2 What is SDMX
- Contents
2.1 Introduction
This chapter provides some background on the SDMX Initiative, the issues which SDMX addresses, the history of the standards and guidelines which have come out of this initiative, and the areas in which SDMX is playing a role today. This chapter also provides some guidance about how prospective users should think about SDMX as it relates to their own part of the statistical process.
SDMX offers a wide variety of technical tools, and these – like any tools – if used well, they produce positive results. SDMX also offers a set of guidelines regarding the application of harmonized statistical concepts to data sets, and how these can be represented. Other guidelines address the classification of statistical data and domains, and the harmonization of relevant terminology. Additionally, SDMX represents a framework for the process of harmonization within domains. All of these different aspects are considered here.
2.2 Background – Official Statistics
SDMX comes out of the world of official statistics. If you work for a national or international statistical agency, you already understand “official statistics”, but many people who work with statistical data may not understand this domain, so we will provide a brief, high-level description.
“Official statistics” are the data which is collected and disseminated by a set of governmental and international organizations to provide the factual basis for making policy and supporting research. Some countries have a “national statistical office” (NSO) while others may have several governmental organizations which are charged with collecting statistical data for governmental use. Most countries also have central banks or similar organizations which collect and disseminate financial and economic data.
Typically, several national government organizations have a statistical function (ministries of education, justice, labour, etc.).
These national organizations typically report their statistics to a set of supra-national organizations, representing either regions of the globe (examples include Eurostat and the European Central Bank) or domains (examples include the World Health Organization, the Food and Agriculture Organization, UNESCO, and the World Bank). Many of these organizations belong to the UN, or are treaty organizations.
All of these organizations exchange, report, and disseminate data in a chain which can be understood as starting at the lowest level within each country, and resulting in highlevel data sets which are “aggregated” as they move through various levels to reach the international level.
The system of official statistics is this network of reported data, according to legal requirements or other types of agreements. There are several important meetings, conferences, and initiatives within this system, so that all organizations adopt similar approaches and techniques, and to coordinate reporting: the Conference of European Statisticians is an important meeting, as is the United Nations Statistical Commission meeting. Ultimately the goal is to measure important phenomenon occurring in the world, and to report the data to policy makers, students, journalists, and other users to help inform their activity. The data is “official” because it comes with the reputation of the world’s governments and international institutions behind it.
2.3 The History of SDMX and its Work Products
In 2001, the heads of seven international statistical institutions came together to form SDMX, with the goal of taking concrete steps towards addressing issues around statistical exchange. These organizations became the sponsors of the SDMX Initiative: the Bank for International Settlements, the European Central Bank, Eurostat, the International Monetary Fund, the Organization for Economic Co-operation and Development, the UN Statistics Division, and the World Bank. They created an initiative with a governing sponsors committee, and a secretariat function to execute the work programme.
The issues can be briefly characterized as follows:
- Statistical collection, processing, and exchange is time-consuming and resourceintensive
- Various international and national organisations have individual approaches for their constituencies
- Uncertainties (in 2001) about how to proceed with new technologies (XML, web services etc.)
The SDMX Initiative stated that it would address these issues:
- By focusing on business practices in the field of statistical information
- By identifying more efficient processes for exchange and sharing of data and metadata using modern technology
The initial projects of the initiative, based largely on work already on-going among various of the sponsor organizations, were:
- A practical case study on emerging e-standards for data exchange
- Maintaining and advancing existing standards for time series data exchange
- Creation of a common vocabulary for statistical metadata
- Development of a framework for metadata repositories
It was further stated that: “New standards should take advantage of the new web-based technologies and the expertise of those working on the business requirements and IT support for the collection, compilation, and dissemination of statistical information.”
Thus, the goals of the SDMX initiative were ones which were broadly agreed across the sponsoring organizations, and within the official statistics community generally. It is important to understand that there were some firm foundations on which SDMX was building:
- An existing standard for exchanging statistical data, known as GESMES/TS, was already in use among several of the sponsor organizations and their nationallevel counterparties. This was based not on modern Web technologies such as XML, but used the older UN/EDIFACT syntax.
- The work on the “metadata common vocabulary” was based on many years of harmonization work within the community, notably Eurostat’s Concept and Definitions Database (CODED) and the OECD Glossary of Terms.
The formation of the SDMX Initiative can be understood as a recognition by the sponsor organizations that working together to address these issues, and that coordinating business approaches using modern, standards-based technology, was the best way forward. In one sense, SDMX evolved from earlier work, but indicated the increased commitment the sponsors had toward reaching its goals. It also represents a comingtogether of efforts around harmonizing statistical content and terminology, and for deploying technology to support statistical processes.
Over time, the work of the SDMX Initiative has expanded, both in terms of contentoriented work products and technical ones. We will describe the evolution of these work products below.
The SDMX Initiative decided early on to position the content-oriented work and the work on technology and standards in a fashion which made these strains of work separate but complimentary. The content-oriented work led to the development of the SDMX ContentOriented Guidelines, while the technical work resulted in the SDMX Technical Specifications. There were several reasons for taking this approach. It reflected the realization that technical specifications must be very precise and detailed in order to allow for automation of statistical exchanges – the programming of computers relies on having very specific rules about how applications communicate, otherwise the communication fails. The SDMX technical standards in one sense function as exchange protocols for machine-to-machine communications (similar to HTTP, for example, but with a focus on specifically statistical exchanges).
Statistical content and terminology issues are very different – they are the subject to interpretation and analysis by trained statisticians. Thus, the technology specifications formed a basis for supporting work on the content side, but in fact are a very different type of work product. It is easiest to see this in the fact that the SDMX Content-Oriented Guidelines are guidelines, to help suggest approaches to people in their statistical work, while the SDMX Technical Specifications are specifications - rules for developing conforming computer applications.
Another reason for this separation is that the technical specifications and content guidelines were expected to be maintained at different rates – once stable, technical specifications tend to be updated less frequently. Also, the reasons for making updates and changes in each area have no dependency between them, so it made sense to separate them. This is reflected in the fact that the technical specifications are submitted and published through the International Standards Organization (ISO), who publish many IT-related standards in various domains, while the content-oriented guidelines are not submitted to ISO, but are maintained by the SDMX Initiative itself. This allows for updates of the content-oriented guidelines on an on-going basis.
A third reason for the separation of the SDMX Technical Standards and the SDMX Content-Oriented Guidelines is that – because they are a technological foundation for exchanging any statistics – the technical specifications are applicable outside the domain of official statistics, while the content-oriented guidelines are specifically designed to be useful within that context (although they might also be useful outside that community, possibly).
This coordinated-but-separate positioning of the two threads of work has proven to be very useful, too, because often statisticians and economists do not have deep expertise in IT, and technologists do not have deep expertise in statistics. SDMX helps to define the point where the two sets of expertise need to coordinate, to effectively use IT within statistical exchanges and processes.
Within the content-oriented work, there are a set of work products: The Content Oriented Guidelines, and 5 annexes:
- Cross-Domain Concepts
- Cross-Domain Codelists
- Statistical Subject-Matter Domains
- Metadata Common Vocabulary
- SDMX-ML for the Content-Oriented Guidelines (Concepts, Code Lists, Category Scheme)
These are discussed in more detail in the first annex to this user guide.
The SDMX Technical Specifications are now in version 2.1, but both version 1.0 and version 2.0 were implemented. The 1.0 version of the specifications have relatively limited coverage – a model for data formats and their structures, along with XML and UN/EDIFACT formats for exchanging these. The UN/EDIFACT format was backwardcompatible with GESMES/TS; the XML formats were new. There was also some support provided for SDMX-based Web services: an XML query document, and a set of guidelines about the use of other related Web-services standards (SOAP and WSDL).
The 2.0 version of the technical specifications had a greatly-expanded scope. The model was extended to include “reference metadata” as a way of structuring and formatting metadata related to data quality frameworks, methodological metadata, and other types of “footnote” metadata. Thus, XML formats for reference metadata were added. Further, a set of standard interfaces in XML for interactions with a SDMX Registry were added, for cataloguing the location of data and reference metadata across the Internet or within an organization, and for maintaining and retrieving structural metadata.
In version 2.1, many features of 2.0 have been improved, and the Web-services recommendations have been expanded to include a RESTful interface, standard functions, and error messages. Now, it is possible to develop generically interoperable applications based on the SDMX standards. Further, the various XML data formats have been simplified based on implementation experience with version 2.0.
For all types of work products, there have been internal reviews within the SDMX community, and also public review of the guidelines and standards.
2.4 The SDMX “Toolkit” Approach
There are many different elements in the SDMX suite of guidelines and specifications, and it may seem daunting to think of implementing them all. It is important to understand the philosophy behind this suite of tools. SDMX has always taken seriously the idea that different organizations will implement at their own speeds, and with their own objectives. As much as possible, they have recognized that investments in legacy systems must be protected, and that existing content and processes should still be supported.
The result of this requirement has been the “toolkit” approach: SDMX offers many different tools, but they need not all be adopted or used together. Indeed, many tools are now built on top of a more fine-grained set of components which themselves can be integrated into an organisation’s own systems. The technical specifications outline a number of different types of conformance with the specifications, based on which parts of the specifications are being used.
The following chapter describes the use-cases which SDMX supports, but a basic list of business applications can be given:
- Collection cases o SDMX as a “push” reporting format for data and metadata (reporter pushes data to collector)
- SDMX as a “pull” reporting format for data and metadata (collector pulls data from reporter)
- Dissemination cases
- Data warehousing cases
Different parts of the standard are used for each of these cases (and others), and the specifications are specifically written to allow only the relevant parts of the standard to be used by any given application.
2.5 Uptake of SDMX within Domains
SDMX has become very widely used within the world of official statistics, so much so that it is difficult to form a comprehensive list of users. This section attempts to characterize the current users of SDMX – a group that will likely grow not only in terms of numbers, but also in terms of the breadth of applications. A few possibilities here are suggested at the end of this section.
If we are to look at the most common uses of SDMX, there are two:
- The use of SDMX as a reporting and collection format, which is especially prevalent within the central banking community (as a result of the earlier implementation of GESMES/TS, now SDMX-EDI), and also among the statistical agencies in Europe (also users of GESMES historically, but implementation is now increasingly driven by such projects as Eurostat’s Census Hub);
- Dissemination of statistical data from websites.
The second application is one which we see in a broad range of institutions, including central banks (ECB and European System of Central Banks, BIS, U.S. Federal Reserve Board and New York Federal Reserve, among others), other sponsoring institutions (IMF, World Bank, OECD, etc.), and national statistical agencies (INEGI in Mexico, Statistics New Zealand, Australian Bureau of Statistics, statistics offices in the European Statistical System etc.)
A less-common but growing use of SDMX is as the basis for data warehouses and other forms of data management. Perhaps the best example of this is the European Central Bank, which has created all of its internal data warehouses around the SDMX Information Model, and has realized many benefits from this. They are by no means the only organization looking at this type of implementation, however – many other organizations are using SDMX to manage not only their statistical data, but also to create metadata repositories, and to integrate their metadata and data.
If we look at which statistical domains have been or are becoming major adopters of SDMX, the list would be something like this (in no particular order):
- Census and Demography
- Education
- Financial and Monetary Indicators
- Economic Indicators
- National Accounts
- Labour
- Food and Agriculture including fisheries
- Epidemiology
- Transport
- Data Quality
- Development Indicators
It is easy to see that this is a broad and cross-cutting set of statistical domains – in fact, there are probably very few domains in which SDMX is not being used in some fashion today, and the above list is intended as an indication of the breadth of the uptake.
SDMX was officially endorsed first within the European statistical system, and then by the UN Statistical Committee. These endorsements were powerful incentives for organizations to use SDMX, and the result has been widespread adoption. There are no major competing standards, which has saved the world of statistics from a phenomenon which has slowed the uptake of standards in some other communities.
Additionally, a strong culture of open-source and free tools development has emerged, helping to make the adoption of SDMX easier. This has come both from within the sponsors' community and without, and is supplemented by an increasing number of tools coming from commercial vendors as well.
To learn more about available SDMX tools, the best place is to consult the SDMX Tools Database, a service provided by the SDMX sponsors, linked to from www.sdmx.org.
And support for the standards does not only take the form of tools – The Open Data Foundation hosts the SDMX User Forum in collaboration with the sponsors, providing a place where the community can interact online, and Eurostat’s CIRCA website provides many types of resources, from training videos to student guides. Many organizations offer SDMX in-person training for different levels of users. The best single point of entry is of course the SDMX website itself.
Looking forward, SDMX is increasingly coming into use: at the most recent SDMX Global conference held in Washington DC in May 2011, Google showed some interest in SDMX as a source of data for its Data Explorer; there is now an interest in setting up a global registry so that all SDMX data and metadata sources can be easily found. Further, we see the strong possibility that the world of corporate statistics may realize the utility of having a strong standards basis around the vast amounts of data collected today to support business intelligence applications.