12 Annex 2 – SDMX Business Process Model
- Contents
12.1 Introduction
The Generic Statistical Business Process Model (GSBPM) is a reference model of the statistical production life-cycle in national statistical agencies, developed by the METIS group in UN/ECE. The work was based on many earlier models, and represents a view of statistical production which is now being accepted as the standard view.
For this reason, we are using the GSBPM as the basis of an example, demonstrating how SDMX fits into the work of a national-level statistical agency.
This example is not a technical one – rather, it is meant to describe the use of SDMX from a business perspective: how, where, and why is SDMX used? These questions will be answered by using the example of effective exchange rates..
It is important to note that in some scenarios, where collected data is already in the form of aggregates, SDMX might be used earlier in the business process. However, for NSOs the most common scenario is probably where micro-data are collected and aggregated at the national level.
There are many benefits to the use of SDMX, and while many of these are related to the use of technology, in the end the real benefits are simple: it becomes easier for users to locate and utilize data, and the data themselves are more comparable. Further, the data become easier to visualize and format, into whatever form is needed, either for the creation of dissemination outputs, or for re-formatting by data users or collectors.
12.2 High Level Schematic of the GSBPM
It is important to have at least a high-level understanding of the GSBPM, as shown in the diagram below.
Figure 35: Hush-level schematic of the GSBPM
Across the top of the diagram, we see the high-level process steps, from 1 to 9. The process begins with the evaluation of data collection needs, and proceeds through the design and creation of data-collection instruments, and then moves on to the actual collection of data. Once collected, data are processed, coded, edited, imputation is performed, weights are calculated, and the data are aggregated.
Up to this point (i.e. 5.6), the GSBPM has been concerned with the collection and processing of micro-data (at least from the perspective of an NSO – from the perspective of a supra-national organization, the collected data may themselves often be aggregates.)
For this example, we show how SDMX can be used from the point of aggregation forward, as we move through the GSBPM. For our purposes, then, we will focus on steps 5.7 and later, as shown below:
Figure 36: The part of the GSBPM supported by SDMX
To understand how SDMX can be used throughout this process, we need to look not only at the internal business process of the Central Bank or NSO, but also at the collection framework of the organization to which the aggregate data are reported.
Because SDMX focuses on the exchange of statistics, it will be necessary to consider the organization in our example which will be performing the collection. This will involve some constructs external to, but accessible by, the compiling organization – notably an SDMX Registry, with the various constructs that it contains (data flows, data providers, etc.).
SDMX is not only for reporting of aggregates, however – it also performs useful functions in the dissemination of data directly to users, and the archiving of data within the organization, so these functions of the organization will also be included in our example.
12.3 GSBPM and SDMX
12.3.1 Aggregation (Step 5.7), and Data Analysis (Step 6)
12.3.1.1 Calculation of Aggregates, and Understanding the SDMX Data Structure
Once data have been aggregated from the micro-data (Step 5.7 of the GSBPM) they will be stored in some format such as a relational database or data warehouse (Oracle, etc.) or in some processing format (SAS, SPSS), or in Excel spreadsheets or similar format. This will depend on the internal system and tools used within the organization, and is different in different organizations.
In order to capture the aggregates in a standard SDMX format, we must first look at the required data structures as dictated by the collecting agency. In our example, we will use the Effective Exchange Rates data structure developed by the ECB4. Below is an example of the type of data which is contained in a data set structured according to this DSD.
Figure 37: Example ECB Data of Effective Exchange Rates
It is not important from the NSO or Central Bank’s perspective to understand every aspect of the analysis which went into the creation of the data structure, as the SDMX Data Structure Definition to be used for reporting will be provided by the agency collecting the data. All data reporters are expected to use the same data structure definition (DSD).
It is important to understand the data structure definition, because this is the resource which describes how the reported aggregates themselves must be structured. Below is view of the Effective Exchange Rates DSD.
The DSD is an XML file, created according to the SDMX-ML standard (it can also be expressed in SDMX-EDI, but the XML version is more common). The fact that is XML is important to the IT staff who must process it, and is used by many SDMX tools, but what is most important is that the statistical concepts and codelists (or “classifications”) it uses are also used in the reported aggregate data file. Thus, we do not look at the XML for this example – it is enough to know that the XML version of the DSD exists, and may be needed by tools or developers at some point.
Figure 38: Example Data Structure Definition (Dimensions) for Effective Exchange Rates with Code List
Figure 39: Example Data Structure Definition (Attributes) for Effective Exchange Rates
The view here shows a number of very important things:
The ID, Agency, and version of the DSD are displayed at the top of the screen. Below this, there is a listing of statistical concepts which are used as dimensions. (Frequency, Currency, Exchange Rate Type, etc.) Each of these concepts will be taken from some authoritative source, and descriptions and definitions can be obtained from the organization which publishes and maintains the DSD. In many cases, these concepts will be the standard concepts defined in the SDMX Cross-Domain Statistical Concepts document (which can be obtained at www.sdmx.org). In other cases, they may be formally defined and documented by the maintaining agency which publishes the DSD. It is important to understand the definition of these concepts, as they may be slightly different from those used by the NSO or Central Bank, but in most cases they will likely be very similar or the same as the concepts already used at the national level for this data.
The concepts used as Dimensions each take a value which has a standard representation. In most cases, this representation will be a codelist – a standard classification which must be used to identify and describe the observations. In the righthand part of the screen, we can see which codelist is used to represent each concept used as a dimension. For example, the Frequency dimension uses a codelist called “CL_FREQ” version 1.0:
Frequency is perhaps the simplest example, as the reporting agency will generally know what the frequency of the data is, and have a record of this in their systems (quarterly, monthly, etc.)
Notice that the Time dimension is not coded, but instead has a time value, indicating the time of the observation.
The values for the codes may or may not match the classification used at the national level, and if they do not match then it will be necessary to map the codes used internally against the classification used by the SDMX DSD.
For each statistical concept used as a dimension, it must be possible to provide a value as specified by the SDMX DSD. This might seem like a lot of work, but it is done for obvious and important reasons – if the collected data are to be comparable at the supranational level, then there must be a standard expression of the data, using the same statistical concepts and classifications to identify and describe the observations. This is no different than when reporting aggregate data today – each data collector will want to have a specific expression of the data collected. What is different, with SDMX, however, is that the data collectors are harmonizing the DSDs used in each domain, and there is an effort, through the SDMX Content-Oriented Guidelines, to use identical statistical concepts and representations where this is possible.
Harmonization of data is a difficult process, but it is one which will result, in time, in more useful data (because it is more comparable), and also hopefully in a more uniform collection of data at the national level, because all reporting countries must conform to a standard DSD as they calculate aggregates, which in turn impacts the data collection process as shown in the GSBPM.
If we look again at the high-level picture of the DSD (above), we will also see a section which shows statistical
concepts used as “Attributes”. These are descriptive concepts, sometimes represented with codes, or sometimes with strings. They are different from Dimensional concepts, because they are not always required – in the table
indicates “Conditional” and
indicates “Mandatory”.
An Attribute also has a relationship “Attribute Relationship” to a construct such as a group of Dimensions, which can be one or more Dimensions, Observation etc. as discussed in Chapter 4.
In other ways, the Attributes are very similar to the Dimensions of the DSD – the coding (if they are coded) must be standard, as dictated by the DSD, and for the same reasons.
12.3.1.2 Formatting Aggregates with SDMX
Once it has been determined that the aggregate data can be expressed as SDMX, according to the rules of the DSD, then we need to think about what is involved in actually creating the SDMX-ML format for the data. This is important because if we can express the aggregates as SDMX-ML, there are a number of tools which will become useful in performing later activities in the GSBPM.
There are several techniques for creating SDMX-ML data sets, and several choices will need to be made. First, there are several “flavours” of SDMX: SDMX-EDI (also known as GESMES/TS) and four types of SDMX-ML (the XML version). This is a technical consideration, and it is typically the case that the data collector will dictate exactly which format is wanted. The XML formats include “Generic”, and “Structure Specific”. Different organizations use different flavours of SDMX-ML, but it is important to note that there are free tools available which will allow for transformations between these different flavours.
These are technical considerations which should be left to the IT staff, so we will not go into them in depth here – the reasons for using one or another are most often purely ITtechnical ones.
We do need to look at the practical options for creating the SDMX-ML files, however. The options are discussed in Chapter 4- Data and Metadata Creation and Reporting and the technical mechanism for achieving different outputs from a database is discussed in Annex 4 – Data Reader and Data Writer Functions. There are several types of tools which will allow the formatting of aggregates into the correct form: tools based on Excel, tools which take the data from a relational database such as Oracle, and tools which work within processing tools such as SAS or PCAxis. Again, we will not look at the technical details of these tools, as this is an IT issue, but it is important to be aware that there are several different tools available. Eurostat provides a free tool for “data mapping” which is broadly useful, and PCAxis will have native SDMX support built into it in future versions. It should be noted that when working with processing applications such as SAS and SPSS, it is often the case that dedicated scripts will need to be written within those environments, as different national formats within those applications will require specific formatting scripts to produce SDMX-ML outputs.
12.3.1.3 SDMX and Analysis of Aggregates (Step 6)
It may not seem obvious that SDMX is relevant to the process of analysis of aggregates, but it can sometimes be very useful. This will depend on which tools are used by an NSO to perform these various steps. Because most systems work well with XML generally – and because SDMX-ML is one flavour of XML – SDMX can provide some useful functions as the aggregates are analyzed and further processed.
In the preparation of draft outputs (Step 6.1), it may be helpful to use any of the various visualization tools based on SDMX when looking at the data. Tools exist for doing graphical visualizations of the SDMX data, using modern technology packages such as the Flex code developed by the European Central Bank (http://code.google.com/p/flexcb/ ). Other packages also exist, provided by various commercial vendors. Other free tools exist for producing Excel spreadsheets and HTML displays of the data.
Especially if files are passed between several individuals while the draft outputs are prepared, it may be useful to exchange the SDMX-ML file, so that different individuals can use different visualizations of the same data while performing this work.
The validation of outputs (Step 6.2) requires more than just data visualization, and it is here that SDMX-ML can provide some solid benefit. Some of the validation rules exist within the DSD, and these can be automatically checked using free SDMX data and metadata set tools, others exist within an SDMX Registry where cross references, versioning, and request for deletions are validated to ensure the integrity of the structural metadata.
What SDMX cannot validate is that the numbers reported are correct in terms of other values in the data set – that is, are they plausible values given the numbers reported in preceding periods, or in relation to other reported data. These are statistical issues that cannot be solved by SDMX-based technology, but which will require dedicated checks created by a statistician who understand the statistical issues.
Scrutinizing and explaining the aggregates (Step 6.3) is something which typically involves visualization of the data (as for Step 6.1) but may also include the creation of specific tabular views for inclusion in reports. The same tools which provide the ability to visualize SDMX data may also allow for the creation of tabular views for use in reports (Excel tables, etc.) but this will vary based on the systems within each NSO or Central Bank.
There is nothing in SDMX which directly addresses disclosure control (Step 6.4) or the finalization of outputs (6.5), other than the use of visualization tools as described for earlier parts of Step 6. However, it should be noted that any corrections or edits to the data will need to be reflected in the SDMX-ML data to be reported. Depending on how the SDMX-ML is generated, this may involve going back to the tools and systems used to format the SDMX-ML in the first place, and making sure that the correct data are available in those tools for re-formatting as SDMX-ML.
12.3.2 Reporting/Dissemination (Step 7)
Step 7 of the GSBPM covers the process of dissemination in its broadest sense – that is, all users of the data are the target of this process step, including organizations which collect the aggregate data from NSOs and Central Banks. Thus, the GSBPM addresses reporting and dissemination as a single set of activities.
There are several types of data dissemination, and when we consider dissemination and reporting using the Internet this category is very broad. As we look at each sub-process in this step, we will need to consider this broad range of possibilities.
In addition to the sub-processes described by the GSBPM, we also need to consider one aspect of SDMX that potentially concerns all forms of reporting and dissemination, the SDMX Registry Services (see SDMX Registry/Repository).
The first sub-process in Step 7 is the updating of output systems. This involves taking the aggregates as prepared in Step 6, and loading them into whatever systems are used to drive dissemination. Typically, this will involve database systems (e.g. Oracle) and - if the same database is not used to drive Web dissemination – also loading data into whatever system drives the views of data on the Website.
SDMX can be used as a format for the exchange of data between systems, whether these systems are internal to an organization, or external, and thus it makes a good format for loading databases used in all types of dissemination. Further, because it is an XML format, SDMX-ML can be used as inputs to systems for creating HTML, PDF, Excel, and other output formats. An SDMX Registry can make the reporting of such data more automated by using the data registration mechanism supported by a registry. The benefit of such a system is that – once new data have been “registered” (see below), the data collector can come and simply query the service for the new data. This helps to ease the burden of data reporting.
This application of SDMX tends to be very technical – because XML is well-supported by many types of systems, it is useful also in loading the databases used to drive dissemination. The details of this application are not something we will explore in any detail here.
The next sub-processes in the GSBPM is the preparation of outputs, and the management of their release. This covers a wide variety of potential products based on the data: reports (typically printed and disseminated as PDF, combining tabular views of the aggregate data with explanatory text and analysis), HTML pages displayed on a Web-site, data downloads in various formats (Excel, CSV, etc.), and Web-based interfaces for querying the data, and for doing graphic visualizations, which may even be interactive.
SDMX can be used as the single XML format for the creation of all other dissemination products, at least for providing the tabular views of the data. (Obviously, websites have more than just data on them.) Again, this can be a very IT-technical topic, but it is important to understand that there are many good technologies for “styling” XML to produce other outputs, including all of the ones typically found on statistical websites.
SDMX is also directly useful in two ways: as a format for reporting to data collectors and as a direct download format. The use of SDMX as a download format has become increasingly popular, and in some cases has proven to be the most popular form of disseminated data available on Web-sites. Many users prefer this format because it is easy to process (being XML) and it is accompanied by rich metadata, including the structural metadata necessary for applications to process or visualize the data. Further, the format is predictable, allowing for easy use of the data coming from outside the organization.
It is worth noting that for many organizations, SDMX is being deployed using a Web service (such as those developed by ECB, IMF, and OECD). Such services allow for direct querying of the data sources, in SDMX-ML format, by any user allowed access to the service. Eurostat is currently developing a “Census Hub” Web service for querying the census data to be collected in 2011. Here, the data for each country remains in the database of the country and role of the “hub” is to broker a user query such that an “SDMX Query” is sent to each relevant database which responds in SDMX. The resultant responses are then combined by the hub.
The “advanced” use of SDMX – where an SDMX-capable database can create many dissemination products which transform the SDMX into other formats, and even in an on-demand fashion for Web dissemination – can greatly simplify the process of preparing dissemination outputs. Instead of having to produce several parallel forms of the data, having a single SDMX source means that, once loaded, print and PDF reports must be prepared, and static Web-pages created, but all other types of data dissemination are basically handled by systems which generate needed outputs (Excel, CSV, graphical visualizations, SDMX-ML) in an on-demand fashion.
The figure below illustrates the basic principle behind this type of SDMX use.
Figure 40: SDMX as the pivotal format in a dissemination system
It should be noted that when SDMX-ML represents a dissemination format in its own right, the SDMX-ML structure file containing the DSD and all its components should be provided along with the SDMX-ML data set it structures, as users will want both types of files for use in their own systems. In most cases, the DSD files will be available from their agency, but in this case a link to that source should be readily accessible to users (this may be through an SDMX Registry – see below).
Typically, all data products (including on-demand delivery via a Web service or query interface) are loaded into a “staging” environment, so that they can be subjected to quality assurance before being actually disseminated. SDMX does not change this aspect of the dissemination and reporting process, but does place an emphasis on the proper testing of Web-delivery for data.
The next sub-process in the GSBPM is the promotion of dissemination products. SDMX is extremely useful in this regard, although not perhaps in a fashion which is obvious. This process in the GSBPM is typically seen as the “advertising” of the statistical products, and SDMX is not much use here except that the use of leading-edge standards may offer some opportunities for promotion (presentations at conferences, etc.).
Far more interesting in increasing the visibility and use of data is the existence of the SDMX Registry Services, which provide a platform for the automatic discovery of data products. Users have become used to the idea that resources can be “Googled”, and while the SDMX Registry services are not part of Google itself, they do provide a focused way of searching for all of the data produced within a domain, regardless of which site the data is published on.
In essence, the SDMX Registry Services provide an online catalogue, listing all of the data available within a community. That community can be open or closed, depending on who is allowed access to the catalogue. Thus, there are registries today which only provide access to data collectors, such as the SDMX Registry used by the Joint External Debt Hub (it is only visible to the organizations which exchange data: the BIS, the IMF, OECD, and the World Bank). SDMX Registries can be public, however, which means that any Website or Internet-aware application could search for all of the data listed in that catalog, and then go to the site where that data is found.
This is a very powerful thing: increasingly, this approach to locating data is being used, because it leverages the latest generation of Web-based technology. Exposing the existence of your data to these types of sites and applications, and making it queryable or otherwise accessible in SDMX-ML format, is a very efficient way to make it visible and available to re-publishers and users of all types.
12.3.3 Archiving (Step 8)
SDMX is not specifically designed to support archiving, which is Step 8 of the GSBPM, but it is worth noting a few significant aspects of the standard which can be useful in this process. First, because SDMX-ML is XML, it provides a format which is not specific to any particular software package. Because of this, it can be used as a good archival format. Second, because SDMX has an XML expression of the structures in the DSD, it is possible to always understand how an SDMX-ML data set is structured, such that it can always be easily processed. Third, SDMX has strict rules about versioning. For archival use, this is good, because changes in the data sets and their structures over time can be recorded and stored.
Thus, while SDMX is not explicitly designed as an archival format, there are aspects to it which are very useful in this process.
12.3.4 The GSPBM and Other Relevant Aspects of SDMX
One feature of SDMX that should be mentioned is the ability to document standard statistical processes. This is done by describing a Process, which is made up of Process Steps, which may themselves have sub-steps. Each step or sub-step can have inputs and outputs, and can be named and described. A Process Step can link to another Process Step either as a hierarchy or by reference via a Transition. The Computation involved in a Process Step can be documented, including the actual software used in the Process Step. Note that there is no support yet in SDMX for specifying actual computations in a way that can be invoked by software.
Figure 41: Schematic of Model for Process in SDMX
The process can be expressed in SDMX-ML, so that documentation can be produced in many useful formats, using the same types of transforms described earlier for disseminating statistical data sets. Thus, a PDF or HTML version of a process description could be generated from the XML.
It is easy to see that a particular organization could use the GSBPM as the basis of such a process, describe each input and output, and then send this to another organization, so that the exact processing of the data was clear.
There is currently no particular requirement from Eurostat or other data collectors for this functionality of SDMX, but it is being implemented by some NSOs internationally, for internal process descriptions being exchanged between departments within the organization. In future, this feature may be used between organizations as well.
12.4 Summary
Our example involves the micro-data coming from external sources being recoded and aggregated, with consequent reporting and dissemination of the tabulated data. To provide a view of how SDMX can be used in this scenario, the relevant parts of the GSBPM are highlighted below, and a summary table provided.
Figure 42: Summary showing the processes supported by SDMX
The table below summarizes each step in the GSBPM where SDMX is used in our scenario.
GSBPM Step Use of SDMX Notes 5.7 Calculate Aggregates No direct use – may influence earlier steps in collection process Derived variables and recodes must match the requirements of the standard DSD 5.8 Finalize Data Files Use of SDMX-ML DSD and data formats to format aggregates Used to pass data and structure to subsequent process steps 6.1 Prepare Draft Outputs SDMX can help to visualize and process data, and is used as a source format for outputs Relies on technologies which easily transform XML into other output formats 6.2 Validate Outputs SDMX-ML provides validation of all rules in the DSD (correct codes, complete and valid descriptions and keys, etc.) Some validation can be validated by XML schema (e.g. use of valid codes and dimension Ids), some validation can be undertaken with other SDMX constructs such as constraints, whilst some cannot be performed using SDMX structures e.g. comparison of numbers to determine plausibility 6.3 Scrutinize and Explain SDMX visualizations may help to easily view data and generate views for output products 6.4 Apply Disclosure Control SDMX visualizations help to verify disclosure processing Not a primary application of SDMX, which does not dictate anything about disclosure 6.5 Finalize Outputs SDMX visualizations may provide views of data for final outputs; outputs may be generated on-demand for dissemination on Website, etc. SDMX data must be updated if data are corrected 7.1 Update Output Systems SDMX provides useful format for loading into output systems Most technology tools and databases provide good support for XML formats such as SDMX-ML 7.2 Produce Dissemination Products SDMX visualizations may provide views of data for final outputs; outputs may be generated on-demand for dissemination on Website, etc. 7.3 Manage Release of Dissemination Products SDMX serves as a format for reporting and dissemination to some users/data collectors; serves as basis for generation of other outputs, whether static or on-demand 7.4 Promote Dissemination products Use of SDMX Registry Services
provides a high level of visibility for data
Depends on the availability of a domain registry for this purpose – requires that new data be registered automatically or manually 8.2 Manage Archive Repository SDMX provides an easy format for generation of formats needed, based on the user demands on the archive;
strict version control allows for explicit management of dependencies between data and metadata
8.3 Preserve Data and Associated Metadata Rich metadata and application/platform independence make SDMX a good archival format The benefits of using SDMX here are several:
- Standard data structures, statistical concepts and classifications, and formats make it easy to process and compare similar types of data from different national sources, both for data collectors and other users
- Richer dissemination format, complete with metadata, supports not only good visualization of data, but also allows easy downloading and use of data in internal systems
- SDMX-ML pro
- vides an excellent format for having a single source of data which can be easily transformed into different output formats for use
- Data becomes easier to find and use, through SDMX Registry Services, promoting the use of the data
- Data is archived in a long-lived format, independent of applications/platforms, and is accompanied by rich metadata, managed according to strict versioning rules
In some other scenarios, SDMX might also be useful as a data collection format, but in the case where micro-data are aggregated, the use of SDMX will be similar to that described here.