3 Guide for SDMX Format Standards

Last modified by Artur on 2025/09/10 11:19

Contents

3.1 Introduction

This guide exists to provide information to implementers of the SDMX format standards – SDMX-ML and SDMX-EDI – that are concerned with data, i.e. Data Structure Definitions and Data Sets. This section is intended to provide information which will help users of SDMX understand and implement the standards. It is not normative, and it does not provide any rules for the use of the standards, such as those found in SDMX-ML: Schema and Documentation and SDMX-EDI: Syntax and Documentation.

3.2 SDMX Information Model for Format Implementers

3.2.1 Introduction

The purpose of this sub-section is to provide an introduction to the SDMX-IM relating to Data Structure Definitions and Data Sets for those whose primary interest is in the use of the XML or EDI formats. For those wishing to have a deeper understanding of the Information Model, the full SDMX-IM document, and other sections in this guide provide a more in-depth view, along with UML diagrams and supporting explanation. For those who are unfamiliar with DSDs, an appendix to the SDMX-IM provides a tutorial which may serve as a useful introduction.

The SDMX-IM is used to describe the basic data and metadata structures used in all of the SDMX data formats. The Information Model concerns itself with statistical data and its structural metadata, and that is what is described here. Both structural metadata and data have some additional metadata in common, related to their management and administration. These aspects of the data model are not addressed in this section and covered elsewhere in this guide or in the full SDMX-IM document.

The Data Structure Definition and Data Set parts of the information model are consistent with the GESMES/TS version 3.0 Data Model (called SDMX-EDI in the SDMX standard), with these exceptions:

the “sibling group” construct has been generalized to permit any dimension or dimensions to be wildcarded, and not just frequency, as in GESMES/TS. It has been renamed a “group” to distinguish it from the “sibling group” where only frequency is wildcarded. The set of allowable partial “group” keys must be declared in the DSD, and attributes may be attached to any of these group keys;
furthermore, whilst the “group” has been retained for compatibility with version 2.0 and with SDMX-EDI, it has, at version 2.1, been replaced by the “Attribute Relationship” definition which is explained later
the section on data representation is now a convention, to support interoperability with EDIFACT-syntax implementations ( see section 3.3.2);

DSD-specific data formats are derived from the model, and some supporting features for declaring multiple measures have been added to the structural metadata descriptions Clearly, this is not a coincidence. The GESMES/TS Data Model provides the foundation for the EDIFACT messages in SDMX-EDI, and also is the starting point for the development of SDMX-ML.

Note that in the descriptions below, text in courier and italicised are the names used in the information model (e.g. DataSet).

3.3 SDMX-ML and SDMX-EDI: Comparison of Expressive Capabilities and Function

SDMX offers several equivalent formats for describing data and structural metadata, optimized for use in different applications. Although all of these formats are derived directly from the SDM-IM, and are thus equivalent, the syntaxes used to express the model place some restrictions on their use. Also, different optimizations provide different capabilities. This section describes these differences, and provides some rules for applications which may need to support more than one SDMX format or syntax. This section is constrained to the Data Structure Definitionand the Date Set.

3.3.1 Format Optimizations and Differences

The following section provides a brief overview of the differences between the various SDMX formats.

Version 2.0 was characterised by 4 data messages, each with a distinct format: Generic, Compact, Cross-Sectional and Utility. Because of the design, data in some formats could not always be related to another format. In version 2.1, this issue has been addressed by merging some formats and eliminating others. As a result, in SDMX 2.1 there are just two types of data formats: GenericData and StructureSpecificData (i.e. specific to one Data Structure Definition).

Both of these formats are now flexible enough to allow for data to be oriented in series with any dimension used to disambiguate the observations (as opposed to only time or a cross sectional measure in version 2.0). The formats have also been expanded to allow for ungrouped observations.

To allow for applications which only understand time series data, variations of these formats have been introduced in the form of two data messages; GenericTimeSeriesData and StructureSpecificTimeSeriesData. It is important to note that these variations are built on the same root structure and can be processed in the same manner as the base format so that they do NOT introduce additional processing requirements.

Structure Definition

The SDMX-ML Structure Message supports the use of annotations to the structure, which is not supported by the SDMX-EDI syntax.

The SDMX-ML Structure Message allows for the structures on which a Data Structure Definition depends – that is, codelists and concepts – to be either included in the message or to be referenced by the message containing the data structure definition. XML syntax is designed to leverage URIs and other Internet-based referencing mechanisms, and these are used in the SDMX-ML message. This option is not available to those using the SDMX-EDI structure message.

Validation

SDMX-EDI – as is typical of EDIFACT syntax messages – leaves validation to dedicated applications (“validation” being the checking of syntax, data typing, and adherence of the data message to the structure as described in the structural definition.)

The SDMX-ML Generic Data Message also leaves validation above the XML syntax level to the application.

The SDMX-ML DSD-specific messages will allow validation of XML syntax and datatyping to be performed with a generic XML parser, and enforce agreement between the structural definition and the data to a moderate degree with the same tool.

Update and Delete Messages and Documentation Messages

All SDMX data messages allow for both delete messages and messages consisting of only data or only documentation.

Character Encodings

All SDMX-ML messages use the UTF-8 encoding, while SDMX-EDI uses the ISO 8879-1 character encoding. There is a greater capacity with UTF-8 to express some character sets (see the “APPENDIX: MAP OF ISO 8859-1 (UNOC) CHARACTER SET (LATIN 1 OR “WESTERN”) in the document “SYNTAX AND DOCUMENTATION VERSION 2.0”.) Many transformation tools are available which allow XML instances with UTF-8 encodings to be expressed as ISO 8879-1-encoded characters, and to transform UTF-8 into ISO 8879-1. Such tools should be used when transforming SDMX-ML messages into SDMX-EDI messages and vice-versa.

Data Typing

The XML syntax and EDIFACT syntax have different data-typing mechanisms. The section below provides a set of conventions to be observed when support for messages in both syntaxes is required. For more information on the SDMX-ML representations of data, see below.

3.3.2 Data Types

The XML syntax has a very different mechanism for data-typing than the EDIFACT syntax, and this difference may create some difficulties for applications which support both EDIFACT-based and XML-based SDMX data formats. This section provides a set of conventions for the expression in data in all formats, to allow for clean interoperability between them.

It should be noted that this section does not address character encodings – it is assumed that conversion software will include the use of transformations which will map between the ISO 8879-1 encoding of the SDMX-EDI format and the UTF-8 encoding of the SDMX-ML formats.

Note that the following conventions may be followed for ease of interoperation between EDIFACT and XML representations of the data and metadata. For implementations in which no transformation between EDIFACT and XML syntaxes is foreseen, the restrictions below need not apply.

Identifiers are:
- Maximum 18 characters;
- Any of A..Z (upper case alphabetic), 0..9 (numeric), _ (underscore);
- The first character is alphabetic.
Names are:
- Maximum 70 characters.
- From ISO 8859-1 character set (including accented characters)
Descriptions are:
- Maximum 350 characters;
- From ISO 8859-1 character set.
Code values are:
- Maximum 18 characters;
- Any of A..Z (upper case alphabetic), 0..9 (numeric), _ (underscore), / (solidus, slash), = (equal sign), - (hyphen);

However, code values providing values to a dimension must use only the following characters:

A..Z (upper case alphabetic), 0..9 (numeric), _ (underscore)

5. Observation values are:

Decimal numerics (signed only if they are negative);
The maximum number of significant figures is:
15 for a positive number
14 for a positive decimal or a negative integer
13 for a negative decimal
Scientific notation may be used.

6. Uncoded statistical concept text values are:

Maximum 1050 characters;
From ISO 8859-1 character set.

7. Time series keys:

In principle, the maximum permissible length of time series keys used in a data exchange does not need to be restricted. However, for working purposes, an effort is made to limit the maximum length to 35 characters; in this length, also (for SDMXEDI) one (separator) position is included between all successive dimension values; this means that the maximum length allowed for a pure series key (concatenation of dimension values) can be less than 35 characters. The separator character is a colon (“:”) by conventional usage.

3.4 SDMX-ML and SDMX-EDI Best Practices

3.4.1 Reporting and Dissemination Guidelines

3.4.1.1 Central Institutions and Their Role in Statistical Data Exchanges

Central institutions are the organisations to which other partner institutions "report" statistics. These statistics are used by central institutions either to compile aggregates and/or they are put together and made available in a uniform manner (e.g. on-line or on a CD-ROM or through file transfers). Therefore, central institutions receive data from other institutions and, usually, they also "disseminate" data to individual and/or institutions for end-use. Within a country, a NSI or a national central bank (NCB) plays, of course, a central institution role as it collects data from other entities and it disseminates statistical information to end users. In SDMX the role of central institution is very important: every statistical message is based on underlying structural definitions (statistical concepts, code lists, DSDs) which have been devised by a particular agency, usually a central institution. Such an institution plays the role of the reference "structural definitions maintenance agency for the corresponding messages which are exchanged. Of course, two institutions could exchange data using/referring to structural information devised by a third institution.

Central institutions can play a double role:

collecting and further disseminating statistics;
devising structural definitions for use in data exchanges.

3.4.1.2 Defining Data Structure Definitions (DSDs)

The following guidelines are suggested for building a DSD. However, it is expected that these guidelines will be considered by central institutions when devising new DSDs.

Dimensions, Attributes and Code Lists

Avoid dimensions that are not appropriate for all the series in the data structure definition. If some dimensions are not applicable (this is evident from the need to have a code in a code list which is marked as “not applicable”, “not relevant” or “total”) for some series then consider moving these series to a new data structure definition in which these dimensions are dropped from the key structure. This is a judgement call as it is sometimes difficult to achieve this without increasing considerably the number of DSDs.

Devise DSDs with a small number of Dimensions for public viewing of data. A DSD with the number dimensions in excess 6 or 7 is often difficult for non specialist users to understand. In these cases it is better to have a larger number of DSDs with smaller “cubes” of data, or to eliminate dimensions and aggregate the data at a higher level. Dissemination of data on the web is a growing use case for the SDMX standards: the differentiation of observations by dimensionality which are necessary for statisticians and economists are often obscure to public consumers who may not always understand the semantic of the differentiation.

Avoid composite dimensions. Each dimension should correspond to a single characteristic of the data, not to a combination of characteristics.

Consider the inclusion of the following attributes. Once the key structure of a data structure definition has been decided, then the set of (preferably mandatory) attributes of this data structure definition has to be defined. In general, some statistical concepts are deemed necessary across all Data Structure Definitions to qualify the contained information. Examples of these are:

A descriptive title for the series (this is most useful for dissemination of data for viewing e.g. on the web)
Collection (e.g. end of period, averaged or summed over period)
Unit (e.g. currency of denomination)
Unit multiplier (e.g. expressed in millions)
Availability (which institutions can a series become available to)
Decimals (i.e. number of decimal digits used in numerical observations)
Observation Status (e.g. estimate, provisional, normal)

Moreover, additional attributes may be considered as mandatory when a specific data structure definition is defined.

Avoid creating a new code list where one already exists. It is highly recommended that structural definitions and code lists be consistent with internationally agreed standard methodologies, wherever they exist, e.g., System of National Accounts 1993; Balance of Payments Manual, Fifth Edition; Monetary and Financial Statistics Manual; Government Finance Statistics Manual, etc. When setting-up a new data exchange, the following order of priority is suggested when considering the use of code lists:

international standard code lists;
international code lists supplemented by other international and/or regional institutions;
standardised lists used already by international institutions;
new code lists agreed between two international or regional institutions;
new specific code lists.

The same code list can be used for several statistical concepts, within a data structure definition or across DSDs. Note that SDMX has recognised that these classifications are often quite large and the usage of codes in any one DSD is only a small extract of the full code list. In this version of the standard it is possible to exchange and disseminate a partial code list which is extracted from the full code list and which supports the dimension values valid for a particular DSD.

Data Structure Definition Structure

The following items have to be specified by a structural definitions maintenance agency when defining a new data structure definition:

Data structure definition (DSD) identification:

DSD identifier
DSD name

A list of metadata concepts assigned as dimensions of the data structure definition. For each:

(statistical) concept identifier
ordinal number of the dimension in the key structure (SDMX-EDI only)
code list identifier (Id, version, maintenance agency) if the representation is coded

A list of (statistical) concepts assigned as attributes for the data structure definition. For each:

(statistical) concept identifier
code list identifier if the concept is coded
assignment status: mandatory or conditional
attachment level
maximum text length for the uncoded concepts
maximum code length for the coded concepts

A list of the code lists used in the data structure definition. For each:

code list identifier
code list name
code values and descriptions

Definition of data flow definitions. Two (or more) partners performing data exchanges in a certain context need to agree on:

the list of data set identifiers they will be using;
for each data flow:
its content and description
the relevant DSD that defines the structure of the data reported or disseminated according the the dataflow definition

3.4.1.3 Exchanging Attributes

3.4.1.3.1 Attributes on series, sibling and data set level

Static properties.

Upon creation of a series the sender has to provide to the receiver values for all mandatory attributes. In case they are available, values for conditional attributes should also be provided. Whereas initially this information may be provided by means other than SDMX-ML or SDMX-EDI messages (e.g. paper, telephone) it is expected that partner institutions will be in a position to provide this information in SDMX-ML or SDMX-EDI format over time.
A centre may agree with its data exchange partners special procedures for authorising the setting of attributes' initial values.
Attribute values at a data set level are set and maintained exclusively by the centre administrating the exchanged data set.

Communication of changes to the centre.

Following the creation of a series, the attribute values do not have to be reported again by senders, as long as they do not change.
Whenever changes in attribute values for a series (or sibling group) occur, the reporting institutions should report either all attribute values again (this is the recommended option) or only the attribute values which have changed. This applies both to the mandatory and the conditional attributes. For example, if a previously reported value for a conditional attribute is no longer valid, this has to be reported to the centre.
A centre may agree with its data exchange partners special procedures for authorising modifications in the attribute values.

Communication of observation level attributes “observation status”, "observation confidentiality", "observation pre-break".

In SDMX-EDI, the observation level attribute “observation status” is part of the fixed syntax of the ARR segment used for observation reporting. Whenever an observation is exchanged, the corresponding observation status must also be exchanged attached to the observation, regardless of whether it has changed or not since the previous data exchange. This rule also applies to the use of the SDMX-ML formats, although the syntax does not necessarily require this.
If the “observation status” changes and the observation remains unchanged, both components would have to be reported.
For Data Structure Definitions having also the observation level attributes “observation confidentiality” and "observation pre-break" defined, this rule applies to these attribute as well: if an institution receives from another institution an observation with an observation status attribute only attached, this means that the associated observation confidentiality and prebreak observation attributes either never existed or from now they do not have a value for this observation.

3.4.2 Best Practices for Batch Data Exchange

3.4.2.1 Introduction

Batch data exchange is the exchange and maintenance of entire databases between counterparties. It is an activity that often employs SDMX-EDI formats, and might also use the SDMX-ML DSD-specific data set. The following points apply equally to both formats.

3.4.2.2 Positioning of the Dimension "Frequency"

The position of the “frequency” dimension is unambiguously identified in the data structure definition. Moreover, most central institutions devising structural definitions have decided to assign to this dimension the first position in the key structure. This facilitates the easy identification of this dimension, something that it is necessary to frequency's crucial role in several database systems and in attaching attributes at the “sibling” group level.

3.4.2.3 Identification of Data Structure Definitions (DSDs)

In order to facilitate the easy and immediate recognition of the structural definition maintenance agency that defined a data structure definition, most central institutions devising structural definitions use the first characters of the data structure definition identifiers to identify their institution: e.g. BIS_EER, EUROSTAT_BOP_01, ECB_BOP1, etc.

3.4.2.4 Identification of the Data Flows

In order to facilitate the easy and immediate recognition of the institution administrating a data flow definitions, many central institutions prefer to use the first characters of the data flow definition identifiers to identify their institution: e.g. BIS_EER, ECB_BOP1, ECB_BOP1, etc. Note that in GESMES/TS the Data Set plays the role of the data flow definition (see DataSet in the SDMX-IM).

The statistical information in SDMX is broken down into two fundamental parts - structural metadata (comprising the Data Structure Definition, and associated Concepts and Code Lists) - see Framework for Standards -, and observational data (the DataSet). This is an important distinction, with specific terminology associated with each part. Data - which is typically a set of numeric observations at specific points in time - is organized into data sets (DataSet) These data sets are structured according to a specific Data Structure Definition (DataStructureDefinition) and are described in the data flow definition (DataflowDefinition) The Data Structure Definition describes the metadata that allows an understanding of what is expressed in the data set, whilst the data flow definition provides the identifier and other important information (such as the periodicity of reporting) that is common to all of its component data sets.

Note that the role of the Data Flow (called DataflowDefintion in the model) and Data Set is very specific in the model, and the terminology used may not be the same as used in all organisations, and specifically the term Data Set is used differently in SDMX than in GESMES/TS. Essentially the GESMES/TS term Data Set is, in SDMX, the Dataflow Definition" whist the term Data Set in SDMX is used to describe the "container" for an instance of the data.

3.4.2.5 Special Issues

3.4.2.5.1 "Frequency" related issues

Special frequencies. The issue of data collected at special (regular or irregular) intervals at a lower than daily frequency (e.g. 24 or 36 or 48 observations per year, on irregular days during the year) is not extensively discussed here. However, for data exchange purposes:

such data can be mapped into a series with daily frequency; this daily series will only hold observations for those days on which the measured event takes place;
if the collection intervals are regular, additional values to the existing frequency code list(s) could be added in the future.

Tick data. The issue of data collected at irregular intervals at a higher than daily frequency (e.g. tick-by-tick data) is not discussed here either. However, for data exchange purposes, such series can already be exchanged in the SDMX-EDI format by using the option to send observations with the associated time stamp.