4 Data and Metadata Creation and Reporting
- Contents
4.1 Scope of this Chapter
This chapter covers the creation of data and metadata for reporting purposes, starting with the definition of the structures which the data and metadata are to be reported against. This chapter references two sets of samples, the details of which are provided in Annex 3 – Data and Metadata Samples. This section focuses on the details of the fundamental features of SDMX whereas the samples referenced in the Annex demonstrate the syntactical representation of these features in SDMX-ML.
4.2 Basics
Fundamental to defining a structure definition is defining the concepts which describe the information contained in the data or metadata and the code lists which provide specific values for these concepts.
4.2.1 Defining Concepts
Any component of a data or metadata structure definition must take its semantic from a concept. This concept is important in that it:
- Provides a detailed definition of the component which describes the structure of the data or metadata.
- Can allow data and metadata for different structures to be comparable when concepts are reused.
It is important when defining concepts to first consider concepts which are already defined within a given community, whether it be the SDMX community as a whole or a smaller community of users in a particular sector. As general rule, one should not define a new concept when an existing concept will suffice. By reusing an existing concept, data and metadata are more easily understood in a wider range of applications. This leads to greater interoperability and comparability of data/metadata.
Assuming new concepts are to be defined, the first step is to determine an appropriate grouping for concepts. In SDMX, all concepts are defined in schemes. These schemes serve to group similar concepts into groups which can be useful for maintenance purposes. One possible distinction to be used to determine the grouping of concepts is their intended usage. If concepts are only to be used for metadata, it may be best to group these into a single scheme. Similarly, if concepts are only to be used for data, these too may be grouped into a single scheme. It is important to consider that concepts themselves are not versioned, rather the schemes in which they are defined are versioned. This means that if any property of a concept is to change, the version of the scheme in which it is defined must change if the scheme is marked as “final”. This in effect versions all concepts defined in that scheme. Therefore, when grouping concepts into schemes it is important to consider the stability of the concept definitions.
A concept itself consists of three main components:
- Its identification, which must be unique within the scheme.
- Its name.
- Its description.
Note that the description is not mandatory, but is highly recommended in order to provide a more complete definition of a concept. When you consider that these concepts are the building blocks which are constructed to define the structure of all data and metadata in SDMX, it should be apparent why complete definitions are important.
In addition to the basic definition properties of the concept, a concept can reference another concept from within the same scheme as its parent concept. The exact nature of this parent child relationship is not strictly enforced by the standard, but it is typically used to denote the child concept is a specialization of the parent. For example, one may have a concept which defines a reference area concept as a "geographic area to which a measured statistical phenomenon relates". In order to allow for more specific types of references areas, one might define concepts for countries as well as groups of geographically similar countries (e.g. continents) and political countries (e.g. military or economic alliances). These child concepts could reference the reference area concept as a parent in order to note that they are specializations of a reference area". An example of this can be seen in the Common Structures sample set. In this sample the confidentiality status references the confidentiality concept as a parent. The status is a specialization of a general confidentiality concept.
< note the test for whether something is a specialization is to ask the question “is xxx a type of yyy” In the Cross Domain Concepts the question would be “Is confidentiality status a type of confidentiality”. The answer is “no”. The problem here is that the cross domain concepts are grouped, often in a structural and not semantic sense. This, unfortunately, is carried over in the SDMX-ML and needs to be rectified.>
When a component in a structure definition takes its semantic from a concept, it always is provided a value in the data or metadata reported against the structure. A concept can define the default nature of these values. This is done with a definition of a core representation. The core representation can either be a un-coded text format, in which a data type is defined along with facets which serve to further restrict that data type, or a coded enumeration in which a code list can be referenced. In the case where an enumerated core representation is defined, the nature of the enumeration values can be described in the same manner that the un-coded text format could be. Note that if a concept is not provided a core representation, it is assumed to have an unbounded value set (i.e. it could be represented by a variety of code lists or un-enumerated formats when used in a DSD or MSD). However, as with any core representation, a usage of the concept in a structure definition can always provide a local representation which overrides the core representation of the concept.
Finally, in order to provide a more complete conceptual definition, the concept can reference an ISO 11179 concept. The intention of this reference is to reference the ISO 11179 concept definition from which this SDMX concept definition is derived.
4.2.2 Defining Code Lists
Any component in a structure definition can specify an enumeration of possible values. These enumerations are defined as code lists. A code list contains a value set of enumerations and the names and descriptions of what is represented by the coded value.
The code list itself is provided an identity, name, and description. The codes comprising the code list have the same properties. Note that the identification of the code is its code value which will be used in any data/metadata sets.
Codes can also be arranged into simple hierarchies, where any code can reference another code within the same list as its parent. Note that SDMX does not formally define the exact nature of this relationship (e.g. whether the child code is additive to the parent and if so, whether there is a weight associated with the code). None the less, it is often useful to capture the fact that some formal relationship does exist, if only to allow for more detailed descriptions of these code lists to be accessed in order to properly understand the data/metadata.
4.3 Data
Fundamental to SDMX is the exchange of data, or more specifically multi-dimensional data against a known structure. Data in SDMX starts with the structure which defines it.
This section details the data structure and its relationship to data sets.
4.3.1 Defining Data Structures
A data structure is a collection of components which define what is being measured and what additional properties (metadata) can be transmitted alongside the actual observed data values.
4.3.1.1 Data Structure Components
Figure 3: Schematic of the Data Structure Definition
Each component must reference a concept from which it takes its semantic. This concept defines the meaning of the component. Typically, the component will also take its identity from this concept, although it is possible for a (data) structure-specific identity to be created. Regardless of whether the concept identity is used or a local identifier is assigned, the identifier of a component must be unique within the scope of all data structure components.
A component can also inherit the representation of the concept from which it takes its semantic, or it can provide a data structure specific representation for the concept. This representation can be either code or un-coded, although some component types have more restrictive representations.
A data structure component can also reference a collection of concepts for the purpose of identifying specific roles that the concept serves in the data structure. For example, a user community may define a collection of concepts which are intended to note special roles for data structure components (e.g. the unit of measure). A component can reference any of these concepts in order to specify that the role identified by the concept is being served by the component. Note that a component is implied to serve the role of the concept from which it takes its identity, therefore it does not need to reference this concept again.
4.3.1.1.1 Primary Measure
Every data structure must define a primary measure. This component always has a fixed identifier (OBS_VALUE), and its purpose is to given a consistent data structure artefact where the observed value can be found. This makes the processing of data messages much more consistent, as the measured value is always readily found. The primary measure can define a representation, or it can inherit the representation from the concept from which it takes its semantic (this concept can have, but need not have, and Id of OBS_VALUE).
4.3.1.1.2 Dimensions
The identification of the phenomenon being measured is achieved through the dimension descriptor. Any given data structure must have at least one dimension. The dimensions (in some models these are known as “classificatory variables”) reference concepts which define identifying properties of the phenomenon being measured. In a dataset, each dimension is given a value. This set of dimension values is often referred to as a key. In any given data set, there can only be one observed value and collection of data attribute (metadata) values per key. Therefore, a key uniquely identities any observed phenomenon.
There are two specialized dimensions that can be defined only once for a data structure. The first is the time dimension. The time dimension represents the point in time at which the phenomenon was observed. If the explicit time dimension is used, then it must use any or all of the allowable time representations defined in SDMX. These representations are:
- A Gregorian calendar period (which can be any or all of a year, a month and year, or a date)
- A standard reporting period (which can be any or all of a year, a semester, a trimester, a quarter, a week, or a day). Each reporting period exists in the context of a reporting year which is defined by a start day (expressed as a month and day). This start day is communicated in a specialized data attribute which will be discussed later in this section.
- A distinct duration, which consists of a start date and time and a duration
- A distinct point in time (i.e. a full time stamp)
Each of theses time representation allows for a time zone offset, so that the exact period of time encompassed by the value can be expressed with absolute precision.
The time dimension has a fixed identification which allows for strictly time series formats to be created. These explicit time series formats will be discussed in the Data Sets section.
The other specialized dimension is the measure dimension. A measure can refer to a collection of properties for which phenomena is being measured for an entity classified by the other dimensions of the data structure (e.g. the demographic measures in the demography data structure). A measure dimension might also refer to the different means in which a phenomenum may be measured (e.g. weight, volume, and price for a commodity).
A measure dimension must always take its representation from a concept scheme. This concept scheme must define a collection of concepts which define the value set of the measure dimension concept. An example of this can be seen in the Eurostat Demography data structure. For example, the demography measures concept scheme (DEMO_MEASURES) contains only concepts which define demographic measurements.
An advantage of defining a measure dimension is that it also allows for a more explicit definition of the representation of the observed value for a given measure. This is achieved by defining core representations for the measure dimension concepts. Note that it is necessary that the primary measure representation in this case be a superset of all possible representations of the measure concepts. This is demonstrated in the Eurostat Demography data structure. An analysis of the demography measures concept scheme shows that the number of deaths in a year is measured as an integer, whereas the life expectancy is measured as a decimal with only one decimal digit. This representation is carried over to the structure specific schema when the measure dimension is used as the observation dimension, and explicit measures are used. This can be seen in the data structure specific schema for the demography data structure.
By defining a measure dimension, a user of the data reported against the data structure will be able to better understand the relationships that exist between the observed values. Another advantage of using a measure dimension is that the data structure can be specific to the representation of observed value because it relates to a specific measure (i.e. it is a concept in a the concept scheme referenced from the measure dimension).
4.3.1.1.3 Attributes
A data structure can also define additional components which serve to hold additional information (metadata) about the data. This information can be presentational in nature (e.g. a series title) or be critical to understanding the data (e.g. the unit of measure). The ECB Exchange Rates data structure contains both of these attributes.
There is one specialized attribute in a data structure, and this is the reporting year start day. If the time dimension of a data structure is allows for reporting periods in its representation then it is strongly recommended that this attribute be used. It has a fixed identifier and representation (a month and day). The purpose of this attribute is to communicate the reference point for a reporting year. This reference point allows the exact calendar period for a reporting period to be determined. If this is not present, then the basis for all reporting periods will be assumed to be January 1.
In addition to the aforementioned component properties, an attribute must define its relationship to the other components of the data structure (i.e. the dimensions or the primary measure). This relationship states how the value of the attribute varies with the value of other components. In the Eurostat DemographyEurostat Demography data structure, the unit of measure attribute (UNIT_MEASURE) specifies an attribute relationship with the demographic measure dimension (DEMO). This should be intuitive, since the unit of measure differs if one is measuring a count, such as the number of live births in a year, or a rate, such as the fertility rate.
A data structure can define groups for the purpose of specifying attribute values. The advantage of defining groups is to avoid repetition of attribute values in a data set. Each group consists of a unique subset of the dimensions. Attributes can either explicitly specify a relationship with the group, or they can specify relationships with specific dimensions yet still reference a group for attachment purposes (although the dimensions with which the attribute have a relationship must all be part of the group dimensions).
4.3.2 Data Sets
Every data set in SDMX must conform to a data structure definition. As described above, the data structure definition organizes concept definitions into various components which identify and supplement the data. When processing data, it is critical to be able to retrieved and fully understand the data structure definition. A clear understanding the data structure is dependent on well defined concepts. Therefore, useful data starts with well defined concepts. Ultimately, understanding is dependent upon understanding the concepts which define what is being measured.
In terms of processing data, the data structure definition provides enough information so that the data can be validated and understood. However, the data structure does not dictate the exact orientation of the data.
The orientation of the data is defined by the data set itself. However, SDMX only allows for two basic orientations:
- A flat list of observations in which the full key is provided for each observed value.
- A collection of observations in series where all but one dimension has the same value and each observation is distinguished by the other dimension (e.g. a time series in which a series has a key and each observation in the series has a distinct time value). This dimension if known as the observation dimension.
A data set may also contain groups, if the data structure defines them. Each group will have a unique set of key values and provide the attribute values associated with the key set.
Data sets note the orientation by defining the observation dimension. In the case of the flat orientation, the observation dimension is actually all dimensions. In the second case, it is a specific dimension from the data structure (this must be the same dimension for the entire data set).
This observation dimension dictates where attributes should be present, based on the attribute relationships defined in the data structure. Any attribute which has a relationship with the observation dimension (i.e. the dimension is a part of a group or a set of dimensions with which the attribute has an “attribute relationship”) must exist at the observation level. This also holds true if an attribute has a relationship with the primary measure. If an attribute has a relationship with no data structure components, then only one value is provided per data set (i.e. the attribute exists at the data set level).
If an attribute has a relationship with a group, or specifies a group for attachment, the attribute will be communicated at the group level. In all other cases the attribute will exist at the series level.
Data can be expressed in one of four formats;
- Generic
- Time series generic
- Structure specific
- Time series structure specific
The time series formats are actually equivalent in content of their more generalized counterparts when they specify the time dimension at the observation level. Therefore, with the exception of the root element name, a generic data set with time at the observation level will be the same as a time series generic data set. The difference in these formats is that the time series only allows for time to be the dimension at the observation level.
Note that regardless of the organisation or the format, the data expressed is always the same. This can be seen by examining the various data messages for the ECB Exchange Rates data. Regardless of the organisation or the format, the data expressed is always the same. In fact, this even holds true for the attribute values even though they are expressed at different levels depending on their relationships. This serves to show that it is critical that the attribute relationship be specified correctly, otherwise the correct value cannot be expressed in a data message.
4.4 Metadata
Reference metadata enables additional information to be attached to data or structural metadata. The design of the reference metadata model allows informational metadata to be:
- Structured and validated
- Late bound to the structural metadata or data to which it applies
Consider contact information for a data set. This information could be contained is a single data attribute which could be carried in the data message. However, the nature of data attributes is that they have no sub structure. Therefore, there would be no means of separating the name of the contact person from the phone number (outside of creating many attributes to hold this information). In addition, this contact information is probably subject to change. It would not make much sense to update the data set when the data itself is unchanged simply to specify a new contact.
This is where reference metadata is useful. The structure of contact information can be clearly specified, and attached dynamically to the data. This same dynamic applies to structural metadata as well. Although all structural metadata components contain annotations, these often do not allow for the structure that is necessary to communicate the desired information.
A major use case for reference metadata is in the support of quality frameworks, where the metadata are not concerned with a data set but with the processes, regulations, and policies of data collection and dissemination.
4.4.1 Defining Metadata Structures
Figure 4: Schematic of the Basic Structure of a Metadata Structure Definition
A metadata structure defines two types of component lists. This first type of component list, the metadata target, serves to identity the types of objects to which the metadata described by this structure can be attached. The second type of component list, the report structure, identifies the content of the metadata reports which can be attached to the target objects.
This is, in a sense, similar to a data structure. In a data structure the dimension list describes how one identifies what is being measured, and the primary measure and attributes describe the details of that measurement. Similarly, the metadata structure uses the metadata target to describe how one identifies what the report pertains to, and the report structure defines the nature of the report in terms of what is to be reported (metadata attributes) . The fundamental difference between the metadata structure and the data structure is that whereas a data structure only has one set of dimensions, attributes, and primary measure, a metadata structure can define multiple targets and report structures, and has no measures.
4.4.1.1 Metadata Target
Figure 5: Identification of Targets in a Metadata Structure Definition
A metadata target defines what is expected in the metadata set in order to identify the object to which the metadata pertains. It is given a unique identifier within the metadata structure in which it is defined. It consists of one or more target object descriptors, each of which themselves have a unique identifier within the metadata target.
There are 5 types of target objects that one can use, each of which serves to uniquely identify an object within the SDMX information model.
4.4.1.1.1 Data Set Target
The data set target is used to attach metadata to a specific data set, which is identified by the identification of the data provider and the provider assigned identification of the data set. This target object has a fixed identifier and representation, so the purpose of defining this in a metadata target is to simply state that the data set reference is part of the metadata target value set. Only one data set target can occur within a metadata target.
4.4.1.1.2 Identifiable Object Target
The identifiable target object is used to attach metadata to any identifiable object in the SDMX information model. This type of target object can be repeated and each instance is assigned a unique identifier within the metadata target. Each instance identifies the type of object which is reference by this target. The identification of the target object is always a complete reference. If the target identifiable object type is an item from within an item scheme, the representation of the target object can reference a scheme for the purposes of limiting the items to which metadata can be attached.
4.4.1.1.3 Dimension Descriptor Values Target
The key descriptor values target is used to attach metadata to data by identifying full or partial data "keys" (a collection of dimension identifier/value pairs). By itself, this target object is typically not descriptive enough as it does not identify the type of data to which the keys apply. Therefore, this is typically used with other target objects, such as the dataflow or data structure, which can identify these data. This target object has a fixed identifier and representation, therefore the metadata target is simply stating that it contains a key descriptor value set. Only one key descriptor values target can occur within a metadata target.
4.4.1.1.4 Constraint Content Target
The constraint content target is used to attach metadata to data by referencing an attachment constraint. This attachment constraint turn defines the data set(s) and keys to which the metadata applies. This is equivalent to using the data set or identifiable object target along with the key descriptor values target. The difference is that the attachment constraint allows the target set to be defined once, and be reused by multiple reports, whereas the former method would require that the data set (or equivalent data structure, dataflow, or provision agreement) reference and key descriptor value set be repeated for each report. Only one constraint content target can occur within a metadata target.
4.4.1.1.5 Report Period Target
The report period target is used to state the reporting period for which metadata is applicable. This effectively allows the metadata to change over time while persisting the historicity of the changes. This target object has a fixed identifier, but its representation can specify the specific type of date that can be used. By default, this is the least restrictive date format. Only one report period target can occur within a metadata target.
4.4.1.1.6 Composing Targets
A metadata target consists of one or more of the target objects described above. It is the sum of these targets which identify the actual target for the metadata. For example, one might attach metadata to a specific portion of data in a given data set. In this case, the data set target object and the key descriptor values target object would be used. In the metadata set, the data set target would identify the data set and the key descriptor values target would identify the portions of the data to which the metadata applies. In another example, one might wish to be able to attach metadata to portions of data for all data sets reported against a given data structure. Rather than repeating the metadata for each data set, a metadata target would be defined which contains an identifiable target object which references a data structure and a key descriptor values target object. When designing a metadata target one must consider how generally the metadata which is reported against the target might be applied.
4.4.1.2 Report Structure
Figure 6: Schematic of the Report Structure in a Metadata Structure Definition
The metadata structure defines one or more report structures which define the content of its metadata reports. Each report structure is assigned a unique identifier within the metadata structure. The report structure defines the metadata attributes which make up its content and references the metadata targets from within the metadata structure that define where the report can be attached. Since a metadata set can only contain metadata reports for a single metadata structure, one must consider whether having reports in the same metadata set would be useful. Note that all reports in a metadata set must be defined in the same metadata structure.
4.4.1.2.1 Metadata Attribute
A metadata attribute is component of the content of a metadata report. Similar to a data attribute in a data structure, the metadata attribute takes its semantic from a concept. This concept serves to define the meaning of the information contained in the report. As with data structure components, it is important to make use of common concepts whenever possible, as this makes the metadata relatable to other metadata reports.
The content of a metadata attribute can be a value and/or other metadata attributes. The value of a metadata attribute can serve a number of purposes. First, as with a data attribute it can be an enumerated value from a SDMX code list. It can also be a noncoded value of any given data type. Where the metadata attribute value differs from that of a data attribute is that if the value is textual, it can be represented in parallel language values, and if necessary be structured using XHTML. It is not necessary that a metadata attribute has a value, as it might only serve to contain other metadata attributes for the purpose of organising metadata reports (in which case the metadata attribute is designated as “isPresentational” indicating that no value is expected to be reported for the metadata attribute in a report).
Metadata attributes are not reusable across various levels of a report structure. If an attribute is intended to occur at multiple levels, it must be redefined at each level. However, the identification of a metadata attribute is only required to be unique within the scope within which it is defined. Therefore, if the intention is to reuse a metadata attribute at different levels, it is recommended that the same identification be used to convey this intention. It should also be noted that within a scope, a metadata attribute can have cardinality (minimum and maximum number of occurrences). This allows metadata attributes to be repeated at various levels, as well as giving the report structure the ability to enforce content requirements.
4.4.2 Metadata Sets
A metadata set is a collection of metadata reports from the same metadata structure definition. Each report is based on a report structure defined within the metadata structure on which the set is based. More than one report for a given report structure can exist within a set, so long as their targets are unique, which is to say that for any given target there should only be one instance of a report for a given report structure.
The manner in which a metadata report is related to an object is through the specification of its target. The metadata structure defines the possible types of targets for a given report structure, and an instance of a report uses one of these target types to identify the actual target of the metadata. The target is essentially a collection of references to data or structural metadata and possibly a period to which the report applies. It is assumed that any system processing metadata reports will be able to resolve these references, or perhaps more appropriately, any system working with data or structural metadata will be able to process the related metadata reports.
The actual content of the report is always contained in an attribute set. This attribute set is the collection of metadata attributes for which values are provided. As with data, the key to a useful metadata report is understanding what is being reported. This comes down to effective concept usage. The content of any report is essentially made up of values reported against concepts. In order for systems to understand the meaning of the metadata, they must understand the concepts.
With the metadata structure definition, the content of any report can be evaluated for completeness and validity. From the metadata structure, one can determine if all necessary metadata attributes are present for a given report structure and if the values assigned to them are allowable based on the attribute definition.
In the Eurostat DemographyEurostat Demography metadata samples, one can see the similarities between the generic and the structure specific messages. The content of the reported metadata does not change with the format used. The structure specific metadata simply provides more validation of the content.