IZA Data Service Center DDI/SDMX Workshop Wiesbaden, Germany, June 18th 2008 The Data Documentation Initiative (DDI) Arofan Gregory / Pascal Heus agregory@opendatafoundation.org.

IZA Data Service Center DDI/SDMX Workshop Wiesbaden, Germany, June 18th The Data Documentation Initiative (DDI) Arofan Gregory / Pascal Heus / Open Data Foundation

Content Background on metadata and XML Metadata and Microdata
XML and Microdata: the DDI DDI 2.0 DDI 3.0 DDI 2.0 vs 3.0 Major stakeholders / initiatives

Metadata / XML

What is metadata? Common definition: Data about Data Labeled stuff
The bean example is taken from: A Manager’s Introduction to Adobe eXtensible Metadata Platform, Unlabeled stuff

What is XML? Today's Universal language on the web
Purpose is to facilitate sharing of structured information across information systems XML stands for eXtensible Markup Language eXtensibe  can be customized Markup  tags, marks, attach attributes to things Language  syntax (grammatical rules) HTML (HyperText Markup Language) is a markup language but not extensible! It is also concerned about presentation, not content. XML is a text format (not a binary black box) XML is a also a collection of technologies (built on the XML language) It is platform independent and is understood by modern programming languages (C++, Java, .NET, pHp, perl, etc.) It is both machine and human readable

Simple XML example Attributes <catalog> <book isbn=” ”> <title>Da Vinci Code</title> <author>Dan Brown</author> </book> <book isbn=” ” pages=”352”> <title>I, robot</title> <author>Isaac Asimov</author> <language>English</language> </book> </catalog> Elements Opening and Closing tags Text content

XML Technology overview
Document Type Definition (DTD) and XSchema are use to validate an XML document by defining namespaces, elements, rules Specialized software and database systems can be used to create and edit XML documents. In the future the XForm standard will be used XML separates the metadata storage from its presentation. XML documents can be transformed into something else, like HTML, PDF, XML, other) through the use of the eXtensible Stylesheet Language, XSL Transformations (XSLT) and XSL Formatting Objects (XSL-FO) Very much like a database system, XML documents can be searched and queried through the use of XPath oe XQuery. There is no need to create tables, indexes or define relationships XML metadata or data can be published in “smart” catalogs often referred to as registries than can be used for discovery of information. XML Documents can be sent like regular files but are typically exchanged between applications through Web Services using the SOAP and other protocols

What is an XML Schema? Exchange / sharing / harmonization implies agreement on structure We need a specification that describes the structure and rules  Schema A schema is a set of rules to which an XML document must conform in order to be considered 'valid' XML Schema was also designed with the intent that determination of a document's validity would produce a collection of information adhering to specific data types Similar to relational databases structural definition Many schemas exists for different purposes Examples DDI, SDMX ,Dublin Core, RSS, XHTML, etc.

Metadata, XML and Microdata

What is a survey? More than just data….
A complex process to produce data for the purpose of statistical analysis Beyond this, a tool to support evidence based policy making and results monitoring The data is surrounded by a large body of documentation Survey data often come with limited documention Note that microdata is intended for experts Statisticians / researchers Represents a single point in time and space Need to be aggregated to produce meaningful results It is the beginning of the story

What is survey metadata?
Survey documentation can be broken down into structured metadata and documents Structured metadata can be captured using XML Documents can be described in structured metadata Example of metadata: Survey level: Title, country, year, abstract, sampling, agencies, access policy, etc. Variable level: filename, label, code, questions, instructions, derivation, etc. Related materials: report, questionnaire, papers, manuals, scripts/programs, photos Cross-surveys: catalogs, longitudinal, concepts, comparability, etc.

Importance of survey metadata
Data Quality: Usefulness = accessibility + coherence + completeness + relevance + timeliness + … Undocumented data is useless Partially documented data is risky (misuse) Data discovery and access Preservation Replication standard (Gary King) Information exchange Reduce need to access sensitive data Maintain coherence / linkages across the complete life cycle (from respondent to policy maker) Reuse

The Data Documentation Initiative
The Data Documentation Initiative is an XML specification to capture structured metadata about “microdata” (broad sense) First generation DDI 1.0…2.1 ( ) focus on single archived instance Second generation DDI 3.0 (2008) focus on life cycle go beyond the single survey concept mutli-purpose

DDI Timeline / Status Pre-DDI 1.0 2000 – DDI 1.0 2003 – DDI 2.0
70’s / 80’s OSIRIS Codebook 1993: IASSIST Codebook Action Group 1996 SGML DTD 1997 DDI XML 1999 Draft DDI DTD 2000 – DDI 1.0 Simple survey Archival data formats Microdata only 2003 – DDI 2.0 Aggregate data (based on matrix structure) Added geographic material to aid geographic search systems and GIS users Establishment of DDI Alliance 2004 – Acceptance of a new DDI paradigm Lifecycle model Shift from the codebook centric / variable centric model to capturing the lifecycle of data Agreement on expanded areas of coverage 2005 Presentation of schema structure Focus on points of metadata creation and reuse 2006 Presentation of first complete 3.0 model Internal and public review 2007 Vote to move to Candidate Version (CR) Establishment of a set of use cases to test application and implementation October 3.0 CR2 2008 February 3.0 CR3 March 3.0 CR3 update April 3.0 CR3 final April 28th 3.0 Approved by DDI Alliance May 21st DDI 3.0 Officially announced Initial presentations at IASSIST 2008 2009 DDI 3.1 and beyond

DDI 1/2.x

The archive perspective
Focus on preservation of a survey Often see survey as collection of data files accompanied by documentation Code book centric report, questionnaire, methodologies, scripts, etc. Result in a static event: the archive Maintained by a single agency Is typically documentation after the facts This is the initial DDI perspective (DDI 2.0)

DDI 2.0 Technical Overview
Based on a single structure (DTD) 1 codeBook, 5 sections docDscr: describes the DDI document The preparation of the metadata stdyDscr: describes the study Title, abstract, methodologies, agencies, access policy fileDscr: describes each file in the dataset dataDscr: describes the data in the files Variables (name, code, ) Variable groups Cubes othMat: other related materials Basic document citation

Characteristics of DDI 1.0/2.0
Focuses on the static object of a codebook Designed for limited uses End user data discovery via the variable or high level study identification (bibliographic) Only heavily structured content relates to information used to drive statistical analysis Coverage is focused on single study, single data file, simple survey and aggregate data files Variable contains majority of information (question, categories, data typing, physical storage information, statistics)

Impact of these limitations
Treated as an “add on” to the data collection process Focus is on the data end product and end users (static) Limited tools for creation or exploitation The Variable must exist before metadata can be created Producers hesitant to take up DDI creation because it is a cost and does not support their development or collection process

DDI 1/2.x Tools Nesstar IHSN Other tools
Nesstar Publisher, Nesstar Server IHSN Microdata Management Toolkit NADA (online catalog for national data archive) Archivist / Reviewer Guidelines Other tools SDA, Harvard/MIT Virtual Data Center (Dataverse) UKDA DExT, ODaF DeXtris

DDI 2.0 perspective Media/Press General Public Academic Users
Producers Users Policy Makers Government Archivists Sponsors Business DDI 2 Survey DDI 2 Survey DDI 2 Survey DDI 2 Survey DDI 2 Survey DDI 2 Survey DDI 2 Survey

DDI 3.0 The life cycle

When to capture metadata?
Metadata must be captured at the time the event occurs! Documenting after the facts leads to considerable loss of information Multiple contributors are typically involved in this process (not only the archivist) This is true for producers and researchers

DDI 3.0 and the Survey Life Cycle
A survey is not a static process: It dynamically evolved across time and involves many agencies/individuals DDI 2.x is about archiving, DDI 3.0 across the entire “life cycle” 3.0 focus on metadata reuse (minimizes redundancies/discrepancies, support comparison) Also supports multilingual, grouping, geography, and others 3.0 is extensible

Requirements for 3.0 Improve and expand the machine-actionable aspects of the DDI to support programming and software systems Support CAI instruments through expanded description of the questionnaire (content and question flow) Support the description of data series (longitudinal surveys, panel studies, recurring waves, etc.) Support comparison, in particular comparison by design but also comparison-after-the fact (harmonization) Improve support for describing complex data files (record and file linkages) Provide improved support for geographic content to facilitate linking to geographic files (shape files, boundary files, etc.)

Approach Shift from the codebook centric model of early versions of DDI to a lifecycle model, providing metadata support from data study conception through analysis and repurposing of data Shift from an XML Data Type Definition (DTD) to an XML Schema model to support the lifecycle model, reuse of content and increased controls to support programming needs Redefine a “single DDI instance” to include a “simple instance” similar to DDI 1/2 which covered a single study and “complex instances” covering groups of related studies. Allow a single study description to contain multiple data products (for example, a microdata file and aggregate products created from the same data collection). Incorporate the requested functionality in the first published edition

Designing to support registries
Resource package structure to publish non-study-specific materials for reuse Extracting specified types of information in to schemes Universe, Concept, Category, Code, Question, Instrument, Variable, etc. Allowing for either internal or external references Can include other schemes by reference and select only desired items Providing Comparison Mapping Target can be external harmonized structure

DDI 3 is composed of several schemas
Technical Overview DDI 3 is composed of several schemas Use only what you need! Schemas represent modules, sub-modules (substitutions), reusable, external schemas archive comparative conceptualcomponent datacollection dataset dcelements DDIprofile ddi-xhtml11 ddi-xhtml11-model-1 ddi-xhtml11-modules-1 group inline_ncube_recordlayout instance logicalproduct ncube_recordlayout physicaldataproduct physicalinstance proprietary_record_layout (beta) reusable simpledc studyunit tabular_ncube_recordlayout xml set of xml schemas to support xhtml

Technical Overview Any element that can be referenced is globally uniquely identified Maintainable (by an agency) Versionable (can change across time) Identifiable (within a maintainable scheme) Modules Reflect closely related sets of information similar to the sections of DDI 1/2.* DTD Modules can be held as separate XML instances and be included in a large instance by either inclusion or reference All modules are maintainable (but not all maintainables are modules)

Technical Overview: Maintainable Schemes (that’s with an ‘e’ not an ‘a’)
Category Scheme Code Scheme Concept Scheme Control Construct Scheme GeographicStructureScheme GeographicLocationScheme InterviewerInstructionScheme Question Scheme NCubeScheme Organization Scheme Physical Structure Scheme Record Layout Scheme Universe Scheme Variable Scheme Packages of reusable metadata maintained by a single agency

DDI 3.0 Use Cases Study design/survey instrumentation
Questionnaire generation/data collection and procesing Data recoding, aggregation and other processing Data dissemination/discovery Archival ingestion/metadata value-add Question/concept/variable banks DDI for use within a research project Capture of metadata regarding data use Metadata mining for comparison, etc. Generating instruction packages/presentations

Study Design/Survey Instrumentation
This use case concerns how DDI 3.0 can support the design of studies and survey instrumentation Without benefit of a question or concept bank

+ Types of Metadata: Concepts (conceptual module)
Universe (conceptual module) Questions (datacollection module) Flow Logic (datacollection module) <DDI 3.0> Concepts Universes <DDI 3.0> Concepts Universes Final Drafting/ Review/ Revision + <DDI 3.0> Questions Flow Logic <DDI 3.0> Concepts Universes Questions Flow Logic As the survey instrument is tested, all revisions and history can be tracked and preserved. This would include question translation and internationalization. Final Drafting/ Testing/ Revision

Questionnaire Generation, Data Collection, and Processing
This use case concerns how DDI 3.0 can support the creation of various types of questionnaires/CAI, and the collection and processing of raw data into microdata.

Physical Data Instance
Types of Metadata: Concepts (conceptual module) Universe (conceptual module) Questions (datacollection module) Flow Logic (datacollection module) Variables (logicalproduct module) Categories/Codes (logicalproduct module) Coding (datacollection module) Paper Questionnaire <DDI 3.0> Concepts Universes Questions Flow Logic Online Survey Instrument Final CAI Instrument Raw Data Microdata DDI captures the content – XML allows for each application to do its own presentation <DDI 3.0> Concepts Universes Questions Flow Logic <DDI 3.0> Variables Coding <DDI 3.0> Categories Codes Physical Data Product Physical Data Instance + +

Data Recoding, Aggregation, etc.
This use case concerns how DDI 3.0 can describe recodes, aggregation, and similar types of data processing.

+ Initial microdata has: Concepts (conceptual module)
Universes (conceptual module) Questions (datacollection module) Flow Logic (datacollection module) Variables (logicalproduct module) Coding (datacollection module) Categories (logicalproduct module) Codes (logicalproduct module) Physical Data Product Physical Data Instance Recode adds: More codings (datacollection module) New variables New categories New codes NCubes (for aggregation) Could be a recode, an aggregation, or other process. Microdata/ Aggregates Microdata <DDI 3.0> Conceptual Datacollection Variables Categories Codes <DDI 3.0> Codings Variables (new) Categories (new) Codes (new) NCubes +

Data Dissemination/Data Discovery
This use case concerns how DDI 3.0 can support the discovery and dissemination of data.

+ <DDI 3.0> Can add archival Rich metadata supports
events meta-data Rich metadata supports auto-generation of websites and other delivery formats Codebooks <DDI 3.0> [Full metadata set] Websites + Databases, repositories Research Data Centers Microdata/ Aggregates Data-Specific Info Access Systems Registries Catalogues Question/Concept/ Variable Banks

Archival Ingestion and Metadata Value-Add
This use case concerns how DDI 3.0 can support the ingest and migration functions of data archives and data libraries.

of processing if good DDI metadata is captured upstream
Supports automation of processing if good DDI metadata is captured upstream Provides a neutral format for data migration as analysis packages are versioned <DDI 3.0> [Full metadata set] (?) Data Archive Data Library Ingest Processing + Microdata/ Aggregates <DDI 3.0> [Full or additional metadata] Archival events Provides good format & foundation for value- added metadata by archive

Question/Concept/Variable Banks
This use case describes how DDI 3.0 can support question, concept, and variable banks. These are often termed “registries” or “metadata repositories” because they contain only metadata – links to the data are optional, but provide implied comparability. The focus is metadata reuse.

Question Bank <DDI 3.0> Questions Flow Logic Codings
Because DDI has links, each type of bank functions in a modular, complementary way. Question Bank <DDI 3.0> Questions Flow Logic Codings <DDI 3.0> Questions Flow Logic Codings Users and Applications Variable Bank <DDI 3.0> Variables Categories Codes <DDI 3.0> Variables Categories Codes Users and Applications <DDI 3.0> Concepts <DDI 3.0> Concepts Users and Applications Concept Bank Supports but does not require ISO 11179

DDI For Use within a Research Project
This use case concerns how DDI 3.0 can support various functions within a research project, from the conception of the study through collection and publication of the resulting data.

$ € £ + + + + Prinicpal Investigator Research Staff Collaborators
<DDI 3.0> Variables Physical Stores <DDI 3.0> Questions Instrument + <DDI 3.0> Concepts Universe Methods Purpose People/Orgs <DDI 3.0> Funding Revisions + + + <DDI 3.0> Data Collection Data Processing $ € £ Data Archive/ Repository Submitted Proposal Publication Presentations +

Capture of Metadata Regarding Data Use
This use case concerns how DDI 3.0 can capture information about how researchers use data, which can then be added to the overall metadata set about the data sources they have accessed.

+ + Types of Metadata Recodes (datacollection module)
Record subsets (physicalinstance module) Variable subsets (logicalproduct module) Comparison (comparative module) Data Sets <DDI 3.0> StudyUnit DataCollection LogicalProduct PhysicalDataProduct PhysicalInstance + <DDI 3.0> Recodes Case Selection Variable Selection Comparison to original study Resulting physical file descriptions Data Data Analysis +

Metadata Mining for Comparison, etc.
This use case concerns how collections of DDI 3.0 metadata can act as a resource to be explored, providing further insight into the comparability and other features of a collection of data.

? Types of Metadata Questions Universe (comparative module) Variable
Concept (comparative module) Question (datacollection module) Variable (logicalproduct module) Questions Variable Metadata Repositories/ Registries Concepts Universe <DDI 3.0> Instances <DDI 3.0> Comparison Questions Categories Codes Variables Universe Concepts Recodes Harmonizations ? Data Sets

Generating Instruction Packages/Presentations
This use case concerns how DDI 3.0 can support automation around the instruction of students and others.

Types of Metadata Individual studies (studyunit module) Grouping purpose (group module) Linking information (comparative module) Processing assistance (group module) <DDI 3.0> StudyUnit 1 <DDI 3.0> StudyUnit 2 <DDI 3.0> StudyUnit 1 StudyUnit 2 StudyUnit 3 StudyUnit 4 Comparative OtherMaterials <DDI 3.0> StudyUnit 3 <DDI 3.0> StudyUnit 4 <DDI 3.0> StudyUnit 1 StudyUnit 2 StudyUnit 3 StudyUnit 4 Topically related studies selected Group is made with description of the intended use for the group Comparative information is added indicating matching fields for linking and mapping between similar variables Other materials such as SAS/SPSS recode command are referenced from the group Instructional Package

DDI 3.0 Tools Under developments DDI Foundation Tools Program
Road Map XML Beans, validation, DDI DExT, DDI2StatsProgs Other tools R SPSS Export, Algenta SurveyViz, others presented at IASSIST DDI Editing Suite Proposed as extension of DDI-FTP Plan for generic editor in 6-9 months DDI 3.0 related projects / initiatives RDC Canada, Germany RDC / EURASI, DANS MIXED, NORC

DDI 3 Relationship to Other Standards
SDMX (from microdata to indicators / time series) Completely mapping to and from DDI NCubes Dublin Core (surveys and documents gets cited) Mapping of citation elements Option for DC namespace basic entry ISO – Geography (microdata gets mapped) Search requirements Support for GIS users METS Designed to support profile development OAIS (alignment of archiving standards) Reference model for the archival lifecycle ISO/IEC (metadata mining through concepts) Variable linking representation to concept and universe Optional data element construct in ConceptualComponent that allows for complete ISO/IEC structure as a maintained item

DDI 3.0 perspective Media/Press General Public Academic Policy Makers
Government Sponsors Business Producers Users Archivists

DDI 2.0 and DDI 3.0

DDI 2 / DDI 3 Single survey Focus on the archive Non-reusable metadata
Maintained by single agency Loose validation DTD based Sparse documentation Designed by archivists Some tools are available Multiple surveys Focus on life cycle Highly reusable metadata Maintained by many agencies Tied validation Schema based Extensive guide Designed by expert groups Tools are beginning to emerge

What 3.0 can do for you Manage multi-surveys
Support multiple contributors Support many different perspectives Support many different use cases Maintain metadata integrity across the life cycle Connect to other metadata spaces Metadata reuse Publication in registries Backward compatibility with 2.0

DDI Community

DDI Organizations/ Agencies
DDI Alliance ( Interuniversity Consortium for Political and Social Research (ICPSR) ( International Association for Social Science Infromation Service & Technology (IASSIST) ( International Household Survey Network (IHSN) ( Open Data Foundation (ODaF) ( National Opinion Research Center Data Enclave (NORC) ( Metadata Technology (

IZA Data Service Center DDI/SDMX Workshop Wiesbaden, Germany, June 18th The Statistical Data and Metadata Exchange Standard (SDMX): An Introduction Arofan Gregory / Pascal Heus / Open Data Foundation

Overview of the Session
SDMX Background and Goals SDMX and Data SDMX and Metadata SDMX and Best Practices: The Content-Oriented Guidelines The SDMX Information Model SDMX and Web Services The SDMX Registry SDMX Data Services Tools and Resources

SDMX Background and Goals

What is SDMX? The problem space:
Statistical collection, processing, and exchange is time-consuming and resource-intensive Focus on aggregate data (esp. time series) Various international and national organisations have individual approaches for their constituencies Uncertainties about how to proceed with new technologies (XML, web services …)

What is SDMX? The Statistical Data and Metadata Exchange (SDMX) initiative is taking steps to address these challenges and opportunities that have just been mentioned: By focusing on business practices in the field of statistical information By identifying more efficient processes for exchange and sharing of data and metadata using modern technology and open standards

Who is SDMX? SDMX is an initiative made up of seven international organizations: Bank for International Settlements European Central Bank Eurostat International Monetary Fund Organisation for Economic Cooperation and Development United Nations World Bank The initiative was launched in 2002

International Organisations Regional Organisations
accounts statistics National Statistical Organisations accounts statistics 180 + Countries Internet, Search, Navigation Banks, Corporates Individual Households trans- actions, micro-data, accounts

SDMX Products Technical standards for the formatting and exchange of aggregate statistics: SDMX Technical Specifications version 1.0 (now ISO/TS SDMX – TC 154 WG2) SDMX Technical Specifications version 2.0 (soon to be submitted to ISO – TC 154 WG2) Content-Oriented Guidelines (in draft) Common Metadata Vocabulary Cross-Domain Statistical Concepts Statistical Subject-Matter Domains

Major Features of SDMX Structure and formats (XML, EDIFACT) for aggregate data Structure and formats (XML) for metadata Formal information model (UML) for managing statistical exchange and sourcing Web-services guidelines and registry services specification for use of modern technologies Content-oriented guidelines to recommend best practices

Recent Events Jan 2007 – Launch meeting at the World Bank for SDMX 2.0 Technical Specifications February 2007 – Endorsement of SDMX by EU’s Statistical Programme Committee March 2008 – SDMX becomes the preferred standard for data and metadata of the UN Statistical Commission Other standards were mentioned – DDI and XBRL specifically

Adopters/Interest The following are known adopters (or planning to adopt): US Federal Reserve Board and Bank of New York European Central Bank Joint External Debt Hub (WB, IMF, OECD, BIS) UN/TRADECOM at UN Statistical Division NAAWE (National Accounts from OECD/Eurostat) SODI (Eurostat and European Governments) Mexican Federal System Vietnamese Ministry of Planning and Investment Qatar Information Exchange IMF (BOP, SNA, SDDS/GDDS) Food and Agriculture Organization Millenium Development Goals (UN System, others) International Labor Organization Bank for International Settlements OECD World Bank Marchioness Islands (Spanish/Portugese Statistical Region) UNESCO (Education) Australian Bureau of Statistics Statistics Canada There are many others not listed or which we are not aware of

Rate of Adoption Between January 2007 and January 2008, adoption has doubled We anticipate a similar rate of growth for the coming year Tools are becoming available UNSC recommendation makes it a safe course to follow for risk-averse institutions Training courses are in increasing demand (Eurostat, Metadata Technology) Standard data and metadata structures for many domains are being developed

SDMX and Data

SDMX and Data Formats SDMX provides a format for describing the structure of data (“structural metadata”) EDIFACT (was GESMES/TS, now SDMX-EDI) XML (SDMX-ML) SDMX provides formats for transmission and processing of data EDIFACT (1 message) XML (4 different equivalent flavors for different functions) Data is tabulated, aggregate data (eg, multi-dimensional/OLAP cubes) Can be any aggregate data! Most data formats are derived from the structural metadata (eg, XML schemas are generated for each type of structure according to the business rules)

Data Set: Structure

First: Identify the Concepts
A statistical concept is a characteristic of a time series or an observation (MCV) A concept is a unit of knowledge created by a unique combination of characteristics (SDMX Information Model) Whatever the definition, statistical concepts are the DNA of the key family Their usage (type, structure, sequence) define the structure of the data

Data Set Structure:Concepts
Unit Multiplier Unit Topic Time/Frequency Country Stock/Flow Computers need structure of data Concepts Code lists Data values How these fit together

Data Set Structure: Code Lists
TOPIC A Brady Bonds B Bank Loans C Debt Securities AR Argentina MX Mexico ZA South Africa COUNTRY STOCK/FLOW 1 Stock 2 Flow Concepts CONCEPTS Topic Country Flow

Quarterly, South Africa, Bank Loans,
Data Makes Sense Q,ZA,B,1, =16547 Quarterly, South Africa, Bank Loans, Stocks, for 30 June 1999 16457

Data Set Structure: Defining Multi-Dimensional Structures
Comprises Concepts that identify the observation value Concepts that add additional metadata about the observation value Concept that is the observation value Any of these may be coded text date/time number etc. Dimensions Attributes Measure Representation

Data Set Structure: Concept Usage
Stock/Flow Country (Dimension) (Dimension) Unit Multiplier Unit (Attribute) (Attribute) Time/Frequency (Dimension) (Dimension) Topic Observation (Dimension) (Measure)

SDMX and Metadata

SDMX and Metadata SDMX provides for several types of metadata
Structural (describes structures of data sets and metadata sets and related items) Provisioning (describes the sourcing of data between departments and organizations) “Reference” metadata – all other types of metadata (footnotes, methodology, quality, etc. Can be specified by the user!) Reference metadata is the most important one – it is what we typically think of as metadata

SDMX Metadata Sets Version 2.0 of the SDMX Technical Specifications provides XML formats for metadata sets (SDMX-ML) To describe their structure To exchange metadata in XML This is based on concepts (similar to the data formats) SDMX supports any metadata concepts the users wishes to report/exchange/process May be flat lists or hierarchical Definitions provided by users, but recommendations exist for many common concepts Metadata sets are attached to a formal object in the information model (an organization, a data set, a codelist, etc.)

SDMX and Metadata This is a very powerful feature of SDMX
It can be used to integrate/mimic other metadata standards! Provides very good support for standard exchange of metadata which cannot be anticipated by the designers of systems/standards Must be based on common agreements about the meaning of metadata concepts Often, concepts are taken from other metadata models/standards such as DDI, Dublin Core, etc.

The SDMX Information Model

The SDMX Information Model
A formal, documented conceptual model of statistical exchange, management, and sourcing Expressed as a UML model Used as the basis of all SDMX implementation XML EDIFACT Any other programming language/platform Provides consistency between implementations Based on analysis of many statistical processing systems Describes existing business practices in a generic way

Information Model: High-Level Schematic
structure and code list maps Structure Maps Data or Metadata Structure Definition Category Scheme uses specific data/metadata structure comprises subject or reporting categories can be linked to categories in multiple category schemes conforms to business rules of the data/metadata flow Data or Metadata Set Data or Metadata Flow Category publishes/reports data/metadata sets can get data/metadata from multiple data/metadata providers can have child categories can provide data/metadata for many data/metadata flows using agreed data/metadata structure Registration of Data or Metadata Set Provision Agreement URL, registration date etc. Data Provider registers existence of data and metadata

SDMX and Best Practices: The Content-Oriented Guidelines

SDMX Content-Oriented Guidelines
There is a long history of discussion about what is best practice in the collection of statistics SDMX decided to define the technical basis for statistical exchange, and then engage in this debate It makes reaching agreements between organizations easier! These documents build on many years of work defining statistical concepts, terms, and classifications Although described as “statistical”, much of what is here also applies to social science (and other) research

SDMX Content-Oriented Guidelines
Four main documents: Overview Metadata Common Vocabulary (annex) Cross-Domain Concepts (2 annexes) Statistical Subject-Matter Domains (annex) These will not become ISO specifications, but will evolve as publications of the SDMX Initiative They are now available in their first official release at

Common Metadata Vocabulary
A set of terms and definitions for the different parts of the SDMX technical standards, and many common concepts used in data and metadata structures Does not replace other major vocabularies in this space (such as the OECD glossary) but references these other works

Cross-Domain Concepts
Includes concepts which are common across many statistical domains Names & Definitions Representations Approximately 130 concepts, some with recommended representations (codelists) These are concepts which support both data and metadata structures Emphasis on quality frameworks for reference metadata concepts

Statistical Subject-Matter Domains
Based on the UN/ECE classification of statistical activities Provides a classification system for use in exchanging statistics across domain boundaries Provides a breakdown of the various domains within official statistics

SDMX and Web Services

Web-Services Components of SDMX
Web-Services Guidelines Part of the Technical Specifications package SDMX Query message Part of SDMX-ML SDMX Registry Services Part of version 2.0 Technical Specifications Interfaces are in SDMX-ML Document describes implementation rules

Web Services Guidelines
Recommends use of WSS 1.1 for web services which use SOAP, WSDL Provides standard function names for many typical web-services functions Querying for data Querying for metadata Querying for structural information

SDMX Query Message An XML Query to support two-way web-services calls using XML messages Designed to support: Queries for structural information from online databases/repositories Queries for data from online databases Queries for metadata from online databases Part of SDMX-ML Very similar to the SQL query language supported by all database packages Specific to SDMX objects

SDMX Registry Services
A “registry” is a common type of technology Every Windows machine has a “Windows registry” to let applications know what other applications are on that machine, and where they are located Web services registries do the same thing on a network Functions like a card catalogue in a print library – you can look up resources and find out how to obtain them A registry provides a single place on the Internet where everyone can discover the data, metadata, and structures that other organizations use/publish They do not contain the data and metadata – it just indexes it and links to it

SDMX Registry Services (cont.)
SDMX Registry Services are based on generic, standard web-services registry technology ISO ebXML Registry/Repository OASIS UDDI Registry (part of .NET, etc.) SDMX Registry Services are not generic They are specific to SDMX exchanges of data and metadata, etc. There is not one central “SDMX Registry” Each domain will have its own registry for its members The registries can be linked (“federated”)

SDMX Registry/Repository
SDMX Registry Interfaces Indexes data and metadata Register REGISTRY Data Set/ Metadata Set Query Describes data and metadata sources and reporting processes Submit REPOSITORY Provisioning Metadata Query Submit REPOSITORY Structural Metadata Describes data and metadata structures Query

SDMX Registry/Repository
SDMX Registry Interfaces Indexes data and metadata Register REGISTRY Data Set/ Metadata Set Query Subscription/ Notification Applications can subscribe to notification of new or changed objects Submit REPOSITORY Provisioning Metadata Query Submit REPOSITORY Structural Metadata Describes data and metadata structures Query

The Old JEDH Site BIS WEBSITE IMF OECD World (Various Bank Formats)
(3-month production cycle)

JEDH with SDMX Retrieves data from sites BIS SDMX “Agent” SDMX-ML
Loaded into JEDH DB [Info about data is registered] IMF SDMX-ML Discover data and URLs SDMX Registry OECD SDMX-ML Data provided in real time to site World Bank SDMX-ML JEDH Site SDMX-ML (Debtor database)

Recent and On-Going Developments
Many organizations using SDMX have been implementing web services There is growing interest in forming a working group to further extend the specification for use with web-services technology Standard error messages Expanded function calls Standard WSDLs If you are interested in this, please tell me!

Tools and Resources

SDMX Tools There are now several sources for SDMX tools
All are free or open-source Eurostat – complete package of tools for data, metadata, and registry services Metadata Technology Ltd – similar package of tools Data editors are usually based on Excel Some other tools Open Data Foundation “SDMX Browser” for data visualization OECD, ECB, and UN/Statistical Division provide some other tools for specific applications Integration with PC-Axis has been prototyped, to be available this summer DevInfo has SDMX support FAME is developing SDMX support Commercial vendors provide good support through web-services functionality Eg, Oracle 11, .NET, etc.

Resources The SDMX Initiative Site: http://www.sdmx.org
The SDMX Toolkit and Forums: Various papers and (soon) open-source tools:

IZA Data Service Center DDI/SDMX Workshop Wiesbaden, Germany, June 18th SDMX, DDI, and Other Standards Arofan Gregory / Pascal Heus / Open Data Foundation

Overview of the Session
DDI/SDMX: Philosophy and Timing of Standards Development DDI/SDMX: Points of Functional Overlap DDI/SDMX: Direct Mappings DDI/SDMX: Integration Approaches Other Related Standards and On-Going Work

DDI/SDMX: Philosophy and Timing of Development

Development Philosophies/Timing
Unlike many standards bodies, both the SDMX Initiative and the DDI Alliance have attempted to create standards which do not duplicate existing efforts There is an awareness that users need to deal with several different standards DDI (3.0) and SDMX were both intentionally aligned with other, related standards DDI 1.*/2.* existed before SDMX It was largely self-contained SDMX was created before DDI 3.0 existed Created with an awareness of DDI 1.*/2.* DDI 3.0 benefited from having SDMX as a published specification Actively aligned with SDMX and many other standards

SDMX Design SDMX was intentionally designed to accommodate integration of standards which are used with the inputs to aggregate data This included DDI and XBRL Mechanism for integration is generic The key point for this integration is the SDMX Registry It provides links between aggregate (SDMX) data sets, and also to source data and metadata

DDI/SDMX: Points of Functional Overlap

SDMX and DDI as Complementary
DDI is designed to document micro-data 1.*/2.* versions were archival, after-the-fact documentation 3.0 version covers entire life cycle, but still has an after-the fact function SDMX is designed as a standard for processing and automation It is not documentary, but is aimed at automation of statistical systems and exchanges These purposes are related, but not duplicative SDMX and DDI can both do useful things within a single system

Examples DDI could be used to document SDMX-based aggregates more completely for archival purposes DDI could be used to document the micro-data on which aggregates are based As soon as tabulation occurs, SDMX can be used to describe and format the data SDMX can describe micro-data, but it is not very useful DDI can be used to automate processing of multi-dimensional data cubes, but it is more difficult than with SDMX SDMX can be used to link DDI instances with other types of standard data and metadata (including both SDMX and DDI)

DDI and SDMX SDMX Aggregated data Indicators, Time Series Across time Across geography Open Access Easy to use DDI Microdata Low level observations Single time period Single geography Controlled access Expert Audience Arofan Microdata data is a important source of aggregated data Crucial overlap and mappings exists between both worlds (but commonly undocumented) Interoperability provides users with a full picture of the production process 116

Generic Process Example
DDI Survey/Register Anonymization, cleaning, recoding, etc. Tabulation, processing, case selection, etc. Indicators Raw Data Set Micro-Data Set/ Public Use Files Aggregation, harmonization Aggregation, harmonization SDMX Aggregate Data Set (Higher Level) Aggregate Data Set (Lower level)

DDI + SDMX? When you have data which has been tabulated/aggregated, it may be useful to have both SDMX and DDI SDMX for processing and exchanging the data DDI for documenting these processes, in case they are of interest to researchers DDI has a much richer descriptive capability for addressing the exact processes used in statistical packages SDMX is easier to process

DDI/SDMX: Direct Mappings

Direct Mappings: DDI & SDMX
IDs and referencing use the same approach (identifiable – versionable - maintainable; structured URN syntax) Both are organized around schemes Reusable packages of data, similar to relational tables in databases Both describe multi-dimensional data A “clean” cube in DDI maps directly to/from SDMX Both have concepts and codelists DDI has much less emphasis on concepts SDMX emphasizes concepts because they are needed for comparison Both contain mappings (“comparison”) for codes and concepts

Formal Mapping There is on-going work to describe a formal mapping between SDMX and DDI It will cover these direct correspondences They are quite obvious: a code maps to a code; a concept to a concept; etc. There are currently no tools, because generic tools such as XSLT will work for this transformation Drafts of this work are expected this summer, as part of the SDMX submission to ISO for the version 2.0 Technical Specifications The direct mappings are the easy part!

Issues with Direct Mapping
It is possible to describe everything in the DDI as an SDMX Metadata Set This is probably not the best way to use SDMX with DDI! It is usually better to select the important fields, and keep the rest in native DDI format When you map from DDI to SDMX, you typically will not carry much of the descriptive metadata, question text, etc. Mostly structural (codelists, dimensions, attributes, concepts) You must have concepts for SDMX which are not always present in DDI Going from SDMX to DDI, it is not always possible to map all the data Especially for SDMX Metadata Sets, which may have user-configured concepts that don’t always exist in DDI Note that SDMX-DDI mappings refer to all versions of DDI

DDI/SDMX: Integration Approaches

Integration Use Cases The most important aspect of DDI – SDMX integration is understanding what the use cases are This defines what mapping/transformation is needed It also defines what links need to be stored between data and metadata files There are some common use cases DDI used to describe and link microdata inputs to SDMX aggregates DDI used to more fully document SDMX aggregates for dissemination to users Using the SDMX Registry as a lifecycle management tool for DDI, SDMX, etc.

Linking Source Data and Aggregates
DDI provides a wealth of information about the micro-data which serves as an input to SDMX aggregates It is possible to capture these links in SDMX, at the cell level or higher, to provide automated access to source data An SDMX registry can be used to provide easy access to these links The user/collector of aggregate data can access the rich DDI metadata, and possibly the data (if they have access rights) It is possible to automatically generate SDMX output from the DDI metadata describing tabulation of micro-data This may not be useful if the desired SDMX target is a standard cube structure described by another organization It may make transformation to the standard cube easier, however The SDMX Registry provides a good tool for managing links Links between SDMX and DDI files are stored as Metadata Reports

Demo: SDMX – DDI Source Links

DDI + SDMX for Dissemination
Typically, the full DDI documentation is not provided on web-sites which publish aggregates/indicators SDMX is becoming a popular dissemination format for these data It has been shown to increase the use of data on the Web If the DDI documentation is available, this could also be delivered as additional documentation Especially useful at study level Links could be directly embedded in SDMX data files as attributes or stored in an SDMX Registry, or both

The SDMX Registry for Lifecycle Management
The SDMX Registry provides a tool for tracking the sources of data for aggregates It can also track the transformation of versions of DDI as the data moves through the lifecycle There is an SDMX model for processes This can be used to describe the DDI lifecycle model SDMX Metadata Reports can be used to link DDI metadata to specific stages of the DDI lifecycle, and to each other Applications could query the SDMX Registry to discover all of the DDI metadata produced upstream, as micro-data is collected and processed

Demos SDMX Metadata Report used to express DDI metadata
SDMX Metadata Report used to link DDI instances

Other Related Standards and On-Going Work

Many Related Standards
DDI SDMX ISO/IEC – concept management and semantic modelling ISO – Geographical metadata METS – packaging/archiving of digital objects PREMIS – Archival lifecycle metadata XBRL – business reporting Dublin Core – citation metadata Standard mappings are being defined by people from many different organizations (see presentation from METIS 2008 in Luxembourg)

ISO/IEC 11179 ISO/IEC is used to describe the meanings and representations of terms and concepts Both SDMX and DDI are aligned with ISO/IEC 11179 SDMX and DDI concepts can be defined using the ISO/IEC attributes Codelists and categories can be directly mapped (and other representations) ISO/IEC can be implemented with DDI (directly, for concepts) and/or with SDMX (as a Metadata Report) ISO/IEC has no standard expression in XML – it is just a model

ISO 19115 Geographical Metadata
ISO describes geographies (bounding boxes for countries, etc.) DDI uses the ISO model in its own XML It does not use the standard ISO XML format, but there is a 1-to-1 mapping SDMX could model ISO if desired Linking to DDI or ISO XML is probably more useful, using the standard SDMX mechanism Most geographies in SDMX aggregate data sets are coded, not directly described

METS METS is used to package a set of files which work together as a digital object Both DDI and SDMX metadata could be placed inside a METS wrapper They would be “metadata sections” The primary use case would be for archiving of a set of related data and metadata files, possibly with other related materials such as research publications

PREMIS PREMIS allows for the capture of administrative metadata as a collection is placed and managed within the archive DDI and SDMX files would be treated like any other files forming part of the collection Both may contain metadata which can be extracted and used to populate PREMIS instances (access levels, confidentiality, etc.)

XBRL XBRL is used by business to report required information to national supervisory bodies This includes banking supervision and other economic data XBRL is a source format for some aggregate statistics XBRL International and the SDMX Sponsors are working together to define a cross-walk between the two standards

Dublin Core Dublin Core is used to capture citation-type metadata for resources on the Internet and elsewhere It is widely used in digital repositories for research papers DDI has the basic Dublin Core XML format as an integral part of the DDI 3.0 specification Dublin Core can be easily mimicked as an SDMX Metadata Report [Demo]

High-Level Vision – Standards Mappings
Federated Registries (Based on SDMX, ebXML, web services) ISO 11179 Semantic definitions Aggregated Data/Metadata (SDMX) registered Organized using References to source data METS/PREMIS XBRL Business Reports DDI Microdata Sets Standard classifications Dublin Core Citations Used in ISO 19115 Geographies

IZA Data Service Center DDI/SDMX Workshop Wiesbaden, Germany, June 18th 2008 The Data Documentation Initiative (DDI) Arofan Gregory / Pascal Heus agregory@opendatafoundation.org.

Similar presentations

Presentation on theme: "IZA Data Service Center DDI/SDMX Workshop Wiesbaden, Germany, June 18th 2008 The Data Documentation Initiative (DDI) Arofan Gregory / Pascal Heus agregory@opendatafoundation.org."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

IZA Data Service Center DDI/SDMX Workshop Wiesbaden, Germany, June 18th 2008 The Data Documentation Initiative (DDI) Arofan Gregory / Pascal Heus agregory@opendatafoundation.org.

Similar presentations

Presentation on theme: "IZA Data Service Center DDI/SDMX Workshop Wiesbaden, Germany, June 18th 2008 The Data Documentation Initiative (DDI) Arofan Gregory / Pascal Heus agregory@opendatafoundation.org."— Presentation transcript:

Similar presentations

About project

Feedback