Presentation on theme: "IZA Data Service Center DDI/SDMX Workshop Wiesbaden, Germany, June 18th 2008 The Data Documentation Initiative (DDI) Arofan Gregory / Pascal Heus firstname.lastname@example.org."— Presentation transcript:
1IZA Data Service Center DDI/SDMX Workshop Wiesbaden, Germany, June 18th The Data Documentation Initiative (DDI)Arofan Gregory / Pascal Heus/Open Data Foundation
2Content Background on metadata and XML Metadata and Microdata XML and Microdata: the DDIDDI 2.0DDI 3.0DDI 2.0 vs 3.0Major stakeholders / initiatives
4What is metadata? Common definition: Data about Data Labeled stuff The bean example is taken from: A Manager’sIntroduction to Adobe eXtensible Metadata Platform,Unlabeled stuff
5What is XML? Today's Universal language on the web Purpose is to facilitate sharing of structured information across information systemsXML stands for eXtensible Markup LanguageeXtensibe can be customizedMarkup tags, marks, attach attributes to thingsLanguage syntax (grammatical rules)HTML (HyperText Markup Language) is a markup language but not extensible! It is also concerned about presentation, not content.XML is a text format (not a binary black box)XML is a also a collection of technologies (built on the XML language)It is platform independent and is understood by modern programming languages (C++, Java, .NET, pHp, perl, etc.)It is both machine and human readable
7XML Technology overview Document Type Definition (DTD) and XSchema are use to validate an XML document by defining namespaces, elements, rulesSpecialized software and database systems can be used to create and edit XML documents. In the future the XForm standard will be usedXML separates the metadata storage from its presentation. XML documents can be transformed into something else, like HTML, PDF, XML, other) through the use of the eXtensible Stylesheet Language, XSL Transformations (XSLT) and XSL Formatting Objects (XSL-FO)Very much like a database system, XML documents can be searched and queried through the use of XPath oe XQuery.There is no need to create tables, indexes or define relationshipsXML metadata or data can be published in “smart” catalogs often referred to as registries than can be used for discovery of information.XML Documents can be sent like regular files but are typically exchanged between applications through Web Services using the SOAP and other protocols
8What is an XML Schema?Exchange / sharing / harmonization implies agreement on structureWe need a specification that describes the structure and rules SchemaA schema is a set of rules to which an XML document must conform in order to be considered 'valid'XML Schema was also designed with the intent that determination of a document's validity would produce a collection of information adhering to specific data typesSimilar to relational databases structural definitionMany schemas exists for different purposesExamplesDDI, SDMX ,Dublin Core, RSS, XHTML, etc.
10What is a survey? More than just data…. A complex process to produce data for the purpose of statistical analysisBeyond this, a tool to support evidence based policy making and results monitoringThe data is surrounded by a large body of documentationSurvey data often come with limited documentionNote that microdata is intended for expertsStatisticians / researchersRepresents a single point in time and spaceNeed to be aggregated to produce meaningful resultsIt is the beginning of the story
11What is survey metadata? Survey documentation can be broken down into structured metadata and documentsStructured metadata can be captured using XMLDocuments can be described in structured metadataExample of metadata:Survey level: Title, country, year, abstract, sampling, agencies, access policy, etc.Variable level: filename, label, code, questions, instructions, derivation, etc.Related materials: report, questionnaire, papers, manuals, scripts/programs, photosCross-surveys: catalogs, longitudinal, concepts, comparability, etc.
12Importance of survey metadata Data Quality:Usefulness = accessibility + coherence + completeness + relevance + timeliness + …Undocumented data is uselessPartially documented data is risky (misuse)Data discovery and accessPreservationReplication standard (Gary King)Information exchangeReduce need to access sensitive dataMaintain coherence / linkages across the complete life cycle (from respondent to policy maker)Reuse
13The Data Documentation Initiative The Data Documentation Initiative is an XML specification to capture structured metadata about “microdata” (broad sense)First generation DDI 1.0…2.1 ( )focus on single archived instanceSecond generation DDI 3.0 (2008)focus on life cyclego beyond the single survey conceptmutli-purpose
14DDI Timeline / Status Pre-DDI 1.0 2000 – DDI 1.0 2003 – DDI 2.0 70’s / 80’s OSIRIS Codebook1993: IASSIST Codebook Action Group1996 SGML DTD1997 DDI XML1999 Draft DDI DTD2000 – DDI 1.0Simple surveyArchival data formatsMicrodata only2003 – DDI 2.0Aggregate data (based on matrix structure)Added geographic material to aid geographic search systems and GIS usersEstablishment of DDI Alliance2004 – Acceptance of a new DDI paradigmLifecycle modelShift from the codebook centric / variable centric model to capturing the lifecycle of dataAgreement on expanded areas of coverage2005Presentation of schema structureFocus on points of metadata creation and reuse2006Presentation of first complete 3.0 modelInternal and public review2007Vote to move to Candidate Version (CR)Establishment of a set of use cases to test application and implementationOctober 3.0 CR22008February 3.0 CR3March 3.0 CR3 updateApril 3.0 CR3 finalApril 28th 3.0 Approved by DDI AllianceMay 21st DDI 3.0 Officially announcedInitial presentations at IASSIST 20082009DDI 3.1 and beyond
16The archive perspective Focus on preservation of a surveyOften see survey as collection of data files accompanied by documentationCode book centricreport, questionnaire, methodologies, scripts, etc.Result in a static event: the archiveMaintained by a single agencyIs typically documentation after the factsThis is the initial DDI perspective (DDI 2.0)
17DDI 2.0 Technical Overview Based on a single structure (DTD)1 codeBook, 5 sectionsdocDscr: describes the DDI documentThe preparation of the metadatastdyDscr: describes the studyTitle, abstract, methodologies, agencies, access policyfileDscr: describes each file in the datasetdataDscr: describes the data in the filesVariables (name, code, )Variable groupsCubesothMat: other related materialsBasic document citation
18Characteristics of DDI 1.0/2.0 Focuses on the static object of a codebookDesigned for limited usesEnd user data discovery via the variable or high level study identification (bibliographic)Only heavily structured content relates to information used to drive statistical analysisCoverage is focused on single study, single data file, simple survey and aggregate data filesVariable contains majority of information (question, categories, data typing, physical storage information, statistics)
19Impact of these limitations Treated as an “add on” to the data collection processFocus is on the data end product and end users (static)Limited tools for creation or exploitationThe Variable must exist before metadata can be createdProducers hesitant to take up DDI creation because it is a cost and does not support their development or collection process
20DDI 1/2.x Tools Nesstar IHSN Other tools Nesstar Publisher, Nesstar ServerIHSNMicrodata Management ToolkitNADA (online catalog for national data archive)Archivist / Reviewer GuidelinesOther toolsSDA, Harvard/MIT Virtual Data Center (Dataverse)UKDA DExT, ODaF DeXtris
21DDI 2.0 perspective Media/Press General Public Academic Users ProducersUsersPolicy MakersGovernmentArchivistsSponsorsBusinessDDI 2SurveyDDI 2SurveyDDI 2SurveyDDI 2SurveyDDI 2SurveyDDI 2SurveyDDI 2Survey
23When to capture metadata? Metadata must be captured at the time the event occurs!Documenting after the facts leads to considerable loss of informationMultiple contributors are typically involved in this process (not only the archivist)This is true for producers and researchers
24DDI 3.0 and the Survey Life Cycle A survey is not a static process: It dynamically evolved across time and involves many agencies/individualsDDI 2.x is about archiving, DDI 3.0 across the entire “life cycle”3.0 focus on metadata reuse (minimizes redundancies/discrepancies, support comparison)Also supports multilingual, grouping, geography, and others3.0 is extensible
25Requirements for 3.0Improve and expand the machine-actionable aspects of the DDI to support programming and software systemsSupport CAI instruments through expanded description of the questionnaire (content and question flow)Support the description of data series (longitudinal surveys, panel studies, recurring waves, etc.)Support comparison, in particular comparison by design but also comparison-after-the fact (harmonization)Improve support for describing complex data files (record and file linkages)Provide improved support for geographic content to facilitate linking to geographic files (shape files, boundary files, etc.)
26ApproachShift from the codebook centric model of early versions of DDI to a lifecycle model, providing metadata support from data study conception through analysis and repurposing of dataShift from an XML Data Type Definition (DTD) to an XML Schema model to support the lifecycle model, reuse of content and increased controls to support programming needsRedefine a “single DDI instance” to include a “simple instance” similar to DDI 1/2 which covered a single study and “complex instances” covering groups of related studies. Allow a single study description to contain multiple data products (for example, a microdata file and aggregate products created from the same data collection).Incorporate the requested functionality in the first published edition
27Designing to support registries Resource packagestructure to publish non-study-specific materials for reuseExtracting specified types of information in to schemesUniverse, Concept, Category, Code, Question, Instrument, Variable, etc.Allowing for either internal or external referencesCan include other schemes by reference and select only desired itemsProviding Comparison MappingTarget can be external harmonized structure
28DDI 3 is composed of several schemas Technical OverviewDDI 3 is composed of several schemasUse only what you need!Schemas represent modules, sub-modules (substitutions), reusable, external schemasarchivecomparativeconceptualcomponentdatacollectiondatasetdcelementsDDIprofileddi-xhtml11ddi-xhtml11-model-1ddi-xhtml11-modules-1groupinline_ncube_recordlayoutinstancelogicalproductncube_recordlayoutphysicaldataproductphysicalinstanceproprietary_record_layout (beta)reusablesimpledcstudyunittabular_ncube_recordlayoutxmlset of xml schemas to support xhtml
29Technical OverviewAny element that can be referenced is globally uniquely identifiedMaintainable (by an agency)Versionable (can change across time)Identifiable (within a maintainable scheme)ModulesReflect closely related sets of information similar to the sections of DDI 1/2.* DTDModules can be held as separate XML instances and be included in a large instance by either inclusion or referenceAll modules are maintainable (but not all maintainables are modules)
30Technical Overview: Maintainable Schemes (that’s with an ‘e’ not an ‘a’) Category SchemeCode SchemeConcept SchemeControl Construct SchemeGeographicStructureSchemeGeographicLocationSchemeInterviewerInstructionSchemeQuestion SchemeNCubeSchemeOrganization SchemePhysical Structure SchemeRecord Layout SchemeUniverse SchemeVariable SchemePackages of reusable metadata maintained by a single agency
31DDI 3.0 Use Cases Study design/survey instrumentation Questionnaire generation/data collection and procesingData recoding, aggregation and other processingData dissemination/discoveryArchival ingestion/metadata value-addQuestion/concept/variable banksDDI for use within a research projectCapture of metadata regarding data useMetadata mining for comparison, etc.Generating instruction packages/presentations
32Study Design/Survey Instrumentation This use case concerns how DDI 3.0 can support the design of studies and survey instrumentationWithout benefit of a question or concept bank
33+ Types of Metadata: Concepts (conceptual module) Universe (conceptual module)Questions (datacollection module)Flow Logic (datacollection module)<DDI 3.0>ConceptsUniverses<DDI 3.0>ConceptsUniversesFinalDrafting/Review/Revision+<DDI 3.0>QuestionsFlow Logic<DDI 3.0>ConceptsUniversesQuestionsFlow LogicAs the survey instrumentis tested, all revisions andhistory can be trackedand preserved. This wouldinclude question translationand internationalization.FinalDrafting/Testing/Revision
34Questionnaire Generation, Data Collection, and Processing This use case concerns how DDI 3.0 can support the creation of various types of questionnaires/CAI, and the collection and processing of raw data into microdata.
35Physical Data Instance Types of Metadata:Concepts (conceptual module)Universe (conceptual module)Questions (datacollection module)Flow Logic (datacollection module)Variables (logicalproduct module)Categories/Codes (logicalproduct module)Coding (datacollection module)PaperQuestionnaire<DDI 3.0>ConceptsUniversesQuestionsFlow LogicOnline SurveyInstrumentFinalCAIInstrumentRaw DataMicrodataDDI capturesthe content – XMLallows for eachapplication to doits own presentation<DDI 3.0>ConceptsUniversesQuestionsFlow Logic<DDI 3.0>VariablesCoding<DDI 3.0>CategoriesCodesPhysical Data ProductPhysical Data Instance++
36Data Recoding, Aggregation, etc. This use case concerns how DDI 3.0 can describe recodes, aggregation, and similar types of data processing.
37+ Initial microdata has: Concepts (conceptual module) Universes (conceptual module)Questions (datacollection module)Flow Logic (datacollection module)Variables (logicalproduct module)Coding (datacollection module)Categories (logicalproduct module)Codes (logicalproduct module)Physical Data ProductPhysical Data InstanceRecode adds:More codings (datacollection module)New variablesNew categoriesNew codesNCubes (for aggregation)Could be a recode,an aggregation,or other process.Microdata/AggregatesMicrodata<DDI 3.0>ConceptualDatacollectionVariablesCategoriesCodes<DDI 3.0>CodingsVariables (new)Categories (new)Codes (new)NCubes+
38Data Dissemination/Data Discovery This use case concerns how DDI 3.0 can support the discovery and dissemination of data.
39+ <DDI 3.0> Can add archival Rich metadata supports events meta-dataRich metadata supportsauto-generation of websitesand other delivery formatsCodebooks<DDI 3.0>[Full meta-data set]Websites+Databases,repositoriesResearchData CentersMicrodata/AggregatesData-SpecificInfo AccessSystemsRegistriesCataloguesQuestion/Concept/Variable Banks
40Archival Ingestion and Metadata Value-Add This use case concerns how DDI 3.0 can support the ingest and migration functions of data archives and data libraries.
41of processing if good DDI metadata is captured upstream Supports automationof processing if good DDI metadata is captured upstreamProvides a neutral format for data migration as analysis packages are versioned<DDI 3.0>[Full meta-data set](?)Data ArchiveData LibraryIngestProcessing+Microdata/Aggregates<DDI 3.0>[Full oradditionalmetadata]Archival eventsProvides good format &foundation for value-added metadata by archive
42Question/Concept/Variable Banks This use case describes how DDI 3.0 can support question, concept, and variable banks. These are often termed “registries” or “metadata repositories” because they contain only metadata – links to the data are optional, but provide implied comparability. The focus is metadata reuse.
43Question Bank <DDI 3.0> Questions Flow Logic Codings Because DDI has links, each type of bankfunctions in a modular, complementary way.QuestionBank<DDI 3.0>QuestionsFlow LogicCodings<DDI 3.0>QuestionsFlow LogicCodingsUsersandApplicationsVariableBank<DDI 3.0>VariablesCategoriesCodes<DDI 3.0>VariablesCategoriesCodesUsersandApplications<DDI 3.0>Concepts<DDI 3.0>ConceptsUsersandApplicationsConceptBankSupports butdoes not requireISO 11179
44DDI For Use within a Research Project This use case concerns how DDI 3.0 can support various functions within a research project, from the conception of the study through collection and publication of the resulting data.
46Capture of Metadata Regarding Data Use This use case concerns how DDI 3.0 can capture information about how researchers use data, which can then be added to the overall metadata set about the data sources they have accessed.
47+ + Types of Metadata Recodes (datacollection module) Record subsets (physicalinstance module)Variable subsets (logicalproduct module)Comparison (comparative module)Data Sets<DDI 3.0>StudyUnitDataCollectionLogicalProductPhysicalDataProductPhysicalInstance+<DDI 3.0>RecodesCase SelectionVariable SelectionComparison to original studyResulting physical file descriptionsDataData Analysis+
48Metadata Mining for Comparison, etc. This use case concerns how collections of DDI 3.0 metadata can act as a resource to be explored, providing further insight into the comparability and other features of a collection of data.
50Generating Instruction Packages/Presentations This use case concerns how DDI 3.0 can support automation around the instruction of students and others.
51Types of MetadataIndividual studies (studyunit module)Grouping purpose (group module)Linking information (comparative module)Processing assistance (group module)<DDI 3.0>StudyUnit 1<DDI 3.0>StudyUnit 2<DDI 3.0>StudyUnit 1StudyUnit 2StudyUnit 3StudyUnit 4ComparativeOtherMaterials<DDI 3.0>StudyUnit 3<DDI 3.0>StudyUnit 4<DDI 3.0>StudyUnit 1StudyUnit 2StudyUnit 3StudyUnit 4Topically related studies selectedGroup is made with description of the intended use for the groupComparative information is added indicating matching fields for linking and mapping between similar variablesOther materials such as SAS/SPSS recode command are referenced from the groupInstructionalPackage
52DDI 3.0 Tools Under developments DDI Foundation Tools Program Road MapXML Beans, validation,DDI DExT, DDI2StatsProgsOther toolsR SPSS Export, Algenta SurveyViz, others presented at IASSISTDDI Editing SuiteProposed as extension of DDI-FTPPlan for generic editor in 6-9 monthsDDI 3.0 related projects / initiativesRDC Canada, Germany RDC / EURASI, DANS MIXED, NORC
53DDI 3 Relationship to Other Standards SDMX (from microdata to indicators / time series)Completely mapping to and from DDI NCubesDublin Core (surveys and documents gets cited)Mapping of citation elementsOption for DC namespace basic entryISO – Geography (microdata gets mapped)Search requirementsSupport for GIS usersMETSDesigned to support profile developmentOAIS (alignment of archiving standards)Reference model for the archival lifecycleISO/IEC (metadata mining through concepts)Variable linking representation to concept and universeOptional data element construct in ConceptualComponent that allows for complete ISO/IEC structure as a maintained item
54DDI 3.0 perspective Media/Press General Public Academic Policy Makers GovernmentSponsorsBusinessProducersUsersArchivists
56DDI 2 / DDI 3 Single survey Focus on the archive Non-reusable metadata Maintained by single agencyLoose validationDTD basedSparse documentationDesigned by archivistsSome tools are availableMultiple surveysFocus on life cycleHighly reusable metadataMaintained by many agenciesTied validationSchema basedExtensive guideDesigned by expert groupsTools are beginning to emerge
57What 3.0 can do for you Manage multi-surveys Support multiple contributorsSupport many different perspectivesSupport many different use casesMaintain metadata integrity across the life cycleConnect to other metadata spacesMetadata reusePublication in registriesBackward compatibility with 2.0
59DDI Organizations/ Agencies DDI Alliance (http://www.ddialliance.org)Interuniversity Consortium for Political and Social Research (ICPSR) (http://icpsr.umich.edu)International Association for Social Science Infromation Service & Technology (IASSIST) (http://www.iassistdata.org)International Household Survey Network (IHSN) (http://www.surveynetwork.org)Open Data Foundation (ODaF) (http://www.opendatafoundation.org)National Opinion Research Center Data Enclave (NORC) (http://dataenclave.norc.org)Metadata Technology (http://www.metadatatechnology.com)
60IZA Data Service Center DDI/SDMX Workshop Wiesbaden, Germany, June 18th The Statistical Data and Metadata Exchange Standard (SDMX): An IntroductionArofan Gregory / Pascal Heus/Open Data Foundation
61Overview of the Session SDMX Background and GoalsSDMX and DataSDMX and MetadataSDMX and Best Practices: The Content-Oriented GuidelinesThe SDMX Information ModelSDMX and Web ServicesThe SDMX RegistrySDMX Data ServicesTools and Resources
63What is SDMX? The problem space: Statistical collection, processing, and exchange is time-consuming and resource-intensiveFocus on aggregate data (esp. time series)Various international and national organisations have individual approaches for their constituenciesUncertainties about how to proceed with new technologies (XML, web services …)
64What is SDMX?The Statistical Data and Metadata Exchange (SDMX) initiative is taking steps to address these challenges and opportunities that have just been mentioned:By focusing on business practices in the field of statistical informationBy identifying more efficient processes for exchange and sharing of data and metadata using modern technology and open standards
65Who is SDMX?SDMX is an initiative made up of seven international organizations:Bank for International SettlementsEuropean Central BankEurostatInternational Monetary FundOrganisation for Economic Cooperation and DevelopmentUnited NationsWorld BankThe initiative was launched in 2002
67SDMX ProductsTechnical standards for the formatting and exchange of aggregate statistics:SDMX Technical Specifications version 1.0 (now ISO/TS SDMX – TC 154 WG2)SDMX Technical Specifications version 2.0 (soon to be submitted to ISO – TC 154 WG2)Content-Oriented Guidelines (in draft)Common Metadata VocabularyCross-Domain Statistical ConceptsStatistical Subject-Matter Domains
68Major Features of SDMXStructure and formats (XML, EDIFACT) for aggregate dataStructure and formats (XML) for metadataFormal information model (UML) for managing statistical exchange and sourcingWeb-services guidelines and registry services specification for use of modern technologiesContent-oriented guidelines to recommend best practices
69Recent EventsJan 2007 – Launch meeting at the World Bank for SDMX 2.0 Technical SpecificationsFebruary 2007 – Endorsement of SDMX by EU’s Statistical Programme CommitteeMarch 2008 – SDMX becomes the preferred standard for data and metadata of the UN Statistical CommissionOther standards were mentioned – DDI and XBRL specifically
70Adopters/InterestThe following are known adopters (or planning to adopt):US Federal Reserve Board and Bank of New YorkEuropean Central BankJoint External Debt Hub (WB, IMF, OECD, BIS)UN/TRADECOM at UN Statistical DivisionNAAWE (National Accounts from OECD/Eurostat)SODI (Eurostat and European Governments)Mexican Federal SystemVietnamese Ministry of Planning and InvestmentQatar Information ExchangeIMF (BOP, SNA, SDDS/GDDS)Food and Agriculture OrganizationMillenium Development Goals (UN System, others)International Labor OrganizationBank for International SettlementsOECDWorld BankMarchioness Islands (Spanish/Portugese Statistical Region)UNESCO (Education)Australian Bureau of StatisticsStatistics CanadaThere are many others not listed or which we are not aware of
71Rate of AdoptionBetween January 2007 and January 2008, adoption has doubledWe anticipate a similar rate of growth for the coming yearTools are becoming availableUNSC recommendation makes it a safe course to follow for risk-averse institutionsTraining courses are in increasing demand (Eurostat, Metadata Technology)Standard data and metadata structures for many domains are being developed
73SDMX and Data FormatsSDMX provides a format for describing the structure of data (“structural metadata”)EDIFACT (was GESMES/TS, now SDMX-EDI)XML (SDMX-ML)SDMX provides formats for transmission and processing of dataEDIFACT (1 message)XML (4 different equivalent flavors for different functions)Data is tabulated, aggregate data (eg, multi-dimensional/OLAP cubes)Can be any aggregate data!Most data formats are derived from the structural metadata (eg, XML schemas are generated for each type of structure according to the business rules)
75First: Identify the Concepts A statistical concept is a characteristic of a time series or an observation (MCV)A concept is a unit of knowledge created by a unique combination of characteristics (SDMX Information Model)Whatever the definition, statistical concepts are the DNA of the key familyTheir usage (type, structure, sequence) define the structure of the data
76Data Set Structure:Concepts Unit MultiplierUnitTopicTime/FrequencyCountryStock/FlowComputers need structure of dataConceptsCode listsData valuesHow these fit together
77Data Set Structure: Code Lists TOPICA Brady BondsB Bank LoansC Debt SecuritiesAR ArgentinaMX MexicoZA South AfricaCOUNTRYSTOCK/FLOW1 Stock2 FlowConceptsCONCEPTSTopicCountryFlow
78Quarterly, South Africa, Bank Loans, Data Makes SenseQ,ZA,B,1, =16547Quarterly, South Africa, Bank Loans,Stocks, for 30 June 199916457
79Data Set Structure: Defining Multi-Dimensional Structures ComprisesConcepts that identify the observation valueConcepts that add additional metadata about the observation valueConcept that is the observation valueAny of these may becodedtextdate/timenumberetc.DimensionsAttributesMeasureRepresentation
80Data Set Structure: Concept Usage Stock/FlowCountry(Dimension)(Dimension)Unit MultiplierUnit(Attribute)(Attribute)Time/Frequency(Dimension)(Dimension)TopicObservation(Dimension)(Measure)
82SDMX and Metadata SDMX provides for several types of metadata Structural (describes structures of data sets and metadata sets and related items)Provisioning (describes the sourcing of data between departments and organizations)“Reference” metadata – all other types of metadata (footnotes, methodology, quality, etc. Can be specified by the user!)Reference metadata is the most important one – it is what we typically think of as metadata
83SDMX Metadata SetsVersion 2.0 of the SDMX Technical Specifications provides XML formats for metadata sets (SDMX-ML)To describe their structureTo exchange metadata in XMLThis is based on concepts (similar to the data formats)SDMX supports any metadata concepts the users wishes to report/exchange/processMay be flat lists or hierarchicalDefinitions provided by users, but recommendations exist for many common conceptsMetadata sets are attached to a formal object in the information model (an organization, a data set, a codelist, etc.)
84SDMX and Metadata This is a very powerful feature of SDMX It can be used to integrate/mimic other metadata standards!Provides very good support for standard exchange of metadata which cannot be anticipated by the designers of systems/standardsMust be based on common agreements about the meaning of metadata conceptsOften, concepts are taken from other metadata models/standards such as DDI, Dublin Core, etc.
86The SDMX Information Model A formal, documented conceptual model of statistical exchange, management, and sourcingExpressed as a UML modelUsed as the basis of all SDMX implementationXMLEDIFACTAny other programming language/platformProvides consistency between implementationsBased on analysis of many statistical processing systemsDescribes existing business practices in a generic way
87Information Model: High-Level Schematic structure and code list mapsStructure MapsData or Metadata Structure DefinitionCategory Schemeuses specific data/metadata structurecomprises subject or reporting categoriescan be linked to categories in multiple category schemesconforms to business rules of the data/metadata flowData or Metadata SetData or Metadata FlowCategorypublishes/reports data/metadata setscan get data/metadata from multiple data/metadata providerscan have child categoriescan provide data/metadata for many data/metadata flows using agreed data/metadata structureRegistration of Data or Metadata SetProvision AgreementURL, registration date etc.Data Providerregisters existence of data and metadata
88SDMX and Best Practices: The Content-Oriented Guidelines
89SDMX Content-Oriented Guidelines There is a long history of discussion about what is best practice in the collection of statisticsSDMX decided to define the technical basis for statistical exchange, and then engage in this debateIt makes reaching agreements between organizations easier!These documents build on many years of work defining statistical concepts, terms, and classificationsAlthough described as “statistical”, much of what is here also applies to social science (and other) research
90SDMX Content-Oriented Guidelines Four main documents:OverviewMetadata Common Vocabulary (annex)Cross-Domain Concepts (2 annexes)Statistical Subject-Matter Domains (annex)These will not become ISO specifications, but will evolve as publications of the SDMX InitiativeThey are now available in their first official release at
91Common Metadata Vocabulary A set of terms and definitions for the different parts of the SDMX technical standards, and many common concepts used in data and metadata structuresDoes not replace other major vocabularies in this space (such as the OECD glossary) but references these other works
92Cross-Domain Concepts Includes concepts which are common across many statistical domainsNames & DefinitionsRepresentationsApproximately 130 concepts, some with recommended representations (codelists)These are concepts which support both data and metadata structuresEmphasis on quality frameworks for reference metadata concepts
93Statistical Subject-Matter Domains Based on the UN/ECE classification of statistical activitiesProvides a classification system for use in exchanging statistics across domain boundariesProvides a breakdown of the various domains within official statistics
95Web-Services Components of SDMX Web-Services GuidelinesPart of the Technical Specifications packageSDMX Query messagePart of SDMX-MLSDMX Registry ServicesPart of version 2.0 Technical SpecificationsInterfaces are in SDMX-MLDocument describes implementation rules
96Web Services Guidelines Recommends use of WSS 1.1 for web services which use SOAP, WSDLProvides standard function names for many typical web-services functionsQuerying for dataQuerying for metadataQuerying for structural information
97SDMX Query MessageAn XML Query to support two-way web-services calls using XML messagesDesigned to support:Queries for structural information from online databases/repositoriesQueries for data from online databasesQueries for metadata from online databasesPart of SDMX-MLVery similar to the SQL query language supported by all database packagesSpecific to SDMX objects
98SDMX Registry Services A “registry” is a common type of technologyEvery Windows machine has a “Windows registry” to let applications know what other applications are on that machine, and where they are locatedWeb services registries do the same thing on a networkFunctions like a card catalogue in a print library – you can look up resources and find out how to obtain themA registry provides a single place on the Internet where everyone can discover the data, metadata, and structures that other organizations use/publishThey do not contain the data and metadata – it just indexes it and links to it
99SDMX Registry Services (cont.) SDMX Registry Services are based on generic, standard web-services registry technologyISO ebXML Registry/RepositoryOASIS UDDI Registry (part of .NET, etc.)SDMX Registry Services are not genericThey are specific to SDMX exchanges of data and metadata, etc.There is not one central “SDMX Registry”Each domain will have its own registry for its membersThe registries can be linked (“federated”)
100SDMX Registry/Repository SDMX Registry InterfacesIndexes data and metadataRegisterREGISTRY Data Set/ Metadata SetQueryDescribes data and metadata sources and reporting processesSubmitREPOSITORY Provisioning MetadataQuerySubmitREPOSITORY Structural MetadataDescribes data and metadata structuresQuery
101SDMX Registry/Repository SDMX Registry InterfacesIndexes data and metadataRegisterREGISTRY Data Set/ Metadata SetQuerySubscription/ NotificationApplications can subscribe to notification of new or changed objectsSubmitREPOSITORY Provisioning MetadataQuerySubmitREPOSITORY Structural MetadataDescribes data and metadata structuresQuery
102The Old JEDH Site BIS WEBSITE IMF OECD World (Various Bank Formats) (3-month production cycle)
103JEDH with SDMX Retrieves data from sites BIS SDMX “Agent” SDMX-ML Loaded intoJEDH DB[Info about data is registered]IMFSDMX-MLDiscover dataand URLsSDMXRegistryOECDSDMX-MLData providedin real timeto siteWorldBankSDMX-MLJEDH SiteSDMX-ML(Debtor database)
104Recent and On-Going Developments Many organizations using SDMX have been implementing web servicesThere is growing interest in forming a working group to further extend the specification for use with web-services technologyStandard error messagesExpanded function callsStandard WSDLsIf you are interested in this, please tell me!
106SDMX Tools There are now several sources for SDMX tools All are free or open-sourceEurostat – complete package of tools for data, metadata, and registry servicesMetadata Technology Ltd – similar package of toolsData editors are usually based on ExcelSome other toolsOpen Data Foundation “SDMX Browser” for data visualizationOECD, ECB, and UN/Statistical Division provide some other tools for specific applicationsIntegration with PC-Axis has been prototyped, to be available this summerDevInfo has SDMX supportFAME is developing SDMX supportCommercial vendors provide good support through web-services functionalityEg, Oracle 11, .NET, etc.
107Resources The SDMX Initiative Site: http://www.sdmx.org The SDMX Toolkit and Forums:Various papers and (soon) open-source tools:
108IZA Data Service Center DDI/SDMX Workshop Wiesbaden, Germany, June 18th SDMX, DDI, and Other StandardsArofan Gregory / Pascal Heus/Open Data Foundation
109Overview of the Session DDI/SDMX: Philosophy and Timing of Standards DevelopmentDDI/SDMX: Points of Functional OverlapDDI/SDMX: Direct MappingsDDI/SDMX: Integration ApproachesOther Related Standards and On-Going Work
111Development Philosophies/Timing Unlike many standards bodies, both the SDMX Initiative and the DDI Alliance have attempted to create standards which do not duplicate existing effortsThere is an awareness that users need to deal with several different standardsDDI (3.0) and SDMX were both intentionally aligned with other, related standardsDDI 1.*/2.* existed before SDMXIt was largely self-containedSDMX was created before DDI 3.0 existedCreated with an awareness of DDI 1.*/2.*DDI 3.0 benefited from having SDMX as a published specificationActively aligned with SDMX and many other standards
112SDMX DesignSDMX was intentionally designed to accommodate integration of standards which are used with the inputs to aggregate dataThis included DDI and XBRLMechanism for integration is genericThe key point for this integration is the SDMX RegistryIt provides links between aggregate (SDMX) data sets, and also to source data and metadata
114SDMX and DDI as Complementary DDI is designed to document micro-data1.*/2.* versions were archival, after-the-fact documentation3.0 version covers entire life cycle, but still has an after-the fact functionSDMX is designed as a standard for processing and automationIt is not documentary, but is aimed at automation of statistical systems and exchangesThese purposes are related, but not duplicativeSDMX and DDI can both do useful things within a single system
115ExamplesDDI could be used to document SDMX-based aggregates more completely for archival purposesDDI could be used to document the micro-data on which aggregates are basedAs soon as tabulation occurs, SDMX can be used to describe and format the dataSDMX can describe micro-data, but it is not very usefulDDI can be used to automate processing of multi-dimensional data cubes, but it is more difficult than with SDMXSDMX can be used to link DDI instances with other types of standard data and metadata (including both SDMX and DDI)
116DDI and SDMXSDMXAggregated dataIndicators, Time SeriesAcross timeAcross geographyOpen AccessEasy to useDDIMicrodataLow level observationsSingle time periodSingle geographyControlled accessExpert AudienceArofanMicrodata data is a important source of aggregated dataCrucial overlap and mappings exists between both worlds (but commonly undocumented)Interoperability provides users with a full picture of the production process116
117Generic Process Example DDISurvey/RegisterAnonymization, cleaning,recoding, etc.Tabulation, processing,case selection, etc.IndicatorsRaw Data SetMicro-Data Set/Public Use FilesAggregation,harmonizationAggregation,harmonizationSDMXAggregate Data Set(Higher Level)Aggregate Data Set(Lower level)
118DDI + SDMX?When you have data which has been tabulated/aggregated, it may be useful to have both SDMX and DDISDMX for processing and exchanging the dataDDI for documenting these processes, in case they are of interest to researchersDDI has a much richer descriptive capability for addressing the exact processes used in statistical packagesSDMX is easier to process
120Direct Mappings: DDI & SDMX IDs and referencing use the same approach (identifiable – versionable - maintainable; structured URN syntax)Both are organized around schemesReusable packages of data, similar to relational tables in databasesBoth describe multi-dimensional dataA “clean” cube in DDI maps directly to/from SDMXBoth have concepts and codelistsDDI has much less emphasis on conceptsSDMX emphasizes concepts because they are needed for comparisonBoth contain mappings (“comparison”) for codes and concepts
121Formal MappingThere is on-going work to describe a formal mapping between SDMX and DDIIt will cover these direct correspondencesThey are quite obvious: a code maps to a code; a concept to a concept; etc.There are currently no tools, because generic tools such as XSLT will work for this transformationDrafts of this work are expected this summer, as part of the SDMX submission to ISO for the version 2.0 Technical SpecificationsThe direct mappings are the easy part!
122Issues with Direct Mapping It is possible to describe everything in the DDI as an SDMX Metadata SetThis is probably not the best way to use SDMX with DDI!It is usually better to select the important fields, and keep the rest in native DDI formatWhen you map from DDI to SDMX, you typically will not carry much of the descriptive metadata, question text, etc.Mostly structural (codelists, dimensions, attributes, concepts)You must have concepts for SDMX which are not always present in DDIGoing from SDMX to DDI, it is not always possible to map all the dataEspecially for SDMX Metadata Sets, which may have user-configured concepts that don’t always exist in DDINote that SDMX-DDI mappings refer to all versions of DDI
124Integration Use CasesThe most important aspect of DDI – SDMX integration is understanding what the use cases areThis defines what mapping/transformation is neededIt also defines what links need to be stored between data and metadata filesThere are some common use casesDDI used to describe and link microdata inputs to SDMX aggregatesDDI used to more fully document SDMX aggregates for dissemination to usersUsing the SDMX Registry as a lifecycle management tool for DDI, SDMX, etc.
125Linking Source Data and Aggregates DDI provides a wealth of information about the micro-data which serves as an input to SDMX aggregatesIt is possible to capture these links in SDMX, at the cell level or higher, to provide automated access to source dataAn SDMX registry can be used to provide easy access to these linksThe user/collector of aggregate data can access the rich DDI metadata, and possibly the data (if they have access rights)It is possible to automatically generate SDMX output from the DDI metadata describing tabulation of micro-dataThis may not be useful if the desired SDMX target is a standard cube structure described by another organizationIt may make transformation to the standard cube easier, howeverThe SDMX Registry provides a good tool for managing linksLinks between SDMX and DDI files are stored as Metadata Reports
127DDI + SDMX for Dissemination Typically, the full DDI documentation is not provided on web-sites which publish aggregates/indicatorsSDMX is becoming a popular dissemination format for these dataIt has been shown to increase the use of data on the WebIf the DDI documentation is available, this could also be delivered as additional documentationEspecially useful at study levelLinks could be directly embedded in SDMX data files as attributes or stored in an SDMX Registry, or both
128The SDMX Registry for Lifecycle Management The SDMX Registry provides a tool for tracking the sources of data for aggregatesIt can also track the transformation of versions of DDI as the data moves through the lifecycleThere is an SDMX model for processesThis can be used to describe the DDI lifecycle modelSDMX Metadata Reports can be used to link DDI metadata to specific stages of the DDI lifecycle, and to each otherApplications could query the SDMX Registry to discover all of the DDI metadata produced upstream, as micro-data is collected and processed
129Demos SDMX Metadata Report used to express DDI metadata SDMX Metadata Report used to link DDI instances
131Many Related Standards DDISDMXISO/IEC – concept management and semantic modellingISO – Geographical metadataMETS – packaging/archiving of digital objectsPREMIS – Archival lifecycle metadataXBRL – business reportingDublin Core – citation metadataStandard mappings are being defined by people from many different organizations (see presentation from METIS 2008 in Luxembourg)
132ISO/IEC 11179ISO/IEC is used to describe the meanings and representations of terms and conceptsBoth SDMX and DDI are aligned with ISO/IEC 11179SDMX and DDI concepts can be defined using the ISO/IEC attributesCodelists and categories can be directly mapped (and other representations)ISO/IEC can be implemented with DDI (directly, for concepts) and/or with SDMX (as a Metadata Report)ISO/IEC has no standard expression in XML – it is just a model
133ISO 19115 Geographical Metadata ISO describes geographies (bounding boxes for countries, etc.)DDI uses the ISO model in its own XMLIt does not use the standard ISO XML format, but there is a 1-to-1 mappingSDMX could model ISO if desiredLinking to DDI or ISO XML is probably more useful, using the standard SDMX mechanismMost geographies in SDMX aggregate data sets are coded, not directly described
134METSMETS is used to package a set of files which work together as a digital objectBoth DDI and SDMX metadata could be placed inside a METS wrapperThey would be “metadata sections”The primary use case would be for archiving of a set of related data and metadata files, possibly with other related materials such as research publications
135PREMISPREMIS allows for the capture of administrative metadata as a collection is placed and managed within the archiveDDI and SDMX files would be treated like any other files forming part of the collectionBoth may contain metadata which can be extracted and used to populate PREMIS instances (access levels, confidentiality, etc.)
136XBRLXBRL is used by business to report required information to national supervisory bodiesThis includes banking supervision and other economic dataXBRL is a source format for some aggregate statisticsXBRL International and the SDMX Sponsors are working together to define a cross-walk between the two standards
137Dublin CoreDublin Core is used to capture citation-type metadata for resources on the Internet and elsewhereIt is widely used in digital repositories for research papersDDI has the basic Dublin Core XML format as an integral part of the DDI 3.0 specificationDublin Core can be easily mimicked as an SDMX Metadata Report [Demo]
138High-Level Vision – Standards Mappings Federated Registries (Based on SDMX, ebXML, web services)ISO11179SemanticdefinitionsAggregatedData/Metadata(SDMX)registeredOrganizedusingReferences to source dataMETS/PREMISXBRLBusinessReportsDDIMicrodataSetsStandardclassificationsDublin CoreCitationsUsed inISO 19115Geographies