Taking Advantage of DDI 3.0

Taking Advantage of DDI 3.0
IASSIST 2007: Workshop Montreal

Presenters Wendy Thomas Arofan Gregory Joachim Wackerow
Minnesota Population Center Arofan Gregory Open Data Foundation Joachim Wackerow GESIS-ZUMA

Afternoon’s Schedule 1:30 – 5:00
1:30 Introduction and Overview 2:00 Maintainable objects 2:30 Questions to Variables to Data 3:00 Break 3:30 Creating groups 4:00 URNs and Versioning 4:30 What is it good for?

Ground Rules Time periods are general, we’ll adjust to address your interests Ask questions There’s a lot to cover so we may suggest continuing a discussion one-on-one during break or after the workshop Materials from the workshop will be posted on the DDI site

Basic Element Types Differences from 2.1
--Every element is NOT identifiable --Many individual elements or complex elements may be versioned --A number of complex elements can be separately maintained

3.0 Modules and Schemas (one is not necessarily the other)
Reflect closely related sets of information similar to the sections of DDI DTD Modules can be held as separate XML instances and be included in a large instance by either inclusion or reference All modules are maintainable, but not all maintainables are modules

3.0 Modules and Schemas (one is not necessarily the other)
Each .xsd file is a schema Some schemas are modules Some schemas are substitution sets Some schemas simply contain elements that are used by multiple schemas or may require more frequent updates Some schema are “borrowed”

SCHEMAS archive comparative conceptualcomponent datacollection dataset
dcelements DDIprofile ddi-xhtml11 ddi-xhtml11-model-1 ddi-xhtml11-modules-1 group inline_ncube_recordlayout instance logicalproduct ncube_recordlayout organization physicaldataproduct physicalinstance reusable simpledc studyunit tabular_ncube_recordlayout xml

Schemas: MODULES archive comparative conceptualcomponent
datacollection dataset dcelements DDIprofile ddi-xhtml11 ddi-xhtml11-model-1 ddi-xhtml11-modules-1 group inline_ncube_recordlayout instance logicalproduct ncube_recordlayout organization physicaldataproduct physicalinstance reusable simpledc studyunit tabular_ncube_recordlayout xml

Schemas: SUBSTITUTIONS
archive comparative conceptualcomponent datacollection dataset dcelements DDIprofile ddi-xhtml11 ddi-xhtml11-model-1 ddi-xhtml11-modules-1 group inline_ncube_recordlayout instance logicalproduct ncube_recordlayout organization physicaldataproduct physicalinstance reusable simpledc studyunit tabular_ncube_recordlayout xml

Schemas: REUSE archive comparative conceptualcomponent datacollection
dataset dcelements DDIprofile ddi-xhtml11 ddi-xhtml11-model-1 ddi-xhtml11-modules-1 group inline_ncube_recordlayout instance logicalproduct ncube_recordlayout organization physicaldataproduct physicalinstance reusable simpledc studyunit tabular_ncube_recordlayout xml

Schemas: BORROWED archive comparative conceptualcomponent
datacollection dataset dcelements DDIprofile ddi-xhtml11 ddi-xhtml11-model-1 ddi-xhtml11-modules-1 group inline_ncube_recordlayout instance logicalproduct ncube_recordlayout organization physicaldataproduct physicalinstance reusable simpledc studyunit tabular_ncube_recordlayout xml

Schemas: Maintainable Schemas
archive comparative conceptualcomponent datacollection dataset dcelements DDIprofile ddi-xhtml11 ddi-xhtml11-model-1 ddi-xhtml11-modules-1 group inline_ncube_recordlayout instance logicalproduct ncube_recordlayout organization physicaldataproduct physicalinstance reusable simpledc studyunit tabular_ncube_recordlayout xml

2.1 sections to 3.0 schema 1.0 Document Description
Citation Guide Document Status Source Document 2.0 Study Description Study Information Methodology Data Access Other Material 3.0 File Description File Text Location Map 4.0 Data Description 5.0 Other Material Instance Archive Study Unit Conceptual Components Abstract / Purpose / Coverage DataCollection Methodology Collection Event Question Scheme Instrument Processing Event LogicalProduct Data Relationships Category Scheme Code Scheme Variable Scheme NCube PhysicalDataProduct Gross Record Structure Base Record Structure PhysicalInstance File Identification Statistics

Why the big change? Documentation focused on the Codebook remains a value-added commodity It becomes the de-facto responsibility of the archive rather than the producer Producers capture information that help them do their work The Life Cycle model focuses on the flow of data through a system [production oriented] Codebooks should be an output from the documentation process, not the sole commodity

Tighter control more required items
In order to support processing we needed tighter control on element and attribute contents Schema provide tighter element level control Profiles allow for customization of coverage A large set of required elements provides a more consistent base for programming

Reuse and Replacement Reuse of elements means that similar actions are handled in similar ways Questions and Variables use the same category and coding schemes Instrument flow logic is found in comparison and coding instructions for variables Identification and referencing are handled in a consistent manner Replacement by substitution allows for addressing changing technologies without making major changes in existing schemas Physical data structures Data types (microdata, aggregate data, future coverage types?)

Maintainable Objects Publishing persistent parts Concept lists
Question Banks Coding schemes Comparison mapping Study description

Support for Registries
A metadata registry is a central location in an organization where metadata definitions are stored and maintained in a controlled method. Metadata registries are used whenever data must be used consistently within an organization or group of organizations.

Examples of Registry Users
Organizations that transmit data using structures such as XML, Web Services or EDI Organizations that need consistent definitions of data across time, between organizations or between processes. For example when an organization builds a data warehouse Organizations that are attempting to break down "silos" of information captured within applications or proprietary file formats

Capturing and Reusing Metadata
Whether captured at inception or created after-the-fact some sections must be completed before other sections can be made The capture of metadata at point of inception in a non-proprietary structure that can be transferred out-of and into process software provides incentive for metadata creation during the life cycle of the data

Metadata flow DDI is built on the life cycle of the data and some information naturally occurs earlier than other information Reuse of and reference to certain types of information such as universe, concepts, categories, and coding schemes prescribe a creation order

Universe Scheme Concept Organization / Individual StudyUnit Citation Coverage Category Coding Question Variable Processing Event (coding) Data Relationships NCube Record Structure Remaining Physical Data Product Items Physical Instance Archive / Group / etc. STEP 1 STEP 2 STEP 3 optional Remaining Logical Product items STEP 4 STEP 5 STEP 7 STEP 8 Instrument

Questions to Variables
Development Software Identifying Universe and Concepts Building or Importing Question Text and Response Domains Instrument Development Software CAI Organizing questions and flow logic Capturing raw response data and process data Data Processing Software Data cleaning and verification Recoding and/or deriving new data elements using existing or new categories or coding schemes DDI DDI

Mapping Relationships [Preparing data for the user]
Information previously provided in Guide Now found in the logical product under Data Relationships Physical expression of the linking relationships is found in the physical data product

What it tells you Record Type
How many record types you have How you can tell what record type it is Does it provide support for a “multi-part” record How to identify a “unique” record within a record type How do you link one record type to another

Multiple Parts / Complex ID
SF DATA SF DATA SF DATA SF DATA

Multiple Parts / Complex ID
SF DATA SF DATA SF DATA SF DATA UNIQUE Within File = LOGRECNO IF SUMLEV = 050 then STATE and COUNTY IF SUMLEV = 160 then STATE and PLACE

Time for a brief break from DDI

Grouping Grouping allows two or more studies to be grouped together
Grouping can be done by design after the fact virtually

Grouping by design Uses inheritance
to reuse rather than rewrite handle a form of multiple inheritance trees first by hierarchy and then by reference Uses comparison to describe changes that take place in a series over time

Grouping after the fact
Uses comparison to describe equivalent or similar objects and how they compare Questions Concepts Variables Category Schemes Coding Schemes including capturing process for recodes

All waves inherit person level information from the group
Comparable by design

Satisfaction with life, School Degree
Waves Satisfaction with life, School Degree All waves contain common topical data on Satisfaction with life and School Degree

Currency Fields CHANGE
between 2001 and 2002 Wave Currency Euro Wave Currency DM

Question and Data expanded
between 1998 and 1999 Wave Size of Company Wave Size of Company, Concerns about Euro

This set of questions is included
PERIODICALLY Waves 1997, 2000, 2001 Computer Usage

Example DDI3: GROUPING and COMPARISON Standard Eurobarometer 1970 ff.
Occupation of respondent Changing “wave standard” category schemes Variations across countries translation, question/variable structure Field questionnaires and codebook excerpt - occupation not yet harmonized

Example DDI3: GROUPING and COMPARISON
TREND: OCCUPATION R: WHAT IS YOUR OCCUPATION? Example Harmonized systematic 3-digit coding Distinguishing major categories and sub-categories Avoiding artificial changes in occupational structures over time Source: Jan W. van Deth: Using Published Survey Data, in: Harkness/Van de Vijver, Mohler: Cross-Cultural Survey Methods Source: Jan W. van Deth: Using Published Survey Data, in: Harkness/Van de Vijver, Mohler: Cross-Cultural Survey Methods

Resource DDI3: GROUPING and COMPARISON
TREND: OCCUPATION VARIABLE NAME: OCCUP R: WHAT IS YOUR OCCUPATION? 110. FARMER / FISHERMAN (SKIPPERS) <until EB29> 111. FARMER 112. FISHERMAN 120. <SELF EMPLOYED> PROFESSIONAL (LAWYER, MEDICAL PRACTITIONER, ACCOUNTANT, ARCHITECT, ...) 130. OWNER OF A SHOP, CRAFTSMEN, BUSINESS PROPIETOR 131. OWNER OF A SHOP, CRAFTSMEN, OTHER SELF EMPLOYED PERSON 132. BUSINESS PROPRIETORS, OWNER (FULL OR PARTNER) OF A COMPANY 210. EMPLOYED PROFESSIONAL (LAWYER, MEDICAL PRACTITIONER, ACCOUNTANT, ARCHITECT, ...) 220. EXECUTIVE, TOP MANAGEMENT, DIRECTOR <starting with EB30:> GENERAL MANAGEMENT <starting with EB37:> GENERAL MANAGEMENT, DIRECTOR OR TOP MANAGEMENT 230. MIDDLE MANAGEMENT, OTHER MANAGEMENT (DEPARTMENT HEAD, JUNIOR MANAGER, TEACHER, TECHNICIAN) 310. EMPLOYED POSITION, WORKING MAINLY AT A DESK 311. WHITE COLLAR – OFFICE WORKER <until EB29> 312. OTHER OFFICE EMPLOYEES <EB30 to EB36> 320. NON-OFFICE EMPLOYEES, NON MANUAL WORKERS (SERVICE SECTOR, E.G. SHOP ASSISTANT ETC.) 321. EMPLOYED POSITION, NOT AT A DESK BUT TRAVELLING (SALESMAN, DRIVER, ...) ... 540. UNEMPLOYED <starting wirh EB30:> TEMPORARILY NOT WORKING, UNEMPLOYED 998. DK / NA 999. INAP Resource Mannheim Eurobarometer Trend File Codebook excerpt – occupation harmonized

DDI3: GROUPING and COMPARISON
Eurobarometer Wave Standard [Q…. ] OCCUPATION OF SELF: SELF EMPLOYED 01. FARMERS, FISHERMEN (SKIPPERS) 02. PROFESSIONAL - LAWYERS, ACCOUNTANTS, ETC. 03. BUSINESS - OWNERS OF SHOPS, CRAFTSMEN,PROPRIETORS EMPLOYED 04. MANUAL WORKER 05. WHITE COLLAR - OFFICE WORKER 06. EXECUTIVE, TOP MANAGEMENT, DIRECTOR NOT EMPLOYED 07. RETIRED 08. HOUSEWIFE, NOT OTHERWISE EMPLOYED 09. STUDENT, MILITARY SERVICE 10. UNEMPLOYED 00. DK/NA DE - Eurobarometer ? - 17 [ F… ] Sind Sie persönlich berufstätig? Selbständige 1. Landwirte 2. Freie Berufe (z.B. Arzt, Anwalt) 3. Kleine, mittlere, größere Selbständige Berufstätige 4. Arbeiter / Facharbeiter 5. Angestellte / Beamte 6. Ltd. Angestellte / ltd. Beamte Nicht berufstätige 7. Rentner / Pensionär 8. Hausfrauen (nicht andersweitig beschäftigt) 9. Schüler, Studenten, Lehrling 0. Arbeitslos DE - Eurobarometer [F. ] Sind Sie persönlich berufstätig? 1. Voll berufstätig (einschl. vorübergehend arbeitslos) 2. Teilweise berufstätig (einschl. vorübergehend arbeitslos) 3. Rentner, Pensionär (früher berufstätig) 4. Rentner, Pensionär (früher nicht berufstätig) 5. (In Ausbildung) Lehrling 6. (In Ausbildung) Schüler, Student 7. Nicht berufstätig, aber früher berufstätig gewesen 8. Noch nie berufstätig gewesen „Internal“ standard: Mannheim Trend File „External“ standards: ESOMAR EB-CH / OSF ISCO-88 DE - EB [F. ] Welchen Beruf üben Sie zur Zeit aus, bzw haben Sie zuletzt ausgeübt? 11. Einfache Angestellte 12. Mittlere Angestellte 13. Qualifizierte Angestellte 14. Leitende Angestellte 15. Ungelernte Arbeiter 16. Angelernte Arbeiter 17. Einfache Facharbeiter 18. Qualifizierte Facharveiter 21. Kleinere Selbständige 22. Mittlere Selbständige 23. Größere Selbständige 24. Freie Berufe (z.B. Arzt, Anwalt) 25. Beamte einfacher Dienst 26. Beamte mittlerer Dienst 27. Beamte gehobener Dienst 28. Beamte höherer Dienst 31. Selbständige Landwirte - kleine (unter 5 ha) 32. Selbständige Landwirte - mittlere (5- unter 20 ha) 32. Selbständige Landwirte - große (20 ha +) Wave standard for a certain period (“de facto” or in master questionnaire) and corresponding items in the German field questionnaires with changes over time regarding categories and question structure …

DDI3: GROUPING and COMPARISON
Group standards inheritance EB OCCUPATION Trend Standard DataCollection LogicalProduct Resource standards comparison to instances Subgroup: overwriting additions inheritance Subgroup: no overwriting comparison to instances DE 3-16 DE 17 DE 18-22 DE 23-29 DE 30-36 DE 37 ff. DE 18-23 DE 24-29 All information about inconsistencies / changes and (possible) harmonisation strategies has to be documented and administered … Study Unit: overwriting additions EB 30 DataCollection LogicalProduct EB … DataCollection LogicalProduct EB 36 DataCollection LogicalProduct EBCH OFS/ISCO-88

Versioning and Maintenance
There are three classes of objects: Identifiable (has ID) Versionable (has version and ID) Maintainable (has agency, version, and ID) Very often, identifiable items such as Codes and Variables are maintained in parent schemes

Rationale Because several organizations are involved in the creation of a set of metadata throughout the lifecycle flow: Rules for maintenance, versioning, and identification must be universal Reference to other organization’s metadata is necessary for re-use – and very common

Maintenance Rules A maintenance agency is identified by its domain name (as for it’s website and ) Maintenance agencies own the objects they maintain Only they are allowed to change or version the objects Other organizations may reference external items in their own schemes, but may not change those items You can make a copy which you maintain, but once you do that, you own it!

Versioning Rules If an object changes in any way, its version changes
This will change the version of any containing maintainable object Typically, objects grow and are versioned as they move through the lifecycle Versions inherit their agency from the maintainable scheme they live in

Identifiable Rules Identifiers are assigned to each identifiable object, and are unique within their maintained parent scheme Identifiable objects inherit their version from their containing versionable parent (if any) Identifiable objects inherit their maintaining agency from the maintainable object they live in

Referencing When referencing an object, you must provide:
The maintenance agency The identifier The version Often, these are inherited from a maintained scheme This is part of their identification

Identification Identification can be by URN or a series of fields
The fields make up the parts of the URN and can be used to compose it A number of fields can inherit information from a maintainable parent

Parts of the Identification Series
Identifiable Element Identification: ID Identifying Agency Version Version Date Version Responsibility Version Rationale Variable Identification: V1 pop.umn.edu 1.1 [default is 1.0] Wendy Thomas Spelling correction

The URN urn:ddi:3_0:VariableScheme:pop.umn.edu:
VScheme_2:1_1:Variable:V1:1_1 Declares that its a ddi version 3.0 element Tells the type of maintainable object being referenced Gives the identifying agency of the scheme Tells the type of object and its unique ID Note that this includes both a maintainable ID and element ID as uniqueness must be maintained within a maintainable object rather than within the agency

DDI What Is It Good For? There are some obvious differences between DDI 2.* and DDI 3.* Ability to capture comparative information Ability to re-use and share metadata Ability to mark up data in XML Greater ability to facilitate data discovery and relationships It is designed to capture lifecycle information as it occurs, and in a way that is useful during production It is machine-actionable – not just documentary All of this comes with added complexity It also allows for greater interoperable support between organizations Here are a few examples…

Scenario 1: Upstream Metadata Capture
Because there is support throughout the lifecycle, you can capture the metadata as it occurs It is re-useable throughout the lifecycle It is versionable as it is modified across the lifecycle It supports production at each stage of the lifecycle It moves into and out of the software tools used at each stage

Scenario 2: Reuse of Metadata
You can reuse many types of metadata, benefiting from the work of others Concepts Variables Categories and codes Geography Questions Promotes interoperability and standardization across organizations Can capture (and re-use) common cross-walks

Scenario 3: Virtual Data
When researchers use data, they often combine variables from several sources This can be viewed as a “virtual” data set The re-coding and harmonization process can be captured as useful metadata The researcher’s data set can be re-created from this metadata Comparability of data from several sources can be expressed

Scenario 4: Mining the Archive
With metadata about relationships and structural similarities You can automatically identify potentially comparable data sets You can navigate the archive’s contents at a high level You have much better detail at a low level across divergent data sets

Other Scenarios?

Taking Advantage of DDI 3.0

Similar presentations

Presentation on theme: "Taking Advantage of DDI 3.0"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Taking Advantage of DDI 3.0

Similar presentations

Presentation on theme: "Taking Advantage of DDI 3.0"— Presentation transcript:

Similar presentations

About project

Feedback