Presentation on theme: "Toward a Global Infrastructure for Data and Metadata: The Open Data Foundation Arofan Gregory Executive Manager The Open Data Foundation."— Presentation transcript:
Toward a Global Infrastructure for Data and Metadata: The Open Data Foundation Arofan Gregory Executive Manager The Open Data Foundation
Something Really Amazing Spaceships arent that amazing… Aliens arent that amazing… Mobile telephones arent that amazing… These devices have access to the complete set of human (well, Federation) knowledge, via ships computer - Thats AMAZING! An Epic Feat of Data Standardization! Tasers arent that amazing…
A Big Idea It might seem too outrageous to imagine that every data source could be accessible and usable via a global network, but… –Consider all the domain grids which are emerging –Consider the number of modern technologies for leveraging data across networks –Consider the tools we have for solving problems of semantic interoperability Maybe Star Trek was only a few decades ahead of its time!
Something Missing… Technology alone cannot solve this problem For centuries, scientists, librarians, and archivists have worked to perfect taxonomies and classifications for organizing and accessing human knowledge –Technologists cant replace the disciplines which have evolved from this work with technology alone –They can only automate it Having an ontology doesnt mean you have an agreed, tried, and workable standard classification system! –A thousand little ontologies still produce chaos!
Why Now? The idea of a global data infrastructure is practical today because… –We have good, standards-based, networked technology –We have a highly sophisticated population of archivists and librarians who understand the challenges of large-scale classification, for all types of media –We have an emerging culture of data producers and users who are beginning to understand the potential offered by modern technology
The Open Data Movement From Wikipedia: Open Data is a philosophy and practice requiring that certain data are freely available to everyone, without restrictions from copyright, patents or other mechanisms of control. It has a similar ethos to a number of other "Open" movements and communities such as Open Source and Open Access.
The Open Data Foundation (ODaF) Although we respect this traditional goal of the Open Data movement, we feel that the technology issues, as opposed to the legal ones, have a different focus: –Much public data is inaccessible or unusable –Confidential data is less accessible than it could be –The collection and publication of some critical data is lacking, notably in the Developing World It is not enough to put the rights to data into the public domain – it must also be practically accessible to all potential users
What Do We Mean by Data? Official statistics collected by government agencies and international organizations –Usually aggregates and time-series data –Covers a huge range of social, scientific, and economic topics Numeric research data supporting social sciences and hard sciences –Often lower-level microdata –May be gathered by survey or sourced from registers Qualitative data used in social sciences research –Not research papers, but source data (eg, interviews)
ODaFs Mission To bring together individuals from the statistics community, the research community, and the technology standards community To promote the creation of a global infrastructure for data and metadata by providing open-source tools and supporting the adoption of a coordinated set of open technology standards To promote the creation and use of knowledge, and fact-based decision-making, through improved access to data and metadata
ODaF - Timeline The idea started at IASSIST 2006 in Edinburgh Incorporated in 2006 as a US scientific non-profit First face-to-face meeting in Washington DC in December 2006 at the National Opinion Research Center (NORC) September 2007: next face-to-face meeting in St. Helena, California Next face-to-face meeting: NORC in DC, December 2007, followed by a European meeting (UK, Netherlands, or Germany) in early 2008 NOTE: We are a virtual organization – we dont rely on face-to-face meetings for conducting work (Thanks, Skype!)
ODaF - Directors –Bob Glushko – head of the UC Berkeley Center for Document Engineering and member of OASIS Board of Directors –Julia Lane – Vice President, NORC and world-class expert in data confidentiality issues –Ernie Boyko, former President of IASSIST – Rune Gloersen – head of IT at Statistics Norway
ODaF - Executive Managers Arofan Gregory – background in SGML/XML, technology standards (notably ebXML, UBL, UN/CEFACT, ISO TC154, DDI, and SDMX) Pascal Heus - lead developer for World Bank and International Household Survey Network, much experience with field-work in Africa, DDI implementor Chris Nelson – veteran OMGer (CWM), worked with many technology standards (UN/EDIFACT, GESMES, ebXML, SDMX, DDI), consummate UML modeler Jostein Ryssevik – former CEO of Nesstar North America, now with Ideas2Evidence, associated with Gallup Europe; longtime DDI implementor
ODaF - Advisors Sandra Cannon - Board of Governors of the Federal Reserve System Gilles Collette- Visual Communications, Pan- American Health Organization (WHO) Daniel Gillman - US Bureau of Labor Statistics Eduardo Gutentag – Chair, OASIS Board of Directors Paul Johanis - Statistics Canada Graeme Oakley - Australian Bureau of Statistics Dr. Andrew Nelson - Joint Research Centre of the European Commission
ODaF – Advisors (cont.) Ken Miller- UK Data Archive / Economic and Social Data Service Duane Nickull- Chair, OASIS SOA Reference Architecture TC Juraj Riecan - United Nations Economic Commission for Europe (UNECE) Gerard Salou - European Central Bank Professor Bo Sundgren, Ph.D - Statistics Sweden Wendy Thomas - Minnesota Population Center, University of Minnesota Wendy Watkins - Data Centre Coordinator, Maps, Data and Government Information Centre, Carleton University Library
ODaF - Organization We are project-oriented: –Any member can participate in projects May be paid consultants for specific work, or volunteers –Project proposal is put before Directors by Management team in consultation with Board of Advisors for approval –Work is conducted by specified project team, using specified resources –All Directors, Managers, and Advisors are volunteers Work is focused on coordination of projects, with resources coming from other participating organizations
The Problem Space The flows of data can be seen as forming a type of supply chain –Collected data are aggregated and reported/disseminated to other organizations –The points where data are exchanged can be problematic: Loss of metadata No automated integration into receiving systems Time- and resource-intensive This exchange of data and metadata must be managed in an efficient, standard fashion if we are to build a global infrastructure
International Organisations Regional Organisations accounts statistics Banks, Corporates Individual Households trans- actions accounts National Statistical Organisations accounts statistics 180 + Countries Internet, Search, Navigation www.z.org www.hub.org www.x.org www.y.org
Data Lifecycle Model Within each level of the information chain, we see a process: –Data sourcing or collection –Data processing (re-coding, harmonization, aggregation) –Data dissemination and archiving –Data reporting and re-purposing Throughout this cycle, each step generates important metadata which can be captured to provide better downstream processing and understanding of the data Today, this metadata is often lost –Between steps of the lifecycle –When the final data product is exchanged in the information chain
An Observation on Organizations Governmental, supra-governmental, and research organizations which produce data have as a primary mission the collection of data –To support policy making –To support research –To support regulatory activities They do not have a primary mission to focus on the exchange of data with other organizations –This is often perceived as a burden rather than a part of the primary mission of the organization They are often not well-skilled in the latest technology for data exchange and interoperability Standards organizations tend to be too busy promoting their own standards to be worried about how users might combine them with other standards in implementations
Issues Issues with public data: –Public data which is not released: "Users won't understand it - Too little metadata! –Public data which is unusable: formats are bad, too little metadata about formats, terminology, methodology, coding, and concepts –Public data which cannot be accessed because its location/existence is not known –Public data which loses value because it cannot be published and accessed in a timely manner
Issues (cont.) Issues with confidential data: –Public data sets derived from confidential data have been damaged by anonymization –Confidential data which are not seen because access produces unacceptable disclosure risk There are secure Research Data Centers for allowing access to confidential data to qualified researchers –These are not as accessible or as open as they could be, due to their physical nature and the fact that they generally are not in communication with each other –Better metadata management and shared metadata leads to a better understanding of disclosure risk, and thus improved access for researchers
Note on Data Confidentiality You might think proponents of Open Data would disapprove of confidential data –Response rates are falling for all types of survey data collection due to fears of disclosure –There are many new ways of collecting data about individuals (RFID chips, security cameras, cell phones, etc.) –The standards for data confidentiality are there for a good reason – to protect individuals! We believe that confidential data should be as open as possible and not more!
Issues (cont.) Issues with data in the Developing World: –Absent data due to inefficient or nonexistent data collection/publication –Unsustainable data collection/publication produces insufficient continuity of data Once educated, IT workers get jobs in Europe and America Funding is typically not on-going, but only for a limited period The vast majority of the worlds population is in the Developing World, and the trend is increasing –To understand our world and make good policy, we must support sustainable data collection and publication about this huge segment of the population!
How Can We Solve These Problems? Many of these issues can be solved with modern technology –Better documentation using standard metadata formats –Better mechanisms for data discovery and access between organizations of all types –Better mechanisms for managing semantic interoperability –Free or inexpensive tools for metadata capture and data/metadata exchange –Improved mechanisms for sustainable collection and publication of data in the Developing World
ODaFs Vision A network of standard, federated registries provide the ability to discover data and metadata globally Standard data and metadata formats and models provide the basis for automated use and integration between applications Standard semantic registries and mappings to standard classifications/ontologies allow for semantic interoperability All of these standards would be coordinated to work together predictably in an open architecture Domains are self-governing – each has its own registries, classifications, etc. There must be minimum governance at the center for operation of the entire network. –Interoperability through mapping to the standards-based open architecture
Which Standards? ISO 17369 Statistical Data and Metadata Exchange (SDMX) Data Documentation Initiative (DDI) ISO/IEC 11179 Metadata Registries ISO 19115 Digital Geographic Data Metadata Encoding and Transmission Standard (METS) Extensible Business Reporting Language (XBRL) Many others (SOA, ebXML, Web Services, Semantic Web, Dublin Core)
ISO 17369 SDMX Produced by official statistics organizations (BIS, ECB, Eurostat, IMF, OECD, World Bank, UN/SD) Now available as a 2.0 version –Supports all aggregate data & time-series –Supports all types of metadata (structural & reference metadata) –Provides standard registry interfaces for data sourcing and exchange (not specific to SDMX formats) Based on a formal meta-model (similar to OMGs Common Warehouse Metamodel, but more focused) Data and metadata formats and classifications are completely configurable Also provides recommendations for concepts, codes, and classifications for official statistics
Data Documentation Initiative (DDI) Produced by a consortium of members (data archives and libraries, national statistical organizations, universities, etc.) Now in 3.0 candidate version which supports full data lifecycle (release Q1 2008) Fine-grained metadata for describing: –Data collection (surveys, registers, etc.) –Data processing (for recodes, harmonization, data comparison) –Data archiving and dissemination –Data can be stored inline or in native file formats –Supports microdata and n-dimensional cubes Aligned with SDMX, ISO/IEC 11179, METS, ISO 19115, and Dublin Core
ISO/IEC 11179 Metadata Registries Model for managing semantics of a data dictionary and the lifecycle of concepts/terms There is a separate ISO specification under development for providing bindings in XML, C, and other languages In widespread use in many other standards, as well as for terminology management within large organizations
ISO 19115 Digital Geographic Data Provides the standard metadata model for describing geographies Implemented in several XML standards, including DDI (there is also a standard ISO XML) Well-accepted within the technology community and among communities of use (geographers, etc.)
METS A packaging standard for digital libraries/archives –Pulls together associated sets of files and establishes their relationship to one another Can carry metadata payloads in their native XML namespaces as metadata sections Cooperatively developed with DDI –METS left the description of data to DDI –DDI supports METS for archival packaging
XBRL XML standard from the accounting world for describing business reports Widely used by banking supervisory organizations –Major source of financial statistics Well marketed and widely supported Ongoing alignment project with SDMX
ODaF Vision - Standards Federated Registries (Based on SDMX, ebXML, web services) Aggregated Data/Metadata (SDMX) XBRL Business Reports DDI Microdata Sets ISO 19115 Geographies Dublin Core Citations Used in registered References to source data Standard classifications Organized using ISO 11179 Semantic definitions METS Packaging
ODaF Activities We are early in our efforts to create such an infrastructure –To establish a sufficient set of well-aligned standards –To build open-source tools to support the use of these standards –To otherwise support the adoption and use of standard models, formats, and registries
ODaF Projects Standards Alignment Project: on-going effort to establish an agreed mapping between the mentioned standards SDMX Registry Hosting: Host SDMX registries on our own servers for those wishing to do prototype implementations DDI Development Support: provide hosting and infrastructure to support the use and development of DDI 3.0 DDI Foundation Tools Program: providing technical coordination and infrastructure for a multi-institution effort to build an Eclipse-based open-source toolkit for working with DDI 3.0, including transforms to/from SAS, SPSS, and STATA SDMX Browser: Developing an open-source tool (using Adobe Flex) for collecting, updating, and viewing statistical data in SDMX format – working in informal collaboration with ECB and OECD
ODaF Project (cont.) DeXtris Browser: beta end-user tool for viewing and searching DDI 1/2.* and 3.0 metadata files – supports version transformations UKDA QuDEX Draft Standard: Working as technical support for UKDA in their development of a standard for qualitative metadata (may become part of DDI) Canadian RDC Network: Providing technical advice to the Canadian RDC network on metadata management and implementation in support of DDI 3.0. NORC Virtual Data Enclave: Working to help develop and deploy the first virtual RDC in the US with data from NIST, others Also involved in proposals to build a European virtual RDC
ODaF Projects (cont.) Have contributed to the creation of training materials and online support for DDI 3.0, for general use White papers: DDI & SDMX (a comparison), guidelines for open-source tools development, others Member, DDI Alliance Sponsored IASSIST 2007 in Montreal (planned also for IASSIST 2008 in Palo Alto, CA)
ODaF - Where We Are Today New organization, lots of interest and support thus far Interesting projects are emerging, some early deliverables have been finished Looking for participation from interested, serious individuals Still at the stage of supporting and promoting a coordinated set of standards