Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Type Registries #2 Co-Chairs: RDA Chairs’ Mtg Gothenburg

Similar presentations


Presentation on theme: "Data Type Registries #2 Co-Chairs: RDA Chairs’ Mtg Gothenburg"— Presentation transcript:

1 Data Type Registries #2 Co-Chairs: RDA Chairs’ Mtg Gothenburg
June 2017 Co-Chairs: Larry Lannom - CNRI Tobias Weigel – DKRZ

2 Problem: Implicit Assumptions in Data
Data sharing requires that data can be parsed, understood, and reused by people and applications other than those that created the data How do we do this now? For documents – formats are enough, e.g., PDF, and then the document explains itself to humans This doesn’t work well with data – numbers are not self-explanatory What does the number 7 mean in cell B27? Data producers may not have explicitly specified certain details in the data: measurement units, coordinate systems, variable names, etc. Need a way to precisely characterize those assumptions such that they can be identified by humans and machines that were not closely involved in its creation

3 Goal of the DTR Effort: Explicate and Share Assumptions using Types and Type Registries
Evaluate and identify a few assumptions in data that can be codified and shared in order to… Produce a functioning Registry system that can easily be evaluated by organizations before adoption Highly configurable for changing scope of captured and shared assumptions depending on the domain or organization Supports several Type record dissemination variations Design for allowing federation between multiple Registry instances The emphasis is not on Identifying every possible assumption and data characteristic applicable for all domains Technology

4 What is a Data Type? A unique and resolvable identifier
Which resolves to characterization of structures, conventions, semantics, and representations of data Serves as a shortcut for humans and machines to understand and process data File formats and mime types have solved the ‘representation’ problem at a ‘unit’ level Examples of problems we aim to solve with data types: It is a number in cell A3, but is it temperature? If so, in Celsius? It is a dataset consisting of location, temperature, and time, but what variable names should I look for? Is it all packaged as CSV or NetCDF? And as a single unit or a collection of units? Type record structure will continue to evolve – not finished, but functioning

5 What is a Data Type Registry?
A low-level infrastructure with wide applicability to record and disseminate type records Not an immediate ROI application Assigns unique and resolvable identifiers to type records Enforces and validates common data model & expression for interoperation between multiple instances of Registries API for machine consumption UI for human use

6 Federated Set of Type Registries
Process Use Case 3 Users 2 4 Federated Set of Type Registries 1 Typed Data ID Type Payload 10100 11010 101…. Visualization I Agree Terms:… Rights Services Data Processing Data Set Dissemination Client (process or people) encounters unknown data type. 1 Resolved to Type Registry. 2 Response includes type definitions, relationships, properties, and possibly service pointers. Response can be used locally for processing, or, optionally 3 typed data or reference to typed data can be sent to service provider. 4

7 Discovery Use Case 2 Users 1 3 4
Federated Set of Type Registries 2 Users 1 3 4 Repositories and Metadata Registries ID Type Payload Clients (process or people) look for types that match their criteria for data. For example, clients may look for types that match certain criteria, e.g., combine location, temperature, and date-time stamp. 1 Type Registry returns matching types. 2 Clients look up in repositories and metadata registries for data sets matching those types. 3 Appropriate typed data is returned. 4

8 Type Registry History Handle Types – 0.Type/SomeType
Good idea, limited applicability Profiles – ‘type’ the whole set of handle/type/value triples. No traction Sloan Grant: Generic Registry system using Type Registry as a use case NSF Grant: Included support for Type Registry Research Data Alliance (RDA) Data Type Registries Working Group One of first two WGs approved at Plenary 1 – March 2013 International representation, > 50 members Co-chairs Lannom (CNRI), Broeder (Max Planck Psycholinguistics Institute) Approved as RDA Recommendation (2015) DTR Phase 2: Follow-on Group Focus on data type records Help data producers create useful record types Co-chairs Lannom (CNRI), Weigel (DKRZ)

9 Current State A prototype is at: http://typeregistry.org/
Multiple other implementations/projects DKRZ, Vermont Monitoring Coop, RPID project Implementation supports notions of primitives and derived types Primitives are fundamental types that we expect humans and software to parse and understand Integer, floating point, boolean value, string, date, timestamp, etc. Derived types depend on primitives to describe something complex Stream gauge, Lidar, Spatial bounding box, etc. Registered types are assigned unique identifiers ISO Study Group

10 ISO Study Group

11 ISO Study Group Activity on Data Type Records: Background
ISO-IEC/JTC1/SC32/WG2 is currently working on a meta model for dataset description. That meta model covers data elements "about" datasets versus "internal details" of datasets. We refer to the record that captures internal details a "data type record”. WG2 was receptive to the idea of exploring the "data type record" space. A study group was authorized Nov 2016, with CNRI as lead ISO-IEC/JTC1/SC32/WG2/SG Present use case(s) from existing RDA members and see what fields are needed to describe the internals of the datasets pertaining the use case.(s) Evaluate what existing ISO standards cover and what they do not. If applicable, recommend a technical report or technical specification or standard. The study group terminates June 2017

12 ISO Study Group Activity on Data Type Records: Work and Possible Outcomes
ISO members referenced portions of existing ISO standards that are applicable to the proposed data type record structure ( , ,   , and 11404) CNRI will create a UML diagram that, wherever applicable, references the pieces from each of those standards. Feb 10th – ‘Data Type Record – Elements for Characterizing Data’ submitted to WG. Has been posted to DTR site Based in part on CMIP6 and Vermont Monitoring Coop examples For discussion purposes, CNRI introduced the notion of a simple data type and a complex data type. Simple data type describes characteristics of a single value (e.g., a cell in a spreadsheet) Complex data type is an aggregate of simple data types (e.g., a row in a spreadsheet can be described using a complex data type) June ’17 Decision Technical Report: documentation on how to use existing standards for the data type use case Technical Specification: intermediate spec, possible future standard, still under development Technical Standard: something new and useful, fully standardized Good news: we get an ISO number, no matter what happens

13

14 Data Type Example

15 Data Type Registry Data set descriptions for automation
Stephen M Richard, IEDA, EarthCube

16 Use cases Document the meaning of entities and attributes in data.
Re-use of data type and attribute definitions Machine-assisted data integration: matching attribute content. Validation of data instances against a type definition. Tools that spin up a UI for a particular data type. Link software to data sources that it can use Support file introspection to assist with deep data registration

17 Progress JSON schema implemented for data type model
DataTypeJSON.json Initial testing with Cordra Next steps– how to get model compilations from spreadsheet or rdf to Cordra. Where to deploy

18

19 Corporation for National Research Initiatives enrich.cordra.org
Enrich DTR Service enhances dataset metadata Syntactic nature (Decimal values between 0.0 – 24.0) Semantic information (Concept time in unit hours) Currently applying the DTR at an attribute level (Concept and Unit) Expand to the dataset level Dataset Metadata (row/column count, maintainer, description, etc.) Dataset Schema (columns information – constraints, formats, semantic information, etc.) Utilizing DTR mapping, in addition to other metadata, to drive a dataset recommendation system. Continuing to collaborate with CNRI on this.

20

21 https://rd-alliance.org/ - https://twitter.com/resdatall
Concept as it was at P8 netCDF-Files Agent Collection script Processing service (WPS) <Metadata> (xml) ? (third-party input) well-defined ways to publish it (automatically) possible repacking into a new collection output multiple types, e.g. netcdf, xml, linked data, text reports, PROV record described in DTR -

22 Pathway for climate data processing services
Multiple upcoming angles for attaching a DT solution: CMIP6 data Handles, Handle records, potential PID Kernel Information profile Copernicus Data Services re-using existing WPS-based services Handles? – Depends on success and acceptance of CMIP6 solution and available effort Climate Analytics Service (DKRZ/CMCC) server-side data processing with PID support data sources: CMIP6, possibly Copernicus, ... -

23 https://rd-alliance.org/ - https://twitter.com/resdatall
Typing angles a) Service typing automated discovery or at least verification b) Data input and output typing (coarse) to distinguish the data sources and help with management c) Typing of data internals (fine) traditional DTR: enable machines to understand meaning of data -


Download ppt "Data Type Registries #2 Co-Chairs: RDA Chairs’ Mtg Gothenburg"

Similar presentations


Ads by Google