RDA’s Recently Endorsed Outputs September 16, 2015.

Slides:



Advertisements
Similar presentations
A Unified Approach to Combat Counterfeiting: Use of the Digital Object Architecture and ITU-T Recommendation X.1255 Robert E. Kahn President & CEO CNRI,
Advertisements

Repositories, Federations, APIs, Policies - wrap up - Peter Wittenburg these slides are just a personal summary of major points they do not represent per.
Information Types and Registries Giridhar Manepalli Corporation for National Research Initiatives Strategies for Discovering Online Data BRDI Symposium.
Alternate Software Development Methodologies
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
RDA Data Foundation and Terminology (DFT) IG: Introduction Prepared for RDA Plenary San Diego, March 9, 2015 Gary Berg-Cross, Raphael Ritz, Co-Chairs DFT.
System Design/Implementation and Support for Build 2 PDS Management Council Face-to-Face Mountain View, CA Nov 30 - Dec 1, 2011 Sean Hardman.
FP OntoGrid: Paving the way for Knowledgeable Grid Services and Systems WP8: Use case 1: Quality Analysis for Satellite Missions.
The Data Attribution Abdul Saboor PhD Research Student Model Base Development and Software Quality Assurance Research Group Freie.
DATA FOUNDATION TERMINOLOGY WG 4 th Plenary Update THE PLUM GOALS This model together with the derived terminology can be used Across communities and stakeholders.
RDA Data Foundation and Terminology (DFT) IG: Introduction Prepared for RDA Plenary San Diego, March 9, 2015 Gary Berg-Cross, Raphael Ritz, Co-Chairs DFT.
Working Group: Practical Policy Rainer Stotzka, Reagan Moore.
Digital Object Architecture
USING METADATA TO FACILITATE UNDERSTANDING AND CERTIFICATION ABOUT THE PRESERVATION PROPERTIES OF A PRESERVATION SYSTEM Jewel H. Ward, Hao Xu, Mike C.
RDA Data Foundation and Terminology (DFT) IG: Introduction Prepared for RDA 6 th Plenary Paris, Sept. 25, 2015 Gary Berg-Cross, Raphael Ritz Co-Chairs.
RDA’s Recently Endorsed Outputs September 16, 2015.
EMI INFSO-RI SA2 - Quality Assurance Alberto Aimar (CERN) SA2 Leader EMI First EC Review 22 June 2011, Brussels.
Sharing Research Data Globally Alan Blatecky National Science Foundation Board on Research Data and Information.
Data Fabric IG Introduction. 2  about 50 interviews & about 75 community interactions  Data Management and Processing is too time consuming and costly.
RDA Terminology: Data Management and Data Fabric Prepared for RDA 6 th Plenary Paris, Sept. 23, 2015 Gary Berg-Cross Co-Chair DFT IG, Co-organizing Chair.
VAMDC use-case for the RDA Data Citation Working Group C.M. Zwölf and VAMDC consortium 6 th RDA Plenary PARIS September 2015.
CLARIN work packages. Conference Place yyyy-mm-dd
Working Group Practical Policy based on slides and latest documents from the PP WG chaired by Reagan Moore, Rainer Stotzka presented by Johannes Reetz.
Rule-Based Preservation Systems Reagan W. Moore Wayne Schroeder Mike Wan Arcot Rajasekar Richard Marciano {moore, schroede, mwan, sekar,
Adoption of RDA-DFT Terminology and Data Model to the Description and Structuring of Atmospheric Data Aaron Addison, Rudolf Husar, Cynthia Hudson-Vitale.
RDA Data Foundation and Terminology (DFT) WG: Overview  Prepared for Collab Chairs Meeting, NIST, Nov 13-14, 2014  Gary Berg-Cross, Raphael Ritz, Peter.
Recent Developments in CLARIN-NL Jan Odijk P11 LREC, Istanbul, May 23,
Hydro DWG at the RDA Plenary: BoF and Aligning HDWG work with WMO expectations and timeline Sylvain, Tony, Silvano, Ilya.
EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No The pan-European.
SDMX IT Tools Introduction
26/05/2005 Research Infrastructures - 'eInfrastructure: Grid initiatives‘ FP INFRASTRUCTURES-71 DIMMI Project a DI gital M ulti M edia I nfrastructure.
Hydro DWG at the RDA Plenary BoF - Improve sharing of water resource data globally 24 September BREAKOUT :30-15:00.
National Science Foundation Cooperative Agreement: OCI Reagan Moore, PI Mary Whitton, Project Manager.
Data Type Registries (DTR) RDA 4th WG/IG Collab Meeting NIST: Dec 2015 Larry Lannom CNRI.
Discussion of Data Fabric Terms & Preparation for RDA P7 Virtual Meeting Monday, January 25, 2016 Organized by Gary Berg-Cross (DFT-IG) and Peter Wittenburg.
Infrastructure Breakout What capacities should we build now to manage data and migrate it over the future generations of technologies, standards, formats,
Data Foundation IG DF Organizing Chairs: Gary Berg-Cross & Peter Wittenburg.
Adoption of RDA-DFT Terminology and Data Model to the Description and Structuring of Atmospheric Data Aaron Addison, Rudolf Husar, Cynthia Hudson-Vitale.
EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No The Data Type.
Open Science and Research – Services for Research Data Management © 2014 OKM ATT 2014–2017 initiative Licenced under.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
Data Typing BoF RDA Plenary 7 Tokyo: March 2016 Larry Lannom CNRI.
EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No EUDAT Aalto Data.
Statistical process model Workshop in Ukraine October 2015 Karin Blix Quality coordinator
Draft Data Foundation and Terminology (DFT) Vocabulary Development Process Prepared for WG-Core meeting 24/25.2 Munich/Garching Gary Berg-Cross Co-Chair.
EGI-InSPIRE RI EGI Compute and Data Services for Open Access in H2020 Tiziana Ferrari Technical Director, EGI.eu
Data Foundations And Terminology (DFT) IG Virtual Meeting July 6 th 2016 Co-Chairs DFT IG :Gary Berg-Cross & Raphael Ritz P8 Sessions DFT IG Breakout Session.
Data Grids, Digital Libraries and Persistent Archives: An Integrated Approach to Publishing, Sharing and Archiving Data. Written By: R. Moore, A. Rajasekar,
1 This slide indicated the continuous cycle of creating raw data or derived data based on collections of existing data. Identify components that could.
Intentions and Goals Comparison of core documents from DFIG and Publishing Workflow IG show that there is much overlap despite different starting points.
Data Type Registries #2 Co-Chairs: RDA Chairs’ Mtg Gothenburg
EUDAT’s engagement with the Earth Sciences
Current and Upcoming RDA Recommendations Dr. ir. Herman Stehouwer
RDA US Science workshop Arlington VA, Aug 2014 Cees de Laat with many slides from Ed Seidel/Rob Pennington.
RDA Data Fabric (DF) Interest Group Peter Wittenburg & Gary Berg-Cross
Joslynn Lee – Data Science Educator
DataNet Collaboration
Donatella Castelli CNR-ISTI
Data Type Registries #2 12 Month Status Larry Lannom, Tobias Weigel Date Location TBD? CC BY-SA 4.0.
Data Type Registries Breakout
Xiaogang Ma, John Erickson, Patrick West, Stephan Zednik, Peter Fox,
Data Foundation and Terminology (DFT) Vocabulary Development Session
RDA Plenary 9 Breakout Session
Data Type Registries (DTR)
C2CAMP (A Working Title)
Health Ingenuity Exchange - HingX
Common Solutions to Common Problems
Agenda (AM) 9:30-10:15 Introduction to RDA
Bird of Feather Session
Presentation transcript:

RDA’s Recently Endorsed Outputs September 16, 2015

2  Introduction  Data Foundation and Terminology –  PID Information Types  Practical Policy  Data Type Registries  Questions Agenda

Data Foundation and Terminology - talking the same language – Peter Wittenburg, Gary Berg-Cross, Raphael Ritz

4  What is the problem?  Data organizations (DOrg) and ideas about it are all different  We are all speaking different languages, wasting time and misunderstanding each other in any project involving data  Different DOrgs make data discovery and integration very time consuming, inefficient and thus expensive  Different DOrgs prevent us developing maintainable support software  Who is impacted (specific domains, professions, etc.)?  All efforts to integrate data, for example in federations, BDA projects, etc.  What are the ramifications of not having the problem resolved?  Combining data of all sorts across different origins (projects, repositories, disciplines, etc.) is a nightmare and requires a lot of curation and transformation before the actual scientific analysis can start Summary of the Problem

5  Working Group structure (how many members, diversity of experience, geographies, etc.)  DFT WG had 60 members coming from almost all regions  Members came from different types of institutions and disciplines  DFT WG included relative newcomers up to members with much experience from data intensive projects  DFT WG produced  a list of core terms essential to harmonize conceptualization of data organizations  a graphical model relating the terms  a set of auxiliary documents including many use cases to demonstrate the bottom-up approach and research of the WG  a Term Tool (using Semantic Media Wiki) to store definitions and allow editing, classification and discussion of terms (which is also open for other groups) Highlights of the Effort and Deliverables

6 Active Contributors to the Work Institute/ProjectCountry/ RegionDomain CNRIUSIT Research and Systems U CardiffUKIT Research and Systems AWIDEOceanography & Environment MPGDEResearch Organisation EUDATEUData Infrastructure CLARINEULinguistic Research Infrastructure EPOSEUEarth Observation Res. Infrastructure ENESIntWorld Climate Res. Infrastructure ENVRIEUEnvironmental Res. Infrastructure DataOneUSEnvironmental Infrastructure ESSD/RENCIUSEarth Science System Data NCGEN/RENCIUSClinical Genomics EuropeanaEUHumanities Infrastructure DataCite/EPICIntPID Infrastructures DICEUSIT Research and Systems CASCNEarth Science Model ADCIRC/RENCIUSOcean and Storm modeling

7  Who was impacted by deliverable?  The European data infrastructure EUDAT is federating data from many discipline repositories where each data collection has a different data organization. If integration is not simply done at physical level (file structures), this heterogeneity makes it very costly to integrate all data to enable re-purposing and to make it accessible at different repositories.  The Technology Director of the international CLARIN project said:  Very handy to have a lingua franca when discussing research infrastructure architectures  It was good to be involved as adopting community from the start of the work  Similar experiences are made by US, Chinese etc. colleagues that work on large scale data integration. Integration work is special and thus does not scale. Even the integration of a simple database of animal voices of the world (11 TB) requested the development of special scripts to extract metadata, relations, rights etc. in addition to the data files  Harmonization would reduce integration time by large factors and had already great effects on interaction efficiency and integration. Impact of the Deliverable

8  Our adopters  The early adopters are to a certain extent those who have these dramatic problems in data integration such as EUDAT, CLARIN, etc.  Their approach was aligned with the progress of the WG discussion. All their repository setups adhere now to the DFT model and their interaction with different communities are based on it: central is the Digital Object, that is described by metadata, is associated with a Persistent ID and whose instances are stored in trustful repositories ( see simplified diagram )  Also several other projects, for example from humanities, health, bioinformatics, neuroinformatics and atmosphere research adopted the basic & simple model and the terminology. Endorsements/Adopters and how have they used the deliverable digital object bitstreamrepository persistent ID metadata isRepresentedBy isStoredIn isReferencedBy isDescribedBy isa

9 Endorsement/Adoption Institute/ProjectCountry/ RegionDomain CNRIUSIT Research and Systems U CardiffUKIT Research and Systems MPGDEResearch Organisation EUDATEUData Infrastructure CLARINEULinguistic Research Infrastructure EPOSEUEarth Observation Res. Infrastructure ENESIntWorld Climate Res. Infrastructure ENVRIEUEnvironmental Res. Infrastructure ESSD/RENCIUSEarth Science System Data NCGEN/RENCIUSClinical Genomics DICEUSIT Research and Systems ADCIRC/RENCIUSOcean and Storm modeling Deep Carbon ProjectUSEnvironmental/Athmospheric Research Note: There may be more projects/institutes that have endoresed or adopted the DFT model without noticing us.

10  Who could use the DFT Terminologies?  The vocabulary is openly available for everyone who wants to run a project including those with large data collections  The organization should be strictly compliant to the model to guarantee independence and thus easy re-purposing of all components  The vocabulary is openly available for everyone who is working in a data federation project integrating data from different sources or who wants to re-purpose data for data intensive science  Projects could use the DFT WG model as a common reference model to design transformations  Projects could use the suggested terminology to achieve quick, mutual understanding  Software developers can adopt this basic model to make sure that their software can be used by almost everyone adhering to state of the art principles How You Can Endorse

11  How to access and use them  Take the “Core Terms and Model” document which provides the final model and the corresponding terms and apply it in your project  In case of questions  Read the supplementary documents to understand conceptualization and background for choices  Meet the WG co-chairs and experts at a plenary  Contact the WG co-chairs  Contribute to the now functioning DFT IG ( , wiki, Term Tool)  Send a request to the RDA Europe support team ( , wiki) (references see last slide) How You Can Endorse

12  Are there plans to further evolve this deliverable?  Yes, since the WG just focused on the basic set of core terms, and additional RDAS WGs are completing work so there is much more out there where terminology harmonization would help substantially  We also see the need to consider the dynamics of the field and to be ready to adapt current definitions and perhaps even the model  Is there an IG or WG that individuals can join on a related topic?  Yes, a follow-up DFT Interest Group has been established and will meet at Plenary 6  A larger scope of integrated work is being discussed as part of the Data Fabric IG Next Steps

13  Who can individuals contact to learn more about this deliverable?  DFT WG:  DFT IG:  TeD-T Term Definition Tool:  RDA EU Support Team: Contact Information

PID Information Types: Towards PID interoperability Tobias Weigel (DKRZ / University of Hamburg) Tim DiLauro (Data Conservancy / Johns Hopkins University)

15  Move from management of files towards management of objects  How does object management scale with increasing numbers?  How do we further automate our processes?  Issues independent from particular disciplines, repositories, management approaches  Understanding the most elemental characteristics of digital objects – for machine agents and human users  Facilitate interoperability across PID systems and simplify PID record usage  Not addressing these key challenges is likely to lead to insular solutions and reiteration of efforts Summary of the Problem

16  More than 50 group members from EU/US/AU  A lot of technical expertise and community experience  Key deliverables (cf. summary report):  Conceptual insights on types and their possible structures  Practical type examples geared towards diverse use cases  Openly licensed API specification and Java-based prototype  Approach for using a general type registry Highlights of the Deliverables IDENTIFIER size checksum timestamps aggregation version license format properties Size: Format: Checksum: Date: Size: Checksum: Format: License: Verification service

17  Some initial types have been registered, making it possible to explore further applications  Information on how to register new types available in the report  Registration relies on the Type Registry  Incited plans in communities and projects about concrete applications  PIDs and typing increasingly seen as a crucial component to decouple management of objects from contents  Simplify client access to data across domains, implementations and changes in information models  More lightweight access to information on less accessible objects Impact of the Deliverable

18  Adopters can be:  Communities who can use existing types and share custom types, as well as build tools and services that exploit them  PID service providers who can offer a typing service as added value beyond registration and resolution, increasing PID interoperability Endorsements/Adopters AdopterCategoryCountryScope / Goal ENESCommunityInt.IPCC AR6 data management DCO-DS/RPICommunityUSEnhancing existing PID usage EUDATCommunity/Service provider EUAdded-value service to various disciplinary communities MGI/NISTCommunityUSAutomation of data type conversions EPICService providerEU Generic added-value service CNRIService providerUS DONAService providerInt.

19  Make use of existing types, invent your own and please tell us about it!  Follow-up RDA WGs on Collections and Data Typing will continue the work on concrete types. The PID Interest Group is also a good place to provide general feedback.  Specification and prototype source code are openly available  Possible development by EUDAT, DCO, ENES and others as interested adopters  Offer by PID service providers as a service beyond registration and resolution  Contribution to a unified type registry is encouraged How You Can Endorse

20  PID Information Types WG   PID Interest Group   PID Collections candidate WG    Data Typing BoF   personal contact: Next Steps and Contact Information

Working Group Practical Policy based on slides and latest documents from the PP WG chaired by Reagan Moore, Rainer Stotzka

22 Summary of the Problem Practical Policy Assertion or assurance that is enforced about a (data) collection (data set, digital object, file) by the creators of the collection Computer actionable policies are used to  enforce data management  automate administrative tasks  validate compliance with assessment criteria  automate scientific data processing and analyses Users motivated by issues related to scale, distribution

23  Practical Policy members represented  11 types of data management systems  30 institutions  2 testbeds  iRODS Renaissance Computing Institute, DataNet Federation Consortium – DFC  GPFS Institute of Physics of the Academy of Sciences, CESNET Garching Computing Centre – RZG  Published two documents  Moore, R., R. Stotzka, C. Cacciari, P. Benedikt, “Practical Policy Templates” February, 2015, B3E CC. B3E CC  Moore, R., R. Stotzka, C. Cacciari, P. Benedikt, “Practical Policy Implementations”, February, 2015, B3E CC. B3E CC Policy Templates

24  Computer actionable rules to enforce:  Preservation standards  Authenticity, integrity, chain of custody, arrangement  Data management plans  Collection creation, product generation, publication, storage, archives  Data distribution  Replication, content distribution network  Publication  Descriptive metadata, time dependent access controls  Processing pipelines  Workflow execution Production Environments

25  Distributed data management environments  EUDAT Data Policy Manager  B2SAFE use case  International Neuroinformatics Coordinating Facility  Institut national de physique nucléaire et de physique des particules  New Zealand BESTGRID  DataNet Federation Consortium  NSF data management plans  Odum Institute preservation archive  The iPlant Collaborative genomics data grid  Science Observatory Network digital library  SILS LifeTime Library  HydroShare  NOAA National Climatic Data Center  NASA Center for Climate Simulations Endorsements/Adopters

26  Policy-based collection management  Purpose for assembling the collection  Properties required to support the purpose  Policies that control when and where the properties are enforced  Procedures that execute operations controlled by the policies  Persistent state information that is generated by the procedures  Periodic assessment criteria that verify compliance  RDA Publications  Policy templates  Constraints, operations, required state information  Policy implementations  Computer actionable rules to automate policy enforcement Applications

27  Data Fabric Interest Group  Policies to support  Federation  Interoperability  Data Foundations and Terminology Interest Group  Vocabulary for policy management  Interoperability testbeds  EUDAT   National Data Service   DataNet Federation Consortium  Next Steps and Contact Information

Data Type Registries Larry Lannom, CNRI Daan Broeder, Meertens Institute, KNAW

29  Data sharing requires that data can be parsed, understood, and reused by people and applications other than those that created the data  How do we do this now?  For documents – formats are enough, e.g., PDF, and then the document explains itself to humans  This doesn’t work well with data – numbers are not self-explanatory  What does the number 7 mean in cell B27?  Data producers may not have explicitly specified certain details in the data: measurement units, coordinate systems, variable names, etc.  Need a way to precisely characterize those assumptions such that they can be identified by humans and machines that were not closely involved in its creation  Affects all data producers and consumers Summary of the Problem

30  Evaluate and identify a few assumptions in data that can be codified and shared in order to…  Produce a functioning Registry system that can easily be evaluated by organizations before adoption  Highly configurable for changing scope of captured and shared assumptions depending on the domain or organization  Supports several Type record dissemination variations  Design for allowing federation between multiple Registry instances  The emphasis is not on  Identifying every possible assumption and data characteristic applicable for all domains  Technology Goal of the DTR Effort: Explicate and Share Assumptions using Types and Type Registries

31  Confirmation that detailed and precise data typing is a key consideration in data sharing and reuse and that a federated registry system for such types is highly desirable and needs to accommodate each community’s own requirements  Deployment of a prototype registry implementing one potential data model, against which various use cases can be tested  Involvement of multiple ongoing scientific data management efforts, across a variety of domains, in actively planning for and testing the use of data types and associated registries in their data management efforts  Integration with one additional RDA WG (Persistent Identifier Types) and at least one Interest Group (RDA/CODATA Materials Data, Infrastructure & Interoperability IG)  Development of a set of questions that require further consideration before a detailed recommendation on data typing can be issued Highlights of the Deliverable

32 Users Typed Data ID Type Payload ID Type Payload ID Type Payload ID Type Payload ID Type Payload ID Type Payload …. Visualization I Agree Terms:… Rights Services Data Processing Data Set Dissemination Client (process or people) encounters unknown data type.1 Resolved to Type Registry. 2 Response includes type definitions, relationships, properties, and possibly service pointers. Response can be used locally for processing, or, optionally 3 typed data or reference to typed data can be sent to service provider Impact of Use Case: Process Use Case Federated Set of Type Registries

33  Materials Science Adoption Project  Demo at P6  X-ray diffraction use case  normalize data sets resulting from multiple proprietary instruments  Enable a homogenous analysis platform for data consumers to perform their analyses  Deep Carbon Observatory  Goal: given a dataset identifier, discover detailed information about the structure(s) within that dataset, and act accordingly  DTR is a registry used for explicating structures in the form of type records  Facilitate norms of behavior relevant to data curation and re-use  Digital Object Identifier  Given a DOI, what services are relevant and applicable  Having chosen a service, how can a client invoke that service?  Having invoked a service, how can a client process the returned data?  DOI, Materials Science, DCO, EUDAT Endorsements/Adopters

34  Start a new prototype effort  Follow existing prototype efforts  Attend the BOF at P6  Join the Data Typing WG when it starts  Try the public prototype at typeregistry.org How You Can Endorse

35  A follow-up WG is planned: Data Typing  Leverage results of DTR  Collect results from multiple prototypes  Best practices for federation  BOF on Data Typing at P6: 24 Sept., Breakout #6  Proposed Chairs of Data Typing WG  Giridhar Manepalli, CNRI  Simon Cox, CSIRO  Tobias Weigel, DKRZ  Larry and Daan are still around Next Steps and Contact Information