Digital Preservation and Citeable Data: Challenges in Data Citation in e-Science Andreas Rauber Vienna University of Technology & Secure Business Austria.

Slides:



Advertisements
Similar presentations
Configuration management
Advertisements

Configuration management
Chapter 10: Designing Databases
A Toolbox for Blackboard Tim Roberts
VO Sandpit, November 2009 Data Citation, Principles and Practice Sarah DataCite Annual Conference, 2014.
Test Case Management and Results Tracking System October 2008 D E L I V E R I N G Q U A L I T Y (Short Version)
Information Types and Registries Giridhar Manepalli Corporation for National Research Initiatives Strategies for Discovering Online Data BRDI Symposium.
Management Information Systems, Sixth Edition
SOFTWARE PRESENTATION ODMS (OPEN SOURCE DOCUMENT MANAGEMENT SYSTEM)
Databases Chapter Distinguish between the physical and logical view of data Describe how data is organized: characters, fields, records, tables,
Caro-COOPS Data Management: Metadata. Cast-Net addresses the need for improved connectivity among coastal observing systems by creating a regional framework.
Requirements Specification
Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.
Chapter 14 The Second Component: The Database.
Examine Quality Assurance/Quality Control Documentation
Long-term Archive Service Requirements draft-ietf-ltans-reqs-00.txt.
Page 1 ISMT E-120 Introduction to Microsoft Access & Relational Databases The Influence of Software and Hardware Technologies on Business Productivity.
State of Connecticut Core-CT Project Query 4 hrs Updated 1/21/2011.
Managing Data Interoperability with FME Tony Kent Applications Engineer IMGS.
Page 1 ISMT E-120 Desktop Applications for Managers Introduction to Microsoft Access.
This chapter is extracted from Sommerville’s slides. Text book chapter
Discussion and conclusion The OGC SOS describes a global standard for storing and recalling sensor data and the associated metadata. The standard covers.
The Data Attribution Abdul Saboor PhD Research Student Model Base Development and Software Quality Assurance Research Group Freie.
The Case for Data Stewardship: Preserving the Scientific Record Matthew Mayernik National Center for Atmospheric Research Version 2.0 [Review Date]
Approaches to Making Dynamic Data Citable: Update on the Activities of the RDA Working Group Andreas Rauber Vienna University of Technology Viennna, Austria.
Online Autonomous Citation Management for CiteSeer CSE598B Course Project By Huajing Li.
An Overview of MPEG-21 Cory McKay. Introduction Built on top of MPEG-4 and MPEG-7 standards Much more than just an audiovisual standard Meant to be a.
Database System Concepts and Architecture
 To explain the importance of software configuration management (CM)  To describe key CM activities namely CM planning, change management, version management.
Preserving Complex Scientific Objects: Process Capture and Data Identification Andreas Rauber J.Binder, T.Miksa, R.Mayer, S.Pröll, S.Strodl, M.Unterberger.
INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.
Metadata and Geographical Information Systems Adrian Moss KINDS project, Manchester Metropolitan University, UK
Discovering Computers Fundamentals Fifth Edition Chapter 9 Database Management.
Lecture2: Database Environment Prepared by L. Nouf Almujally & Aisha AlArfaj 1 Ref. Chapter2 College of Computer and Information Sciences - Information.
Application code Registry 1 Alignment of R-GMA with developments in the Open Grid Services Architecture (OGSA) is advancing. The existing Servlets and.
Joint Declaration of Data Citation Principles Notes [1] CODATA 2013: sec 3.2.1; Uhlir (ed.) 2012, ch 14; Altman &
Data Citation Working Group P6 23 nd Sep 2015, Paris.
McGraw-Hill/Irwin © 2008 The McGraw-Hill Companies, All Rights Reserved Chapter 7 Storing Organizational Information - Databases.
ITGS Databases.
Chapter 9 Database Systems © 2007 Pearson Addison-Wesley. All rights reserved.
Management Information Systems, 4 th Edition 1 Chapter 8 Data and Knowledge Management.
Introduction to the Semantic Web and Linked Data
4 way comparison of Data Citation Principles: Amsterdam Manifesto, CoData, Data Cite, Digital Curation Center FORCE11 Data Citation Synthesis Group Should.
Internet Documentation and Integration of Metadata (IDIOM) Presented by Ahmet E. Topcu Advisor: Prof. Geoffrey C. Fox 1/14/2009.
M. Stockhause 1, G. Levavasseur 2, K. Berger 1 1 Deutsches Klimarechenzentrum (DKRZ) 2 Institute Pierre Simon Laplace (IPSL) ESGF-QCWT Quality Control.
Find Research Data b2find.eudat.eu B2FIND User Training How to find data objects and collections using EUDAT’s B2FIND This work is licensed.
1 Chapter 12 Configuration management This chapter is extracted from Sommerville’s slides. Text book chapter 29 1.
4 way comparison of Data Citation Principles: Amsterdam Manifesto, CoData, Data Cite, Digital Curation Center FORCE11 Data Citation Synthesis Group Should.
Chapter 18 Object Database Management Systems. Outline Motivation for object database management Object-oriented principles Architectures for object database.
4 way comparison of Data Citation Principles: Amsterdam Manifesto, CoData, Data Cite, Digital Curation Center FORCE11 Data Citation Synthesis Group.
NIH BioCADDIE / Force11 Data Citation Pilot Kickoff Meeting Nine Zero Hotel, Boston MA, 3 February 2016 Introduction: Tim Clark, Maryann Martone and Joan.
Data Citation Implementation Pilot Workshop
Joint Declaration of Data Citation Principles (Overview) The Data Citation Synthesis Group Joint Declaration.
Digital Preservation in Data-Driven Science Andreas Rauber Institut für Softwaretechnik und Interaktive.
EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No EUDAT Aalto Data.
Dr. Ari Asmi Research Coordinator Faculty of Science Department of Physics MAKING DYNAMIC DATA CITABLE: APPROACHES TO DATA CITATION WITHIN AS A RDA WORKING.
Updating image To update the background image: Go to ‘View’ Select ‘Slide Master’ Select the page with the image Right click on the image and select ‘Change.
Data Resource Management Data Concepts Database Management Types of Databases Chapter 5 McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies,
Approaches to Making Data Citeable Recommendations of the RDA Working Group Andreas Rauber, Ari Asmi, Dieter van Uytvanck Stefan Pröll.
RDA WG on Dynamic Data Citation
MongoDB Er. Shiva K. Shrestha ME Computer, NCIT
Flanders Marine Institute (VLIZ)
MANAGING DATA RESOURCES
OpenML Workshop Eindhoven TU/e,
New input for CEOS Persistent Identifier Best Practices
EUDAT B2FIND A Cross-Discipline Metadata Service and Discovery Portal
A Case Study for Synergistically Implementing the Management of Open Data Robert R. Downs NASA Socioeconomic Data and Applications.
The ultimate in data organization
Leveraging PIDs for object management in data infrastructures RDA UK Node Workshop, July Tobias Weigel (DKRZ)
Presentation transcript:

Digital Preservation and Citeable Data: Challenges in Data Citation in e-Science Andreas Rauber Vienna University of Technology & Secure Business Austria

Outline  Data-driven Science  Principles of Data Citation  Challenges in non-trivial settings  Summary

The Fourth Paradigm  The Fourth Paradigm Tony Hey, Stewart Tansley, and Kristin Tolle (Eds.), Oct. 2009, Microsoft Research / /  Jim Grey ( ) Turing Award Winner 1998  Presentation on Jan , NRC-CSTB  Transcript:

The Fourth Paradigm 1)Empirical Science -~1.000 years ago -Description of observed phenomena 2)Theoretical Science -~100 years ago -Model buidling, Generalization 3)Computational Science -~10 years ago -Simulation of complex Phenomena ( adopted from Jim Gray, eScience Talk at NRC-CSTB meeting Mountain View CA, January 11, 2007, slide 4)eScience Talk at NRC-CSTB meeting

The Fourth Paradigm 4)Data-intensive Science -~ now -Connects theory, experiment and simulation -Huge amounts of data from sensors and simulations -Data processing via software -Storing data in networked infrastructures -Gaining knowledge by analysis of integrated data  e-Science, Data-driven Science, Data-intensive Science  Studies / Meta-Studies, Integration  Data is the key enabler --> -Need to preserve data in digital repositories! -Need to precisely identify (cite) data!

Outline  Data-driven Science  Principles of Data Citation  Challenges in non-trivial settings  Summary

Joint Declaration of Data Citation Principles  8 Principles created by the Data Citation Synthesis Group   The Data Citation Principles cover purpose, function and attributes of citations  Goal: Encourage communities to develop practices and tools that embody uniform data citation principles Page 7

Joint Declaration of Data Citation Principles (cont‘d) 1)Importance Data should be considered legitimate, citable products of research. Data citations should be accorded the same importance as publications. 2)Credit and Attribution Data citations should facilitate giving credit and normative and legal attribution to all contributors to the data. Page 8

Joint Declaration of Data Citation Principles (cont‘d) 3)Evidence Whenever and wherever a claim relies upon data, the corresponding data should be cited. 4)Unique Identification A data citation should include a persistent method for identification that is machine actionable, globally unique, and widely used by a community. Page 9

Joint Declaration of Data Citation Principles (cont‘d) 5)Access Data citations should facilitate access to the data themselves and to such associated metadata, documentation, code, and other materials, as are necessary for both humans and machines to make informed use of the referenced data. 6)Persistence Unique identifiers, and metadata describing the data, and its disposition, should persist - even beyond the lifespan of the data they describe. Page 10

Joint Declaration of Data Citation Principles (cont‘d) 7)Specificity and Verifiability Data citations should facilitate identification of, access to, and verfication of the specific data that support a claim. Citations or citation metadata should include information about provenance and fixity sufficient to facilitate verfiying that the specific timeslice, version and/or granular portion of data retrieved subsequently is the same as was originally cited. Page 11

Joint Declaration of Data Citation Principles (cont‘d) 8)Interoperability and flexibility Data citation methods should be sufficiently flexible to accommodate the variant practices among communities, but should not differ so much that they compromise interoperability of data citation practices across communities. Page 12

Outline  Data-driven Science  Principles of Data Citation  Challenges in non-trivial settings  Summary

Data Citation  Citing Data should be easy -from providing a URL in a footnote -via providing a reference in the bibliography section -to assigning a PID to dataset (DOI, ARK, …) in a repository  What’s the problem? Page 14

Citation of Dynamic Data  Citable datasets have to be static -Fixed set of data, no changes: no corrections to errors, no new data being added  But: (research) data is dynamic -Adding new data, correcting errors, enhancing data quality, … -Changes sometimes highly dynamic, at irregular intervals  Current approaches -Identifying entire data stream, without any versioning -Using “accessed at” date -“Artificial” versioning by identifying batches of data (e.g. annual), aggregating changes into releases (time-delayed!)  Would like to cite precisely the data as it existed at certain point in time, without delaying release of new data Page 15

Granularity of Data Citation  What about granularity of data to be cited? -Databases collect enormous amounts of data over time -Researchers use specific subsets of data -Need to identify precisely the subset used  Current approaches -Storing a copy of subset as used in study -> scalability -Citing entire dataset, providing textual description of subset -> imprecise (ambiguity) -Storing list of record identifiers in subset -> scalability, not for arbitrary subsets (e.g. when not entire record selected)  Would like to be able to cite precisely the subset of dynamic data used in a study Page 16

Data Citation – Requirements  Dynamic data -corrections, additions, …  Arbitrary subsets of data (granularity) -rows/columns, time sequences, … -from single number to the entire set  Stable across technology changes -e.g. migration to new database  Machine-actionable -not just machine-readable, definitely not just human-readable and interpretable  Scalable to very large / highly dynamic datasets

 Research Data Alliance  WG on Data Citation: Making Dynamic Data Citeable  WG officially endorsed in March Concentrating on the problems of dynamic (changing) datasets -Focus! -Liaise with other WGs on attribution, metadata, … -Liaise with other initiatives on data citation (CODATA, DataCite, Force11, …) - RDA WG Data Citation

Making Dynamic Data Citeable Data Citation: Data + Means-of-access  Data  time-stamped & versioned (aka history) Researcher creates working-set via some interface:  Access  assign PID to QUERY, enhanced with  Time-stamping for re-execution against versioned DB  Re-writing for normalization, unique-sort, mapping to history  Hashing result-set: verifying identity/correctness leading to landing page S. Pröll, A. Rauber. Scalable Data Citation in Dynamic Large Databases: Model and Reference Implementation. In IEEE Intl. Conf. on Big Data 2013 (IEEE BigData2013),

Data Citation – Deployment  Researcher uses workbench to identify subset of data  Upon executing selection („download“) user gets  Data (package, access API, …)  PID (e.g. DOI) (Query is time-stamped and stored)  Hash value computed over the data for local storage  Recommended citation text (e.g. BibTeX)  PID resolves to landing page  Provides detailed metadata, link to parent data set, subset,…  Option to retrieve original data OR current version OR changes  Upon activating PID associated with a data citation  Query is re-executed against time-stamped and versioned DB  Results as above are returned Note: query string provides excellent provenance information on the data set! This is an important advantage over traditional approaches relying on, e.g. storing a list of identifiers/DB dump!!!

Initial Pilots  Devised concept  Identify additional challenges (unique sorting, hash-key computation, distribution, …)  Evaluate conceptually in different settings -How to apply versioning efficiently? -How to perform time-stamping efficiently? -How easy to adopt / changes required?  Start implementing pilots -SQL: LNEC, MSD -CSV: MSD -XML: Clarin, xBase Page 21

Dynamic Data Citation for SQL Data LNEC, MSD Reference Implementation Dynamic Data Citation - Pilots

SQL Prototype Implementation  LNEC Laboratory of Civil Engineering, Portugal  Monitoring dams and bridges  31 manual sensor instruments  25 automatic sensor instruments  Web portal -Select sensor data -Define timespans  Report generation -Analysis processes, produces -Latex, produces -PDF report Page 23 Florian Fuchs [CC-BY-3.0 ( via Wikimedia Commonshttp://creativecommons.org/licenses/by/3.0)

SQL Prototype Implementation  Million Song Dataset  Larges benchmark collection in Music Retrieval  Original set provided by Echonest  No audio, only several sets of features  Harvested, additional features and metadata extracted and offered by several groups e.g.  Dynamics because of metadata errors, extraction errors  Research groups select subsets by genre, audio length, audio quality,… 24

SQL Time-Stamping and Versioning  Integrated -Extend original tables by temporal metadata -Expand primary key by record-version column  Hybrid -Utilize history table for deleted record versions with metadata -Original table reflects latest version only  Separated -Utilizes full history table -Also inserts reflected in history table  Solution to be adopted depends on trade-off -Storage Demand -Query Complexity -Software adaption Page 25

SQL: Storing Queries Page 26  Add query store containing -PID of the query -Original query -Re-written query + query string hash -Timestamp (as used in re-written query) -Hash-key of query result -Metadata useful for citation / landing page (creator, institution, rights, …) -PID of parent dataset (or using fragment identifiers for query)

SQL Query Re-Writing 27  Adapt query to history table

Dynamic Data Citation for CSV Data  Why CSV data? -Well understood and widely used -Simple and flexible -Most frequently requested during initial RDA meetings  Goals : -Ensure cite-ability of CSV data -Enable subset citation -Support particularly small and large volume data -Support dynamically changing data  2 Options: -Versioning system (subversion/svn, git, …) -Migration to RDBMS

CSV: Basic Steps  Upload interface  Upload CSV files  Migrate CSV file into RDBMS  Generate table structure  Add metadata columns for versioning  Add indices  Dynamic data  Update existing records  Append new data  Access interface  Track subset creation  Store queries Barrymieny

CSV Data Prototype

Further Pilots  NERC: UK Natural Environment Research Council Butterfly monitoring Ocean buoy network Sociological Archive National hydrological archive  ARGO buoy network: SeaDataNet, distributed dataset  XML Data in Field Linguistics (CLARIN)  ESIP Mtg in Washington, Jan : Earth Science Data  Further Pilots on XML, LOD, …

Join RDA and Working Group If you are interested in joining the discussion, contributing a pilot, wish to establish a data citation solution, …  Register for the RDA WG on Data Citation: -Website: -Mailinglist: -Web Conferences: -List of pilots: wg/wiki/collaboration-environments.html wg/wiki/collaboration-environments.html

Summary  Trustworthy and efficient e-Science based on data  Data as “1 st -class citizen”  Support for citing arbitrary subsets of dynamic data -Time-stamping and versioning of data -Storing (and citing) time-stamped queries  Allows retrieving exact view on data set as used  No need for artificial “versioning”, delaying release of new data, or redundant storage of data subset dumps  Helps tracing provenance (semantics) of data selection  Future work: distributed datasets, data & query migration

Thank you!