Discussion of Larger Scope DFT Concepts & Terminological Issues Prepared for RDA P4, Amsterdam, Sept 2014 Gary Berg-Cross: Co-Chair DFT WG.

Slides:



Advertisements
Similar presentations
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Advertisements

The PREMIS Data Dictionary Michael Day Digital Curation Centre UKOLN, University of Bath JORUM, JISC and DCC.
An Introduction to Repositories Thornton Staples Director of Community Strategy and Alliances Director of the Fedora Project.
Music Encoding Initiative (MEI) DTD and the OCVE
Digital Preservation - Its all about the metadata right? “Metadata and Digital Preservation: How Much Do We Really Need?” SAA 2014 Panel Saturday, August.
Management Information Systems, Sixth Edition
Fedora 3.0 and METS: A Partnership for the Organization, Presentation and Preservation of Digital Objects Open Repositories Georgia Tech, Atlanta,
UKOLN is supported by: OAI-ORE a perspective on compound information objects ( Defining Image Access.
Oct 31, 2000Database Management -- Fall R. Larson Database Management: Introduction to Terms and Concepts University of California, Berkeley School.
PAWN: A Novel Ingestion Workflow Technology for Digital Preservation
Modified from Sommerville’s originalsSoftware Engineering, 7th edition. Chapter 8 Slide 1 System models.
Metadata : Setting the Scene or a Basic Introduction Wendy Duff University of Toronto, Faculty of Information Studies.
Sharif University of Technology Session # 7.  Contents  Systems Analysis and Design  Planning the approach  Asking questions and collecting data 
RDA Data Foundation and Terminology (DFT) IG: Introduction Prepared for RDA Plenary San Diego, March 9, 2015 Gary Berg-Cross, Raphael Ritz, Co-Chairs DFT.
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
DATA FOUNDATION TERMINOLOGY WG 4 th Plenary Update THE PLUM GOALS This model together with the derived terminology can be used Across communities and stakeholders.
SC32 WG2 Metadata Standards Tutorial Metadata Registries and Big Data WG2 N1945 June 9, 2014 Beijing, China.
RDA Data Foundation and Terminology (DFT) IG: Introduction Prepared for RDA Plenary San Diego, March 9, 2015 Gary Berg-Cross, Raphael Ritz, Co-Chairs DFT.
San Diego Supercomputer CenterUniversity of California, San Diego Preservation Research Roadmap Reagan W. Moore San Diego Supercomputer Center
Working Group: Practical Policy Rainer Stotzka, Reagan Moore.
RDA Data Foundation and Terminology (DFT) IG: Introduction Prepared for RDA 6 th Plenary Paris, Sept. 25, 2015 Gary Berg-Cross, Raphael Ritz Co-Chairs.
Organizing Data and Information AD660 – Databases, Security, and Web Technologies Marcus Goncalves Spring 2013.
Chapter 7: Database Systems Succeeding with Technology: Second Edition.
Scalable Metadata Definition Frameworks Raymond Plante NCSA/NVO Toward an International Virtual Observatory How do we encourage a smooth evolution of metadata.
Of 33 lecture 10: ontology – evolution. of 33 ece 720, winter ‘122 ontology evolution introduction - ontologies enable knowledge to be made explicit and.
11 Chapter 11 Object-Oriented Databases Database Systems: Design, Implementation, and Management 4th Edition Peter Rob & Carlos Coronel.
RDA Terminology: Data Management and Data Fabric Prepared for RDA 6 th Plenary Paris, Sept. 23, 2015 Gary Berg-Cross Co-Chair DFT IG, Co-organizing Chair.
OAIS Rathachai Chawuthai Information Management CSIM / AIT Issued document 1.0.
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
Ocean Observatories Initiative Data Management (DM) Subsystem Overview Michael Meisinger September 29, 2009.
Creating Archive Information Packages for Data Sets: Early Experiments with Digital Library Standards Ruth Duerr, NSIDC MiQun Yang, THG Azhar Sikander,
Metadata in Satellite Images, or, An Administrator’s View Prepared for the CUL Metadata Working Group August Karen Calhoun The Orion Nebula.
CS3773 Software Engineering Lecture 04 UML Class Diagram.
Working Group Practical Policy based on slides and latest documents from the PP WG chaired by Reagan Moore, Rainer Stotzka presented by Johannes Reetz.
Database Systems DBMS Environment Data Abstraction.
Introduction to Database Systems1. 2 Basic Definitions Mini-world Some part of the real world about which data is stored in a database. Data Known facts.
Prepared By Prepared By : VINAY ALEXANDER ( विनय अलेक्सजेंड़र ) PGT(CS),KV JHAGRAKHAND.
RDA Data Foundation and Terminology (DFT) WG: Overview  Prepared for Collab Chairs Meeting, NIST, Nov 13-14, 2014  Gary Berg-Cross, Raphael Ritz, Peter.
10/24/09CK The Open Ontology Repository Initiative: Requirements and Research Challenges Ken Baclawski Todd Schneider.
RELATORS, ROLES AND DATA… … similarities and differences.
OAIS Rathachai Chawuthai Information Management CSIM / AIT Issued document 1.0.
Intellectual Works and their Manifestations Representation of Information Objects IR Systems & Information objects Spring January, 2006 Bharat.
Metadata and Meta tag. What is metadata? What does metadata do? Metadata schemes What is meta tag? Meta tag example Table of Content.
Domain Model A representation of real-world conceptual classes in a problem domain. The core of object-oriented analysis They are NOT software objects.
Copyright (c) 2014 Pearson Education, Inc. Introduction to DBMS.
Discussion of Data Fabric Terms & Preparation for RDA P7 Virtual Meeting Monday, January 25, 2016 Organized by Gary Berg-Cross (DFT-IG) and Peter Wittenburg.
Lifecycle Metadata for Digital Objects November 15, 2004 Preservation Metadata.
Data Foundation IG DF Organizing Chairs: Gary Berg-Cross & Peter Wittenburg.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
Data Management: Data Processing Types of Data Processing at USGS There are several ways to classify Data Processing activities at USGS, and here are some.
Introduction: Databases and Database Systems Lecture # 1 June 19,2012 National University of Computer and Emerging Sciences.
Draft Ideas on a Process to Design and Build the DFT Vocabulary Gary Berg-Cross Developed for DFT WG Session at 2 nd RDA Plenary Sept Washington.
Working Group: Data Foundations and Terminology (Practical Policy Considerations) Reagan Moore.
Draft Data Foundation and Terminology (DFT) Vocabulary Development Process Prepared for WG-Core meeting 24/25.2 Munich/Garching Gary Berg-Cross Co-Chair.
International Planetary Data Alliance Registry Project Update September 16, 2011.
Data Foundations And Terminology (DFT) IG Virtual Meeting July 6 th 2016 Co-Chairs DFT IG :Gary Berg-Cross & Raphael Ritz P8 Sessions DFT IG Breakout Session.
Data Grids, Digital Libraries and Persistent Archives: An Integrated Approach to Publishing, Sharing and Archiving Data. Written By: R. Moore, A. Rajasekar,
Management Information Systems by Prof. Park Kyung-Hye Chapter 7 (8th Week) Databases and Data Warehouses 07.
RDA Data Foundation and Terminology (DFT) WG
RDA Data Fabric (DF) Interest Group Peter Wittenburg & Gary Berg-Cross
Data Foundation and Terminology (DFT) Vocabulary Development Session
Chapter 2 Database Environment Pearson Education © 2009.
PREMIS Tools and Services
ece 627 intelligent web: ontology and beyond
2. An overview of SDMX (What is SDMX? Part I)
Database Systems Instructor Name: Lecture-3.
Metadata in Digital Preservation: Setting the Scene
Database Design Hacettepe University
Chapter 2 Database Environment Pearson Education © 2009.
Database management systems
Presentation transcript:

Discussion of Larger Scope DFT Concepts & Terminological Issues Prepared for RDA P4, Amsterdam, Sept 2014 Gary Berg-Cross: Co-Chair DFT WG

Topical Outline –is there more to define and richer ways to define it? Broader View of Data Processes Boarder WG and Data Concept View Automated policy driven definitions – example goal To identify typical application scenarios for policies such as replication, preservation etc. Selective examples from metadata Is Web view of data important to include? Better Vocabulary Development Recap from P2 Taxonomies, relations, attributes Operational Procedural Data curation registration Unclassified Organizational Aggregations

Vocabulary can be Built out in Stages –adapted example from P2 RDA Scope…. WG Scope Core Starter Set 9-12 months P4-P5? 18 months months P3-P4? On 3 months P2

One View of DFT Scope from Process View Policy defines this? 1.What elements are in a PID record 2.How to point to a metadata record 3.What is in a metadata record at registration 4.What is replication with identical vs. different bit-streams that may store additional attributes The rest of data management and the lifecycle? We have policy for a minimum metadata record? Other WGs…..

Other Basic Terms -Data Citation as Example Term Data Citation Definition Data Citation is the practice of providing a citation to data in a similar way that researchers routinely include a bibliographic reference to published resources. Examples References [[References::Australian National Data Service [ANDS], [2011], Van Leunen, [1992] Other sources: quoted from cited there as 'adapted from Related Term: Citable Data is a type of referable data that has undergone quality assessment and can be referred to as citations in publications and as part of Research Objects. Data Citation A citation is a formal structured reference to another scholarly published or unpublished work. In traditional print publishing, a “bibliographic citation” refers to a formal structured reference to another scholarly published or unpublished work. Typically, in-text citation pointers to these structured references are either marked off with parentheses or brackets, such as “(Author, Year),” or indicated by superscript or square-bracketed numerals although in some research domains footnotes are used. Such citations of a single work may occur several times throughout the text. The full bibliographic reference to the work appears in the bibliography or reference list, often following the end of the main text, and is called a “reference” or “bibliographic reference.” Traditional print citations include “pinpointing” information, typically in the form of a page range that identifies which part of the cited work is being referenced.

Practical Policy WG area examples Contextual metadata extraction Data access control Data backup Data format control Data retention Disposition Integrity (including replication) Notification.. Contextual metadata extraction policies This policy area focuses on metadata associated with files and collections. The creation of provenance and descriptive metadata defines a context for interpreting the relevance of files in a collection. Depending upon the data source, there are multiple ways to provide metadata: Extract metadata from an associated document. An example is the medical imaging format DICOM. Extract metadata from a structured document which includes internal metadata. Examples are FITS for astronomy, netCDF, and HDF. Extract metadata by parsing patterns within the text within a document. Identify a feature present within a file and label the file with the location of the feature that is present within the file. Extract metadataAttribute_name Attribute_value Attribute_unit Source_file Source_collection A start on minimal MD? Key processes across the data lifecycle?

Data LifeCycle define all stages in the existence of digital data from creation to destruction and chained operations Workflows with LC Core definitions then might include…. Curation : The activity of, managing and promoting the use of data from its point of creation, to ensure it is fit for contemporary purpose, and available for discovery and re-use. For dynamic datasets this may mean continuous enrichment or updating to keep it fit for purpose. Higher levels of Curation will also involve maintaining links with annotation and with other published materials. Archiving : A curation activity which ensures that data is properly selected, stored, can be accessed and that its logical and physical integrity is maintained over time, including security and authenticity. Preservation : An activity within archiving in which specific items of data/collections are maintained over time so that they can still be accessed and understood through changes in technology. Interoperability: The ability of a system to accept and send services and to use the services so exchanged to enable them to operate useful. ISO TC204, document N271)

The Basic 4 Part Model (should also note Actors like User as needed) Data Process Infrastructure Process Uses Modifies Produces Transformed Data Standardized relations

Illustrating the Basic 4 Part Model Raw Data Registry Registration Process Uses Modifies Produces Registered Data

Data Transformed to Information (Note, for now our detailed Processes are what Practical Policy has identified) Data Object Representation Object = Process Infrastructure Process Uses Modifies Produces Information Object

Metadata Types Notional Core Diagram Reflecting Data Lifecycle and RDA WGs PID Types Practical Policies : Divided by Policy Types such as “Manage data sets in a repository” Initial Process & Registration Uses Modifies Metadata Registries Raw Data ? Data & Metadata Management Type Registries Data Repositories Referable Data… Curation Types? Collection repositories Archives….. Citation? Uses Provenance Types? Curation/Provenance policies….. PIT WG PP WG DTR WG MDSD WG User/ manager

Digital Object/Data Management Data Types Manage Data Sets within the Data Lifecycle replication of digital objects DO provenance metadata Data Repositories Practical Policies: Manage data sets in a repository DO Naming & descriptive metadata DO arrangement DO representation & administrative metadata DO access controls DO retention, disposition, integrity User/ manager Any of these concepts might be defined as part of DFT

Metadata Management Metadata Types Metadata Catalogs within the Data Lifecycle reserved vocabularies metadata schema Metadata Registries/ Directories/Catalogs Practical Policies: Manage Metadata Catalog metadata organization in tables metadata properties (creation date, access control, ownership). Data Repositories User/ manager

Digital Object/Collection Management Data Repository Building Data Collections Generate PID for digital collection object Collection provenance metadata including data source for collection Collection Repositories Practical Policies: Collection Building and Management Collection Naming & descriptive metadata including composition & arrangement Collection representation & administrative metadata Collection access controls Metadata Repository Data Citation etc… User/ manager

Terminological Issues Community discussion and Better Definitions 5 steps vocabulary process from scope & requirements to tool building and population to: 3.Focused Vocabulary Design Process and Community Agreement (at and after 3 rd Plenary) 4.Refinement & Maintenance (ongoing) 5.Draft Vocabulary Publication and Review (4 th Plenary)

Still Relevant From P2 - Vocabulary Analysis Process Identify concepts and concept relations implied by collected terms; Analyze and model concept systems on the basis of identified concepts and concept relations that are used to understand a term and its referent; Establishing representations of concept systems through concept diagrams; Craft concept-oriented definitions as a concept base; Test arrangement in taxonomical class hierarchy(s) Add essential Properties/Attributes/slots to distinguish related concepts Link concepts via Relations…..etc. Associate a designated vocabulary term to each concept (in one or more languages); and, Document the vocabulary in an agreed upon form, perhaps starting as a structured glossary and support concept models Adapted liberally from ISO TC 37 Standards Basic Principles of Terminology

Simple Vocab Entry Example from P2 illustrates taxonomy & other relations, attributes etc. Data Object Type of: Abstract Object (Taxonomy) Sub-types: digital object,…… Definition: n computer science, an object is any entity that can be manipulated by the commands of a programming language, such as a value, variable, function, or data structure. (With the later introduction of object oriented programming the same word, "object", refers to a particular instance of a class) Definition 2: a Data Object is a dataset Equivalent terms (other languages)... Attributes….metadata record with data object name, local ID, PID, representation info, checksum….. Relations a data element isPartof Data Object…. Examples/Instances include: repository metadata, data models, databases, tables, views, files, entities, columns, data elements, and attributes. (Source

Backup Slides

Vocabulary Design Process & Vocabulary Qualities Both analysis and design may employ conceptual modeling to capture the essential meaning and structure of the descriptions of the vocabulary. The product of this is some form of conceptual model. Desired Qualities 1.Adequate capture of content intuitions, expressed by domain experts, 1. in an understandable forms 2.includes details on constraining descriptions 3.Uses well defined relations, taxonomic and others 4.Illustrate with examples 2.Rigorous – stands up to rational analysis 3.Minimally redundant - no unintended synonyms

Overview of Term Development Starter areas and items : Persistent Identifiers (PIDs and types) Digital Object - Data Object Collection - Data Set - Aggregation Repository (Registries and related Policies) Scope Terms from Model Papers Placed In Tool Digital Object A digital object is composed of structured sequence of bits/bytes. As an object it is named. This bit sequence can be identified & accessed by a unique and persistent identifier or by use of referencing attributes describing its properties. Defs & Refinement Analysis and Revision Process Getting Defs organized for review Term Definition Tool prototyped and developed at Rechenzentrum Garching (RZG) der Max-Planck- Gesellschaft

Conceptual Spaces digital object bit stream instance of a bit stream service object informational object aggregation data object is_a is_part_of has_a has_many collection is_a metadata record PID record has_a is_a data stream is_equal data set is_equal corpus is_equal attribute has_a property contains_a Peter’s Original Refinements

Digital Object (aka Digital Entity) is composed of structured sequence of bits/bytes. As an object it is named. This bit sequence can be identified & accessed by a unique and persistent identifier or by use of referencing attributes describing its properties. Note Digital Entity definition from X.1255 ITU standard “machine-independent data structure consisting of one or more elements in digital form that can be parsed by different information systems; the structure helps to enable interoperability among diverse information systems in the Internet.” Metadata is a type of data object that that contains attributes describing properties of an associated data or digital object. It may contain as key the persistent identifier of that associated object. The association between a data object and metadata is that the content of the metadata describes the data object. Metadata may serve different purposes, such as helping people to find data of relevance - discovery (Michener 2006) or to bring data together – federation. A list of used include: Discovery, Access, Selection, Licensing, authorization, Quality, suitability and Provenance, reproducibility. Data properties, both internal and external, are types of metadata as is transactional information about data. Ref; Michener, W.K Data Object is a type of digital object that included the named bits of a digital object but also has representation object allowing processing of its information content. Information that maps a Data Object into more meaningful concepts" (OAIS) — makes humanly-perceptible properties happen Examples: file format, encoding scheme, data format, encoding scheme, data type

Representation Object provide some context for a data object. It contains provenance, description (e.g. format, encoding scheme, algorithm-Brown, 2008), structural, and administrative information about the object. This is a form of metadata. Brown, A. (2008). White paper: Representation information registries. Retrieved June19, 2009, from Service Object (Code Object) is a type of digital object containing executable code, considered as a unit. Note: Under specific circumstances a service object can be viewed as a data object. Information Object is a type of Data Object which includes the object’s metadata in the Object. OAIS example, a digital image in TIFF format can only be rendered as an image using software which has been designed to interpret the bitstream in accordance with the TIFF format specification. In other words, the logical Information Object (the image) can only be derived from the physical Data Object (the bitstream) via a process of interpretation. OAIS uses the term Representation Information to describe the knowledge base required for this interpretation. Active Data denotes virtual units of data objects that are created dynamically by executable code. Active data may be viewed as a product from executing the workflow. It may thus also be called a "dynamic data object" Workflow object A Text file listing the chained operations. (too specific to RENCI)

9. Aggregations Digital Object Aggregation (aka aggregation) is a structured bundle of distinct data/digital objects. Digital Collection is a type of aggregation. It is also a Digital Object with a PID to be referable and metadata describing at a minimum its aggregation properties. Gappy Data is an attribute of data collections or sets indicates that it is incomplete at its time of registration and which is completed at unpredictable moments. Note: Eudat has a definition that might be used. Active Collection is a type of collection that is being generated dynamically by executable code and results in what some call a dynamic data object. Data/Digital Sets is a type of managed data collection. It is the basic unit of managed data and has a persistent identifier and metadata. Alternative Dataset: A collection of data, published or curated by a single agent, and available for access or download in one or more formats. Compound Data Object/digital objects are bounded aggregations of resources and their relationships Data Catalog is a curated collection of metadata about datasets.

1 Basic Digital & Data Concepts data is inherently collective so one could be saying data collection as well as data Digital Data refers to a structured sequence of bits/bytes that represents information content. In many contexts digital data and data are used interchangeably implying both the bits and the content. Real-Time Data is data/data collection which is produced in its own schedule & has a tight time relation to the processes that create it and that require immediate actions. Timeliness such as real time is an attribute of data. Dynamic Data is a type of data which is changing frequently and asynchronously. Note1: EUDAT has defined DD since it occurs in many data creation situations and needs special treatment. Note2: Dynamic data has also been used in the context of Workflow- workflow that is executed a "dynamic data object", or you can call the results from executing the workflow a "dynamic data object" Referable Data is a type of data (digital or not) that is persistently stored and which is referred to by a persistent identifier. Digital data may be accesses by the identifier. Some data objects references may access a service on the object (OAI-ORE). Citable Data is a type of referable data that has undergone quality assessment and can be referred to as citations in publications.

More Aggregation Types Data Container is a type of data structure used to store data collections for particular uses. File is a named and ordered sequence of bytes that is known by an operating system. A file can be zero or more bytes and has one file format, access permissions, and usually file system statistics such as size and last modification date. Corpuse (Corpus)is a type of collection that has a meaningful value to a researcher. It may or may not be digitally represented. Data Container is a type of data structure used to store data collections for particular uses.

Metadata Management Data Sources Data Replication In Repository reserved vocabularies metadata schema Data Repository Practical Policies: Manage Data Replication in Repository metadata organization in tables metadata properties (creation date, access control, ownership). Associated Metadata Repositories Data Sources Original MetaData User/ manager

Query of Local Data & Metadata Repositories Metadata Repositories Practical Policies: Local Query Operations Local Data Repositories Data Query Processing MD Query Processing MD Query User/ manager User/ manager Data Query

Query of Remote Data Repositories vis PID PID Registry Practical Policies: Remote (Repository) PID Query Data Repository PID Query Processing User/ manager Data Query