Presentation is loading. Please wait.

Presentation is loading. Please wait.

RDA Terminology: Data Management and Data Fabric Prepared for RDA 6 th Plenary Paris, Sept. 23, 2015 Gary Berg-Cross Co-Chair DFT IG, Co-organizing Chair.

Similar presentations


Presentation on theme: "RDA Terminology: Data Management and Data Fabric Prepared for RDA 6 th Plenary Paris, Sept. 23, 2015 Gary Berg-Cross Co-Chair DFT IG, Co-organizing Chair."— Presentation transcript:

1 RDA Terminology: Data Management and Data Fabric Prepared for RDA 6 th Plenary Paris, Sept. 23, 2015 Gary Berg-Cross Co-Chair DFT IG, Co-organizing Chair for DF IG DFT Goal: Describe a basic, abstract (but clear) data organization model that systemizes the already large body of definition work on data management terms, especially as involved in RDA’s efforts. Terminology Issue What do we expect from RDA ? Adopt one or build own language? Spend years on terminology debates? Build our own language stepwise, Other - such as cooperate with other efforts?

2 Topics - RDA DFT is about clarifying and labeling concepts and Terminology Strategy Franco Zoppi “The document seems to suffer from a problem in the used terminology. Terms are sometimes unclear (in many cases definitions would help) or even wrong or misused. I guess that most of these problems could be avoided with a correct use of Computer Science/ICT well established and consolidated terminology. This is particularly evident in Sections 2.2, 2.3 and 2.6.” Broadening discussion beyond a core to wider Data Management Including suggested concepts with candidate terminology Current strategy is to: Clarify and update existing terms Digital Objects need IDs, but what and how as part of data management? etc... Improve supporting models with conceptual relations (a big job) Provide practical guidance (technical and policy views)

3 Digital Data Management including unregistrered (is a braoder concept) Broadening the Discussion (Stepwise or Scope- wise) Data Management (and use) is broader still Digital Object Management (registered, digital data) Where are datasets???

4 Integrate Concepts: Policy-based Digital Data Management Concept Graph (Reagan Moore) Based on practical principles, Policy defines when in a workflow a PID is created as well as other curation activities..These defs are linked

5 Including suggested concepts with candidate terminology: Examples 1.Data practice is the actual application/ use of ideas & methods (as opposed to theories) about how data are collected, created, stored (maintained), curated, used, shared and released (disseminated). 2.Data principles are rules that provide guidance across data management and use for such things as” data acquisition, data lifecycle control, data policy & ownership, metadata practices, data quality etc. 3.Common data solutions are agreed upon, easily available, tested & approved approaches to widely occurring problems in data management and use 4.Data discovery is a process of query and/or search to find (research) data of interest. 5.Database cracking features incremental partial indexing and/or sorting of the data. It combines features of automatic index selection and partial indexes. It reorganizes data within the query operators, integrating the re-organization effort (occasionally invoking creation or removal of indexes on tables and views based on use) into query execution. It shifts the cost of index maintenance from updates to query processing. 6.Adaptive indexing is characterized by the partial creation and refinement of preliminary or fixed DB indexes as side effects to support efficient query execution. (after http://www.vldb.org/pvldb/vol4/p586-idreos.pdf)

6 Clarifying Concepts: we discussed other organizing model ideas Digital Object (aka Digital Entity) A digital object is composed of structured sequence of bits/bytes. As an object it is named. This bit sequence can be identified & accessed by a unique and persistent identifier or by use of referencing attributes describing its properties. Note Digital Entity definition from X.1255 ITU standard “machine-independent data structure consisting of one or more elements in digital form that can be parsed by different information systems; the structure helps to enable interoperability among diverse information systems in the Internet.” Link data management principles to the actual workflow of generating data Data Management Workflow Structured Object – includes provenance, versioning, and output MD (from PP)

7 Clarifying and updating existing terms: adding practicality Comments on the DF White paper include challenges to the idea that Internal/External properties is a useful distinction for DOs: Internal property refers to the properties, making up an internal structure, that allow one to interpret the content of a DO. the statement “we need to distinguish the external characteristics from the internal characteristics to ensure that we really can separate common data management tasks from discipline–specific heterogeneity..” seems not appropriate.... many such things considered external for data managements vary by discipline too...search by sample type or Dx. I think that it is unfeasible the assignment of PIDS to single data. Therefore you need search and query capabilities to find the required data contained in datasets/databases identified by the PIDs. ID, creation date,... Sample type UoM Obs. Precision Patient Age Symptom Dx... Common Management for these External Properties? Part is Identification, but Part is for discoverability

8 Improving conceptual relations Concept map overview of Core Terms How is some part of a database or dataset to be identified/cited? How should data stored in a repository that has complex internal structure and that is subject to change be identified/cited? We will need smarter resolvers that offer additional services beyond getting from an identifier to an object location.

9 Providing Practical Guidance (Tech, Policy & Strategy) When should a PID be assigned to be useful with dynamic data? If you build up a clinical trial database you will continuously add and change data. There is no PID necessary because here you have the audit trail which stores all actions. A PID should be assigned, for example, when the database is cleaned and frozen, which is a definite working step in the workflow of clinical trials. (Christian Ohmann, Wolfgang Kuchinke, Steve Canham) PIDs should be assigned at the level of granularity (data sets) appropriate for a functional use that is envisaged (Costantino Thanos) Responses Scalibility is an issue, so the management of objects & identifiers should work through the same mechanisms as much as possible. To enable management of objects beyond a view focusing on single items, adequate mechanisms should, for example, be able to select objects by their most important characteristics or aggregate them at multiple levels of granularity and provide basic CRUD operations on such object collections. Tobias Weigel, Michael Lautenschlager For added-value services registries at the resolvers’ level are also needed and should be maintained by recognized international organizations. Publishers will rely on the DOI system because there has been major investment. What highly available and scalable PID system is feasible? We should develop a strategy build upon what is existing and what can be done for those cases, where currently no PID is used. Etc....


Download ppt "RDA Terminology: Data Management and Data Fabric Prepared for RDA 6 th Plenary Paris, Sept. 23, 2015 Gary Berg-Cross Co-Chair DFT IG, Co-organizing Chair."

Similar presentations


Ads by Google