An OAI-ORE Aggregation for the National Virtual Observatory

An OAI-ORE Aggregation for the National Virtual Observatory
David Reynolds Tim DiLauro Sayeed Choudhury Library Digital Programs Sheridan Libraries Johns Hopkins University Abstract: Johns Hopkins University, the National Virtual Observatory (NVO), and its partners are developing a data curation prototype system that connects data deposition with the publishing process using a Fedora-based repository at its foundation. As part of this effort, the development team is evaluating the use of the Open Archives Initiative - Object Reuse and Exchange (ORE) specification to create an ORE Aggregation that models the relationships between the various resources related to a particular publication (e.g., an article, the data it describes, and references cited). As stated on the OAI-ORE web site ( OAI-ORE provides "specifications that allow distributed repositories to exchange information about their constituent digital objects. These specifications will include approaches for representing digital objects and repository services that facilitate access and ingest of these representations." This presentation will outline observations from the data modeling process, illustrate the mapping of the model to an ORE Aggregation (and its associated ORE Resource Map or ReM) for the NVO data curation system, and identify linkages between the ORE Aggregation and other conceptual models. Finally, the presentation will identify potential next steps to advance this work into other domains.

OAI-ORE Project Briefing
Presentation Outline Some Background Motivation for the work described here Illustrate simplified current workflow OAI-ORE background Describe a possible ORE-enabled solution Present rough model of a real article Future directions CNI Task Force Meeting, April 2008 OAI-ORE Project Briefing

Project Background JHU Sheridan Libraries teaming with the NVO and the AAS to capture published data Support curation and entrée to preservation Stable location for data the journals are not prepared to store permanently Provide platform for data services Funding from IMLS and Microsoft What we want to do Data curation is the active and on-going management of data through its lifecycle of interest and usefulness to scholarship, science, and education Working with NVO and AAS to capture published data The NVO make it possible for astronomical researchers to find, retrieve, and analyze astronomical data from ground- and space-based telescopes worldwide. Augment a publishing system for AAS publications (AJ Astronomical Journal; ApJ Astrophysical Journal) Also beginning discussions with Optical Society of America (OSA) Enhance existing journal workflow to support capture, storage, and preservation of associated data(sets) Provide a means for authors to submit their research data along with their text, tables, and figures Journals are not prepared to store this data permanently, so we can provide a valuable method of preservation and access. We will build a Fedora based repository to preserve these data and associated metadata, provide contextual information about how they relate to each other and to the publications, and provide a means to reuse the data in other contexts Both through the journal system and outside via OAI-PMH? We’d like to thank IMLS and Microsoft for funding support CNI Task Force Meeting, April 2008 OAI-ORE Project Briefing

Motivation Mandate to integrate data capture with extant workflow for multiple journals Need to capture relationships with resources which are not part of a particular article Desire to share risk Goal of capturing relationships among conceptual objects and web resources Data capture has to integrate with the current AAS publishing system. Capturing the data during the publishing process is our best chance of getting all the data and have it associated with the published article. Provenance: who handled, when deposited, etc. System must be flexible enough to use with publication workflow for more than one publisher across multiple disciplines OAI-ORE has several advantages to us as a data model. By going with a developing standard that has the potential for wide adoption, we are sharing development risks with a wider community There are resources such as data sets that may form an important part of the conclusions of an article, but aren’t formally part of that article. In addition, there are components of an article such as figures and tables that may or may not be separate web resources in their own right. We want to clearly describe these resources and their relationships to various aspects of the published article. ORE has the flexibility to handle these complexities in a way that are both understandable to humans as well as being machine-actionable CNI Task Force Meeting, April 2008 OAI-ORE Project Briefing

Sample Astronomical Data
“major astronomy archives focus on standard data products, and less so on highly processed images or spectra that are associated with peer-reviewed publications” (from NVO grant narrative) Without an integrated system for individual scientists to deposit these PROCESSED data into a persistent library-based archive or NVO-standard interfaces for access, it is impossible to query these data, identify gaps in knowledge, cite the data within publications, or preserve them for long-term access. (from NVO grant narrative) Observations from many different telescopes and astronomers brought together in published data catalogs such as Sloan Digital Sky Survey, Data Release 4 (SDSS DR4) or the Spitzer Wide-Area Infrared Extragalactic Survey (SWIRE). FITS (Flexible Image Transport System) files are the standard data format used in astronomy. It is used to store both image and non-image data Astronomers take data like spectra, images, catalogs, and tables from these published catalogs and make further refinements to them such as filtering. This yields an altered data set. These data are used to make assertions and are often characterized by graphs or tables in the published article. Problem is that the data are not curated and made available to other astronomers. CNI Task Force Meeting, April 2008 Archiving Published Data

Current Submission Workflow
Publisher Author Author creates the text of the article and starts the submission process with the publisher submits text as PDF She also has a table and figure data that she also wants to send. The publisher has the means to accept the PDF and the table data separately, but they do not want to keep the figure data apart from the PDF This is a problem because the table data is important in its own right and there is no place for it to be stored and preserved CNI Task Force Meeting, April 2008 OAI-ORE Project Briefing

Archiving Published Data
OAI-ORE Public alpha release December 2007 Executive editors: Carl Lagoze, Cornell University Herbert Van de Sompel, Los Alamos National Lab Utilizes Web architecture Abstract Data Model, Vocabulary, Resource Map Profile of Atom Alpha release in Dec. 2007 Herbert Van de Sompel and Carl Lagoze, exec. Editors ORE standard is Used to instantiate, describe, and identify aggregations of web resources Focus is on Web resources and URIs, not on repositories ORE central concept is the aggregation that is described by a ReM Each resource in the aggregation has a representation on the web—each has its own URI The ReM also has a representation on the web with its own URI “ORE provides a mechanism to associate identities with these aggregations and describe them in a machine-readable manner would make them visible to Web agents, both humans and machines. This could be useful for a number of applications and contexts” (direct quote from ORE Abstract Data Model) CNI Task Force Meeting, April 2008 Archiving Published Data

Archiving Published Data
OAI-ORE Basic Model “Resource Map (ReM), which is a resource identified by a URI (say ReM-1) that encapsulates a set of RDF statements. These statements instantiate an aggregation as a resource with a URI, enumerate the constituents of the aggregation, the relationships among those constituents, and the Web context of the aggregation A Resource Map Document is a machine readable representation of a Resource Map. A Resource Map Document may be serialized in different formats. The alpha release of the ORE specifications include an Atom Syndication Format guide, but the specification does not limit serialization to just Atom. ReM describes Aggregation: one-to-one relationship ORE makes use of RDF triples to relate a subject resource to an object resource through a predicate or relationship. In our example, ReM-1 aggregates AR-1 and has type T-1—these are two RDF triples: ReM-1 ore:aggregates AR-1 ReM-1 rdf:type T-1 Aggregation aggregates one or more Aggregated Resources An AR can be part of more than one agregation ReM has some required metadata that will help consuming clients establish trustworthiness (dc:creator) and timeliness (dcterms:modified is a date-timestamp showing the last modification of the ReM) CNI Task Force Meeting, April 2008 Archiving Published Data

A More Desirable Workflow
Publisher Author Archive In our revised publishing scenario, we add the data archive into the mix Now the author creates a ReM for the article aggregation. In this case the aggregation contains the text of the article in PDF along with data for a figure and a table. This all gets sent to the publisher. The ReM serves as a manifest, so the publisher uses it to understand the components it was sent. They then create its published version of the article with the newly-created table and figure integrated. Next, the publisher creates a new ReM that describes the data they are sending to the data archive. The archive uses the ReM to ingest the data that they are curating, in this case the figure and table data. The archive does not store a copy of the published article—that is the function of the publisher. The archive now creates a third ReM that describes the two data sets they are keeping along with pointers back to the article in the publisher’s archive. This ReM also shows the relationship between the figure and table in the article to the data sets in the data archive. David Reynolds CNI Task Force Meeting, April 2008 OAI-ORE Project Briefing

A More Complex Example Our data model [slide 10]
Aggregation consists of text, tables, figures, and data sets . There are also resources outside the aggregation that are significant: citations to other articles, data sets from which our sets were derived, and data catalogs Explain our data model in detail, possibly starting with the two data catalogs ***Have not decided how to characterize the queries on the original catalog that retrieved the FITS files nor do we know how to characterize the processing that produced the derived FITS files. We need to get this done*** What can we do with the ReM? “Crawler-based search engines could use such descriptions to index information and provide search results sets at the granularity of the aggregations rather or in addition to their individual parts” “These machine-readable descriptions could provide the foundation for advanced scholarly communication systems that allow the flexible reuse and refactoring of rich scholarly artifacts and their components”

Components: Data Driven Scholarship
Publication & Editorial Process Data capture Metadata capture & validation Links Identifiers Library Curation Preservation Data Storage Appliance Metadata database Digital data objects Ancillary information Data Storage Appliance Metadata database Digital data objects Ancillary information Data Storage Appliance Metadata database Digital data objects Ancillary information replication services VOStore The library has the primary role in the middle > layer, that of establishing and maintaining the data storage and > performing various curatorial tasks. The library is also working > with the publishers in the upper layer to establish the workflow for > feeding the data into the repository. The library also works with > various entities on the data access layer to provide services to > enable re‑use of the data in the repository Data Access VO portals Journal portals Other after-market distributors Registry Logging

Future Directions Library can provide a common infrastructure
Investigate data curation in other scientific disciplines Gather input from scholarly societies Digital humanities One of the main, guiding principles of our data curation efforts is to identify commonalities across disciplines to the extent possible. If one thinks about physical infrastructure at a campus, everyone uses the same HVAC, networking, etc., but the bio labs look very different from the classics departments. At some level, there is common infrastructure we can apply and that's probably where the Library can make a contribution. Highly specific domain needs might need to be addressed in more specific ways, perhaps by scholars or scholarly societies themselves. With our work, we're using astronomy as a starting point. So it would be important to question how portable the ORE data model, the publishing workflows, the repository services, etc. might across domains. By sharing our ideas with OSA, we're at least getting another scholarly society and discipline's input such questions. We don't want to keep doing this on a discipline by discipline basis, but until we have better understanding and associated theoretical frameworks for understanding data curation across disciplines, this seems like a good way to start. We hope that the framework will settle into a fair amount of uniformity but we will like need to continue integration work with new players for some time to come. Thinking way outside the box, our work with Rose and other digital humanities also allows us to consider the portability question across not only the sciences and engineering, but other disciplines as well.

An OAI-ORE Aggregation for the National Virtual Observatory

Similar presentations

Presentation on theme: "An OAI-ORE Aggregation for the National Virtual Observatory"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

An OAI-ORE Aggregation for the National Virtual Observatory

Similar presentations

Presentation on theme: "An OAI-ORE Aggregation for the National Virtual Observatory"— Presentation transcript:

Similar presentations

About project

Feedback