Presentation is loading. Please wait.

Presentation is loading. Please wait.

Open Repositories 2015 Sharon Farnel, University of Alberta

Similar presentations

Presentation on theme: "Open Repositories 2015 Sharon Farnel, University of Alberta"— Presentation transcript:

1 Open Repositories 2015 Sharon Farnel, University of Alberta
Metadata at a crossroads: shifting ‘from strings to things’ for Hydra North Open Repositories 2015 Sharon Farnel, University of Alberta At the University of Alberta Libraries we are currently developing a Digital Asset Management System (‘Hydra North’, built on Hydra and Fedora 4) to bring all of our digital assets into one platform for discovery, access and preservation. As we move toward the DAMS (based on Hydra and Fedora 4) we also find ourselves at a metadata crossroads. Today we would like to share a bit about our experience and hope others may learn from it as they find themselves at a metadata crossroads.

2 Preview Where we came from Where we want to be
Principles and questions Where we are now How we got to where we are now Where we go next Reflections

3 Where we came from Our institutional repository has acted as the home for the intellectual output of our campus community, including theses, faculty and student publications, learning objects, small datasets and the like. Additional separate repositories have housed locally digitized content including ephemera, images, maps, newspapers, and archival materials. More recently, a research data repository was added to the mix to enable access to and preservation of research data from our campus community.

4 Where we came from The metadata that underlies our existing repositories is as varied as the digital objects found within them. The number of standards - descriptive, administrative and structural - involved is numerous; a recently conducted metadata inventory includes MARC/MARCXML, MODS, DC (simple and qualified), ETDMS, DDI, METS/ALTO, CSDGM, NAP, EAD, Datacite, FOXML. In addition, the metadata has been created over many years following a variety of different guidelines, often by individuals expert in metadata creation, nearly as often by those more familiar with a collection itself. While some of the metadata was created natively in the standard in which it now exists, much has been converted from other standards.

5 Where we want to be consolidated discovery, access and preservation platform coherent index for all items in our digital collections enabled by a single data dictionary that captures the ontologies and vocabularies used across the system metadata created according tp the principles of linked data Though the task seems daunting at times, we see great potential in RDF and linked data generally for providing options other than conversion to or use of a single ‘lowest common denominator’ standard. But there are still challenges. How do we select ontologies and vocabularies that will best serve the objects but also enable our metadata to be fully functional within the linked data environment? How do we ensure our choices enable integration with local and external discovery platforms? What enhancements need to be made to our legacy metadata to make it as linked data compatible as possible? Given ever-present resource constraints, what enhancement can we realistically undertake, and how do we prioritize them? What tools and resources are out there to help with enhancement work? In essence, how do we take metadata that was created under a record-based framework and make it maximally functional within a framework of ‘things not string’ in order to get the most out of the promise of linked data?

6 Principles and questions
≠ simply converting ‘standard x’ metadata from XML to RDF ≠ going from functional XML to semi-functional RDF shift ‘from strings to things’ in a meaningful yet effective and efficient way that works in our context questions which ontologies and vocabularies to select? what enhancements are necessary? which can we realistically undertake? how do we prioritize them? what tools are out there that can help us? We are very much focused on the fact that we don’t simply want to convert what we have to rdf, but rather to take this opportunity to rethink what we have created and find ways to express it as best we can in rdf (under principles of linked data) knowing that we may/will be limited by the fact that this metadata was created under a different framework. We are adamant that we don’t want to go from functional xml to semi-functional rdf; if we are going to do this, let’s do it in such a way that we get the most out of the promise of linked data. Also context is key.

7 Where we are now data dictionary created and in use
beta version with institutional repository content (data and metadata) and functionality complex service model, simple data model linked data enhancements enumerated and prioritized preliminary workflows developed, refined and documented

8 How we got to where we are now
ontologies and vocabularies looked to partners in the repository community (e.g., rdf-vocab), online resources (e.g., Linked Open Vocabularies) maintain running list in data dictionary metadata enhancements phase 1: item types, languages, licenses consistent forms, include URIs phase 2: place names, controlled subjects phase 3: person names, freeform subjects consistent forms, include URIs? phase 1: low hanging fruit phase 2: somewhat easier, place names with geonames integration, look to lcsh for controlled subjects phase 3: person names, freeform subjects - consistency of forms at least (for now) based on resources but also policy and technical functionality

9 How we got to where we are now
tools json and jq Open Refine oXygen and XSLT ingest script (Fedora 3 FOXML to Fedora 4) based on script by Jose Blanco (UMichigan) Github preliminary steps user database migrated communities, collections migrated audit log facets defined Extract of all metadata from our F3 repo in json; jq used to query/run reports, on a collection/community basis (to account for idiosyncrasies across the system). Open Refine for reviewing reports/data, determining cleanup and eventually reconciliation (greater integration down the line) oXygen, XSLT for stylesheeting the major changes (single stylesheet, others only when needed) to datastream in original foxml Ingest/migration script - Weiwei is presenting on this Github for stylesheets and script

10 How we got to where we are now
typical workflow collection(s) identified for migration json report(s) extracted (using jq) from full json file json report(s) reviewed in Open Refine to identify issues oXygen used to review and revise generic stylesheet or create collection-specific version transform source FOXML for migration migration initiated (using latest DCQ datastream) original FOXML migrated as separate datastream objects and metadata reviewed in test instance Typical workflow for an IR collection (which is what we’ve been working on; doing so collection by collection) note: collections and communities and users migrated prior so hooks are there start with json extract of metadata (xslt from all foxml) - 1,210,325 lines select a collection(s) for migration - e.g., Circumpolar Digital Image Collection based on collection pid, extract metadata (using jq) use Open Refine to view report, identify issues to be addressed (manually or in the stylesheet); e.g., data in wrong element, local identifier for collection, etc. as collections are fairly homogenous (except e-theses which we will tackle later on :) we’ve decided on a single stylesheet; will create additional on an as needed basis transformed foxml passed to dev team, migrated using the script; based on latest DCQ datastream original FOXML migrated as separate legacy datastream items reviewed in test environment

11 Original FOXML DC Datastream

12 Item display in current IR

13 Transformed FOXML DCQ Datastream

14 Item display in Hydra North

15 JCR XML (excerpt) for item in Hydra North
you’ll notice that some of the phase 1 lod enhancements aren’t there - language, item type, license what we’d planned are language (click), item type (click), license (click) but … we’re in phase .9 at the moment; more on the next slide around challenges JCR XML (excerpt) for item in Hydra North

16 How we got to where we are now
challenges when technology and metadata collide displaying strings, storing URIs repeatability of predicates, complexity of concepts displaying metadata, not allowing users to add it when old and new data meet incorporating older values into new metadata models consistency between strings and URIs, controlled and freeform values respecting the past, preparing for the future messy data For languages, want URIs but no way to display string; can’t add both string and URI OR two URIs for complex item type (journal article submitted) as the element is not repeatable; displaying metadata but not allowing users to enter it (e.g., theses) and the downside of having a single simple data model For some, e.g., CC licenses, older versions exist, how to get these to live nicely alongside new versions available via interface; will not be capturing URIs or using controlled vocabs (at least not yet), so do we capture, or do we wait, meaning more cleanup down the line? how do we think about technical implementations of linked data for future? How do we do what we can with the old data, build better functionality in the future, and allow it all to work together? Messy data ...

17 Where we go next institutional repository migration complete by September 1, 2015 additional LOD enhancements begin work on digital collections and archives simple service models, complex data models more possibilities for LOD enhancements integration with digital preservation and research data management platforms more and different metadata, including preservation and structural

18 Reflections look to and learn from the community
sometimes even the simplest things can be a challenge phasing LOD enhancements is beneficial working closely with your dev team can be a real benefit be patient and enjoy the ride!

19 Thank you! Questions or comments?

20 Ontologies and vocabularies
dc DCMI Type Vocabulary dcterms ualterms Internet Media Types ISO639-2 Languages Creative Commons XML Schema Datatypes Bibliographic Ontology VIVO Core Ontology Geonames MARC Relator Terms LC Name Authority File also on tap are TGM, Getty TGN,, and more

21 Hydra North (Beta)

Download ppt "Open Repositories 2015 Sharon Farnel, University of Alberta"

Similar presentations

Ads by Google