Presentation on theme: "Open Repositories 2015 Sharon Farnel, University of Alberta"— Presentation transcript:
1 Open Repositories 2015 Sharon Farnel, University of Alberta Metadata at a crossroads: shifting ‘from strings to things’ for Hydra NorthOpen Repositories 2015Sharon Farnel, University of AlbertaAt the University of Alberta Libraries we are currently developing a Digital Asset Management System (‘Hydra North’, built on Hydra and Fedora 4) to bring all of our digital assets into one platform for discovery, access and preservation. As we move toward the DAMS (based on Hydra and Fedora 4) we also find ourselves at a metadata crossroads. Today we would like to share a bit about our experience and hope others may learn from it as they find themselves at a metadata crossroads.
2 Preview Where we came from Where we want to be Principles and questionsWhere we are nowHow we got to where we are nowWhere we go nextReflections
3 Where we came fromOur institutional repository has acted as the home for the intellectual output of our campus community, including theses, faculty and student publications, learning objects, small datasets and the like. Additional separate repositories have housed locally digitized content including ephemera, images, maps, newspapers, and archival materials. More recently, a research data repository was added to the mix to enable access to and preservation of research data from our campus community.
4 Where we came fromThe metadata that underlies our existing repositories is as varied as the digital objects found within them. The number of standards - descriptive, administrative and structural - involved is numerous; a recently conducted metadata inventory includes MARC/MARCXML, MODS, DC (simple and qualified), ETDMS, DDI, METS/ALTO, CSDGM, NAP, EAD, Datacite, FOXML. In addition, the metadata has been created over many years following a variety of different guidelines, often by individuals expert in metadata creation, nearly as often by those more familiar with a collection itself. While some of the metadata was created natively in the standard in which it now exists, much has been converted from other standards.
5 Where we want to beconsolidated discovery, access and preservation platformcoherent index for all items in our digital collectionsenabled bya single data dictionary that captures the ontologies and vocabularies used across the systemmetadata created according tp the principles of linked dataThough the task seems daunting at times, we see great potential in RDF and linked data generally for providing options other than conversion to or use of a single ‘lowest common denominator’ standard. But there are still challenges. How do we select ontologies and vocabularies that will best serve the objects but also enable our metadata to be fully functional within the linked data environment? How do we ensure our choices enable integration with local and external discovery platforms? What enhancements need to be made to our legacy metadata to make it as linked data compatible as possible? Given ever-present resource constraints, what enhancement can we realistically undertake, and how do we prioritize them? What tools and resources are out there to help with enhancement work? In essence, how do we take metadata that was created under a record-based framework and make it maximally functional within a framework of ‘things not string’ in order to get the most out of the promise of linked data?
6 Principles and questions ≠ simply converting ‘standard x’ metadata from XML to RDF≠ going from functional XML to semi-functional RDFshift ‘from strings to things’ in a meaningful yet effective and efficient way that works in our contextquestionswhich ontologies and vocabularies to select?what enhancements are necessary?which can we realistically undertake?how do we prioritize them?what tools are out there that can help us?We are very much focused on the fact that we don’t simply want to convert what we have to rdf, but rather to take this opportunity to rethink what we have created and find ways to express it as best we can in rdf (under principles of linked data) knowing that we may/will be limited by the fact that this metadata was created under a different framework. We are adamant that we don’t want to go from functional xml to semi-functional rdf; if we are going to do this, let’s do it in such a way that we get the most out of the promise of linked data. Also context is key.
7 Where we are now data dictionary created and in use beta version with institutional repository content (data and metadata) and functionalitycomplex service model, simple data modellinked data enhancements enumerated and prioritizedpreliminary workflows developed, refined and documented
8 How we got to where we are now ontologies and vocabularieslooked to partners in the repository community (e.g., rdf-vocab), online resources (e.g., Linked Open Vocabularies)maintain running list in data dictionarymetadata enhancementsphase 1: item types, languages, licensesconsistent forms, include URIsphase 2: place names, controlled subjectsphase 3: person names, freeform subjectsconsistent forms, include URIs?phase 1: low hanging fruitphase 2: somewhat easier, place names with geonames integration, look to lcsh for controlled subjectsphase 3: person names, freeform subjects - consistency of forms at least (for now) based on resources but also policy and technical functionality
9 How we got to where we are now toolsjson and jqOpen RefineoXygen and XSLTingest script (Fedora 3 FOXML to Fedora 4)based on script by Jose Blanco (UMichigan)Githubpreliminary stepsuser database migratedcommunities, collections migratedaudit log facets definedExtract of all metadata from our F3 repo in json; jq used to query/run reports, on a collection/community basis (to account for idiosyncrasies across the system).Open Refine for reviewing reports/data, determining cleanup and eventually reconciliation (greater integration down the line)oXygen, XSLT for stylesheeting the major changes (single stylesheet, others only when needed) to datastream in original foxmlIngest/migration script - Weiwei is presenting on thisGithub for stylesheets and script
10 How we got to where we are now typical workflowcollection(s) identified for migrationjson report(s) extracted (using jq) from full json filejson report(s) reviewed in Open Refine to identify issuesoXygen used toreview and revise generic stylesheet or create collection-specific versiontransform source FOXML for migrationmigration initiated (using latest DCQ datastream)original FOXML migrated as separate datastreamobjects and metadata reviewed in test instanceTypical workflow for an IR collection (which is what we’ve been working on; doing so collection by collection)note: collections and communities and users migrated prior so hooks are therestart with json extract of metadata (xslt from all foxml) - 1,210,325 linesselect a collection(s) for migration - e.g., Circumpolar Digital Image Collectionbased on collection pid, extract metadata (using jq)use Open Refine to view report, identify issues to be addressed (manually or in the stylesheet); e.g., data in wrong element, local identifier for collection, etc.as collections are fairly homogenous (except e-theses which we will tackle later on :) we’ve decided on a single stylesheet; will create additional on an as needed basistransformed foxml passed to dev team, migrated using the script; based on latest DCQ datastreamoriginal FOXML migrated as separate legacy datastreamitems reviewed in test environment
15 JCR XML (excerpt) for item in Hydra North you’ll notice that some of the phase 1 lod enhancements aren’t there - language, item type, licensewhat we’d planned are language (click), item type (click), license (click)but … we’re in phase .9 at the moment; more on the next slide around challengesJCR XML (excerpt) for item in Hydra North
16 How we got to where we are now challengeswhen technology and metadata collidedisplaying strings, storing URIsrepeatability of predicates, complexity of conceptsdisplaying metadata, not allowing users to add itwhen old and new data meetincorporating older values into new metadata modelsconsistency between strings and URIs, controlled and freeform valuesrespecting the past, preparing for the futuremessy dataFor languages, want URIs but no way to display string; can’t add both string and URI OR two URIs for complex item type (journal article submitted) as the element is not repeatable; displaying metadata but not allowing users to enter it (e.g., theses) and the downside of having a single simple data modelFor some, e.g., CC licenses, older versions exist, how to get these to live nicely alongside new versions available via interface; will not be capturing URIs or using controlled vocabs (at least not yet), so do we capture, or do we wait, meaning more cleanup down the line? how do we think about technical implementations of linked data for future?How do we do what we can with the old data, build better functionality in the future, and allow it all to work together?Messy data ...
17 Where we go nextinstitutional repository migration complete by September 1, 2015additional LOD enhancementsbegin work on digital collections and archivessimple service models, complex data modelsmore possibilities for LOD enhancementsintegration with digital preservation and research data management platformsmore and different metadata, including preservation and structural
18 Reflections look to and learn from the community sometimes even the simplest things can be a challengephasing LOD enhancements is beneficialworking closely with your dev team can be a real benefitbe patient and enjoy the ride!
20 Ontologies and vocabularies dcDCMI Type VocabularydctermsualtermsInternet Media TypesISO639-2 LanguagesCreative CommonsXML Schema DatatypesBibliographic OntologyVIVO Core OntologyGeonamesMARC Relator TermsLC Name Authority Filealso on tap are TGM, Getty TGN, schema.org, and more