Presentation is loading. Please wait.

Presentation is loading. Please wait.

Implications of Using PCDM in the Face of a Major Repository Migration

Similar presentations


Presentation on theme: "Implications of Using PCDM in the Face of a Major Repository Migration"— Presentation transcript:

1 Implications of Using PCDM in the Face of a Major Repository Migration
Leaving Flatland Implications of Using PCDM in the Face of a Major Repository Migration Steve Van Tuyl Hui Zhang Michael Boock Center for Digital Scholarship and Services Oregon State University Libraries & Press

2 Content Types in SA@OSU
Theses and Dissertations 24,842 Technical Reports 11,185 Articles 7,788 Presentations and Posters 1,041 Audio and Video 159 Datasets 59 Oregon State University’s institutional repository has been in production since It contains over 58,000 total items, primarily theses and dissertations as well as technical reports (primarily in the form of government documents and extension and experiment station publications), items classified as books (primarily digitized books out of copyright), datasets, conference proceedings and presentations, multimedia, datasets, and, increasingly since the university passed an OA policy in 2013, faculty articles. Mediated deposit across many collections including ETD, which is require

3 Multi Part Files in SA@OSU
25% of objects in SA are multi-file 11% of ETD objects are multi-file How much content do we have that has multiple parts? ETDs with multiple bitstreams Diagram indicating what those bitstreams are (file extensions) <- steve has some sort of script that does this i think All file types with multiple bitstreams Diagram indicating what those bitstreams are (file extensions)

4 ScholarsArchive@OSU - Migration Project
Current infrastructure doesn’t meet a number of needs: Flexible reporting & analytics Multi-file objects & relationships Decision to Hydra-fy: Modularity Unification of developer base PCDM Unification of developer base--most library development at osulp is in rails now OregonDigital, SA, other. Less familiarity with java (DSpace)

5 DSpace Simple Dissertation

6 DSpace Multi-file Example
Multi-file dataset and an article that used the dataset, both in ScholarsArchive

7 PCDM (probably for the 100th time today)
Collection Object/Work File Talk here about extreme flexibility and need to figure out how we want to represent different types of objects and their relationships with other objects?

8 Simple Document (dissertation)
Work Dissertation File File [pdf] Specific examples, won’t look a lot different in hydra

9 Multi-Part Dissertation
Work Article 1 Work Article 2 Work File [pdf] File File [pdf] File With more complex objects, we’ll be able to represent differently, assign metadata more granularly and represent relationships between objects that formerly either had to be packaged together as a single item, or described separately without being able to demonstrate or express relationships between them (except in notes or by putting items together in collections--still not clear how they relate)

10 Simple Dataset Dataset Work NetCDF Datafile Work Compressed Files Work
Readme Work File [nc] File File [tar] File File [txt] File For datasets, in DSpace unable to represent heirarchy. Just list a bunch of files with minimal description (size, format, brief description)

11 Not so simple dataset Same data, ‘properly’ represented
Work Dataset File Work Readme File [txt] Logical File Grouping File Work Data File File [nc] Work Work Work Work Work Work Work Data File 1 Data File 2 Data File 3 Data File 4 Data File 5 Data File 6 Data File 7 File File File File File File File File [nc] File [nc] File [nc] File [nc] File [nc] File [nc] File [nc]

12 Research Paper & Dataset
File Work Readme File [txt] File Work Data File File [csv] Related (dcterms:isReferencedBy) File Work Data File File [csv] Work Dataset File Work Work Work Data File File [csv] Pre-Print File [pdf] Final Manuscript File [pdf] File Work Data File File [csv] File File File Work Figure File [tif]

13 Datasets, Paper, Dissertation
File Work Readme File [pdf] Related (dcterms:isReferencedBy) File Work File [zip] Data File Work File Work Dataset Data File File [zip] File Work File [zip] Data File Work Work Dissertation Final Manuscript File [pdf] File Work Data File File [zip] File File File [pdf]

14 Challenges - Data Modeling Consistency
Managing the flexibility of PCDM Consistency in: Modeling structure (especially throughout migration process) Internal representation of object types (consistent to the data modeling) Community representation of object types Diversity of item types (documents, AV, data, etc.) Large percentage are similar (single documents) I think we're saying: So, as we've demonstrated, we have a wide variety of item types with different relationships in the current repository. How do we ensure that these items and their relationships are migrated so that they are represented in the new repository consistently. For example, all theses with associated datasets are represented according to our predetermined model. ??

15 Solutions - Data Modeling Consistency
Establish models for representation of broad types of content Single file objects Multi-file objects Versioned objects Curation of future repository content will adhere Being able to identify new objects that require different models Engage with PCDM community to identify content type guidance/standards Interoperability Common language for troubleshooting Plan for how we represent multi-file content deposited to the new repository. Determine any retrospective work that needs to be done to build relationships between existing content.

16 Challenges - Intent At times researchers know better than librarians do about the structure of their datasets The diversity of dataset sources and formats The volume/size of dataset The relationships among files inside dataset and to external/derivative resources (e.g. article) How do users want to discover objects The diversity of dataset sources and formats: whether files should be zipped together or represented as individual file

17 Solutions - Intent Balance between researcher/depositor intent WRT their content and modeling None of us are right all the time Be transparent about expectations and procedures Try to be consistent to the existing data models Need to consider at what level of granularity users want to discover content Representation of compound objects in a discovery interface? Thats hard Come to #OR2017

18 Challenges - Migration to Hydra Land
We cannot babysit every item we migrate. Migration of 58,000+ items in an automated manner Retain all metadata, including structural information such as collections and communities to which items belong in DSpace Though we don’t want to adhere too much… Strike a balance between familiar representation and more functional modeling Are there issues with internal consistency if we allow changes in structure of objects post migration? Yes. community/collection in DSpace represents institution chart Collection name has meanings such as content type or owning institute Are there issues with internal consistency if we allow changes in structure of objects post migration? We have to make compromise during migration, but new items will be stricted to the rules.

19 Solutions - Migration to Hydra Land
Identify what types of objects require manual or hands-on migration Datasets Multi-file dissertations & supplementary files Faux relationships Capture community & collection structure in item-level metadata to inform creation of PCDM collections (or not) Migrated collections will represent object groupings (e.g. Biochemistry) rather than object types (Biochemistry posters) After migration, reevaluate data models for new content to allow flexibility but pay attention to consistency Capture community & collection structure in item-level metadata to inform creation of PCDM collections (or not): manual audit the 400 collections in SA to create a crosswalk of community/collection structure to item level metadata for migration After migration, reevaluate data models for new content to allow flexibility but pay attention to consistency: e.g., multi-part dissertation And Poster as type facet

20 What’s Next A repository-wide analysis to find logical object types that require specific models Setting local model consistency Identifying relatedness of existing content Identify items that are too complex to automate (~10%?) Migration Automation Pilot with test collections Migrate majority of contents by data models pertaining to item types Either hand migrate remaining content OR modify automation to meet complexity Service critical collections such as ETD Discovery Identify items that are too complex to automate (~10%?) Unusual structure Parts that belong elsewhere (supplementary datasets with dissertations) Either hand migrate remaining content OR modify automation to meet complexity: depend on the number of items that require additional treatment Service critical collections such as ETD: it requires special consideration to assure smooth, minimum downtime transition (use DSpace for ETD until Sufia is thoroughly tested)

21 ? Thank You

22

23 This is what our abstract says we’re talking about...
Increasingly, repository managers and data curators are identifying gaps between current repository functionality and dataset preservation and dissemination requirements. Supplementary files such as datasets and software code are increasingly deposited together with documents (e.g. dissertations, faculty research articles). Datasets are also increasingly being deposited to repository platforms with the expectation that the structure of the data can be properly represented, along with the relationships between the data and other repository content. In our current repository, deposits are realized without regard for representation of the relationships between file types or differentiation in the description of files. In 2015, Oregon State University Libraries and Press (OSULP) began to migrate the institutional repository from DSpace to the Hydra-Sufia platform. This presentation describes and demonstrates OSULP’s prototype repository architecture that explicitly defines relationships between datasets and the publications using these datasets. We demonstrate how we use the Portland Common Data Model (PCDM) to contextualize files in relationship with other resources in the repository. We provide concrete examples of how this architectural migration improves the representation and publication of repository content and the implications for migration of a large repository.

24 Hydra::Works::Collection a pcdm:Collection Hydra::Works::Work
a pcdm:Object pcdm:hasMember A D A D pcdm:hasMember Hydra::Works::FileSet a pcdm:Object Key A Access A D B Bitstream D Descriptive pcdm:hasFile T Technical T B OriginalFile a pcdm:File A T B Thumbnail a pcdm:File A T B ExtractedText a pcdm:File A

25 DSpace Flatness Example
Only bitstream metadata is file name, size, format and description. Not indexed, hard to tell what file is what and what is important.


Download ppt "Implications of Using PCDM in the Face of a Major Repository Migration"

Similar presentations


Ads by Google