Digital Preservation in Hydra/Fedora

Digital Preservation in Hydra/Fedora
March 24, 2015 We are going to provide an overview of what is going on in digital preservation in the world of the Hydra/Fedora framework. Get A head on Your Repository

About Hydra/Fedora Flexible Extensible Digital Object Repository Architecture Open-source project Provides a platform for digital preservation and presentation Used by hundreds of organizations, with over 52 Fedora Members contributing financially; Yale is one of these. Originally developed at Cornell, now led by Fedora Project Steering Group under stewardship of DuraSpace.org ( Yale is also a Fedora development partner, and Mike Friscia serves on the Fedora Leadership Committee Currently actively engaged in development of Fedora 4 Clearly the big decision reflected in our migration path was our decision to standardize on Hydra/Fedora two years ago. Widely used system—the current Fedora registry, which is by no means complete, shows 140 institutions in the U.S., and about as many overseas We did not standardize on Fedora as a platform here at YUL until Before that we had an abundance of digital collections platforms. Some were one-offs, some were meant to be the future but failed to scale up (ContentDM). Many were entirely custom—I’ll talk more later about the migration headaches this is causing. sponsors (2013) • 3TU.Datacentrum / TU Delft Library • Arizona State University • Brown University • Charles Darwin University • Colorado Allliance of Research Libraries • Columbia University • Creighton University • FIZ Karlsruhe • Indiana University Bloomington / Indiana University Purdue University • London School of Economics & Political Science • LYRASIS • National Library of Finland • National Library of Medicine • National Library of Wales • Northeastern University • Northern Illinois University • Northern Territory Library • Northwestern University • Oregon State University • Penn State • Rutgers University • Smithsonian Institution • Stanford University • Texas A&M Libraries • Texas Digital Library • Université de Liège • University of California Los Angeles • University of California San Diego • University of California Santa Barbara • University of Cambridge • University of Cincinnati • University of Hong Kong • University of Manitoba • University of New South Wales • University of North Carolina at Chapel Hill • University of Oxford • University of Prince Edward Island • University of Victoria • University of Virginia • University of Wisconsin • Vanderbilt University Library 11. In-kind contributors • Columbia University • discoverygarden inc. • FIZ Karlsruhe • Max Planck Digital Library • Media Shelf • Stanford University • University of California, Los Angeles • University of California, San Diego • University of New South Wales • University of North Carolina, Chapel Hill • University of Prince Edward Island • University of Texas, Austin • University of Virginia • University of Wisconsin • Virginia Tech • Yale University

Hydra Began in 2008 as collaboration between Stanford, UVA, Univ. of Hull, and Fedora Commons YUL joined in 2013 as 18th member. Membership now up to around 27—recent additions include Princeton, Cornell, Case Western Another 25 or more institutions are working in the Hydra framework without yet being formal members, including Brown, Johns Hopkins, Trinity College Dublin, Oxford, UC Berkeley and others Our second big decision after standardizing on Fedora was to join the Hydra project in 2013. Project Hydra formed in 2008 at the conference Open Repositories by a small group interested in solving a similar set of repository problems using the Fedora Commons software. Fedora is very extensible but not easily customized. When you do make customizations, it becomes more difficult to keep the software up to date. This group set out to design a way to make it easier to customize Fedora.

Hydra Partners OR = Open Repositories Conference
By 2012 the original group met their goals and brought in new partners along the way almost tripling in size. Yale joined shortly after and became the 18th partner in 2013. OR = Open Repositories Conference

Stanford University (f) University of Hull (f)
DuraSpace (f) Stanford University (f) University of Hull (f) University of Virginia (f) MediaShelf University of Notre Dame Northwestern University Columbia University Penn State University Indiana University London School of Economics and Political Science Rock and Roll Hall of Fame and Museum Royal Library of Denmark Data Curation Experts WGBH Boston Public Library Duke University Yale University Virginia Tech University of Cincinnati Princeton University Library Cornell University Oregon Digital (University of Oregon and Oregon State University) Case Western Reserve University Tufts University Duoc UC University of Alberta Currently there are 27 partner institutions. It is estimated that by the end of this year there will be 34 partners which is almost double the size of the community from just a couple years ago.

A Worldwide Presence It is difficult to track the exact number of institutions that have adopted Hydra given the open source nature of the framework. But as of this month, 62 institutions have reported that they have adopted Hydra. This map gives a glimpse of where Hydra is in use with the main concentration being right around us in the Northeast.

Get A head on Your Repository
Hydra at Yale Get A head on Your Repository

A framework for repository-powered applications, with
What Is Hydra? A framework for repository-powered applications, with multiple, tailored UIs, and a robust repository back end One body, many heads A set of solution bundles A community If you want to go fast, go alone. If you want to go far, go together. At Open Repositories in 2009 they gave a presentation called Project Hydra: Designing & Building a Reusable Framework for Multipurpose, Multifunction, Multi-institutional Repository-Powered Solutions. In the talk they outlined a three year plan to create an application and middleware framework that in combination with the Fedora Commons software, would create reusable, multipurpose repository solutions. In addition they were seeking to build an open source community around a common set of goals and principles to create a long lasting solution.

Hydra Interface Single Image Zoom Bookreader Data Import Hydra-Head
(IT use only) Single Image Zoom Bookreader Complex Object Display Downloadable PDF Data Import Hydra-Head Creating and managing objects (CRUD) Blacklight Discovering and viewing objects (R) Active Fedora and Solrizer Search/Facet Logic Hydra Access Controls Image Request Metadata Images Fedora (Preservation) Solr (Index) Ladybird (Yale’s Cataloging Tool) In our local instance of Hydra, we use Ladybird, Fedora, Blacklight as well as the Hydra framework to pull this all together into what we call our Hydra Stack. Ladybird is responsible for packaging content for ingest as an object into Hydra and Blacklight is used for public access and discovery. The ingested objects are called content models. These predefined models set the requirements an object must meet in order for ingest to be successful. These requirements include the specific files that must be present for ingest to take place. Link to Images Image Retrieval Managed Storage Media Server RSS SQL Server

Content model Models are designed based on the type of content and in general include information such as the digital format of files and associated metadata requirements. The diagram here shows a simple content model for a still image. It indicates that in order for ingest to take place, the package must contain a TIF image, Derivative files, descriptive metadata and some additional metadata files used for rights management. There are some optional files such as a text file for OCR output or a PDF. Content models are also used to express the relation between other content models when there is a need for objects to have complex relationships either other objects such as pages in a book relating to the object that represents the entire book.

Access Conditions Defined for each file in a content model
Wide range of authorization definitions Customizable Example: Access conditions are set for each file that is part of the content model. The current system provides us the ability to assign access to each individual file separately instead of the more common Fedora model of using a single access condition for the entire content model. We chose this path because of the need to separate the level of authorization required to access the master files from the derivative files for our image collections. Access can be granted as broadly as open access for anyone in the world down to access only to a single individual user.

Ingest Workflow One the package is created, ingest from Ladybird to Hydra is controlled by manual and automated processes. In some cases the use of ladybird is transparent and the content passes through without the use of Ladybird, in other cases each step of the ingest process is user controlled in Ladybird. Most of what we have published to Hydra has used the manual ingest path. Once the content is ready, someone publishes the content. For the Kissinger papers, the entire process is automated.

Research Data into Hydra
Colectica software exports contents in BagIt format Bag enters a watched folder in Ladybird Ladybird validates the bag contents Checksum validation File characterization Ladybird maintains the original file hierarchy as a collection of complex objects Each Ladybird object mapped to an Unstructured Content Model Each content model is then ingested into Hydra One of the projects we have been working on is for research data. In this project the files and metadata are curated in a system external to our Hydra stack named Colectica. Once ready, the other system delivers the content to Ladybird using the Library of Congress BagIT specification. Ladybird then imports the content, maintains the original hierarchy and then passes the contents onto Hydra for ingest. During the packaging process Ladybird runs several processes for file validation, format recognition and fixity checks. At this time, the technical data remains in Ladybird and only the files and minimal metadata are passed to Hydra for storage. As this project progresses, we will increase the amount of data we send to Hydra as well as including the event based metadata typically expressed as PREMIS events along with the technical profiles for each of the files.

Unstructured Content model
A challenge we faced with the research data project was developing a content model that would work with the content. In designing a content model, we need to know up front each file that will be stored in the model. To get around this we developed what we call the unstructured content model which can be seen in this diagram. It is basically a content model that only requires metadata about the entire model. Including a file is optional and it does not matter what type of file is included. In this case we run file characterization software to create a profile and then we pass this metadata into Hydra as part of the model. The reason that the model does not require a file is that we chose to store the exact hierarchy in its original form. In the BagIT specification, the bag is essentially a folder filled with more folders and files. In the unstructured approach, we treat a folder level the same as a file so that each is properly represented.

Another project we recently worked on was integration with the Digital Preservation Network. In this pilot project we selected roughly 5000 objects from Hydra to export using the same BagIT specification and ship them off to Stanford for ingest into the digital preservation network. Once the hydra objects were selected, I used an application in Ladybird to recall the contents and create bags stored on a temporary server and then shipped them off to Stanford, literally using Fedex for the pilot. This laid down some brainstorming to eventually create a mechanism that would allow Hydra adopters the ability to create a direct export link from their repositories to the Digital Preservation network.

DPN The Stanford Digital Repository is Fedora 3. They had 147 TB as of last June. Chronopolis at UCSD is one of the few certified Trusted Digital Repositories. They are moving to Fedora 4 APTrust at UVA uses Fedora to manage metadata with pointers to content in Amazon S3 and Glacier and administrative functions in Hydra and Blacklight.

Digital Preservation in Hydra
The hydra community has been actively working on digital preservation for several years. Three of the more prominent projects to come from the work include the Hydra applications Sufia, Argo and Chronos. Get A head on Your Repository

Hydra Solution Bundles
Sufia CurateND ScholarSphere HydraDAM Argo Chronos Sufia is generally used as an institutional repository application that allows self-depositing of files. Using a simplified method for managing access, users can load content into Sufia and use it to create collections of materials. There are many successful Sufia deployments including Penn State’s Scholarsphere and Notre Dame’s CurateND.

Taken from their site, Curate ND offers researchers a secure platform for long term preservation to meet the requirements of funding agencies. The system also offers an automated process to migrate files in danger of obsolescence to new formats while retaining the original files. This is beneficial since there are times when migrating to a new format results in the loss of functionality. Keeping around the original is the best option as new technologies emerge allowing emulation of obsolete computer environments making it possible to work with the old files in their original computing environment.

Penn State offers Scholarsphere which is very similar in nature to Curate ND. In addition to the same services offered by Notre Dame, Scholarsphere offers a scheduler service to perform regular file characterization and fixity checks to ensure document integrity over time and also uses the same retention policies as CurateND so that the original digital files are kept in the repository.

The most recent addition to Hydra is from the Royal Library in Copenhagen. They released Chronos which is an application that sits alongside their deployment of a Blacklight based Hydra stack, much like the one we have deployed here at Yale. Chronos provides an administrative interface to manage the content in the system. Management options include features such as monitoring files for format obsolescence, running fixity checks in bulk and scheduling additional tasks. Their strategy was to develop a robust set of digital preservation policies and then create a strategy to implement the policies as well as secure the budgets. Last summer and fall they worked on the final stages in creating software specifications which ultimately led to a development process starting this past December and launch of the new application earlier this month.

Preservation Profiles
Encryp-ted Integrity check Storage pillars Preservationprofile I II III IV V-VIII IX X XI 1: Storage without bit preservation 2: Digital born collection of material that has access restrictions 3: Legally deposited born digital material that is not in the Webarchive 4: Born digital collection material, without access restrictions 5: Retro digitized (expensive) materials with analog copies 6: Secret digital materials 7: Top secret digital materials An interesting contribution to the Hydra community was their design of preservation profiles. As you can see here, the profiles are setup to take a general classification of content and match it up with appropriate storage solutions. The Royal Library is charged, by law, with the long-term preservation, automated embargo management and automated format migration of millions of government documents. A collection that ranges from open access materials to top-secret government documents. The storage pillars they use each map to different costs associated so that they can project the cost model for each type of profile. In their calculations, the cost models include the price of storage, associated hardware and both technical and non-technical staffing requirements.

Future Development Going forward there are three major development taking place in the Hydra/Fedora communities. The release of Fedora 4, Auditing in Fedora 4 and the Portland Common Data Model. Get A head on Your Repository

Fedora 4 Roadmap: Audit Service Portland Common Data Model
Migration Tools Asynchronous Storage Linked Data Platform Managed External Data Streams Fedora 4 was released into production this past December. It represents a significant change in the platform using modern technologies to support large scale repositories and adopts a much more modern approach to the application lifecycle management of the development process. The effort is led by DuraSpace who employs two full time project managers and the programming efforts are donated by partner institutions including contributions from programmers in Library IT. Strategic planning, budget and governance of the project is also managed by a steering committee to which I was elected to this past December.

Fedora 4 Auditing Track Events: agent, date, activity, entity
Allow import/export of events High performance Stored separate from repository entities Export in RDF format Provide SPARQL-Query search endpoint In a recent poll of the more than 170 Fedora adopters, the prioritization of missing features was set. Top on the list, above tools for migrating from earlier versions of Fedora, was the need to store audit data. Fedora does a great job at performing a wide range of functions on content in the repository but is missing features to store data about what it did and when the event took place. The design process spanned the last six weeks and the work to implement auditing into Fedora 4 is expected to be released in May.

Portland Common Data Model
Fedora 4 offers support of the legacy XML style of content storage but anyone migrating to the new platform is encouraged to change to a new RDF model. This sparked some development in the Hydra community to rethink the way a content model is structured as well as described. Last fall a project known as Hydra Works spawned where the group set out to create a model that could be used across all Hydra adopters, in essence making their content interoperable with other Hydra institutions. Since the content model is more tied to Fedora than it is to Hydra, the development slowly moved from Hydra and into Fedora. Rather than create a model that scales only for Hydra, it made more sense to create a model that would work for all content going into Fedora. The Hydra Works project became known as the Portland Common Data Model, named after the city where much of the debate and discussion took place to shape this new development path. In May, a group of Hydra partners as well as a group of Fedora programmers will pool their efforts into the programming necessary to release the new Portland model to the Fedora community.

HydraDAM2

Indiana University Libraries, in partnership with Northwestern University Library, also received a $750,000 grant from the Andrew W. Mellon Foundation to support work on the Avalon Media System project through January 2017.

Hydra Infrastructure Get A head on Your Repository

Hydra Architecture Open source, community developed software
Fedora Commons Apache Solr Blacklight MySQL Hydra Project open source, community developed software Locally developed software; Ladybird, Media Delivery Service 1,000,000 GB

Repository Storage – Current State
New Haven/West Haven, CT Rocky Hill, CT Repository Yale ITS Disk-based Enterprise Storage Yale Library Tape-based Archival Storage Iron Mtn., Offline Replicated Set - Tape Data replicated across 3 locations in 2 different types of storage infrastructures. Rocky Hill - 30 miles north of New Haven $350TB/Year

Repository Storage – Current State
Risks of current state: Data resides in single region, the Northeast Tape media handling and refresh constraints at petabyte scale One month window in which primary and backup are in same location 1,000,000 GB

Repository Storage – Future State
New Haven/West Haven, CT Out-of-Region Repository Digital Preservation Network Cloud storage provider (ex. Amazon Glacier) or or Yale ITS Disk-based Enterprise Storage Yale ITS Out-of-region Storage Data replicated across at least 2 locations in at least 2 different storage infrastructures. 2-4 years out. DPN 50 members, 5 primary nodes (Stanford Digital Repository, UofT, Chronopolis, APTrust, HathiTrust), recently completed pilot AWS Glacier, $10/TB/mo – US, EU, Asia, along with ITS Amazon Virtual Private Cloud (Amazon VPC) Other EDU partners w/ shared space

Yale Hydra Roadmap Get A head on Your Repository

Migrations in Progress
Stanford has 147 TB in a Fedora preservation repository as of last June We are currently ingesting 2 TB/week into Hydra

Hydra Growth at Yale (TB)
We are already projected to reach the petabyte scale in the next year. If we really scale up to manage research data projects, faculty projects that include A/V and data, and University archival video, these bars are going to get a lot taller fast.

Hydra Roadmap Complete Kissinger collection (1.7 million pages, 10 million files) Complete migration of legacy digital collections Discovery and display for curated research data Self-archiving (Sufia) project with ITS to support Yale faculty, student, and research content (first Fedora 4 collections) Move all collections to Fedora 4 (IIIF, RDF, auditing, other advanced features) Unified search Integration with ArchivesSpace (ArcLight Hydra project) ORCid support Online exhibitions in Spotlight Video streaming support, HydraDAM for video preservation DPN or other offsite copy support Other things on the horizon: geoBlacklight for GIS data Mirador2 image viewer

Digital Preservation Services
Multiple Copies Bit Preservation Secure Storage with Managed Access Provenance and Authenticity Assurance Standards Compliance Obsolescence Monitoring Format migration and emulation services Preservation per se has not been a Hydra development priority here at Yale, except as a side effect of good development and infrastructure practices, and the nature of Fedora itself as a preservation platform. Our work has by necessity concentrated on specific ingest, workflow, display, authentication, and discovery functionality for the Arcadia projects, Kissinger papers, and legacy migrations. Nevertheless, I hope it has become clear that a great deal of preservation functionality is either already present in the system or under development by Hydra partners. We currently do most of these, but we do not yet have a solution for obsolescence monitoring, format migration, and emulation services for born digital collections in multiple file formats. (5 TB for MSSA, more for BRBL).

Questions? “Not all digital objects are digital assets. Only those which store value and will realise future benefit can be described as assets. Those which won’t are liabilities.” -4C Roadmap, “Investing in Curation: A Shared Path to Sustainability”

Resources http://digitalpowrr.niu.edu/tool-grid/

Digital Preservation in Hydra/Fedora

Similar presentations

Presentation on theme: "Digital Preservation in Hydra/Fedora"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Digital Preservation in Hydra/Fedora

Similar presentations

Presentation on theme: "Digital Preservation in Hydra/Fedora"— Presentation transcript:

Similar presentations

About project

Feedback