Presentation on theme: "Digital Preservation in Hydra/Fedora"— Presentation transcript:
1 Digital Preservation in Hydra/Fedora March 24, 2015We are going to provide an overview of what is going on in digital preservation in the world of the Hydra/Fedora framework.Get A head on Your Repository
2 About Hydra/FedoraFlexible Extensible Digital Object Repository ArchitectureOpen-source projectProvides a platform for digital preservation and presentationUsed by hundreds of organizations, with over 52 Fedora Members contributing financially; Yale is one of these.Originally developed at Cornell, now led by Fedora Project Steering Group under stewardship of DuraSpace.org(Yale is also a Fedora development partner, and Mike Friscia serves on the Fedora Leadership CommitteeCurrently actively engaged in development of Fedora 4Clearly the big decision reflected in our migration path was our decision to standardize on Hydra/Fedora two years ago.Widely used system—the current Fedora registry, which is by no means complete, shows 140 institutions in the U.S., and about as many overseasWe did not standardize on Fedora as a platform here at YUL until Before that we had an abundance of digital collections platforms. Some were one-offs, some were meant to be the future but failed to scale up (ContentDM). Many were entirely custom—I’ll talk more later about the migration headaches this is causing.sponsors (2013) • 3TU.Datacentrum / TU Delft Library • Arizona State University • Brown University • Charles Darwin University • Colorado Allliance of Research Libraries • Columbia University • Creighton University • FIZ Karlsruhe • Indiana University Bloomington / Indiana University Purdue University • London School of Economics & Political Science • LYRASIS • National Library of Finland • National Library of Medicine • National Library of Wales • Northeastern University • Northern Illinois University • Northern Territory Library • Northwestern University • Oregon State University • Penn State • Rutgers University • Smithsonian Institution • Stanford University • Texas A&M Libraries • Texas Digital Library • Université de Liège • University of California Los Angeles • University of California San Diego • University of California Santa Barbara • University of Cambridge • University of Cincinnati • University of Hong Kong • University of Manitoba • University of New South Wales • University of North Carolina at Chapel Hill • University of Oxford • University of Prince Edward Island • University of Victoria • University of Virginia • University of Wisconsin • Vanderbilt University Library11. In-kind contributors • Columbia University • discoverygarden inc. • FIZ Karlsruhe • Max Planck Digital Library • Media Shelf • Stanford University • University of California, Los Angeles • University of California, San Diego • University of New South Wales • University of North Carolina, Chapel Hill • University of Prince Edward Island • University of Texas, Austin • University of Virginia • University of Wisconsin • Virginia Tech • Yale University
3 HydraBegan in 2008 as collaboration between Stanford, UVA, Univ. of Hull, and Fedora CommonsYUL joined in 2013 as 18th member. Membership now up to around 27—recent additions include Princeton, Cornell, Case WesternAnother 25 or more institutions are working in the Hydra framework without yet being formal members, including Brown, Johns Hopkins, Trinity College Dublin, Oxford, UC Berkeley and othersOur second big decision after standardizing on Fedora was to join the Hydra project in 2013.Project Hydra formed in 2008 at the conference Open Repositories by a small group interested in solving a similar set of repository problems using the Fedora Commons software. Fedora is very extensible but not easily customized. When you do make customizations, it becomes more difficult to keep the software up to date. This group set out to design a way to make it easier to customize Fedora.
4 Hydra Partners OR = Open Repositories Conference By 2012 the original group met their goals and brought in new partners along the way almost tripling in size. Yale joined shortly after and became the 18th partner in 2013.OR = Open Repositories Conference
5 Stanford University (f) University of Hull (f) DuraSpace (f)Stanford University (f)University of Hull (f)University of Virginia (f)MediaShelfUniversity of Notre DameNorthwestern UniversityColumbia UniversityPenn State UniversityIndiana UniversityLondon School of Economics and Political ScienceRock and Roll Hall of Fame and MuseumRoyal Library of DenmarkData Curation ExpertsWGBHBoston Public LibraryDuke UniversityYale UniversityVirginia TechUniversity of CincinnatiPrinceton University LibraryCornell UniversityOregon Digital (University of Oregon and Oregon State University)Case Western Reserve UniversityTufts UniversityDuoc UCUniversity of AlbertaCurrently there are 27 partner institutions. It is estimated that by the end of this year there will be 34 partners which is almost double the size of the community from just a couple years ago.
6 A Worldwide PresenceIt is difficult to track the exact number of institutions that have adopted Hydra given the open source nature of the framework. But as of this month, 62 institutions have reported that they have adopted Hydra. This map gives a glimpse of where Hydra is in use with the main concentration being right around us in the Northeast.
7 Get A head on Your Repository Hydra at YaleGet A head on Your Repository
8 A framework for repository-powered applications, with What Is Hydra?A framework for repository-powered applications, withmultiple, tailored UIs, and arobust repository back endOne body, many headsA set of solution bundlesA communityIf you want to go fast, go alone. If you want to go far, go together.At Open Repositories in 2009 they gave a presentation called Project Hydra: Designing & Building a Reusable Framework for Multipurpose, Multifunction, Multi-institutional Repository-Powered Solutions. In the talk they outlined a three year plan to create an application and middleware framework that in combination with the Fedora Commons software, would create reusable, multipurpose repository solutions. In addition they were seeking to build an open source community around a common set of goals and principles to create a long lasting solution.
9 Hydra Interface Single Image Zoom Bookreader Data Import Hydra-Head (IT use only)Single Image ZoomBookreaderComplex Object DisplayDownloadable PDFData ImportHydra-HeadCreating and managing objects (CRUD)BlacklightDiscovering and viewing objects (R)Active Fedora and SolrizerSearch/Facet LogicHydra Access ControlsImage RequestMetadataImagesFedora(Preservation)Solr(Index)Ladybird(Yale’s Cataloging Tool)In our local instance of Hydra, we use Ladybird, Fedora, Blacklight as well as the Hydra framework to pull this all together into what we call our Hydra Stack. Ladybird is responsible for packaging content for ingest as an object into Hydra and Blacklight is used for public access and discovery. The ingested objects are called content models. These predefined models set the requirements an object must meet in order for ingest to be successful. These requirements include the specific files that must be present for ingest to take place.Link to ImagesImage RetrievalManaged StorageMedia ServerRSSSQL Server
10 Content modelModels are designed based on the type of content and in general include information such as the digital format of files and associated metadata requirements. The diagram here shows a simple content model for a still image. It indicates that in order for ingest to take place, the package must contain a TIF image, Derivative files, descriptive metadata and some additional metadata files used for rights management. There are some optional files such as a text file for OCR output or a PDF. Content models are also used to express the relation between other content models when there is a need for objects to have complex relationships either other objects such as pages in a book relating to the object that represents the entire book.
11 Access Conditions Defined for each file in a content model Wide range of authorization definitionsCustomizableExample:Access conditions are set for each file that is part of the content model. The current system provides us the ability to assign access to each individual file separately instead of the more common Fedora model of using a single access condition for the entire content model. We chose this path because of the need to separate the level of authorization required to access the master files from the derivative files for our image collections. Access can be granted as broadly as open access for anyone in the world down to access only to a single individual user.
12 Ingest WorkflowOne the package is created, ingest from Ladybird to Hydra is controlled by manual and automated processes. In some cases the use of ladybird is transparent and the content passes through without the use of Ladybird, in other cases each step of the ingest process is user controlled in Ladybird. Most of what we have published to Hydra has used the manual ingest path. Once the content is ready, someone publishes the content. For the Kissinger papers, the entire process is automated.
13 Research Data into Hydra Colectica software exports contents in BagIt formatBag enters a watched folder in LadybirdLadybird validates the bag contentsChecksum validationFile characterizationLadybird maintains the original file hierarchy as a collection of complex objectsEach Ladybird object mapped to an Unstructured Content ModelEach content model is then ingested into HydraOne of the projects we have been working on is for research data. In this project the files and metadata are curated in a system external to our Hydra stack named Colectica. Once ready, the other system delivers the content to Ladybird using the Library of Congress BagIT specification. Ladybird then imports the content, maintains the original hierarchy and then passes the contents onto Hydra for ingest. During the packaging process Ladybird runs several processes for file validation, format recognition and fixity checks. At this time, the technical data remains in Ladybird and only the files and minimal metadata are passed to Hydra for storage. As this project progresses, we will increase the amount of data we send to Hydra as well as including the event based metadata typically expressed as PREMIS events along with the technical profiles for each of the files.
14 Unstructured Content model A challenge we faced with the research data project was developing a content model that would work with the content. In designing a content model, we need to know up front each file that will be stored in the model. To get around this we developed what we call the unstructured content model which can be seen in this diagram. It is basically a content model that only requires metadata about the entire model. Including a file is optional and it does not matter what type of file is included. In this case we run file characterization software to create a profile and then we pass this metadata into Hydra as part of the model. The reason that the model does not require a file is that we chose to store the exact hierarchy in its original form. In the BagIT specification, the bag is essentially a folder filled with more folders and files. In the unstructured approach, we treat a folder level the same as a file so that each is properly represented.
15 Another project we recently worked on was integration with the Digital Preservation Network. In this pilot project we selected roughly 5000 objects from Hydra to export using the same BagIT specification and ship them off to Stanford for ingest into the digital preservation network. Once the hydra objects were selected, I used an application in Ladybird to recall the contents and create bags stored on a temporary server and then shipped them off to Stanford, literally using Fedex for the pilot. This laid down some brainstorming to eventually create a mechanism that would allow Hydra adopters the ability to create a direct export link from their repositories to the Digital Preservation network.
16 DPNThe Stanford Digital Repository is Fedora 3. They had 147 TB as of last June.Chronopolis at UCSD is one of the few certified Trusted Digital Repositories. They are moving to Fedora 4APTrust at UVA uses Fedora to manage metadata with pointers to content in Amazon S3 and Glacier and administrative functions in Hydra and Blacklight.
17 Digital Preservation in Hydra The hydra community has been actively working on digital preservation for several years. Three of the more prominent projects to come from the work include the Hydra applications Sufia, Argo and Chronos.Get A head on Your Repository
18 Hydra Solution Bundles SufiaCurateNDScholarSphereHydraDAMArgoChronosSufia is generally used as an institutional repository application that allows self-depositing of files. Using a simplified method for managing access, users can load content into Sufia and use it to create collections of materials. There are many successful Sufia deployments including Penn State’s Scholarsphere and Notre Dame’s CurateND.
19 Taken from their site, Curate ND offers researchers a secure platform for long term preservation to meet the requirements of funding agencies. The system also offers an automated process to migrate files in danger of obsolescence to new formats while retaining the original files. This is beneficial since there are times when migrating to a new format results in the loss of functionality. Keeping around the original is the best option as new technologies emerge allowing emulation of obsolete computer environments making it possible to work with the old files in their original computing environment.
20 Penn State offers Scholarsphere which is very similar in nature to Curate ND. In addition to the same services offered by Notre Dame, Scholarsphere offers a scheduler service to perform regular file characterization and fixity checks to ensure document integrity over time and also uses the same retention policies as CurateND so that the original digital files are kept in the repository.
21 The most recent addition to Hydra is from the Royal Library in Copenhagen. They released Chronos which is an application that sits alongside their deployment of a Blacklight based Hydra stack, much like the one we have deployed here at Yale. Chronos provides an administrative interface to manage the content in the system. Management options include features such as monitoring files for format obsolescence, running fixity checks in bulk and scheduling additional tasks. Their strategy was to develop a robust set of digital preservation policies and then create a strategy to implement the policies as well as secure the budgets. Last summer and fall they worked on the final stages in creating software specifications which ultimately led to a development process starting this past December and launch of the new application earlier this month.
22 Preservation Profiles Encryp-tedIntegrity checkStorage pillarsPreservationprofileIIIIIIIVV-VIIIIXXXI1: Storage without bit preservation2: Digital born collection of material that has access restrictions3: Legally deposited born digital material that is not in the Webarchive4: Born digital collection material, without access restrictions5: Retro digitized (expensive) materials with analog copies6: Secret digital materials7: Top secret digital materialsAn interesting contribution to the Hydra community was their design of preservation profiles. As you can see here, the profiles are setup to take a general classification of content and match it up with appropriate storage solutions. The Royal Library is charged, by law, with the long-term preservation, automated embargo management and automated format migration of millions of government documents. A collection that ranges from open access materials to top-secret government documents. The storage pillars they use each map to different costs associated so that they can project the cost model for each type of profile. In their calculations, the cost models include the price of storage, associated hardware and both technical and non-technical staffing requirements.
23 Get A head on Your Repository Future DevelopmentGoing forward there are three major development taking place in the Hydra/Fedora communities. The release of Fedora 4, Auditing in Fedora 4 and the Portland Common Data Model.Get A head on Your Repository
24 Fedora 4 Roadmap: Audit Service Portland Common Data Model Migration ToolsAsynchronous StorageLinked Data PlatformManaged External Data StreamsFedora 4 was released into production this past December. It represents a significant change in the platform using modern technologies to support large scale repositories and adopts a much more modern approach to the application lifecycle management of the development process. The effort is led by DuraSpace who employs two full time project managers and the programming efforts are donated by partner institutions including contributions from programmers in Library IT. Strategic planning, budget and governance of the project is also managed by a steering committee to which I was elected to this past December.
25 Fedora 4 Auditing Track Events: agent, date, activity, entity Allow import/export of eventsHigh performanceStored separate from repository entitiesExport in RDF formatProvide SPARQL-Query search endpointIn a recent poll of the more than 170 Fedora adopters, the prioritization of missing features was set. Top on the list, above tools for migrating from earlier versions of Fedora, was the need to store audit data. Fedora does a great job at performing a wide range of functions on content in the repository but is missing features to store data about what it did and when the event took place. The design process spanned the last six weeks and the work to implement auditing into Fedora 4 is expected to be released in May.
26 Portland Common Data Model Fedora 4 offers support of the legacy XML style of content storage but anyone migrating to the new platform is encouraged to change to a new RDF model. This sparked some development in the Hydra community to rethink the way a content model is structured as well as described. Last fall a project known as Hydra Works spawned where the group set out to create a model that could be used across all Hydra adopters, in essence making their content interoperable with other Hydra institutions. Since the content model is more tied to Fedora than it is to Hydra, the development slowly moved from Hydra and into Fedora. Rather than create a model that scales only for Hydra, it made more sense to create a model that would work for all content going into Fedora. The Hydra Works project became known as the Portland Common Data Model, named after the city where much of the debate and discussion took place to shape this new development path. In May, a group of Hydra partners as well as a group of Fedora programmers will pool their efforts into the programming necessary to release the new Portland model to the Fedora community.
28 Indiana University Libraries, in partnership with Northwestern University Library, also received a $750,000 grant from the Andrew W. Mellon Foundation to support work on the Avalon Media System project through January 2017.
29 Get A head on Your Repository Hydra InfrastructureGet A head on Your Repository
30 Hydra Architecture Open source, community developed software Fedora CommonsApache SolrBlacklightMySQLHydra Project open source, community developed softwareLocally developed software; Ladybird, Media Delivery Service1,000,000 GB
31 Repository Storage – Current State New Haven/West Haven, CTRocky Hill, CTRepositoryYale ITSDisk-based Enterprise StorageYale LibraryTape-based ArchivalStorageIron Mtn., Offline Replicated Set - TapeData replicated across 3 locations in 2 different types of storage infrastructures.Rocky Hill - 30 miles north of New Haven$350TB/Year
32 Repository Storage – Current State Risks of current state:Data resides in single region, the NortheastTape media handling and refresh constraints at petabyte scaleOne month window in which primary and backup are in same location1,000,000 GB
33 Repository Storage – Future State New Haven/West Haven, CTOut-of-RegionRepositoryDigital Preservation NetworkCloud storage provider (ex. Amazon Glacier)ororYale ITSDisk-based Enterprise StorageYale ITS Out-of-region StorageData replicated across at least 2 locations in at least 2 different storage infrastructures.2-4 years out.DPN 50 members, 5 primary nodes (Stanford Digital Repository, UofT, Chronopolis, APTrust, HathiTrust), recently completed pilotAWS Glacier, $10/TB/mo – US, EU, Asia, along with ITS Amazon Virtual Private Cloud (Amazon VPC)Other EDU partners w/ shared space
34 Get A head on Your Repository Yale Hydra RoadmapGet A head on Your Repository
35 Migrations in Progress Stanford has 147 TB in a Fedora preservation repository as of last JuneWe are currently ingesting 2 TB/week into Hydra
36 Hydra Growth at Yale (TB) We are already projected to reach the petabyte scale in the next year. If we really scale up to manage research data projects, faculty projects that include A/V and data, and University archival video, these bars are going to get a lot taller fast.
37 Hydra RoadmapComplete Kissinger collection (1.7 million pages, 10 million files)Complete migration of legacy digital collectionsDiscovery and display for curated research dataSelf-archiving (Sufia) project with ITS to support Yale faculty, student, and research content (first Fedora 4 collections)Move all collections to Fedora 4 (IIIF, RDF, auditing, other advanced features)Unified searchIntegration with ArchivesSpace (ArcLight Hydra project)ORCid supportOnline exhibitions in SpotlightVideo streaming support, HydraDAM for video preservationDPN or other offsite copy supportOther things on the horizon:geoBlacklight for GIS dataMirador2 image viewer
38 Digital Preservation Services Multiple CopiesBit PreservationSecure Storage with Managed AccessProvenance and Authenticity AssuranceStandards ComplianceObsolescence MonitoringFormat migration and emulation servicesPreservation per se has not been a Hydra development priority here at Yale, except as a side effect of good development and infrastructure practices, and the nature of Fedora itself as a preservation platform. Our work has by necessity concentrated on specific ingest, workflow, display, authentication, and discovery functionality for the Arcadia projects, Kissinger papers, and legacy migrations.Nevertheless, I hope it has become clear that a great deal of preservation functionality is either already present in the system or under development by Hydra partners.We currently do most of these, but we do not yet have a solution for obsolescence monitoring, format migration, and emulation services for born digital collections in multiple file formats. (5 TB for MSSA, more for BRBL).
39 Questions?“Not all digital objects are digital assets. Only those which store value and will realise future benefit can be described as assets. Those which won’t are liabilities.”-4C Roadmap, “Investing in Curation: A Shared Path to Sustainability”