UC3 Curation Micro-Services Simplified Repository Ingest UC Curation Center California Digital Library May 20, 2010.

Slides:



Advertisements
Similar presentations
ETD Management in the Texas Digital Library Adam Mikeal Texas Digital Library ETD 08 Aberdeen, Scotland June 6, 2008.
Advertisements

Merritt: A Micro-Services-Based Curation Repository University of California Curation Center California Digital Library November 18, 2010.
DIGIDOC A web based tool to Manage Documents. System Overview DigiDoc is a web-based customizable, integrated solution for Business Process Management.
A Micro-Services-Based Approach for Curation and Preservation Solutions Stephen Abrams Patricia Cruse John Kunze Perry Willett University of California.
DuraSpace: Digital Information All Ways, Always Pretoria, South Africa May 14 th, 2009.
An Introduction to Repositories Thornton Staples Director of Community Strategy and Alliances Director of the Fedora Project.
Database Planning, Design, and Administration
Mairéad Martin, Penn State University Commons Solutions Group Storage Workshop May 2010.
The Merritt Curation Repository Features, Uses, and Benefits University of California Curation Center California Digital Library UC Berkeley, August 13,
Digital Preservation - Its all about the metadata right? “Metadata and Digital Preservation: How Much Do We Really Need?” SAA 2014 Panel Saturday, August.
Alternate Software Development Methodologies
Transformations at GPO: An Update on the Government Printing Office's Future Digital System George Barnum Coalition for Networked Information December.
PREMIS in Thought: Data Center for LC Digital Holdings Ardys Kozbial, Arwen Hutt, David Minor February 11, 2008.
Hydra Partners Meeting March 2012 Bill Branan DuraCloud Technical Lead.
Mark Evans, Tessella Digital Preservation Boot Camp – PASIG meeting, Washington DC, 22 nd May 2013 PREMIS Practical Strategies For Preservation Metadata.
PAWN: A Novel Ingestion Workflow Technology for Digital Preservation
Introducing Symposia : “ The digital repository that thinks like a librarian”
Institutional Repositories Tools for scholarship Mary Westell University of Calgary AMTEC Conference May 26, 2005.
Identity and Access Management IAM A Preview. 2 Goal To design and implement an identity and access management (IAM) middleware infrastructure that –
PAWN: A Novel Ingestion Workflow Technology for Digital Preservation Mike Smorul, Joseph JaJa, Yang Wang, and Fritz McCall.
 an easy-to-use interface for deposit and update  access via persistent URLs  tools for long-term management  permanent storage Merritt is a new cost-effective.
Data-PASS Shared Catalog Micah Altman & Jonathan Crabtree 1 Micah Altman Harvard University Archival Director, Henry A. Murray Research Archive Associate.
Digital Object: A Virtual Online Storage Solution 598C Course Project Huajing Li.
Database System Development Lifecycle © Pearson Education Limited 1995, 2005.
Overview of the Database Development Process
Chinese-European Workshop on Digital Preservation, Beijing July 14 – Network of Expertise in Digital Preservation 1 Trusted Digital Repositories,
Ingest and Dissemination with DAITSS Presented by Randy Fischer, Programmer, Florida Center for Library Automation, University of Florida DigCCurr2007.
Social Science Data and ETDs: Issues and Challenges Joan Cheverie Georgetown University Myron Gutmann ICPSR – University of Michigan Austin McLean ProQuest.
Managing the Record of Research At the Smithsonian Using SIdora SAA Research Forum August 12, 2014.
Libraries as Partners in Research: the UC Curation Center’s Tools and Services UC3 Team University of California Curation Center California Digital Library.
San Diego Supercomputer CenterUniversity of California, San Diego Preservation Research Roadmap Reagan W. Moore San Diego Supercomputer Center
Curation Micro-Services “It’s a Series of Tubes” Curation Micro-Services “It’s a Series of Tubes”
MIS 385/MBA 664 Systems Implementation with DBMS/ Database Management Dave Salisbury ( )
UC3 Standards and Best Practices for Datasets and Other Supplemental Journal Article Materials UC3 Stephen Abrams Patricia Cruse John Kunze.
Relationships July 9, Producers and Consumers SERI - Relationships Session 1.
ZLOT Prototype Assessment John Carlo Bertot Associate Professor School of Information Studies Florida State University.
Topic Rathachai Chawuthai Information Management CSIM / AIT Review Draft/Issued document 0.1.
Metadata in a distributed information environment: Interoperability as recombinant potential Lorcan Dempsey OCLC/SCURL pre-IFLA conference, 15/16 Aug 02.
April 10, 2009CDL Users Council1 Digital Curation Services at CDL Perry Willett Digital Preservation Project Manager California Digital Library.
Ocean Observatories Initiative Data Management (DM) Subsystem Overview Michael Meisinger September 29, 2009.
Implementor’s Panel: BL’s eJournal Archiving solution using METS, MODS and PREMIS Markus Enders, British Library DC2008, Berlin.
Andrew S. Budarevsky Adaptive Application Data Management Overview.
Chapter 6 Supporting Knowledge Management through Technology
Digital Preservation MetaArchive Cooperative.  9:00-9:45 - Session 1: Digital Preservation Overview  9:45-11:00 - Session 2: Policy & Planning Overview.
Digital Preservation: Current Thinking Anne Gilliland-Swetland Department of Information Studies.
Tools and Services for Managing Research Patricia Cruse University of California Curation Center California Digital Library.
Catawba County Board of Commissioners Retreat June 11, 2007 It is a great time to be an innovator 2007 Technology Strategic Plan *
10/24/09CK The Open Ontology Repository Initiative: Requirements and Research Challenges Ken Baclawski Todd Schneider.
GPO’s Federal Digital System December 10, 2009 U.S. Government Printing Office.
OAIS Rathachai Chawuthai Information Management CSIM / AIT Issued document 1.0.
Funded by: © AHDS Preservation in Institutional Repositories Preliminary conclusions of the SHERPA DP project Gareth Knight Digital Preservation Officer.
Preservation Program Digital Preservation Program Digital Preservation Services: Extending tools to meet campus needs Patricia Cruse, Director, Digital.
UC3 Services In-Depth: Data Curation for Practitioners 2012 Workshop.
NDSR Boston webinar: Digital Preservation Introduction Presenter: Nancy Y McGovern October 2015.
The Importance of Standards in Digital Preservation Tina Norris Kayla Payne Jennifer
A Resource Discovery Service for the Library of Texas Requirements, Architecture, and Interoperability Testing William E. Moen, Ph.D. Principal Investigator.
Vicki Tobias Introduction to and Institutional Repositories.
The University of California Curation Center (UC3) A Plan for Curation Services: Organization, Technologies, Communities The Original NDIIPP Partners:
Information Resource Stewardship A suggested approach for managing the critical information assets of the organization.
Infrastructure Breakout What capacities should we build now to manage data and migrate it over the future generations of technologies, standards, formats,
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
SNOMED CT Vendor Introduction 27 th October :30 (CET) Implementation Special Interest Group Tom Seabury IHTSDO.
An Introduction to EZID University of California Curation Center Team California Digital Library August, 2011 UC3 Summer Webinar Series.
Active Directory Domain Services (AD DS). Identity and Access (IDA) – An IDA infrastructure should: Store information about users, groups, computers and.
Data Management and Digital Preservation Carly Dearborn, MSIS Digital Preservation & Electronic Records Archivist
A Shared Commitment to Digital Preservation and Access.
CENTRAL/WESTERN MASSACHUSETTS AUTOMATED RESOURCE SHARING Digitization GOALS & THEIR LOGISTICS Michael J. Bennett Digital Initiatives Librarian C/WMARS,
SowiDataNet - A User-Driven Repository for Data Sharing and Centralizing Research Data from the Social and Economic Sciences in Germany Monika Linne, 30.
CNI Spring 2010 Membership Meeting
Presentation transcript:

UC3 Curation Micro-Services Simplified Repository Ingest UC Curation Center California Digital Library May 20, 2010

Agenda Introduction – Welcome and review of objectives – UC3 and digital curation – Landscape, assumptions, and imperatives Curation micro-services – The Merritt project – Design goals – The future of the DPR Simplified repository ingest – Concepts – Implementation – Demonstration Discussion

Objectives By the end of this discussion we hope that you will understand – Digital curation and the UC3 mission – The emergent, micro-services approach to curation infrastructure – The Merritt curation environment and the future of the DPR – The Merritt Ingest service and its interactions with the Identity, Storage, and Inventory services – How to incorporate the Ingest service into your workflows

University of California Curation Center (UC3) We’ve changed our name, but not our commitment – Ensuring that the information resources supporting, and resulting from, the University’s research, teaching, and learning mission remains authentic, available, and usable UC3 is a Center of Excellence – A creative partnership bringing together the expertise and resources of the CDL, the ten UC campuses, and the broader international curation community

Digital curation The set of policies and practices focused on managing and adding value to a body of trusted digital content – Preservation ensures access over time – Access depends upon preservation up to a point in time It can also be seen as facilitating the alignment of the scholarly and information lifecycles

Landscape Ever increasing number, size, and diversity of content – More stuff, less resources Ever increasing diversity of partners, stakeholders, and expectations – Producers / consumers  prosumers / conducers Inevitability of disruptive change – Technology – User expectation – Institutional mission and resources Problem or opportunity? $ Work Time

Assumptions Curated content gains – Safety through redundancy “Lots of copies keeps stuff safe” – Meaning through context “Lots of description keeps stuff meaningful” – Utility through service “Lots of services keeps stuff useful” – Value through use “Lots of uses keeps stuff valuable” Curation is an outcome, not a place – Decentralized curation can be as effective as centralized Curation stewardship is a relay

Imperatives Provide innovative, effective, and efficient services Plan for change – Focus on content, not the systems in which that content is managed Systems come and go (but not our system ;-) – Occam’s Razor and Murphy’s Law suggest Favor the small and simple over the large and complex Favor the minimally sufficient over the feature laden Favor the configurable over the prescribed Favor the proven over the (merely) novel Enable curation at the point of use Do more with less

Curation micro-services Devolve curation function into a granular set of independent, but interoperable micro-services – Since each is small and self-contained,they are collectively easier to develop, maintain, and enhance – Since the level of investment in, and therefore commitment to, any given service is small, they are easier to replace when they have outlived their usefulness – The scope of each service is limited, but complex behavior emerges from the strategic composition of individual atomistic services

Merritt curation micro-services Value Annotation of content by consumers Notification of new content availability Transformation to create derivatives Curation Utility Search of content and metadata Index to enable fast search of content for curation Preservation Context Characterization to extract content properties of curated content Replication for safety State Fixity to verify bit-level integrity for long-term retention for long-term reference Ingest Inventory Storage Identity

What is the future of the DPR? The DPR will continue to be operated as a core UC3 service – However, the components of the underlying system will be gradually replaced with their new Merritt- based equivalents – All content currently managed in the DPR will be automatically migrated to the new environment Micro-services also can be used to deploy locally- hosted repositories to meet specialized local needs

What is the future of the DPR? Continuing stewardship commitment by UC3 regarding managed content – Safety, persistence, efficiency, economy Streamlined workflows for submission, access, and collection management – Easy in, easy out Minimal technical requirements for contribution Great flexibility in deploying customized repository solutions

Design goals Policy neutral, protocol and platform independent – We know we can’t foresee all of the contexts in which these services can be usefully deployed Principle of least surprise – Extensive options, but meaningful default behavior Linked data – All entities exist within a web of semantic relations The file system is the database – All content and metadata are expressed in the file system – Some subset of this information may be replicated in databases as an optimization for fast query

Design goals Code to interfaces – Underlying implementations should and will evolve over time without invalidating the public interface “contract” Exploit agile methods – Early prototyping, frequent refactoring – Stakeholder engagement The appropriate benchmark for submission user experience is Flickr

Storage concepts Node – A sub-domain of the Storage service established to meet specific policy, administrative, or technical needs Object – Encapsulation in digital form of an abstract intellectual or aesthetic work Version – A set of files representing a discrete state of the object – Any change to object state constitutes a new version File – A formatted bit stream

Storage concepts Stable reference – All objects (and their versions, and their files) managed in the Storage service have stable URLs that can be used to retrieve entities or metadata about entities, subject to appropriate access control File Version Object Storage service Request type Storage node

Ingest concepts Queue – Asynchronous processing of submitted material Batch – A set of digital objects submitted together – The unit of notification and reporting Job – The processing of a single digital object Handler – A specific processing stage

Ingest concepts Profile – A user-specific set of processing choices – Negotiated as part of the submission agreement Notification – At the time of ingest submission and completion – Our stewardship obligation begins at the time of ingest completion Submit by-value (a file) or by-reference (a URL)

Ingest process flow Submitting library Ingest Inventory Storage Node Identity Submit Create identifier Identifier Add version Get version metadata Version metadata Notification Version metadata Get version metadata Add version

Ingest implementation Submitting library Submitter Consumer Ingester Storage Queue HTML form Servlet Implicitly multi-threaded Servlet Implicitly multi-threaded Dæmon Explicitly multi-threaded ZooKeeper dæmon Job metadata Job payload Submission notification Ingest notification Batch or single object

Demonstration A few caveats… – Still a work in progress! – The final interface style sheets are not yet applied – Inventory and authentication/authorization services still under development – Full error reporting is not complete

Development roadmap First waveSecond wave Third waveFourth wave Fifth waveSixth wave IdentityInventoryIndexSearchNotificationAnnotation StorageIngestFixityReplicationCharacterizationTransformation Object / collection modelingMetadata standards Authentication / authorizationSemantic interoperability Policy / business model development

Early community reaction Collaborative development and integration projects with UC3 partners Independent implementation of key Merritt specifications Presentation/BOF at Open Repositories 2010 Digital curation group and Barcamp

Discussion Will existing workflows continue to work? – Yes, we have a crosswalk from the existing METS- based feeder submission What are the minimal requirements for an acceptable digital object? – A per-object METS file is no longer required – The DPR will accept any content in any form However, the long-term curation service level may vary depending on the object’s formal characteristics, the presence (or absence) of accompanying metadata, the general state of curation understanding, and the availability of appropriate tools

Discussion How do I include metadata in my submission? – The Ingest submission form provides an opportunity to specify descriptive Dublin Kernel metadata – Administrative metadata is implied by the user’s profile Name, affiliation, contact information, collection, … – Technical (and, potentially, descriptive) metadata is automatically extracted by the characterization handler – Additional metadata can be expressed in recognized schemas and stored in files with well-known names mrt-dublin-core.txt mrt-mods.xml mrt-creative-commons.rdf …

Discussion Isn’t a enterprise storage solution or RDMS (e.g. Oracle) better than just relying on the file system? – No, we believe that there are a number of important advantages to directly exploiting the file system No vendor lock-in; propriety systems are difficult to debug Modern file systems have excellent scaling characteristics The ability to re-instantiate the system by walking the file system is significant

Discussion Why is there a separate Ingest service? Why can’t I just submit directly to the Storage service? – Merritt embraces the “separation of concerns” principle The Storage service only “knows” about storage and has strict requirements for the allowable form of submissions The Ingest service was explicitly designed for user-facing operation and imposes minimal constraints on submission forms

Discussion (questions for you) What constitutes a “collection”? –Does it have hierarchically-arranged sub-components? What tools do you need to manage your collections effectively? How do you expect to retrieve content from the repository? – Following a saved link? – Search query? If so, what would be the query terms?

Discussion (questions for you) What level of access control is necessary? – Bright vs. dark policy – Embargo periods – Redaction Who are the subject populations? – UC affiliates – Non-UC How fine-grained must this control be? – Collection or object – Campus, research group, user

Discussion (questions for you) Are there other repository tools or protocols that we should investigate? Please respond to the DPR survey at

For more information UC Curation Center Curation micro-services DPR survey Digital curation group and Barcamp UC3 Stephen AbramsErik Hetzner Margaret Low Mark ReyesPerry Willett Patricia Cruse Greg Janée John KunzeTracy Seneca Scott Fisher David Loy Isaac RabinovitchMarisa Strong