Presentation on theme: "An Introduction to Repositories Thornton Staples Director of Community Strategy and Alliances Director of the Fedora Project."— Presentation transcript:
An Introduction to Repositories Thornton Staples Director of Community Strategy and Alliances Director of the Fedora Project
Creating a digital library is not a process of moving the traditional library online. Increasingly, it’s more about the care and feeding of the web!
Creating digital surrogates of paper collections is only the beginning Surrogate collections are an important step! Collecting born-digital materials is rapidly coming upon us Simple Institutional repository approaches are good but only scratch the surface Complex scholarly and scientific projects are the biggest challenge
Repositories are designed to be flexible and adaptable Relational databases are too rigid Need to be able to add new content types and media easily Need to be able to handle arbitrary complexity in relatively simple ways Above all, it all needs to be durable over a very long time!
Preservation and Archiving Scholars Workbench Scholars Workbench Institutional Repository Institutional Repository Data Curation Solutions Data Curation Solutions The Repository (Content abstraction) The Repository (Content abstraction) Raid Arrays Raid Arrays Tape Libraries Tape Libraries Cloud Storage Cloud Storage
Repositories are the foundation for many applications A set of abstractions that can be used to represent different kinds of data Manages the actual content beneath the surface Negotiates the connection between access and storage Designed to make data “durable” over the long term
Access is the core purpose of a repository Searching is important but it is not the only thing Finding is the point of searching! The point of finding is very often to use the resource that you have found, for analysis or reuse New digital resources that reuse found objects depend on continuing access for validity
Any unit of content may have more than one context Within one collection –An architectural image may related to more than one building Across collections –Special collections images many be art objects Across repositories –Born digital publications will almost always cross institutional boudaries
Authenticity and fidelity What is an authoritative digital surrogate of a real object? When is a copy of an original surrogate exact? A born-digital object has nothing to compare Digital “fingerprints” must be captured and managed as metadata When formats change, objects will not have all the same technical characteristics…
Making complex digital information “durable” is a very hard problem Durability implies that digital content is directly in use and sustained long-term A history of the changes to the encoding and state of content must be reliably provided A meaningful context for any unit of content may be one of many and must be sustained Replication appears to be our best friend and the could looks like an answer
Management is the core function of a repository Repositories are designed to keep everything as stable as possible while providing flexible access Managing things such that when they aren’t changing they are reliably the same Accounting for migration for technical reasons Disaster preparedness (lots of copies!) Must respect legal and policy issues
Repository abstractions provide a durability framework for managing. Content is “unitized” as information objects that combine data, metadata, policies, relationships and the history of the object. Complex digital resources are formally defined graphs of related objects. The public view of the content is presented as virtual data components.
DC Persistent ID RELS-EXT AUDIT n n Reserved Datastreams Custom Datastreams (any type, any number) A data object is one unit of content POLICY
Files are stored on disk and managed directly Versioning is necessary Checksums for each file provide assurance that they file has not changed Can be managed by the repository or as remote files
Virtual datastreams provide the access abstraction Can be simply retrieving a stored component Views of the content can be derived on demand, for different formats and resolutions Other data productions can be derived on demand; i.e. tiles from a JPEG2000 file By providing an abstract view of the content you break the dependence on the stored files
Content Access Content Management
Descriptive metadata is about the content of the resource Indexed for searching Also used for rendering user experiences Some standards in use: Dublin core - general MODS - bibliographic VRACore – cultural heritage FGDC - GIS datasets DDI – social science datsets
Administrative metadata is more about the encoding and use Metadata about the object generally, like checksums Technical metadata about the specifics of the encoding each format Event metadata, about what happens to an object over its lifetime; audit trails Policy metadata, like access restrictions and credit lines
Relationships Among Objects Describes adjacency relationships among objects, among units of content Can be done by explicitly listing IDs in XML, using METS for example or using RDF: PID – typeOfRelationship – relatedObjectPID Can used to assemble complex resources and aggregations of objects Explicit and implicit aggregations
Establishing and Enforcing Policies Policies must be established for the entire life-cycle of the information –Ownership and workflow policies –Access and use policies –Policies associated with sustaining (or not!) Polices must be expressed for end users Policies must also be expressed for machine access
Indexing In a repository there is no “catalog”; the repository is the catalog Many indexes can be created for many reasons Either metadata or full content, or both Ontology-based indexes are rapidly becoming more feasible Keeping indexes updated is the trick
Fedora Repository Service GSearch OAI Ingest Simple JMS Simple JMS More… repository publishes events services listen and consume events or other messages Indexing as a harvesting service Blacklight