Presentation on theme: "Collaborating to Compile Information about Formats The vision, the current state, and the challenges for format registries Caroline R. Arms Library of."— Presentation transcript:
Collaborating to Compile Information about Formats The vision, the current state, and the challenges for format registries Caroline R. Arms Library of Congress, Office of Strategic Initiatives Atlantic Workshop on Long Term Knowledge Retention University of Bath, Feb 12, 2007
Assumptions: Nobody can keep digital content usable for decades without identifying and understanding the formats the content is encoded in. We need –information for humans to make decisions –information for systems to retrieve and process –services that can be invoked on the basis of the information and decisions Many content stewards will need the same information about formats and the same associated services This is not such a simple problem that we can all afford to do it independently
Information for humans and systems Format-related tasks for humans –Setting policies: what formats will be used/accepted/mandated –Planning for dealing with content in unacceptable or obsolete formats Format-related tasks for systems –identification – I have a digital object; what format is it? –validation – I have an object purportedly of format F; is it? –characterization – "I have an object of format F; what are its significant technical properties? –delivery – I have an object of format F; how can I render it? –transformation – I have an object of format F, but need it in format G; how can I produce it? –risk assessment – I have an object of format F; is it at risk of obsolescence?
The Library of Congress (LC) Collects materials created by others Acquires the majority though copyright registration/deposit –adds approximately 10,000 items to the collections daily. Has very limited rights to make copies of or transform content it receives though copyright deposit during the period of copyright protection Can require, under Section 408 of the Copyright Act (Mandatory Deposit), that publishers deposit the "best" edition they publish, but cannot set a quality threshold Collects much more than books –29 million books and other printed materials, 2.7 million recordings, 12 million photographs, 4.8 million maps, 5 million music items and 58 million manuscripts.
Not just a problem for archives The obsolescence of digital formats is a problem in many contexts where preservation of digital content matters –For future generations of scholars or the public –For current scholarship based on cumulative data climatology, astronomy, genetics new methods may call for migration of older data to new formats –For commercial re-purposing by large enterprises and individual creators –For maintenance of long-lived facilities –For compliance and forensics –For personal memories
Vision of a Global Digital Format Registry A common mechanism to pool and share scarce technical expertise on a global basis, reducing the necessity for duplicative local effort A channel for the widest possible distribution of the fruits of that expertise to all actors engaged in preservation activities A process for generating community-wide agreement as to the normative definitions of format syntax and semantics, promoting best practices and effective interchange of digital assets A foundation for additional value-added services requiring detailed knowledge of digital formats A trustworthy resource Concretely, a two-year projected funded by the Mellon Foundation will provide services for: –Centrally-organized collection of format representation information –Distributed storage, discovery, and delivery of that information
GDFR Project – selected highlights http://www.formatregistry.org Led by Harvard University (Stephen Abrams) System development by OCLC Rich data model –Builds on PRONOM model from UK National Archives –Information about formats (very inclusively defined) Permitting persistent identification and description at as fine a level of granularity as needed –Information about software tools –Relationships, dependencies –Links to external resources, e.g. specifications subject to copyright, format assessments Software intended for reliable production use –Specific application of more general registry suite from OCLC
Learned while waiting Several partners in the National Digital Information Infrastructure and Preservation Program (NDIIPP) have worked on part of the puzzle Interesting variety in focus –Assessment to select preferred formats (LC) –Simple view of formats with emphasis on connections to tools and services –Emphasis on information future programmers would need to build tools –Specialized archive stresses need for fine granularity of format variants, to allow fine-grained characterization and validation non-hierarchical relationships, e.g. involving DTDs, XML Schemas, and instances. Common threads –Need to have or enlist domain experts –Relationships among formats are complex. Some are important to humans and irrelevant to systems. –Challenges of documenting proprietary formats Possible LC role: specifications in escrow –Enthusiasm about making local registries compatible with GDFR
Registry collaboration/interoperation Need to build trust for systems to rely on global registry Global system needs –critical mass of content perceived to be authoritative –proven track record of reliability –evidence of economic viability GDFR will pursue technical steps in current project –interoperation with other registries/repositories (e.g. PRONOM, Representation Information Registry Repository) –building population of records, including from other resources –links to information in other resources Knowing that social steps are important too –editorial process and culture –governance separate activity, led by U.S. National Archives
Library of Congress contribution Analysis of digital formats Description and assessment Focusing on formats promising for long-term sustainability and for content LC is likely to collect Web site: http://www.digitalpreservation.gov/formats/
Two Types of Evaluation Factors Sustainability factors for all formats –influence feasibility and cost of preserving content in the face of future change Quality and functionality factors that vary by content category –reflect considerations that will be expected by future users
Quality and Functionality Factors Vary according to content type, e.g., text, image, sound Pertain to current and future usefulness, e.g., for scholarship or repurposing Identification of factors reflects consideration of what are likely to be significant or essential features of some content items and which may not be representable in all formats –Surround sound for audio –Color maintenance for still images –Logical structure for text documents Domain expertise is clearly key here –Bill Reglis group at Drexel University is beginning to document engineering formats