Presentation is loading. Please wait.

Presentation is loading. Please wait.

Preservation and Archiving Special Interest Group Spring Meeting San Francisco, 27-29 May 2008 Preservation Characterization Stephen Abrams California.

Similar presentations


Presentation on theme: "Preservation and Archiving Special Interest Group Spring Meeting San Francisco, 27-29 May 2008 Preservation Characterization Stephen Abrams California."— Presentation transcript:

1 Preservation and Archiving Special Interest Group Spring Meeting San Francisco, 27-29 May 2008 Preservation Characterization Stephen Abrams California Digital Library Stephen.Abrams@ucop.edu

2 Characterization /ker-ik-t(ə-)rə-zā'-shən/ noun 1. The action or result of characterizing. 2. Description of characteristics or essential features.

3 Characterization Knowing what you have, as a stable starting point for iterative preservation analysis, planning, and action Adopted from A. Brown, “Developing Practical Approaches to Active Preservation,” IJDC 1:2 (June 2007).

4 What? So what? What do you have? –Identification –Feature extraction –Conformance What should you do with what you have? –Assessment

5 Ingest workflow

6 Migration workflow

7 Two approaches to characterization Implicit –Custom grammars defining a single format processed by a generic engine that understands all grammars Unix file National Archives (UK) DROID Open Grid Forum DFDL Planets XCEL/XCDL Explicit –Plug-in framework with custom modules that each understand a single format NLNZ Metadata Extractor JHOVE

8 Why choose one over the over? Implicit –ProMore sustainable in the long term –ConIs the formal notation rich enough to capture all nuances of formats of interest? Explicit –ProIt’s just programming –ConIt’s more programming

9 JH VE Extensible framework for format identification, validation, and characterization –Pluggable format-specific modules for: GIF, JPEG, JPEG 2000, TIFF AIFF, WAVE ASCII, HTML, UTF-8, XML PDF –GUI, command-line, and Java API Collaborative project of Harvard University and the JSTOR Electronic-Archive Initiative –Funded by Andrew W. Mellon Foundation –GNU LGPL license

10 JH VE 2 A next generation architecture for format-aware preservation processing –Three-fold goals: Re-factor the existing architecture to achieve higher performance, simplify system integration, and encourage third-party enhancement Provide significant new function (Re-) Implement modules Collaborative project of CDL, Portico, and Stanford University –Funded by Library of Congress/NDIIPP –Open source BSD license

11 JH VE 2 enhancements JHOVE assumed 1 object = 1 file = 1 format But what about… –TIFF with embedded ICC profile and XMP metadata 1 object = 1 file = 3 formats –JPEG 2000 JPX fragmentation 1 object = n files = 1 format –ESRI Shapefile 1 object = 3 files = 3 formats JHOVE2 will support 1 object = n files = m formats

12 JH VE 2 enhancements Generic plug-in interface Configurable set of modules iteratively invoked against each object Inter-module memory structure for stateful processing Identification de-coupled from conformance Standardized handling of format profiles and error reporting Configurable conformance criteria API level support for limited editing

13 JH VE 2 modules Identification Feature extraction and conformance for: –GIF, JPEG, JPEG 2000, TIFF –AIFF, WAVE –ASCII, HTML, SGML, UTF-8, XML –PDF –Shapefile –ICC Symbolic display of selected binary formats Assessment based on prior characterization and locally- defined policy rules and heuristics

14 JH VE 2 modules Identification Feature extraction and conformance for: –GIF, JPEG, JPEG 2000, TIFF –AIFF, WAVE –ASCII, HTML, SGML, UTF-8, XML –PDF –Shapefile –ICC Symbolic display of selected binary formats Assessment based on prior characterization and locally- defined policy rules and heuristics

15 JH VE 2 modules Identification Feature extraction and conformance for: – JPEG 2000, TIFF – WAVE –ASCII, SGML, UTF-8, XML –PDF –Shapefile –ICC Symbolic display of selected binary formats Assessment based on prior characterization and locally- defined policy rules and heuristics

16 JH VE 2 data abstraction Determine the “natural” conceptual structures of a format and their component attributes –Each such structure maps to a class with methods for parsing, validating, reporting, and serializing –Each such attribute maps to a field with accessor and mutator methods UTF-8  Character TIFF  IFH and IFD JPEG 2000  Box PDF  boolean, number, string, name, array, dictionary, stream, and null

17 JH VE 2 timeline Months 1-6 Outreach, design, and prototyping Months 7-9 Core APIs and framework Months 10-24 Modules

18 For more information… www.significantproperties.org.uk droid.sourceforge.net forge.gridforum.org/projects/dfdl-wg hki.uni-koeln.de/planets/ meta-extractor.sourceforge.net hul.harvard.edu/jhove www.ucop.edu:8080/display/JHOVE2Info/Home


Download ppt "Preservation and Archiving Special Interest Group Spring Meeting San Francisco, 27-29 May 2008 Preservation Characterization Stephen Abrams California."

Similar presentations


Ads by Google