Presentation is loading. Please wait.

Presentation is loading. Please wait.

Transparent Format Migration of Preserved Web Content D. S. H. Rosenthal, T. Lipkis, T. S. Robertson, S. Morabito Lib Magazine, 11(1), 2005

Similar presentations


Presentation on theme: "Transparent Format Migration of Preserved Web Content D. S. H. Rosenthal, T. Lipkis, T. S. Robertson, S. Morabito Lib Magazine, 11(1), 2005"— Presentation transcript:

1 Transparent Format Migration of Preserved Web Content D. S. H. Rosenthal, T. Lipkis, T. S. Robertson, S. Morabito Lib Magazine, 11(1), 2005 http://www.dlib.org/dlib/january05/rosenthal/01rosenthal.html Slides by Frank McCown Old Dominion University March 17, 2005

2 Format Migration What is it? Conversion of older DO format to current format What other major digital preservation strategy could be used? Emulation Original DO format is preserved and presented to the user When should a DO be migrated to a new format? Format change does not imply obsolescence

3 Format Obsolescence of Web Content Web format is obsolete when widely used browsers can no longer present the content Backwards compatibility of browsers a must HTML 4 vs. XHTML Old Web formats die slowly How many can you think of? Emulation is difficult to implement Find older browser, original plug-in, etc.

4 Migration of Obsolete Formats Three migration points Migration on ingest Convert all incoming objects into selected format before preserving Batch migration Convert all preserved objects into new format when preserved format is perceived to be obsolete Migration on access Convert preserved object into new format on-the-fly when requested by a user

5 Migration Issues Keep original format in case conversion tool is later found to have a bug or lost vital info when converting Conversion tool should be preserved to document original format and in case bug is found in tool Choose migration format wisely – it can significantly reduce the need and cost for future migrations

6 The LOCKSS System LOCKSS 1 - Lots Of Copies Keep Stuff Safe™ Developed at Stanford University Open source, P2P software Used by libraries to ensure web accessible content (e- journals and open access material), remains available at all times Each peer collects material to preserve by crawling publisher’s web site Peers continually perform content consistency checks and repair content when needed Preserved material is transparently presented to user if publisher’s copy is not available (using web proxy) Currently used by 80 libraries worldwide 1 http://lockss.stanford.edu

7 LOCKSS Format Migration Plug-in format converter registers input/output MIME types IANA MIME types - http://www.iana.org/assignments/media-types/http://www.iana.org/assignments/media-types/ LOCKSS web proxy uses plug-in converters to perform on-the-fly conversion of obsolete formats (migration on access) Converters are preserved along with web content among peers

8 Proof of Concept Convert “obsolete” GIF images to PNG Proxy Web server prevents MIME type image/gif from matching any Accept: header Mismatch prompts conversion so content is delivered using the original URL but with Mime-Type=image/png. Images from Fig 1 and 2 at http://www.dlib.org/dlib/january05/rosenthal/01rosenthal.html

9 HTTP Format Negotiation Browser can tell a web server a format is obsolete by telling it not to send that format HTTP/1.1 1 defines how web servers and client browsers negotiate the format, language, and encoding of web content Browser sends request using Accept: header listing acceptable MIME types of content format 1 http://www.w3.org/Protocols/rfc2616/rfc2616.html

10 Format Negotiation Examples Accept: text/plain;q=0.5, text/xml;q=0.8, text/html “I prefer text/html first, text/xml second, and finally text/plain.” */*;q=0.1 “If you can’t give me what I want, give me what you have.” image/*, image/gif;q=0 “Send me any kind of image except GIFs.” NOTE: q=0 semantics are not actually defined in HTTP/1.1

11 Format Negotiation Illustration Browser LOCKSS Proxy Web Server HTTP Request Accept: */*;q=0.1, image/gif;q=0 HTTP Response Content-Type: image/png GIF GIF to PNG Converter PNG I’ll take whatever you have except obsolete GIF images. All I have are GIFs. I’ll convert them to a format the browser can handle.

12 Future Work for LOCKSS Replace proof-of-concept implementation with complete implementation with API for plug-in converters Use a format migration service like TOM Use JHOVE format metadata extraction and validation technology to improve the quality of format metadata

13 TOM (Typed Object Model) Came from John Ockerbloom’s Ph.D. thesis at Carnegie Mellon 1 Currently managed by developers at Univ of Pennsylvania Library led by Ockerbloom Addresses the problem of increasingly new and obsolete data formats that makes using digital information problematic TOM makes it possible to Explain a data format Interpret the format for proper data extraction Convert the format into other formats 1 http://tom.library.upenn.edu/pubs/thesis/

14 TOM Two components Data Model that describes data formats and operations that can be performed on them Networked software that supports the description and operations of the data formats Figure from http://tom.library.upenn.edu/intro.html

15 TOM Applications TOM example broker http://tom.library.upenn.edu/cgi- bin/typebrowse/showtype?broker=tom%2elibrary%2eupenn%2eedu& http://tom.library.upenn.edu/cgi- bin/typebrowse/showtype?broker=tom%2elibrary%2eupenn%2eedu& TOM Conversion Service http://tom.library.upenn.edu/convert/ Could be used by LOCKS for format migration http://tom.library.upenn.edu/convert/ Fred (Format Registry Demonstration) http://tom.library.upenn.edu/fred/ http://tom.library.upenn.edu/fred/

16 JHOVE JSTOR/Harvard Object Validation Environment 1 Provides functions to perform format-specific identification, validation, and characterization of digital objects Identification What is the format of my digital object? Validation Is my digital object really of type X? Characterization What are the significant properties of my digital object of type X? GIF example http://hul.harvard.edu/jhove/gif-hul.html http://hul.harvard.edu/jhove/gif-hul.html 1 http://hul.harvard.edu/jhove/

17 JHOVE Use in Repository Figure from http://hul.harvard.edu/jhove/ Submission Information Package (SIP) - OAIS

18 JHOVE and LOCKSS JHOVE generates reliable format metadata LOCKSS can use JHOVE to extract quality metadata about the contents of its repository What if object to store is not valid? It may be easier to write a conversion tool using JHOVE to supply format metadata

19 Conclusion Goal is to ensure obsolete formats will not make current LOCKSS content inaccessible Migration on access can be done transparently to the user Format migration service like TOM can be used to perform conversions Use of JHOVE would improve quality of LOCKSS content metadata


Download ppt "Transparent Format Migration of Preserved Web Content D. S. H. Rosenthal, T. Lipkis, T. S. Robertson, S. Morabito Lib Magazine, 11(1), 2005"

Similar presentations


Ads by Google