Presentation is loading. Please wait.

Presentation is loading. Please wait.

Generating Best Effort Preservation Metadata for Web Resources at Time of Dissemination Joan A. Smith & Michael L. Nelson Old Dominion University Department.

Similar presentations


Presentation on theme: "Generating Best Effort Preservation Metadata for Web Resources at Time of Dissemination Joan A. Smith & Michael L. Nelson Old Dominion University Department."— Presentation transcript:

1 Generating Best Effort Preservation Metadata for Web Resources at Time of Dissemination Joan A. Smith & Michael L. Nelson Old Dominion University Department of Computer Science Norfolk, VA 23529 {jsmit, mln}@cs.odu.edu JCDL 2007 Presented: 20 June 2007 Joint Conference on Digital Libraries 2007

2 20 June 2007{jas,mln}@odu.edu Slide # 2 What’s In A Web Page?

3 20 June 2007{jas,mln}@odu.edu Slide # 3 A Simple Web Page: Behind the Scenes

4 20 June 2007{jas,mln}@odu.edu Slide # 4 HTTP: Behind the Scenes Non-Text Resource example: http://foo.edu/jackJill.jpg Note the sparse metadata from the HTTP GET request Binary content is not human-readable and does not even display properly in the terminal window We really need more metadata for the digital archeologist of the future: –Color map –NISO information –Base64 encoding of resource –MD5 or other hash function –Subject matter And more metadata would help preserve the Jack and Jill document, too: –Language –Document summary/abstract –Keyword extraction –Lexical signature % telnet foo.edu 80 Trying 82.165.199.160... Connected to foo.edu. Escape character is '^]'. GET /jackJill.jpg HTTP/1.1 Host: foo.edu HTTP/1.1 200 OK Date: Mon, 11 Jun 2007 16:49:25 GMT Server: Apache/1.3.33 (Unix) Last-Modified: Mon, 29 Aug 2005 12:01:40 GMT ETag: "5800535-3e72-4312f924" Accept-Ranges: bytes Content-Length: 15986 Content-Type: image/jpeg ÿØÿà "#2s¡35Rq‘±³ÁÂ$%Ccruƒ“¢ÃÒÿÄ ¬ê@‘XÑ9÷'M½ÂšX¬4ýÃÆ{çÉÎ Ð?~‰·õÔÓ!RÓ@Š’û¡·TÓ`r’pz{ ëÖ.éhéQ)Ùè5ü­b»[g¨øx^zè ² "#2s¡35Rq‘±³ÁÂ$%Ccruƒ“¢ÃÒÿÄ ¬ê@‘XÑ9÷'M½ÂšX¬4ýÃÆ{çÉÎ Ð?~‰·õÔÓ!RÓ@Š’û¡·TÓ`r’pz{ ëÖ.éhéQ)Ùè5ü­b»[g¨øx^zè Connection closed by foreign host.

5 20 June 2007{jas,mln}@odu.edu Slide # 5 Preservation & Metadata Resource Metadata Available Less More Probability of Preservation Low High What I get from the HTTP/HTML What I need to make an Archival Information Package (AIP) AIP

6 20 June 2007{jas,mln}@odu.edu Slide # 6 Post-Harvest Processing (at Ingest) Harvest Analyze/Examine/ProcessArchive Often a combination of manual and automated input

7 20 June 2007{jas,mln}@odu.edu Slide # 7 Metadata Generation Utility Examples NameDescription JhoveAnalysis by type (img, audio, text) KeaKey phrase extraction OTSOpen Text Summarizer ExifToolImage/video metadata extractor PDFlib-pCOSExtract PDF metadata MP3-TagExtract audio file tags EssenceCustomized information extraction GDFRMIME++ MD5Message Digest File MagicUses content-identification bits of the file

8 20 June 2007{jas,mln}@odu.edu Slide # 8 The Conscientious Webmaster He who waits to do a great deal of good will never do anything. -- Samuel Johnson Preservation is important… But I’m soooo busy… How to help???

9 20 June 2007{jas,mln}@odu.edu Slide # 9 Configuring the Web-Server for Automatic Metadata http://foo.edu/example.html No impact to everyday users Regular “GET” => “regular” response OAI-PMH “Get Record” => “crate” response http://foo.edu/modoai/?verb=getRecord&identifier= http://foo.edu/example.html&metadataPrefix=crate Standard Apache “Location” directive mod_oai module configured with “plug-ins” Scripts, utilities, etc. can vary by MIME type

10 20 June 2007{jas,mln}@odu.edu Slide # 10 Harvest with Metadata (at Dissemination) Metadata Magic: Get the resource together with its metadata Harvest Pre-processed resource

11 20 June 2007{jas,mln}@odu.edu Slide # 11 Automatic Metadata via mod_oai http://foo.edu/modoai/?verb=getRecord&identifier= http://foo.edu/jackJill.jpg&metadataPrefix=crate 2007-06-18T18:21:46Z <request verb="GetRecord" identifier=http://foo.edu/jackJill.jpg metadataPrefix=“crate">http://foo.edu/crate/ http://foo.edu/jackJill.jpg 2007-01-17T04:09:07Z mime:image:jpeg image/jpeg encoding=“base64” JVBERi0xLjQKMyAwIG9iaiA8PAovTGVuZ3RoIDM5MjAgICCAKL0ZpI+hlzHdxHZ56diZdOiXjHNfEq9jOuDTzEc “file magic” /usr/bin/file jackJill.jpg file-4.16 JPEG image data, JFIF standard 1.00, resolution (DPI), "LEAD Technologies Inc. V1.01", 33 x 26 “jhove” /opt/jhove/jhove –m jpeg-hul Jhove (Rel. 1.1, 2006-06-05) Date: 2007-06-18 14:35:50 EDT RepresentationInformation: /home/crate/apache/htdocs/jackJill.jpg ReportingModule: JPEG-hul, Rel. 1.2 (2005-08-22) LastModified: 2007-01-16 23:09:07 EST Size: 27750 Format: JPEG Version: 1.00 Status: Well-Formed and valid SignatureMatches: JPEG-hul MIMEtype: image/jpeg Profile: JFIF JPEGMetadata: CompressionType: Huffman coding, Baseline DCT Images: Number: 1 Image: NisoImageMetadata: MIMEType: image/jpeg ByteOrder: big-endian CompressionScheme: JPEG ColorSpace: YCbCr SamplingFrequencyUnit: inch XSamplingFrequency: 33 YSamplingFrequency: 26 ImageWidth: 172 ImageLength: 146 BitsPerSample: 8, 8, 8 SamplesPerPixel: 3 Scans: 1 QuantizationTables: QuantizationTable: Precision: 8-bit DestinationIdentifier: 0 Comments: LEAD Technologies Inc. V1.01 ApplicationSegments: APP0

12 20 June 2007{jas,mln}@odu.edu Slide # 12 Preservation & Metadata Resource Metadata Available Less More Probability of Preservation Low High HTTP/HTML Automatic metadata utilities/CRATE Archival Information Package (AIP)

13 20 June 2007{jas,mln}@odu.edu Slide # 13 Automatic, Best-Effort Metadata Unverified –Utility results are not cross-checked –Output of analyses directly into XML response Undifferentiated –No categorization of output –Resource and metadata cohabit response Automatic –Generated at time of dissemination –Integrates preservation functions with the web server A simple, easy-to-implement option for improving preservation metadata for web resources

14 20 June 2007{jas,mln}@odu.edu Slide # 14 Further Information The mod_oai project home page: http://www.modoai.org/ IWAW 2007: “CRATE: A Simple Model for Self-Describing Web Resources” Authors’ webs: http://www.cs.odu.edu/~mln/pubs/ http://www.joanasmith.com/pubs.html I Helped!


Download ppt "Generating Best Effort Preservation Metadata for Web Resources at Time of Dissemination Joan A. Smith & Michael L. Nelson Old Dominion University Department."

Similar presentations


Ads by Google