Presentation is loading. Please wait.

Presentation is loading. Please wait.

An Apache Module for Generating Self-Describing Web Resources Joan A. Smith Michael L. Nelson Alliance for Information Science and Technology Innovation.

Similar presentations


Presentation on theme: "An Apache Module for Generating Self-Describing Web Resources Joan A. Smith Michael L. Nelson Alliance for Information Science and Technology Innovation."— Presentation transcript:

1 An Apache Module for Generating Self-Describing Web Resources Joan A. Smith Michael L. Nelson Alliance for Information Science and Technology Innovation East Coast Meeting 16 October 2007

2 {jsmit,mln}@cs.odu.edu Slide # 2 Web Site Preservation: 2 Problems The counting problem How many pages are on that site? To save it you have to find it The representation problem What’s that page all about? Future use requires understanding Guess the bean count, win the jar

3 16 October 2007{jsmit,mln}@cs.odu.edu Slide # 3 A Crawler’s View of the Web Site web root http://www.foo.edu/ X X X The crawler has run into the counting problem, and doesn’t know it….

4 16 October 2007{jsmit,mln}@cs.odu.edu Slide # 4 Pages Out of Crawler Reach Some pages linked from web root Some dynamic content Some orphaned pages Some pages protected with access controls Some pages too deep for a particular crawler Sitemap protocol attempts to address these problems

5 16 October 2007{jsmit,mln}@cs.odu.edu Slide # 5 Sitemap Protocol Google-driven initiative –Derives from earlier concept of graphical “site map” and alphabetical “site index” –Google standardized the protocol XML-formatted file –Created by webmaster of a web site –Can be “tweaked” to include/exclude files based on a wide variety of criteria Simplifies site resource exposure to search engines (i.e., to their robots) Supported by Google, MSN, Yahoo and ASK But it doesn’t address inherent HTTP limitations

6 16 October 2007{jsmit,mln}@cs.odu.edu Slide # 6 Web Crawling & The Counting Problem HTTP cannot ask for only new or modified resources –Conditional GET by datestamp or etag has limited benefit –Cannot get a list of pages that have been deleted; changed; added –Each resource must be requested, one at a time, by name There is no “SELECT * ” in HTTP –Crawlers cannot request a list of all URLs for the site –Crawlers can only GET one resource at a time, by name –HTTP cannot give a crawler a list of resources it has Undiscovered resources will not be refreshed Sitemaps –XML document lays out site structure (cf. http://www.sitemaps.org/protocol.php ) http://www.sitemaps.org/protocol.php http://www.example.com/ 2005-01-01 monthly 0.8 –Provides minimal, crawl-oriented metadata (update frequency, etc.) –Can include Dynamic URLs Counting Problem Search Engine Solution

7 16 October 2007{jsmit,mln}@cs.odu.edu Slide # 7 A Web Page: Behind the Scenes

8 16 October 2007{jsmit,mln}@cs.odu.edu Slide # 8 HTTP: Behind the Scenes Resource example: http://foo.edu/jackJill.jpg Note the limited metadata from the HTTP GET request Binary content is not human-readable We only “GET” one resource at a time Additional metadata could help the digital image archeologist of the future: –Color map –NISO information –Base64 encoding of resource –MD5 or other hash function –Subject matter And metadata that could help preserve the Jack and Jill document content: –Language –Script type and version –Document summary/abstract –Keyword extraction –Lexical signature % telnet foo.edu 80 Trying 82.165.199.160... Connected to foo.edu. Escape character is '^]'. GET /jackJill.jpg HTTP/1.1 Host: foo.edu HTTP/1.1 200 OK Date: Mon, 11 Jun 2007 16:49:25 GMT Server: Apache/1.3.33 (Unix) Last-Modified: Mon, 29 Aug 2005 12:01:40 GMT ETag: "5800535-3e72-4312f924" Accept-Ranges: bytes Content-Length: 15986 Content-Type: image/jpeg ÿØÿà "#2s¡35Rq‘±³ÁÂ$%Ccruƒ“¢ÃÒÿÄ ¬ê@‘XÑ9÷'M½ÂšX¬4ýÃÆ{çÉÎ Ð?~‰·õÔÓ!RÓ@Š’û¡·TÓ`r’pz{ ëÖ.éhéQ)Ùè5ü­b»[g¨øx^zè ² "#2s¡35Rq‘±³ÁÂ$%Ccruƒ“¢ÃÒÿÄ ¬ê@‘XÑ9÷'M½ÂšX¬4ýÃÆ{çÉÎ Ð?~‰·õÔÓ!RÓ@Š’û¡·TÓ`r’pz{ ëÖ.éhéQ)Ùè5ü­b»[g¨øx^zè Connection closed by foreign host.

9 16 October 2007{jsmit,mln}@cs.odu.edu Slide # 9 Web Crawling & The Representation Problem HTTP provides limited metadata –Server-client communication, focused on the here and now –Not concerned with preservation-related information –Format obsolescence is not addressed –File content comprehension is the client’s problem Client and Server must be configured to handle each resource “type” –Default includes most common current file types –Older types, unusual files may get an error message from the client browser:

10 16 October 2007{jsmit,mln}@cs.odu.edu Slide # 10 Archives: Metadata-Rich Each model handles resource representation information in its own way Metadata helps ensure long-term persistence, availability of resources

11 16 October 2007{jsmit,mln}@cs.odu.edu Slide # 11 The MPEG-21 DIDL Model A complex-object model combining the resource and its metadata together

12 16 October 2007{jsmit,mln}@cs.odu.edu Slide # 12 Post-Harvest Processing required for ingestion Harvest Analyze/Examine/ProcessArchive Often a combination of manual and automated input

13 16 October 2007{jsmit,mln}@cs.odu.edu Slide # 13 Metadata Generation Utility Examples NameDescription JhoveAnalysis by type (img, audio, text) KeaKey phrase extraction OTSOpen Text Summarizer ExifToolImage/video metadata extractor PDFlib-pCOSExtract PDF metadata MP3-TagExtract audio file tags EssenceCustomized information extraction GDFRMIME++ MD5Message Digest File MagicUses content-identification bits of the file

14 16 October 2007{jsmit,mln}@cs.odu.edu Slide # 14 # Webs >> # Archiving Institutions Archivist Web Sites Typical ingest scenario

15 16 October 2007{jsmit,mln}@cs.odu.edu Slide # 15 Harvest with Metadata Metadata Magic: Get the resource together with its metadata Harvest Pre-processed resource

16 16 October 2007{jsmit,mln}@cs.odu.edu Slide # 16 Harnessing the Web Server Archivist: mod_oai GetRecord request and response User: standard GET request and response Self-describing resource

17 16 October 2007{jsmit,mln}@cs.odu.edu Slide # 17 Configuring the Web-Server for Metadata Magic http://foo.edu/example.html No impact to everyday users Regular “GET” => “regular” response OAI-PMH “Get Record” => “crate” response http://foo.edu/modoai/?verb=getRecord&identifier= http://foo.edu/example.html&metadataPrefix=crate Standard Apache “Location” directive mod_oai module configured with “plug-ins” Scripts, utilities, etc. can vary by MIME type

18 16 October 2007{jsmit,mln}@cs.odu.edu Slide # 18 6 Verbs of the OAI-PMH VerbFunction Identifydescription of repository ListMetadataFormatsmetadata formats supported by repository ListSetssets defined by repository ListIdentifiersOAI unique ids contained in repository ListRecordslisting of N records GetRecordlisting of a single record metadata about the repository harvesting verbs most verbs can take qualifying arguments: dates, sets, ids, metadata formats, and resumption token (for flow control) Compatible with HTTP Supports OAIS model Can support complex object model OAI-PMH can help resolve the Counting and Representation Problems

19 16 October 2007{jsmit,mln}@cs.odu.edu Slide # 19 OAI-PMH Verbs and mod_oai: Addressing The Counting & Representation Problems Counting Problem –“ListIdentifiers” provides equivalent of Sitemap –“ListRecords” response serializes the site’s contents using a single request –Qualifiers – by date range, by MIME set – enable customized crawls –Simplifies update semantics Representation Problem –“ListRecords” and “GetRecords” responses can include a wealth of metadata MPEG-21 DIDL CRATE –Allows more sophisticated crawling and archival preparation OAI-PMH verbs address inherent HTTP limitations mod_oai lets the web server provide self-describing resources

20 16 October 2007{jsmit,mln}@cs.odu.edu Slide # 20 What is a “Self-Describing” Resource? EXIF TOOL: File Name103_0315.JPG Camera Model NameCanon EOS DIGITAL REBEL Date/Time Original2003:09:30 13:37:51 Shooting ModeSports Shutter Speed1/2000 Aperture7.1 Metering ModeEvaluative Exposure Compensation0 ISO400 Lens75.0 - 300.0mm Focal Length300.0mm Image Size3072x2048 QualityNormal FlashOff White BalanceAuto Focus ModeAI Servo AF Contrast+1 Sharpness+1 Saturation+1 Color ToneNormal File Size1606 kB File Number103-0315 Standard HTTP Headers -- Last-Modified: Mon, 29 Aug 2005 12:01:40 GMT ETag: "5800535-3e72-4312f924" Content-Length: 15986 Content-Type: image/jpeg PLUS: Output from built-in utilities: JHOVE TOOL: Date: 2007-06-18 14:35:50 EDT RepresentationInformation: /home/crate/apache/htdocs/jackJill.jpg ReportingModule: JPEG-hul, Rel. 1.2 (2005-08-22) LastModified: 2007-01-16 23:09:07 EST Size: 27750 Format: JPEG Version: 1.00 Status: Well-Formed and valid SignatureMatches: JPEG-hul MIMEtype: image/jpeg Profile: JFIF JPEGMetadata: CompressionType: Huffman coding, Baseline DCT Images: Number: 1 Image: NisoImageMetadata: MIMEType: image/jpeg ByteOrder: big-endian CompressionScheme: JPEG ColorSpace: YCbCr SamplingFrequencyUnit: inch XSamplingFrequency: 33 YSamplingFrequency: 26 ImageWidth: 172 ImageLength: 146 BitsPerSample: 8, 8, 8 SamplesPerPixel: 3 Scans: 1 QuantizationTables: QuantizationTable: Precision: 8-bit DestinationIdentifier: 0 Comments: LEAD Technologies Inc. V1.01 ApplicationSegments: APP0 File/Magic: JPEG image data JFIF standard 1.00 resolution (DPI) "LEAD Technologies Inc. V1.01“ 33 x 26 MD5 Hash: 58a54e8638db432f4515eedf89f44505 …CRATE: Wrapped together with the resource in simple XML

21 16 October 2007{jsmit,mln}@cs.odu.edu Slide # 21 Apache: mod_oai Location Directive Scripts Pipes Executables MIME-based selective processing

22 16 October 2007{jsmit,mln}@cs.odu.edu Slide # 22 Building a CRATE URI, UUID Standard HTTP Headers Plug-In Metadata Base64-Encoded Resource CRATE CRATE ID METADATA RESOURCE A simple model for self-describing web resources

23 16 October 2007{jsmit,mln}@cs.odu.edu Slide # 23 A self-describing resource using mod_oai http://foo.edu/modoai/?verb=getRecord&identifier= http://foo.edu/jackJill.jpg&metadataPrefix=crate 2007-06-18T18:21:46Z <request verb="GetRecord" identifier=http://foo.edu/jackJill.jpg metadataPrefix=“crate">http://foo.edu/crate/ http://foo.edu/jackJill.jpg 2007-01-17T04:09:07Z mime:image:jpeg image/jpeg encoding=“base64” JVBERi0xLjQKMyAwIG9iaiA8PAovTGVuZ3RoIDM5MjAgICCAKL0ZpI+hlzHdxHZ56diZdOiXjHNfEq9jOuDTzEc “file magic” /usr/bin/file jackJill.jpg file-4.16 JPEG image data, JFIF standard 1.00, resolution (DPI), "LEAD Technologies Inc. V1.01", 33 x 26 “jhove” /opt/jhove/jhove –m jpeg-hul Jhove (Rel. 1.1, 2006-06-05) Date: 2007-06-18 14:35:50 EDT RepresentationInformation: /home/crate/apache/htdocs/jackJill.jpg ReportingModule: JPEG-hul, Rel. 1.2 (2005-08-22) LastModified: 2007-01-16 23:09:07 EST Size: 27750 Format: JPEG Version: 1.00 Status: Well-Formed and valid SignatureMatches: JPEG-hul MIMEtype: image/jpeg Profile: JFIF JPEGMetadata: CompressionType: Huffman coding, Baseline DCT Images: Number: 1 Image: NisoImageMetadata: MIMEType: image/jpeg ByteOrder: big-endian CompressionScheme: JPEG ColorSpace: YCbCr SamplingFrequencyUnit: inch XSamplingFrequency: 33 YSamplingFrequency: 26 ImageWidth: 172 ImageLength: 146 BitsPerSample: 8, 8, 8 SamplesPerPixel: 3 Scans: 1 QuantizationTables: QuantizationTable: Precision: 8-bit DestinationIdentifier: 0 Comments: LEAD Technologies Inc. V1.01 ApplicationSegments: APP0

24 16 October 2007{jsmit,mln}@cs.odu.edu Slide # 24 Automatic, Best-Effort Metadata Automatic –Generated at time of dissemination –Integrates preservation functions with the web server Unverified –Utility results are not cross-checked –Output of analyses go directly into XML response Undifferentiated –No categorization of output –Resource and metadata form complex-object response A simple, easy-to-implement option for improving available preservation metadata for web resources

25 16 October 2007{jsmit,mln}@cs.odu.edu Slide # 25 Preservation & Metadata Resource Metadata Available Less More Probability of Preservation Low High HTTP/HTML Automatic metadata utilities/CRATE Archival Information Package (AIP)

26 16 October 2007{jsmit,mln}@cs.odu.edu Slide # 26 Current Status mod_oai Open Source release at project’s completion Draft CRATE schema definition (XSD) Metrics Collection & Evaluation –Impact of utilities on web server performance –Examine utility compatibility and issues –Address security concerns Native Utility Efficiency –Language dependent (Java, C) –Improvements may depend on external pressure Security –Metadata vs information exposure risk –Access controls

27 16 October 2007{jsmit,mln}@cs.odu.edu Slide # 27 Demo AT MODOAI.ORG: http://www.modoai.org/demos.html

28 16 October 2007{jsmit,mln}@cs.odu.edu Slide # 28 Further Information The mod_oai project home page: http://www.modoai.org/ JCDL 2007: Generating Best Effort Preservation Metadata For Web Resources At Time Of Dissemination IWAW 2007: CRATE: A Simple Model For Self-Describing Web Resources Authors’ webs: http://www.cs.odu.edu/~mln/pubs/ http://www.joanasmith.com/pubs.html

29 Supplementary Slides

30 16 October 2007{jsmit,mln}@cs.odu.edu Slide # 30 Robot Crawls of A Large, Deep Web Google example here: http://www.joanasmith.com/deepWeb/animOdu2.gifhttp://www.joanasmith.com/deepWeb/animOdu2.gif

31 16 October 2007{jsmit,mln}@cs.odu.edu Slide # 31 Addressing the Counting Problem Using OAI-PMH via mod_oai Advantages for Crawler: Single request itemizes all resources in web tree: ListIdentifiers Can refine by MIME set, Datestamp Original modoai was limited: No Dynamic URLs Web root tree only Same metadata as HTTP Basic request: http://www.foo.edu/modoai/?verb=ListIdentifiers&metadataPrefix=oai_dc Enhanced request: &from=2006-09-15&set=mime:video:mpeg New version: Utilizes sitemap files Can include Dynamic URLs Rich metadata possibilities via “plugins”

32 16 October 2007{jsmit,mln}@cs.odu.edu Slide # 32 Web Server Configuration: “conf” file ### Section 1: Global Environment # ServerType standalone ServerRoot "/etc/httpd" PidFile /var/run/httpd.pid ResourceConfig /dev/null AccessConfig /dev/null Timeout 300 KeepAlive On MaxKeepAliveRequests 0 KeepAliveTimeout 15 MinSpareServers 16 MaxSpareServers 64 StartServers 16 MaxClients 512 MaxRequestsPerChild 100000 ### Section 2: 'Main' server configuration # Port 80 Listen 80 Listen 443 User www Group www ServerAdmin admin@openna.com ServerName www.openna.com DocumentRoot "/home/httpd/ona" Options None AllowOverride None Order deny,allow Deny from all Options None AllowOverride None Order allow,deny Allow from all Options None AllowOverride None Order deny,allow Deny from all DirectoryIndex index.htm index.html index.php index.php3 default.html index.cgi # #Include conf/mmap.conf # UseCanonicalName On TypesConfig /etc/httpd/conf/mime.types DefaultType text/plain HostnameLookups Off Operational Rules Modules (mod_perl, etc.) Security Virtual Hosts


Download ppt "An Apache Module for Generating Self-Describing Web Resources Joan A. Smith Michael L. Nelson Alliance for Information Science and Technology Innovation."

Similar presentations


Ads by Google