CRATE: A Simple Model for Self- Describing Web Resources International Web Archiving Workshop 2007 Joan A. Smith & Michael L. Nelson Old Dominion University Department of Computer Science Norfolk, VA {jsmit,
IWAW ‘07 Slide # 2 WWW and Digital Libraries: Vastly Different Worlds World Wide Web –A disorganized free-for-all –Near-zero metadata –Unpredictable additions, deletions, modifications –No preservation policy Crawlapalooza Digital Library –Organized –Groomed content –Lots of metadata –Structured changes –Active preservation policies Harvester Home Companion
IWAW ‘07 Slide # 3 Web Sites: Metadata Challenged % telnet foo.edu 80 Trying Connected to foo.edu. Escape character is '^]'. GET /jackJill.jpg HTTP/1.1 Host: foo.edu HTTP/ OK Date: Mon, 11 Jun :49:25 GMT Server: Apache/ (Unix) Last-Modified: Mon, 29 Aug :01:40 GMT ETag: " e f924" Accept-Ranges: bytes Content-Length: Content-Type: image/jpeg ÿØÿà "#2s¡35Rq‘±³ÁÂ$%Ccruƒ“¢ÃÒÿÄ HTML metadata JPEG metadata
IWAW ‘07 Slide # 4 Archives: Metadata-Rich
IWAW ‘07 Slide # 5 YAMM?! (Yet Another Metadata Model?)
IWAW ‘07 Slide # 6 The MPEG-21 DIDL Model
IWAW ‘07 Slide # 7 Preservation & Metadata Resource Metadata Available Less More Probability of Preservation Low High HTTP/HTML Automatic metadata utilities/CRATE Archival Information Package (AIP)
IWAW ‘07 Slide # 8 # Webs >> # Archivists Archivist Web Sites Typical ingest scenario
IWAW ‘07 Slide # 9 Harnessing the Web Server Archivist: mod_oai GetRecord request and response User: standard GET request and response Self-describing resource
IWAW ‘07 Slide # 10 What is a “Self-Describing” Resource? EXIF TOOL: File Name103_0315.JPG Camera Model NameCanon EOS DIGITAL REBEL Date/Time Original2003:09:30 13:37:51 Shooting ModeSports Shutter Speed1/2000 Aperture7.1 Metering ModeEvaluative Exposure Compensation0 ISO400 Lens mm Focal Length300.0mm Image Size3072x2048 QualityNormal FlashOff White BalanceAuto Focus ModeAI Servo AF Contrast+1 Sharpness+1 Saturation+1 Color ToneNormal File Size1606 kB File Number Standard HTTP Headers -- Last-Modified: Mon, 29 Aug :01:40 GMT ETag: " e f924" Content-Length: Content-Type: image/jpeg PLUS: Output from built-in utilities: JHOVE TOOL: Date: :35:50 EDT RepresentationInformation: /home/crate/apache/htdocs/jackJill.jpg ReportingModule: JPEG-hul, Rel. 1.2 ( ) LastModified: :09:07 EST Size: Format: JPEG Version: 1.00 Status: Well-Formed and valid SignatureMatches: JPEG-hul MIMEtype: image/jpeg Profile: JFIF JPEGMetadata: CompressionType: Huffman coding, Baseline DCT Images: Number: 1 Image: NisoImageMetadata: MIMEType: image/jpeg ByteOrder: big-endian CompressionScheme: JPEG ColorSpace: YCbCr SamplingFrequencyUnit: inch XSamplingFrequency: 33 YSamplingFrequency: 26 ImageWidth: 172 ImageLength: 146 BitsPerSample: 8, 8, 8 SamplesPerPixel: 3 Scans: 1 QuantizationTables: QuantizationTable: Precision: 8-bit DestinationIdentifier: 0 Comments: LEAD Technologies Inc. V1.01 ApplicationSegments: APP0 File/Magic: JPEG image data JFIF standard 1.00 resolution (DPI) "LEAD Technologies Inc. V1.01“ 33 x 26 MD5 Hash: 58a54e8638db432f4515eedf89f44505 …CRATE: Wrapped together with the resource in simple XML
IWAW ‘07 Slide # 11 Metadata Generation Utility Examples NameDescription JhoveAnalysis by type (img, audio, text) KeaKey phrase extraction OTSOpen Text Summarizer ExifToolImage/video metadata extractor PDFlib-pCOSExtract PDF metadata MP3-TagExtract audio file tags EssenceCustomized information extraction GDFRMIME++ MD5Message Digest File MagicUses content-identification bits of the file
IWAW ‘07 Slide # 12 Web Server Configuration: “conf” file ### Section 1: Global Environment # ServerType standalone ServerRoot "/etc/httpd" PidFile /var/run/httpd.pid ResourceConfig /dev/null AccessConfig /dev/null Timeout 300 KeepAlive On MaxKeepAliveRequests 0 KeepAliveTimeout 15 MinSpareServers 16 MaxSpareServers 64 StartServers 16 MaxClients 512 MaxRequestsPerChild ### Section 2: 'Main' server configuration # Port 80 Listen 80 Listen 443 User www Group www ServerAdmin ServerName DocumentRoot "/home/httpd/ona" Options None AllowOverride None Order deny,allow Deny from all Options None AllowOverride None Order allow,deny Allow from all Options None AllowOverride None Order deny,allow Deny from all DirectoryIndex index.htm index.html index.php index.php3 default.html index.cgi # #Include conf/mmap.conf # UseCanonicalName On TypesConfig /etc/httpd/conf/mime.types DefaultType text/plain HostnameLookups Off Operational Rules Modules (mod_perl, etc.) Security Virtual Hosts
IWAW ‘07 Slide # 13 Apache: mod_oai Location Directive SetHandler modoai-handler modoai_oai_active ON label “md5sum” exec“/usr/bin/md5sum %s” version “/usr/bin/md5sum --version” mime “*/*” label “file” exec“/usr/bin/file -kz %s” version “/usr/bin/file -v” mime “*/*” label “jhove” exec“/opt/jhove/jhove -m pdf-hul %s” version “/opt/jhove/jhove -v” mime “application/pdf” label “pronom” exec“java -jar DROID.jar -L %s” version “java -jar DROID.jar -V” mime “*/*” Scripts Pipes Executables MIME-based selective processing
IWAW ‘07 Slide # 14 Building a CRATE URI, UUID Standard HTTP Headers Plug-In Metadata Base64-Encoded Resource CRATE CRATE ID METADATA RESOURCE
IWAW ‘07 Slide # 15 CRATE example from mod_oai T18:21:46Z <request verb="GetRecord" identifier= metadataPrefix=“crate"> T04:09:07Z mime:image:jpeg image/jpeg encoding=“base64” JVBERi0xLjQKMyAwIG9iaiA8PAovTGVuZ3RoIDM5MjAgICCAKL0ZpI+hlzHdxHZ56diZdOiXjHNfEq9jOuDTzEc “file magic” /usr/bin/file jackJill.jpg file-4.16 JPEG image data, JFIF standard 1.00, resolution (DPI), "LEAD Technologies Inc. V1.01", 33 x 26 “jhove” /opt/jhove/jhove –m jpeg-hul Jhove (Rel. 1.1, ) <![CDATA[ Date: :35:50 EDT RepresentationInformation: /home/crate/apache/htdocs/jackJill.jpg ReportingModule: JPEG-hul, Rel. 1.2 ( ) LastModified: :09:07 EST Size: Format: JPEG Version: 1.00 Status: Well-Formed and valid SignatureMatches: JPEG-hul MIMEtype: image/jpeg Profile: JFIF JPEGMetadata: CompressionType: Huffman coding, Baseline DCT Images: Number: 1 Image: NisoImageMetadata: MIMEType: image/jpeg ByteOrder: big-endian CompressionScheme: JPEG ColorSpace: YCbCr SamplingFrequencyUnit: inch XSamplingFrequency: 33 YSamplingFrequency: 26 ImageWidth: 172 ImageLength: 146 BitsPerSample: 8, 8, 8 SamplesPerPixel: 3 Scans: 1 QuantizationTables: QuantizationTable: Precision: 8-bit DestinationIdentifier: 0 Comments: LEAD Technologies Inc. V1.01 ApplicationSegments: APP0 ]]>
IWAW ‘07 Slide # 16 Automatic, Best-Effort Metadata Automatic –Generated at time of dissemination –Integrates preservation functions with the web server Unverified –Utility results are not cross-checked –Output of analyses go directly into XML response Undifferentiated –No categorization of output –Resource and metadata form complex-object response A simple, easy-to-implement option for improving available preservation metadata for web resources
IWAW ‘07 Slide # 17 Issues - Or Not? Web Server Performance –Academic vs dot-com expectations –Solution options Utility Efficiency –Java-based vs C-based –Market pressures Security –Metadata vs risk –Access controls
IWAW ‘07 Slide # 18 Next Up… mod_oai Open Source release Formalize/release CRATE schema definition (XSD) Metrics Collection & Evaluation –Academic sites –Dot-Com sites –Examine utility compatibility and issues –Address security concerns
IWAW ‘07 Slide # 19 Demo TODAY: crate&identifier= crate&identifier= AT MODOAI.ORG:
IWAW ‘07 Slide # 20 Further Information The mod_oai project home page: JCDL 2007: Generating Best Effort Preservation Metadata For Web Resources At Time Of Dissemination Authors’ webs: