Presentation is loading. Please wait.

Presentation is loading. Please wait.

CRATE: A Simple Model for Self- Describing Web Resources International Web Archiving Workshop 2007 Joan A. Smith & Michael L. Nelson Old Dominion University.

Similar presentations


Presentation on theme: "CRATE: A Simple Model for Self- Describing Web Resources International Web Archiving Workshop 2007 Joan A. Smith & Michael L. Nelson Old Dominion University."— Presentation transcript:

1 CRATE: A Simple Model for Self- Describing Web Resources International Web Archiving Workshop 2007 Joan A. Smith & Michael L. Nelson Old Dominion University Department of Computer Science Norfolk, VA 23529 {jsmit, mln}@cs.odu.edu

2 IWAW ‘07 {jsmit,mln}@cs.odu.edu Slide # 2 WWW and Digital Libraries: Vastly Different Worlds World Wide Web –A disorganized free-for-all –Near-zero metadata –Unpredictable additions, deletions, modifications –No preservation policy Crawlapalooza Digital Library –Organized –Groomed content –Lots of metadata –Structured changes –Active preservation policies Harvester Home Companion

3 IWAW ‘07 {jsmit,mln}@cs.odu.edu Slide # 3 Web Sites: Metadata Challenged % telnet foo.edu 80 Trying 82.165.199.160... Connected to foo.edu. Escape character is '^]'. GET /jackJill.jpg HTTP/1.1 Host: foo.edu HTTP/1.1 200 OK Date: Mon, 11 Jun 2007 16:49:25 GMT Server: Apache/1.3.33 (Unix) Last-Modified: Mon, 29 Aug 2005 12:01:40 GMT ETag: "5800535-3e72-4312f924" Accept-Ranges: bytes Content-Length: 15986 Content-Type: image/jpeg ÿØÿà "#2s¡35Rq‘±³ÁÂ$%Ccruƒ“¢ÃÒÿÄ HTML metadata JPEG metadata

4 IWAW ‘07 {jsmit,mln}@cs.odu.edu Slide # 4 Archives: Metadata-Rich

5 IWAW ‘07 {jsmit,mln}@cs.odu.edu Slide # 5 YAMM?! (Yet Another Metadata Model?)

6 IWAW ‘07 {jsmit,mln}@cs.odu.edu Slide # 6 The MPEG-21 DIDL Model

7 IWAW ‘07 {jsmit,mln}@cs.odu.edu Slide # 7 Preservation & Metadata Resource Metadata Available Less More Probability of Preservation Low High HTTP/HTML Automatic metadata utilities/CRATE Archival Information Package (AIP)

8 IWAW ‘07 {jsmit,mln}@cs.odu.edu Slide # 8 # Webs >> # Archivists Archivist Web Sites Typical ingest scenario

9 IWAW ‘07 {jsmit,mln}@cs.odu.edu Slide # 9 Harnessing the Web Server Archivist: mod_oai GetRecord request and response User: standard GET request and response Self-describing resource

10 IWAW ‘07 {jsmit,mln}@cs.odu.edu Slide # 10 What is a “Self-Describing” Resource? EXIF TOOL: File Name103_0315.JPG Camera Model NameCanon EOS DIGITAL REBEL Date/Time Original2003:09:30 13:37:51 Shooting ModeSports Shutter Speed1/2000 Aperture7.1 Metering ModeEvaluative Exposure Compensation0 ISO400 Lens75.0 - 300.0mm Focal Length300.0mm Image Size3072x2048 QualityNormal FlashOff White BalanceAuto Focus ModeAI Servo AF Contrast+1 Sharpness+1 Saturation+1 Color ToneNormal File Size1606 kB File Number103-0315 Standard HTTP Headers -- Last-Modified: Mon, 29 Aug 2005 12:01:40 GMT ETag: "5800535-3e72-4312f924" Content-Length: 15986 Content-Type: image/jpeg PLUS: Output from built-in utilities: JHOVE TOOL: Date: 2007-06-18 14:35:50 EDT RepresentationInformation: /home/crate/apache/htdocs/jackJill.jpg ReportingModule: JPEG-hul, Rel. 1.2 (2005-08-22) LastModified: 2007-01-16 23:09:07 EST Size: 27750 Format: JPEG Version: 1.00 Status: Well-Formed and valid SignatureMatches: JPEG-hul MIMEtype: image/jpeg Profile: JFIF JPEGMetadata: CompressionType: Huffman coding, Baseline DCT Images: Number: 1 Image: NisoImageMetadata: MIMEType: image/jpeg ByteOrder: big-endian CompressionScheme: JPEG ColorSpace: YCbCr SamplingFrequencyUnit: inch XSamplingFrequency: 33 YSamplingFrequency: 26 ImageWidth: 172 ImageLength: 146 BitsPerSample: 8, 8, 8 SamplesPerPixel: 3 Scans: 1 QuantizationTables: QuantizationTable: Precision: 8-bit DestinationIdentifier: 0 Comments: LEAD Technologies Inc. V1.01 ApplicationSegments: APP0 File/Magic: JPEG image data JFIF standard 1.00 resolution (DPI) "LEAD Technologies Inc. V1.01“ 33 x 26 MD5 Hash: 58a54e8638db432f4515eedf89f44505 …CRATE: Wrapped together with the resource in simple XML

11 IWAW ‘07 {jsmit,mln}@cs.odu.edu Slide # 11 Metadata Generation Utility Examples NameDescription JhoveAnalysis by type (img, audio, text) KeaKey phrase extraction OTSOpen Text Summarizer ExifToolImage/video metadata extractor PDFlib-pCOSExtract PDF metadata MP3-TagExtract audio file tags EssenceCustomized information extraction GDFRMIME++ MD5Message Digest File MagicUses content-identification bits of the file

12 IWAW ‘07 {jsmit,mln}@cs.odu.edu Slide # 12 Web Server Configuration: “conf” file ### Section 1: Global Environment # ServerType standalone ServerRoot "/etc/httpd" PidFile /var/run/httpd.pid ResourceConfig /dev/null AccessConfig /dev/null Timeout 300 KeepAlive On MaxKeepAliveRequests 0 KeepAliveTimeout 15 MinSpareServers 16 MaxSpareServers 64 StartServers 16 MaxClients 512 MaxRequestsPerChild 100000 ### Section 2: 'Main' server configuration # Port 80 Listen 80 Listen 443 User www Group www ServerAdmin admin@openna.com ServerName www.openna.com DocumentRoot "/home/httpd/ona" Options None AllowOverride None Order deny,allow Deny from all Options None AllowOverride None Order allow,deny Allow from all Options None AllowOverride None Order deny,allow Deny from all DirectoryIndex index.htm index.html index.php index.php3 default.html index.cgi # #Include conf/mmap.conf # UseCanonicalName On TypesConfig /etc/httpd/conf/mime.types DefaultType text/plain HostnameLookups Off Operational Rules Modules (mod_perl, etc.) Security Virtual Hosts

13 IWAW ‘07 {jsmit,mln}@cs.odu.edu Slide # 13 Apache: mod_oai Location Directive SetHandler modoai-handler modoai_oai_active ON label “md5sum” exec“/usr/bin/md5sum %s” version “/usr/bin/md5sum --version” mime “*/*” label “file” exec“/usr/bin/file -kz %s” version “/usr/bin/file -v” mime “*/*” label “jhove” exec“/opt/jhove/jhove -m pdf-hul %s” version “/opt/jhove/jhove -v” mime “application/pdf” label “pronom” exec“java -jar DROID.jar -L %s” version “java -jar DROID.jar -V” mime “*/*” Scripts Pipes Executables MIME-based selective processing

14 IWAW ‘07 {jsmit,mln}@cs.odu.edu Slide # 14 Building a CRATE URI, UUID Standard HTTP Headers Plug-In Metadata Base64-Encoded Resource CRATE CRATE ID METADATA RESOURCE

15 IWAW ‘07 {jsmit,mln}@cs.odu.edu Slide # 15 CRATE example from mod_oai http://foo.edu/modoai/?verb=GetRecord&identifier= http://foo.edu/jackJill.jpg&metadataPrefix=crate 2007-06-18T18:21:46Z <request verb="GetRecord" identifier=http://foo.edu/jackJill.jpg metadataPrefix=“crate">http://foo.edu/crate/ http://foo.edu/jackJill.jpg 2007-01-17T04:09:07Z mime:image:jpeg image/jpeg encoding=“base64” JVBERi0xLjQKMyAwIG9iaiA8PAovTGVuZ3RoIDM5MjAgICCAKL0ZpI+hlzHdxHZ56diZdOiXjHNfEq9jOuDTzEc “file magic” /usr/bin/file jackJill.jpg file-4.16 JPEG image data, JFIF standard 1.00, resolution (DPI), "LEAD Technologies Inc. V1.01", 33 x 26 “jhove” /opt/jhove/jhove –m jpeg-hul Jhove (Rel. 1.1, 2006-06-05) <![CDATA[ Date: 2007-06-18 14:35:50 EDT RepresentationInformation: /home/crate/apache/htdocs/jackJill.jpg ReportingModule: JPEG-hul, Rel. 1.2 (2005-08-22) LastModified: 2007-01-16 23:09:07 EST Size: 27750 Format: JPEG Version: 1.00 Status: Well-Formed and valid SignatureMatches: JPEG-hul MIMEtype: image/jpeg Profile: JFIF JPEGMetadata: CompressionType: Huffman coding, Baseline DCT Images: Number: 1 Image: NisoImageMetadata: MIMEType: image/jpeg ByteOrder: big-endian CompressionScheme: JPEG ColorSpace: YCbCr SamplingFrequencyUnit: inch XSamplingFrequency: 33 YSamplingFrequency: 26 ImageWidth: 172 ImageLength: 146 BitsPerSample: 8, 8, 8 SamplesPerPixel: 3 Scans: 1 QuantizationTables: QuantizationTable: Precision: 8-bit DestinationIdentifier: 0 Comments: LEAD Technologies Inc. V1.01 ApplicationSegments: APP0 ]]>

16 IWAW ‘07 {jsmit,mln}@cs.odu.edu Slide # 16 Automatic, Best-Effort Metadata Automatic –Generated at time of dissemination –Integrates preservation functions with the web server Unverified –Utility results are not cross-checked –Output of analyses go directly into XML response Undifferentiated –No categorization of output –Resource and metadata form complex-object response A simple, easy-to-implement option for improving available preservation metadata for web resources

17 IWAW ‘07 {jsmit,mln}@cs.odu.edu Slide # 17 Issues - Or Not? Web Server Performance –Academic vs dot-com expectations –Solution options Utility Efficiency –Java-based vs C-based –Market pressures Security –Metadata vs risk –Access controls

18 IWAW ‘07 {jsmit,mln}@cs.odu.edu Slide # 18 Next Up… mod_oai Open Source release Formalize/release CRATE schema definition (XSD) Metrics Collection & Evaluation –Academic sites –Dot-Com sites –Examine utility compatibility and issues –Address security concerns

19 IWAW ‘07 {jsmit,mln}@cs.odu.edu Slide # 19 Demo TODAY: http://beatitude.cs.odu.edu:8080/modoaitest/diag.jpg http://beatitude.cs.odu.edu:8080/modoai/?verb=GetRecord&metadataPrefix= crate&identifier=http://localhost/modoaitest/diag.jpghttp://beatitude.cs.odu.edu:8080/modoai/?verb=GetRecord&metadataPrefix= crate&identifier=http://localhost/modoaitest/diag.jpg AT MODOAI.ORG: http://www.modoai.org/demos.html

20 IWAW ‘07 {jsmit,mln}@cs.odu.edu Slide # 20 Further Information The mod_oai project home page: http://www.modoai.org/ JCDL 2007: Generating Best Effort Preservation Metadata For Web Resources At Time Of Dissemination Authors’ webs: http://www.cs.odu.edu/~mln/pubs/ http://www.joanasmith.com/pubs.html


Download ppt "CRATE: A Simple Model for Self- Describing Web Resources International Web Archiving Workshop 2007 Joan A. Smith & Michael L. Nelson Old Dominion University."

Similar presentations


Ads by Google