CRATE: A Simple Model for Self- Describing Web Resources International Web Archiving Workshop 2007 Joan A. Smith & Michael L. Nelson Old Dominion University.

Slides:



Advertisements
Similar presentations
What all is there Inside the Apache web server. These slides are part of study material of LAMP course. Course conducted by Prof Rocky Jagtiani – Technical.
Advertisements

Web Server Onno W. Purbo Web server.
Where museums, libraries, and archives intersect NISO Z39.87 Developments Robin L. Dale RLG.
Initial web server configuration 1WUCM1. Overview Planning Testing the OS/Environment – IP setup Installation Configuration – Simple minimum details Testing.
HTTP HyperText Transfer Protocol. HTTP Uses TCP as its underlying transport protocol Uses port 80 Stateless protocol (i.e. HTTP Server maintains no information.
Tools for a Preservation-Ready Web Joan A. Smith & Michael L. Nelson Old Dominion University Department of Computer Science {jsmit, NDIIPP.
Initial web server configuration Dr Jim Briggs 1WUCM1.
IST 535 Week 1 Class Orientation / Review of Web Basics.
16-Jun-15 HTTP Hypertext Transfer Protocol. 2 HTTP messages HTTP is the language that web clients and web servers use to talk to each other HTTP is largely.
HTTP Exercise 01. Three Internet Protocols IP TCP HTTP Routes messages thru “Inter-network “ 2-way Connection between programs on 2 computers So they.
Configuring a Web Server. Overview  Understand how a Web server works  Install IIS (Internet Information Services) and Apache Web servers  Examine.
HTTP Overview Vijayan Sugumaran School of Business Administration Oakland University.
2/9/2004 Web and HTTP February 9, /9/2004 Assignments Due – Reading and Warmup Work on Message of the Day.
Appendix: Installing AMP (Apache + MySQL + PHP). Training Course, CS, NCTU 2 AMP  AMP A – Apache Web Server M – MySQL Database Server P – PHP Language.
Web Server Design Week 5 Old Dominion University Department of Computer Science CS 495/595 Spring 2010 Martin Klein 2/10/10.
1 HTML and CGI Scripting CSC8304 – Computing Environments for Bioinformatics - Lecture 10.
Using OAI-PMH Resource Harvesting & MPEG-21 DIDL for Digital Preservation Joan A. Smith & Michael L. Nelson Old Dominion University Department of Computer.
FITS: The File Information Tool Set
CP476 Internet Computing Lecture 5 : HTTP, WWW and URL 1 Lecture 5. WWW, HTTP and URL Objective: to review the concepts of WWW to understand how HTTP works.
TCP/IP Protocol Suite 1 Chapter 22 Upon completion you will be able to: World Wide Web: HTTP Understand the components of a browser and a server Understand.
Web Services CSCI N321 – System and Network Administration Copyright © 2007,2008 by Scott Orr and the Trustees of Indiana University.
OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland OAIResource Software Her This work supported in part by the.
An Apache Module for Generating Self-Describing Web Resources Joan A. Smith Michael L. Nelson Alliance for Information Science and Technology Innovation.
Generating Best Effort Preservation Metadata for Web Resources at Time of Dissemination Joan A. Smith & Michael L. Nelson Old Dominion University Department.
Web Server Design Week 8 Old Dominion University Department of Computer Science CS 495/595 Spring 2010 Martin Klein 3/3/10.
Web Server Design Week 4 Old Dominion University Department of Computer Science CS 495/595 Spring 2010 Martin Klein 2/03/10.
Repository Synchronization Using NNTP and SMTP Michael L. Nelson, Joan A. Smith, Martin Klein Old Dominion University Norfolk VA
Web Server Design Assignment #1: Basic Operations Due: 02/03/2010 Old Dominion University Department of Computer Science CS 495/595 Spring 2010 Martin.
Web Server Design Week 11 Old Dominion University Department of Computer Science CS 495/595 Spring 2010 Martin Klein 3/24/10.
1-1 HTTP request message GET /somedir/page.html HTTP/1.1 Host: User-agent: Mozilla/4.0 Connection: close Accept-language:fr request.
LinuxChix Apache. Serving Webpages The layer 7 protocol (HTTP) is what our browsers talk to get us the websites we can't seem to live without. HTTP is.
Open Archives Initiative Object Reuse & Exchange Resource Map Discovery Michael L. Nelson * Carl Lagoze, Herbert Van de Sompel, Pete Johnston, Robert Sanderson,
Web Server Design Assignment #2: Conditionals & Persistence Due: 02/24/2010 Old Dominion University Department of Computer Science CS 495/595 Spring 2010.
Appendix E: Overview of HTTP ©SoftMoore ConsultingSlide 1.
WEB SERVER Mark Kimmet Shana Blair. The Project Web Server Application  Receives request for web pages or images from a client browser via the internet.
2: Application Layer 1 Chapter 2: Application layer r 2.1 Principles of network applications  app architectures  app requirements r 2.2 Web and HTTP.
CITA 310 Section 2 HTTP (Selected Topics from Textbook Chapter 6)
Web Server Design Week 7 Old Dominion University Department of Computer Science CS 495/595 Spring 2010 Martin Klein 2/24/10.
Web Technologies Lecture 1 The Internet and HTTP.
Web Server Design Week 13 Old Dominion University Department of Computer Science CS 495/595 Spring 2010 Martin Klein 4/7/10.
27.1 Chapter 27 WWW and HTTP Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Internet Applications (Cont’d) Basic Internet Applications – World Wide Web (WWW) Browser Architecture Static Documents Dynamic Documents Active Documents.
Web Server Administration Chapter 6 Configuring a Web Server.
Evaluating Ingest Success: Using the AIHT Michael L. Nelson, Joan A. Smith Department of Computer Science Old Dominion University Norfolk VA DCC.
Data Communications and Computer Networks Chapter 2 CS 3830 Lecture 7 Omar Meqdadi Department of Computer Science and Software Engineering University of.
Web Server Design Week 6 Old Dominion University Department of Computer Science CS 495/595 Spring 2010 Martin Klein 2/17/10.
Web Server Design Week 5 Old Dominion University Department of Computer Science CS 495/595 Spring 2012 Michael L. Nelson 02/07/12.
Mod_oai: Metadata Harvesting for Everyone Michael L. Nelson, Herbert Van de Sompel, Xiaoming Liu, Aravind Elango
Web Programming Week 1 Old Dominion University Department of Computer Science CS 418/518 Fall 2007 Michael L. Nelson 8/27/07.
Web Server Administration Chapter 6 Configuring a Web Server.
Web Server Design Week 13 Old Dominion University Department of Computer Science CS 495/595 Spring 2012 Michael L. Nelson 04/03/12.
Introduction to Digital Libraries Week 11: OAI-PMH and and Complex Objects for Preservation Old Dominion University Department of Computer Science CS 751/851.
Web Server Design Week 3 Old Dominion University Department of Computer Science CS 495/595 Spring 2006 Michael L. Nelson 1/23/06.
Web Server Design Week 6 Old Dominion University Department of Computer Science CS 495/595 Spring 2006 Michael L. Nelson 2/13/06.
HTTP – An overview.
Unix System Administration
Web Server Design Week 4 Old Dominion University
Unit-5 Chap-1 Configuring Web Server
Web Server Design Week 15 Old Dominion University
Tools for a Preservation-Ready Web
Web Server Design Week 5 Old Dominion University
Web Server Design Week 3 Old Dominion University
Web Server Design Week 5 Old Dominion University
Web Server Design Week 3 Old Dominion University
Web Server Design Week 4 Old Dominion University
Web Server Design Week 12 Old Dominion University
Web Server Design Week 14 Old Dominion University
Web Server Design Assignment #1: Basic Operations
Web Server Design Assignment #5 Extra Credit
Web Programming Week 1 Old Dominion University
Presentation transcript:

CRATE: A Simple Model for Self- Describing Web Resources International Web Archiving Workshop 2007 Joan A. Smith & Michael L. Nelson Old Dominion University Department of Computer Science Norfolk, VA {jsmit,

IWAW ‘07 Slide # 2 WWW and Digital Libraries: Vastly Different Worlds World Wide Web –A disorganized free-for-all –Near-zero metadata –Unpredictable additions, deletions, modifications –No preservation policy Crawlapalooza Digital Library –Organized –Groomed content –Lots of metadata –Structured changes –Active preservation policies Harvester Home Companion

IWAW ‘07 Slide # 3 Web Sites: Metadata Challenged % telnet foo.edu 80 Trying Connected to foo.edu. Escape character is '^]'. GET /jackJill.jpg HTTP/1.1 Host: foo.edu HTTP/ OK Date: Mon, 11 Jun :49:25 GMT Server: Apache/ (Unix) Last-Modified: Mon, 29 Aug :01:40 GMT ETag: " e f924" Accept-Ranges: bytes Content-Length: Content-Type: image/jpeg ÿØÿà "#2s¡35Rq‘±³ÁÂ$%Ccruƒ“¢ÃÒÿÄ HTML metadata JPEG metadata

IWAW ‘07 Slide # 4 Archives: Metadata-Rich

IWAW ‘07 Slide # 5 YAMM?! (Yet Another Metadata Model?)

IWAW ‘07 Slide # 6 The MPEG-21 DIDL Model

IWAW ‘07 Slide # 7 Preservation & Metadata Resource Metadata Available Less More Probability of Preservation Low High HTTP/HTML Automatic metadata utilities/CRATE Archival Information Package (AIP)

IWAW ‘07 Slide # 8 # Webs >> # Archivists Archivist Web Sites Typical ingest scenario

IWAW ‘07 Slide # 9 Harnessing the Web Server Archivist: mod_oai GetRecord request and response User: standard GET request and response Self-describing resource

IWAW ‘07 Slide # 10 What is a “Self-Describing” Resource? EXIF TOOL: File Name103_0315.JPG Camera Model NameCanon EOS DIGITAL REBEL Date/Time Original2003:09:30 13:37:51 Shooting ModeSports Shutter Speed1/2000 Aperture7.1 Metering ModeEvaluative Exposure Compensation0 ISO400 Lens mm Focal Length300.0mm Image Size3072x2048 QualityNormal FlashOff White BalanceAuto Focus ModeAI Servo AF Contrast+1 Sharpness+1 Saturation+1 Color ToneNormal File Size1606 kB File Number Standard HTTP Headers -- Last-Modified: Mon, 29 Aug :01:40 GMT ETag: " e f924" Content-Length: Content-Type: image/jpeg PLUS: Output from built-in utilities: JHOVE TOOL: Date: :35:50 EDT RepresentationInformation: /home/crate/apache/htdocs/jackJill.jpg ReportingModule: JPEG-hul, Rel. 1.2 ( ) LastModified: :09:07 EST Size: Format: JPEG Version: 1.00 Status: Well-Formed and valid SignatureMatches: JPEG-hul MIMEtype: image/jpeg Profile: JFIF JPEGMetadata: CompressionType: Huffman coding, Baseline DCT Images: Number: 1 Image: NisoImageMetadata: MIMEType: image/jpeg ByteOrder: big-endian CompressionScheme: JPEG ColorSpace: YCbCr SamplingFrequencyUnit: inch XSamplingFrequency: 33 YSamplingFrequency: 26 ImageWidth: 172 ImageLength: 146 BitsPerSample: 8, 8, 8 SamplesPerPixel: 3 Scans: 1 QuantizationTables: QuantizationTable: Precision: 8-bit DestinationIdentifier: 0 Comments: LEAD Technologies Inc. V1.01 ApplicationSegments: APP0 File/Magic: JPEG image data JFIF standard 1.00 resolution (DPI) "LEAD Technologies Inc. V1.01“ 33 x 26 MD5 Hash: 58a54e8638db432f4515eedf89f44505 …CRATE: Wrapped together with the resource in simple XML

IWAW ‘07 Slide # 11 Metadata Generation Utility Examples NameDescription JhoveAnalysis by type (img, audio, text) KeaKey phrase extraction OTSOpen Text Summarizer ExifToolImage/video metadata extractor PDFlib-pCOSExtract PDF metadata MP3-TagExtract audio file tags EssenceCustomized information extraction GDFRMIME++ MD5Message Digest File MagicUses content-identification bits of the file

IWAW ‘07 Slide # 12 Web Server Configuration: “conf” file ### Section 1: Global Environment # ServerType standalone ServerRoot "/etc/httpd" PidFile /var/run/httpd.pid ResourceConfig /dev/null AccessConfig /dev/null Timeout 300 KeepAlive On MaxKeepAliveRequests 0 KeepAliveTimeout 15 MinSpareServers 16 MaxSpareServers 64 StartServers 16 MaxClients 512 MaxRequestsPerChild ### Section 2: 'Main' server configuration # Port 80 Listen 80 Listen 443 User www Group www ServerAdmin ServerName DocumentRoot "/home/httpd/ona" Options None AllowOverride None Order deny,allow Deny from all Options None AllowOverride None Order allow,deny Allow from all Options None AllowOverride None Order deny,allow Deny from all DirectoryIndex index.htm index.html index.php index.php3 default.html index.cgi # #Include conf/mmap.conf # UseCanonicalName On TypesConfig /etc/httpd/conf/mime.types DefaultType text/plain HostnameLookups Off Operational Rules Modules (mod_perl, etc.) Security Virtual Hosts

IWAW ‘07 Slide # 13 Apache: mod_oai Location Directive SetHandler modoai-handler modoai_oai_active ON label “md5sum” exec“/usr/bin/md5sum %s” version “/usr/bin/md5sum --version” mime “*/*” label “file” exec“/usr/bin/file -kz %s” version “/usr/bin/file -v” mime “*/*” label “jhove” exec“/opt/jhove/jhove -m pdf-hul %s” version “/opt/jhove/jhove -v” mime “application/pdf” label “pronom” exec“java -jar DROID.jar -L %s” version “java -jar DROID.jar -V” mime “*/*” Scripts Pipes Executables MIME-based selective processing

IWAW ‘07 Slide # 14 Building a CRATE URI, UUID Standard HTTP Headers Plug-In Metadata Base64-Encoded Resource CRATE CRATE ID METADATA RESOURCE

IWAW ‘07 Slide # 15 CRATE example from mod_oai T18:21:46Z <request verb="GetRecord" identifier= metadataPrefix=“crate"> T04:09:07Z mime:image:jpeg image/jpeg encoding=“base64” JVBERi0xLjQKMyAwIG9iaiA8PAovTGVuZ3RoIDM5MjAgICCAKL0ZpI+hlzHdxHZ56diZdOiXjHNfEq9jOuDTzEc “file magic” /usr/bin/file jackJill.jpg file-4.16 JPEG image data, JFIF standard 1.00, resolution (DPI), "LEAD Technologies Inc. V1.01", 33 x 26 “jhove” /opt/jhove/jhove –m jpeg-hul Jhove (Rel. 1.1, ) <![CDATA[ Date: :35:50 EDT RepresentationInformation: /home/crate/apache/htdocs/jackJill.jpg ReportingModule: JPEG-hul, Rel. 1.2 ( ) LastModified: :09:07 EST Size: Format: JPEG Version: 1.00 Status: Well-Formed and valid SignatureMatches: JPEG-hul MIMEtype: image/jpeg Profile: JFIF JPEGMetadata: CompressionType: Huffman coding, Baseline DCT Images: Number: 1 Image: NisoImageMetadata: MIMEType: image/jpeg ByteOrder: big-endian CompressionScheme: JPEG ColorSpace: YCbCr SamplingFrequencyUnit: inch XSamplingFrequency: 33 YSamplingFrequency: 26 ImageWidth: 172 ImageLength: 146 BitsPerSample: 8, 8, 8 SamplesPerPixel: 3 Scans: 1 QuantizationTables: QuantizationTable: Precision: 8-bit DestinationIdentifier: 0 Comments: LEAD Technologies Inc. V1.01 ApplicationSegments: APP0 ]]>

IWAW ‘07 Slide # 16 Automatic, Best-Effort Metadata Automatic –Generated at time of dissemination –Integrates preservation functions with the web server Unverified –Utility results are not cross-checked –Output of analyses go directly into XML response Undifferentiated –No categorization of output –Resource and metadata form complex-object response A simple, easy-to-implement option for improving available preservation metadata for web resources

IWAW ‘07 Slide # 17 Issues - Or Not? Web Server Performance –Academic vs dot-com expectations –Solution options Utility Efficiency –Java-based vs C-based –Market pressures Security –Metadata vs risk –Access controls

IWAW ‘07 Slide # 18 Next Up… mod_oai Open Source release Formalize/release CRATE schema definition (XSD) Metrics Collection & Evaluation –Academic sites –Dot-Com sites –Examine utility compatibility and issues –Address security concerns

IWAW ‘07 Slide # 19 Demo TODAY: crate&identifier= crate&identifier= AT MODOAI.ORG:

IWAW ‘07 Slide # 20 Further Information The mod_oai project home page: JCDL 2007: Generating Best Effort Preservation Metadata For Web Resources At Time Of Dissemination Authors’ webs: