EVALUATING THE TEMPORAL COHERENCE OF ARCHIVED PAGES SCOTT G. AINSWORTH MICHAEL L. NELSON OLD DOMINION UNIVERSITY IIPC 2015 APRIL 27 – MAY 1, 2015 STANFORD.

Slides:



Advertisements
Similar presentations
HTTP HyperText Transfer Protocol. HTTP Uses TCP as its underlying transport protocol Uses port 80 Stateless protocol (i.e. HTTP Server maintains no information.
Advertisements

HTTP – HyperText Transfer Protocol
Prototypes of pro-active approaches to support the archiving of web references for scholarly communications Richard Wincewicz 1, Peter Burnhill 1 & Herbert.
Unexpected Protocol Implications Elliot Jaffe with Dror Feitelson, Scott Kirkpatrick Networking Seminar Artzi 2008.
11 WARC standard revision workshop Clément Oury IIPC General Assembly open workshops Stanford, April 28th, 2015 IIPC General Assembly – Stanford – April.
Navigating and Browsing 3D Models in 3DLIB Hesham Anan, Kurt Maly, Mohammad Zubair Computer Science Dept. Old Dominion University, Norfolk, VA, (anan,
Web Server Design Week 5 Old Dominion University Department of Computer Science CS 495/595 Spring 2010 Martin Klein 2/10/10.
Interpreting logs and reports IIPC GA 2014 Crawl engineers and operators workshop Bert Wendland/BnF.
1 HTML and CGI Scripting CSC8304 – Computing Environments for Bioinformatics - Lecture 10.
Krerk Piromsopa. Web Caching Krerk Piromsopa. Department of Computer Engineering. Chulalongkorn University.
CP476 Internet Computing Lecture 5 : HTTP, WWW and URL 1 Lecture 5. WWW, HTTP and URL Objective: to review the concepts of WWW to understand how HTTP works.
TCP/IP Protocol Suite 1 Chapter 22 Upon completion you will be able to: World Wide Web: HTTP Understand the components of a browser and a server Understand.
Memento Update CNI Task Force Meeting, Spring Memento Herbert Van de Sompel Robert Sanderson Michael L. Nelson Giant Leaps.
Archival HTTP Redirection Retrieval Policies Temporal Web Analytics Workshop 2013, Rio De Janiro Ahmed AlSum, Michael L. Nelson Old Dominion University.
Addressing Incompleteness and Noise in Evolving Web Snapshots KJDB2007 Masashi Toyoda IIS, University of Tokyo.
OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland OAIResource Software Her This work supported in part by the.
TEMPORAL SPREAD IN ARCHIVED COMPOSITE RESOURCES (WORK IN PROGRESS) SCOTT G. AINSWORTH MICHAEL L. NELSON OLD DOMINION UNIVERSITY COMPUTER SCIENCE WADL 2013.
Scott Ainsworth, Ahmed AlSum, Hany SalahEldeen, Michele C. Weigle, Michael L. Nelson Old Dominion University, USA {sainswor, aalsum, hany, mweigle,
Crawl RSS Kristinn Sigurðsson National and University Library of Iceland IIPC GA 2014 – Paris.
80 million tiny images: a large dataset for non-parametric object and scene recognition CS 4763 Multimedia Systems Spring 2008.
Web Server Design Week 8 Old Dominion University Department of Computer Science CS 495/595 Spring 2010 Martin Klein 3/3/10.
Web Server Design Week 4 Old Dominion University Department of Computer Science CS 495/595 Spring 2010 Martin Klein 2/03/10.
Web Server Design Assignment #1: Basic Operations Due: 02/03/2010 Old Dominion University Department of Computer Science CS 495/595 Spring 2010 Martin.
CIS679: Lecture 13 r Review of Last Lecture r More on HTTP.
1-1 HTTP request message GET /somedir/page.html HTTP/1.1 Host: User-agent: Mozilla/4.0 Connection: close Accept-language:fr request.
Open Archives Initiative Object Reuse & Exchange Resource Map Discovery Michael L. Nelson * Carl Lagoze, Herbert Van de Sompel, Pete Johnston, Robert Sanderson,
Web Server Design Assignment #2: Conditionals & Persistence Due: 02/24/2010 Old Dominion University Department of Computer Science CS 495/595 Spring 2010.
Thumbnail Summarization Techniques For Web Archives Ahmed AlSum * Stanford University Libraries Stanford CA, USA 1 Michael L. Nelson.
Appendix E: Overview of HTTP ©SoftMoore ConsultingSlide 1.
CITA 310 Section 2 HTTP (Selected Topics from Textbook Chapter 6)
Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007.
OAI Object Reuse & Exchange: Atom Serialization Nordbib Workshop, September , Stockholm, Sweden OAI-ORE: Atom Serialization The ORE Editors are:
Web Server Design Week 2 Old Dominion University Department of Computer Science CS 495/595 Spring 2010 Martin Klein 1/20/10.
Web Server Design Week 7 Old Dominion University Department of Computer Science CS 495/595 Spring 2010 Martin Klein 2/24/10.
Web Server Design Week 13 Old Dominion University Department of Computer Science CS 495/595 Spring 2010 Martin Klein 4/7/10.
HTTP Here, we examine the hypertext transfer protocol (http) – originally introduced around 1990 but not standardized until 1997 (version 1.0) – protocol.
Overview of Servlets and JSP
Hiberlink is funded by the Andrew W. Mellon Foundation Investigating Reference Rot in Web-Based Scholarly Communication Martin Klein Los Alamos National.
Hiberlink is funded by the Andrew W. Mellon Foundation The Missing Link Proposal #hiberlink #memento Herbert.
LURP Details. LURP Lab Details  1.Given a GET … call a proxy CGI script in the same way you would for a normal CGI request  2.This UDP perl.
Web Server Design Week 6 Old Dominion University Department of Computer Science CS 495/595 Spring 2010 Martin Klein 2/17/10.
WEB1P webarch1 Web architecture Dr Jim Briggs. WEB1P webarch2 What is the web? Distributed system Client-server system Characteristics of clients and.
Web Server Design Week 5 Old Dominion University Department of Computer Science CS 495/595 Spring 2012 Michael L. Nelson 02/07/12.
Web Programming Week 1 Old Dominion University Department of Computer Science CS 418/518 Fall 2007 Michael L. Nelson 8/27/07.
Web Server Design Week 13 Old Dominion University Department of Computer Science CS 495/595 Spring 2012 Michael L. Nelson 04/03/12.
Python: Programming the Google Search (Crawling) Damian Gordon.
Web Server Design Week 3 Old Dominion University Department of Computer Science CS 495/595 Spring 2006 Michael L. Nelson 1/23/06.
Web Server Design Week 6 Old Dominion University Department of Computer Science CS 495/595 Spring 2006 Michael L. Nelson 2/13/06.
Web Cacheability of CRLs David Groep, Jan 26 th, 2009.
Web Server Design Week 10 Old Dominion University
Web Server Design Week 8 Old Dominion University
Web Server Design Week 7 Old Dominion University
Lazy Preservation, Warrick, and the Web Infrastructure
Web Server Design Week 4 Old Dominion University
Web Server Design Week 15 Old Dominion University
Web Server Design Week 15 Old Dominion University
Web Server Design Week 8 Old Dominion University
Web Server Design Week 8 Old Dominion University
Web Server Design Week 6 Old Dominion University
Web Server Design Week 10 Old Dominion University
Web Server Design Week 8 Old Dominion University
Web Server Design Week 3 Old Dominion University
Web Server Design Week 5 Old Dominion University
Web Server Design Week 4 Old Dominion University
Web Server Design Week 14 Old Dominion University
Web Server Design Week 6 Old Dominion University
Web Server Design Assignment #5 Extra Credit
Web Server Design Week 7 Old Dominion University
Web Server Design Week 7 Old Dominion University
Old Dominion University Computer Science IIPC New Member
Presentation transcript:

EVALUATING THE TEMPORAL COHERENCE OF ARCHIVED PAGES SCOTT G. AINSWORTH MICHAEL L. NELSON OLD DOMINION UNIVERSITY IIPC 2015 APRIL 27 – MAY 1, 2015 STANFORD UNIVERSITY HERBERT VAN DE SOMPEL LOS ALAMOS NATIONAL LABORATORY

HE WENT TO VIEW AN ARCHIVED PAGE. YOU WON’T BELIEVE WHAT HE SAW NEXT… SCOTT G. AINSWORTH MICHAEL L. NELSON OLD DOMINION UNIVERSITY IIPC 2015 APRIL 27 – MAY 1, 2015 STANFORD UNIVERSITY HERBERT VAN DE SOMPEL LOS ALAMOS NATIONAL LABORATORY

2015 IIPC General Assembly CONTENTS  Motivation  Composite Mementos  Coherence Framework  Temporal Coherence  Future Work  Conclusion 4/27/15Scott G. Ainsworth Michael L. Nelson Herbert Van de Sampel 3

2015 IIPC General Assembly RESEARCH TO DATE 4/27/15Scott G. Ainsworth Michael L. Nelson Herbert Van de Sampel 4 How to crawl a site to maximize coherence Ben Saad et al., JCDL 2011, TPDL 2011 Detecting, visualizing temporal defects Spaniol et al., WICOW 2009, IWAW 2009, VLDB 2009

2015 IIPC General Assembly RESEARCH TO 4/27/15Scott G. Ainsworth Michael L. Nelson Herbert Van de Sampel 5 How much of the web is archived? Ainsworth et al., JCDL 2011 Are the archives stable? Brunelle et al., JCDL 2013 Temporal drift while browsing in an archive? Ainsworth et al., JCDL 2013 Are the missing resources important? Brunelle et al., JCDL 2014 Are the present resources correct?

2015 IIPC General Assembly AS PRESENTED BY IA 4/27/15Scott G. Ainsworth Michael L. Nelson Herbert Van de Sampel 6 (now 404, but that's a different story…)

2015 IIPC General Assembly NOT ALL T19:09:26 4/27/15Scott G. Ainsworth Michael L. Nelson Herbert Van de Sampel 7

2015 IIPC General Assembly CLEAR OR CLOUDY? 4/27/15Scott G. Ainsworth Michael L. Nelson Herbert Van de Sampel 8

2015 IIPC General Assembly QUESTIONS How prevalent is temporal incoherence? Can temporal coherence be improved by using multiple archives? Can temporal coherence be improved by introducing memento selection heuristics? 4/27/15Scott G. Ainsworth Michael L. Nelson Herbert Van de Sampel 9

2015 IIPC General Assembly CONTENTS  Motivation  Composite Memento  Coherence Framework  Temporal Coherence  Future Work  Conclusion 4/27/15Scott G. Ainsworth Michael L. Nelson Herbert Van de Sampel 10

2015 IIPC General Assembly COMPOSITE MEMENTO PRESENTATIONSTRUCTURE 4/27/15Scott G. Ainsworth Michael L. Nelson Herbert Van de Sampel 11

2015 IIPC General Assembly CONTENTS  Motivation  Composite Memento  Coherence Framework  Temporal Coherence  Future Work  Conclusion 4/27/15Scott G. Ainsworth Michael L. Nelson Herbert Van de Sampel 12

2015 IIPC General Assembly COHERENCE STATES Prima Facie Coherent Evidence that the memento existed in its archived state when the root was acquired. Prima Facie Violative Evidence … did not exist... Possibly Coherent Evidence … might have existed... Probably Violative Evidence … probably did not exist... 4/27/15Scott G. Ainsworth Michael L. Nelson Herbert Van de Sampel 13

2015 IIPC General Assembly CONSIDER THIS PAGE… 4/27/15Scott G. Ainsworth Michael L. Nelson Herbert Van de Sampel 14

2015 IIPC General Assembly WITH THESE RESPONSE HEADERS 4/27/15Scott G. Ainsworth Michael L. Nelson Herbert Van de Sampel 15 HTTP/ OK Server: Tengine/2.0.3 Date: Mon, 27 Apr :03:32 GMT Content-Type: image/jpeg Content-Length: Connection: keep-alive Memento-Datetime: Tue, 07 Feb :58:23 GMT Link: X-Archive-Orig-server: Apache/ (Unix) ApacheJServ/1.1.2 PHP/4.3.4 X-Archive-Orig-etag: "4978-3d10-3e4d822e" X-Archive-Orig-content-length: X-Archive-Orig-accept-ranges: bytes X-Archive-Orig-date: Tue, 07 Feb :58:20 GMT X-Archive-Orig-content-type: image/jpeg X-Archive-Orig-last-modified: Fri, 14 Feb :56:30 GMT X-Archive-Orig-connection: close

2015 IIPC General Assembly PRIMA FACIE COHERENT 4/27/15Scott G. Ainsworth Michael L. Nelson Herbert Van de Sampel 16 Bracket Pattern: Memento-Datetime + Last-Modified (yes, Last-Modified is sometimes wrong, but many of those cases can be detected)

2015 IIPC General Assembly PRIMA FACIE COHERENT 4/27/15Scott G. Ainsworth Michael L. Nelson Herbert Van de Sampel 17 Equal Pattern: simultaneous capture (with an optionally tunable “bubble of simultaneity”)

2015 IIPC General Assembly PRIMA FACIE VIOLATIVE 4/27/15Scott G. Ainsworth Michael L. Nelson Herbert Van de Sampel 18 Closest memento created and acquired after the root was acquired

2015 IIPC General Assembly POSSIBLY COHERENT 4/27/15Scott G. Ainsworth Michael L. Nelson Herbert Van de Sampel 19 Closest (or only) memento captured before the root

2015 IIPC General Assembly PROBABLY VIOLATIVE 4/27/15Scott G. Ainsworth Michael L. Nelson Herbert Van de Sampel 20 Closest (or only) memento captured after the root but no Last-Modified (possibly indicating a dynamically generated representations) (for both PC & PV, you could do content comparison if there are 2 mementos that straddle the root page)

2015 IIPC General Assembly CONTENTS  Motivation  Composite Memento  Coherence Framework  Temporal Coherence  Future Work  Conclusion 4/27/15Scott G. Ainsworth Michael L. Nelson Herbert Van de Sampel 21

2015 IIPC General Assembly TEMPORAL COHERENCE 4/27/15Scott G. Ainsworth Michael L. Nelson Herbert Van de Sampel 22

2015 IIPC General Assembly TEMPORAL COHERENCE 4/27/15Scott G. Ainsworth Michael L. Nelson Herbert Van de Sampel :36:08 +9 days +18 days +7 months +2.1 years

2015 IIPC General Assembly EMBEDDED RESOURCES ResourceMemento-DatetimeDeltaResource Memento- Datetime Delta 01:36:08spacer.gif :23: d mm_menu.js :39:129.0 djimcheng.gif :37: d style.css :39:399.0 djsmith.gif :58: d gfx-logo-odu-crown.gif :39:399.0 drmenu_1st_featured_alumni.png :21: d ddmenu_ddown.js :39:439.0 dhmenu_college_...-new.png :14:257.3 mo university.js :39:569.0 drmenu_1st_upcoming_news.png :15:147.3 mo rmenu_1st_about.png :40: drmenu_1st_upcoming_events.png :01:127.3 mo rmenu_bottom_229.gif :07: dlmenu_1st_resources.png :47:417.5 mo shadow-bl.gif :55: dbullet_blue_triangle.gif :43:487.5 mo ecsbdg.jpg :56: dlogo-cs.gif :54:297.5 mo shadow-br.gif :18: drmenu_1st_featured_student.png :36:072.1 years gfx-btn-go-dblue.gif :34: dshadow-b.gif :35:172.1 years shadow-tr.gif :55: dshadow-r.gif404 Not Found header-right1.gif :06: d 4/27/15Scott G. Ainsworth Michael L. Nelson Herbert Van de Sampel 24 Embedded Resources26 Mean Delta125.9 days Standard Deviation207.7 days Minimum Delta9.0 days Maximum Delta2.1 years

2015 IIPC General Assembly REPRESENTING COHERENCE 4/27/15Scott G. Ainsworth Michael L. Nelson Herbert Van de Sampel 25

2015 IIPC General Assembly REPRESENTING COHERENCE 4/27/15Scott G. Ainsworth Michael L. Nelson Herbert Van de Sampel 26

2015 IIPC General Assembly REPRESENTING COHERENCE 4/27/15Scott G. Ainsworth Michael L. Nelson Herbert Van de Sampel 27

2015 IIPC General Assembly REPRESENTING COHERENCE 4/27/15Scott G. Ainsworth Michael L. Nelson Herbert Van de Sampel 28

2015 IIPC General Assembly REPRESENTING COHERENCE 4/27/15Scott G. Ainsworth Michael L. Nelson Herbert Van de Sampel 29

2015 IIPC General Assembly THE FULL CHART 4/27/15Scott G. Ainsworth Michael L. Nelson Herbert Van de Sampel media.gif

2015 IIPC General Assembly EXPERIMENT: DATA SET 4,000 sample URI-Rs (data set from JCDL 2011) Single and Multiple Archives Two Heuristics: Minimum distance (current default Wayback behavior) choose closest Memento-Datetime Bracket (proposed here) use combination of Memento-Datetime + Last-Modified Download all TimeMaps Download all root mementos Download all embedded resources 4/27/15Scott G. Ainsworth Michael L. Nelson Herbert Van de Sampel 31

2015 IIPC General Assembly EXPERIMENT: SAMPLING For each root URI-R TimeMap, choose a single memento per month Extract embedded URI-Rs Download TimeMaps for embedded URI-Rs Download heuristically best URI-Ms Repeat recursively 4/27/15Scott G. Ainsworth Michael L. Nelson Herbert Van de Sampel 32

2015 IIPC General Assembly ROOT URI-R STATISTICS 4/27/15Scott G. Ainsworth Michael L. Nelson Herbert Van de Sampel 33 Root URI-Rs archived2, % In multiple archives1, % Mean archives per URI-R1.58 Mean mementos per URI-R OK82, % 503 Service Unavailable 4, % 404 Not found % 403 Forbidden % Others % URI-M Status Archival Data

2015 IIPC General Assembly EMBEDDED URI-R STATISTICS 4/27/15Scott G. Ainsworth Michael L. Nelson Herbert Van de Sampel 34 Embedded URI-Rs1,623,127 per root URI-M 19.7 Embedded URI-Ms available1,332, % per root URI-M 15.1 Not archived312, % 404 Not found 44, % 403 Forbidden 6, % 503 Service Unavailable 5, % Others 3, % URI-M Failure Reasons Archival Data

2015 IIPC General Assembly COMPOSITE MEMENTO (ROOT) COMPLETENESS & COHERENCE 4/27/15Scott G. Ainsworth Michael L. Nelson Herbert Van de Sampel 35 Description MinDist Single MinDist Multi Bracket Single Bracket Multi Mean Complete76.1%80.2%76.2%80.3% Mean Missing23.9%19.8%23.8%19.7% Completeness (and Missing) Description MinDist Single MinDist Multi Bracket Single Bracket Multi Mean Prima Facie Coherent41.0%40.9%54.7%54.6% Mean Possibly Coherent27.3%28.7%12.8%14.2% Mean Probably Violative2.5%5.3%2.5%5.3% Mean Prima Facie Violative5.3% 6.2% Coherence At least 5% of pages can be shown to have temporal violations! Multiple archives: +completeness, -coherence?

2015 IIPC General Assembly EMBEDDED MEMENTO COHERENCE 4/27/15Scott G. Ainsworth Michael L. Nelson Herbert Van de Sampel 36 Description MinDist Single MinDist Multi Bracket Single Bracket Multi Prima Facie Coherent622,565621,447864,736859,625 Possibly Coherent497,405466,046244,104215,585 Probably Violative104,37653,734104,33953,694 Prima Facie Violative100,760103,662114,062117,469 Totals 1,325,1061,244,8891,327,2411,246,373 Description MinDist Single MinDist Multi Bracket Single Bracket Multi Prima Facie Coherent47.0%49.9%65.2%69.0% Possibly Coherent37.5%37.4%18.4%17.3% Probably Violative7.9%4.3%7.9%4.3% Prima Facie Violative7.6%8.3%8.6%9.4% At least 7% of embedded resources are used violatively!

2015 IIPC General Assembly CONTENTS  Motivation  Related work  Preliminary work  Temporal Coherence  Future work  Conclusion 4/27/15Scott G. Ainsworth Michael L. Nelson Herbert Van de Sampel 37

2015 IIPC General Assembly MINOR OR MAJOR VIOLATIONS? This is a temporal violation. But is it meaningful? How to judge? Most archives transform HTML Not all archives support export of original file How to measure similarity on binary files? early results: very few cases of equivalent binaries 4/27/15Scott G. Ainsworth Michael L. Nelson Herbert Van de Sampel 38

2015 IIPC General Assembly HOW TO CONVEY COHERENCE? 4/27/15Scott G. Ainsworth Michael L. Nelson Herbert Van de Sampel 39 How to scale to > 100 embedded mementos? How to convey coherence & contributing archive?

2015 IIPC General Assembly POLICIES & HEURISTICS Tradeoffs: Fast: minimize distance Accurate: maximize coherence Complete: query all (not just top k) archives in order to maximize completeness 4/27/15Scott G. Ainsworth Michael L. Nelson Herbert Van de Sampel 40

2015 IIPC General Assembly CONTENTS  Motivation  Composite Memento  Coherence Framework  Temporal Coherence  Future Work  Conclusion 4/27/15Scott G. Ainsworth Michael L. Nelson Herbert Van de Sampel 41

2015 IIPC General Assembly CONCLUSION Defined four classes of temporal coherence for describing relationship between root & embedded mementos Prima Facie {Coherent|Violative} Possibly Coherent / Probably Violative Determine classes using a combination of HTTP metadata, primarily Memento-Datetime & Last-Modified At least 5% of IA pages have 1 or more temporal violations Using multiple archives increases completeness, but with a possible loss of coherence Determining semantic impact of violations and UI issues (status, policy choices) are areas of future research 4/27/15Scott G. Ainsworth Michael L. Nelson Herbert Van de Sampel 42