Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lazy Preservation, Warrick, and the Web Infrastructure

Similar presentations


Presentation on theme: "Lazy Preservation, Warrick, and the Web Infrastructure"— Presentation transcript:

1 Lazy Preservation, Warrick, and the Web Infrastructure
Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA Internet Archive Tutorial JCDL 2007 Vancouver, BC June 19, 2007

2 Available at http://warrick.cs.odu.edu/
McCown, et al., Brass: A Queueing Manager for Warrick, IWAW 2007. McCown, et al., Factors Affecting Website Reconstruction from the Web Infrastructure, ACM IEEE JCDL 2007. McCown and Nelson, Evaluation of Crawling Policies for a Web-Repository Crawler, HYPERTEXT 2006. McCown, et al., Lazy Preservation: Reconstructing Websites by Crawling the Crawlers, ACM WIDM 2006. Available at

3 What Types of Websites Are Lost?
Marshall, McCown, and Nelson, Evaluating Personal Archiving Strategies for Internet-based Information, IS&T Archiving 2007.

4 Success of website recovery each week
*On average, we recovered 61% of a website on any given week.

5 Overlap with Internet Archive
Overall, IA contained only 46% of the resources available in SE caches

6 Web Server Recoverable Not Recoverable
Static files (html files, PDFs, images, style sheets, Javascript, etc.) Web Infrastructure Recoverable config Perl script Dynamic page Database Not Recoverable

7 Injecting Server Components into Crawlable Pages
Erasure codes HTML pages Recover at least m blocks

8

9


Download ppt "Lazy Preservation, Warrick, and the Web Infrastructure"

Similar presentations


Ads by Google