Measuring and Archiving the Web

Measuring and Archiving the Web
Michael L. Nelson CS 495/595 Old Dominion University This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License This course is based on Dr. McCown's class

W3C Characterization Activity
Work from Provided definitions for common Web terms like resource, link, proxy, server, etc., some of which are now dated Attempted to answer questions like: How many web pages are there? and How fast is the Web growing? Summary in Pitkow, Summary of WWW Characterizations, Journal of the World Wide Web, 1999

Web Page Popularity The Zipf distribution of number of page hits versus rank for five days of AOL December 1997 data Summary of WWW Characterizations

The number of clicks per user at the Xerox WWW Site during May 1998.
Web Page Popularity The number of clicks per user at the Xerox WWW Site during May 1998. The curve follows an Inverse Gaussian Distribution, which has a heavy right tail. Summary of WWW Characterizations

OCLC Characterization Research
Work from Analyzed Web samples annually to look for trends Sample obtained by randomly sampling IP addresses and connecting to port 80 Today this method would miss a large number of websites that use virtual hosting – multiple domain names hosted on same computer using one IP address (remember the "Host:" request header?) Findings: O'Neill et al., Trends in the Evolution of the Public Web, D-Lib Magazine, Apr 2003

Number of Public Websites
O'Neill et al., Trends in the Evolution of the Public Web, D-Lib Magazine, Apr 2003

Distribution of Websites by Country
1999 2002 O'Neill et al., Trends in the Evolution of the Public Web, D-Lib Magazine, Apr 2003

Popular Websites by In-Links
OCLC Most Linked-To Websites1 Most Linked-To Websites (Jan 2013)2 facebook.com twitter.com google.com youtube.com adobe.com wordpress.org blogspot.com wikipedia.org godaddy.com wordpress.com 1http:// 2

Popular Websites by Visits
Alexa’s Top Websites (Jan 2013)1 Most Linked-To Websites (Jan 2013)2 facebook.com google.com youtube.com yahoo.com baidu.com wikipedia.org live.com amazon.com qq.com twitter.com facebook.com twitter.com google.com youtube.com adobe.com wordpress.org blogspot.com wikipedia.org godaddy.com wordpress.com 1 2 see also:

Characterizing National Web Domains
A large-scale study by Baeza-Yates at al.1 analyzed web collections from 10 national domains and multinational Web spaces of African and Indochinese Web sites Examined languages, file sizes, pages per site, link structure, etc. 1Baeza-Yates et al. , Characterization of national Web domains, ACM Trans. Internet Technol., May 2007

Web Page Languages Baeza-Yates et al. , Characterization of national Web domains, ACM Trans. Internet Technol., May 2007

Some Power-law Distributions
File sizes for small and large files Pages per site Baeza-Yates et al. , Characterization of national Web domains, ACM Trans. Internet Technol., May 2007

In and Out Degree of Web Pages
In-degree of web pages Out-degree of web pages for few and many outlinks Baeza-Yates et al. , Characterization of national Web domains, ACM Trans. Internet Technol., May 2007

More than 95% of content was HTML
Non-HTML File Content More than 95% of content was HTML Baeza-Yates et al. , Characterization of national Web domains, ACM Trans. Internet Technol., May 2007

How dynamic is the Web? How often are pages added to the Web?
How often are pages are removed from the Web? How often do pages change? What kinds of changes do pages typically exhibit? How does the link structure change over time?

How dynamic is the Web? Two studies attempt to answer these questions:
2004 study (Fetterly et al.1) of 150 million web pages over 11 weeks analyzed weekly snapshots 2004 study (Ntoulas et al.2) of 150 websites over one year analyzed weekly snapshots What follows are some selected highlights 1Fetterly et al. , A large-scale study of the evolution of Web pages, Software Practice & Experience ,2004 2Ntoulas et al. , What's new on the web?: the evolution of the web from a search engine perspective, Proc WWW 2004

Document Length 2n bytes Fetterly et al. ,2004

Successful Downloads Weeks Fetterly et al. ,2004

Rates of Change by TLD Fetterly et al. ,2004

New Pages Ntoulas et al. , 2004

Link Evolution Ntoulas et al. , 2004

Summary of Findings Fetterly et al., 2004 Ntoulas et al., 2004
When pages change, they change in trivial ways or just their markup Strong relationship between TLD and rate of change but not degree of change The larger the document, the more likely it is to be changed more frequently and significantly Past frequency of changes to a page is good predictor of future page changes Web page changes are usually minor New pages are created at rate of 8% per week Only 20% of pages today will be accessible in a year Large number of pages borrow content from existing pages Every week, 25% new links are created, and after 1 year, 80% of links are replaced with new ones Past degree of change to web page is good predictor of future degree of change

Linkrot: The 404 Problem Kahle (‘97) - Average page lifetime is 44 days Koehler (‘99, ‘04) - 67% URLs lost in 4 years Lawrence et al. (‘01) - 23%-53% URLs in CiteSeer papers invalid over 5 year span (3% of invalid URLs “unfindable”) Spinellis (‘03) - 27% URLs in CACM/Computer papers gone in 5 years Fetterly et al. (‘03) – about 0.5% of web pages disappeared per week Ntoulas et al. (‘04) – predicted only 20% of pages today will be accessible in a year McCown et al. (‘05) - 10 year half-life for URLs in D-Lib Magazine articles SalahEldeen & Nelson (‘12) – 11% of URLs from Tweets disappear after 1 year 23

Twitter killing blogs? http://wayoftheduck.com/blogs-vs-tweets
Blogosphere Twitter killing blogs?

Twitter Users, mid-2012

Twitter Cities, June 2012

Web Archiving Web archiving is the process of collecting pages from the Web and saving them in an archive Usually it’s important to save all associated resources (images, style sheets, etc.) to preserve look Archives are typically produced using web crawlers

The Ephemeral Web Link rot is a significant problem
Kahle (‘97) - Average page lifetime is 44 days Koehler (‘99, ‘04) - 67% URLs lost in 4 years Lawrence et al. (‘01) - 23%-53% URLs in CiteSeer papers invalid over 5 year span (3% of invalid URLs “unfindable”) Spinellis (‘03) - 27% URLs in CACM/Computer papers gone in 5 years Ntoulas et al. (‘04) – predicted only 20% of pages today will be accessible in a year Even if links don’t disappear, existing content is likely to change over time

Why archive the Web? If the Web isn’t saved, we might loose a significant amount of our digital heritage The Web gives historians and other social scientists significant insight into our society, especially into how technology has effected it Serves as important resource in many lawsuits Some organizations want to or are legally obliged to archive their web materials Someone worked hard on this stuff, why not save it?

one answer: replay of contemporary pages >> summary pages
Why Care About The Past? From an anonymous WWW 2010 reviewer about our Memento paper (emphasis mine): "Is there any statistics to show that many or a good number of Web users would like to get obsolete data or resources? " one answer: replay of contemporary pages >> summary pages

Archiving Moves At Hurricane Speed, Most News Stories Move Faster

Who Archives the Web?

good? yes. complete story? no.
Heroic solo efforts good? yes. complete story? no. 1TB endowment = ~$4700:

If you could choose a website to preserve for all time, what would you choose? What websites will people want to look at 50 to 500 years from now?

K12 Web Archiving Program
Program by Internet Archive and Library of Congress 5th to 12th graders get to participate in web archiving activities As of Oct 2010, students in the program have archived 2,379 websites Video:

Who is archiving the Web?
Internet Archive Founded by Brewster Kahle in 1996 Largest web archive in the world (338+ billion pages) Pages Available to public via Wayback Machine 30k s/w titles, 600k books, 900k films/movies, 1M audio recordings, 2M hours TV, 2M ebooks Images:

Who is archiving the Web?
National libraries & national archives, usually focusing on culturally significant web collections US Library of Congress: Minerva UK Web Archiving Consortium: UK Web Archive National Library of Australia: PANDORA list from the IIPC: Commercial organizations Hanzo Archives: commercial web archiving tools Nextpoint: archiving service for organizations Iterasi: for corporate, legal, and govt Etc.

now w/ a diff interface: http://www.loc.gov/websites/collections/

Other Players Free on-demand archiving (page-at-a-time model, not a crawler) WebCite archive.is many others…

archive.is http://archive.is/www.cs.odu.edu
see also:

Web Crawlers Wget and HTTrack Heritrix
Simple tools for mirroring a website All crawled URLs are converted into a path and file saved as foo.org/test/index.html Not designed for large-scale crawling Heritrix Built by Internet Archive and Nordic national libraries for larger web archiving tasks Archived content stored in Web ARChive file format (WARC) Uses web interface Can find links in JavaScript, Flash, etc.

WARC File Organization
Slide from John Kunze at

WARC Tools warc-tools WARCreate & WAIL
Python scripts for interacting with WARC files WARCreate & WAIL

Archive Viewers/Searchers
Wayback Open source Java implementation of IA’s Wayback Machine Search by URL NutchWAX Full-text search of a web archive WAXToolbar For accessing Wayback’s archive Provides search capabilities for NutchWAX

How Much of The Web Is Archived?

Public Archives, ca. Late 2010 / Early 2011
Three categories of archives Internet Archive Search engine Other archives UK US See also:

1000 URIs Ordered by First Observation Date
See also:

see also: http://ws-dl. blogspot

How Much of the Web is Archived? It Depends on Which Web…
Including SE cache Excluding SE Cache 90% 79% 97% 68% 35% 16% 88% 19% 2013 95% 92% 23% 26% Changes since 2011: no more free SE APIs; greatly reduced IA quarantine period; 15 public web archives

Long Tail of Archives Archive.is
see also:

Memento: Time Travel for the Web
The Memento Team Herbert Van de Sompel Michael L. Nelson Robert Sanderson Lyudmila Balakireva Scott Ainsworth Harihar Shankar Memento is partially funded by the Library of Congress 71

W3C Web Architecture: Resource – URI - Representation
Identifies dereference Resource Representation Represents 72

W3C Web Architecture: Resource – URI - Representation
Identifies dereference content negotiation Resource Representation 1 Represents Representation 2 73

Resources

Resources have Representations

Resources have Representations that Change over Time

Only the Current Representation is Available from a Resource

Old Representations are Lost Forever

Archived Resources Exist

Archived Resources Sep 11 2001, 20:36:10 UTC Dec 20 2001, 4:51:00 UTC
archived resource for archived resource for 80

Finding Archived Resources
Go to and search On select desired datetime 81

Finding Archived Resources
Go to and click History Browse History 82

Navigating Archived Resources
Dec , 4:51:00 UTC Navigating Archived Resources current Pentagon archived resource for 83

Navigating Archived Resources
Sep , 20:36:10 UTC Sep , 21:38:55 UTC Navigating Archived Resources SPACE archived resource for 84

Current and Past Web are Not Integrated
Current and Past Web based on same technology. But, going from Current to Past Web is a matter of (manual) discovery. Memento wants to make going from Current to Past Web a (HTTP) protocol matter. Memento wants to integrate Current And Past Web.

TBL on Generic vs. Specific Resources

In The Beginning… there was the inode
struct stat { dev_t st_dev; /* ID of device containing file */ ino_t st_ino; /* inode number */ mode_t st_mode; /* protection */ nlink_t st_nlink; /* number of hard links */ uid_t st_uid; /* user ID of owner */ gid_t st_gid; /* group ID of owner */ dev_t st_rdev; /* device ID (if special file) */ off_t st_size; /* total size, in bytes */ blksize_t st_blksize; /* blocksize for filesystem I/O */ blkcnt_t st_blocks; /* number of blocks allocated */ time_t st_atime; /* time of last access */ time_t st_mtime; /* time of last modification */ time_t st_ctime; /* time of last status change */ };

Limited Time Semantics…
% telnet 80 Trying Connected to Escape character is '^]'. HEAD /images/ndiipp_header6.jpg HTTP/1.1 Host: Connection: close HTTP/ OK Date: Mon, 19 Jul :41:04 GMT Server: Apache Last-Modified: Thu, 18 Jun :25:54 GMT ETag: "1bc dca24880" Accept-Ranges: bytes Content-Length: 67893 Content-Type: image/jpeg Connection closed by foreign host.

Time Semantics Becoming Less, Not More Available
% telnet 80 Trying Connected to Escape character is '^]'. HEAD / HTTP/1.1 Host: Connection: close HTTP/ OK Date: Mon, 19 Jul :36:00 GMT Server: Apache Accept-Ranges: bytes Content-Type: text/html Connection closed by foreign host.

The Past Links to the Present…
explicit HTML link; no HTTP links; opaque URI

The Past Links to the Present…
no HTML links; no HTTP links; implicit from URI

But the Present Does Not Link to the Past
no hints in HTML, HTTP, or URI % telnet 80 Trying Connected to Escape character is '^]'. HEAD / HTTP/1.1 Host: Connection: close HTTP/ OK Date: Mon, 19 Jul :36:00 GMT Server: Apache Accept-Ranges: bytes Content-Type: text/html Connection closed by foreign host.

Linking the Past and the Present
Codify existing methods to create linkage from the past to the present easy: an archived version knows for which URI it is an archived version Create a linkage from the present to the past hard: solve with a level of indirection from present to past

Normal HTTP Flow GET R 200 OK

The Web without a Time Dimension
96

97

Need to use a different URI to access archived versions of a resource and its current version 98

The Web with Time Dimension added by Memento
In Memento: use URI of the current version to access archived versions, but qualify it with datetime 99

The Web with Time Dimension added by Memento
TimeGate … and arrive at an archived version via level of indirection 100

Content Negotiation in the datetime dimension
Many systems support content negotiation for media type: Your client by default asks for HTML and gets HTML; But it could get PDF via the same URI. Memento proposes a new dimension for content negotiation – time: Your client by default asks for the current time, and gets it But it could get an older version via the same URI Can be accomplished with two new HTTP headers: Request header: Accept-Datetime Conveys datetime of content requested by client Response header: Memento-Datetime Conveys datetime of content returned by server 101 101

Memento HTTP Flow HEAD R, Accept-Datetime 200, LinkG
GET G, Accept-Datetime 302M, Vary, LinkR,B,M GET M, Accept-Datetime 200, Memento-Datetime, LinkR,B,M 102

The Memento Framework

MementoFox: https://addons.mozilla.org/en-us/firefox/addon/mementofox/

art history: http://www. moma. org/collection/browse_results. php

Measuring and Archiving the Web

Similar presentations

Presentation on theme: "Measuring and Archiving the Web"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Measuring and Archiving the Web

Similar presentations

Presentation on theme: "Measuring and Archiving the Web"— Presentation transcript:

Similar presentations

About project

Feedback