Measuring and Archiving the Web Michael L. Nelson CS 495/595 Old Dominion University This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License This course is based on Dr. McCown's class
W3C Characterization Activity Work from 1998-1999 Provided definitions for common Web terms like resource, link, proxy, server, etc., some of which are now dated http://www.w3.org/1999/05/WCA-terms/ Attempted to answer questions like: How many web pages are there? and How fast is the Web growing? Summary in Pitkow, Summary of WWW Characterizations, Journal of the World Wide Web, 1999
Web Page Popularity The Zipf distribution of number of page hits versus rank for five days of AOL December 1997 data Summary of WWW Characterizations
The number of clicks per user at the Xerox WWW Site during May 1998. Web Page Popularity The number of clicks per user at the Xerox WWW Site during May 1998. The curve follows an Inverse Gaussian Distribution, which has a heavy right tail. Summary of WWW Characterizations
OCLC Characterization Research Work from 1998-2002 Analyzed Web samples annually to look for trends Sample obtained by randomly sampling IP addresses and connecting to port 80 Today this method would miss a large number of websites that use virtual hosting – multiple domain names hosted on same computer using one IP address (remember the "Host:" request header?) Findings: O'Neill et al., Trends in the Evolution of the Public Web, D-Lib Magazine, Apr 2003
Number of Public Websites O'Neill et al., Trends in the Evolution of the Public Web, D-Lib Magazine, Apr 2003
Distribution of Websites by Country 1999 2002 O'Neill et al., Trends in the Evolution of the Public Web, D-Lib Magazine, Apr 2003
Popular Websites by In-Links OCLC Most Linked-To Websites1 Most Linked-To Websites (Jan 2013)2 facebook.com twitter.com google.com youtube.com adobe.com wordpress.org blogspot.com wikipedia.org godaddy.com wordpress.com 1http://www.oclc.org/research/activities/past/orprojects/wcp/stats/linkage.htm 2 http://www.seomoz.org/top500
Popular Websites by Visits Alexa’s Top Websites (Jan 2013)1 Most Linked-To Websites (Jan 2013)2 facebook.com google.com youtube.com yahoo.com baidu.com wikipedia.org live.com amazon.com qq.com twitter.com facebook.com twitter.com google.com youtube.com adobe.com wordpress.org blogspot.com wikipedia.org godaddy.com wordpress.com 1 http://www.alexa.com/topsites 2 http://www.seomoz.org/top500 see also: http://www.builda-website.net/alexa-traffic-rank.html, http://www.alexa.com/help/traffic-learn-more
Characterizing National Web Domains A large-scale study by Baeza-Yates at al.1 analyzed web collections from 10 national domains and multinational Web spaces of African and Indochinese Web sites Examined languages, file sizes, pages per site, link structure, etc. 1Baeza-Yates et al. , Characterization of national Web domains, ACM Trans. Internet Technol., May 2007
Web Page Languages Baeza-Yates et al. , Characterization of national Web domains, ACM Trans. Internet Technol., May 2007
Some Power-law Distributions File sizes for small and large files Pages per site Baeza-Yates et al. , Characterization of national Web domains, ACM Trans. Internet Technol., May 2007
In and Out Degree of Web Pages In-degree of web pages Out-degree of web pages for few and many outlinks Baeza-Yates et al. , Characterization of national Web domains, ACM Trans. Internet Technol., May 2007
More than 95% of content was HTML Non-HTML File Content More than 95% of content was HTML Baeza-Yates et al. , Characterization of national Web domains, ACM Trans. Internet Technol., May 2007
How dynamic is the Web? How often are pages added to the Web? How often are pages are removed from the Web? How often do pages change? What kinds of changes do pages typically exhibit? How does the link structure change over time?
How dynamic is the Web? Two studies attempt to answer these questions: 2004 study (Fetterly et al.1) of 150 million web pages over 11 weeks analyzed weekly snapshots 2004 study (Ntoulas et al.2) of 150 websites over one year analyzed weekly snapshots What follows are some selected highlights 1Fetterly et al. , A large-scale study of the evolution of Web pages, Software Practice & Experience ,2004 2Ntoulas et al. , What's new on the web?: the evolution of the web from a search engine perspective, Proc WWW 2004
Document Length 2n bytes Fetterly et al. ,2004
Successful Downloads Weeks Fetterly et al. ,2004
Rates of Change by TLD Fetterly et al. ,2004
New Pages Ntoulas et al. , 2004
Link Evolution Ntoulas et al. , 2004
Summary of Findings Fetterly et al., 2004 Ntoulas et al., 2004 When pages change, they change in trivial ways or just their markup Strong relationship between TLD and rate of change but not degree of change The larger the document, the more likely it is to be changed more frequently and significantly Past frequency of changes to a page is good predictor of future page changes Web page changes are usually minor New pages are created at rate of 8% per week Only 20% of pages today will be accessible in a year Large number of pages borrow content from existing pages Every week, 25% new links are created, and after 1 year, 80% of links are replaced with new ones Past degree of change to web page is good predictor of future degree of change
Linkrot: The 404 Problem Kahle (‘97) - Average page lifetime is 44 days Koehler (‘99, ‘04) - 67% URLs lost in 4 years Lawrence et al. (‘01) - 23%-53% URLs in CiteSeer papers invalid over 5 year span (3% of invalid URLs “unfindable”) Spinellis (‘03) - 27% URLs in CACM/Computer papers gone in 5 years Fetterly et al. (‘03) – about 0.5% of web pages disappeared per week Ntoulas et al. (‘04) – predicted only 20% of pages today will be accessible in a year McCown et al. (‘05) - 10 year half-life for URLs in D-Lib Magazine articles SalahEldeen & Nelson (‘12) – 11% of URLs from Tweets disappear after 1 year 23
Twitter killing blogs? http://wayoftheduck.com/blogs-vs-tweets Blogosphere http://www.sifry.com/alerts/archives/000432.html Twitter killing blogs? http://wayoftheduck.com/blogs-vs-tweets
Twitter Users, mid-2012 http://semiocast.com/publications/2012_07_30_Twitter_reaches_half_a_billion_accounts_140m_in_the_US
Twitter Cities, June 2012 http://semiocast.com/publications/2012_07_30_Twitter_reaches_half_a_billion_accounts_140m_in_the_US
Web Archiving Web archiving is the process of collecting pages from the Web and saving them in an archive Usually it’s important to save all associated resources (images, style sheets, etc.) to preserve look Archives are typically produced using web crawlers
The Ephemeral Web Link rot is a significant problem Kahle (‘97) - Average page lifetime is 44 days Koehler (‘99, ‘04) - 67% URLs lost in 4 years Lawrence et al. (‘01) - 23%-53% URLs in CiteSeer papers invalid over 5 year span (3% of invalid URLs “unfindable”) Spinellis (‘03) - 27% URLs in CACM/Computer papers gone in 5 years Ntoulas et al. (‘04) – predicted only 20% of pages today will be accessible in a year Even if links don’t disappear, existing content is likely to change over time
Why archive the Web? If the Web isn’t saved, we might loose a significant amount of our digital heritage The Web gives historians and other social scientists significant insight into our society, especially into how technology has effected it Serves as important resource in many lawsuits Some organizations want to or are legally obliged to archive their web materials Someone worked hard on this stuff, why not save it?
one answer: replay of contemporary pages >> summary pages Why Care About The Past? From an anonymous WWW 2010 reviewer about our Memento paper (emphasis mine): "Is there any statistics to show that many or a good number of Web users would like to get obsolete data or resources? " one answer: replay of contemporary pages >> summary pages http://www.slideshare.net/phonedude/why-careaboutthepast http://www.nytimes.com/2013/06/19/books/seven-american-deaths-and-disasters-transcribes-the-news.html
vs.
Archiving Moves At Hurricane Speed, Most News Stories Move Faster
Who Archives the Web?
good? yes. complete story? no. Heroic solo efforts good? yes. complete story? no. 1TB endowment = ~$4700: http://blog.dshr.org/2011/02/paying-for-long-term-storage.html
If you could choose a website to preserve for all time, what would you choose? What websites will people want to look at 50 to 500 years from now?
K12 Web Archiving Program Program by Internet Archive and Library of Congress 5th to 12th graders get to participate in web archiving activities As of Oct 2010, students in the program have archived 2,379 websites Video: http://www.archive.org/details/K12WebArchivingProgramStudents
Who is archiving the Web? Internet Archive Founded by Brewster Kahle in 1996 Largest web archive in the world (338+ billion pages) Pages Available to public via Wayback Machine 30k s/w titles, 600k books, 900k films/movies, 1M audio recordings, 2M hours TV, 2M ebooks Images: http://en.wikipedia.org/wiki/Internet_Archive
Who is archiving the Web? National libraries & national archives, usually focusing on culturally significant web collections US Library of Congress: Minerva UK Web Archiving Consortium: UK Web Archive National Library of Australia: PANDORA list from the IIPC: http://netpreserve.org/resources/member-archives Commercial organizations Hanzo Archives: commercial web archiving tools Nextpoint: archiving service for organizations Iterasi: for corporate, legal, and govt Etc.
now w/ a diff interface: http://www.loc.gov/websites/collections/
Other Players Free on-demand archiving (page-at-a-time model, not a crawler) WebCite archive.is www.freezepage.com www.peep.us many others…
archive.is http://archive.is/www.cs.odu.edu see also: http://ws-dl.blogspot.com/2013/07/2013-07-09-archiveis-supports-memento.html
Web Crawlers Wget and HTTrack Heritrix Simple tools for mirroring a website All crawled URLs are converted into a path and file http://foo.org/test/ saved as foo.org/test/index.html Not designed for large-scale crawling Heritrix Built by Internet Archive and Nordic national libraries for larger web archiving tasks Archived content stored in Web ARChive file format (WARC) Uses web interface Can find links in JavaScript, Flash, etc.
WARC File Organization Slide from John Kunze at http://www.iwaw.net/05/kunze.pdf
WARC Tools warc-tools WARCreate & WAIL Python scripts for interacting with WARC files WARCreate & WAIL http://ws-dl.blogspot.com/2013/07/2013-07-10-warcreate-and-wail-warc.html
Archive Viewers/Searchers Wayback Open source Java implementation of IA’s Wayback Machine Search by URL NutchWAX Full-text search of a web archive WAXToolbar For accessing Wayback’s archive Provides search capabilities for NutchWAX
How Much of The Web Is Archived?
Public Archives, ca. Late 2010 / Early 2011 Three categories of archives Internet Archive Search engine Other archives UK US See also: http://arxiv.org/abs/1212.6177
1000 URIs Ordered by First Observation Date See also: http://ws-dl.blogspot.com/2011/06/2011-06-23-how-much-of-web-is-archived.html
see also: http://ws-dl. blogspot
How Much of the Web is Archived? It Depends on Which Web… Including SE cache Excluding SE Cache 90% 79% 97% 68% 35% 16% 88% 19% 2013 95% 92% 23% 26% Changes since 2011: no more free SE APIs; greatly reduced IA quarantine period; 15 public web archives
Long Tail of Archives Archive.is see also: http://www.cs.odu.edu/~mln/pubs/tpdl-2013/paper_134.pdf
Memento: Time Travel for the Web The Memento Team Herbert Van de Sompel Michael L. Nelson Robert Sanderson Lyudmila Balakireva Scott Ainsworth Harihar Shankar Memento is partially funded by the Library of Congress 71
W3C Web Architecture: Resource – URI - Representation Identifies dereference Resource Representation Represents 72
W3C Web Architecture: Resource – URI - Representation Identifies dereference content negotiation Resource Representation 1 Represents Representation 2 73
Resources
Resources have Representations
Resources have Representations that Change over Time
Only the Current Representation is Available from a Resource
Old Representations are Lost Forever
Archived Resources Exist
Archived Resources Sep 11 2001, 20:36:10 UTC Dec 20 2001, 4:51:00 UTC http://en.wikipedia.org/w/index.php?title=September_11_attacks&oldid=282333 archived resource for http://en.wikipedia.org/wiki/September_11_attacks http://web.archive.org/web/20010911203610/http://www.cnn.com/ archived resource for http://cnn.com 80
Finding Archived Resources Go to http://www.archive.org/ and search http://cnn.com On http://web.archive.org/web/*/http://cnn.com, select desired datetime 81
Finding Archived Resources Go to http://en.wikipedia.org/wiki/September_11_attacks and click History Browse History 82
Navigating Archived Resources Dec 20 2001, 4:51:00 UTC Navigating Archived Resources current Pentagon http://en.wikipedia.org/w/index.php?title=September_11_attacks&oldid=282333 archived resource for http://en.wikipedia.org/wiki/September_11_attacks3 http://en.wikipedia.org/wiki/The_Pentagon 83
Navigating Archived Resources Sep 11 2001, 20:36:10 UTC Sep 11 2001, 21:38:55 UTC Navigating Archived Resources SPACE http://web.archive.org/web/20010911203610/http://www.cnn.com/ archived resource for http://cnn.com http://web.archive.org/web/20010911213855/www.cnn.com/TECH/space/ 84
Current and Past Web are Not Integrated Current and Past Web based on same technology. But, going from Current to Past Web is a matter of (manual) discovery. Memento wants to make going from Current to Past Web a (HTTP) protocol matter. Memento wants to integrate Current And Past Web.
TBL on Generic vs. Specific Resources http://www.w3.org/DesignIssues/Generic.html
In The Beginning… there was the inode struct stat { dev_t st_dev; /* ID of device containing file */ ino_t st_ino; /* inode number */ mode_t st_mode; /* protection */ nlink_t st_nlink; /* number of hard links */ uid_t st_uid; /* user ID of owner */ gid_t st_gid; /* group ID of owner */ dev_t st_rdev; /* device ID (if special file) */ off_t st_size; /* total size, in bytes */ blksize_t st_blksize; /* blocksize for filesystem I/O */ blkcnt_t st_blocks; /* number of blocks allocated */ time_t st_atime; /* time of last access */ time_t st_mtime; /* time of last modification */ time_t st_ctime; /* time of last status change */ };
Limited Time Semantics… % telnet www.digitalpreservation.gov 80 Trying 140.147.249.7... Connected to www.digitalpreservation.gov. Escape character is '^]'. HEAD /images/ndiipp_header6.jpg HTTP/1.1 Host: www.digitalpreservation.gov Connection: close HTTP/1.1 200 OK Date: Mon, 19 Jul 2010 21:41:04 GMT Server: Apache Last-Modified: Thu, 18 Jun 2009 16:25:54 GMT ETag: "1bc861-10935-dca24880" Accept-Ranges: bytes Content-Length: 67893 Content-Type: image/jpeg Connection closed by foreign host.
Time Semantics Becoming Less, Not More Available % telnet www.digitalpreservation.gov 80 Trying 140.147.249.7... Connected to www.digitalpreservation.gov. Escape character is '^]'. HEAD / HTTP/1.1 Host: www.digitalpreservation.gov Connection: close HTTP/1.1 200 OK Date: Mon, 19 Jul 2010 21:36:00 GMT Server: Apache Accept-Ranges: bytes Content-Type: text/html Connection closed by foreign host.
The Past Links to the Present… explicit HTML link; no HTTP links; opaque URI
The Past Links to the Present… no HTML links; no HTTP links; implicit from URI
But the Present Does Not Link to the Past no hints in HTML, HTTP, or URI % telnet www.digitalpreservation.gov 80 Trying 140.147.249.7... Connected to www.digitalpreservation.gov. Escape character is '^]'. HEAD / HTTP/1.1 Host: www.digitalpreservation.gov Connection: close HTTP/1.1 200 OK Date: Mon, 19 Jul 2010 21:36:00 GMT Server: Apache Accept-Ranges: bytes Content-Type: text/html Connection closed by foreign host.
Linking the Past and the Present Codify existing methods to create linkage from the past to the present easy: an archived version knows for which URI it is an archived version Create a linkage from the present to the past hard: solve with a level of indirection from present to past
Normal HTTP Flow GET R 200 OK
Normal HTTP Flow GET R 200 OK
The Web without a Time Dimension 96
The Web without a Time Dimension 97
The Web without a Time Dimension Need to use a different URI to access archived versions of a resource and its current version 98
The Web with Time Dimension added by Memento In Memento: use URI of the current version to access archived versions, but qualify it with datetime 99
The Web with Time Dimension added by Memento TimeGate … and arrive at an archived version via level of indirection 100
Content Negotiation in the datetime dimension Many systems support content negotiation for media type: Your client by default asks for HTML and gets HTML; But it could get PDF via the same URI. Memento proposes a new dimension for content negotiation – time: Your client by default asks for the current time, and gets it But it could get an older version via the same URI Can be accomplished with two new HTTP headers: Request header: Accept-Datetime Conveys datetime of content requested by client Response header: Memento-Datetime Conveys datetime of content returned by server 101 101
Memento HTTP Flow HEAD R, Accept-Datetime 200, LinkG GET G, Accept-Datetime 302M, Vary, LinkR,B,M GET M, Accept-Datetime 200, Memento-Datetime, LinkR,B,M 102
Memento HTTP Flow HEAD R, Accept-Datetime 200, LinkG GET G, Accept-Datetime 302M, Vary, LinkR,B,M GET M, Accept-Datetime 200, Memento-Datetime, LinkR,B,M 103
Memento HTTP Flow HEAD R, Accept-Datetime 200, LinkG GET G, Accept-Datetime 302M, Vary, LinkR,B,M GET M, Accept-Datetime 200, Memento-Datetime, LinkR,B,M 104
Memento HTTP Flow HEAD R, Accept-Datetime 200, LinkG GET G, Accept-Datetime 302M, Vary, LinkR,B,M GET M, Accept-Datetime 200, Memento-Datetime, LinkR,B,M 105
Memento HTTP Flow HEAD R, Accept-Datetime 200, LinkG GET G, Accept-Datetime 302M, Vary, LinkR,B,M GET M, Accept-Datetime 200, Memento-Datetime, LinkR,B,M 106
Memento HTTP Flow HEAD R, Accept-Datetime 200, LinkG GET G, Accept-Datetime 302M, Vary, LinkR,B,M GET M, Accept-Datetime 200, Memento-Datetime, LinkR,B,M 107
The Memento Framework
MementoFox: https://addons.mozilla.org/en-us/firefox/addon/mementofox/
art history: http://www. moma. org/collection/browse_results. php