Presentation is loading. Please wait.

Presentation is loading. Please wait.

Measuring and Archiving the Web

Similar presentations


Presentation on theme: "Measuring and Archiving the Web"— Presentation transcript:

1 Measuring and Archiving the Web
Michael L. Nelson CS 495/595 Old Dominion University This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License This course is based on Dr. McCown's class

2 W3C Characterization Activity
Work from Provided definitions for common Web terms like resource, link, proxy, server, etc., some of which are now dated Attempted to answer questions like: How many web pages are there? and How fast is the Web growing? Summary in Pitkow, Summary of WWW Characterizations, Journal of the World Wide Web, 1999

3 Web Page Popularity The Zipf distribution of number of page hits versus rank for five days of AOL December 1997 data Summary of WWW Characterizations

4 The number of clicks per user at the Xerox WWW Site during May 1998.
Web Page Popularity The number of clicks per user at the Xerox WWW Site during May 1998. The curve follows an Inverse Gaussian Distribution, which has a heavy right tail. Summary of WWW Characterizations

5 OCLC Characterization Research
Work from Analyzed Web samples annually to look for trends Sample obtained by randomly sampling IP addresses and connecting to port 80 Today this method would miss a large number of websites that use virtual hosting – multiple domain names hosted on same computer using one IP address (remember the "Host:" request header?) Findings: O'Neill et al., Trends in the Evolution of the Public Web, D-Lib Magazine, Apr 2003

6 Number of Public Websites
O'Neill et al., Trends in the Evolution of the Public Web, D-Lib Magazine, Apr 2003

7 Distribution of Websites by Country
1999 2002 O'Neill et al., Trends in the Evolution of the Public Web, D-Lib Magazine, Apr 2003

8 Popular Websites by In-Links
OCLC Most Linked-To Websites1 Most Linked-To Websites (Jan 2013)2 facebook.com twitter.com google.com youtube.com adobe.com wordpress.org blogspot.com wikipedia.org godaddy.com wordpress.com 1http:// 2

9 Popular Websites by Visits
Alexa’s Top Websites (Jan 2013)1 Most Linked-To Websites (Jan 2013)2 facebook.com google.com youtube.com yahoo.com baidu.com wikipedia.org live.com amazon.com qq.com twitter.com facebook.com twitter.com google.com youtube.com adobe.com wordpress.org blogspot.com wikipedia.org godaddy.com wordpress.com 1 2 see also:

10 Characterizing National Web Domains
A large-scale study by Baeza-Yates at al.1 analyzed web collections from 10 national domains and multinational Web spaces of African and Indochinese Web sites Examined languages, file sizes, pages per site, link structure, etc. 1Baeza-Yates et al. , Characterization of national Web domains, ACM Trans. Internet Technol., May 2007 

11 Web Page Languages Baeza-Yates et al. , Characterization of national Web domains, ACM Trans. Internet Technol., May 2007 

12 Some Power-law Distributions
File sizes for small and large files Pages per site Baeza-Yates et al. , Characterization of national Web domains, ACM Trans. Internet Technol., May 2007 

13 In and Out Degree of Web Pages
In-degree of web pages Out-degree of web pages for few and many outlinks Baeza-Yates et al. , Characterization of national Web domains, ACM Trans. Internet Technol., May 2007 

14 More than 95% of content was HTML
Non-HTML File Content More than 95% of content was HTML Baeza-Yates et al. , Characterization of national Web domains, ACM Trans. Internet Technol., May 2007 

15 How dynamic is the Web? How often are pages added to the Web?
How often are pages are removed from the Web? How often do pages change? What kinds of changes do pages typically exhibit? How does the link structure change over time?

16 How dynamic is the Web? Two studies attempt to answer these questions:
2004 study (Fetterly et al.1) of 150 million web pages over 11 weeks analyzed weekly snapshots 2004 study (Ntoulas et al.2) of 150 websites over one year analyzed weekly snapshots What follows are some selected highlights 1Fetterly et al. , A large-scale study of the evolution of Web pages,  Software Practice & Experience ,2004   2Ntoulas et al. , What's new on the web?: the evolution of the web from a search engine perspective, Proc WWW 2004  

17 Document Length 2n bytes Fetterly et al. ,2004  

18 Successful Downloads Weeks Fetterly et al. ,2004  

19 Rates of Change by TLD Fetterly et al. ,2004  

20 New Pages Ntoulas et al. , 2004  

21 Link Evolution Ntoulas et al. , 2004  

22 Summary of Findings Fetterly et al., 2004 Ntoulas et al., 2004
When pages change, they change in trivial ways or just their markup Strong relationship between TLD and rate of change but not degree of change The larger the document, the more likely it is to be changed more frequently and significantly Past frequency of changes to a page is good predictor of future page changes Web page changes are usually minor New pages are created at rate of 8% per week Only 20% of pages today will be accessible in a year Large number of pages borrow content from existing pages Every week, 25% new links are created, and after 1 year, 80% of links are replaced with new ones Past degree of change to web page is good predictor of future degree of change

23 Linkrot: The 404 Problem Kahle (‘97) - Average page lifetime is 44 days Koehler (‘99, ‘04) - 67% URLs lost in 4 years Lawrence et al. (‘01) - 23%-53% URLs in CiteSeer papers invalid over 5 year span (3% of invalid URLs “unfindable”) Spinellis (‘03) - 27% URLs in CACM/Computer papers gone in 5 years Fetterly et al. (‘03) – about 0.5% of web pages disappeared per week Ntoulas et al. (‘04) – predicted only 20% of pages today will be accessible in a year McCown et al. (‘05) - 10 year half-life for URLs in D-Lib Magazine articles SalahEldeen & Nelson (‘12) – 11% of URLs from Tweets disappear after 1 year 23

24 Twitter killing blogs? http://wayoftheduck.com/blogs-vs-tweets
Blogosphere Twitter killing blogs?

25 Twitter Users, mid-2012

26 Twitter Cities, June 2012

27 Web Archiving Web archiving is the process of collecting pages from the Web and saving them in an archive Usually it’s important to save all associated resources (images, style sheets, etc.) to preserve look Archives are typically produced using web crawlers

28 The Ephemeral Web Link rot is a significant problem
Kahle (‘97) - Average page lifetime is 44 days Koehler (‘99, ‘04) - 67% URLs lost in 4 years Lawrence et al. (‘01) - 23%-53% URLs in CiteSeer papers invalid over 5 year span (3% of invalid URLs “unfindable”) Spinellis (‘03) - 27% URLs in CACM/Computer papers gone in 5 years Ntoulas et al. (‘04) – predicted only 20% of pages today will be accessible in a year Even if links don’t disappear, existing content is likely to change over time

29 Why archive the Web? If the Web isn’t saved, we might loose a significant amount of our digital heritage The Web gives historians and other social scientists significant insight into our society, especially into how technology has effected it Serves as important resource in many lawsuits Some organizations want to or are legally obliged to archive their web materials Someone worked hard on this stuff, why not save it?

30 one answer: replay of contemporary pages >> summary pages
Why Care About The Past? From an anonymous WWW 2010 reviewer about our Memento paper (emphasis mine): "Is there any statistics to show that many or a good number of Web users would like to get obsolete data or resources? " one answer: replay of contemporary pages >> summary pages

31

32 vs.

33

34

35

36

37

38

39

40

41

42

43

44

45 Archiving Moves At Hurricane Speed, Most News Stories Move Faster

46

47

48

49 Who Archives the Web?

50 good? yes. complete story? no.
Heroic solo efforts good? yes. complete story? no. 1TB endowment = ~$4700:

51 If you could choose a website to preserve for all time, what would you choose? What websites will people want to look at 50 to 500 years from now?

52 K12 Web Archiving Program
Program by Internet Archive and Library of Congress 5th to 12th graders get to participate in web archiving activities As of Oct 2010, students in the program have archived 2,379 websites Video:

53 Who is archiving the Web?
Internet Archive Founded by Brewster Kahle in 1996 Largest web archive in the world (338+ billion pages) Pages Available to public via Wayback Machine 30k s/w titles, 600k books, 900k films/movies, 1M audio recordings, 2M hours TV, 2M ebooks Images:

54

55

56

57 Who is archiving the Web?
National libraries & national archives, usually focusing on culturally significant web collections US Library of Congress: Minerva UK Web Archiving Consortium: UK Web Archive National Library of Australia: PANDORA list from the IIPC: Commercial organizations Hanzo Archives: commercial web archiving tools Nextpoint: archiving service for organizations Iterasi: for corporate, legal, and govt Etc.

58 now w/ a diff interface: http://www.loc.gov/websites/collections/

59 Other Players Free on-demand archiving (page-at-a-time model, not a crawler) WebCite archive.is many others…

60 archive.is http://archive.is/www.cs.odu.edu
see also:

61 Web Crawlers Wget and HTTrack Heritrix
Simple tools for mirroring a website All crawled URLs are converted into a path and file saved as foo.org/test/index.html Not designed for large-scale crawling Heritrix Built by Internet Archive and Nordic national libraries for larger web archiving tasks Archived content stored in Web ARChive file format (WARC) Uses web interface Can find links in JavaScript, Flash, etc.

62 WARC File Organization
Slide from John Kunze at

63 WARC Tools warc-tools WARCreate & WAIL
Python scripts for interacting with WARC files WARCreate & WAIL

64 Archive Viewers/Searchers
Wayback Open source Java implementation of IA’s Wayback Machine Search by URL NutchWAX Full-text search of a web archive WAXToolbar For accessing Wayback’s archive Provides search capabilities for NutchWAX

65 How Much of The Web Is Archived?

66 Public Archives, ca. Late 2010 / Early 2011
Three categories of archives Internet Archive Search engine Other archives UK US See also:

67 1000 URIs Ordered by First Observation Date
See also:

68 see also: http://ws-dl. blogspot

69 How Much of the Web is Archived? It Depends on Which Web…
Including SE cache Excluding SE Cache 90% 79% 97% 68% 35% 16% 88% 19% 2013 95% 92% 23% 26% Changes since 2011: no more free SE APIs; greatly reduced IA quarantine period; 15 public web archives

70 Long Tail of Archives Archive.is
see also:

71 Memento: Time Travel for the Web
The Memento Team Herbert Van de Sompel Michael L. Nelson Robert Sanderson Lyudmila Balakireva Scott Ainsworth Harihar Shankar Memento is partially funded by the Library of Congress 71

72 W3C Web Architecture: Resource – URI - Representation
Identifies dereference Resource Representation Represents 72

73 W3C Web Architecture: Resource – URI - Representation
Identifies dereference content negotiation Resource Representation 1 Represents Representation 2 73

74 Resources

75 Resources have Representations

76 Resources have Representations that Change over Time

77 Only the Current Representation is Available from a Resource

78 Old Representations are Lost Forever

79 Archived Resources Exist

80 Archived Resources Sep 11 2001, 20:36:10 UTC Dec 20 2001, 4:51:00 UTC
archived resource for archived resource for 80

81 Finding Archived Resources
Go to and search On select desired datetime 81

82 Finding Archived Resources
Go to and click History Browse History 82

83 Navigating Archived Resources
Dec , 4:51:00 UTC Navigating Archived Resources current Pentagon archived resource for 83

84 Navigating Archived Resources
Sep , 20:36:10 UTC Sep , 21:38:55 UTC Navigating Archived Resources SPACE archived resource for 84

85 Current and Past Web are Not Integrated
Current and Past Web based on same technology. But, going from Current to Past Web is a matter of (manual) discovery. Memento wants to make going from Current to Past Web a (HTTP) protocol matter. Memento wants to integrate Current And Past Web.

86 TBL on Generic vs. Specific Resources

87 In The Beginning… there was the inode
struct stat { dev_t st_dev; /* ID of device containing file */ ino_t st_ino; /* inode number */ mode_t st_mode; /* protection */ nlink_t st_nlink; /* number of hard links */ uid_t st_uid; /* user ID of owner */ gid_t st_gid; /* group ID of owner */ dev_t st_rdev; /* device ID (if special file) */ off_t st_size; /* total size, in bytes */ blksize_t st_blksize; /* blocksize for filesystem I/O */ blkcnt_t st_blocks; /* number of blocks allocated */ time_t st_atime; /* time of last access */ time_t st_mtime; /* time of last modification */ time_t st_ctime; /* time of last status change */ };

88 Limited Time Semantics…
% telnet 80 Trying Connected to Escape character is '^]'. HEAD /images/ndiipp_header6.jpg HTTP/1.1 Host: Connection: close HTTP/ OK Date: Mon, 19 Jul :41:04 GMT Server: Apache Last-Modified: Thu, 18 Jun :25:54 GMT ETag: "1bc dca24880" Accept-Ranges: bytes Content-Length: 67893 Content-Type: image/jpeg Connection closed by foreign host.

89 Time Semantics Becoming Less, Not More Available
% telnet 80 Trying Connected to Escape character is '^]'. HEAD / HTTP/1.1 Host: Connection: close HTTP/ OK Date: Mon, 19 Jul :36:00 GMT Server: Apache Accept-Ranges: bytes Content-Type: text/html Connection closed by foreign host.

90 The Past Links to the Present…
explicit HTML link; no HTTP links; opaque URI

91 The Past Links to the Present…
no HTML links; no HTTP links; implicit from URI

92 But the Present Does Not Link to the Past
no hints in HTML, HTTP, or URI % telnet 80 Trying Connected to Escape character is '^]'. HEAD / HTTP/1.1 Host: Connection: close HTTP/ OK Date: Mon, 19 Jul :36:00 GMT Server: Apache Accept-Ranges: bytes Content-Type: text/html Connection closed by foreign host.

93 Linking the Past and the Present
Codify existing methods to create linkage from the past to the present easy: an archived version knows for which URI it is an archived version Create a linkage from the present to the past hard: solve with a level of indirection from present to past

94 Normal HTTP Flow GET R 200 OK

95 Normal HTTP Flow GET R 200 OK

96 The Web without a Time Dimension
96

97 The Web without a Time Dimension
97

98 The Web without a Time Dimension
Need to use a different URI to access archived versions of a resource and its current version 98

99 The Web with Time Dimension added by Memento
In Memento: use URI of the current version to access archived versions, but qualify it with datetime 99

100 The Web with Time Dimension added by Memento
TimeGate … and arrive at an archived version via level of indirection 100

101 Content Negotiation in the datetime dimension
Many systems support content negotiation for media type: Your client by default asks for HTML and gets HTML; But it could get PDF via the same URI. Memento proposes a new dimension for content negotiation – time: Your client by default asks for the current time, and gets it But it could get an older version via the same URI Can be accomplished with two new HTTP headers: Request header: Accept-Datetime Conveys datetime of content requested by client Response header: Memento-Datetime Conveys datetime of content returned by server 101 101

102 Memento HTTP Flow HEAD R, Accept-Datetime 200, LinkG
GET G, Accept-Datetime 302M, Vary, LinkR,B,M GET M, Accept-Datetime 200, Memento-Datetime, LinkR,B,M 102

103 Memento HTTP Flow HEAD R, Accept-Datetime 200, LinkG
GET G, Accept-Datetime 302M, Vary, LinkR,B,M GET M, Accept-Datetime 200, Memento-Datetime, LinkR,B,M 103

104 Memento HTTP Flow HEAD R, Accept-Datetime 200, LinkG
GET G, Accept-Datetime 302M, Vary, LinkR,B,M GET M, Accept-Datetime 200, Memento-Datetime, LinkR,B,M 104

105 Memento HTTP Flow HEAD R, Accept-Datetime 200, LinkG
GET G, Accept-Datetime 302M, Vary, LinkR,B,M GET M, Accept-Datetime 200, Memento-Datetime, LinkR,B,M 105

106 Memento HTTP Flow HEAD R, Accept-Datetime 200, LinkG
GET G, Accept-Datetime 302M, Vary, LinkR,B,M GET M, Accept-Datetime 200, Memento-Datetime, LinkR,B,M 106

107 Memento HTTP Flow HEAD R, Accept-Datetime 200, LinkG
GET G, Accept-Datetime 302M, Vary, LinkR,B,M GET M, Accept-Datetime 200, Memento-Datetime, LinkR,B,M 107

108 The Memento Framework

109 MementoFox: https://addons.mozilla.org/en-us/firefox/addon/mementofox/

110 art history: http://www. moma. org/collection/browse_results. php


Download ppt "Measuring and Archiving the Web"

Similar presentations


Ads by Google