Measuring and Archiving the Web

Slides:



Advertisements
Similar presentations
The Internet and the Web
Advertisements

BUILDING DIGITAL WEB ARCHIVES FOR FUTURE SCHOLARS Jani Stenvall
TC2-Computer Literacy Mr. Sencer February 4, 2010.
World Wide Web1 Applications World Wide Web. 2 Introduction What is hypertext model? Use of hypertext in World Wide Web (WWW) – HTML. WWW client-server.
The World Wide Web and the Internet Dr Jim Briggs 1WUCM1.
1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006.
WEB DESIGN SOME FOUNDATIONS. SO WHAT IS THIS INTERNET.
Web Archiving Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike.
July 25, 2012 Arlington, Virginia Digital Preservation 2012warcreate.com WARCreate Create Wayback-Consumable WARC Files from Any Webpage Mat Kelly, Michele.
Web Architecture Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike.
HINARI/Basic Internet Concepts (module 1.1). Instructions - This part of the:  course is a PowerPoint demonstration intended to introduce you to Basic.
Web Characterization: What Does the Web Look Like?
2013Dr. Ali Rodan 1 Handout 1 Fundamentals of the Internet.
CP476 Internet Computing Lecture 5 : HTTP, WWW and URL 1 Lecture 5. WWW, HTTP and URL Objective: to review the concepts of WWW to understand how HTTP works.
1 Chapter 2 & Chapter 4 §Browsers. 2 Terms §Software §Program §Application.
Web Archiving Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial 3.0Attribution-NonCommercial.
Annick Le Follic Bibliothèque nationale de France Tallinn,
1 Archive-It: Archiving and Preserving Born Digital Content NDIIPP June 2009 Molly Bragg Partner Specialist Internet Archive.
Preserving Digital Culture: Tools & Strategies for Building Web Archives : Tools and Strategies for Building Web Archives Internet Librarian 2009 Tracy.
Kingdom of Saudi Arabia Ministry of Higher Education Al-Imam Muhammad Ibn Saud Islamic University College of Computer and Information Sciences Chapter.
Chapter 8 Browsing and Searching the Web. 2Practical PC 5 th Edition Chapter 8 Getting Started In this Chapter, you will learn: − What is a Web page −
1 WWW. 2 World Wide Web Major application protocol used on the Internet Simple interface Two concepts –Point –Click.
Microsoft Office 2008 for Mac – Illustrated Unit D: Getting Started with Safari.
INTRODUCTION Dr Mohd Soperi Mohd Zahid Semester /16.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Strategies for archiving the Danish web space Bjarne Andersen Head of Digital Resources State and University Library, Aarhus
HTML PROJECT #1 Project 1 Introduction to HTML. HTML Project 1: Introduction to HTML 2 Project Objectives 1.Describe the Internet and its associated key.
Web Archiving Workshop Mark Phillips Texas Conference on Digital Libraries June 4, 2008.
Web Development. Agenda Web History Network Architecture Types of Server The languages of the web Protocols API 2.
The World Wide Web.
4.01 How Web Pages Work.
Archiving & Preserving Digital Content
Web fundamentals: Clients, Servers, and Communication
4.01 How Web Pages Work.
Chapter 10: Web Basics.
Block 5: An application layer protocol: HTTP
How HTTP Works Made by Manish Kushwaha.
Chapter 8 Browsing and Searching the Web
Why We Need Multiple Archives
HTTP – An overview.
Web Server Design Week 10 Old Dominion University
COMP2322 Lab 2 HTTP Steven Lee Feb. 8, 2017.
E-commerce | WWW World Wide Web - Concepts
Project 1 Introduction to HTML.
E-commerce | WWW World Wide Web - Concepts
Debugging Your Website with Fiddler and Chrome Developer Tools
Web Server Design Week 7 Old Dominion University
Tutorial (4): HTTP Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
IS333D: MULTI-TIER APPLICATION DEVELOPMENT
A Brief Introduction to the Internet
Windows Internet Explorer 7-Illustrated Essentials
Chapter 27 WWW and HTTP.
Web Server Design Week 4 Old Dominion University
Web archive data and researchers’ needs: how might we meet them?
Group 3: Olena Hunsicker and Divya Josyula
Characterization of Search Engine Caches
Web Server Design Week 8 Old Dominion University
Web Server Design Week 6 Old Dominion University
Introduction to World Wide Web
CIS 133 mashup Javascript, jQuery and XML
Chien-Chung Shen CIS, UD
Web Server Design Week 5 Old Dominion University
Web Server Design Week 4 Old Dominion University
4.01 How Web Pages Work.
An Introduction to the Internet and the Web
Web Server Design Week 6 Old Dominion University
CSCI-351 Data communication and Networks
Web Server Design Week 7 Old Dominion University
Web Server Design Week 7 Old Dominion University
Old Dominion University Computer Science IIPC New Member
Presentation transcript:

Measuring and Archiving the Web Michael L. Nelson CS 495/595 Old Dominion University This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License This course is based on Dr. McCown's class

W3C Characterization Activity Work from 1998-1999 Provided definitions for common Web terms like resource, link, proxy, server, etc., some of which are now dated http://www.w3.org/1999/05/WCA-terms/ Attempted to answer questions like: How many web pages are there? and How fast is the Web growing? Summary in Pitkow, Summary of WWW Characterizations, Journal of the World Wide Web, 1999

Web Page Popularity The Zipf distribution of number of page hits versus rank for five days of AOL December 1997 data Summary of WWW Characterizations

The number of clicks per user at the Xerox WWW Site during May 1998. Web Page Popularity The number of clicks per user at the Xerox WWW Site during May 1998. The curve follows an Inverse Gaussian Distribution, which has a heavy right tail. Summary of WWW Characterizations

OCLC Characterization Research Work from 1998-2002 Analyzed Web samples annually to look for trends Sample obtained by randomly sampling IP addresses and connecting to port 80 Today this method would miss a large number of websites that use virtual hosting – multiple domain names hosted on same computer using one IP address (remember the "Host:" request header?) Findings: O'Neill et al., Trends in the Evolution of the Public Web, D-Lib Magazine, Apr 2003

Number of Public Websites O'Neill et al., Trends in the Evolution of the Public Web, D-Lib Magazine, Apr 2003

Distribution of Websites by Country 1999 2002 O'Neill et al., Trends in the Evolution of the Public Web, D-Lib Magazine, Apr 2003

Popular Websites by In-Links OCLC Most Linked-To Websites1 Most Linked-To Websites (Jan 2013)2 facebook.com twitter.com google.com youtube.com adobe.com wordpress.org blogspot.com wikipedia.org godaddy.com wordpress.com 1http://www.oclc.org/research/activities/past/orprojects/wcp/stats/linkage.htm 2 http://www.seomoz.org/top500

Popular Websites by Visits Alexa’s Top Websites (Jan 2013)1 Most Linked-To Websites (Jan 2013)2 facebook.com google.com youtube.com yahoo.com baidu.com wikipedia.org live.com amazon.com qq.com twitter.com facebook.com twitter.com google.com youtube.com adobe.com wordpress.org blogspot.com wikipedia.org godaddy.com wordpress.com 1 http://www.alexa.com/topsites 2 http://www.seomoz.org/top500 see also: http://www.builda-website.net/alexa-traffic-rank.html, http://www.alexa.com/help/traffic-learn-more

Characterizing National Web Domains A large-scale study by Baeza-Yates at al.1 analyzed web collections from 10 national domains and multinational Web spaces of African and Indochinese Web sites Examined languages, file sizes, pages per site, link structure, etc. 1Baeza-Yates et al. , Characterization of national Web domains, ACM Trans. Internet Technol., May 2007 

Web Page Languages Baeza-Yates et al. , Characterization of national Web domains, ACM Trans. Internet Technol., May 2007 

Some Power-law Distributions File sizes for small and large files Pages per site Baeza-Yates et al. , Characterization of national Web domains, ACM Trans. Internet Technol., May 2007 

In and Out Degree of Web Pages In-degree of web pages Out-degree of web pages for few and many outlinks Baeza-Yates et al. , Characterization of national Web domains, ACM Trans. Internet Technol., May 2007 

More than 95% of content was HTML Non-HTML File Content More than 95% of content was HTML Baeza-Yates et al. , Characterization of national Web domains, ACM Trans. Internet Technol., May 2007 

How dynamic is the Web? How often are pages added to the Web? How often are pages are removed from the Web? How often do pages change? What kinds of changes do pages typically exhibit? How does the link structure change over time?

How dynamic is the Web? Two studies attempt to answer these questions: 2004 study (Fetterly et al.1) of 150 million web pages over 11 weeks analyzed weekly snapshots 2004 study (Ntoulas et al.2) of 150 websites over one year analyzed weekly snapshots What follows are some selected highlights 1Fetterly et al. , A large-scale study of the evolution of Web pages,  Software Practice & Experience ,2004   2Ntoulas et al. , What's new on the web?: the evolution of the web from a search engine perspective, Proc WWW 2004  

Document Length 2n bytes Fetterly et al. ,2004  

Successful Downloads Weeks Fetterly et al. ,2004  

Rates of Change by TLD Fetterly et al. ,2004  

New Pages Ntoulas et al. , 2004  

Link Evolution Ntoulas et al. , 2004  

Summary of Findings Fetterly et al., 2004 Ntoulas et al., 2004 When pages change, they change in trivial ways or just their markup Strong relationship between TLD and rate of change but not degree of change The larger the document, the more likely it is to be changed more frequently and significantly Past frequency of changes to a page is good predictor of future page changes Web page changes are usually minor New pages are created at rate of 8% per week Only 20% of pages today will be accessible in a year Large number of pages borrow content from existing pages Every week, 25% new links are created, and after 1 year, 80% of links are replaced with new ones Past degree of change to web page is good predictor of future degree of change

Linkrot: The 404 Problem Kahle (‘97) - Average page lifetime is 44 days Koehler (‘99, ‘04) - 67% URLs lost in 4 years Lawrence et al. (‘01) - 23%-53% URLs in CiteSeer papers invalid over 5 year span (3% of invalid URLs “unfindable”) Spinellis (‘03) - 27% URLs in CACM/Computer papers gone in 5 years Fetterly et al. (‘03) – about 0.5% of web pages disappeared per week Ntoulas et al. (‘04) – predicted only 20% of pages today will be accessible in a year McCown et al. (‘05) - 10 year half-life for URLs in D-Lib Magazine articles SalahEldeen & Nelson (‘12) – 11% of URLs from Tweets disappear after 1 year 23

Twitter killing blogs? http://wayoftheduck.com/blogs-vs-tweets Blogosphere http://www.sifry.com/alerts/archives/000432.html Twitter killing blogs? http://wayoftheduck.com/blogs-vs-tweets

Twitter Users, mid-2012 http://semiocast.com/publications/2012_07_30_Twitter_reaches_half_a_billion_accounts_140m_in_the_US

Twitter Cities, June 2012 http://semiocast.com/publications/2012_07_30_Twitter_reaches_half_a_billion_accounts_140m_in_the_US

Web Archiving Web archiving is the process of collecting pages from the Web and saving them in an archive Usually it’s important to save all associated resources (images, style sheets, etc.) to preserve look Archives are typically produced using web crawlers

The Ephemeral Web Link rot is a significant problem Kahle (‘97) - Average page lifetime is 44 days Koehler (‘99, ‘04) - 67% URLs lost in 4 years Lawrence et al. (‘01) - 23%-53% URLs in CiteSeer papers invalid over 5 year span (3% of invalid URLs “unfindable”) Spinellis (‘03) - 27% URLs in CACM/Computer papers gone in 5 years Ntoulas et al. (‘04) – predicted only 20% of pages today will be accessible in a year Even if links don’t disappear, existing content is likely to change over time

Why archive the Web? If the Web isn’t saved, we might loose a significant amount of our digital heritage The Web gives historians and other social scientists significant insight into our society, especially into how technology has effected it Serves as important resource in many lawsuits Some organizations want to or are legally obliged to archive their web materials Someone worked hard on this stuff, why not save it?

one answer: replay of contemporary pages >> summary pages Why Care About The Past? From an anonymous WWW 2010 reviewer about our Memento paper (emphasis mine): "Is there any statistics to show that many or a good number of Web users would like to get obsolete data or resources? " one answer: replay of contemporary pages >> summary pages http://www.slideshare.net/phonedude/why-careaboutthepast http://www.nytimes.com/2013/06/19/books/seven-american-deaths-and-disasters-transcribes-the-news.html

vs.

Archiving Moves At Hurricane Speed, Most News Stories Move Faster

Who Archives the Web?

good? yes. complete story? no. Heroic solo efforts good? yes. complete story? no. 1TB endowment = ~$4700: http://blog.dshr.org/2011/02/paying-for-long-term-storage.html

If you could choose a website to preserve for all time, what would you choose? What websites will people want to look at 50 to 500 years from now?

K12 Web Archiving Program Program by Internet Archive and Library of Congress 5th to 12th graders get to participate in web archiving activities As of Oct 2010, students in the program have archived 2,379 websites Video: http://www.archive.org/details/K12WebArchivingProgramStudents

Who is archiving the Web? Internet Archive Founded by Brewster Kahle in 1996 Largest web archive in the world (338+ billion pages) Pages Available to public via Wayback Machine 30k s/w titles, 600k books, 900k films/movies, 1M audio recordings, 2M hours TV, 2M ebooks Images: http://en.wikipedia.org/wiki/Internet_Archive

Who is archiving the Web? National libraries & national archives, usually focusing on culturally significant web collections US Library of Congress: Minerva UK Web Archiving Consortium: UK Web Archive National Library of Australia: PANDORA list from the IIPC: http://netpreserve.org/resources/member-archives Commercial organizations Hanzo Archives: commercial web archiving tools Nextpoint: archiving service for organizations Iterasi: for corporate, legal, and govt Etc.

now w/ a diff interface: http://www.loc.gov/websites/collections/

Other Players Free on-demand archiving (page-at-a-time model, not a crawler) WebCite archive.is www.freezepage.com www.peep.us many others…

archive.is http://archive.is/www.cs.odu.edu see also: http://ws-dl.blogspot.com/2013/07/2013-07-09-archiveis-supports-memento.html

Web Crawlers Wget and HTTrack Heritrix Simple tools for mirroring a website All crawled URLs are converted into a path and file http://foo.org/test/ saved as foo.org/test/index.html Not designed for large-scale crawling Heritrix Built by Internet Archive and Nordic national libraries for larger web archiving tasks Archived content stored in Web ARChive file format (WARC) Uses web interface Can find links in JavaScript, Flash, etc.

WARC File Organization Slide from John Kunze at http://www.iwaw.net/05/kunze.pdf

WARC Tools warc-tools WARCreate & WAIL Python scripts for interacting with WARC files WARCreate & WAIL http://ws-dl.blogspot.com/2013/07/2013-07-10-warcreate-and-wail-warc.html

Archive Viewers/Searchers Wayback Open source Java implementation of IA’s Wayback Machine Search by URL NutchWAX Full-text search of a web archive WAXToolbar For accessing Wayback’s archive Provides search capabilities for NutchWAX

How Much of The Web Is Archived?

Public Archives, ca. Late 2010 / Early 2011 Three categories of archives Internet Archive Search engine Other archives UK US See also: http://arxiv.org/abs/1212.6177

1000 URIs Ordered by First Observation Date See also: http://ws-dl.blogspot.com/2011/06/2011-06-23-how-much-of-web-is-archived.html

see also: http://ws-dl. blogspot

How Much of the Web is Archived? It Depends on Which Web… Including SE cache Excluding SE Cache 90% 79% 97% 68% 35% 16% 88% 19% 2013 95% 92% 23% 26% Changes since 2011: no more free SE APIs; greatly reduced IA quarantine period; 15 public web archives

Long Tail of Archives Archive.is see also: http://www.cs.odu.edu/~mln/pubs/tpdl-2013/paper_134.pdf

Memento: Time Travel for the Web The Memento Team Herbert Van de Sompel Michael L. Nelson Robert Sanderson Lyudmila Balakireva Scott Ainsworth Harihar Shankar Memento is partially funded by the Library of Congress 71

W3C Web Architecture: Resource – URI - Representation Identifies dereference Resource Representation Represents 72

W3C Web Architecture: Resource – URI - Representation Identifies dereference content negotiation Resource Representation 1 Represents Representation 2 73

Resources

Resources have Representations

Resources have Representations that Change over Time

Only the Current Representation is Available from a Resource

Old Representations are Lost Forever

Archived Resources Exist

Archived Resources Sep 11 2001, 20:36:10 UTC Dec 20 2001, 4:51:00 UTC http://en.wikipedia.org/w/index.php?title=September_11_attacks&oldid=282333 archived resource for http://en.wikipedia.org/wiki/September_11_attacks http://web.archive.org/web/20010911203610/http://www.cnn.com/ archived resource for http://cnn.com 80

Finding Archived Resources Go to http://www.archive.org/ and search http://cnn.com On http://web.archive.org/web/*/http://cnn.com, select desired datetime 81

Finding Archived Resources Go to http://en.wikipedia.org/wiki/September_11_attacks and click History Browse History 82

Navigating Archived Resources Dec 20 2001, 4:51:00 UTC Navigating Archived Resources current Pentagon http://en.wikipedia.org/w/index.php?title=September_11_attacks&oldid=282333 archived resource for http://en.wikipedia.org/wiki/September_11_attacks3 http://en.wikipedia.org/wiki/The_Pentagon 83

Navigating Archived Resources Sep 11 2001, 20:36:10 UTC Sep 11 2001, 21:38:55 UTC Navigating Archived Resources SPACE http://web.archive.org/web/20010911203610/http://www.cnn.com/ archived resource for http://cnn.com http://web.archive.org/web/20010911213855/www.cnn.com/TECH/space/ 84

Current and Past Web are Not Integrated Current and Past Web based on same technology. But, going from Current to Past Web is a matter of (manual) discovery. Memento wants to make going from Current to Past Web a (HTTP) protocol matter. Memento wants to integrate Current And Past Web.

TBL on Generic vs. Specific Resources http://www.w3.org/DesignIssues/Generic.html

In The Beginning… there was the inode struct stat { dev_t st_dev; /* ID of device containing file */ ino_t st_ino; /* inode number */ mode_t st_mode; /* protection */ nlink_t st_nlink; /* number of hard links */ uid_t st_uid; /* user ID of owner */ gid_t st_gid; /* group ID of owner */ dev_t st_rdev; /* device ID (if special file) */ off_t st_size; /* total size, in bytes */ blksize_t st_blksize; /* blocksize for filesystem I/O */ blkcnt_t st_blocks; /* number of blocks allocated */ time_t st_atime; /* time of last access */ time_t st_mtime; /* time of last modification */ time_t st_ctime; /* time of last status change */ };

Limited Time Semantics… % telnet www.digitalpreservation.gov 80 Trying 140.147.249.7... Connected to www.digitalpreservation.gov. Escape character is '^]'. HEAD /images/ndiipp_header6.jpg HTTP/1.1 Host: www.digitalpreservation.gov Connection: close HTTP/1.1 200 OK Date: Mon, 19 Jul 2010 21:41:04 GMT Server: Apache Last-Modified: Thu, 18 Jun 2009 16:25:54 GMT ETag: "1bc861-10935-dca24880" Accept-Ranges: bytes Content-Length: 67893 Content-Type: image/jpeg Connection closed by foreign host.

Time Semantics Becoming Less, Not More Available % telnet www.digitalpreservation.gov 80 Trying 140.147.249.7... Connected to www.digitalpreservation.gov. Escape character is '^]'. HEAD / HTTP/1.1 Host: www.digitalpreservation.gov Connection: close HTTP/1.1 200 OK Date: Mon, 19 Jul 2010 21:36:00 GMT Server: Apache Accept-Ranges: bytes Content-Type: text/html Connection closed by foreign host.

The Past Links to the Present… explicit HTML link; no HTTP links; opaque URI

The Past Links to the Present… no HTML links; no HTTP links; implicit from URI

But the Present Does Not Link to the Past no hints in HTML, HTTP, or URI % telnet www.digitalpreservation.gov 80 Trying 140.147.249.7... Connected to www.digitalpreservation.gov. Escape character is '^]'. HEAD / HTTP/1.1 Host: www.digitalpreservation.gov Connection: close HTTP/1.1 200 OK Date: Mon, 19 Jul 2010 21:36:00 GMT Server: Apache Accept-Ranges: bytes Content-Type: text/html Connection closed by foreign host.

Linking the Past and the Present Codify existing methods to create linkage from the past to the present easy: an archived version knows for which URI it is an archived version Create a linkage from the present to the past hard: solve with a level of indirection from present to past

Normal HTTP Flow GET R 200 OK

Normal HTTP Flow GET R 200 OK

The Web without a Time Dimension 96

The Web without a Time Dimension 97

The Web without a Time Dimension Need to use a different URI to access archived versions of a resource and its current version 98

The Web with Time Dimension added by Memento In Memento: use URI of the current version to access archived versions, but qualify it with datetime 99

The Web with Time Dimension added by Memento TimeGate … and arrive at an archived version via level of indirection 100

Content Negotiation in the datetime dimension Many systems support content negotiation for media type: Your client by default asks for HTML and gets HTML; But it could get PDF via the same URI. Memento proposes a new dimension for content negotiation – time: Your client by default asks for the current time, and gets it But it could get an older version via the same URI Can be accomplished with two new HTTP headers: Request header: Accept-Datetime Conveys datetime of content requested by client Response header: Memento-Datetime Conveys datetime of content returned by server 101 101

Memento HTTP Flow HEAD R, Accept-Datetime 200, LinkG GET G, Accept-Datetime 302M, Vary, LinkR,B,M GET M, Accept-Datetime 200, Memento-Datetime, LinkR,B,M 102

Memento HTTP Flow HEAD R, Accept-Datetime 200, LinkG GET G, Accept-Datetime 302M, Vary, LinkR,B,M GET M, Accept-Datetime 200, Memento-Datetime, LinkR,B,M 103

Memento HTTP Flow HEAD R, Accept-Datetime 200, LinkG GET G, Accept-Datetime 302M, Vary, LinkR,B,M GET M, Accept-Datetime 200, Memento-Datetime, LinkR,B,M 104

Memento HTTP Flow HEAD R, Accept-Datetime 200, LinkG GET G, Accept-Datetime 302M, Vary, LinkR,B,M GET M, Accept-Datetime 200, Memento-Datetime, LinkR,B,M 105

Memento HTTP Flow HEAD R, Accept-Datetime 200, LinkG GET G, Accept-Datetime 302M, Vary, LinkR,B,M GET M, Accept-Datetime 200, Memento-Datetime, LinkR,B,M 106

Memento HTTP Flow HEAD R, Accept-Datetime 200, LinkG GET G, Accept-Datetime 302M, Vary, LinkR,B,M GET M, Accept-Datetime 200, Memento-Datetime, LinkR,B,M 107

The Memento Framework

MementoFox: https://addons.mozilla.org/en-us/firefox/addon/mementofox/

art history: http://www. moma. org/collection/browse_results. php