Web Characterization: What Does the Web Look Like?

Slides:



Advertisements
Similar presentations
The Internet and the Web
Advertisements

Basic Internet Terms Digital Design. Arpanet The first Internet prototype created in 1965 by the Department of Defense.
Introduction to Web Science Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-
Fundamentals of Information Systems, Second Edition 1 Telecommunications, the Internet, Intranets, and Extranets Chapter 4.
Communicating Information: Web Design. It’s a big net HTTP FTP TCP/IP SMTP protocols The Internet The Internet is a network of networks… It connects millions.
Information Retrieval Lecture 8 Introduction to Information Retrieval (Manning et al. 2007) Chapter 19 For the MSc Computer Science Programme Dell Zhang.
Web Graph Characteristics Kira Radinsky All of the following slides are courtesy of Ronny Lempel (Yahoo!)
Web as Graph – Empirical Studies The Structure and Dynamics of Networks.
Internet – Part II. What is the World Wide Web? The World Wide Web is a collection of host machines, which deliver documents, graphics and multi-media.
1 Uniform Sampling from the Web via Random Walks Ziv Bar-Yossef Alexander Berg Steve Chien Jittat Fakcharoenphol Dror Weitz University of California at.
Information persistence on the Web Judit Bar-Ilan The Hebrew University and Bar-Ilan University and Bluma Peritz The Hebrew University.
COMPUTER TERMS PART 1. COOKIE A cookie is a small amount of data generated by a website and saved by your web browser. Its purpose is to remember information.
Internet Basics مهندس / محمد العنزي
Web Archiving Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike.
Lecturer: Ghadah Aldehim
INTRODUCTION TO THE INTERNET CA095.  What is the internet?  Website vs Webpage  Web Address / Internet Protocol  Language of the Internet  Web Browser.
Internet. 1.Someone creates a website 2.They load it to a web server computer 3.We must have an Internet connection 4.We can see the websites on a browser.
Web Architecture Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike.
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
The Internet Writer’s Handbook 2/e Introduction to World Wide Web Terms Writing for the Web.
Internet Technology I د. محمد البرواني. Project Number 3 Computer crimes in the cybernet Computer crimes in the cybernet Privacy in the cybernet Privacy.
Web Archiving Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial 3.0Attribution-NonCommercial.
CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics.
Introduction to Computers Section 8A. home How the Internet Works Anyone with access to the Internet can exchange text, data files, and programs with.
Addressing Incompleteness and Noise in Evolving Web Snapshots KJDB2007 Masashi Toyoda IIS, University of Tokyo.
Computer Tech Class August 20,  Internet was originally started by Department of Defense  Was made as means of communication without central location.
Webpage Design.
Here you are at your computer, but you don’t have internet connections. Your ISP becomes your link to the internet. In order to get access you need to.
Fundamentals of Information Systems, Second Edition 1 Telecommunications, the Internet, Intranets, and Extranets.
The Internet. Network - Collection of computers and devices connected together via communications devices Internet - Worldwide collection of networks.
COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.
Internet Vocabulary CTE Intro. URL  The “address” of a website. Entering this address in the Address Bar will take you directly to a particular website.
How Does a Search Engine Work? Part 2 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-
Measuring the Size of the Web Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State.
Chapter 8 Browsing and Searching the Web. 2Practical PC 5 th Edition Chapter 8 Getting Started In this Chapter, you will learn: − What is a Web page −
Using the Internet. (WWW) and the Internet The World Wide Web (WWW) is a small part of the Internet. The Internet relates to all the hardware and software.
NTU Natural Language Processing Lab. 1 Investment and Attention in the Weblog Community Advisor: Hsin-Hsi Chen Speaker: Sheng-Chung Yen.
Ch 14. Link Analysis Padmini Srinivasan Computer Science Department
World Wide Web Library 150 Week 8. The Web The World Wide Web is one part of the Internet. No one controls the web Diverse kinds of services accessed.
Determining the Suitability of Online Research Materials Beth Thompson.
Tools for Visualizing Networks Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-
Fundamentals of Information Systems, Second Edition 1 Telecommunications, the Internet, Intranets, and Extranets.
Web Design and Development. World Wide Web  World Wide Web (WWW or W3), collection of globally distributed text and multimedia documents and files 
Web Server.
The Availability and Persistence of Web References in D-Lib Magazine Frank McCown, Sheffan Chan, Michael L. Nelson and Johan Bollen Old Dominion University.
Introduction to Web & HTML
The Internet and World Wide Web Sullivan University Library.
Evolution of Web from a Search Engine Perspective Saket Singam
By Nathaniel Dias, Benton Le Ics4U Mr.Krnic. The beginning of the internet started as a result of the Cold War. After the launch of the Russian space.
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
Internet Someone creates a website 2.They load it to a web server computer 3.We must have an Internet connection 4.We can see the websites.
Week-6 (Lecture-1) Publishing and Browsing the Web: Publishing: 1. upload the following items on the web Google documents Spreadsheets Presentations drawings.
A large-scale study of the evolution of Web pages D. Fetterly, M. Manasse, M. Najork and L. Wiener SPE Vol.34 No.2 pages , Feb Apr
Web Page Design The Basics. The Web Page A document (file) created using the HTML scripting language. A document (file) created using the HTML scripting.
CS 791-S04 Digital Preservation Seminar Presentation of: Arms, "Preservation of Scientific Serials: Three Current Examples", JEP, 5(2), 1999 and Nelson.
Measuring and Archiving the Web
How Much Do You Know About the Internet?. What is the Internet? The Internet is the world’s largest computer network, connecting more than 4 million computers.
Services of the Internet
Chapter 10: Web Basics.
Chapter 8 Browsing and Searching the Web
Professional Web Designing For Absolute Beginners
Introducing the World Wide Web
UNIT 15 Webpage Creator.
Web page a hypertext document connected to the World Wide Web.
Just-In-Time Recovery of Missing Web Pages
Characterization of Search Engine Caches
DD Sir-Infomatics Web Development Part-1.
Website A website is a collection of web pages (documents that are accessed through the Internet) When someone gives you their web address, it generally.
Information Retrieval and Web Design
The Internet and Electronic mail
Presentation transcript:

Web Characterization: What Does the Web Look Like? Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License

W3C Characterization Activity Work from 1998-1999 Provided definitions for common Web terms like resource, link, proxy, server, etc., some of which are now dated http://www.w3.org/1999/05/WCA-terms/ Attempted to answer questions like: How many web pages are there? and How fast is the Web growing? Summary in Pitkow, Summary of WWW Characterizations, Journal of the World Wide Web, 1999

Web Page Popularity (1994) Summary of WWW Characterizations

OCLC Characterization Research Work from 1998-2002 Analyzed Web samples annually to look for trends Sample obtained by randomly sampling IP addresses and connecting to port 80 Today this method would miss a large number of websites that use virtual hosting – multiple domain names hosted on same computer using one IP address Findings: O'Neill et al., Trends in the Evolution of the Public Web, D-Lib Magazine, Apr 2003

Number of Public Websites O'Neill et al., Trends in the Evolution of the Public Web, D-Lib Magazine, Apr 2003

Distribution of Websites by Country 1999 2002 O'Neill et al., Trends in the Evolution of the Public Web, D-Lib Magazine, Apr 2003

Popular Websites by In-Links OCLC Most Linked-To Websites1 Most Linked-To Websites in 20092 en.wikipedia.org www.youtube.com www.dictionary.com www.craigslist.com www.facebook.com www.myspace.com www.twitter.com www.imdb.com www.hulu.com www.perezhilton.com 1http://www.oclc.org/research/activities/past/orprojects/wcp/stats/linkage.htm 2http://www.seomoz.org/blog/tangled-web-the-most-linked-to-pages-on-the-internet

Bow-Tie Structure of the Web SCC = Strongly Connected Component 17 Million nodes Broder et. al (Graph Structure of the Web, 2000) Examined a large web graph (200M pages, 1.5B links)

Characterizing National Web Domains A large-scale study by Baeza-Yates at al.1 analyzed web collections from 10 national domains and multinational Web spaces of African and Indochinese Web sites Examined languages, file sizes, pages per site, link structure, etc. 1Baeza-Yates et al. , Characterization of national Web domains, ACM Trans. Internet Technol., May 2007 

Web Page Languages Baeza-Yates et al. , Characterization of national Web domains, ACM Trans. Internet Technol., May 2007 

Some Power-law Distributions File sizes for small and large files Pages per site Baeza-Yates et al. , Characterization of national Web domains, ACM Trans. Internet Technol., May 2007 

In and Out Degree of Web Pages In-degree of web pages Out-degree of web pages for few and many outlinks Baeza-Yates et al. , Characterization of national Web domains, ACM Trans. Internet Technol., May 2007 

More than 95% of content was HTML Non-HTML File Content More than 95% of content was HTML Baeza-Yates et al. , Characterization of national Web domains, ACM Trans. Internet Technol., May 2007 

How dynamic is the Web? How often are pages added to the Web? How often are pages are removed from the Web? How often do pages change? What kinds of changes do pages typically exhibit? How does the link structure change over time?

How dynamic is the Web? Two studies attempt to answer these questions: 2004 study (Fetterly et al.1) of 150 million web pages over 11 weeks analyzed weekly snapshots 2004 study (Ntoulas et al.2) of 150 websites over one year analyzed weekly snapshots What follows are some selected highlights 1Fetterly et al. , A large-scale study of the evolution of Web pages,  Software Practice & Experience ,2004   2Ntoulas et al. , What's new on the web?: the evolution of the web from a search engine perspective, Proc WWW 2004  

Document Length 2n bytes Fetterly et al. ,2004  

Successful Downloads Weeks Fetterly et al. ,2004  

Rates of Change by TLD Fetterly et al. ,2004  

New Pages Ntoulas et al. , 2004  

Link Evolution Ntoulas et al. , 2004  

Summary of Findings Fetterly et al., 2004 Ntoulas et al., 2004 When pages change, they change in trivial ways or just their markup Strong relationship between TLD and rate of change but not degree of change The larger the document, the more likely it is to be changed more frequently and significantly Past frequency of changes to a page is good predictor of future page changes Web page changes are usually minor New pages are created at rate of 8% per week Only 20% of pages today will be accessible in a year Large number of pages borrow content from existing pages Every week, 25% new links are created, and after 1 year, 80% of links are replaced with new ones Past degree of change to web page is good predictor of future degree of change

Linkrot: The 404 Problem Kahle (‘97) - Average page lifetime is 44 days Koehler (‘99, ‘04) - 67% URLs lost in 4 years Lawrence et al. (‘01) - 23%-53% URLs in CiteSeer papers invalid over 5 year span (3% of invalid URLs “unfindable”) Spinellis (‘03) - 27% URLs in CACM/Computer papers gone in 5 years Fetterly et al. (‘03) – about 0.5% of web pages disappeared per week Ntoulas et al. (‘04) – predicted only 20% of pages today will be accessible in a year McCown et al. (‘05) - 10 year half-life for URLs in D-Lib Magazine articles Nelson & Allen (‘02) - 3% objects in digital library gone in 1 year

Blogosphere http://www.sifry.com/alerts/archives/000432.html