Web as Network: A Case Study Networked Life CIS 112 Spring 2010 Prof. Michael Kearns.

Slides:



Advertisements
Similar presentations
The Structure of the Web Mark Levene (Follow the links to learn more!)
Advertisements

Measurement and Analysis of Online Social Networks 1 A. Mislove, M. Marcon, K Gummadi, P. Druschel, B. Bhattacharjee Presentation by Shahan Khatchadourian.
Based on slides by Y. Peng University of Maryland
Graph Traversals Visit vertices of a graph G to determine some property: Is G connected? Is there a path from vertex a to vertex b? Does G have a cycle?
CS171 Introduction to Computer Science II Graphs Strike Back.
Graph Traversals Visit vertices of a graph G to determine some property: Is G connected? Is there a path from vertex a to vertex b? Does G have a cycle?
Data Structures Chapter 12 Graphs Andreas Savva. 2 Definition A graph consists of a set of vertices together with a set of edges. If e = (v,w) is an edge.
Information Retrieval Lecture 8 Introduction to Information Retrieval (Manning et al. 2007) Chapter 19 For the MSc Computer Science Programme Dell Zhang.
Mining and Searching Massive Graphs (Networks)
CS 345A Data Mining Lecture 1
CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.
CS 345A Data Mining Lecture 1 Introduction to Web Mining.
Testing for Connectedness and Cycles
CS 728 Lecture 4 It’s a Small World on the Web. Small World Networks It is a ‘small world’ after all –Billions of people on Earth, yet every pair separated.
Common Properties of Real Networks. Erdős-Rényi Random Graphs.
Network Science and the Web Networked Life CIS 112 Spring 2008 Prof. Michael Kearns.
Decoding the Structure of the WWW : A Comparative Analysis of Web Crawls AUTHORS: M.Angeles Serrano Ana Maguitman Marian Boguna Santo Fortunato Alessandro.
CS344: Lecture 16 S. Muthu Muthukrishnan. Graph Navigation BFS: DFS: DFS numbering by start time or finish time. –tree, back, forward and cross edges.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
The Web as Network Networked Life CSE 112 Spring 2006 Prof. Michael Kearns.
CS 345 Data Mining Lecture 1 Introduction to Web Mining.
Network Structure and Web Search Networked Life CIS 112 Spring 2010 Prof. Michael Kearns.
Data Structures, Spring 2006 © L. Joskowicz 1 Data Structures – LECTURE 14 Strongly connected components Definition and motivation Algorithm Chapter 22.5.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 7 May 14, 2006
Measurement and Analysis of Online Social Networks Alan Mislove,Massimiliano Marcon, Krishna P. Gummadi, Peter Druschel, Bobby Bhattacharjee Presented.
Network Science and the Web: A Case Study Networked Life CIS 112 Spring 2009 Prof. Michael Kearns.
News and Notes, 2/24 Homework 2 due at the start of Thursday’s class New required readings: –“Micromotives and Macrobehavior”, chapters 1, 3 and 4 –Watts,
1 Uniform Sampling from the Web via Random Walks Ziv Bar-Yossef Alexander Berg Steve Chien Jittat Fakcharoenphol Dror Weitz University of California at.
CS347 Lecture 12 May 21, 2001 ©Prabhakar Raghavan.
Social Media Mining Graph Essentials.
The Mobile Web is Structurally Different Apoorva Jindal USC Chris Crutchfield MIT Samir Goel Google Inc Ravi Jain Google Inc Ravi Kolluri Google Inc.
Topic 13 Network Models Credits: C. Faloutsos and J. Leskovec Tutorial
WEB SCIENCE: ANALYZING THE WEB. Graph Terminology Graph ~ a structure of nodes/vertices connected by edges The edges may be directed or undirected Distance.
CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics.
The Shape of the Web So, the Web is a directed graph, but what does it look like?
Data Analysis in YouTube. Introduction Social network + a video sharing media – Potential environment to propagate an influence. Friendship network and.
COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Lecture 5: Mathematics of Networks (Cont) CS 790g: Complex Networks Slides are modified from Networks: Theory and Application by Lada Adamic.
Structural Properties of Networks: Introduction Networked Life NETS 112 Fall 2015 Prof. Michael Kearns.
Mathematics of Networks (Cont)
Chapter 3. Community Detection and Evaluation May 2013 Youn-Hee Han
Ch 14. Link Analysis Padmini Srinivasan Computer Science Department
Web Intelligence Complex Networks I This is a lecture for week 6 of `Web Intelligence Example networks in this lecture come from a fabulous site of Mark.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Most of contents are provided by the website Graph Essentials TJTSD66: Advanced Topics in Social Media.
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan.
The Structure of the Web. Getting to knowing the Web How big is the web and how do you measure it? How many people use the web? How many use search engines?
General Writing - Audience What is their level of knowledge? Advanced, intermediate, basic? Hard to start too basic – but have to use the right terminology.
Copyright © Curt Hill Graphs Definitions and Implementations.
Informatics tools in network science
“Important” Vertices and the PageRank Algorithm Networked Life NETS 112 Fall 2014 Prof. Michael Kearns.
Models of Web-Like Graphs: Integrated Approach
Web Crawling and Automatic Discovery Donna Bergmark March 14, 2002.
GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.
22C:145 Artificial Intelligence
6CCS3WSN--7CCSMWAL Algorithms for WWW and Social Networks Algorithmic Issues in the WWW Lecture 1.
6CCS3WSN--7CCSMWAL Real world networks
Structural Properties of Networks: Introduction
Introduction to Web Mining
Uniform Sampling from the Web via Random Walks
Structural Properties of Networks: Introduction
Comp 245 Data Structures Graphs.
CS246 Web Characteristics.
CS246: Web Characteristics
CS 345A Data Mining Lecture 1
Modelling and Searching Networks Lecture 2 – Complex Networks
CS 345A Data Mining Lecture 1
Introduction to Web Mining
CS 345A Data Mining Lecture 1
Presentation transcript:

Web as Network: A Case Study Networked Life CIS 112 Spring 2010 Prof. Michael Kearns

The Web as Network Consider the web as a network –vertices: individual (html) pages –edges: hyperlinks between pages –will view as both a directed and undirected graph What is the structure of this network? –connected components –degree distributions –etc. What does it say about the people building and using it? –page and link generation –visitation statistics What are the algorithmic consequences? –web search –community identification

Graph Structure in the Web [Broder et al. paper] Report on the results of two massive “web crawls” Executed by AltaVista in May and October 1999 Details of the crawls: –automated script following hyperlinks (URLs) from pages found –large set of starting points collected over time –crawl implemented as breadth-first search –have to deal with webspam, infinite paths, timeouts, duplicates, etc. May ’99 crawl: –200 million pages, 1.5 billion links Oct ’99 crawl: –271 million pages, 2.1 billion links Unaudited, self-reported Sep ’03 stats:Sep ’03 stats: –3 major search engines claim > 3 billion pages indexed

Five Easy Pieces Authors did two kinds of breadth-first search: –ignoring link direction  weak connectivity –only following forward links  strong connectivity They then identify five different regions of the web: –strongly connected component (SCC): can reach any page in SCC from any other in directed fashion –component IN: can reach any page in SCC in directed fashion, but not reverse –component OUT: can be reached from any page in SCC, but not reverse –component TENDRILS: weakly connected to all of the above, but cannot reach SCC or be reached from SCC in directed fashion (e.g. pointed to by IN) –SCC+IN+OUT+TENDRILS form weakly connected component (WCC) –everything else is called DISC (disconnected from the above) –here is a visualization of this structurevisualization

Size of the Five SCC: ~56M pages, ~28% IN: ~43M pages, ~ 21% OUT: ~43M pages, ~21% TENDRILS: ~44M pages, ~22% DISC: ~17M pages, ~8% WCC > 91% of the web --- the giant component One interpretation of the pieces: –SCC: the heart of the web –IN: newer sites not yet discovered and linked to –OUT: “insular” pages like corporate web sites

Diameter Measurements Directed worst-case diameter of the SCC: –at least 28 Directed worst-case diameter of IN  SCC  OUT: –at least 503 Over 75% of the time, there is no directed path between a random start and finish page in the WCC –when there is a directed path, average length is 16 Average undirected distance in the WCC is 7 Moral: –web is a “small world” when we ignore direction –otherwise the picture is more complex

Degree Distributions They are, of course, heavy-tailedheavy-tailed Power law distribution of component size –consistent with the Erdos-Renyi model Undirected connectivity of web not reliant on “connectors” –what happens as we remove high-degree vertices?remove high-degree vertices?

Here is a 2005 update on all this stuff.2005 update