Web Graph Characteristics Kira Radinsky All of the following slides are courtesy of Ronny Lempel (Yahoo!)

Slides:



Advertisements
Similar presentations
The Structure of the Web Mark Levene (Follow the links to learn more!)
Advertisements

Analysis and Modeling of Social Networks Foudalis Ilias.
Probability Distributions CSLU 2850.Lo1 Spring 2008 Cameron McInally Fordham University May contain work from the Creative Commons.
Information Networks Generative processes for Power Laws and Scale-Free networks Lecture 4.
Information Retrieval Lecture 8 Introduction to Information Retrieval (Manning et al. 2007) Chapter 19 For the MSc Computer Science Programme Dell Zhang.
Power Laws: Rich-Get-Richer Phenomena
On Power-Law Relationships of the Internet Topology Michalis Faloutsos Petros Faloutsos Christos Faloutsos.
Lecture 10: Power Laws CS 790g: Complex Networks Slides are modified from Networks: Theory and Application by Lada Adamic.
4. PREFERENTIAL ATTACHMENT The rich gets richer. Empirical evidences Many large networks are scale free The degree distribution has a power-law behavior.
Power Law and Its Generative Models Bo Young Kim
Network Models Social Media Mining. 2 Measures and Metrics 2 Social Media Mining Network Models Why should I use network models? In may 2011, Facebook.
Review of Basic Probability and Statistics
WEB GRAPHS (Chap 3 of Baldi) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National Cheng Kung University 2005/10/6.
CS 345A Data Mining Lecture 1
CS 728 Lecture 4 It’s a Small World on the Web. Small World Networks It is a ‘small world’ after all –Billions of people on Earth, yet every pair separated.
Web as Graph – Empirical Studies The Structure and Dynamics of Networks.
Chapter 5: Probability Concepts
Decoding the Structure of the WWW : A Comparative Analysis of Web Crawls AUTHORS: M.Angeles Serrano Ana Maguitman Marian Boguna Santo Fortunato Alessandro.
Analysis of Social Information Networks Thursday January 27 th, Lecture 3: Popularity-Power law 1.
OMS 201 Review. Range The range of a data set is the difference between the largest and smallest data values. It is the simplest measure of dispersion.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 7 May 14, 2006
Elec471 Embedded Computer Systems Chapter 4, Probability and Statistics By Prof. Tim Johnson, PE Wentworth Institute of Technology Boston, MA Theory and.
Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine WEB GRAPHS.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
The Mobile Web is Structurally Different Apoorva Jindal USC Chris Crutchfield MIT Samir Goel Google Inc Ravi Jain Google Inc Ravi Kolluri Google Inc.
Information Networks Power Laws and Network Models Lecture 3.
Topic 13 Network Models Credits: C. Faloutsos and J. Leskovec Tutorial
Web Characterization: What Does the Web Look Like?
Estimation Basic Concepts & Estimation of Proportions
1 More about the Sampling Distribution of the Sample Mean and introduction to the t-distribution Presentation 3.
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/ Dec.
1 CS 475/575 Slide Set 6 M. Overstreet Spring 2005.
Modeling and Simulation CS 313
Models and Algorithms for Complex Networks Power laws and generative processes.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.
1 Statistical Properties for Text Rong Jin. 2 Statistical Properties of Text  How is the frequency of different words distributed?  How fast does vocabulary.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Mathematics of Networks (Cont)
Lecture 2 Review Probabilities Probability Distributions Normal probability distributions Sampling distributions and estimation.
Lotkaian Informetrics and applications to social networks L. Egghe Chief Librarian Hasselt University Professor Antwerp University Editor-in-Chief “Journal.
Ch 14. Link Analysis Padmini Srinivasan Computer Science Department
STA347 - week 31 Random Variables Example: We roll a fair die 6 times. Suppose we are interested in the number of 5’s in the 6 rolls. Let X = number of.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Lecture 10: Network models CS 765: Complex Networks Slides are modified from Networks: Theory and Application by Lada Adamic.
CY1B2 Statistics1 (ii) Poisson distribution The Poisson distribution resembles the binomial distribution if the probability of an accident is very small.
Most of contents are provided by the website Network Models TJTSD66: Advanced Topics in Social Media (Social.
Review of Probability. Important Topics 1 Random Variables and Probability Distributions 2 Expected Values, Mean, and Variance 3 Two Random Variables.
Random Variables Example:
Clusters Recognition from Large Small World Graph Igor Kanovsky, Lilach Prego Emek Yezreel College, Israel University of Haifa, Israel.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
How Do “Real” Networks Look?
Statistical Properties of Text
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Chapter 31Introduction to Statistical Quality Control, 7th Edition by Douglas C. Montgomery. Copyright (c) 2012 John Wiley & Sons, Inc.
1 Patterns of Cascading Behavior in Large Blog Graphs Jure Leskoves, Mary McGlohon, Christos Faloutsos, Natalie Glance, Matthew Hurst SDM 2007 Date:2008/8/21.
“Important” Vertices and the PageRank Algorithm Networked Life NETS 112 Fall 2014 Prof. Michael Kearns.
Models of Web-Like Graphs: Integrated Approach
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/ Dec.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Normal Distribution and Parameter Estimation
Topics In Social Computing (67810)
Distribution of the Sample Means
How Do “Real” Networks Look?
Lecture 11: Scale Free Networks
How Do “Real” Networks Look?
How Do “Real” Networks Look?
How Do “Real” Networks Look?
M248: Analyzing data Block A UNIT A3 Modeling Variation.
Presentation transcript:

Web Graph Characteristics Kira Radinsky All of the following slides are courtesy of Ronny Lempel (Yahoo!)

2 The Web as a Graph Pages as graph nodes, hyperlinks as edges. – Sometimes sites are taken as the nodes Some natural questions: 1.Distribution of the number of in-links to a page. 2.Distribution of the number of out-links from a page. 3.Distribution of the number of pages in a site. 4.Connectivity: is it possible to reach most pages from most pages? 5.Is there a theoretical model that fits the graph?

3 Mathematical Background: Power-Law Distributions A non-negative random variable X is said to have a Power-Law distribution if, for some constants c>0 and α >0: Prob[X>x] ~ x - α, or equivalently f(x) ~ x -( α+1) Taking logs from both sides, we have: log Prob[X>x] = - α log(x) + c Power Law distributions have “ heavy/long tails ”, i.e. the probability mass of events whose value is far from the expectancy or median of the distribution is significant – Unlike Normal or Geometric/Exponential distributions, where the probability mass of the tail decreases exponentially, in Power Law distributions the mass of the tail decreases by the constant power of α – Another point of view: in an Exponential distribution, f(x)/f(x+k) is constant, whereas in a Power-Law distribution, f(x)/f(kx) is constant. – The “ average ” quantity in a Power-Law distribution is not “ typical ” Examples of Power-Law distributions are Pareto and Zipf distributions (see next slides)

4 Mathematical Background: The Pareto Distribution A continuous, positive random variable X in the range [L,  ] is said to be distributed Pareto(L,k) if its probability density function is: f(X=x;k;L) = k L k / x k+1 This implies that Prob(X>x) = (L/x) k – Has finite expectancy of Lk/(k-1) only for k>1 – Has finite variance only for k>2 Named after the Italian economist Vilfredo Pareto ( ), who modeled with it the distribution of wealth in society – Most people have little income; 20% of society holds 80% of the wealth

5 Mathematical Background: Zipf ’ s Law A random variable X follows Zipf ’ s Law (is “ Zipfian ” ) with parameter α when the j ’ th most popular value of X occurs with probability that is proportional to j - α – Essentially the distribution is over the discrete ranks Whenever α >1, X may take an infinite number of values (i.e. have infinitely many different value popularities) Named after the American Linguist George Kingsley Zipf ( ), who observed it on the frequencies of words in the English language – On a large corpus of English text, the 135 most frequently occurring words accounted for half of the text

6 Mathematical Background:An Observed Zipfian Sample Implies a Power-Law The following analysis is due to Lada Adamic: Assume that N units of wealth (coins) are distributed to M individuals – There are N observations of a random variable Y that can take on the discrete values 1,2, …,M Y k =j (k=1, … N, j=1..M) means that person j got coin k – Denote by X 1 [X m ] the number of coins of the richest[poorest] individual at the end of the process For simplicity, assume that N>>M and the X j ’ s are all distinct Assume that a perfect Zipfian behavior is observed, i.e. X r /N ~ r -b for all r=1, … M – This trivially implies X r ~ r -b

7 Mathematical Background:An Observed Zipfian Sample Implies a Power-Law (cont.) Recap: we distributed N coins to M individuals, and denoted by X 1 [X m ] the number of coins of the richest[poorest] individual at the end of the process By assuming Zipfian wealth: X r ~ r -b, or X r =cr -b Let Z be the random variable of a person ’ s wealth, i.e. the number of coins a person gets by this process Observation: if the r ’ th richest person got X r coins, then exactly r people out of M got X r coins or more Pr[Z  X r ]=Pr[Z  cr -b ]=r/M Define y= cr -b, and so r=(y/c) -(1/b), and so Pr[Z  y]= y -(1/b) c (1/b) /M Hence Pr[Z  y] ~ y -(1/b), and Z obeys a Power-Law

8 Distribution of Inlinks * Image taken from “ Graph Structure in the Web ”, Broder et al., WWW ’ A plot of the number of nodes having each value of in-degree Both axes are in log-scale Denoting the size of the sample crawl by N (over 200M here), we have: Log (N*Prob[node has in-degree x])  -a*log(x)+c Log (Prob[node has in-degree x])  -a*log(x)+c ’ Which indicates the Power-Law Prob[node has in-degree x] ~ x -a Note that the number of nodes with small in-degree is over-estimated while the number of nodes with very high in-degree is under-estimated

9 More Power-Laws on the Web We ’ ve seen that the in-degree of pages exhibits a Power-Law. Furthermore: Out-degree (somewhat surprising) Degrees of the inter-host graph Number of pages in Web sites Number of visits to Web sites/pages PageRank scores – With an exponent very close to that of the in-degree distribution – Curiously, degrees in the telephone call graph have the same 2.1 exponent Frequencies of words (as observed by Zipf) Popularities of queries submitted to search engines (will be discussed later in the course)

10 The Web as a Graph Connectivity: is it possible to reach most pages from most pages? The Web is a bow-tie! The Web graph is also scale-free, fractal: many slices and subgraphs exhibit similar properties. Image taken from “ Graph Structure in the Web ”, Broder et al., WWW ’ 2000.

11 Self-Similarity on the Web Dill et al., ACM TOIT 2002 Created large Thematically Unified Clusters (TUCs) Pages containing a certain keyword Pages of large Web sites/Intranets Pages containing a geographical reference in the Western US The host graph In general, the TUCs display very similar graph properties, e.g. In/out degree distributions Bow-tie structure (relative sizes of the components) Also discovered that the SCC of the different TUCs are strongly connected, i.e. it is possible to browse between the TUCs