# 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 7 May 14, 2006

## Presentation on theme: "1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 7 May 14, 2006"— Presentation transcript:

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 7 May 14, 2006 http://www.ee.technion.ac.il/courses/049011

2 Web Structure I : Power Laws and Small World Phenomenon

3 Outline Power laws The preferential attachment model Small-world networks The Watts-Strogatz model

4 Observed Phenomena Few multi-billionaires, but many with modest income [Pareto, 1896] Few frequent words, but many infrequent words [Zipf, 1932] Few “mega-cities” but many small towns [Zipf, 1949] Few web pages with high degree, but many with low degree [Kumar et al, 99] [Barabási & Albert, 99] All the above obey power laws.

5 Power Law (Pareto) Distribution  > 0: shape parameter (“slope”) k > 0: location parameter Ex: (k = \$1000,  = 2)  1/100 earn ≥ \$10,000  1/10,000 earn ≥ \$100,000  1/1,000,000 earn ≥ \$1,000,000

6 Power Law Properties PDF: Infinite mean for  ≤ 1 Infinite variance for  ≤ 2 When X is discrete,

7 Power Law Graphs Linear Scale PlotLog-Log Plot Slope = - 

8 Scale-Free Distributions Power laws are invariant to scale  Ex: (k = arbitrary,  = 2) 1/100 earn ≥ 10k 1/10,000 earn ≥ 100k 1/1,000,000 earn ≥ 1000k

9 Heavy Tailed Distributions In many “classical” distributions  Ex: normal, exponential In power law distributions “heavy tail” “light tail”

10 Zipf’s Law Size of r-th largest city is Equivalent to a power law:  X = size of a city   Change variables: 

11 Power Laws and the Internet Web Graph  In- and out-degrees (in slope: ~2.1, out slope: ~2.7) [Kumar et al. 99, Barabási & Albert 99, Broder et al 00]  Sizes of connected components [Broder et al 00]  Website sizes [Huberman & Adamic 99] Internet graph  Degrees [Faloutsos 3 99]  Eigenvalues [Mihail & Papadimitriou 02] Traffic  Number of visits to websites

12 Power Laws and Graphs If X is a random web page, then What random graph model explains this phenomenon?

13 Erdős-Rényi Random Graphs G n,p  n: size of the graph (fixed)  p: edge existence probability (fixed): Every pair u,v is connected by an edge with probability p. Theorem [Erdős & Rényi, 60] For any node x in G n,p,

14 Preferential Attachment [ Barabási & Albert 99] A novel random graph model  Initialization: graph starts with a single node with two self loops.  Growth: At every step a new node v is added to the graph. v has a self loop and connects to one neighbor.  Preferential attachment: v connects to u with probability The rich get richer / The winner takes it all

15 : # of nodes whose indegree = k after t steps k > 1:    Expected growth: Why Does it Work? k = 1:

16 Why Does it Work? (2) Fact: After sufficiently many steps, reaches a “steady state”. c k = value of at the steady state. Since at steady state, Hence, Therefore:

17 Why Does it Work? (3) Then: And: Therefore:

18 Six Degrees of Separation [Stanley Milgram, 67] “Random starters” at Nebraska, Kansas, etc. Destinations: in Boston Intermediaries send postcards to Milgram Findings: average of 6 postcards “Conclusion”: every two people in the US are connected by a path of length ~ 6

19 Small-World Networks Average diameter: length of shortest path from u to v, averaged over all pairs u,v Clustering coefficient: fraction of neighbors of v that are neighbors of each other, averaged over all v Small-world network: a sparse graph with average diameter O(log n) and a constant clustering coefficient

20 The Web as a Small World Network Low diameter  Study of a synthetic web graph model [Albert, Jeong, Barabási 99] Average diameter of the Web is ~19 Grows logarithmically with size of the Web.  Study of a large crawl [Broder et al 00] Average diameter of the SCC is ~ 16 Maximum diameter of the SCC is ≥ 28  Diameter of host graph [Adamic 99] Average diameter of SCC: ~4 High clustering coefficient  Clustering coefficient of host graph [Adamic 99] Clustering coefficient: ~0.08 (compared to 0.001 in a comparable random graph)

21 Model for Small-World Networks [Watts & Strogatz 98] One extreme: random networks  Low diameter  Low clustering coefficient Other extreme: “regular” networks (e.g., a lattice)  High clustering coefficient  High diameter Small-world: interpolation between the two  Low diameter  High clustering coefficient  Regularity: social networking  Randomness: individual interests

22 Random Network The model: n vertices Every pair u,v is connected by an edge with probability p = d/n Properties: Expected number of edges: ~dn Graph is connected w.h.p Diameter: O(log n) w.h.p. Clustering coefficient: ~ p = d/n = o(1)

23 Ring Lattice The model: n vertices on a circle Every vertex has d neighbors: the d/2 vertices to its right and the d/2 vertices to its left Properties: Number of edges: dn/2 Graph is connected Diameter: O(n/d) Clustering coefficient:

24 Random Rewiring Start from a ring lattice for i = 1 to d/2 do  for v = 1 to n do Pick i-th clockwise nearest neighbor of v With probability p, replace this neighbor by a random vertex

25 Analysis If p = 0, ring lattice  High clustering coefficient  High diameter If p = 1, random network  Logarithmic diameter  Low clustering coefficient However,  Diameter goes down rapidly as p grows  Clustering coefficient goes down slowly as p grows Therefore, for small p, we get a small-world network.  Logarithmic diameter  High clustering coefficient

26 End of Lecture 7

Similar presentations