Challenges and Opportunities Posed by Power Laws in Network Analysis Bruno Ribeiro UMass Amherst MURI REVIEW MEETING Berkeley, 26 th Oct 2011.

Challenges and Opportunities Posed by Power Laws in Network Analysis Bruno Ribeiro UMass Amherst MURI REVIEW MEETING Berkeley, 26 th Oct 2011

Power Laws in Networks →Network topology: – power law distribution of node degrees AS topology, social networks (Facebook, etc) →Network traffic: – Flow: subset of packets – Power law distribution of flow sizes router packet stream 2 vertex degree - d P[deg > d ] Flickr dataset

Characterizing Networks from Incomplete Data This talk →Estimate distributions (of degrees, of flow sizes, …) from incomplete data (sampled edges, sampled packets, …) →Uncover central nodes in the network 3

Outline →Challenge: Estimating subset size distributions from incomplete data – Incomplete data: randomly sampled edges, randomly sampled packets, … – Impact of power laws on estimation accuracy – Impact of other distributions on estimation accuracy →Opportunity: Uncovering central nodes in power law networks 4

ESTIMATING SUBSET SIZE DISTRIBUTIONS FROM INCOMPLETE DATA Part 1: Challenge 5

Subset size distributions 6 Set of fishes Number of fishes (subset size) types of fish (subsets) distribution x - subset size (number of fishes) fraction of subsets (types of fish) with size x

Estimating subset size distributions 7 Set of fishes randomly sample N fishes (uniformly) distribution x - subset size (number of fishes) fraction of subsets (types of fish) with size x unbiased estimate sampled fishes

Questions How many fishes need to catch to obtain accurate distribution estimates? What is impact of distribution shape on estimation accuracy? 8

→Set →Subsets (non-overlapping) →Subset size distribution Abstract Problem Overview 9

Incomplete Data Estimation 10 random sampling estimation IP flow size distribution set of IP packets Sampled packets flow

→Distribution of # incoming links to a webpage – Q: do we need to crawl most of web graph? →Incoming links observed as outgoing links from other webpages – set = set of links – subset = incoming links to a webpage – sampling: link sampling 11 Network-related subset size distributions (webgraph) ? in-degree: # of links to webpage outgoing links

→Distribution number of packets in a TCP flow – Set = IP packets – Subset = a IP flow – Sampling: packet sampling 12 Network-related subset size distributions (IP traffic) router packet stream

Incomplete Data, Edge Sampling Example Original graph 13 Sampled in-degrees 3x Estimator Original In-Degree Distribution

Incomplete data model →Set elements sampled with probability p – without replacement – independently →Model – : probability that j out of i subset elements are sampled –  i : fraction of subsets with i elements e.g.: fraction of nodes with degree i, fraction of flows with i packets 14

Model (cont) →b ij – binomial(i,j) →  i : fraction of subsets with i elements →W : maximum subset size → : fraction of subsets with j sampled elements – d 0 is not observable 15

Mean Squared Error Question →  i : unbiased estimate of of  i →p : sampling probability →N : sampled subsets (e.g. N sampled flows) Exists an unbiased estimator that has small mean squared error: MSE(  i )? 16 Try Maximum Likelihood Estimator (MLE)?

Maximum Likelihood Estimation →Simulation: edge sampling →Flickr network (photo-sharing), 1.5M nodes 17 in-degree ii

Cramer-Rao Lower Bound (CRLB) →Let B = [b ij ], d = [d j ],  = [  i ] – Then d = B  →D = diag(d) : diagonal matrix D jj = d j →  i : unbiased estimate of of  i → J : Fisher information matrix of N subsets – J = B T D B – lower bound Mean Squared Error of  i : MSE(  i )  ( J -1 ) ii /N 18 Need to find J -1

Recap →Interested in the inverse of Fisher information matrix because MSE(  i )  ( J -1 ) ii /N →N : # of subsets sampled (# of nodes, # of TCP flows) →  : subset size distribution estimate (what we seek) →p : sampling probability (edges, packets) →W : maximum subset size 19

Results 20

Heavier than exponential subset size distribution tail →Theorem 1: Suppose that  W decreases more slowly than exponential. More precisely assume –log(  W ) = o(W) 21 error grows with subset size W

Exponential subset size distribution tail →Theorem 2: Suppose that  W decreases exponentially in W. More precisely assume -log(  W ) = W log a + o(W) as W  ∞ for some 0 < a < 1 22

Lighter than exponential subset size distribution tail →Theorem 3: Suppose that  W decreases faster than exponentially in W. More precisely assume -log(  W ) = (W). Then it follows that 0 < p ≤ 1 23

Infinite support  & power laws →If  is power law with infinite support (W  ∞) – if p < ½ any unbiased estimator has “infinite” MSE might as well output random estimates – if p > ½ estimates can be accurate if enough samples are collected 24

Estimating Subset Size Average → I : randomly chosen subset size →Average subset size E[ I ]: – E[ I ] ≤ ∞ & E[ I 2 ] = ∞ then estimation error is unbounded Reason: inspection paradox Sampling biased towards very large subsets – Average size of sampled subsets  E[ I 2 ]/2E[ I ] – otherwise, error is bounded 25

IMPACT OF POWER LAWS ON SAMPLING CENTRAL NETWORK NODES Part 2: Opportunity 26

→Central nodes important in networks – Communication bottlenecks, trend setters, information aggregators →Notions of centrality. – betweenness, closeness, PageRank, degree Challenge: identify top k central nodes exploring small fraction of network Central Nodes 27 central nodes

Degree as a proxy for centrality →Betweenness centrality: node is central if it belongs to many shortest paths →Closeness centrality: node is central if has short paths to all other nodes →Rank correlation measures the degree of similarity between two rankings →Low rank correlation in planar graphs (e.g. power grid) 28 SetType of Network# of nodes# of edgesDescription AS-SnapshotComputer22,96348,436Snapshot of Internet at level of AS ca-CondMatCollaboration23,133186,936ArXiv Condense Matter ca-HepPhCollaboration12,008237,010ArXiv High Energy Physics email-EnronSocial36,692367,662Email network from Enron Rank correlation with Degree

Random walk in steady state visits node with probability proportional to node degree In power law graphs such bias towards high degree nodes is strong We observe that RWs more efficient than more evolved techniques (AXS, RXS) Looking for high degree nodes 29 % of network sampled

Thank you 30

Challenges and Opportunities Posed by Power Laws in Network Analysis Bruno Ribeiro UMass Amherst MURI REVIEW MEETING Berkeley, 26 th Oct 2011.

Similar presentations

Presentation on theme: "Challenges and Opportunities Posed by Power Laws in Network Analysis Bruno Ribeiro UMass Amherst MURI REVIEW MEETING Berkeley, 26 th Oct 2011."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Challenges and Opportunities Posed by Power Laws in Network Analysis Bruno Ribeiro UMass Amherst MURI REVIEW MEETING Berkeley, 26 th Oct 2011.

Similar presentations

Presentation on theme: "Challenges and Opportunities Posed by Power Laws in Network Analysis Bruno Ribeiro UMass Amherst MURI REVIEW MEETING Berkeley, 26 th Oct 2011."— Presentation transcript:

Similar presentations

About project

Feedback