 # Sampling from Large Graphs. Motivation Our purpose is to analyze and model social networks –An online social network graph is composed of millions of.

## Presentation on theme: "Sampling from Large Graphs. Motivation Our purpose is to analyze and model social networks –An online social network graph is composed of millions of."— Presentation transcript:

Sampling from Large Graphs

Motivation Our purpose is to analyze and model social networks –An online social network graph is composed of millions of nodes and edges –In order to analyze it we have to store the whole graph in the computers memory –Sometimes this is impossible –Even when it is possible it is extremely time consuming only to compute some basic graph properties –Thus we need to extract a small sample of the graph and analyze it

Problem Given a huge real graph, how can we derive a representative sample? –Which sampling method to use? –How small can the sample size be? –How do we measure success?

Problem What do we compare against? –Scale down sampling: Given a graph G with n nodes, derive a sample graph G’ with n’ nodes (n’ << n) that will be most similar to G –Back in time sampling: Let G n’ denote graph G at some point in time when it had n’ nodes Find a sample S on n’ nodes that is most similar to G n’ (when graph G had the same size as S)

Evaluation Techniques Criteria for scale down sampling –In degree distribution –Out degree distribution –Distribution of sizes of weakly connected components –Distribution of sizes of strongly connected components –Hop plot, number of reachable pairs of nodes at distance h –Hop plot on the largest WCC –Distribution of the clustering coefficient –Distribution of singular values of the graph adjacency matrix versus the rank

Evaluation Techniques Criteria for back in time sampling –Densification Power Law: Number of edges vs number of nodes over time –The effective diameter of the graph over time Observed that shrinks and stabilizes over time –Normalized size of the largest WCC over time –Average clustering coefficient over time –Largest singular value of graph adjacency matrix over time

Statistical Tests Comparing graph patterns using Kolmogorov-Smirnov D- statistic –Measure the agreement between two distributions using D = max x {|F’(x) – F(x)|} –Where F and F’ are two cumulative distribution functions –Does not address the issue of scaling –Just compares the shape of the distributions Comparing graph patterns using the visiting probability –For each node u E G, calculate the probability of visiting node w E G –Use of Frobenius norm to calculate the difference in visiting probability.

Algorithms Sampling by random node selection –Random Node Sampling: Uniformly at random select a set of nodes –Random PageRank sampling Set the probability of a node being selected into the sample proportional to its PageRank weight –Random Degree Node Se the probability of a node being selected into the sample proportional to its degree

Algorithms Sampling by random edge selection –Random edge sampling Uniformly select edges at random –Random node – edge sampling Uniformly at random select a node, then uniformly at random select an edge incident to it –Hybrid sampling With probability p perform RNE sampling, with probability 1-p perform RE sampling

Algorithms Sampling by exploration –Random node neighbor Select a node uniformly at random together with all his out-going neighbors –Random walk sampling Uniformly at random select a random node and perform a random walk with restarts If we get stuck, randomly select another node to start –Random jump sampling Same as random walk sampling but with a probability p we jump to a new node –Forest fire sampling Choose a node u uniformly at random Generate a random number z and select z out links of u that are not yet visited Apply this step recursively for all z links selected

Evaluation Three groups of algorithms: –RDN, RJ, RW: biased towards high degree nodes and densely connected part of the graph –FF, RPN, RN: not biased towards high degree nodes, match the temporal densification of the true graph –RE, RNE, HYB: For small sample size the resulting graph is very sparsely connected Conclusion: –For the scale down goal methods based on random walks perform best –For the back in time goal forest fire algorithm performs best –No single perfect answer to graph sampling –Experiments showed that a 15% sample is usually enough

Further thoughts Wrong approach trying to match all properties? Maybe we should try matching one at a time Test methods for sampling on graphs with weighted – labeled edges Current algorithms are extremely slow when we read a graph from a file –Need to implement better versions of them in order to decrease the I/O cost

Bibliography Sampling from large graphs, J. Leskovec and C. Faloutsos Unbiased sampling of Facebook, M. Gjoka, M. Kurant, C. T. Butts and A. Markopoulou What is the real size of a sampled network? The case of the Internet, F. Viger, A. Barrat. L. Dall’Asta, C. Zhang and E. D. Kolaczyk Sampling large Internet topologies for simulation purposes, V. Krishnamurthy, M. Faloutsos, M. Chrobak, J. Cui, L. Lao and A. G. Percus

Download ppt "Sampling from Large Graphs. Motivation Our purpose is to analyze and model social networks –An online social network graph is composed of millions of."

Similar presentations