Presentation is loading. Please wait.

Presentation is loading. Please wait.

Minas Gjoka, Emily Smith, Carter T. Butts

Similar presentations


Presentation on theme: "Minas Gjoka, Emily Smith, Carter T. Butts"— Presentation transcript:

1 Minas Gjoka, Emily Smith, Carter T. Butts
Estimating Clique Composition and Size Distributions from Sampled Network Data Minas Gjoka, Emily Smith, Carter T. Butts University of California, Irvine

2 Outline Problem statement Estimation methodology
Results with real-life graphs

3 Cliques A complete subgraph that contains i vertices is an order-i clique order-1 order-2 A maximal clique is a clique that is not included in a larger clique order-3 order-4 order-5 order-i

4 Cliques A complete subgraph that contains i vertices is an order-i clique A maximal clique is a clique that is not included in a larger clique order-3 b b b a c a c order-4 d d 4 non-maximal order-3 cliques d b a c a c d

5 Counting of Cliques Ci is the count of order-i cliques (maximal or non-maximal) C1 order-1 graph G C2 order-2 3 2 1 4 5 C3 order-3 8 6 7 C4 order-4 Clique Distribution of G C = (C1, C2, C3, C4) = ( 0, 1, 2, 1 ) Goal 1: Estimate Ci (for all i) in graph G from sampled network data

6 Counting of Cliques Vertex Attributes
Vertex Attribute vector Xj j=1..p, p<=N p =3 graph G 3 2 u =[ ] 1 4 5 u =[ ] 8 6 7 u =[ ] Clique Composition Distribution of G Cu is the count of order-u cliques Goal 2: Estimate Cu (for all u) in graph G from sampled network data

7 What type of cliques can we count?
Maximal cliques Non-maximal cliques

8 Motivation Counting of Cliques Sampled network data
cliques describe local structure (clustering, cohesive subgroups) algorithmic implications of cliques in engineering context cliques used as input in network models Sampled network data unknown graphs with access limitations massive known graphs

9 Related Work Model-based methods Design-based methods
Do not scale Do not help with counting Design-based methods Subgraph (or motif) counting tools that use sampling e.g. MFinder, FANMOD, MODA No support for subgraphs of size larger than 10 No support for vertex attributes Biased Estimation

10 Estimation

11 Methodology Collect an egocentric network sample H1,..,Hn
Collect a probability sample of “n” nodes from the graph: Vj, X[Vj] j=1..n uniform independence sampling weighted independence sampling link-trace sampling with replacement without replacement

12 Methodology Collect an egocentric network sample H1,..,Hn
Collect a probability sample of “n” nodes from the graph: Vj, X[Vj] j=1..n graph G(V,E) 3 2 1 4 4 5 n=2 C3 8 6 7 7

13 Methodology Collect an egocentric network sample H1,..,Hn
Collect a probability sample of “n” nodes from the graph: Fetch the egonet of each sampled node: Vj, X[Vj] j=1..n G[Vj] j=1..n graph G(V,E) 3 2 1 3 2 4 5 8 n=2 C3 6 7 4 5 8 6 7

14 Methodology Collect an egocentric network sample H1,..,Hn
Collect a probability sample of “n” nodes from the graph Fetch the egonet of each sampled node Calculate the clique count Ci (or Cu) in each egonet Hj j=1..n Vj, X[Vj] G[Vj] graph G(V,E) 3 2 1 3 2 4 5 8 n=2 C3 6 7 4 5 8 6 7

15 Methodology 1 Collect an egocentric network sample H1,..,Hn
Collect a probability sample of “n” nodes from the graph Fetch the egonet of each sampled node Calculate the clique count Ci (or Cu) in each egonet Hj can use existing exact clique counting algorithms clique type is determined by counting algorithm. j=1..n Vj, X[Vj] G[Vj] graph G(V,E) 3 2 1 3 2 4 5 8 n=2 C3 6 7 4 5 8 1 6 7

16 Methodology 1 Collect an egocentric network sample H1,..,Hn
Collect a probability sample of “n” nodes from the graph Fetch the egonet of each sampled node Calculate the clique count Ci (or Cu) in each egonet Hj Apply estimation method that combines calculations Clique Degree Sums (CDS) Distinct Clique Counting (CC) j=1..n Vj, X[Vj] G[Vj] graph G(V,E) 3 2 1 3 2 4 5 8 n=2 C3 6 7 4 5 8 1 6 7

17 Methodology 1 Collect an egocentric network sample H1,..,Hn
Collect a probability sample of “n” nodes from the graph Fetch the egonet of each sampled node Calculate the clique count Ci (or Cu) in each egonet Hj Apply estimation method that combines calculations Clique Degree Sums (CDS) labeling of neighbors not required, more space efficient Distinct Clique Counting (CC) higher accuracy j=1..n Vj, X[Vj] G[Vj] Maximal cliques graph G(V,E) 3 2 1 3 2 4 5 8 n=2 C3 6 7 4 5 8 1 6 7

18 Labeling of neighbors C3 8 7 1 9 6 2 5 4 3 graph G

19 Labeling of neighbors Vj, X[Vj], G[Vj] C3 8 8 7 7 1 1 9 9 9 6 6 6 2 2
5 5 5 4 4 3 3 graph G n=2

20 Labeling of neighbors Distinct Clique Counting (CC) labeled neighbors
8 7 C3 Labeled Neighbors 9 9 6 6 8 7 Calculate count C3 5 5 1 9 6 9 9 6 6 2 5 5 5 5 5 5 4 3 4 4 4 3 3 graph G n=2

21 Labeling of neighbors Distinct Clique Counting (CC)
labeled neighbors Clique Degree Sums (CDS) unlabeled neighbors 8 7 C3 Labeled Neighbors 9 9 6 9 6 5 8 7 Calculate count C3 5 5 1 4 3 9 6 9 6 2 5 5 5 5 5 5 Calculate count C3 4 3 4 4 3 Unlabeled Neighbors graph G n=2

22 Clique Degree Sums unlabeled neighbors
Order-i Clique Degree dij contains the number of i-cliques that node j belongs

23 Clique Degree Sums unlabeled neighbors
graph G (V,E) Order-i Clique Degree dij contains the number of i-cliques that node j belongs 6 4 3 8 8 7 5 2 H8 1 d38 = 2 C3

24 Clique Degree Sums unlabeled neighbors
All nodes Number of i-cliques that node j belongs Di is the Order-i Clique Degree Sum

25 Clique Degree Sums unlabeled neighbors
graph G (V,E) All nodes 6 4 3 d38 Number of i-cliques that node j belongs 8 8 7 5 2 Di is the Order-i Clique Degree Sum 1 C3 D3 = d31 + d32 + d33 + d34 + d35 +d36 + d37 + d38 D3 = D3 = 9 D3 = 3C3

26 Clique Degree Sums unlabeled neighbors
All nodes Number of i-cliques that node j belongs Sampled nodes Node j inclusion probability is a design-unbiased Horvitz-Thompson estimator ( )

27 Clique Degree Sums unlabeled neighbors
All nodes Number of i-cliques that node j belongs Number of u-cliques that node j belongs Sampled nodes Node j inclusion probability is a design-unbiased Horvitz-Thompson estimator ( )

28 Clique Degree Sums Estimator Variance
We can use Horvitz-Thompson theory to derive unbiased estimators of the variance of and Node inclusion probability Joint node inclusion probability

29 Clique Degree Sums Estimator Variance
We can use Horvitz-Thompson theory to derive unbiased estimators of the variance of and Uniform Independence Sampling Weighted Independence Sampling Link-trace Sampling Without replacement With replacement

30 Clique Degree Sums Estimator Variance
We can use Horvitz-Thompson theory to derive unbiased estimators of the variance of and Uniform Independence Sampling Without replacement Sampled nodes All nodes Node inclusion probability Joint node inclusion probability

31 Distinct Clique Counting labeled neighbors
number of distinct i-cliques in H1, .., Hn i-clique inclusion probability is a design-unbiased Horvitz-Thompson estimator ( ) ) Uniform Independence Sampling Weighted Independence Sampling Link-trace Sampling With replacement Without replacement

32 Distinct Clique Counting labeled neighbors
number of distinct i-cliques in H1, .., Hn i-clique inclusion probability is a design-unbiased Horvitz-Thompson estimator ( ) ) Uniform Independence Sampling With replacement

33 Distinct Clique Counting labeled neighbors
graph G 6 a 4 3 8 C3 7 5 b c 2 N=8 1 n=4 UIS with replacement

34 Distinct Clique Counting labeled neighbors
graph G 6 a 4 3 8 C3 7 5 b c 2 N=8 1 n=4 UIS with replacement Observed order-3 cliques 6 6 5 2 5 2 8 8 7 7 1 1 Distinct order-3 cliques 6 5 2 8 7 1

35 Computational complexity
Space complexity to count Ci or Cu O(1) for Clique Degree Sums Method O(ci) or O(cu) for Distinct Clique Counting Method Time complexity from O(3N/3) to O(n*3D/3) where N is the graph size, D is the maximum degree, and n is the sample size from O(n*3D/3) to O(3D/3) via parallel computations per egonet

36 Benefits of our methodology
Full knowledge of graph not required Fast estimation for massive known graphs Estimation or exact computation easily parallelizable for massive known graphs Estimation with or without neighbor labels Supports vertex attributes Supports a variety of sampling designs

37 Results

38 Simulation Results

39 Simulation Results Facebook New Orleans
Distinct Clique Counting Clique Degree Sums Egonet sample size n=1,000 Uniform independence sampling, without replacement 1000 simulations

40 Simulation Results 1000 simulations
Error metric Normalized Mean Absolute Error : Clique Degree Sums Distinct Clique Counting

41 Simulation Results Clique Degree Sums Distinct Clique Counting

42 Which estimation method to use? Heuristic
All edges between egos and neighbors Average Edge Count = Unique edges between egos and neighbors graph G 6 4 3 n=3 8 6 6 6 5 5 2 2 7 5 8 2 8 8 1 7 7 1 7 N=8 1 a 9 Average Edge Count = = 1.5 b c 6

43 Which estimation method to use? Heuristic
Clique Degree Sums Error Distinct Clique Counting Error Average Edge Count

44 Estimation Results Facebook ‘09
Facebook ‘09 crawled dataset[1] 36,628 unique egonets [1] M. Gjoka, M. Kurant, C. T. Butts and A. Markopoulou, “Walking in Facebook: A Case Study of Unbiased Sampling of OSNs”, IEEE INFOCOM 2010.

45 Estimation Results vertex attributes, Facebook ‘09
Complemented dataset with gender attributes about 6 million users

46 Thank you! Unbiased estimation methods of clique distributions
Clique Degree Sums Distinct Clique Counting Facebook cliques Future work support estimation of any subgraphs (beyond cliques) References [1] M. Gjoka, E. Smith, C. T. Butts, “Estimating Clique Composition and Size Distributions from Sampled Network Data”, IEEE NetSciCom '14 . [2] Facebook datasets: [3] Python code for Clique Estimators: Thank you!


Download ppt "Minas Gjoka, Emily Smith, Carter T. Butts"

Similar presentations


Ads by Google