Minas Gjoka, Emily Smith, Carter T. Butts

Slides:

Advertisements

Similar presentations

Exact Inference in Bayes Nets

Advertisements

CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.

1 2.5K-Graphs: from Sampling to Generation Minas Gjoka, Maciej Kurant ‡, Athina Markopoulou UC Irvine, ETZH ‡

Practical Recommendations on Crawling Online Social Networks

Construction of Simple Graphs with a Target Joint Degree Matrix and Beyond Minas Gjoka, Balint Tillman, Athina Markopoulou University of California, Irvine.

Networks. Graphs (undirected, unweighted) has a set of vertices V has a set of undirected, unweighted edges E graph G = (V, E), where.

1 Walking on a Graph with a Magnifying Glass Stratified Sampling via Weighted Random Walks Maciej Kurant Minas Gjoka, Carter T. Butts, Athina Markopoulou.

A Graphical Model For Simultaneous Partitioning And Labeling Philip Cowans & Martin Szummer AISTATS, Jan 2005 Cambridge.

Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.

Mutual Information Mathematical Biology Seminar

Sampling from Large Graphs. Motivation Our purpose is to analyze and model social networks –An online social network graph is composed of millions of.

Descriptive statistics Experiment  Data  Sample Statistics Experiment  Data  Sample Statistics Sample mean Sample mean Sample variance Sample variance.

Presented by Ozgur D. Sahin. Outline Introduction Neighborhood Functions ANF Algorithm Modifications Experimental Results Data Mining using ANF Conclusions.

© 2005, it - instituto de telecomunicações. Todos os direitos reservados. Gerhard Maierbacher Scalable Coding Solutions for Wireless Sensor Networks IT.

Nonparametric, Model-Assisted Estimation for a Two-Stage Sampling Design Mark Delorey Joint work with F. Jay Breidt and Jean Opsomer September 8, 2005.

Minas Gjoka, UC IrvineWalking in Facebook 1 Walking in Facebook: A Case Study of Unbiased Sampling of OSNs Minas Gjoka, Maciej Kurant ‡, Carter Butts,

Computer Science 1 Web as a graph Anna Karpovsky.

Network A/B Testing: From Sampling to Estimation

Graph Classification.

Standard error of estimate & Confidence interval.

Walter Hop Web-shop Order Prediction Using Machine Learning Master’s Thesis Computational Economics.

Social Network Analysis via Factor Graph Model

Optimizing pooling strategies for the massive next-generation sequencing of viral samples Pavel Skums 1 Joint work with Olga Glebova 2, Alex Zelikovsky.

A Distributed and Privacy Preserving Algorithm for Identifying Information Hubs in Social Networks M.U. Ilyas, Z Shafiq, Alex Liu, H Radha Michigan State.

Multigraph Sampling of Online Social Networks Minas Gjoka, Carter Butts, Maciej Kurant, Athina Markopoulou 1Multigraph sampling.

1 Link-Trace Sampling for Social Networks: Advances and Applications Maciej Kurant (UC Irvine) Join work with: Minas Gjoka (UC Irvine), Athina Markopoulou.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

Anomalous Node Detection in Time Series of Mobile Communication Graphs Leman Akoglu January 28, 2010.

1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),

WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS junction.

Poking Facebook: Characterization of OSN Applications Minas Gjoka, Michael Sirivianos, Athina Markopoulou, Xiaowei Yang University of California, Irvine.

Network Characterization via Random Walks B. Ribeiro, D. Towsley UMass-Amherst.

Automated Social Hierarchy Detection through Network Analysis (SNAKDD07) Ryan Rowe, Germ´an Creamer, Shlomo Hershkop, Salvatore J Stolfo 1 Advisor:

A Graph-based Friend Recommendation System Using Genetic Algorithm

Mining Social Network for Personalized Prioritization Language Techonology Institute School of Computer Science Carnegie Mellon University Shinjae.

Challenges and Opportunities Posed by Power Laws in Network Analysis Bruno Ribeiro UMass Amherst MURI REVIEW MEETING Berkeley, 26 th Oct 2011.

CS774. Markov Random Field : Theory and Application Lecture 02

Bruno Ribeiro Don Towsley University of Massachusetts Amherst IMC 2010 Melbourne, Australia.

Exploit of Online Social Networks with Community-Based Graph Semi-Supervised Learning Mingzhen Mo and Irwin King Department of Computer Science and Engineering.

Institute of Computing Technology, Chinese Academy of Sciences 1 A Unified Framework of Recommending Diverse and Relevant Queries Speaker: Xiaofei Zhu.

Lecture 2: Statistical learning primer for biologists

Towards Social User Profiling: Unified and Discriminative Influence Model for Inferring Home Locations Rui Li, Shengjie Wang, Hongbo Deng, Rui Wang, Kevin.

Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:

Comparison of Tarry’s Algorithm and Awerbuch’s Algorithm CS 6/73201 Advanced Operating System Presentation by: Sanjitkumar Patel.

Classification Ensemble Methods 1

1 Latency-Bounded Minimum Influential Node Selection in Social Networks Incheol Shin

Graph Data Management Lab, School of Computer Science Personalized Privacy Protection in Social Networks (VLDB2011)

Speaker : Yu-Hui Chen Authors : Dinuka A. Soysa, Denis Guangyin Chen, Oscar C. Au, and Amine Bermak From : 2013 IEEE Symposium on Computational Intelligence.

Exponential random graphs and dynamic graph algorithms David Eppstein Comp. Sci. Dept., UC Irvine.

Learning with Green’s Function with Application to Semi-Supervised Learning and Recommender System ----Chris Ding, R. Jin, T. Li and H.D. Simon. A Learning.

A Simulation-Based Study of Overlay Routing Performance CS 268 Course Project Andrey Ermolinskiy, Hovig Bayandorian, Daniel Chen.

Informatics tools in network science

Computation on Graphs. Graphs and Sparse Matrices Sparse matrix is a representation of.

1 Coarse-Grained Topology Estimation via Graph Sampling Maciej Kurant 1 Minas Gjoka 2 Yan Wang 2 Zack W. Almquist 2 Carter T. Butts 2 Athina Markopoulou.

Alan Mislove Bimal Viswanath Krishna P. Gummadi Peter Druschel.

High resolution product by SVM. L’Aquila experience and prospects for the validation site R. Anniballe DIET- Sapienza University of Rome.

Random Walk for Similarity Testing in Complex Networks

Topo Sort on Spark GraphX Lecturer: 苟毓川

Stochastic Streams: Sample Complexity vs. Space Complexity

Sequential Algorithms for Generating Random Graphs

Modeling, sampling, generating Networks with MRV

Learning with information of features

Department of Computer Science University of York

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Software Metrics “How do we measure the software?”

Binghui Wang, Le Zhang, Neil Zhenqiang Gong

Using Clustering to Make Prediction Intervals For Neural Networks

Distance-Constraint Reachability Computation in Uncertain Graphs

Locality In Distributed Graph Algorithms

Analysis of Large Graphs: Overlapping Communities

Presentation transcript:

Minas Gjoka, Emily Smith, Carter T. Butts Estimating Clique Composition and Size Distributions from Sampled Network Data Minas Gjoka, Emily Smith, Carter T. Butts University of California, Irvine

Outline Problem statement Estimation methodology Results with real-life graphs

Cliques A complete subgraph that contains i vertices is an order-i clique order-1 order-2 A maximal clique is a clique that is not included in a larger clique order-3 order-4 order-5 … order-i

Cliques A complete subgraph that contains i vertices is an order-i clique A maximal clique is a clique that is not included in a larger clique order-3 b b b a c a c order-4 d d 4 non-maximal order-3 cliques d b a c a c d

Counting of Cliques Ci is the count of order-i cliques (maximal or non-maximal) C1 order-1 graph G C2 order-2 3 2 1 4 5 C3 order-3 8 6 7 C4 order-4 Clique Distribution of G C = (C1, C2, C3, C4) = ( 0, 1, 2, 1 ) Goal 1: Estimate Ci (for all i) in graph G from sampled network data

Counting of Cliques Vertex Attributes Vertex Attribute vector Xj j=1..p, p<=N p =3 graph G 3 2 u =[ 3 0 0 ] 1 4 5 u =[ 2 1 0 ] 8 6 7 u =[ 2 0 1 ] Clique Composition Distribution of G Cu is the count of order-u cliques Goal 2: Estimate Cu (for all u) in graph G from sampled network data

What type of cliques can we count? Maximal cliques Non-maximal cliques

Motivation Counting of Cliques Sampled network data cliques describe local structure (clustering, cohesive subgroups) algorithmic implications of cliques in engineering context cliques used as input in network models Sampled network data unknown graphs with access limitations massive known graphs

Related Work Model-based methods Design-based methods Do not scale Do not help with counting Design-based methods Subgraph (or motif) counting tools that use sampling e.g. MFinder, FANMOD, MODA No support for subgraphs of size larger than 10 No support for vertex attributes Biased Estimation

Estimation

Methodology Collect an egocentric network sample H1,..,Hn Collect a probability sample of “n” nodes from the graph: Vj, X[Vj] j=1..n uniform independence sampling weighted independence sampling link-trace sampling with replacement without replacement

Methodology Collect an egocentric network sample H1,..,Hn Collect a probability sample of “n” nodes from the graph: Vj, X[Vj] j=1..n graph G(V,E) 3 2 1 4 4 5 n=2 C3 8 6 7 7

Methodology Collect an egocentric network sample H1,..,Hn Collect a probability sample of “n” nodes from the graph: Fetch the egonet of each sampled node: Vj, X[Vj] j=1..n G[Vj] j=1..n graph G(V,E) 3 2 1 3 2 4 5 8 n=2 C3 6 7 4 5 8 6 7

Methodology Collect an egocentric network sample H1,..,Hn Collect a probability sample of “n” nodes from the graph Fetch the egonet of each sampled node Calculate the clique count Ci (or Cu) in each egonet Hj j=1..n Vj, X[Vj] G[Vj] graph G(V,E) 3 2 1 3 2 4 5 8 n=2 C3 6 7 4 5 8 6 7

Methodology 1 Collect an egocentric network sample H1,..,Hn Collect a probability sample of “n” nodes from the graph Fetch the egonet of each sampled node Calculate the clique count Ci (or Cu) in each egonet Hj can use existing exact clique counting algorithms clique type is determined by counting algorithm. j=1..n Vj, X[Vj] G[Vj] graph G(V,E) 3 2 1 3 2 4 5 8 n=2 C3 6 7 4 5 8 1 6 7

Methodology 1 Collect an egocentric network sample H1,..,Hn Collect a probability sample of “n” nodes from the graph Fetch the egonet of each sampled node Calculate the clique count Ci (or Cu) in each egonet Hj Apply estimation method that combines calculations Clique Degree Sums (CDS) Distinct Clique Counting (CC) j=1..n Vj, X[Vj] G[Vj] graph G(V,E) 3 2 1 3 2 4 5 8 n=2 C3 6 7 4 5 8 1 6 7

Methodology 1 Collect an egocentric network sample H1,..,Hn Collect a probability sample of “n” nodes from the graph Fetch the egonet of each sampled node Calculate the clique count Ci (or Cu) in each egonet Hj Apply estimation method that combines calculations Clique Degree Sums (CDS) labeling of neighbors not required, more space efficient Distinct Clique Counting (CC) higher accuracy j=1..n Vj, X[Vj] G[Vj] Maximal cliques graph G(V,E) 3 2 1 3 2 4 5 8 n=2 C3 6 7 4 5 8 1 6 7

Labeling of neighbors C3 8 7 1 9 6 2 5 4 3 graph G

Labeling of neighbors Vj, X[Vj], G[Vj] C3 8 8 7 7 1 1 9 9 9 6 6 6 2 2 5 5 5 4 4 3 3 graph G n=2

Labeling of neighbors Distinct Clique Counting (CC) labeled neighbors 8 7 C3 Labeled Neighbors 9 9 6 6 8 7 Calculate count C3 5 5 1 9 6 9 9 6 6 2 5 5 5 5 5 5 4 3 4 4 4 3 3 graph G n=2

Labeling of neighbors Distinct Clique Counting (CC) labeled neighbors Clique Degree Sums (CDS) unlabeled neighbors 8 7 C3 Labeled Neighbors 9 9 6 9 6 5 8 7 Calculate count C3 5 5 1 4 3 9 6 9 6 2 5 5 5 5 5 5 Calculate count C3 4 3 4 4 3 Unlabeled Neighbors graph G n=2

Clique Degree Sums unlabeled neighbors Order-i Clique Degree dij contains the number of i-cliques that node j belongs

Clique Degree Sums unlabeled neighbors graph G (V,E) Order-i Clique Degree dij contains the number of i-cliques that node j belongs 6 4 3 8 8 7 5 2 H8 1 d38 = 2 C3

Clique Degree Sums unlabeled neighbors All nodes Number of i-cliques that node j belongs Di is the Order-i Clique Degree Sum

Clique Degree Sums unlabeled neighbors graph G (V,E) All nodes 6 4 3 d38 Number of i-cliques that node j belongs 8 8 7 5 2 Di is the Order-i Clique Degree Sum 1 C3 D3 = d31 + d32 + d33 + d34 + d35 +d36 + d37 + d38 D3 = 1 + 1 + 0 + 1 + 2 + 1 + 1 + 2 D3 = 9 D3 = 3C3

Clique Degree Sums unlabeled neighbors All nodes Number of i-cliques that node j belongs Sampled nodes Node j inclusion probability is a design-unbiased Horvitz-Thompson estimator ( )

Clique Degree Sums unlabeled neighbors All nodes Number of i-cliques that node j belongs Number of u-cliques that node j belongs Sampled nodes Node j inclusion probability is a design-unbiased Horvitz-Thompson estimator ( )

Clique Degree Sums Estimator Variance We can use Horvitz-Thompson theory to derive unbiased estimators of the variance of and Node inclusion probability Joint node inclusion probability

Clique Degree Sums Estimator Variance We can use Horvitz-Thompson theory to derive unbiased estimators of the variance of and Uniform Independence Sampling Weighted Independence Sampling Link-trace Sampling Without replacement With replacement

Clique Degree Sums Estimator Variance We can use Horvitz-Thompson theory to derive unbiased estimators of the variance of and Uniform Independence Sampling Without replacement Sampled nodes All nodes Node inclusion probability Joint node inclusion probability

Distinct Clique Counting labeled neighbors number of distinct i-cliques in H1, .., Hn i-clique inclusion probability is a design-unbiased Horvitz-Thompson estimator ( ) ) Uniform Independence Sampling Weighted Independence Sampling Link-trace Sampling With replacement Without replacement

Distinct Clique Counting labeled neighbors number of distinct i-cliques in H1, .., Hn i-clique inclusion probability is a design-unbiased Horvitz-Thompson estimator ( ) ) Uniform Independence Sampling With replacement

Distinct Clique Counting labeled neighbors graph G 6 a 4 3 8 C3 7 5 b c 2 N=8 1 n=4 UIS with replacement

Distinct Clique Counting labeled neighbors graph G 6 a 4 3 8 C3 7 5 b c 2 N=8 1 n=4 UIS with replacement Observed order-3 cliques 6 6 5 2 5 2 8 8 7 7 1 1 Distinct order-3 cliques 6 5 2 8 7 1

Computational complexity Space complexity to count Ci or Cu O(1) for Clique Degree Sums Method O(ci) or O(cu) for Distinct Clique Counting Method Time complexity from O(3N/3) to O(n*3D/3) where N is the graph size, D is the maximum degree, and n is the sample size from O(n*3D/3) to O(3D/3) via parallel computations per egonet

Benefits of our methodology Full knowledge of graph not required Fast estimation for massive known graphs Estimation or exact computation easily parallelizable for massive known graphs Estimation with or without neighbor labels Supports vertex attributes Supports a variety of sampling designs

Results

Simulation Results

Simulation Results Facebook New Orleans Distinct Clique Counting Clique Degree Sums Egonet sample size n=1,000 Uniform independence sampling, without replacement 1000 simulations

Simulation Results 1000 simulations Error metric Normalized Mean Absolute Error : Clique Degree Sums Distinct Clique Counting

Simulation Results Clique Degree Sums Distinct Clique Counting

Which estimation method to use? Heuristic All edges between egos and neighbors Average Edge Count = Unique edges between egos and neighbors graph G 6 4 3 n=3 8 6 6 6 5 5 2 2 7 5 8 2 8 8 1 7 7 1 7 N=8 1 a 9 Average Edge Count = = 1.5 b c 6

Which estimation method to use? Heuristic Clique Degree Sums Error Distinct Clique Counting Error Average Edge Count

Estimation Results Facebook ‘09 Facebook ‘09 crawled dataset[1] 36,628 unique egonets [1] M. Gjoka, M. Kurant, C. T. Butts and A. Markopoulou, “Walking in Facebook: A Case Study of Unbiased Sampling of OSNs”, IEEE INFOCOM 2010.

Estimation Results vertex attributes, Facebook ‘09 Complemented dataset with gender attributes about 6 million users

Thank you! Unbiased estimation methods of clique distributions Clique Degree Sums Distinct Clique Counting Facebook cliques Future work support estimation of any subgraphs (beyond cliques) References [1] M. Gjoka, E. Smith, C. T. Butts, “Estimating Clique Composition and Size Distributions from Sampled Network Data”, IEEE NetSciCom '14 . [2] Facebook datasets: http://odysseas.calit2.uci.edu/research/osn.html [3] Python code for Clique Estimators: http://tinyurl.com/clique-estimators Thank you!