DOULION: Counting Triangles in Massive Graphs with a Coin

Slides:



Advertisements
Similar presentations
CMU SCS I2.2 Large Scale Information Network Processing INARC 1 Overview Goal: scalable algorithms to find patterns and anomalies on graphs 1. Mining Large.
Advertisements

Charalampos (Babis) E. Tsourakakis KDD 2013 KDD'131.
CHARALAMPOS E. TSOURAKAKIS SCHOOL OF COMPUTER SCIENCE CARNEGIE MELLON UNIVERSITY Fast counting of triangles in large networks without counting: Algorithms.
1 Maximal Independent Set. 2 Independent Set (IS): In a graph G=(V,E), |V|=n, |E|=m, any set of nodes that are not adjacent.
Maximizing the Spread of Influence through a Social Network
Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.
Xiaowei Ying, Xintao Wu, Daniel Barbara Spectrum based Fraud Detection in Social Networks 1.
Efficient Distribution Mining and Classification Yasushi Sakurai (NTT Communication Science Labs), Rosalynn Chong (University of British Columbia), Lei.
1 Maximal Independent Set. 2 Independent Set (IS): In a graph, any set of nodes that are not adjacent.
CMU SCS C. Faloutsos (CMU)#1 Large Graph Algorithms Christos Faloutsos CMU McGlohon, Mary Prakash, Aditya Tong, Hanghang Tsourakakis, Babis Akoglu, Leman.
Fast algorithm for detecting community structure in networks.
EXPANDER GRAPHS Properties & Applications. Things to cover ! Definitions Properties Combinatorial, Spectral properties Constructions “Explicit” constructions.
Charalampos (Babis) E. Tsourakakis WAW 2010, Stanford 16 th December ‘10 WAW '101.
MapReduce on Matlab By: Erum Afzal.
Spectral Graph Theory (Basics)
Network Measures Social Media Mining. 2 Measures and Metrics 2 Social Media Mining Network Measures Klout.
Randomized Algorithms Morteza ZadiMoghaddam Amin Sayedi.
Charalampos (Babis) E. Tsourakakis Brown University Brown University May 22 nd 2014 Brown University1.
Neighbourhood Sampling for Local Properties on a Graph Stream A. Pavan, Iowa State University Kanat Tangwongsan, IBM Research Srikanta Tirthapura, Iowa.
Graph Sparsifiers Nick Harvey University of British Columbia Based on joint work with Isaac Fung, and independent work of Ramesh Hariharan & Debmalya Panigrahi.
CMU SCS KDD'09Faloutsos, Miller, Tsourakakis P0-1 Large Graph Mining: Power Tools and a Practitioner’s guide Christos Faloutsos Gary Miller Charalampos.
FAST COUNTING OF TRIANGLES IN LARGE NETWORKS: ALGORITHMS AND LAWS RPI Theory Seminar, 24 November 2008 Charalampos (Babis) Tsourakakis School of Computer.
Jure Leskovec Computer Science Department Cornell University / Stanford University Joint work with: Jon Kleinberg (Cornell), Christos.
1 Maximal Independent Set. 2 Independent Set (IS): In a graph G=(V,E), |V|=n, |E|=m, any set of nodes that are not adjacent.
Efficient Triangle Motif Counting in Large Scale Complex Networks with GPUs Hakan Kardeş CS 791v.
Maximizing the Spread of Influence through a Social Network Authors: David Kempe, Jon Kleinberg, É va Tardos KDD 2003.
Paired Sampling in Density-Sensitive Active Learning Pinar Donmez joint work with Jaime G. Carbonell Language Technologies Institute School of Computer.
Minas Gjoka, Emily Smith, Carter T. Butts
RTM: Laws and a Recursive Generator for Weighted Time-Evolving Graphs Leman Akoglu, Mary McGlohon, Christos Faloutsos Carnegie Mellon University School.
Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur.
Community detection via random walk Draft slides.
Class 2: Graph Theory IST402.
Sampling in Graphs Alexandr Andoni (Microsoft Research)
CMU SCS KDD'09Faloutsos, Miller, Tsourakakis P9-1 Large Graph Mining: Power Tools and a Practitioner’s guide Christos Faloutsos Gary Miller Charalampos.
Clustering Data Streams A presentation by George Toderici.
Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur.
Random Walk for Similarity Testing in Complex Networks
Cohesive Subgraph Computation over Large Graphs
A Peta-Scale Graph Mining System
Shan Lu, Jieqi Kang, Weibo Gong, Don Towsley UMASS Amherst
Finding Dense and Connected Subgraphs in Dual Networks
Large Graph Mining: Power Tools and a Practitioner’s guide
CONNECTED-COMPONENTS ALGORITHMS FOR MESH-CONNECTED PARALLEL COMPUTERS
Sofus A. Macskassy Fetch Technologies
Consistent and Efficient Reconstruction of Latent Tree Models
Sequential Algorithms for Generating Random Graphs
Maximal Independent Set
June 2017 High Density Clusters.
Spectral Clustering.
CS 3343: Analysis of Algorithms
Lecture 18: Uniformity Testing Monotonicity Testing
Supporting Fault-Tolerance in Streaming Grid Applications
Community detection in graphs
Density Independent Algorithms for Sparsifying
Large Graph Mining: Power Tools and a Practitioner’s guide
CIS 700: “algorithms for Big Data”
Haim Kaplan and Uri Zwick
Matrix Martingales in Randomized Numerical Linear Algebra
Approximating the Community Structure of the Long Tail
CSCI B609: “Foundations of Data Science”
On the effect of randomness on planted 3-coloring models
Raphael Yuster Haifa University Uri Zwick Tel Aviv University
SEG5010 Presentation Zhou Lanjun.
Asymmetric Transitivity Preserving Graph Embedding
Large Graph Mining: Power Tools and a Practitioner’s guide
Lecture 6: Counting triangles Dynamic graphs & sampling
Dynamic Graph Algorithms
Shan Lu, Jieqi Kang, Weibo Gong, Don Towsley UMASS Amherst
Network Models Michael Goodrich Some slides adapted from:
Advanced Topics in Data Mining Special focus: Social Networks
Presentation transcript:

DOULION: Counting Triangles in Massive Graphs with a Coin Charalampos (Babis) Tsourakakis Carnegie Mellon University KDD ‘09 Paris Joint work with: U Kang, Gary L. Miller, Christos Faloutsos DOULION, KDD 09

Outline Motivation Related Work Proposed Method Results Conclusion Extra DOULION, KDD 09

Why is Triangle Counting important? Clustering coefficient Transitivity ratio Social Network Analysis fact: “Friends of friends are friends” A C B [WF94)] Hidden Thematic Structure of the Web (Eckmann et al. PNAS [EM02]) Motif Detection, (e.g., [YPSB05] ) Web Spam Detection (Becchetti et.al. KDD ’08 [BBCG08]) DOULION, KDD 09

Personal Motivation [CET08] eigenvalues of adjacency matrix Political Blogs eigenvalues of adjacency matrix Keep only 3! 3 i-th eigenvector DOULION, KDD 09

Outline Motivation Related Work Proposed Method Results Conclusion Extra DOULION, KDD 09

Counting methods Dense graphs Sparse graphs Fast Low space Time complexity O(n2.37) O(n3) Space complexity O(n2) O(m) Sparse graphs Fast Low space Time complexity O(m0.7n1.2+n2+o(1)) e.g. O( n ) Space complexity Θ(n2) (eventually) Θ(m) Matrix Multiplication not practical M. Latapy, Theory and Experiments DOULION, KDD 09

Naive Sampling X=1 T3 X=0 T0 T1 T2 r independent samples of three distinct vertices X=1 T3 X=0 T0 T1 T2 DOULION, KDD 09

Naive Sampling r independent samples of three distinct vertices Then the following holds: with probability at least 1-δ Works Prohibitive for graphs with T3=o(n2). e.g., T3 n2logn DOULION, KDD 09

Buriol, Frahling, Leonardi, Marchetti-Spaccamela, Sohler k Sample uniformly at random an edge (i,j) and a node k in V-{i,j} ? ? i j Check if edges (i,k) and (j,k) exist in E(G) samples DOULION, KDD 09

Outline Motivation Related Work Proposed Method Results Conclusion Extra DOULION, KDD 09

Our Sampling Approach HEADS! (i,j) “survives” G(V,E) 1/p i j DOULION, KDD 09

Our Sampling Approach G(V,E) k m TAILS! (k,m) “dies” DOULION, KDD 09

Sampling approach DOULION, KDD 09

Our Sampling Approach on Kn Gn,0.5 In Expectation Initially Weighted * DOULION, KDD 09

E[Χ]=Δ Mean and Variance Δ=#triangles=k+(Δ-k) k non-edge-disjoint triangles X r.v, our estimate E[Χ]=Δ DOULION, KDD 09

Outline Motivation Related Work Proposed Method Results Conclusion Extra DOULION, KDD 09

Doulion and NodeIterator Sparsify first and then use Node Iterator to count triangles. Node Iterator: Consider each node and count how many edges among its neighbors DOULION, KDD 09

Expected Speedup Expected Speedup: 1/p2 Proof Let R be the running time of Node Iterator after the sparsification: Therefore, expected speedup: DOULION, KDD 09

Some results (I) ~3M, ~35M ~400K, ~2.1M DOULION, KDD 09

Some results (II) ~3.1M, ~37M ~3.6M, ~42M DOULION, KDD 09

Outline Motivation Related Work Proposed Method Results Conclusion Extra DOULION, KDD 09

Conclusions New Sampling approach that counts triangles approximately. Basic analysis of the estimate (expectation, variance, expected speedup) Experimentation on many real world datasets where we showed that for p=constant we get high quality estimates and 1/p2 constant speedups. DOULION, KDD 09

Question Can p be smaller than constant? How small can we afford p to be and at the same time guarantee concentration? Could e.g., p be as small as 1/ ??? Motivation: p Speedup 0.001 106 0.005 4*104 0.01 104 DOULION, KDD 09

Outline Motivation Related Work Proposed Method Results Conclusion Extra DOULION, KDD 09

Approximate Triangle Counting Approximate Triangle Counting Arxiv preprint http://arxiv.org/PS_cache/arxiv/pdf/0904/0904.3761v1.pdf C.E.T M.N. Kolountzakis G.L. Miller DOULION, KDD 09

Theorem C.E.T, Kolountzakis, Miller 2009 How to choose p? Mildness, pick p=1 Concentration DOULION, KDD 09

Practitioner’s Guide Wikipedia 2005 1,6M nodes 18,5M edges Pick p=1/ Keep doubling until concentration Concentration appears Concentration becomes stronger DOULION, KDD 09

“Bad” Instances Remove edge (1,2) Remove any weighted edge w sufficiently large DOULION, KDD 09

Thanks! http://www.cs.cmu.edu/~ctsourak/projects.html Code and datasets available graphminingtoolbox@gmail.com (HADOOP, MATLAB, JAVA implementations along with small real-world graphs, all datasets used are on the web) An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software environment and the complete set of instructions which generated the figures. Buckheit and Donoho[BD95] DOULION, KDD 09

References Efficient semi-streaming algorithms for local triangle counting in massive graphs Becchetti, Boldi, Castillio, Gionis [BBCG08] Commensurate distances and similar motifs in genetic congruence and protein interaction networks in yeast Ye, Peyser, Spencer, Bader [YPSB05] DOULION, KDD 09

References Curvature of co-links uncovers hidden thematic layers in the World Wide Web Eckmann, Moses [EM02] DOULION, KDD 09

References Fast Counting of Triangles in Large Real-World Networks: Algorithms and Laws C. Tsourakakis [BD95] Wavelab and reproducible research Buckheit, Donoho DOULION, KDD 09

References Social Network Analysis: Methods and Applications Wasserman, Faust [WF94] Counting triangles in data streams Buriol, Frahling, Leonardi, Spaccamela, Sohler [BFLSS06] DOULION, KDD 09

Doulion DOULION, KDD 09