Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang.

Similar presentations


Presentation on theme: "1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang."— Presentation transcript:

1 1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang Data Mining Research Laboratory Dept. of Computer Science and Engineering The Ohio State University

2 “Every 2 days we create as much information as we did up to 2003” - Eric Schmidt, Google ex-CEO The Data Deluge 2

3 3 600$ to buy a disk drive that can store all of the world’s music [McKinsey Global Institute Special Report, June ’11] Data Storage Costs are Low

4 4 Data does not exist in isolation.

5 5 Data almost always exists in connection with other data.

6 6 Social networks Protein InteractionsInternet VLSI networks Data dependencies Neighborhood graphs

7 7 All this data is only useful if we can scalably extract useful knowledge

8 8 Challenges 1. Large Scale Billion-edge graphs commonplace Scalable solutions are needed

9 9 Challenges 2. Noise Links on the web, protein interactions Need to alleviate

10 10 Challenges 3. Novel structure Hub nodes, small world phenomena, clusters of varying densities and sizes, directionality Novel algorithms or techniques are needed

11 11 Challenges 4. Domain Specific Needs E.g. Balance, Constraints etc. Need mechanisms to specify

12 12 Challenges 5. Network Dynamics How do communities evolve? Which actors have influence? How do clusters change as a function of external factors?

13 13 Challenges 6. Cognitive Overload Need to support guided interaction for human in the loop

14 14 Our Vision and Approach Graph Pre-processing Sparsification SIGMOD ’11, WebSci’12 Near Neighbor Search For non-graph data PVLDB ’12 Symmetrization For directed graphs EDBT ’10 Core Clustering Consensus Clustering KDD’06, ISMB’07 Viewpoint Neighborhood Analysis KDD ’09 Graph Clustering via Stochastic Flows KDD ’09, BCB ’10 Dynamic Analysis and Visualization Event Based Analysis KDD’07,TKDD’09 Network Visualization KDD’08 Density Plots SIGMOD’08, ICDE’12 Scalable Implementations and Systems Support on Modern Architectures Multicore Systems (VLDB’07, VLDB’09), GPUs (VLDB’11), STCI Cell (ICS’08), Clusters (ICDM’06, SC’09, PPoPP’07, ICDE’10) Application Domains Bioinformatics (ISMB’07, ISMB’09, ISMB’12, ACM BCB’11, BMC’12) Social Network and Social Media Analysis (TKDD’09, WWW’11, WebSci’12, WebSci’12)

15 15 Graph Sparsification for Community Discovery SIGMOD ’11, WebSci’12

16 16 Is there a simple pre-processing of the graph to reduce the edge set that can “clarify” or “simplify” its cluster structure?

17 17 Given a graph, discover groups of nodes that are strongly connected to one another but weakly connected to the rest of the graph. Graph Clustering and Community Discovery

18 18 Social Network and Graph Compression Direct Analytics on compressed representation Graph Clustering : Applications

19 19 Optimize VLSI layout Graph Clustering : Applications

20 20 Protein function prediction Graph Clustering : Applications

21 21 Data distribution to minimize communication and balance load Graph Clustering : Applications

22 22 Is there a simple pre-processing of the graph to reduce the edge set that can “clarify” or “simplify” its cluster structure?

23 23 Preview OriginalSparsified [Automatically visualized using Prefuse]

24 24 The promise Clustering algorithms can run much faster and be more accurate on a sparsified graph. Ditto for Network Visualization

25 25 Utopian Objective Retain edges which are likely to be intra-cluster edges, while discarding likely inter-cluster edges.

26 26 A way to rank edges on “strength” or similarity.

27 27 Algorithm: Global Sparsification (G-Spar) Parameter: Sparsification ratio, s 1. For each edge : (i) Calculate Sim ( ) 2. Retain top s% of edges in order of Sim, discard others

28 28 Dense clusters are over-represented, sparse clusters under-represented Works great when the goal is to just find the top communities

29 29 Algorithm: Local Sparsification (L-Spar) Parameter: Sparsification exponent, e (0 < e < 1) 1. For each node i of degree d i : (i) For each neighbor j: (a) Calculate Sim ( ) (ii) Retain top (d i ) e neighbors in order of Sim, for node i Underscoring the importance of Local Ranking

30 30 Ensures representation of clusters of varying densities

31 31 But... Similarity computation is expensive!

32 32 A randomized, approximate solution based on Minwise Hashing [Broder et. al., 1998]

33 33 Minwise Hashing { dog, cat, lion, tiger, mouse} [ cat, mouse, lion, dog, tiger] [ lion, cat, mouse, dog, tiger] Universe A = { mouse, lion } mh 1 (A) = min ( { mouse, lion } ) = mouse mh 2 (A) = min ( { mouse, lion } ) = lion

34 34 Key Fact For two sets A, B, and a min-hash function mh i (): Unbiased estimator for Sim using k hashes:

35 35 Time complexity using Minwise Hashing Edges Hashes Only 2 sequential passes over input. Great for disk-resident data Note: exact similarity is less important – we really just care about relative ranking  lower k

36 Theoretical Analysis of L-Spar: Main Results Q: Why choose top d e edges for a node of degree d? A: Conservatively sparsify low-degree nodes, aggressively sparsify hub nodes. Easy to control degree of sparsification. Proposition: If input graph has power-law degree distn. with exponent , then sparsified graph also has power-law degree distn. with exponent Corollary: The sparsification ratio corresponding to exponent e is no more than For  = 2.1 and e = 0.5, ~17% edges will be retained. Higher  (steeper power-laws) and/or lower e leads to more sparsification.

37 Experiments Datasets 3 PPI networks (BioGrid, DIP, Human) 2 Information (Wiki, Flickr) & 2 Social (Orkut, Twitter) networks Largest network (Orkut), roughly a Billion edges Ground truth available for PPI networks and Wiki Clustering algorithms Metis [Karypis & Kumar ‘98], MLR-MCL [Satuluri & Parthasarathy, ‘09], Metis+MQI [Lang & Rao ‘04], Graclus [Dhillon et. al. ’07], Spectral methods [Shi ’00], Edge-based agglomerative/divisive methods [Newman ’04] Compared sparsifications L-Spar, G-Spar, RandomEdge and ForestFire

38 38 Dataset (n, m) Spars.Ratio RandomG-SparL-Spar SpeedQualitySpeedQualitySpeedQuality Yeast_Noisy (6k, 200k) 17%11x-10%30x-15%25x+11% Wiki (1.1M, 53M) 15%8x-26%104x-24%52x+50% Orkut (3M, 117M) 17%13x+20%30x+60%36x+60% Results Using Metis [Hardware: Quad-core Intel i5 CPU, 3.2 GHz, with 16GB RAM ]

39 39 Dataset (n, m) Spars.Ratio RandomG-SparL-Spar SpeedQualitySpeedQualitySpeedQuality Yeast_Noisy (6k, 200k) 17%11x-10%30x-15%25x+11% Wiki (1.1M, 53M) 15%8x-26%104x-24%52x+50% Orkut (3M, 117M) 17%13x+20%30x+60%36x+60% Results Using Metis Same sparsification ratio for all 3 methods. [Hardware: Quad-core Intel i5 CPU, 3.2 GHz, with 16GB RAM ]

40 40 Dataset (n, m) Spars.Ratio RandomG-SparL-Spar SpeedQualitySpeedQualitySpeedQuality Yeast_Noisy (6k, 200k) 17%11x-10%30x-15%25x+11% Wiki (1.1M, 53M) 15%8x-26%104x-24%52x+50% Orkut (3M, 117M) 17%13x+20%30x+60%36x+60% Results Using Metis Good speedups, but typically loss in quality. [Hardware: Quad-core Intel i5 CPU, 3.2 GHz, with 16GB RAM ]

41 41 Dataset (n, m) Spars.Ratio RandomG-SparL-Spar SpeedQualitySpeedQualitySpeedQuality Yeast_Noisy (6k, 200k) 17%11x-10%30x-15%25x+11% Wiki (1.1M, 53M) 15%8x-26%104x-24%52x+50% Orkut (3M, 117M) 17%13x+20%30x+60%36x+60% Results Using Metis Great speedups and quality. [Hardware: Quad-core Intel i5 CPU, 3.2 GHz, with 16GB RAM ]

42 42 Dataset (n, m) Spars.Ratio L-Spar SpeedQuality Yeast_Noisy (6k, 200k) 17%17x+4% Wiki (1.1M, 53M) 15%23x-4.5% Orkut (3M, 117M) 17%22x0% L-Spar: Results Using MLR-MCL [Hardware: Quad-core Intel i5 CPU, 3.2 GHz, with 16GB RAM ]

43 L-Spar: Qualitative Examples NodeRetained neighborsDiscarded neighbors Graph ( Wiki article )Graph Theory, Adjacency list, Adjacency matrix, Model theory Tessellation,Roman letters used in Mathematics, Morphism Jack Dorsey ( Twitter user, and co-founder ) Biz Stone, Evan Williams, Jason Goldman, Sarah Lacy Alyssa Milano, JetBlue Airways, WholeFoods, Parul Sharma Gladiator ( Flickr tag) colosseum, world- heritage, site, italy europe, travel, canon, sky, summer Twitter executives, Silicon Valley figures

44 44 Impact of Sparsification on Noisy Data As the graphs get noisier, L-Spar is increasingly beneficial.

45 Impact of Sparsification on Spectrum: Yeast PPI

46 Global Sparsification results in multiple components Local sparsification seems to match trends of original graph Impact of Sparsification on Spectrum: Epinion

47 Impact of Sparsification on Spectrum: Human PPI

48 Impact of Sparsification on Spectrum: Flickr

49 Anatomy of density plot 49 Some measure of density Specific ordering of the vertices in the graph

50 Density Overlay Plots 50 Visual Comparison between Global vs Local Sparsification

51 51 Summary Sparsification: Simple pre-processing that makes a big difference Only tens of seconds to execute on multi-million-node graphs. Reduces clustering time from hours down to minutes. Improves accuracy by removing noisy edges for several algorithms. Helps visualization Ongoing and future work Spectral results suggests one might be able to provide theoretical rationale – Can we tease it out? Investigate other kinds of graphs, incorporating content, novel applications (e.g. wireless sensor networks, VLSI design)

52 52 Prior Work Random edge Sampling [Karger ‘94] Sampling in proportion to effective resistances: good guarantees but very slow [Spielman and Srivastava ‘08] Matrix sparsification [Arora et. al. ’06] : Fast, but same as random sampling in the absence of weights.

53 Topological Measures 53

54 Modularity (from Wikipedia ) 54 Modularity is the fraction of the edges that fall within the given groups minus the expected such fraction if edges were distributed at random. The value of the modularity lies in the range [−1/2,1). It is positive if the number of edges within groups exceeds the number expected on the basis of chance.

55 55

56 56

57 The MCL algorithm Expand: M := M*M Inflate: M := M.^r (r usually 2), renormalize columns Converged? Input: A, Adjacency matrix Initialize M to M G, the canonical transition matrix M:= M G := (A+I) D -1 Yes Output clusters No Prune Enhances flow to well-connected nodes (i.e. nodes within a community). Increases inequality in each column. “Rich get richer, poor get poorer.” (reduces flow across communities) Saves memory by removing entries close to zero. Enables faster convergence Clustering Interpretation: Nodes flowing into the same sink node are assigned same cluster labels


Download ppt "1 Local Sparsification for Scalable Module Identification in Networks Srinivasan Parthasarathy Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang."

Similar presentations


Ads by Google