Presentation is loading. Please wait.

Presentation is loading. Please wait.

CMU SCS I2.2 Large Scale Information Network Processing INARC 1 Overview Goal: scalable algorithms to find patterns and anomalies on graphs 1. Mining Large.

Similar presentations


Presentation on theme: "CMU SCS I2.2 Large Scale Information Network Processing INARC 1 Overview Goal: scalable algorithms to find patterns and anomalies on graphs 1. Mining Large."— Presentation transcript:

1 CMU SCS I2.2 Large Scale Information Network Processing INARC 1 Overview Goal: scalable algorithms to find patterns and anomalies on graphs 1. Mining Large Graphs: Algorithms, Inference, and Discoveries 2. Spectral Analysis of Billion-Scale Graphs: Discov eries and Implementation 3. Patterns on the Connected Components of Terabyte-Scale Graphs PI: Christos Faloutsos (CMU)  Students: Leman Akoglu, Polo Chau, U Kang

2 CMU SCS I2.2 Large Scale Information Network Processing INARC 2 Mining Large Graphs: Algorithms, Inference, and Discoveries U Kang Duen Horng Chau Christos Faloutsos School of Computer Science Carnegie Mellon University

3 CMU SCS I2.2 Large Scale Information Network Processing INARC 3 Outline Problem Definition Proposed Method Experiment Conclusion

4 CMU SCS I2.2 Large Scale Information Network Processing INARC 4 Motivation Inference on graph: “guilt by association”  Adult sites tend to be connected to adult sites, while edu. sites are connected to educational ones  Given labels(adult or edu) on a subset of the nodes, infer the labels of other unlabeled nodes on graph  Tool: Belief Propagation(BP) red nodes connected to red nodes blue nodes connected to blue nodes

5 CMU SCS I2.2 Large Scale Information Network Processing INARC Prior prob Messages from neighbors Node belief Propagation matrix ~Messages from neighbors Messsage from node i to node j Message computation Belief computation Prior prob Belief Propagation 5

6 CMU SCS I2.2 Large Scale Information Network Processing INARC A Challenge in BP Scalability! Existing works assume that all the nodes (and/or edges) of the input graph fit in memory  Problem: what if the graph is too large to fit in memory? Challenge: Scaling up the inference algorithm for very large graphs whose nodes do not fit in memory 6

7 CMU SCS I2.2 Large Scale Information Network Processing INARC Problem Definition How can we scale up the BP algorithm to very large graphs? Goal  Scalability: to billions of nodes and edges  Efficiency: fast algorithm 7

8 CMU SCS I2.2 Large Scale Information Network Processing INARC 8 Outline Problem Definition Proposed Method Experiment Conclusion

9 CMU SCS I2.2 Large Scale Information Network Processing INARC Main Idea Our approach  Use Hadoop to scale-up BP Challenge  How can we formulate BP using a simple, efficient operation supported by Hadoop? 9

10 CMU SCS I2.2 Large Scale Information Network Processing INARC Main Idea Key observation  BP message update equation = local message exchange 10 m 13 m 31 m 01 m 10 m 12 m 21 m 24 m 42 A message is updated from its neighboring messages. For example, m 12 is updated from m 01 and m 31

11 CMU SCS I2.2 Large Scale Information Network Processing INARC BP message update can be expressed by a generalized matrix-vector multiplication on a line graph L(G) induced from the original graph G  Nodes in L(G) are edges in G  Two nodes in L(G) are connected if they are adjacent in G Main Idea 11

12 CMU SCS I2.2 Large Scale Information Network Processing INARC BP message update can be expressed by a generalized matrix-vector multiplication on a line graph L(G) induced from the original graph G Proposed: HA-LFP algorithm 12 New message vector Old message vector Line graph of G Generalized m-v multiplication Multiply repeatedly until convergence

13 CMU SCS I2.2 Large Scale Information Network Processing INARC Complexity One Iteration of HA-LFP on L(G) One Matrix Vector Multiplication on G = Time : O((V+E) / M) Space: O(V + E) V : # of nodes E : # of nodes M : # of machines 13

14 CMU SCS I2.2 Large Scale Information Network Processing INARC 14 Outline Problem Definition Proposed Method Experiment Conclusion

15 CMU SCS I2.2 Large Scale Information Network Processing INARC 15 Questions Q1: How fast is HA-LFP? Q2: How does HA-LFP scale-up? Q3: How can we find `good’ and `bad’ sites in a web graph?

16 CMU SCS I2.2 Large Scale Information Network Processing INARC Running Time Q1: How fast is HA-LFP? [10 iteration] 16

17 CMU SCS I2.2 Large Scale Information Network Processing INARC Scale Up Q2: How does HA-LFP scale-up? Linear on the number of machines, edges 17

18 CMU SCS I2.2 Large Scale Information Network Processing INARC Advantage of HA-LFP Scalability  The only solution when the node information cannot fit in memory.  Near-linear scale up Running Time  Faster than the single-machine, for large graphs Fault Tolerance 18

19 CMU SCS I2.2 Large Scale Information Network Processing INARC Analysis of Web Graph Q3: How can we find `good’ and `bad’ sites in a web graph? Pages whose goodness scores < 0.9 are likely to be adult pages 19

20 CMU SCS I2.2 Large Scale Information Network Processing INARC 20 Outline Problem Definition Proposed Method Experiment Conclusion

21 CMU SCS I2.2 Large Scale Information Network Processing INARC 21 Conclusion HA-LFP  Belief Propgation for billion-scale graphs on Hadoop  Near-linear scalability on # of machines, edges Many applications  Finding `good’ and `bad’ web sites  Fraud detection  …

22 CMU SCS I2.2 Large Scale Information Network Processing INARC 22 Spectral Analysis of Billion-Scale Graphs: Discoveries and Implementation U Kang Brendan Meeder Christos Faloutsos School of Computer Science Carnegie Mellon University

23 CMU SCS I2.2 Large Scale Information Network Processing INARC 23 Outline Problem Definition Proposed Method Experiment Conclusion

24 CMU SCS I2.2 Large Scale Information Network Processing INARC 24 Problem Definition Eigensolver  Computes top-k eigenvalues and eigenvectors  Application: SVD, triangle counting, spectral clustering, … Existing eigensolver  Can handle up to millions of nodes How can we scale up eigensolvers to billion- scale graphs?

25 CMU SCS I2.2 Large Scale Information Network Processing INARC 25 Outline Problem Definition Proposed Method Experiment Conclusion

26 CMU SCS I2.2 Large Scale Information Network Processing INARC Main Idea HEigen algorithm (Hadoop Eigen-solver)  Selective parallelize ‘Lanczos’ algorithm Expensive operation: on Hadoop for scalability Inexpensive operation: on a single-machine for accuracy  Block encoding Block encoding, and then do matrix-vector multiplication  Exploiting skewness in matrix-matrix mult. In matrix-matrix multiplication when a matrix is very large and the other is very small 26

27 CMU SCS I2.2 Large Scale Information Network Processing INARC Application of HEigen Triangle Counting  Real social networks have a lot of triangles Friends of friends are friends But: triangles are expensive to compute  (3-way join; several approx. algos) Q: Can we do that quickly? A: Yes!  #triangles = 1/6 Sum ( λ i 3 )  (and, because of skewness in eigenvalues,  we only need the top few eigenvalues!) [Tsourakakis ICDM 2008]

28 CMU SCS I2.2 Large Scale Information Network Processing INARC 28 Outline Problem Definition Proposed Method Experiment Conclusion

29 CMU SCS I2.2 Large Scale Information Network Processing INARC 29 Questions Q1: How does HEigen scale-up? Q2: Which Matrix-Matrix multiplication algorithm runs the fastest? Q3: How can we find anomalous sites in a web graph?

30 CMU SCS I2.2 Large Scale Information Network Processing INARC Running Time Q1: How does HEigen scale-up? Heigen-BLOCK is faster than PLAIN ver. Linear on the number of machines, edges

31 CMU SCS I2.2 Large Scale Information Network Processing INARC Scale Up Cache-based MM runs the fastest! Q2: Which Matrix-Matrix multiplication algorithm runs the fastest?

32 CMU SCS I2.2 Large Scale Information Network Processing INARC 32 Results Triangle counting on Twitter social network [Twitter 2009; ~ 3 billion edges] U.S. politicians: moderate number of triangles vs. degree Adult sites: very large number of triangles vs. degree Q3: How can we find anomalous sites in a web graph?

33 CMU SCS I2.2 Large Scale Information Network Processing INARC 33 Outline Problem Definition Proposed Method Experiment Conclusion

34 CMU SCS I2.2 Large Scale Information Network Processing INARC 34 Conclusion HEigen  Eigensolver for billion-scale graphs on Hadoop  Near-linear scalability on # of machines, edges  Cache-based Matrix-Matrix multiplication: fastest!  Anomalies in triangle counts Many applications  Triangle counting  SVD ……

35 CMU SCS I2.2 Large Scale Information Network Processing INARC 35 Patterns on the Connected Components of Terabyte-Scale Graphs U Kang* Mary McGlohon* † Leman Akoglu* Christos Faloutsos* (*) SCS, Carnegie Mellon University (†) Google

36 CMU SCS I2.2 Large Scale Information Network Processing INARC 36 Outline Problem Definition Static Patterns Evolution Patterns Model Conclusion

37 CMU SCS I2.2 Large Scale Information Network Processing INARC A large graph is composed of many connected components 37 Problem Definition Q2: evolution patterns? Q3: model? Size Q1: static patterns? Count YahooWeb graph |V| = 1.4 billion |E| = 6.7 billion 120 GBytes

38 CMU SCS I2.2 Large Scale Information Network Processing INARC 38 Outline Problem Definition Static Patterns Evolution Patterns Model Conclusion

39 CMU SCS I2.2 Large Scale Information Network Processing INARC 39 Q1: Static Patterns What are the regularities in the connected components of a static graph?  How do they look like?  Do the GCC and the other connected components look similar? Chain? Clique? Idea: use ‘density’ and ‘radius’ to find patterns

40 CMU SCS I2.2 Large Scale Information Network Processing INARC Density of Connected Component What is a good metric for the density of a connected component?  A candidate: |E| / |V| (“average degree”)  Problem: it increases over time 40 Number of Nodes Number of Edges

41 CMU SCS I2.2 Large Scale Information Network Processing INARC Density of Connected Component We want a metric that can measure the ‘intrinsic’ density of a component  Proposed: Graph Fractal Dimension(GFD) log |E| / log |V| 41 [Leskovec+ KDD05] Number of Nodes Number of Edges Number of Edges

42 CMU SCS I2.2 Large Scale Information Network Processing INARC Density of Connected Component Graph Fractal Dimension(GFD)  log |E| / log |V| 42 Chain: GFD ~1 Star: GFD ~1 Bipartite Core: 1 < GFD < 2 Clique: GFD ~2

43 CMU SCS I2.2 Large Scale Information Network Processing INARC Density of Connected Component 43 What are the GFDs of connected components in a large, real graph?

44 CMU SCS I2.2 Large Scale Information Network Processing INARC Density of Connected Component GFDs of CCs in YahooWeb graph GFDs of CCs are slightly denser than the tree 44 Slope= 1.08 GFDs of CCs are constant on average Number of Nodes Number of Edges Number of Edges

45 CMU SCS I2.2 Large Scale Information Network Processing INARC Radius of Connected Component 45 Q1.1: What does the GCC look like? Q1.2: What do the rest CC’s look like? ( What are the GFDs?)

46 CMU SCS I2.2 Large Scale Information Network Processing INARC Radius of Connected Component What are the patterns of radii in connected components? A1.2: Chain-like disconnected components 46 Slope= 1.38 Core Chain Average Radius A1.1: GCC looks like a ‘kite’ Max. Radius Avg. Max.

47 CMU SCS I2.2 Large Scale Information Network Processing INARC 47 Outline Problem Definition Static Patterns Evolution Patterns Model Conclusion

48 CMU SCS I2.2 Large Scale Information Network Processing INARC 48 Q2: Evolution Patterns How do the connected components evolve?  Do largest connected components grow with the same rate?  How often does a newcomer join the disconnected components? newcomer ? ?

49 CMU SCS I2.2 Large Scale Information Network Processing INARC Gelling Point Gelling Point [McGlohon+ KDD08]  Diameter starts to shrink 49

50 CMU SCS I2.2 Large Scale Information Network Processing INARC Growth of Connected Component GFDs of Top 3 CC’s over time 50 Before “gelling point”: GFDs of Top 3 CC’s stay constant, “tree” like. After “deviation point”: GFD of GCC takes off, becomes denser.

51 CMU SCS I2.2 Large Scale Information Network Processing INARC ‘Rebel’ Probability What are the chances that a newcomer doesn’t belong to GCC? (“rebel” prob.) 51 newcomer ? GCC DCs

52 CMU SCS I2.2 Large Scale Information Network Processing INARC ‘Rebel’ Probability What are the chances that a newcomer doesn’t belong to GCC? (“rebel” prob.) 52 newcomer d: degree of a newcomer s: size (|V|) of DC But, how exactly?

53 CMU SCS I2.2 Large Scale Information Network Processing INARC ‘Rebel’ Prob. power of |V| in dc ‘Rebel’ Probability 53 ‘Rebel’ Prob. exponential to the degree d: degree of a newcomer s: size (|V|) of DC

54 CMU SCS I2.2 Large Scale Information Network Processing INARC 54 Outline Problem Definition Static Patterns Evolution Patterns Model Conclusion

55 CMU SCS I2.2 Large Scale Information Network Processing INARC 55 Q3: Model How can we explain the static and the evolution patterns by a generative model? Modeling Goals  (G1) Constant GFDs  (G2) ERP (Exponential Rebel Probability)  (G3) Disconnected Components

56 CMU SCS I2.2 Large Scale Information Network Processing INARC CommunityConnection Model CommunityConnection model  Defines a behavior of a new node joining the network 1. Chooses a host to link to. 2. Visits the neighbors Repeat the two processes! 56

57 CMU SCS I2.2 Large Scale Information Network Processing INARC CommunityConnection Model How does the CommunityConnection model match reality? 57

58 CMU SCS I2.2 Large Scale Information Network Processing INARC CommunityConnection Model Results (G1) Constant GFDs 58 Number of Nodes Number of Edges Number of Nodes Number of Edges

59 CMU SCS I2.2 Large Scale Information Network Processing INARC CommunityConnection Model Results (G2) ERP (Exponential Rebel Probability) (G3) Disconnected Components 59 Degreelog(|V| in DC) log( Rebel Prob.) log( Rebel Prob.)

60 CMU SCS I2.2 Large Scale Information Network Processing INARC 60 Outline Problem Definition Static Patterns Evolution Patterns Model Conclusion

61 CMU SCS I2.2 Large Scale Information Network Processing INARC 61 Conclusion Patterns in the Connected Components  Goal 1 : Static Patterns Chain-like disconnected components ‘Kite’-like GCC  Goal 2 : Evolution Patterns Constant, low GFD(“density”) until the gelling point ERP (Exponential Rebel Probability)  Goal 3 : Model CommunityConnection Model (matches reality)

62 CMU SCS I2.2 Large Scale Information Network Processing INARC Hadoop/PEGASUS Degree Distr. Pagerank Diameter Conn. Comp Eigensolver Belief Propagation Clustering, … Future Plan 62

63 CMU SCS I2.2 Large Scale Information Network Processing INARC 63 Thank you!


Download ppt "CMU SCS I2.2 Large Scale Information Network Processing INARC 1 Overview Goal: scalable algorithms to find patterns and anomalies on graphs 1. Mining Large."

Similar presentations


Ads by Google