Presentation is loading. Please wait.

Presentation is loading. Please wait.

Algorithms for Large Graph Mining

Similar presentations


Presentation on theme: "Algorithms for Large Graph Mining"— Presentation transcript:

1 Algorithms for Large Graph Mining
Faloutsos Algorithms for Large Graph Mining Christos Faloutsos CMU

2 Our goal: One-stop solution for mining huge graphs:
Faloutsos et al 2/24/2019 Our goal: One-stop solution for mining huge graphs: PEGASUS project (PEta GrAph mining System) Open-source code and papers CMU 2010 C. Faloutsos (CMU)

3 Outline Introduction – Motivation Problem#1: Patterns in graphs
Problem#2: Tools Problem#3: Scalability Conclusions CMU 2010 C. Faloutsos (CMU)

4 Graphs - why should we care?
Faloutsos Graphs - why should we care? Internet Map [lumeta.com] Food Web [Martinez ’91] Friendship Network [Moody ’01] Protein Interactions [genomebiology.com] CMU 2010 C. Faloutsos (CMU)

5 Graphs - why should we care?
IR: bi-partite graphs (doc-terms) web: hyper-text graph ... and more: D1 DN T1 TM ... CMU 2010 C. Faloutsos (CMU)

6 Graphs - why should we care?
Faloutsos Graphs - why should we care? network of companies & board-of-directors members ‘viral’ marketing web-log (‘blog’) news propagation computer network security: /IP traffic and anomaly detection .... CMU 2010 C. Faloutsos (CMU)

7 Outline Introduction – Motivation Problem#1: Patterns in graphs
Static graphs Weighted graphs Time evolving graphs Problem#2: Tools Problem#3: Scalability Conclusions CMU 2010 C. Faloutsos (CMU)

8 Problem #1 - network and graph mining
Faloutsos Problem #1 - network and graph mining How does the Internet look like? How does the web look like? What is ‘normal’/‘abnormal’? which patterns/laws hold? CMU 2010 C. Faloutsos (CMU)

9 Problem #1 - network and graph mining
Faloutsos Problem #1 - network and graph mining How does the Internet look like? How does the web look like? What is ‘normal’/‘abnormal’? which patterns/laws hold? can not spot anomalies, without discovering patterns Large datasets reveal patterns that may be invisible otherwise… CMU 2010 C. Faloutsos (CMU)

10 Graph mining Are real graphs random? CMU 2010 C. Faloutsos (CMU)

11 Laws and patterns Are real graphs random? A: NO!!
Faloutsos Laws and patterns Are real graphs random? A: NO!! Diameter in- and out- degree distributions other (surprising) patterns So, let’s look at the data CMU 2010 C. Faloutsos (CMU)

12 Solution# S.1 Power law in the degree distribution [SIGCOMM99] att.com
Faloutsos Solution# S.1 Power law in the degree distribution [SIGCOMM99] internet domains att.com log(degree) -0.82 ibm.com log(rank) CMU 2010 C. Faloutsos (CMU)

13 Solution# S.2: Eigen Exponent E
Eigenvalue Exponent = slope E = -0.48 May 2001 Rank of decreasing eigenvalue A2: power law in the eigenvalues of the adjacency matrix CMU 2010 C. Faloutsos (CMU)

14 Solution# S.2: Eigen Exponent E
Eigenvalue Exponent = slope E = -0.48 May 2001 Rank of decreasing eigenvalue [Mihail, Papadimitriou ’02]: slope is ½ of rank exponent CMU 2010 C. Faloutsos (CMU)

15 But: How about graphs from other domains? CMU 2010 C. Faloutsos (CMU)

16 More power laws: web hit counts [w/ A. Montgomery] Web Site Traffic
Faloutsos More power laws: web hit counts [w/ A. Montgomery] users sites Web Site Traffic Count (log scale) Zipf ``ebay’’ in-degree (log scale) CMU 2010 C. Faloutsos (CMU)

17 epinions.com who-trusts-whom [Richardson + Domingos, KDD 2001] count
Faloutsos epinions.com who-trusts-whom [Richardson + Domingos, KDD 2001] count trusts-2000-people user (out) degree CMU 2010 C. Faloutsos (CMU)

18 Outline Introduction – Motivation Problem#1: Patterns in graphs
Static graphs degree, diameter, eigen, triangles cliques Weighted graphs Time evolving graphs Problem#2: Tools CMU 2010 C. Faloutsos (CMU)

19 Solution# S.3: Triangle ‘Laws’
Real social networks have a lot of triangles CMU 2010 C. Faloutsos (CMU)

20 Solution# S.3: Triangle ‘Laws’
Real social networks have a lot of triangles Friends of friends are friends Any patterns? CMU 2010 C. Faloutsos (CMU)

21 Triangle Law: #S.3 [Tsourakakis ICDM 2008]
HEP-TH ASN X-axis: # of Triangles a node participates in Y-axis: count of such nodes Epinions CMU 2010 C. Faloutsos (CMU)

22 Triangle Law: #S.4 [Tsourakakis ICDM 2008]
Reuters SN X-axis: degree Y-axis: mean # triangles n friends -> ~n1.6 triangles Epinions CMU 2010 C. Faloutsos (CMU)

23 Triangle Law: Computations [Tsourakakis ICDM 2008]
details Triangle Law: Computations [Tsourakakis ICDM 2008] But: triangles are expensive to compute (3-way join; several approx. algos) Q: Can we do that quickly? CMU 2010 C. Faloutsos (CMU)

24 Triangle Law: Computations [Tsourakakis ICDM 2008]
details Triangle Law: Computations [Tsourakakis ICDM 2008] But: triangles are expensive to compute (3-way join; several approx. algos) Q: Can we do that quickly? A: Yes! #triangles = 1/6 Sum ( li3 ) (and, because of skewness, we only need the top few eigenvalues! CMU 2010 C. Faloutsos (CMU)

25 Triangle Law: Computations [Tsourakakis ICDM 2008]
details Triangle Law: Computations [Tsourakakis ICDM 2008] 1000x+ speed-up, >90% accuracy CMU 2010 C. Faloutsos (CMU)

26 EigenSpokes B. Aditya Prakash, Mukund Seshadri, Ashwin Sridharan, Sridhar Machiraju and Christos Faloutsos: EigenSpokes: Surprising Patterns and Scalable Community Chipping in Large Graphs, PAKDD 2010, Hyderabad, India, June 2010. CMU 2010 C. Faloutsos (CMU)

27 EigenSpokes Eigenvectors of adjacency matrix ‏
equivalent to singular vectors (symmetric, undirected graph)‏ A = U \Sigma U^T CMU 2010 C. Faloutsos (CMU)

28 EigenSpokes Eigenvectors of adjacency matrix ‏
equivalent to singular vectors (symmetric, undirected graph)‏ A = U \Sigma U^T \vec{u}_1 \vec{u}_i CMU 2010 C. Faloutsos (CMU)

29 EigenSpokes EE plot: Scatter plot of scores of u1 vs u2
One would expect Many origin A few scattered ~randomly u2 u1 CMU 2010 C. Faloutsos (CMU)

30 EigenSpokes EE plot: Scatter plot of scores of u1 vs u2
One would expect Many origin A few scattered ~randomly u1 u2 90o CMU 2010 C. Faloutsos (CMU)

31 EigenSpokes - pervasiveness
Present in mobile social graph across time and space Patent citation graph CMU 2010 C. Faloutsos (CMU)

32 EigenSpokes - explanation
Near-cliques, or near-bipartite-cores, loosely connected CMU 2010 C. Faloutsos (CMU)

33 EigenSpokes - explanation
Near-cliques, or near-bipartite-cores, loosely connected So what? Extract nodes with high scores high connectivity Good “communities” spy plot of top 20 nodes CMU 2010 C. Faloutsos (CMU)

34 Bipartite Communities!
patents from same inventor(s) cut-and-paste bibliography! magnified bipartite community CMU 2010 C. Faloutsos (CMU)

35 Outline Introduction – Motivation Problem#1: Patterns in graphs
Static graphs degree, diameter, eigen, triangles cliques Weighted graphs Time evolving graphs Problem#2: Tools CMU 2010 C. Faloutsos (CMU)

36 Observations on weighted graphs?
A: yes - even more ‘laws’! M. McGlohon, L. Akoglu, and C. Faloutsos Weighted Graphs and Disconnected Components: Patterns and a Generator. SIG-KDD 2008 CMU 2010 C. Faloutsos (CMU)

37 Observation W.1: fortification
Q: How do the weights of nodes relate to degree? CMU 2010 C. Faloutsos (CMU)

38 Observation W.1: fortification: Snapshot Power Law
Weight: super-linear on in-degree exponent ‘iw’: 1.01 < iw < 1.26 Orgs-Candidates More donors, even more $ e.g. John Kerry, $10M received, from 1K donors In-weights ($) $10 $5 Edges (# donors) CMU 2010 C. Faloutsos (CMU)

39 Outline Introduction – Motivation Problem#1: Patterns in graphs
Static graphs Weighted graphs Time evolving graphs Problem#2: Tools CMU 2010 C. Faloutsos (CMU)

40 Problem: Time evolution
Faloutsos Problem: Time evolution with Jure Leskovec (CMU -> Stanford) and Jon Kleinberg (Cornell – CMU) CMU 2010 C. Faloutsos (CMU)

41 T.1 Evolution of the Diameter
Faloutsos T.1 Evolution of the Diameter Prior work on Power Law graphs hints at slowly growing diameter: diameter ~ O(log N) diameter ~ O(log log N) What is happening in real data? Diameter first, DPL second Check diameter formulas As the network grows the distances between nodes slowly grow CMU 2010 C. Faloutsos (CMU)

42 T.1 Evolution of the Diameter
Faloutsos T.1 Evolution of the Diameter Prior work on Power Law graphs hints at slowly growing diameter: diameter ~ O(log N) diameter ~ O(log log N) What is happening in real data? Diameter shrinks over time Diameter first, DPL second Check diameter formulas As the network grows the distances between nodes slowly grow CMU 2010 C. Faloutsos (CMU)

43 T.1 Diameter – “Patents” Patent citation network 25 years of data
Faloutsos T.1 Diameter – “Patents” diameter Patent citation network 25 years of data time [years] CMU 2010 C. Faloutsos (CMU)

44 T.2 Temporal Evolution of the Graphs
Faloutsos T.2 Temporal Evolution of the Graphs N(t) … nodes at time t E(t) … edges at time t Suppose that N(t+1) = 2 * N(t) Q: what is your guess for E(t+1) =? 2 * E(t) CMU 2010 C. Faloutsos (CMU)

45 T.2 Temporal Evolution of the Graphs
Faloutsos T.2 Temporal Evolution of the Graphs N(t) … nodes at time t E(t) … edges at time t Suppose that N(t+1) = 2 * N(t) Q: what is your guess for E(t+1) =? 2 * E(t) A: over-doubled! But obeying the ``Densification Power Law’’ CMU 2010 C. Faloutsos (CMU)

46 T.2 Densification – Patent Citations
Faloutsos T.2 Densification – Patent Citations Citations among patents granted 1999 2.9 million nodes 16.5 million edges Each year is a datapoint E(t) 1.66 N(t) CMU 2010 C. Faloutsos (CMU)

47 Outline Introduction – Motivation Problem#1: Patterns in graphs
Static graphs Weighted graphs Time evolving graphs Problem#2: Tools CMU 2010 C. Faloutsos (CMU)

48 More on Time-evolving graphs
M. McGlohon, L. Akoglu, and C. Faloutsos Weighted Graphs and Disconnected Components: Patterns and a Generator. SIG-KDD 2008 CMU 2010 C. Faloutsos (CMU)

49 Observation T.3: NLCC behavior
Q: How do NLCC’s emerge and join with the GCC? (``NLCC’’ = non-largest conn. components) Do they continue to grow in size? or do they shrink? or stabilize? CMU 2010 C. Faloutsos (CMU)

50 Observation T.3: NLCC behavior
After the gelling point, the GCC takes off, but NLCC’s remain ~constant (actually, oscillate). IMDB CC size Time-stamp CMU 2010 C. Faloutsos (CMU)

51 Timing for Blogs with Mary McGlohon (CMU)
Jure Leskovec (CMU->Stanford) Natalie Glance (now at Google) Mat Hurst (now at MSR) [SDM’07] CMU 2010 C. Faloutsos (CMU)

52 T.4 : popularity over time
Faloutsos T.4 : popularity over time # in links lag: days after post 1 2 3 @t Post popularity drops-off – exponentially? @t + lag CMU 2010 C. Faloutsos (CMU) 52

53 T.4 : popularity over time
Faloutsos T.4 : popularity over time # in links (log) days after post (log) 1 2 3 Post popularity drops-off – exponentially? POWER LAW! Exponent? CMU 2010 C. Faloutsos (CMU) 53

54 T.4 : popularity over time
Faloutsos T.4 : popularity over time # in links (log) -1.6 days after post (log) 1 2 3 Post popularity drops-off – exponentially? POWER LAW! Exponent? -1.6 close to -1.5: Barabasi’s stack model and like the zero-crossings of a random walk CMU 2010 C. Faloutsos (CMU) 54

55 Outline Introduction – Motivation Problem#1: Patterns in graphs
Problem#2: Tools CenterPiece Subgraphs OddBall (anomaly detection) PEGASUS Problem#3: Scalability Conclusions CMU 2010 C. Faloutsos (CMU)

56 CenterPiece Subgraphs
Hanghang TONG et al, KDD’06 CMU 2010 C. Faloutsos (CMU)

57 Center-Piece Subgraph Discovery [Tong+ KDD 06]
Input B A C Q: Who is the most central node wrt the black nodes? (e.g., master-mind criminal, common advisor/collaborator, etc) In the first application, the ceps Original Graph CMU 2010 C. Faloutsos (CMU)

58 Center-Piece Subgraph Discovery [Tong+ KDD 06]
Input: original graph Output: CePS B A C B A C CePS Node Q: How to find hub for the query nodes? A: Combine proximity scores (RWR) CMU 2010 C. Faloutsos (CMU)

59 CePS: Example (AND Query)
? DBLP co-authorship network: 400,000 authors, 2,000,000 edges Code at: CMU 2010 C. Faloutsos (CMU)

60 CePS: Example (AND Query)
DBLP co-authorship network: 400,000 authors, 2,000,000 edges Code at: CMU 2010 C. Faloutsos (CMU)

61 Outline Introduction – Motivation Problem#1: Patterns in graphs
Problem#2: Tools CenterPiece Subgraphs OddBall (anomaly detection) PEGASUS Problem#3: Scalability Conclusions CMU 2010 C. Faloutsos (CMU)

62 OddBall: Spotting Anomalies in Weighted Graphs
Leman Akoglu, Mary McGlohon, Christos Faloutsos Carnegie Mellon University School of Computer Science To appear in PAKDD 2010, Hyderabad, India

63 Main idea For each node, extract ‘ego-net’ (=1-step-away neighbors)
Extract features (#edges, total weight, etc etc) Compare with the rest of the population CMU 2010 C. Faloutsos (CMU)

64 What is an egonet? egonet ego CMU 2010 C. Faloutsos (CMU) 64
Here mention about ball look of egonets and so the name OddBall.. Obtain a new graph – a weighted one!! Show ego with orange CMU 2010 C. Faloutsos (CMU) 64

65 Selected Features Ni: number of neighbors (degree) of ego i
Ei: number of edges in egonet i Wi: total weight of egonet i λw,i: principal eigenvalue of the weighted adjacency matrix of egonet I This slide definitely should talk about features, but should be very attention grabbing 1 – add figure star after 1-2 2 – add figure high total weight after 3 (will relate 2-3) 3 – add figure dominant heavy link after 4 (will relate 3-4) – here explain why eigenvalues? Show a star with N edges – sqrt(N) is the eigenvalue Show a two node+single edge graph with edge weight W, eigenvalue=W  show a graph with a heavy link where eigenvalue ~ W : show these real examples CMU 2010 C. Faloutsos (CMU) 65

66 Near-Clique/Star SOME OLD RULES CMU 2010 C. Faloutsos (CMU) 66 66

67 Near-Clique/Star SOME OLD RULES CMU 2010 C. Faloutsos (CMU) 67 67

68 Outline Introduction – Motivation Problem#1: Patterns in graphs
Problem#2: Tools CenterPiece Subgraphs OddBall (anomaly detection) PEGASUS Problem#3: Scalability Conclusions CMU 2010 C. Faloutsos (CMU)

69 Outline – Algorithms & results
Faloutsos et al 2/24/2019 Outline – Algorithms & results Centralized Hadoop/PEGASUS Degree Distr. old Pagerank Diameter/ANF DONE Conn. Comp Triangles Visualization STARTED CMU 2010 C. Faloutsos (CMU) 69

70 HADI for diameter estimation
Radius Plots for Mining Tera-byte Scale Graphs U Kang, Charalampos Tsourakakis, Ana Paula Appel, Christos Faloutsos, Jure Leskovec, SDM’10 Naively: diameter needs O(N**2) space and up to O(N**3) time – prohibitive (N~1B) Our HADI: linear on E (~10B) Near-linear scalability wrt # machines Several optimizations -> 5x faster CMU 2010 C. Faloutsos (CMU)

71 Count ?? 19+? [Barabasi+] Radius
???? Count ?? 19+? [Barabasi+] Radius YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) Largest publicly available graph ever studied. CMU 2010 C. Faloutsos (CMU)

72 YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges)
effective diameter: surprisingly small. Multi-modality: probably mixture of cores . CMU 2010 C. Faloutsos (CMU)

73 YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges)
effective diameter: surprisingly small. Multi-modality: probably mixture of cores . CMU 2010 C. Faloutsos (CMU)

74 Radius Plot of GCC of YahooWeb.
CMU 2010 C. Faloutsos (CMU)

75 Running time - Kronecker and Erdos-Renyi Graphs with billions edges.

76 Outline – Algorithms & results
Faloutsos et al 2/24/2019 Outline – Algorithms & results Centralized Hadoop/PEGASUS Degree Distr. old Pagerank Diameter/ANF DONE Conn. Comp Triangles Visualization STARTED CMU 2010 C. Faloutsos (CMU) 76

77 Generalized Iterated Matrix Vector Multiplication (GIMV)
PEGASUS: A Peta-Scale Graph Mining System - Implementation and Observations. U Kang, Charalampos E. Tsourakakis, and Christos Faloutsos. (ICDM) 2009, Miami, Florida, USA. Best Application Paper (runner-up). CMU 2010 C. Faloutsos (CMU)

78 Generalized Iterated Matrix Vector Multiplication (GIMV)
PageRank proximity (RWR) Diameter Connected components (eigenvectors, Belief Prop. … ) Matrix – vector Multiplication (iterated) CMU 2010 C. Faloutsos (CMU)

79 Example: GIM-V At Work Connected Components Count Size CMU 2010
C. Faloutsos (CMU)

80 Example: GIM-V At Work Connected Components Count Size 300-size cmpt
Why? 1100-size cmpt X 65. Why? Size CMU 2010 C. Faloutsos (CMU)

81 financial-advice sites
Example: GIM-V At Work Connected Components Count suspicious financial-advice sites (not existing now) Size CMU 2010 C. Faloutsos (CMU)

82 after the gelling point
GIM-V At Work Connected Component over Time LinkedIn: 7.5M nodes and 58M edges Stable tail slope after the gelling point CMU 2010 C. Faloutsos (CMU)

83 Outline – Algorithms & results
Faloutsos et al 2/24/2019 Outline – Algorithms & results Centralized Hadoop/PEGASUS Degree Distr. old Pagerank Diameter/ANF DONE Conn. Comp Triangles Visualization STARTED CMU 2010 C. Faloutsos (CMU) 83

84 Triangles : Computations [Tsourakakis ICDM 2008]
Mentioned already Triangles : Computations [Tsourakakis ICDM 2008] But: triangles are expensive to compute (3-way join; several approx. algos) Q: Can we do that quickly? A: Yes! #triangles = 1/6 Sum ( li3 ) (and, because of skewness, we only need the top few eigenvalues! CMU 2010 C. Faloutsos (CMU)

85 Triangle Law: #1 [Tsourakakis ICDM 2008]
Mentioned already Triangle Law: #1 [Tsourakakis ICDM 2008] HEP-TH ASN X-axis: # of Triangles a node participates in Y-axis: count of such nodes Epinions CMU 2010 C. Faloutsos (CMU)

86 Outline – Algorithms & results
Faloutsos et al 2/24/2019 Outline – Algorithms & results Centralized Hadoop/PEGASUS Degree Distr. old Pagerank Diameter/ANF DONE Conn. Comp Triangles Visualization STARTED CMU 2010 C. Faloutsos (CMU) 86

87 Visualization: ShiftR
Supporting Ad Hoc Sensemaking: Integrating Cognitive, HCI, and Data Mining Approaches Aniket Kittur, Duen Horng (‘Polo’) Chau, Christos Faloutsos, Jason I. Hong Sensemaking Workshop at CHI 2009, April 4-5. Boston, MA, USA. CMU 2010 C. Faloutsos (CMU)

88 CMU 2010 C. Faloutsos (CMU) Supporting Sensemaking in Large Graphs
Shiftr supports sensemaking in large graph. It supports user-directed sensemaking of graph data, where the user can create arbitrary, potentially competing groups to categorize some of the nodes of interest and Shiftr recommends more nodes relevant accordingly using the Belief Propagation algorithm. This screenshot of Shiftr shows citation data around the article “The Cost Structure of Sensemaking”, where the user tries to visualize and make sense of the research areas relevant to this article. The user has identified four areas: sensemaking (blue), information visualization (orange), web revisitation (green), and collaborative search (red). Shiftr helps the user find more relevant articles in each of the areas, some of which are simultaneously relevant to more than ones areas, and visualize connections among the articles. CMU 2010 C. Faloutsos (CMU)

89 Outline Introduction – Motivation Problem#1: Patterns in graphs
Problem#2: Tools Problem#3: Scalability Main idea Performance results Conclusions CMU 2010 C. Faloutsos (CMU)

90 Motivation Goal How can we find diameter, connected components, RWR score, PageRanks of graphs with billions of nodes and edges? YahooWeb: |V|: 1.4 B, |E|: 6.6 B, File: 120 Gb Yahoo had 5Pb of data [Fayyad, KDD’07] CMU 2010 C. Faloutsos (CMU)

91 Our apporach Hadoop Unified framework (GIM-V) for several algorithms
Parallel processing Easy to use Scalable Unified framework (GIM-V) for several algorithms CMU 2010 C. Faloutsos (CMU)

92 Main Idea Plain M-V multiplication j mi,j i k CMU 2010
C. Faloutsos (CMU)

93 Main Idea Plain M-V multiplication Three Implicit Operations here:
j k mi,j Plain M-V multiplication Three Implicit Operations here: combine2: multiply m_i,j and v_j combineAll: sum n multiplication results assign: update v_i with v_i’ CMU 2010 C. Faloutsos (CMU)

94 Main Idea GIM-V Customize the three operations
j k mi,j GIM-V Customize the three operations combine2 combineAll Assign Matrix represents edges (strength of connection (src, dst) ) Vector represents some value of nodes (eg., ‘importance’ or pageRank) CMU 2010 C. Faloutsos (CMU)

95 Main Idea GIM-V: Connected Component HCC: Hadoop Connected Component
How many connected components? Which node belong to which component? component id 1 1 A 1 1 5 7 2 A 2 1 6 3 A 3 1 3 2 8 4 A 4 1 or 5 B 5 5 4 6 B 6 5 A B C 7 C 7 7 8 C 8 7 CMU 2010 C. Faloutsos (CMU)

96 Main Idea GIM-V: HCC Component vector c satisfies
HCC in terms of GIM-V Initialize ci to i Do until c converges CMU 2010 C. Faloutsos (CMU)

97 Main Idea GIM-V: Connected Component HCC: Hadoop Connected Component 1
2 3 4 5 6 7 8 1 5 1 1 1 1 7 2 1 1 2 1 6 3 3 1 1 3 1 2 8 4 1 4 1 4 5 1 5 5 6 1 6 5 A B C 7 1 7 7 8 1 8 7 CMU 2010 C. Faloutsos (CMU)

98 Main Idea GIM-V in HCC 1 2 3 4 5 6 7 8 1 1 1 min(1, min(2) ) 1 2 1 1 2 min(2, min(1,3) ) 1 3 1 1 3 min(3, min(2,4) ) 2 4 1 4 min(4, min(3) ) 3 5 1 5 min(5, min(6) ) 5 6 1 6 min(6, min(5) ) 5 7 1 7 min(7, min(8) ) 7 8 1 8 min(8, min(7) ) 7 CMU 2010 C. Faloutsos (CMU)

99 (k) GIM-V with MIN() operation = find minimum node ids within (k) hop
Main Idea GIM-V in HCC 1 2 3 4 5 6 7 8 1 1 1 min(1, min(2) ) 1 2 1 1 2 min(2, min(1,3) ) 1 3 1 1 3 min(3, min(2,4) ) 2 4 1 4 min(4, min(3) ) 3 5 1 5 min(5, min(6) ) 5 (k) GIM-V with MIN() operation = find minimum node ids within (k) hop 6 1 6 min(6, min(5) ) 5 7 1 7 min(7, min(8) ) 7 8 1 8 min(8, min(7) ) 7 CMU 2010 C. Faloutsos (CMU)

100 Main Idea GIM-V in HCC 1 2 3 4 5 6 7 8 1 1 1 min(1, min(2) ) 1 1 1 2 1 1 2 min(2, min(1,3) ) 1 1 1 3 1 1 3 min(3, min(2,4) ) 2 1 1 4 1 4 min(4, min(3) ) 3 2 1 5 1 5 min(5, min(6) ) 5 5 5 6 1 6 min(6, min(5) ) 5 5 5 7 1 7 min(7, min(8) ) 7 7 7 8 1 8 min(8, min(7) ) 7 7 7 CMU 2010 C. Faloutsos (CMU)

101 Maximum # of iterations : |d|
Main Idea GIM-V in HCC 1 2 3 4 5 6 7 8 1 1 1 min(1, min(2) ) 1 1 1 2 1 1 2 min(2, min(1,3) ) 1 1 1 3 1 1 3 min(3, min(2,4) ) 2 1 1 4 1 4 min(4, min(3) ) 3 2 1 5 1 5 min(5, min(6) ) 5 5 5 6 1 6 min(6, min(5) ) 5 5 5 7 1 7 min(7, min(8) ) 7 7 7 Maximum # of iterations : |d| 8 1 8 min(8, min(7) ) 7 7 7 CMU 2010 C. Faloutsos (CMU)

102 Outline – Algorithms & results
Faloutsos et al 2/24/2019 Outline – Algorithms & results Centralized Hadoop/PEGASUS Degree Distr. old Pagerank Diameter/ANF DONE Conn. Comp Triangles Visualization STARTED easy IMV GIMV GIMV IMV CMU 2010 C. Faloutsos (CMU) 102

103 Outline Goal Main Idea: GIM-V Fast Algorithm for GIM-V Performance
GIM-V BASE, BL, CL, DI Performance GIM-V At Work Conclusion CMU 2010 C. Faloutsos (CMU)

104 Fast Algorithms For GIM-V
Naïve-Method (GIM-V BASE) Input: Matrix(src,dst) and Vector(id,val) dst 1 2 3 4 5 6 7 8 First Job: combine2() - Join M and V using M.dst and V.id - Output M.src, V.val 1 1 1 2 1 1 2 3 1 1 3 4 1 4 5 1 5 Second Job: combineAll(), assign() - Aggregate (M.src, V.val) by M.src - Output(M.src, min(V.val1, V.val2,…)) 6 1 6 7 1 7 8 1 8 src CMU 2010 C. Faloutsos (CMU)

105 Fast Algorithms For GIM-V
Block-Method (GIM-V BL) 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 1 1 1 1 2 1 1 2 1 1 1 3 1 1 3 1 1 2 4 1 4 1 3 5 1 5 1 5 6 1 6 1 5 7 1 7 1 7 8 1 8 1 7 Group matrix, vectors into blocks Do the multiplication on blocks CMU 2010 C. Faloutsos (CMU)

106 Fast Algorithms For GIM-V
Block-Method (GIM-V BL) 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 1 1 1 1 2 1 1 2 1 1 1 3 1 1 3 1 1 2 4 1 4 1 3 5 1 5 1 5 6 1 6 1 5 7 1 7 1 7 8 1 8 1 7 Decrease Sorting Time Decrease File Size Group matrix, vectors into blocks Do the multiplication on blocks CMU 2010 C. Faloutsos (CMU)

107 Fast Algorithms For GIM-V
Clustering (GIM-V CL) 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 1 1 1 2 1 2 1 1 3 1 3 1 1 4 1 4 1 5 1 1 5 1 6 1 6 1 7 1 7 1 8 1 1 8 1 1 1 2 5 4 7 8 3 5 2 3 6 6 7 4 8 CMU 2010 C. Faloutsos (CMU)

108 Fast Algorithms For GIM-V
Diagonal Block Iteration (GIM-V DI) Multiply Diagonal Edges and Vectors as much as possible in one iteration 1 2 3 4 5 6 7 8 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 2 1 1 3 1 1 3 1 3 2 1 4 1 4 1 5 1 5 5 5 6 1 6 5 5 7 1 7 7 7 8 1 8 7 7 CMU 2010 C. Faloutsos (CMU)

109 Outline Goal Main Idea: GIM-V Fast Algorithm for GIM-V Performance
GIM-V BASE, BL, CL, DI Performance GIM-V At Work Conclusion CMU 2010 C. Faloutsos (CMU)

110 GIM-V BL-CL is >= 5 times faster than GIM-V BASE!
Performance GIM-V BL-CL is >= 5 times faster than GIM-V BASE! CMU 2010 C. Faloutsos (CMU)

111 Linear on the number of edges
Faloutsos Scalability w/ edges Linear on the number of edges CMU 2010 C. Faloutsos (CMU)

112 Outline Introduction – Motivation Problem#1: Patterns in graphs
Problem#2: Tools Problem#3: Scalability Conclusions CMU 2010 C. Faloutsos (CMU)

113 OVERALL CONCLUSIONS – low level:
Faloutsos OVERALL CONCLUSIONS – low level: Several new patterns (fortification, triangle-laws, etc) New tools: CenterPiece Subgraphs, anomaly detection (OddBall), PEGASUS Scalability: PEGASUS / hadoop CMU 2010 C. Faloutsos (CMU)

114 OVERALL CONCLUSIONS – high level
Faloutsos OVERALL CONCLUSIONS – high level Large datasets may reveal patterns/outliers that would be invisible otherwise Unprecedented opportunities huge datasets, available on-line Amazing h/w and s/w developments CMU 2010 C. Faloutsos (CMU)

115 References Hanghang Tong, Christos Faloutsos, and Jia-Yu Pan, Fast Random Walk with Restart and Its Applications, ICDM 2006, Hong Kong. Hanghang Tong, Christos Faloutsos, Center-Piece Subgraphs: Problem Definition and Fast Solutions, KDD 2006, Philadelphia, PA CMU 2010 C. Faloutsos (CMU)

116 References T. G. Kolda and J. Sun. Scalable Tensor Decompositions for Multi-aspect Data Mining. In: ICDM 2008, pp , December 2008. CMU 2010 C. Faloutsos (CMU)

117 References Jure Leskovec, Jon Kleinberg and Christos Faloutsos
Graphs over Time: Densification Laws, Shrinking Diameters and Possible Explanations KDD 2005, Chicago, IL. ("Best Research Paper" award). CMU 2010 C. Faloutsos (CMU)

118 References Jimeng Sun, Yinglian Xie, Hui Zhang, Christos Faloutsos. Less is More: Compact Matrix Decomposition for Large Sparse Graphs, SDM, Minneapolis, Minnesota, Apr 2007. Jimeng Sun, Spiros Papadimitriou, Philip S. Yu, and Christos Faloutsos, GraphScope: Parameter-free Mining of Large Time-evolving Graphs ACM SIGKDD Conference, San Jose, CA, August 2007 CMU 2010 C. Faloutsos (CMU)

119 Goal Thanks to: NSF IIS-0705359, IIS-0534205,
Faloutsos et al 2/24/2019 Goal One-stop shopping for large graph mining: Chau, Polo McGlohon, Mary Tsourakakis, Babis Akoglu, Leman Prakash, Aditya Tong, Hanghang Kang, U Thanks to: NSF IIS , IIS , Yahoo (M45), LLNL, IBM, SPRINT, INTEL, HP CMU 2010 C. Faloutsos (CMU)


Download ppt "Algorithms for Large Graph Mining"

Similar presentations


Ads by Google