Algorithms for Large Graph Mining

Slides:



Advertisements
Similar presentations
CMU SCS Large Graph Mining - Patterns, tools and cascade analysis Christos Faloutsos CMU.
Advertisements

1 Dynamics of Real-world Networks Jure Leskovec Machine Learning Department Carnegie Mellon University
BiG-Align: Fast Bipartite Graph Alignment
CMU SCS I2.2 Large Scale Information Network Processing INARC 1 Overview Goal: scalable algorithms to find patterns and anomalies on graphs 1. Mining Large.
School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems.
CMU SCS Mining Billion-Node Graphs - Patterns and Algorithms Christos Faloutsos CMU.
CMU SCS : Multimedia Databases and Data Mining Lecture #26: Graph mining - patterns Christos Faloutsos.
Lecture 21 Network evolution Slides are modified from Jurij Leskovec, Jon Kleinberg and Christos Faloutsos.
CMU SCS Mining Billion-Node Graphs - Patterns and Algorithms Christos Faloutsos CMU.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
CMU SCS C. Faloutsos (CMU)#1 Large Graph Algorithms Christos Faloutsos CMU McGlohon, Mary Prakash, Aditya Tong, Hanghang Tsourakakis, Babis Akoglu, Leman.
NetMine: Mining Tools for Large Graphs Deepayan Chakrabarti Yiping Zhan Daniel Blandford Christos Faloutsos Guy Blelloch.
N EIGHBORHOOD F ORMATION AND A NOMALY D ETECTION IN B IPARTITE G RAPHS Jimeng Sun, Huiming Qu, Deepayan Chakrabarti & Christos Faloutsos Jimeng Sun, Huiming.
CMU SCS Mining Billion-node Graphs Christos Faloutsos CMU.
Weighted Graphs and Disconnected Components Patterns and a Generator Mary McGlohon, Leman Akoglu, Christos Faloutsos Carnegie Mellon University School.
Social Networks and Graph Mining Christos Faloutsos CMU - MLD.
CMU SCS Mining Large Graphs Christos Faloutsos CMU.
Neighborhood Formation and Anomaly Detection in Bipartite Graphs Jimeng Sun Huiming Qu Deepayan Chakrabarti Christos Faloutsos Speaker: Jimeng Sun.
CMU SCS Large Graph Mining Christos Faloutsos CMU.
SCS CMU Proximity Tracking on Time- Evolving Bipartite Graphs Speaker: Hanghang Tong Joint Work with Spiros Papadimitriou, Philip S. Yu, Christos Faloutsos.
CMU SCS Bio-informatics, Graph and Stream mining Christos Faloutsos CMU.
CMU SCS Graph and stream mining Christos Faloutsos CMU.
CMU SCS Graph Mining and Influence Propagation Christos Faloutsos CMU.
Fast Random Walk with Restart and Its Applications
CMU SCS Yahoo/Hadoop, 2008#1 Peta-Graph Mining Christos Faloutsos Prakash, Aditya Shringarpure, Suyash Tsourakakis, Charalampos Appel, Ana Chau, Polo Leskovec,
CMU SCS Large Graph Mining – Patterns, Tools and Cascade analysis Christos Faloutsos CMU.
CMU SCS : Multimedia Databases and Data Mining Lecture #28: Graph mining - patterns Christos Faloutsos.
CMU SCS Data Mining in Streams and Graphs Christos Faloutsos CMU.
Weighted Graphs and Disconnected Components Patterns and a Generator IDB Lab 현근수 In KDD 08. Mary McGlohon, Leman Akoglu, Christos Faloutsos.
CMU SCS KDD'09Faloutsos, Miller, Tsourakakis P0-1 Large Graph Mining: Power Tools and a Practitioner’s guide Christos Faloutsos Gary Miller Charalampos.
CMU SCS Mining Billion-node Graphs: Patterns, Generators and Tools Christos Faloutsos CMU.
CMU SCS Large Graph Mining Christos Faloutsos CMU.
Jure Leskovec Computer Science Department Cornell University / Stanford University Joint work with: Jon Kleinberg (Cornell), Christos.
CMU SCS Mining Billion-Node Graphs Christos Faloutsos CMU.
CMU SCS Mining Billion-Node Graphs: Patterns and Algorithms Christos Faloutsos CMU.
CMU SCS Graph Mining - surprising patterns in real graphs Christos Faloutsos CMU.
CMU SCS Mining Billion Node Graphs Christos Faloutsos CMU.
CMU SCS Large Graph Mining Christos Faloutsos CMU.
CMU SCS KDD '09Faloutsos, Miller, Tsourakakis P5-1 Large Graph Mining: Power Tools and a Practitioner’s guide Task 5: Graphs over time & tensors Faloutsos,
RTM: Laws and a Recursive Generator for Weighted Time-Evolving Graphs Leman Akoglu, Mary McGlohon, Christos Faloutsos Carnegie Mellon University School.
Kijung Shin Jinhong Jung Lee Sael U Kang
CMU SCS Graph Mining Christos Faloutsos CMU. CMU SCS iCAST, Jan. 09C. Faloutsos 2 Thank you! Prof. Hsing-Kuo Kenneth Pao Eric, Morgan, Ian, Teenet.
CMU SCS Patterns, Anomalies, and Fraud Detection in Large Graphs Christos Faloutsos CMU.
CMU SCS Mining Large Social Networks: Patterns and Anomalies Christos Faloutsos CMU.
1 Patterns of Cascading Behavior in Large Blog Graphs Jure Leskoves, Mary McGlohon, Christos Faloutsos, Natalie Glance, Matthew Hurst SDM 2007 Date:2008/8/21.
CMU SCS KDD'09Faloutsos, Miller, Tsourakakis P9-1 Large Graph Mining: Power Tools and a Practitioner’s guide Christos Faloutsos Gary Miller Charalampos.
Graph Models Class Algorithmic Methods of Data Mining
Cohesive Subgraph Computation over Large Graphs
A Peta-Scale Graph Mining System
Finding Dense and Connected Subgraphs in Dual Networks
Large Graph Mining: Power Tools and a Practitioner’s guide
DOULION: Counting Triangles in Massive Graphs with a Coin
PEGASUS: A PETA-SCALE GRAPH MINING SYSTEM
NetMine: Mining Tools for Large Graphs
Hanghang Tong, Brian Gallagher, Christos Faloutsos, Tina Eliassi-Rad
Large Graph Mining: Power Tools and a Practitioner’s guide
Part 2: Graph Mining Tools - SVD and ranking
Part 1: Graph Mining – patterns
Lecture 13 Network evolution
15-826: Multimedia Databases and Data Mining
R-MAT: A Recursive Model for Graph Mining
Graph and Tensor Mining for fun and profit
Dynamics of Real-world Networks
Jinhong Jung, Woojung Jin, Lee Sael, U Kang, ICDM ‘16
Graph and Tensor Mining for fun and profit
Graph and Tensor Mining for fun and profit
Lecture 21 Network evolution
Modelling and Searching Networks Lecture 2 – Complex Networks
Advanced Topics in Data Mining Special focus: Social Networks
Presentation transcript:

Algorithms for Large Graph Mining Faloutsos Algorithms for Large Graph Mining Christos Faloutsos CMU

Our goal: One-stop solution for mining huge graphs: Faloutsos et al 2/24/2019 Our goal: One-stop solution for mining huge graphs: PEGASUS project (PEta GrAph mining System) www.cs.cmu.edu/~pegasus Open-source code and papers CMU 2010 C. Faloutsos (CMU)

Outline Introduction – Motivation Problem#1: Patterns in graphs Problem#2: Tools Problem#3: Scalability Conclusions CMU 2010 C. Faloutsos (CMU)

Graphs - why should we care? Faloutsos Graphs - why should we care? Internet Map [lumeta.com] Food Web [Martinez ’91] Friendship Network [Moody ’01] Protein Interactions [genomebiology.com] CMU 2010 C. Faloutsos (CMU)

Graphs - why should we care? IR: bi-partite graphs (doc-terms) web: hyper-text graph ... and more: D1 DN T1 TM ... CMU 2010 C. Faloutsos (CMU)

Graphs - why should we care? Faloutsos Graphs - why should we care? network of companies & board-of-directors members ‘viral’ marketing web-log (‘blog’) news propagation computer network security: email/IP traffic and anomaly detection .... CMU 2010 C. Faloutsos (CMU)

Outline Introduction – Motivation Problem#1: Patterns in graphs Static graphs Weighted graphs Time evolving graphs Problem#2: Tools Problem#3: Scalability Conclusions CMU 2010 C. Faloutsos (CMU)

Problem #1 - network and graph mining Faloutsos Problem #1 - network and graph mining How does the Internet look like? How does the web look like? What is ‘normal’/‘abnormal’? which patterns/laws hold? CMU 2010 C. Faloutsos (CMU)

Problem #1 - network and graph mining Faloutsos Problem #1 - network and graph mining How does the Internet look like? How does the web look like? What is ‘normal’/‘abnormal’? which patterns/laws hold? can not spot anomalies, without discovering patterns Large datasets reveal patterns that may be invisible otherwise… CMU 2010 C. Faloutsos (CMU)

Graph mining Are real graphs random? CMU 2010 C. Faloutsos (CMU)

Laws and patterns Are real graphs random? A: NO!! Faloutsos Laws and patterns Are real graphs random? A: NO!! Diameter in- and out- degree distributions other (surprising) patterns So, let’s look at the data CMU 2010 C. Faloutsos (CMU)

Solution# S.1 Power law in the degree distribution [SIGCOMM99] att.com Faloutsos Solution# S.1 Power law in the degree distribution [SIGCOMM99] internet domains att.com log(degree) -0.82 ibm.com log(rank) CMU 2010 C. Faloutsos (CMU)

Solution# S.2: Eigen Exponent E Eigenvalue Exponent = slope E = -0.48 May 2001 Rank of decreasing eigenvalue A2: power law in the eigenvalues of the adjacency matrix CMU 2010 C. Faloutsos (CMU)

Solution# S.2: Eigen Exponent E Eigenvalue Exponent = slope E = -0.48 May 2001 Rank of decreasing eigenvalue [Mihail, Papadimitriou ’02]: slope is ½ of rank exponent CMU 2010 C. Faloutsos (CMU)

But: How about graphs from other domains? CMU 2010 C. Faloutsos (CMU)

More power laws: web hit counts [w/ A. Montgomery] Web Site Traffic Faloutsos More power laws: web hit counts [w/ A. Montgomery] users sites Web Site Traffic Count (log scale) Zipf ``ebay’’ in-degree (log scale) CMU 2010 C. Faloutsos (CMU)

epinions.com who-trusts-whom [Richardson + Domingos, KDD 2001] count Faloutsos epinions.com who-trusts-whom [Richardson + Domingos, KDD 2001] count trusts-2000-people user (out) degree CMU 2010 C. Faloutsos (CMU)

Outline Introduction – Motivation Problem#1: Patterns in graphs Static graphs degree, diameter, eigen, triangles cliques Weighted graphs Time evolving graphs Problem#2: Tools CMU 2010 C. Faloutsos (CMU)

Solution# S.3: Triangle ‘Laws’ Real social networks have a lot of triangles CMU 2010 C. Faloutsos (CMU)

Solution# S.3: Triangle ‘Laws’ Real social networks have a lot of triangles Friends of friends are friends Any patterns? CMU 2010 C. Faloutsos (CMU)

Triangle Law: #S.3 [Tsourakakis ICDM 2008] HEP-TH ASN X-axis: # of Triangles a node participates in Y-axis: count of such nodes Epinions CMU 2010 C. Faloutsos (CMU)

Triangle Law: #S.4 [Tsourakakis ICDM 2008] Reuters SN X-axis: degree Y-axis: mean # triangles n friends -> ~n1.6 triangles Epinions CMU 2010 C. Faloutsos (CMU)

Triangle Law: Computations [Tsourakakis ICDM 2008] details Triangle Law: Computations [Tsourakakis ICDM 2008] But: triangles are expensive to compute (3-way join; several approx. algos) Q: Can we do that quickly? CMU 2010 C. Faloutsos (CMU)

Triangle Law: Computations [Tsourakakis ICDM 2008] details Triangle Law: Computations [Tsourakakis ICDM 2008] But: triangles are expensive to compute (3-way join; several approx. algos) Q: Can we do that quickly? A: Yes! #triangles = 1/6 Sum ( li3 ) (and, because of skewness, we only need the top few eigenvalues! CMU 2010 C. Faloutsos (CMU)

Triangle Law: Computations [Tsourakakis ICDM 2008] details Triangle Law: Computations [Tsourakakis ICDM 2008] 1000x+ speed-up, >90% accuracy CMU 2010 C. Faloutsos (CMU)

EigenSpokes B. Aditya Prakash, Mukund Seshadri, Ashwin Sridharan, Sridhar Machiraju and Christos Faloutsos: EigenSpokes: Surprising Patterns and Scalable Community Chipping in Large Graphs, PAKDD 2010, Hyderabad, India, 21-24 June 2010. CMU 2010 C. Faloutsos (CMU)

EigenSpokes Eigenvectors of adjacency matrix ‏ equivalent to singular vectors (symmetric, undirected graph)‏ A = U \Sigma U^T CMU 2010 C. Faloutsos (CMU)

EigenSpokes Eigenvectors of adjacency matrix ‏ equivalent to singular vectors (symmetric, undirected graph)‏ A = U \Sigma U^T \vec{u}_1 \vec{u}_i CMU 2010 C. Faloutsos (CMU)

EigenSpokes EE plot: Scatter plot of scores of u1 vs u2 One would expect Many points @ origin A few scattered ~randomly u2 u1 CMU 2010 C. Faloutsos (CMU)

EigenSpokes EE plot: Scatter plot of scores of u1 vs u2 One would expect Many points @ origin A few scattered ~randomly u1 u2 90o CMU 2010 C. Faloutsos (CMU)

EigenSpokes - pervasiveness Present in mobile social graph across time and space Patent citation graph CMU 2010 C. Faloutsos (CMU)

EigenSpokes - explanation Near-cliques, or near-bipartite-cores, loosely connected CMU 2010 C. Faloutsos (CMU)

EigenSpokes - explanation Near-cliques, or near-bipartite-cores, loosely connected So what? Extract nodes with high scores high connectivity Good “communities” spy plot of top 20 nodes CMU 2010 C. Faloutsos (CMU)

Bipartite Communities! patents from same inventor(s) cut-and-paste bibliography! magnified bipartite community CMU 2010 C. Faloutsos (CMU)

Outline Introduction – Motivation Problem#1: Patterns in graphs Static graphs degree, diameter, eigen, triangles cliques Weighted graphs Time evolving graphs Problem#2: Tools CMU 2010 C. Faloutsos (CMU)

Observations on weighted graphs? A: yes - even more ‘laws’! M. McGlohon, L. Akoglu, and C. Faloutsos Weighted Graphs and Disconnected Components: Patterns and a Generator. SIG-KDD 2008 CMU 2010 C. Faloutsos (CMU)

Observation W.1: fortification Q: How do the weights of nodes relate to degree? CMU 2010 C. Faloutsos (CMU)

Observation W.1: fortification: Snapshot Power Law Weight: super-linear on in-degree exponent ‘iw’: 1.01 < iw < 1.26 Orgs-Candidates More donors, even more $ e.g. John Kerry, $10M received, from 1K donors In-weights ($) $10 $5 Edges (# donors) CMU 2010 C. Faloutsos (CMU)

Outline Introduction – Motivation Problem#1: Patterns in graphs Static graphs Weighted graphs Time evolving graphs Problem#2: Tools … CMU 2010 C. Faloutsos (CMU)

Problem: Time evolution Faloutsos Problem: Time evolution with Jure Leskovec (CMU -> Stanford) and Jon Kleinberg (Cornell – sabb. @ CMU) CMU 2010 C. Faloutsos (CMU)

T.1 Evolution of the Diameter Faloutsos T.1 Evolution of the Diameter Prior work on Power Law graphs hints at slowly growing diameter: diameter ~ O(log N) diameter ~ O(log log N) What is happening in real data? Diameter first, DPL second Check diameter formulas As the network grows the distances between nodes slowly grow CMU 2010 C. Faloutsos (CMU)

T.1 Evolution of the Diameter Faloutsos T.1 Evolution of the Diameter Prior work on Power Law graphs hints at slowly growing diameter: diameter ~ O(log N) diameter ~ O(log log N) What is happening in real data? Diameter shrinks over time Diameter first, DPL second Check diameter formulas As the network grows the distances between nodes slowly grow CMU 2010 C. Faloutsos (CMU)

T.1 Diameter – “Patents” Patent citation network 25 years of data Faloutsos T.1 Diameter – “Patents” diameter Patent citation network 25 years of data time [years] CMU 2010 C. Faloutsos (CMU)

T.2 Temporal Evolution of the Graphs Faloutsos T.2 Temporal Evolution of the Graphs N(t) … nodes at time t E(t) … edges at time t Suppose that N(t+1) = 2 * N(t) Q: what is your guess for E(t+1) =? 2 * E(t) CMU 2010 C. Faloutsos (CMU)

T.2 Temporal Evolution of the Graphs Faloutsos T.2 Temporal Evolution of the Graphs N(t) … nodes at time t E(t) … edges at time t Suppose that N(t+1) = 2 * N(t) Q: what is your guess for E(t+1) =? 2 * E(t) A: over-doubled! But obeying the ``Densification Power Law’’ CMU 2010 C. Faloutsos (CMU)

T.2 Densification – Patent Citations Faloutsos T.2 Densification – Patent Citations Citations among patents granted 1999 2.9 million nodes 16.5 million edges Each year is a datapoint E(t) 1.66 N(t) CMU 2010 C. Faloutsos (CMU)

Outline Introduction – Motivation Problem#1: Patterns in graphs Static graphs Weighted graphs Time evolving graphs Problem#2: Tools … CMU 2010 C. Faloutsos (CMU)

More on Time-evolving graphs M. McGlohon, L. Akoglu, and C. Faloutsos Weighted Graphs and Disconnected Components: Patterns and a Generator. SIG-KDD 2008 CMU 2010 C. Faloutsos (CMU)

Observation T.3: NLCC behavior Q: How do NLCC’s emerge and join with the GCC? (``NLCC’’ = non-largest conn. components) Do they continue to grow in size? or do they shrink? or stabilize? CMU 2010 C. Faloutsos (CMU)

Observation T.3: NLCC behavior After the gelling point, the GCC takes off, but NLCC’s remain ~constant (actually, oscillate). IMDB CC size Time-stamp CMU 2010 C. Faloutsos (CMU)

Timing for Blogs with Mary McGlohon (CMU) Jure Leskovec (CMU->Stanford) Natalie Glance (now at Google) Mat Hurst (now at MSR) [SDM’07] CMU 2010 C. Faloutsos (CMU)

T.4 : popularity over time Faloutsos T.4 : popularity over time # in links lag: days after post 1 2 3 @t Post popularity drops-off – exponentially? @t + lag CMU 2010 C. Faloutsos (CMU) 52

T.4 : popularity over time Faloutsos T.4 : popularity over time # in links (log) days after post (log) 1 2 3 Post popularity drops-off – exponentially? POWER LAW! Exponent? CMU 2010 C. Faloutsos (CMU) 53

T.4 : popularity over time Faloutsos T.4 : popularity over time # in links (log) -1.6 days after post (log) 1 2 3 Post popularity drops-off – exponentially? POWER LAW! Exponent? -1.6 close to -1.5: Barabasi’s stack model and like the zero-crossings of a random walk CMU 2010 C. Faloutsos (CMU) 54

Outline Introduction – Motivation Problem#1: Patterns in graphs Problem#2: Tools CenterPiece Subgraphs OddBall (anomaly detection) PEGASUS Problem#3: Scalability Conclusions CMU 2010 C. Faloutsos (CMU)

CenterPiece Subgraphs Hanghang TONG et al, KDD’06 CMU 2010 C. Faloutsos (CMU)

Center-Piece Subgraph Discovery [Tong+ KDD 06] Input B A C Q: Who is the most central node wrt the black nodes? (e.g., master-mind criminal, common advisor/collaborator, etc) In the first application, the ceps Original Graph CMU 2010 C. Faloutsos (CMU)

Center-Piece Subgraph Discovery [Tong+ KDD 06] Input: original graph Output: CePS B A C B A C CePS Node Q: How to find hub for the query nodes? A: Combine proximity scores (RWR) CMU 2010 C. Faloutsos (CMU)

CePS: Example (AND Query) ? DBLP co-authorship network: 400,000 authors, 2,000,000 edges Code at: http://www.cs.cmu.edu/~htong/soft.htm CMU 2010 C. Faloutsos (CMU)

CePS: Example (AND Query) DBLP co-authorship network: 400,000 authors, 2,000,000 edges Code at: http://www.cs.cmu.edu/~htong/soft.htm CMU 2010 C. Faloutsos (CMU)

Outline Introduction – Motivation Problem#1: Patterns in graphs Problem#2: Tools CenterPiece Subgraphs OddBall (anomaly detection) PEGASUS Problem#3: Scalability Conclusions CMU 2010 C. Faloutsos (CMU)

OddBall: Spotting Anomalies in Weighted Graphs Leman Akoglu, Mary McGlohon, Christos Faloutsos Carnegie Mellon University School of Computer Science To appear in PAKDD 2010, Hyderabad, India

Main idea For each node, extract ‘ego-net’ (=1-step-away neighbors) Extract features (#edges, total weight, etc etc) Compare with the rest of the population CMU 2010 C. Faloutsos (CMU)

What is an egonet? egonet ego CMU 2010 C. Faloutsos (CMU) 64 Here mention about ball look of egonets and so the name OddBall.. Obtain a new graph – a weighted one!! Show ego with orange CMU 2010 C. Faloutsos (CMU) 64

Selected Features Ni: number of neighbors (degree) of ego i Ei: number of edges in egonet i Wi: total weight of egonet i λw,i: principal eigenvalue of the weighted adjacency matrix of egonet I This slide definitely should talk about features, but should be very attention grabbing 1 – add figure star after 1-2 2 – add figure high total weight after 3 (will relate 2-3) 3 – add figure dominant heavy link after 4 (will relate 3-4) – here explain why eigenvalues? Show a star with N edges – sqrt(N) is the eigenvalue Show a two node+single edge graph with edge weight W, eigenvalue=W  show a graph with a heavy link where eigenvalue ~ W : show these real examples CMU 2010 C. Faloutsos (CMU) 65

Near-Clique/Star SOME OLD RULES CMU 2010 C. Faloutsos (CMU) 66 66

Near-Clique/Star SOME OLD RULES CMU 2010 C. Faloutsos (CMU) 67 67

Outline Introduction – Motivation Problem#1: Patterns in graphs Problem#2: Tools CenterPiece Subgraphs OddBall (anomaly detection) PEGASUS Problem#3: Scalability Conclusions CMU 2010 C. Faloutsos (CMU)

Outline – Algorithms & results Faloutsos et al 2/24/2019 Outline – Algorithms & results Centralized Hadoop/PEGASUS Degree Distr. old Pagerank Diameter/ANF DONE Conn. Comp Triangles Visualization STARTED CMU 2010 C. Faloutsos (CMU) 69

HADI for diameter estimation Radius Plots for Mining Tera-byte Scale Graphs U Kang, Charalampos Tsourakakis, Ana Paula Appel, Christos Faloutsos, Jure Leskovec, SDM’10 Naively: diameter needs O(N**2) space and up to O(N**3) time – prohibitive (N~1B) Our HADI: linear on E (~10B) Near-linear scalability wrt # machines Several optimizations -> 5x faster CMU 2010 C. Faloutsos (CMU)

Count ?? 19+? [Barabasi+] Radius ???? Count ?? 19+? [Barabasi+] Radius YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) Largest publicly available graph ever studied. CMU 2010 C. Faloutsos (CMU)

YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) effective diameter: surprisingly small. Multi-modality: probably mixture of cores . CMU 2010 C. Faloutsos (CMU)

YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) effective diameter: surprisingly small. Multi-modality: probably mixture of cores . CMU 2010 C. Faloutsos (CMU)

Radius Plot of GCC of YahooWeb. CMU 2010 C. Faloutsos (CMU)

Running time - Kronecker and Erdos-Renyi Graphs with billions edges.

Outline – Algorithms & results Faloutsos et al 2/24/2019 Outline – Algorithms & results Centralized Hadoop/PEGASUS Degree Distr. old Pagerank Diameter/ANF DONE Conn. Comp Triangles Visualization STARTED CMU 2010 C. Faloutsos (CMU) 76

Generalized Iterated Matrix Vector Multiplication (GIMV) PEGASUS: A Peta-Scale Graph Mining System - Implementation and Observations. U Kang, Charalampos E. Tsourakakis, and Christos Faloutsos. (ICDM) 2009, Miami, Florida, USA. Best Application Paper (runner-up). CMU 2010 C. Faloutsos (CMU)

Generalized Iterated Matrix Vector Multiplication (GIMV) PageRank proximity (RWR) Diameter Connected components (eigenvectors, Belief Prop. … ) Matrix – vector Multiplication (iterated) CMU 2010 C. Faloutsos (CMU)

Example: GIM-V At Work Connected Components Count Size CMU 2010 C. Faloutsos (CMU)

Example: GIM-V At Work Connected Components Count Size 300-size cmpt Why? 1100-size cmpt X 65. Why? Size CMU 2010 C. Faloutsos (CMU)

financial-advice sites Example: GIM-V At Work Connected Components Count suspicious financial-advice sites (not existing now) Size CMU 2010 C. Faloutsos (CMU)

after the gelling point GIM-V At Work Connected Component over Time LinkedIn: 7.5M nodes and 58M edges Stable tail slope after the gelling point CMU 2010 C. Faloutsos (CMU)

Outline – Algorithms & results Faloutsos et al 2/24/2019 Outline – Algorithms & results Centralized Hadoop/PEGASUS Degree Distr. old Pagerank Diameter/ANF DONE Conn. Comp Triangles Visualization STARTED CMU 2010 C. Faloutsos (CMU) 83

Triangles : Computations [Tsourakakis ICDM 2008] Mentioned already Triangles : Computations [Tsourakakis ICDM 2008] But: triangles are expensive to compute (3-way join; several approx. algos) Q: Can we do that quickly? A: Yes! #triangles = 1/6 Sum ( li3 ) (and, because of skewness, we only need the top few eigenvalues! CMU 2010 C. Faloutsos (CMU)

Triangle Law: #1 [Tsourakakis ICDM 2008] Mentioned already Triangle Law: #1 [Tsourakakis ICDM 2008] HEP-TH ASN X-axis: # of Triangles a node participates in Y-axis: count of such nodes Epinions CMU 2010 C. Faloutsos (CMU)

Outline – Algorithms & results Faloutsos et al 2/24/2019 Outline – Algorithms & results Centralized Hadoop/PEGASUS Degree Distr. old Pagerank Diameter/ANF DONE Conn. Comp Triangles Visualization STARTED CMU 2010 C. Faloutsos (CMU) 86

Visualization: ShiftR Supporting Ad Hoc Sensemaking: Integrating Cognitive, HCI, and Data Mining Approaches Aniket Kittur, Duen Horng (‘Polo’) Chau, Christos Faloutsos, Jason I. Hong Sensemaking Workshop at CHI 2009, April 4-5. Boston, MA, USA. CMU 2010 C. Faloutsos (CMU)

CMU 2010 C. Faloutsos (CMU) Supporting Sensemaking in Large Graphs Shiftr supports sensemaking in large graph. It supports user-directed sensemaking of graph data, where the user can create arbitrary, potentially competing groups to categorize some of the nodes of interest and Shiftr recommends more nodes relevant accordingly using the Belief Propagation algorithm. This screenshot of Shiftr shows citation data around the article “The Cost Structure of Sensemaking”, where the user tries to visualize and make sense of the research areas relevant to this article. The user has identified four areas: sensemaking (blue), information visualization (orange), web revisitation (green), and collaborative search (red). Shiftr helps the user find more relevant articles in each of the areas, some of which are simultaneously relevant to more than ones areas, and visualize connections among the articles. CMU 2010 C. Faloutsos (CMU)

Outline Introduction – Motivation Problem#1: Patterns in graphs Problem#2: Tools Problem#3: Scalability Main idea Performance results Conclusions CMU 2010 C. Faloutsos (CMU)

Motivation Goal How can we find diameter, connected components, RWR score, PageRanks of graphs with billions of nodes and edges? YahooWeb: |V|: 1.4 B, |E|: 6.6 B, File: 120 Gb Yahoo had 5Pb of data [Fayyad, KDD’07] CMU 2010 C. Faloutsos (CMU)

Our apporach Hadoop Unified framework (GIM-V) for several algorithms Parallel processing Easy to use Scalable Unified framework (GIM-V) for several algorithms CMU 2010 C. Faloutsos (CMU)

Main Idea Plain M-V multiplication j mi,j i k CMU 2010 C. Faloutsos (CMU)

Main Idea Plain M-V multiplication Three Implicit Operations here: j k mi,j Plain M-V multiplication Three Implicit Operations here: combine2: multiply m_i,j and v_j combineAll: sum n multiplication results assign: update v_i with v_i’ CMU 2010 C. Faloutsos (CMU)

Main Idea GIM-V Customize the three operations j k mi,j GIM-V Customize the three operations combine2 combineAll Assign Matrix represents edges (strength of connection (src, dst) ) Vector represents some value of nodes (eg., ‘importance’ or pageRank) CMU 2010 C. Faloutsos (CMU)

Main Idea GIM-V: Connected Component HCC: Hadoop Connected Component How many connected components? Which node belong to which component? component id 1 1 A 1 1 5 7 2 A 2 1 6 3 A 3 1 3 2 8 4 A 4 1 or 5 B 5 5 4 6 B 6 5 A B C 7 C 7 7 8 C 8 7 CMU 2010 C. Faloutsos (CMU)

Main Idea GIM-V: HCC Component vector c satisfies HCC in terms of GIM-V Initialize ci to i Do until c converges CMU 2010 C. Faloutsos (CMU)

Main Idea GIM-V: Connected Component HCC: Hadoop Connected Component 1 2 3 4 5 6 7 8 1 5 1 1 1 1 7 2 1 1 2 1 6 3 3 1 1 3 1 2 8 4 1 4 1 4 5 1 5 5 6 1 6 5 A B C 7 1 7 7 8 1 8 7 CMU 2010 C. Faloutsos (CMU)

Main Idea GIM-V in HCC 1 2 3 4 5 6 7 8 1 1 1 min(1, min(2) ) 1 2 1 1 2 min(2, min(1,3) ) 1 3 1 1 3 min(3, min(2,4) ) 2 4 1 4 min(4, min(3) ) 3 5 1 5 min(5, min(6) ) 5 6 1 6 min(6, min(5) ) 5 7 1 7 min(7, min(8) ) 7 8 1 8 min(8, min(7) ) 7 CMU 2010 C. Faloutsos (CMU)

(k) GIM-V with MIN() operation = find minimum node ids within (k) hop Main Idea GIM-V in HCC 1 2 3 4 5 6 7 8 1 1 1 min(1, min(2) ) 1 2 1 1 2 min(2, min(1,3) ) 1 3 1 1 3 min(3, min(2,4) ) 2 4 1 4 min(4, min(3) ) 3 5 1 5 min(5, min(6) ) 5 (k) GIM-V with MIN() operation = find minimum node ids within (k) hop 6 1 6 min(6, min(5) ) 5 7 1 7 min(7, min(8) ) 7 8 1 8 min(8, min(7) ) 7 CMU 2010 C. Faloutsos (CMU)

Main Idea GIM-V in HCC 1 2 3 4 5 6 7 8 1 1 1 min(1, min(2) ) 1 1 1 2 1 1 2 min(2, min(1,3) ) 1 1 1 3 1 1 3 min(3, min(2,4) ) 2 1 1 4 1 4 min(4, min(3) ) 3 2 1 5 1 5 min(5, min(6) ) 5 5 5 6 1 6 min(6, min(5) ) 5 5 5 7 1 7 min(7, min(8) ) 7 7 7 8 1 8 min(8, min(7) ) 7 7 7 CMU 2010 C. Faloutsos (CMU)

Maximum # of iterations : |d| Main Idea GIM-V in HCC 1 2 3 4 5 6 7 8 1 1 1 min(1, min(2) ) 1 1 1 2 1 1 2 min(2, min(1,3) ) 1 1 1 3 1 1 3 min(3, min(2,4) ) 2 1 1 4 1 4 min(4, min(3) ) 3 2 1 5 1 5 min(5, min(6) ) 5 5 5 6 1 6 min(6, min(5) ) 5 5 5 7 1 7 min(7, min(8) ) 7 7 7 Maximum # of iterations : |d| 8 1 8 min(8, min(7) ) 7 7 7 CMU 2010 C. Faloutsos (CMU)

Outline – Algorithms & results Faloutsos et al 2/24/2019 Outline – Algorithms & results Centralized Hadoop/PEGASUS Degree Distr. old Pagerank Diameter/ANF DONE Conn. Comp Triangles Visualization STARTED easy IMV GIMV GIMV IMV … CMU 2010 C. Faloutsos (CMU) 102

Outline Goal Main Idea: GIM-V Fast Algorithm for GIM-V Performance GIM-V BASE, BL, CL, DI Performance GIM-V At Work Conclusion CMU 2010 C. Faloutsos (CMU)

Fast Algorithms For GIM-V Naïve-Method (GIM-V BASE) Input: Matrix(src,dst) and Vector(id,val) dst 1 2 3 4 5 6 7 8 First Job: combine2() - Join M and V using M.dst and V.id - Output M.src, V.val 1 1 1 2 1 1 2 3 1 1 3 4 1 4 5 1 5 Second Job: combineAll(), assign() - Aggregate (M.src, V.val) by M.src - Output(M.src, min(V.val1, V.val2,…)) 6 1 6 7 1 7 8 1 8 src CMU 2010 C. Faloutsos (CMU)

Fast Algorithms For GIM-V Block-Method (GIM-V BL) 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 1 1 1 1 2 1 1 2 1 1 1 3 1 1 3 1 1 2 4 1 4 1 3 5 1 5 1 5 6 1 6 1 5 7 1 7 1 7 8 1 8 1 7 Group matrix, vectors into blocks Do the multiplication on blocks CMU 2010 C. Faloutsos (CMU)

Fast Algorithms For GIM-V Block-Method (GIM-V BL) 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 1 1 1 1 2 1 1 2 1 1 1 3 1 1 3 1 1 2 4 1 4 1 3 5 1 5 1 5 6 1 6 1 5 7 1 7 1 7 8 1 8 1 7 Decrease Sorting Time Decrease File Size Group matrix, vectors into blocks Do the multiplication on blocks CMU 2010 C. Faloutsos (CMU)

Fast Algorithms For GIM-V Clustering (GIM-V CL) 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 1 1 1 2 1 2 1 1 3 1 3 1 1 4 1 4 1 5 1 1 5 1 6 1 6 1 7 1 7 1 8 1 1 8 1 1 1 2 5 4 7 8 3 5 2 3 6 6 7 4 8 CMU 2010 C. Faloutsos (CMU)

Fast Algorithms For GIM-V Diagonal Block Iteration (GIM-V DI) Multiply Diagonal Edges and Vectors as much as possible in one iteration 1 2 3 4 5 6 7 8 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 2 1 1 3 1 1 3 1 3 2 1 4 1 4 1 5 1 5 5 5 6 1 6 5 5 7 1 7 7 7 8 1 8 7 7 CMU 2010 C. Faloutsos (CMU)

Outline Goal Main Idea: GIM-V Fast Algorithm for GIM-V Performance GIM-V BASE, BL, CL, DI Performance GIM-V At Work Conclusion CMU 2010 C. Faloutsos (CMU)

GIM-V BL-CL is >= 5 times faster than GIM-V BASE! Performance GIM-V BL-CL is >= 5 times faster than GIM-V BASE! CMU 2010 C. Faloutsos (CMU)

Linear on the number of edges Faloutsos Scalability w/ edges Linear on the number of edges CMU 2010 C. Faloutsos (CMU)

Outline Introduction – Motivation Problem#1: Patterns in graphs Problem#2: Tools Problem#3: Scalability Conclusions CMU 2010 C. Faloutsos (CMU)

OVERALL CONCLUSIONS – low level: Faloutsos OVERALL CONCLUSIONS – low level: Several new patterns (fortification, triangle-laws, etc) New tools: CenterPiece Subgraphs, anomaly detection (OddBall), PEGASUS Scalability: PEGASUS / hadoop CMU 2010 C. Faloutsos (CMU)

OVERALL CONCLUSIONS – high level Faloutsos OVERALL CONCLUSIONS – high level Large datasets may reveal patterns/outliers that would be invisible otherwise Unprecedented opportunities huge datasets, available on-line Amazing h/w and s/w developments CMU 2010 C. Faloutsos (CMU)

References Hanghang Tong, Christos Faloutsos, and Jia-Yu Pan, Fast Random Walk with Restart and Its Applications, ICDM 2006, Hong Kong. Hanghang Tong, Christos Faloutsos, Center-Piece Subgraphs: Problem Definition and Fast Solutions, KDD 2006, Philadelphia, PA CMU 2010 C. Faloutsos (CMU)

References T. G. Kolda and J. Sun. Scalable Tensor Decompositions for Multi-aspect Data Mining. In: ICDM 2008, pp. 363-372, December 2008. CMU 2010 C. Faloutsos (CMU)

References Jure Leskovec, Jon Kleinberg and Christos Faloutsos Graphs over Time: Densification Laws, Shrinking Diameters and Possible Explanations KDD 2005, Chicago, IL. ("Best Research Paper" award). CMU 2010 C. Faloutsos (CMU)

References Jimeng Sun, Yinglian Xie, Hui Zhang, Christos Faloutsos. Less is More: Compact Matrix Decomposition for Large Sparse Graphs, SDM, Minneapolis, Minnesota, Apr 2007. Jimeng Sun, Spiros Papadimitriou, Philip S. Yu, and Christos Faloutsos, GraphScope: Parameter-free Mining of Large Time-evolving Graphs ACM SIGKDD Conference, San Jose, CA, August 2007 CMU 2010 C. Faloutsos (CMU)

Goal Thanks to: NSF IIS-0705359, IIS-0534205, Faloutsos et al 2/24/2019 Goal One-stop shopping for large graph mining: www.cs.cmu.edu/~pegasus Chau, Polo McGlohon, Mary Tsourakakis, Babis Akoglu, Leman Prakash, Aditya Tong, Hanghang Kang, U Thanks to: NSF IIS-0705359, IIS-0534205, Yahoo (M45), LLNL, IBM, SPRINT, INTEL, HP CMU 2010 C. Faloutsos (CMU)