Fast Algorithms for Querying and Mining Large Graphs Hanghang Tong Machine Learning Department Carnegie Mellon University

Slides:



Advertisements
Similar presentations
1 Dynamics of Real-world Networks Jure Leskovec Machine Learning Department Carnegie Mellon University
Advertisements

On the Vulnerability of Large Graphs
CMU SCS I2.2 Large Scale Information Network Processing INARC 1 Overview Goal: scalable algorithms to find patterns and anomalies on graphs 1. Mining Large.
Fast Algorithms for Querying and Mining Large Graphs Hanghang Tong Machine Learning Department Carnegie Mellon University
Multi-label Relational Neighbor Classification using Social Context Features Xi Wang and Gita Sukthankar Department of EECS University of Central Florida.
Graphs, Node importance, Link Analysis Ranking, Random walks
© 2012 IBM Corporation IBM Research Gelling, and Melting, Large Graphs by Edge Manipulation Joint Work by Hanghang Tong (IBM) B. Aditya Prakash (Virginia.
1 Social Influence Analysis in Large-scale Networks Jie Tang 1, Jimeng Sun 2, Chi Wang 1, and Zi Yang 1 1 Dept. of Computer Science and Technology Tsinghua.
Absorbing Random walks Coverage
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
N EIGHBORHOOD F ORMATION AND A NOMALY D ETECTION IN B IPARTITE G RAPHS Jimeng Sun, Huiming Qu, Deepayan Chakrabarti & Christos Faloutsos Jimeng Sun, Huiming.
Fast Direction-Aware Proximity for Graph Mining KDD 2007, San Jose Hanghang Tong, Yehuda Koren, Christos Faloutsos.
Communities in Heterogeneous Networks Chapter 4 1 Chapter 4, Community Detection and Mining in Social Media. Lei Tang and Huan Liu, Morgan & Claypool,
Maggie Zhou COMP 790 Data Mining Seminar, Spring 2011
Graph Based Semi- Supervised Learning Fei Wang Department of Statistical Science Cornell University.
SCS CMU Joint Work by Hanghang Tong, Spiros Papadimitriou, Jimeng Sun, Philip S. Yu, Christos Faloutsos Speaker: Hanghang Tong Aug , 2008, Las Vegas.
Neighborhood Formation and Anomaly Detection in Bipartite Graphs Jimeng Sun Huiming Qu Deepayan Chakrabarti Christos Faloutsos Speaker: Jimeng Sun.
© 2011 IBM Corporation IBM Research SIAM-DM 2011, Mesa AZ, USA, Non-Negative Residual Matrix Factorization w/ Application to Graph Anomaly Detection Hanghang.
© 2010 IBM Corporation Diversified Ranking on Large Graphs: An Optimization Viewpoint Hanghang Tong, Jingrui He, Zhen Wen, Ching-Yung Lin, Ravi Konuru.
SCS CMU Proximity Tracking on Time- Evolving Bipartite Graphs Speaker: Hanghang Tong Joint Work with Spiros Papadimitriou, Philip S. Yu, Christos Faloutsos.
Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.
Presented by Ozgur D. Sahin. Outline Introduction Neighborhood Functions ANF Algorithm Modifications Experimental Results Data Mining using ANF Conclusions.
Measure Proximity on Graphs with Side Information Joint Work by Hanghang Tong, Huiming Qu, Hani Jamjoom Speaker: Mary McGlohon 1 ICDM 2008, Pisa, Italy15-19.
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
Fast Random Walk with Restart and Its Applications
SCS CMU Joint Work by Hanghang Tong, Yasushi Sakurai, Tina Eliassi-Rad, Christos Faloutsos Speaker: Hanghang Tong Oct , 2008, Napa, CA CIKM 2008.
CMU SCS KDD'09Faloutsos, Miller, Tsourakakis P3-1 Large Graph Mining: Power Tools and a Practitioner’s guide Task 3: Recommendations & proximity Faloutsos,
GDG DevFest Central Italy Joint work with J. Feldman, S. Lattanzi, V. Mirrokni (Google Research), S. Leonardi (Sapienza U. Rome), H. Lynch (Google)
1 Information Retrieval through Various Approximate Matrix Decompositions Kathryn Linehan Advisor: Dr. Dianne O’Leary.
DATA MINING LECTURE 13 Absorbing Random walks Coverage.
Training and Testing of Recommender Systems on Data Missing Not at Random Harald Steck at KDD, July 2010 Bell Labs, Murray Hill.
2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop.
Glasgow 02/02/04 NN k networks for content-based image retrieval Daniel Heesch.
DATA MINING LECTURE 13 Pagerank, Absorbing Random Walks Coverage Problems.
KDD 2007, San Jose Fast Direction-Aware Proximity for Graph Mining Speaker: Hanghang Tong Joint work w/ Yehuda Koren, Christos Faloutsos.
SCS CMU Proximity on Large Graphs Speaker: Hanghang Tong Guest Lecture.
Fast Random Walk with Restart and Its Applications Hanghang Tong, Christos Faloutsos and Jia-Yu (Tim) Pan ICDM 2006 Dec , HongKong.
CMU SCS KDD '09Faloutsos, Miller, Tsourakakis P5-1 Large Graph Mining: Power Tools and a Practitioner’s guide Task 5: Graphs over time & tensors Faloutsos,
Tools and Algorithms for Querying and Mining Large Graphs Hanghang Tong Machine Learning Department Carnegie Mellon University
1 Panther: Fast Top-K Similarity Search on Large Networks Jing Zhang 1, Jie Tang 1, Cong Ma 1, Hanghang Tong 2, Yu Jing 1, and Juanzi Li 1 1 Department.
Hanghang Tong, Brian Gallagher, Christos Faloutsos, Tina Eliassi-Rad
KDD 2007, San Jose Fast Direction-Aware Proximity for Graph Mining Speaker: Hanghang Tong Joint work w/ Yehuda Koren, Christos Faloutsos.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Panther: Fast Top-k Similarity Search in Large Networks JING ZHANG, JIE TANG, CONG MA, HANGHANG TONG, YU JING, AND JUANZI LI Presented by Moumita Chanda.
Single-Pass Belief Propagation
Kijung Shin Jinhong Jung Lee Sael U Kang
Supervised Random Walks: Predicting and Recommending Links in Social Networks Lars Backstrom (Facebook) & Jure Leskovec (Stanford) Proc. of WSDM 2011 Present.
Center-Piece Subgraphs: Problem definition and Fast Solutions Hanghang Tong Christos Faloutsos Carnegie Mellon University.
Online Social Networks and Media Absorbing random walks Label Propagation Opinion Formation.
Presented by: Omar Alqahtani Spring Authors: Publication:  ICDE 2015 Type:  Research Paper 2.
SCS CMU Speaker Hanghang Tong Colibri: Fast Mining of Large Static and Dynamic Graphs Speaking Skill Requirement.
GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.
Extrapolation to Speed-up Query- dependent Link Analysis Ranking Algorithms Muhammad Ali Norozi Department of Computer Science Norwegian University of.
Xiaowei Ying, Kai Pan, Xintao Wu, Ling Guo Univ. of North Carolina at Charlotte SNA-KDD June 28, 2009, Paris, France Comparisons of Randomization and K-degree.
Term Project Proposal By J. H. Wang Apr. 7, 2017.
Cohesive Subgraph Computation over Large Graphs
Finding Dense and Connected Subgraphs in Dual Networks
Large Graph Mining: Power Tools and a Practitioner’s guide
Course Summary (Lecture for CS410 Intro Text Info Systems)
Hanghang Tong, Brian Gallagher, Christos Faloutsos, Tina Eliassi-Rad
CS7280: Special Topics in Data Mining Information/Social Networks
Speaker: Hanghang Tong Carnegie Mellon University
Jinhong Jung, Woojung Jin, Lee Sael, U Kang, ICDM ‘16
Asymmetric Transitivity Preserving Graph Embedding
Graph and Link Mining.
Alan Kuhnle*, Victoria G. Crawford, and My T. Thai
Learning to Rank Typed Graph Walks: Local and Global Approaches
Proximity in Graphs by Using Random Walks
PRSim: Sublinear Time SimRank Computation on Large Power-Law Graphs.
Presentation transcript:

Fast Algorithms for Querying and Mining Large Graphs Hanghang Tong Machine Learning Department Carnegie Mellon University 1

Graphs are everywhere! 2 Internet Map [Koren 2009]Food Web [2007] Protein Network [Salthe 2004] Social Network [Newman 2005] Web Graph Terrorist Network [Krebs 2002] Why Do We Care?

Research Theme Help users to understand and utilize large graph-related data? 3

A1: Social Networks Facebook (300m users, $10bn value, $500mn revenue) MSN (240m users, 4.5pb); Myspace (110m users) LinkedIn (50m users, $1bn value); Twitter (18m users) How to help users explore such networks? (e.g., find strange persons, communities, locate common friends, etc) 4 Community Anomaly

A2: Network Forensics [Sun+ 2007] How to detect abnormal traffic? 5 Port scanningDDoS Normal Traffic Adj. Matrix ibm.com cmu.edu Graph IP Src IP Dst IP Src IP Dst IP Src IP Dst

A3: Business Intelligence NY Times Forbes ReutersHardware Service IBM 2006 NY Times Forbes ReutersHardware Service IBM 2007 NY Times Forbes ReutersHardware Service IBM …. Year Proximity of ”IBM” wrt Service (higher is better) How close is “IBM” to service business over years? Footnote: nodes are business reviews and keywords; edges means ‘reporting’

A4: Financial Fraud Detection [Tong+ 2007] 7  7.5% of U.S. adults lost money for financial fraud  50%+ US corporations lost >= $500,000 [Albrecht+ 2001]  e.g., Enron ($70bn)  Total cost of financial fraud: $1trillion [Ansari 2006] How to detect abnormal transaction patterns? (e.g., money-laundry ring) : Anonymous accounts : Anonymous banks Legends:

A5: Immunization How to select k `best’ nodes for immunization? Footnote: SARS costs 700+ lives; $40+ Bn

This Talk Q uerying [Goal: query complex relationship] – Q.1. Find complex user-specific patterns; – Q.2. Proximity tracking; – Q.3. Answer all the above questions quickly. M ining [Goal: find interesting patterns] – M.1. Immunization; – M.2. Spot anomalies. 9

Overview Q1 Q3 Q2 Q3 M1 M2

Overview CePS, iPoG (KDD06, ICDM08, CIKM09) Q1 FastProx (ICDM06, KAIS07, KDD07 b, ICDM08) Q3 pTrack/cTrack (SDM08, SAM08) Q2 FastProx (SDM08, SAM08) Q3 NetShield M1 Colibri-S (KDD08) M2 Colibri-D (KDD08) M2

Proximity Measurement 12 Q: How close is A to B? a.k.a Relevance, Closeness, ‘Similarity’… Background

Random Walk with Restart [Tong+ ICDM 2006] Node 4 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Node 8 Node 9 Node 10 Node 11 Node Ranking vector More red, more relevant Nearby nodes, higher scores Background

14 RWR: Think of it as Wine Spill 1.Spill a drop of wine on cloth 2.Spread/diffuse to the neighborhood Background

RWR: Wine Spill on a Graph wine spill on clothRWR on a graph Query Background Same Diffusion Eq.

Random Walk with Restart 16 Background Same Diffusion Eq.

Intuitions: Why RWR is Good Score? Target Source Score (Red Path) = (1-c) c 6 x W(1,3) x W(3,4) x …. x W(14,20) Penalty of length of pathProb of traversing the path Footnote: (1-c) is restart probability in RWR; W is normalized adjacency matrix of the graph. Background

Prox (1, 20) = Score (Red Path) + Score (Green Path) + Score (Yellow Path) + Score (Purple Path) + … A high proximity many short/heavy-weighted paths Target Source Intuitions: Why RWR is Good Score? Background

Overview CePS, iPoG (KDD06, ICDM08, CIKM09) Q1 FastProx (ICDM06, KAIS07, KDD07 b, ICDM08) Q3 pTrack/cTrack (SDM08, SAM08) Q2 FastProx (SDM08, SAM08) Q3 NetShield M1 Colibri-S (KDD08) M2 Colibri-D (KDD08) M2

Q1: Find Complex User-Specific Patterns Q1.1. Center-Piece Subgraph Discovery, – e.g., master-mind criminal given some suspects X, Y and Z? Q1.2 Interactive Querying (e.g. Negation) – e.g., find most similar conferences wrt KDD, but not like ICML? 20 Our algorithms for Q1.1 and Q1.2 Cyano (a real system in IBM)

Overview CePS, iPoG (KDD06, ICDM08, CIKM09) Q1 FastProx (ICDM06, KAIS07, KDD07 b, ICDM08) Q3 pTrack/cTrack (SDM08, SAM08) Q2 FastProx (SDM08, SAM08) Q3 NetShield M1 Colibri-S (KDD08) M2 Colibri-D (KDD08) M2

Q1.1 Center-Piece Subgraph Discovery [Tong+ KDD 06] Original Graph Q: Who is the most central node wrt the black nodes? (e.g., master-mind criminal, common advisor/collaborator, etc) Input

Q1.1 Center-Piece Subgraph Discovery [Tong+ KDD 06] Original Graph CePS Q: How to find hub for the black nodes? CePS Node Input Output Our Sol.: Max (Prox(A, Red) x Prox(B, Red) x Prox(C, Red))

CePS: Example (AND Query) 24 DBLP co-authorship network: - 400,000 authors, 2,000,000 edges ?

CePS: Example (AND Query) 25 DBLP co-authorship network: - 400,000 authors, 2,000,000 edges

Overview CePS, iPoG (KDD06, ICDM08, CIKM09) Q1 FastProx (ICDM06, KAIS07, KDD07 b, ICDM08) Q3 pTrack/cTrack (SDM08, SAM08) Q2 FastProx (SDM08, SAM08) Q3 NetShield M1 Colibri-S (KDD08) M2 Colibri-D (KDD08) M2

27 Q1.2: Interactive Querying Q: What are the most related conferences wrt KDD, for a user who likes SIGIR, but not ICML?

Initial ResultsNo to `ICML’Yes to `SIGIR’ 'ICDM' 'ICML' 'SDM' 'VLDB' 'ICDE' 'SIGMOD' 'NIPS' 'PKDD' 'IJCAI' 'PAKDD' 'ICDM' 'SDM' 'PKDD' 'ICDE' 'VLDB' 'SIGMOD' 'PAKDD' 'CIKM' 'SIGIR' 'WWW' 'SIGIR' 'TREC' 'CIKM' 'ECIR' 'CLEF' 'ICDM' 'JCDL' 'VLDB' 'ACL' 'ICDE' two main sub-communities in KDD: DBs (green) vs. Stat (Red) Negative feedback on ICML will exclude other stats confs (NIPS, IJCAI) Positive feedback on SIGIR will bring more IR (brown) conferences. what are most related conferences wrt KDD? (DBLP author-conference bipartite graph) 28 Q1.2 iPoG for Interactive Querying [Tong+ ICDM 08, CIKM 09]

Initial ResultsNo to `ICML’Yes to `SIGIR’ 'ICDM' 'ICML' 'SDM' 'VLDB' 'ICDE' 'SIGMOD' 'NIPS' 'PKDD' 'IJCAI' 'PAKDD' 'ICDM' 'SDM' 'PKDD' 'ICDE' 'VLDB' 'SIGMOD' 'PAKDD' 'CIKM' 'SIGIR' 'WWW' 'SIGIR' 'TREC' 'CIKM' 'ECIR' 'CLEF' 'ICDM' 'JCDL' 'VLDB' 'ACL' 'ICDE' two main sub-communities in KDD: DBs (green) vs. ML/AI (Red) Negative feedback on ICML will exclude other ML/AI conf.s (NIPS, IJCAI) Positive feedback on SIGIR will bring more IR (brown) conferences. what are most related conferences wrt KDD? (DBLP author-conference bipartite graph) 29

Initial ResultsNo to `ICML’Yes to `SIGIR’ 'ICDM' 'ICML' 'SDM' 'VLDB' 'ICDE' 'SIGMOD' 'NIPS' 'PKDD' 'IJCAI' 'PAKDD' 'ICDM' 'SDM' 'PKDD' 'ICDE' 'VLDB' 'SIGMOD' 'PAKDD' 'CIKM' 'SIGIR' 'WWW' 'SIGIR' 'TREC' 'CIKM' 'ECIR' 'CLEF' 'ICDM' 'JCDL' 'VLDB' 'ACL' 'ICDE' two main sub-communities in KDD: DBs (green) vs. ML/AI (Red) Negative feedback on ICML will exclude other ML/AI conf.s (NIPS, IJCAI) Positive feedback on SIGIR will bring more IR (brown) conferences. what are most related conferences wrt KDD? (DBLP author-conference bipartite graph) 30 Q1.2 iPoG for Interactive Querying [Tong+ ICDM 08, CIKM 09]

Overview CePS, iPoG (KDD06, ICDM08, CIKM09) Q1 FastProx (ICDM06, KAIS07, KDD07 b, ICDM08) Q3 pTrack/cTrack (SDM08, SAM08) Q2 FastProx (SDM08, SAM08) Q3 NetShield M1 Colibri-S (KDD08) M2 Colibri-D (KDD08) M2

Q2.2 pTrack: Challenge [Tong+ SDM 08] Observations (CePS, iPoG…) – All for static graphs – Proximity: main tool Graphs are evolving over time! – New nodes/edges show up; – Existing nodes/edges die out; – Edge weights change… 32

33 Given Author-Conference Bipartite Graphs Q1: What are top-k conferences for Yu over years? Q2: How close is KDD to VLDB over years? A: Track proximity, incrementally!

pTrack: Philip S. Yu’s Top-5 conferences up to each year ICDE ICDCS SIGMETRICS PDIS VLDB CIKM ICDCS ICDE SIGMETRICS ICMCS KDD SIGMOD ICDM CIKM ICDCS ICDM KDD ICDE SDM VLDB Databases Performance Distributed Sys. Databases Data Mining DBLP: (Au. x Conf.) - 400k authors, - 3.5k conferences - 20 years 34

Prox. Rank Year Data Mining and Databases are getting closer & closer 35 (Closer) KDD’s Rank wrt. VLDB over years

Q2: pTrack on Bipartite Graphs Computational Challenges (assuming ) – Iterative method O(m) – Straight-forward update Example – NetFlix (2.6m users x 18k movies, 100m ratings) – Both need >1hr Our Solution (Fast-Update): – – ~10 seconds on Netflix data set 36

Q2: pTrack on Bipartite Graphs Observation #1 – n 1 authors; n 2 conferences; – n 1 >> n 2 e.g., > 400k authors, 3.5k conf.s in DBLP Observation #2 – m edges changed, (n 1 authors, n 2 conf.s) – rank of update = = update Proposed algorithm: Fast-Update 37 Theorem: (Tong+ 2008) (1) Fast-Update has no quality loss (2) Fast-Update is Conferences ~~~ KDD … … …

38 176x speedup 40x speedup log(Time) (Seconds) Data Sets Our method Q2: Speed Comparison

Overview CePS, iPoG (KDD06, ICDM08, CIKM09) Q1 FastProx (ICDM06, KAIS07, KDD07 b, ICDM08) Q3 pTrack/cTrack (SDM08, SAM08) Q2 FastProx (SDM08, SAM08) Q3 NetShield M1 Colibri-S (KDD08) M2 Colibri-D (KDD08) M2

n x n n x 1 Ranking vector Starting vector (Normalized) Adjacency matrix 1 Restart p Footnote: Maxwell Equation for Web [Chakrabarti] Computing RWR

41 Footnote: 1-c restart prob; W normalized adjacency matrix Q = - - c x WIQ

Computing RWR 42 Footnote: 1-c restart prob; W normalized adjacency matrix Q How to get (elements) of Q? = - - c x WIQ

Computing RWR Power Method – No Pre-Computation; – Light Storage Cost O(m) – Slow On-Line Response: O(m x Iter) Pre-Compute – Fast On-Line Response – Prohibitive Pre-Compute Cost: O(n 3 ) – Prohibitive Storage Cost: O(n 2 ) 43

Q: How to Balance? On-line Off-line 44 Goal: Efficiently get (elements) of

B_Lin: Pre-Compute [Tong+ ICDM 2006] Find Communities Compute Within- Communities Scores Q 13 Q 12 Q 11

B_Lin: On-Line [Tong+ ICDM 2006] Find Communities Fix the remaining Combine

+ ~ ~ B_Lin: details W 1 : within community Cross community details 47 + W =

B_Lin: details W I – c ~ ~ I – c – cUSV W1W1 Easy to be invertedLRA difference Sherman–Morrison Lemma! details 48 If Then

B_Lin: Pre-Compute Stage Q: Efficiently compute and store Q A: A few small, instead of ONE BIG, matrices inversions 49 Footnote: Q 1 =(I-cW 1 ) -1 details

B_Lin: On-Line Stage Q: Efficiently recover one column of Q A: A few, instead of MANY, matrix-vector multiplications 50 details

Query Time vs. Pre-Compute Time Log Query Time Log Pre-compute Time Quality: 90%+ On-line: Up to 150x speedup Pre-computation: Two orders of magnitude saving 51 Our Results

More on Scalability Issues for Querying (the spectrum of ``FastProx’’) B_Lin: one large linear system – [Tong+ ICDM06, KAIS08] BB_Lin: the intrinsic complexity is small – [Tong+ KAIS08] FastUpdate: time-evolving linear system – [Tong+ SDM08, SAM08] FastAllDAP: multiple linear systems – [Tong+ KDD07 a] Fast-iPoG: dealing w/ on-line feedback – [Tong+ ICDM 2008, Tong+ CIKM09] 52

Overview CePS, iPoG (KDD06, ICDM08, CIKM09) Q1 FastProx (ICDM06, KAIS07, KDD07 b, ICDM08) Q3 pTrack/cTrack (SDM08, SAM08) Q2 FastProx (SDM08, SAM08) Q3 NetShield M1 Colibri-S (KDD08) M2 Colibri-D (KDD08) M2

A5: Immunization 54 How to select k `best’ nodes for immunization?

M1: SIS Virus Model [Chakrabarti+ 2008] ‘Flu’ like: Susceptible-Infectious- Susceptible If virus ‘strength’ s < 1/ λ 1,A, an epidemic can not happen Intuition – s: # of sneeze before heal – λ 1,A : # of edges/paths 55 Background ~ ~ ~ ~

M1: Optimal Method Select k nodes, whose absence creates the largest drop in λ 1,A Original Graph: λ 1,A Without {2, 6}: λ 1,A ~

M1: Optimal Method Select k nodes, whose absence creates the largest drop in λ 1,A But, we need in time – Example: 1,000 nodes, with 10,000 edges It takes 0.01 seconds to compute λ It takes 2,615 years to find best-5 nodes ! 57 Leading eigenvalue w/o subset of nodes S

M1: Netshield to the Rescue 58 Theorem: (Tong+ 2009) (1) Au = λ 1,A X u(i ): eigen-score u Think of u(i) as PageRank or in-degree G. W. Stewart J. G. Sun

M1: Netshield to the Rescue find a set of nodes S, which – (1) each has high eigen-scores – (2) diverse among themselves Theorem: (Tong+ 2009) (1) Intuition

M1: Netshield to the Rescue Example: – 1,000 nodes, with 10,000 edges – Netshield takes < 0.1 seconds to find best-5 nodes ! – … as opposed to 2,615 years 60 Theorem: (Tong+ 2009) (1) (2) Br(S) is sub-modular (3) Netshield is near-optimal (wrt max Br(S)) (4) Netshield is O(nk 2 +m) Footnote: near-optimal means Br(S Netshield ) >= (1-1/e) Br(S Opt )

Sub-Modular (i.e., Diminishing Returns) >= Benefit of deleting {1,2, 3,4} Benefit of deleting {1,2} Marginal benefit of deleting {5,6} Why Netshield is Near-Optimal? details

Why Netshield is Near-Optimal? Sub-Modular (i.e., Diminishing Returns)>= Theorem: k-step greedy alg. to maximize a sub-modular function guarantees (1-1/e) optimal [Nemhauster+ 78] details

M1: Why Br(S) is sub-modular? details Already deleted Newly deleted

M1: Why Br(S) is sub-modular? Marginal Benefit of deleting {5,6} Pure benefit from {5,6} Interaction between {5,6} and {1,2} Only purple term depends on {1, 2}! = - details Already deleted Newly deleted

Marginal Benefit = Blue –Purple More Green Footnote: greens are nodes already deleted; blue {5,6} nodes are nodes to be deleted More PurpleLess Red Marginal Benefit of Left >= Marginal Benefit of Right M1: Why Br(S) is sub-modular? details

M2: Quality of Netshield 66 Eig-Drop k Netshield Optimal (1-1/e) x Optimal (better)

M1: Speed of Netshield 67 Time k > 10 days 0.1 seconds Netshield NIPS co-authorship Network (better)

Scalability of Netshield Time # of edges (better) X 10 8

Overview CePS, iPoG (KDD06, ICDM08, CIKM09) Q1 FastProx (ICDM06, KAIS07, KDD07 b, ICDM08) Q3 pTrack/cTrack (SDM08, SAM08) Q2 FastProx (SDM08, SAM08) Q3 NetShield M1 Colibri-S (KDD08) M2 Colibri-D (KDD08) M2

Motivation [Tong+ KDD 08 b] Q: How to find patterns from a large graph? – e.g., communities, anomalies, etc. 70 AuthorConference John KDD Tom Bob Carl Van Roy RECOMB ISMB ICDM

Motivation [Tong+ KDD 08 b] Q: How to find patterns from a large graph? – e.g., communities, anomalies, etc. A: Low-Rank Approximation (LRA) for adjacency matrix of the graph. 71 A L MR XX ~ ~

LRA for Graph Mining John KDD Tom Bob Carl Van Roy RECOMB ISMB ICDM AuthorConferenceAdjacency matrix: A 72 Conference Author

LRA for Graph Mining: Communities John KDD Tom Bob Carl Van Roy RECOMB ISMB ICDM AuthorConf. ~ ~ XX Adj. matrix: A R: Conf. Group M: Group-Group Interaction L: author group 73

LRA for Graph Mining: Anomalies John KDD Tom Bob Carl Van Roy RECOMB ISMB ICDM AuthorConf. Adj. matrix: A Recon. error is high  ‘Carl-KDD’ is abnormal 74 Reconstructed A ~

Challenges: How to Get (L, M, R)? Efficiently both time and space Intuitively easy for interpretation Dynamically track patterns over time 75 None of existing methods fully meets our wish list!

Why Not SVD and CUR/CMD? SVD (Optimal in L 2 and L F ) – Efficiency Time: Space: (L, R) are dense – Interpretation Linear Combination of many columns – Dynamic: Not Easy 76 CUR/CMD (Example-based) – Efficiency Better than SVD Redundancy in L – Interpretation Actual Columns from A xxxx – Dynamic: Not Easy

Solutions: Colibri [Tong+ KDD 08 b] Colibri-S: for static graphs – Basic idea: remove linear redundancy Colibri-D: for dynamic graphs – Basic idea: leverage smoothness over time 77 Theorem: (Tong+ 2008) (1) Colibri = CUR/CMD in accuracy (2) Colibri <= CUR/CMD in time (3) Colibri <= CUR/CMD in space

Comparison SVD, CUR vs. Colibri s Wish List SVD [Golub+ 1989] CUR [Drineas+ 2005] Colibri [Tong+ 2008] Efficiency Interpretation Dynamics 78 details

Performance of Colibri-S TimeSpace Ours CUR CMD 79 SVD Accuracy Same 91%+ Time 12x of CMD 28x of CUR Space ~1/3 of CMD ~10% of CUR Ours

Performance of Colibri-D Time # of changed cols CMD Colibri-S Colibri-D achieves up to 112x speedup Colibri-D 80 Network traffic - 21,837 nodes - 1,220 hours - 22,800 edge/hr (Prior Best Method) Accuracy - Same 93%+

Overview CePS, iPoG (KDD06, ICDM08, CIKM09) Q1 FastProx (ICDM06, KAIS07, KDD07 b, ICDM08) Q3 pTrack/cTrack (SDM08, SAM08) Q2 FastProx (SDM08, SAM08) Q3 NetShield M1 Colibri-S (KDD08) M2 Colibri-D (KDD08) M2

Some of my other work #1: FastDAP (in KDD07 a) – Predict Link Direction #2: Graph X-Ray (in KDD 07 b) – Best Effort Pattern Match in Attributed Graphs. #3: GhostEdge (in KDD 08 a) – Classification in Sparsely Labeled Network #4: TANGENT (in KDD09) – ``surprise-me’’ recommendation #5: GMine (in VLDB 06) – Interactive Graph Visualization and Mining #6: Graphite (in ICDM 08) – Visual Query System for Attributed Graphs # 7: T3/MT3: (in CIKM 08) – Mine Complex Time-stamped Events #8: BlurDetect (in ICME 04) – Determine whether or not, and how, an image is blurred #9: MRBIR (in MM 04, TIP06) – Manifold-Ranking based Image Retrieval #10: GBMML (in CVPR05, ACM/Multimedia 05) – Graph-based Multiple Modality Learning 82

TasksStatic GraphsDynamic GraphsImages 83 Overview (this talk + others) Querying Mining CePS, iPoG, Basset, DAP, G-Ray, Grahite, TANGENT, FastRWR (KDD06, CDM06, KDD07a, KDD07b, IICDM08, KAIS08, CIKM09, KDD09) pTrack, cTrack, Fast-Update (SDM08, SAM08) Netshield, Colibri-S, GhostEdge, Gmine, Pack, Shiftr (VLDB06, KDD08a, KDD08b, SDM-LinkAnalysis 09, ) T3/MT3, Colibri-D (KDD08a, CIKM08) MRBIR, UOLIR (MM04, CVPR05) BlurDetect, GBMML, iQuality, iExpertise (ICDE04, ICIP04, MMM05, PCM05, MM05)

Plans Goals Step 1 (this talk) Step 2 (medium term) Step 3 (long term) G1 Querying CePS, iPoG, pTrack Recommendation Interpretable Q Querying rich data G2 Mining Netshield, Colibri Immunization Interpretable M Mining rich data G3 Scalability All above O(m) or better (single machine) Scalable by parallel Scalable on rich data What is Next? 84 Research Theme: Help users to understand and utilize large graph-related data

Current Recommendation (Focus on Relevance) Sci. fiction comedy horror Footnote: Nodes are movies; Edge is similarity between movies adventure Red nodes: by (most of) existing algorithms

``Broad Spectrum Recommendation’’ (focus on completeness = relevance + diversity + novelty) adventure Sci. fiction comedy horror Footnote: Nodes are movies; Edge = similarity between movies

Research Theme: Help users to understand and utilize large graph-related data Plans Goals Step 1 (this talk) Step 2 (medium term) Step 3 (long term) G1 Querying CePS, iPoG, pTrack Recommendation Interpretable Q Querying rich data G2 Mining Netshield, Colibri Immunization Interpretable M Mining rich data G3 Scalability All above O(m) or better (single machine) Scalable by parallel Scalable on rich data What is Next? 87

Interpretable Recommendation Amazon.com recommends (based on items you purchased or told us your own) Current Recommendation 88

Interpretable Recommendation Amazon.com recommends (based on items you purchased or told us your own) Amazing.com recommends Because it has the topics You are interested Graph mining Linear algebra You might be interested Hadoop Submodularity Current Recommendation Interpretable Recommendation

Plans Goals Step 1 (this talk) Step 2 (medium term) Step 3 (long term) G1 Querying CePS, iPoG, pTrack Recommendation Interpretable Q Querying rich data G2 Mining Netshield, Colibri Immunization Interpretable M Mining rich data G3 Scalability All above O(m) or better (single machine) Scalable by parallel Scalable on rich data What is Next? 90 Research Theme: Help users to understand and utilize large graph-related data

Immunization This Talk: SIS (e.g., flu) In the Future – Immunize for SIR (e.g., chicken pox) – Immunize in Dynamic Settings  Dynamics of Graphs,  e.g., edges/nodes are changing  Dynamics of Virus,  e.g., the infection/healing rates are changing 91 Footnote: SIR stands for susceptible-infectious-recovered.

Plans Goals Step 1 (this talk) Step 2 (medium term) Step 3 (long term) G1 Querying CePS, iPoG, pTrack Recommendation Interpretable Q Querying rich data G2 Mining Netshield, Colibri Immunization Interpretable M Mining rich data G3 Scalability All above O(m) or better (single machine) Scalable by parallel Scalable on rich data What is Next? 92 Research Theme: Help users to understand and utilize large graph-related data

Interpretable Mining 93  Find Communities  Find a few nodes/edges to describe  each community  relationship between 2 communities Footnote: Nodes are actors; edges indicate co-play in a movie.

Plans Goals Step 1 (this talk) Step 2 (medium term) Step 3 (long term) G1 Querying CePS, iPoG, pTrack Recommendation Interpretable Q Querying rich data G2 Mining Netshield, Colibri Immunization Interpretable M Mining rich data G3 Scalability All above O(m) or better (single machine) Scalable by parallel Scalable on rich data What is Next? 94 Research Theme: Help users to understand and utilize large graph-related data

Querying Rich Graphs (e.g., geo-coded, attributed) 95 What is difference between North America and Asia? Teenage Adult Phone MSN

Mining Rich Graphs (e.g., geo-coded, attributed) 96 Teenager Adult Phone MSN How to find patterns? (e.g., communities, anomalies) telemarketer

Plans Goals Step 1 (this talk) Step 2 (medium term) Step 3 (long term) G1 Querying CePS, iPoG, pTrack Recommendation Interpretable Q Querying rich data G2 Mining Netshield, Colibri Immunization Interpretable M Mining rich data G3 Scalability All above O(m) or better (single machine) Scalable by parallel Scalable on rich data What is Next? 97 Research Theme: Help users to understand and utilize large graph-related data

Scalability Two orthogonal efforts – E1: O(m) or better on a single machine – E2: Parallelism (e.g., hadoop) (implementation, decouple, analysis) 98

Research Theme: Help users to understand and utilize large graph-related data 99 Real Data User Scalability

CePS iPoG Basset pTrack BLin BBLin FastUpdate Fast-iPoG Colibri GhostEdge Graphite Pack TANGENT GMine T3 Mining Q1 Q2 Q3 M3 M2 M1 My Collaboration Graph (During Ph.D Study) Legends: Green: Querying Yellow: Mining Purple: Others G-Ray DAP NBLin cTrack Basset MT3 NetShield

Q & A Thank you! 101