Tools and Algorithms for Querying and Mining Large Graphs Hanghang Tong Machine Learning Department Carnegie Mellon University

Tools and Algorithms for Querying and Mining Large Graphs Hanghang Tong Machine Learning Department Carnegie Mellon University htong@cs.cmu.edu http://www.cs.cmu.edu/~htong 1

Thesis Committee Christos Faloutsos William Cohen Jeff Schneider Philip S. Yu 2

Graphs are everywhere! 3

Motivating Questions: (high level) Given a large graph, we want to 4 CePS on DBLP [Tong+ KDD 06] T3 on CIKM [Tong+ CIKM 08] +Task A: Querying +Task B: Mining Will return to this later…

Motivating Questions (in details) Q uerying [Goal: query complex relationship] – Q.1. Find complex user-specific patterns; – Q.2. Link Prediction & Proximity Tracking; – Q.3. Answer all the above questions quickly. M ining [Goal: find interesting patterns] – M.1. Spot Anomalies; – M.2. Mine time & space; – M.3. Detect communities. 5

Thesis Overview 6 Q1 Q3 Q2 Q3 M1 M2 M3 M1 M2

Thesis Overview 7 CePS, G-Ray, ProSIN (KDD06, KDD07 a, ICDM08) Q1 FastProx (ICDM06, KAIS07, KDD07 b, ICDM08) Q3 pTrack/cTrack (SDM08, SAM08) Q2 DAP (KDD07 b) Q2 FastProx (SDM08, SAM08) Q3 P3 Colibri-D (KDD08 b) M1 T3/MT3 (CIKM08) M2 P1 M3 P1 M3 Colibri-S (KDD08 b) M1 P3 CompletedProposed Questions That We Ask P2 M2 P3

TasksImpact, Applications Q1 Identify master-mind criminal; money launder ring; interactive search & summarization Q2 Predict who-calls-whom; Trend analysis on graph level Q3 Scale all the above app.s to large, disk resident, graphs M1 Efficient anomaly detection in an intuitive, dynamic way M2 Mine time/space in complex settings M3 Detect community w/ optional constraints Thesis Overview: Impact Querying Mining 8 Footnote: Our work for Q1 has been transferred into IBM product (Cyano)

Roadmap Introduction Completed Work – Querying – Mining Proposed Work 9 Preliminary Q1 Q2 Q3

Preliminary: Proximity Measurement 10 a.k.a Relevance, Closeness, ‘Similarity’…

Competed work on Q1 Goal: Find complex user-specific patterns, – Q1.1. Center-Piece Subgraph Discovery, – e.g., master-mind criminal given some suspects X, Y and Z? – Q1.2. Best Effort Pattern Match, – e.g., Money-laundry ring – Q1.3 Interactive querying (e.g. Negation) – e.g., find most similar conferences wrt KDD, but not like ICML? 12

Q1.1 Center-Piece Subgraph Discovery [Tong+ KDD 06] Original Graph CePS Q: How to find hub for the black nodes? CePS Node Input Output Red: Max (Prox(A, Red) x Prox(B, Red) x Prox(C, Red))

CePS: Example (AND Query) 14 DBLP co-authorship network: - 400,000 authors, 2,000,000 edges

K_SoftAND: Relaxation of AND Asking AND query?  No Answer! Disconnected Communities Noise 15

CePS: 2 SoftAND Stat. DB 16

Output Data Graph Query Graph Matching Subgraph Q: How to find matching subgraph? Q1.2. Best-Effort Pattern Match [Tong+ KDD 2007 b] Input Interception

G-Ray: How to? matching node Goodness = Prox (12, 4) x Prox (4, 12) x Prox (7, 4) x Prox (4, 7) x Prox (11, 7) x Prox (7, 11) x Prox (12, 11) x Prox (11, 12) details Observation :, etc. 18

Effectiveness: star-query QueryResult Databases Bio-medicalIntelligent Agent 19

Effectiveness: line-query Query Result DatabasesLearningBio-medicalTheory 20

Q1.3: Interactive Querying 21 User Feedback

Initial ResultsNo to `ICML’Yes to `SIGIR’ 'ICDM' 'ICML' 'SDM' 'VLDB' 'ICDE' 'SIGMOD' 'NIPS' 'PKDD' 'IJCAI' 'PAKDD' 'ICDM' 'SDM' 'PKDD' 'ICDE' 'VLDB' 'SIGMOD' 'PAKDD' 'CIKM' 'SIGIR' 'WWW' 'SIGIR' 'TREC' 'CIKM' 'ECIR' 'CLEF' 'ICDM' 'JCDL' 'VLDB' 'ACL' 'ICDE' two main sub-communities in KDD: DBs (green) vs. Stat (Red) Negative feedback on ICML will exclude other stats confs (NIPS, IJCAI) Positive feedback on SIGIR will bring more IR (brown) conferences. what are most related conferences wrt KDD? (DBLP author-conference bipartite graph) 22 Q1.3 ProSIN for Interactive Querying [Tong+ ICDM 08]

Initial ResultsNo to `ICML’Yes to `SIGIR’ 'ICDM' 'ICML' 'SDM' 'VLDB' 'ICDE' 'SIGMOD' 'NIPS' 'PKDD' 'IJCAI' 'PAKDD' 'ICDM' 'SDM' 'PKDD' 'ICDE' 'VLDB' 'SIGMOD' 'PAKDD' 'CIKM' 'SIGIR' 'WWW' 'SIGIR' 'TREC' 'CIKM' 'ECIR' 'CLEF' 'ICDM' 'JCDL' 'VLDB' 'ACL' 'ICDE' two main sub-communities in KDD: DBs (green) vs. Stat (Red) Negative feedback on ICML will exclude other stats confs (NIPS, IJCAI) Positive feedback on SIGIR will bring more IR (brown) conferences. what are most related conferences wrt KDD? (DBLP author-conference bipartite graph) 23

Initial ResultsNo to `ICML’Yes to `SIGIR’ 'ICDM' 'ICML' 'SDM' 'VLDB' 'ICDE' 'SIGMOD' 'NIPS' 'PKDD' 'IJCAI' 'PAKDD' 'ICDM' 'SDM' 'PKDD' 'ICDE' 'VLDB' 'SIGMOD' 'PAKDD' 'CIKM' 'SIGIR' 'WWW' 'SIGIR' 'TREC' 'CIKM' 'ECIR' 'CLEF' 'ICDM' 'JCDL' 'VLDB' 'ACL' 'ICDE' two main sub-communities in KDD: DBs (green) vs. Stat (Red) Negative feedback on ICML will exclude other stats confs (NIPS, IJCAI) Positive feedback on SIGIR will bring more IR (brown) conferences. what are most related conferences wrt KDD? (DBLP author-conference bipartite graph) 24 Q1.3 ProSIN for Interactive Querying [Tong+ ICDM 08]

Q2.1 Link Prediction: direction [Tong+ KDD 07 a] Q: Given the existence of the link, what is the direction of the link? A: (DAP) Compare Prox(i  j) and Prox(j  i) >70% Prox (i  j) - Prox (j  i) density i j i i i 26 ? Web Link - 4, 000 nodes - 10, 000 edges

Q2.2 pTrack/cTrack: Challenge [Tong+ SDM 08] Observations (CePS, GRay, ProSIN…) – All for static graphs – Proximity: main tool Graphs are evolving over time! – New nodes/edges show up; – Existing nodes/edges die out; – Edge weights change… Q: How to make everything incremental? A: Track Proximity! 27

pTrack/cTrack: Trend analysis on graph level M. Jordan G.Hinton C. Koch T. Sejnowski Year Rank of Influence 28

pTrack: Problem Definitions [Given] – (1) a large, skewed time-evolving bipartite graphs, – (2) the query nodes of interest [Track] – (1) top-k most related nodes for each query node at each time step t; – (2) the proximity score (or rank of proximity) between any two query nodes at each time step t 29

pTrack: Philip S. Yu’s Top-5 conferences up to each year ICDE ICDCS SIGMETRICS PDIS VLDB CIKM ICDCS ICDE SIGMETRICS ICMCS KDD SIGMOD ICDM CIKM ICDCS ICDM KDD ICDE SDM VLDB 1992199720022007 Databases Performance Distributed Sys. Databases Data Mining DBLP: (Au. x Conf.) - 400k aus, - 3.5k confs - 20 yrs 30

KDD’s Rank wrt. VLDB over years Prox. Rank Year Data Mining and Databases are getting closer & closer 31 (Closer)

cTrack:10 most influential authors in NIPS community up to each year Author-paper bipartite graph from NIPS 1987-1999. 1740 papers, 2037 authors, spreading over 13 years T. Sejnowski M. Jordan 32

Proximity is the main tool Q.1: CePS, G-Ray, ProSIN Q.2: DAP, pTrack/cTrack 34 Q: What is a `good’ Score? a.k.a Relevance, Closeness, ‘Similarity’…

Random walk with restart [Pan+ KDD 2004] Node 4 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Node 8 Node 9 Node 10 Node 11 Node 12 0.13 0.10 0.13 0.22 0.13 0.05 0.08 0.04 0.03 0.04 0.02 1 4 3 2 5 6 7 9 10 8 1 1212 0.13 0.10 0.13 0.05 0.08 0.04 0.02 0.04 0.03 Ranking vector More red, more relevant Nearby nodes, higher scores

Why RWR is a good score? all paths from i to j with length 1 all paths from i to j with length 2 all paths from i to j with length 3 : adjacency matrix. c: damping factor i j RWR summarizes all the weighted paths from i to j

Computing RWR OntheFly – No Pre-Computation; – Light Storage Cost (W) – Slow On-Line Response: O(mE) Pre-Compute – Fast On-Line Response – Prohibitive Pre-Compute Cost: O(n 3 ) – Prohibitive Storage Cost: O(n 2 ) 37 ~

Q: How to Balance? On-line Off-line 38 Goal: Efficiently Get (elements) of

B_Lin: Basic Idea [Tong+ ICDM 2006] 1 4 3 2 5 6 7 9 10 8 1 1212 0.13 0.10 0.13 0.05 0.08 0.04 0.02 0.04 0.03 1 4 3 2 5 6 7 9 10 8 1 1212 Find Community Fix the remaining Combine 1 4 3 2 5 6 7 9 10 8 1 1212 1 4 3 2 5 6 7 9 8 1 1212 5 6 7 9 8 1 1212 1 4 3 2 5 6 7 9 8 1 1212 1 4 3 2 5 6 7 9 8 1 1212 1 4 3 2 39

+ ~ ~ B_Lin: details W 1 : within community ~ Cross community details 40 + W ~ =

B_Lin: details W ~ I – c ~ ~ I – c – cUSV W1W1 ~ Easy to be invertedLRA difference Sherman–Morrison Lemma! details 41 If Then

B_Lin: summary Pre-Compute Stage Q: A: A few small, instead of ONE BIG, matrices inversions On-Line Stage Q: Efficiently recover one column of Q A: A few, instead of MANY, matrix-vector multiplications Efficiently compute and store Q 42

Query Time vs. Pre-Compute Time Log Query Time Log Pre-compute Time Quality: 90%+ On-line: Up to 150x speedup Pre-computation: Two orders saving 43 Our Results

More on Scalability Issues for Querying (the spectrum of ``FastProx’’) B_Lin: one large linear system – [Tong+ ICDM06, KAIS08] BB_Lin: the intrinsic complexity is small – [Tong+ KAIS08] FastUpdate: time-evolving linear system – [Tong+ SDM08, SAM08] FastAllDAP: multiple linear systems – [Tong+ KDD07 a] Fast-ProSIN: dealing w/ on-line feedback – [Tong+ ICDM 2008] 44

Roadmap Introduction Completed Work – Querying – Mining Proposed Work 45 M1: Spotting Anomalies M2: Mining Time

Motivation [Tong+ KDD 08 b] Q: How to find patterns? – e.g., communities, anomalies, etc. A: Low-Rank Approximation (LRA) for Adjacency Matrix of the Graph. A L MR XX ~ ~ 47

LRA for Graph Mining: Example John KDD Tom Bob Carl Van Roy RECOMB ISMB ICDM AuthorConf. LMR ~ ~ XX Adj. matrix: A Au. clusters Conf. Cluster Interaction Recon. error is high  ‘Carl’ is abnormal 48

Challenges: How to get (L, M, R)? Efficiently both time and space Intuitively easy for interpretation Dynamically track patterns over time 49 None of Existing Methods Fully Meets Our Wish List!

Why Not SVD and CUR/CX? SVD: Optimal in L 2 and L F – Efficiency Time: Space: (L, R) are dense – Interpretation Linear Combination of many columns – Dynamic: Not Easy 50 CUR: Example-based – Efficiency Better than SVD Redundancy in L – Interpretation Actual Columns from A xxxx – Dynamic: Not Easy

Solutions: Colibri [Tong+ KDD 08 b] Colibri-S: for static graph – Basic idea: remove linear redundancy – Same accuracy as CUR/CX – Significant savings in both time & space Up to 53x speed-up Colibri-D: for dynamic graph – Basic idea: leverage smoothness between time – Same accuracy as CUR/CMD Up to 112x speed-up 51 details

A Pictorial Comparison (for static graphs) 52 1 st singular vector 2 nd singular vector SVDCUR CMD Colibri-S details

Comparison SVD, CUR vs. Colibri s Wish List SVD [Golub+ 1989] CUR/CX [Drineas+ 2005] Colibri [Tong+ 2008] Efficiency Interpretation Dynamics 53 details

Performance of Colibri-S TimeSpace Ours CUR CMD Ours CMD Accuracy Same 91%+ Time 12x of CMD 28x of CUR Space ~1/3 of CMD ~10% of CUR 54 Data set: Network traffic - 21,837 sources/destinations, 158,805 edges

Performance of Colibri-D Time # of changed cols CMD Colibri-S Colibri-D achieves up to 112x speedups Colibri-D 55 Network traffic - 21,837 nodes - 1,220 hours - 22,800 edge/hr

M2: How to mine time in some complex context? [Tong+ CIKM 08] 57

A Motivating Example: Inputs TimeEvent (e.g., Session) Entity Oct. 26Link AnalysisTom, Bob ClusteringBob, Alan Oct. 27ClassificationBob, Alan Anomaly DetectionAlan, Beck Oct. 28PartyBeck, Dan Oct. 29Web SearchDan, Jack AdvertisingJack, Peter Oct. 30Enterprise SearchJack, Peter Oct. 31Q & APeter, Smith 58

Time Cluster, rep. entities: b 7,b 6, b 8 A Motivating Example: Outputs Jack Oct. 29 Oct. 30 Oct. 28 Oct. 26 Oct. 27 Time Cluster Rep. Entities: ``Jack’’, ``Peter’’, ``Smith’’ Abnormal Time Rep. Entities: ``Beck’’, ``Dan’’ Time Cluster Rep. Entities: ``Tom’’, ``Bob’’, ``Alan’’

Problem Definitions ( How to mine time in such complex context) Given data sets collected at different time stamps; We want to find +1: Time Clusters +2: Abnormal Time stamps +3: Interpretations +4: Right time granularity 60 T3 MT3 Our Solutions

Data Sets CIKM: from CIKM proceedings Time: Publication year (1993-2007, 15) Event: Paper-published (952) Entities: Author (1895) & Session (279) Attribute: Keyword (158) DeviceScan: from MIT Reality Mining Time: the day scanning happened (1/1/2004- 5/5/2005, 294) Event: blue tooth device scanning person (114, 046) Entities: Device (103) & Person (97) Attribute: NA 61

T3 on `CIKM’ Data Set Rep. AuthorsRep. Keywords James. P. Callan W. Bruce Croft James Allan Philip S. Yu George Karypis Charles Clarke Web Cluster Classification XML Language Stream Rep. AuthorsRep. Keywords Elke Rundensteiner Daniel Miranker Andreas Henrich Il-Yeol Song Scott B Huffman Robert J. Hall Knowledge System Unstructured Rule Object-oriented Deductive 62

MT3 on `DeviceScan’ Data Set Aggregate by Month Apr. 2004 is anomaly Aggregate by Day Work day Semester Break & Holiday 63

Roadmap Introduction Completed Work – Querying – Mining Proposed Work – P1: Community detection – P2: Mining Space – P3: Diffusion Wavelets 64

Detecting Communities Observations: two seemingly opposite efforts in community detection – E1: parameter-free (no user intervention) – E2: cluster w/ constraints (listen to users) Challenge: How to fill the gap? Idea: MDL-based method, encoding the constraints in descriptions. 66 P1

Mining Space Given the data sets collected at different locations We want to – Find similar locations – Spot Abnormal locations – Provide Interpretations Idea: extend T3/MT3 to 2-d case 67 P2

Diffusion Wavelets Observation #1: Graph Laplacian is basis – For many querying and mining techniques Observation #2: Diffusion wavelets focus on local spectrum in multi-scales Conjecture: Diffusion wavelets (might) provide an alternative/better way for – Querying – Mining 68 P3

Time Line Dec. ‘08: Thesis Proposal Jan. – Feb., ‘09: – Research on Community Detection Mar. – Apr. ‘09: – Research on Mining Space May – Jul. ‘09: – Research on Diffusion Wavelets Aug. ‘09: Thesis Write-up Sep. ‘09: Defense 69 P3 P1 P2

Selected References H. Tong & C. Faloutsos. (2006) Center-piece subgraphs: problem definition and fast solutions. In KDD, 404-413, 2006. H. Tong, C. Faloutsos, & J.Y. Pan. (2006) Fast Random Walk with Restart and Its Applications. In ICDM, 613-622, 2006. (b.p. award) H. Tong, Y. Koren, & C. Faloutsos. (2007) Fast direction-aware proximity for graph mining. In KDD, 747-756, 2007. H. Tong, B. Gallagher, C. Faloutsos, & T. Eliassi-Rad. (2007) Fast best-effort pattern matching in large attributed graphs. In KDD, 737-746, 2007. H. Tong, S. Papadimitriou, P.S. Yu & C. Faloutsos. (2008) Proximity Tracking on Time-Evolving Bipartite Graphs. in SDM 2008. (b.p. award) H. Tong, S. Papadimitriou, J. Sun, P.S. Yu & C. Faloutsos. (2008) Fast Mining of Static and Dynamic Graphs. KDD 2008 H. Tong, Y. Sakurai, T. Eliassi-Rad, and C. Faloutsos. Fast Mining of Complex Time-Stamped Events CIKM 08 H. Tong, H. Qu, and H. Jamjoom. Measuring Proximity on Graphs with Side Information. ICDM 2008 70

My other work during Ph.D study GhostEdge (w/ Brian, Christos and Tina, in KDD 08) – Classification in Sparsely Labeled Network GMine (w/ Junio, Agma, Christos and Jure, in VLDB 06) – Interactive Graph Visualization and Mining Graphite (w/ Polo, Christos, Jason, Brian and Tina, in ICDM 08) – Visual Query System for Attributed Graphs TANGENT (w/ Kensuke and Christos) – ``surprise-me’’ recommendation PaCK (w/ Jingrui, Spiros, Tina, Jaime and Christos) – Community detection for heterogonous graphs 71

Acknowledgements Christos Faloutsos, Jia-Yu Pan, Yehuda Koren, Spiros Papadimitriou, Philip S. Yu, Jimeng Sun, Huiming Qu, Hani Jamjoom, Tina Eliassi-Rad, Brian Gallagher, Yasushi Sakurai, Kensuke Oonuma, Duen Horng (Polo) Chau, Jason I. Hong, Jingrui He, Jaime Carbonell, José Fernando Rodrigues Jr., Jure Leskovec Agma J. M. Traina, Charalampos (Babis) Tsourakakis, Meng Su 72 (the old way)

CePS ProSIN Gray DAP pTrack cTrack BLin BBLin FastUpdate FastDAP Fast-ProSIN Colibri P1 P3 GhostEdge Graphite Pack TANGENT GMine T3/MT3 P2 Mining Q1 Q2 Q3 M2 M3 M1 A Graph Miner’s Way: My Collaboration Graph (During Ph.D Study) Legends: Green: Querying Blue: Mining Purple: Others : Completed : Proposed

Q & A Thank you! 74

Tools and Algorithms for Querying and Mining Large Graphs Hanghang Tong Machine Learning Department Carnegie Mellon University

Similar presentations

Presentation on theme: "Tools and Algorithms for Querying and Mining Large Graphs Hanghang Tong Machine Learning Department Carnegie Mellon University"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Tools and Algorithms for Querying and Mining Large Graphs Hanghang Tong Machine Learning Department Carnegie Mellon University

Similar presentations

Presentation on theme: "Tools and Algorithms for Querying and Mining Large Graphs Hanghang Tong Machine Learning Department Carnegie Mellon University"— Presentation transcript:

Similar presentations

About project

Feedback