Presentation is loading. Please wait.

Presentation is loading. Please wait.

SCS CMU Proximity on Large Graphs Speaker: Hanghang Tong 2008-4-10 15-826 Guest Lecture.

Similar presentations


Presentation on theme: "SCS CMU Proximity on Large Graphs Speaker: Hanghang Tong 2008-4-10 15-826 Guest Lecture."— Presentation transcript:

1 SCS CMU Proximity on Large Graphs Speaker: Hanghang Tong 2008-4-10 15-826 Guest Lecture

2 SCS CMU 2 Graphs are everywhere!

3 SCS CMU 3 Food-web: example

4 SCS CMU 4 Graph Mining: the big picture Graph/Global Level Subgraph/ Community Level Node Level We are here!

5 SCS CMU 5 Proximity on Graph: What? a.k.a Relevance, Closeness, ‘Similarity’…

6 SCS CMU 6 Proximity is the main tool behind… Link prediction [Liben-Nowell+], [Tong+] Ranking [Haveliwala], [Chakrabarti+] Email Management [Minkov+] Image caption [Pan+] Neighborhooh Formulation [Sun+] Conn. subgraph [Faloutsos+], [Tong+], [Koren+] Pattern match [Tong+] Collaborative Filtering [Fouss+] Many more… Will return to this later

7 SCS CMU 7 Roadmap Motivation Part I: Definitions Part II: Fast Solutions Part III: Applications Conclusion Basic: RWR Variants Asymmetry of Prox. Group Prox Prox w/ Attributes Prox w/ Time

8 SCS CMU 8 Why not shortest path? ‘pizza delivery guy’ problem ‘multi-facet’ relationship Some ``bad’’ proximities

9 SCS CMU 9 Why not max. netflow? No punishment on long paths Some ``bad’’ proximities

10 SCS CMU 10 Why not ``effective conductance”? Some ``bad’’ proximities ‘pizza delivery guy’ problem

11 SCS CMU 11 What is a ``good’’ Proximity? Multiple Connections Quality of connection Direct & In-directed Conns Length, Degree, Weight… …

12 SCS CMU 12 1 4 3 2 5 6 7 9 10 8 1 1212 Random walk with restart

13 SCS CMU 13 Random walk with restart Node 4 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Node 8 Node 9 Node 10 Node 11 Node 12 0.13 0.10 0.13 0.22 0.13 0.05 0.08 0.04 0.03 0.04 0.02 1 4 3 2 5 6 7 9 10 8 1 1212 0.13 0.10 0.13 0.05 0.08 0.04 0.02 0.04 0.03 Ranking vector More red, more relevant Nearby nodes, higher scores

14 SCS CMU Why RWR is a good score? 14 all paths from i to j with length 1 all paths from i to j with length 2 all paths from i to j with length 3

15 SCS CMU 15 Roadmap Motivation Part I: Definitions Part II: Fast Solutions Part III: Applications Conclusion Basic: RWR Variants Asymmetry of Prox. Group Prox Prox w/ Attributes Prox w/ Time

16 SCS CMU 16 Variant: escape probability Define Random Walk (RW) on the graph Esc_Prob(A  B) –Prob (starting at A, reaches B before returning to A) Esc_Prob = Pr (smile before cry) A B the remaining graph

17 SCS CMU 17 Other Variants Other measure by RWs –Community Time/Hitting Time [Fouss+] –SimRank [Jeh+] Equivalence of Random Walks –Electric Networks: EC [Doyle+]; SAEC[Faloutsos+]; CFEC[Koren+] –String Systems Katz [Katz], [Huang+], [Scholkopf+] Matrix-Forest-based Alg [Chobotarev+]

18 SCS CMU 18 Other Variants Other measure by RWs –Community Time/Hitting Time [Fouss+] –SimRank [Jeh+] Equivalence of Random Walks –Electric Networks: EC [Doyle+]; SAEC[Faloutsos+]; CFEC[Koren+] –String Systems Katz [Katz], [Huang+], [Scholkopf+] Matrix-Forest-based Alg [Chobotarev+] All are related to, or similar to random walk with restart!

19 SCS CMU 19 Roadmap Motivation Part I: Definitions Part II: Fast Solutions Part III: Applications Conclusion Basic: RWR Variants Asymmetry of Prox. Group Prox Prox w/ Attributes Prox w/ Time

20 SCS CMU 20 Asymmetry of Proximity [Tong+ KDD07 a] What is Prox from A to B? What is Prox from B to A? What is Prox between A and B?

21 SCS CMU 21 Asymmetry also exists in un-directed graphs Hanghang’s most important conf. is KDD The most important author in KDD is... So is love… Hanghang KDD

22 SCS CMU 22 Roadmap Motivation Part I: Definitions Part II: Fast Solutions Part III: Applications Conclusion Basic: RWR Variants Asymmetry of Prox. Group Prox Prox w/ Attributes Prox w/ Time

23 SCS CMU 23 Group Proximity [Tong+ 2007] Q: How close are Accountants to SECs? A: Prob (starting at any RED, reaches any GREEN before touching any RED again)

24 SCS CMU 24 Proximity on Attribute Graphs What is the proximity from node 7 to 10? If we know that…

25 SCS CMU 25 Sol: Augmented graphs

26 SCS CMU 26 Attributes on nodes/edges (ER graph) [Chakrabarti+ WWW07] skip WroteSentReceived In-Replied-toCited Works

27 SCS CMU 27 Proximity w/ Time Sol #1: treat time an categorical attr. [Minkov+] Sol #2: aggregate slice matrices [Tong+] Time Global aggregation Slide window Exponential emphasis

28 SCS CMU 28 Summary of Part I Goal: Summarize multiple … relationships Solutions –Basic: Random Walk with Restart –Property: Asymmetry –Variants: Esc_Prob and many others. –Generalization: Group Prox.; w/ Attr.; w/ Time

29 SCS CMU 29 Motivation Part I: Definitions Part II: Fast Solutions Part III: Applications Conclusion Roadmap B_Lin: RWR FastAllDAP: Esc_Prob BB_Lin: Skewed BGs FastUpdate: Time-Evolving

30 SCS CMU Preliminary: Sherman–Morrison Lemma 30 = If: Then:

31 SCS CMU SM Lemma: Applications RLS –and almost any algorithm in time series! Leave-one-out cross validation for LS Kalman filtering Incremental matrix decomposition … and all the fast sols we will introduce! 31

32 SCS CMU 32 Computing RWR 1 4 3 2 5 6 7 9 10 8 1 1212 n x n n x 1 Ranking vector Starting vector Adjacent matrix 1 Restart p

33 SCS CMU 33 Beyond RWR P-PageRank [Haveliwala] PageRank [Haveliwala] RWR [Pan, Sun] SM Learning [Zhou, Zhu] RL in CBIR [He] Fast RWR (B_Lin) Finds the Root Solution ! : Maxwell Equation for Web! [Chakrabarti]

34 SCS CMU 34 RWR is the building block for computing… –Escape Probability (augmented w/sink) [Tong+] –..  Effective Conductanc  Resistance Dist.  Commute Time –MRF (special structure) [Cohen] Similar Idea of B_Lin to compute other measurements Beyond RWR

35 SCS CMU 35 Q: Given query i, how to solve it? ? ? Adjacent matrix Starting vector

36 SCS CMU 36 1 4 3 2 5 6 7 9 10 8 1 1212 0.13 0.10 0.13 0.05 0.08 0.04 0.02 0.04 0.03 OntheFly: 1 4 3 2 5 6 7 9 10 8 1 1212 No pre-computation/ light storage Slow on-line response O(mE)

37 SCS CMU 37 4 PreCompute 1 4 3 2 5 6 7 9 10 8 1 1212 0.13 0.10 0.13 0.05 0.08 0.04 0.02 0.04 0.03 1 3 2 5 6 7 9 10 8 1 1212 [Haveliwala] R:R:

38 SCS CMU 38 PreCompute: 1 4 3 2 5 6 7 9 10 8 1 1212 0.13 0.10 0.13 0.05 0.08 0.04 0.02 0.04 0.03 1 4 3 2 5 6 7 9 10 8 1 1212 Fast on-line response Heavy pre-computation/storage cost O(n ) 3 2

39 SCS CMU 39 Q: How to Balance? On-line Off-line

40 SCS CMU 40 B_Lin: Basic Idea [Tong+] 1 4 3 2 5 6 7 9 10 8 1 1212 0.13 0.10 0.13 0.05 0.08 0.04 0.02 0.04 0.03 1 4 3 2 5 6 7 9 10 8 1 1212 Find Community Fix the remaining Combine 1 4 3 2 5 6 7 9 10 8 1 1212 1 4 3 2 5 6 7 9 8 1 1212 5 6 7 9 8 1 1212 1 4 3 2 5 6 7 9 8 1 1212 1 4 3 2 5 6 7 9 8 1 1212 1 4 3 2

41 SCS CMU 41 Pre-computational stage Q: A: A few small, instead of ONE BIG, matrices inversions Efficiently compute and store Q

42 SCS CMU 42 Q: Efficiently recover one column of Q A: A few, instead of MANY, matrix-vector multiplication On-Line Query Stage +

43 SCS CMU 43 Pre-compute Stage p1: B_Lin Decomposition –P1.1 partition –P1.2 low-rank approximation p2: Q matrices –P2.1 computing (for each partition) –P2.2 computing (for concept space)

44 SCS CMU 44 P1.1: partition 1 4 3 2 5 6 7 9 10 8 1 1212 1 4 3 2 5 6 7 9 8 1 1212 Within-partition linkscross-partition links skip

45 SCS CMU 45 P1.1: block-diagonal 1 4 3 2 5 6 7 9 10 8 1 1212 1 4 3 2 5 6 7 9 8 1 1212 skip

46 SCS CMU 46 P1.2: LRA for 3 1 4 2 5 6 7 9 10 8 1 1212 1 4 3 2 5 6 7 9 8 1 1212 |S| << |W 2 | ~ skip

47 SCS CMU 47 + = skip

48 SCS CMU 48 p2.1 Computing c skip

49 SCS CMU 49 Comparing and Computing Time –100,000 nodes; 100 partitions –Computing 100,00x is Faster! Storage Cost –100x saving! Q 1,1 Q 1,2 Q 1,k = skip

50 SCS CMU 50 Q: How to fix the green portions? + ~ ~ ~ + ? skip

51 SCS CMU 51 p2.2 Computing: U V = _ 1 4 3 2 5 6 7 9 10 8 1 1212 Q 1,1 Q 1,2 Q 1,k skip

52 SCS CMU 52 SM Lemma says: We have: Communities Bridges skip

53 SCS CMU 53 On-Line Stage Q + Query Result ? A (SM lemma) Pre-Computation skip

54 SCS CMU 54 On-Line Query Stage q1: q2: q3: q4: q5: q6: skip

55 SCS CMU 55 skip

56 SCS CMU 56 Query Time vs. Pre-Compute Time Log Query Time Log Pre-compute Time Quality: 90%+ On-line: Up to 150x speedup Pre-computation: Two orders saving

57 SCS CMU 57 Motivation Part I: Definitions Part II: Fast Solutions Part III: Applications Conclusion Roadmap B_Lin: RWR FastAllDAP: Esc_Prob BB_Lin: Skewed BGs FastUpdate: Time-Evolving

58 SCS CMU 58 FastAllDAP [Tong+] Footnote: augmented w/ universal sink as practical modification A B the remaining graph Q: How to compute –Esc_Prob = Pr (smile before cry)?

59 SCS CMU 59 Solving DAP (Straight-forward way) One matrix inversion, one proximity! 1 x (n-2) (n-2) x (n-2) 1-c: fly-out probability (to black-hole)

60 SCS CMU 60 Esc_Prob(1->5) = P= I - + P: Transition matrix (row norm.) 2 c c

61 SCS CMU 61 Case 1, Medium Size Graph –Matrix inversion is feasible, but… –What if we want many proximities? –Q: How to get all (n ) proximities efficiently? –A: FastAllDAP! Case 2: Large Size Graph –Matrix inversion is infeasible –Q: How to get one proximity efficiently? –A: FastOneDAP! Challenges 2 skip

62 SCS CMU 62 FastAllDAP Q1: How to efficiently compute all possible proximities on a medium size graph? –a.k.a. how to efficiently solve multiple linear systems simultaneously? Goal: reduce # of matrix inversions!

63 SCS CMU 63 FastAllDAP: Observation Need two different matrix inversions! P=

64 SCS CMU 64 FastAllDAP: Rescue Redundancy among different linear systems! P= Overlap between two gray parts! Prox(1  5) Prox(1  6)

65 SCS CMU 65 FastAllDAP: Theorem Theorem: Proof: by SM Lemma Example:

66 SCS CMU 66 FastAllDAP: Algorithm Alg. –Compute Q –For i,j =1,…, n, compute Computational Save O(1) instead of O(n )! Example –w/ 1000 nodes, –1m matrix inversion vs. 1 matrix! 2

67 SCS CMU 67 FastAllDAP Size of Graph Time (sec) Straight-Solver FastAllDAP 1,000x faster!

68 SCS CMU 68 Motivation Part I: Definitions Part II: Fast Solutions Part III: Applications Conclusion Roadmap B_Lin: RWR FastAllDAP: Esc_Prob BB_Lin: Skewed BGs FastUpdate: Time-Evolving

69 SCS CMU RWR on Bipartite Graph 69 n m authors Conferences Author-Conf. Matrix Observation: n >> m! Examples: 1. DBLP: 400k aus, 3.5k confs 2. NetFlix: 2.7M usrs, 18k mvs

70 SCS CMU 70 Q: Given query i, how to solve it? RWR on Skewed bipartite graphs ? ? ….... ….. …... 0 0 n m Ar ….... ….. …... Ac

71 SCS CMU Step 1: Step 2: Cost: Examples –NetFlix: 1.5hr for pre-computation; –DBLP: 1 few minutes 71 BB_Lin: Pre-Computation [Tong+ 06] M = Ac Ar X 2-step RWR for Conferences All Conf-Conf Prox. Scores

72 SCS CMU 72 BB_Lin: Pre-Computation [Tong+ 06] Step 1: Step 2: M = Ac Ar X 2-step RWR for Conferences All Conf-Conf Prox. Scores

73 SCS CMU 73 BB_Lin: Pre-Computation [Tong+ 06] Step 1: Step 2: Cost: Examples –NetFlix: 1.5hr for pre-computation; –DBLP: 1 few minutes M = Ac Ar X 2-step RWR for Conferences All Conf-Conf Prox. Scores Ac/Ar E edges m x m

74 SCS CMU BB_Lin: On-Line Stage 74 Ac/Ar E edges Case 1: - Conf - Conf authors Conferences Read out !

75 SCS CMU BB_Lin: On-Line Stage 75 Ac/Ar E edges Case 2: - Au - Conf authors Conferences 1 matrix-vec!

76 SCS CMU BB_Lin: On-Line Stage 76 Ac/Ar E edges Case 3: - Au - Au authors Conferences 2 m atrix-vec!

77 SCS CMU BB_Lin: Examples NetFlix dataset (2.7m user x 18k movies) –1.5hr for pre-computation; –<1 sec for on-line DBLP dataset (400k authors x 3.5k confs) –A few minutes for pre-computation –<0.01 sec for on-line 77

78 SCS CMU 78 Motivation Part I: Definitions Part II: Fast Solutions Part III: Applications Conclusion Roadmap B_Lin: RWR FastAllDAP: Esc_Prob BB_Lin: Skewed BGs FastUpdate: Time-Evolving

79 SCS CMU 79 Challenges BB_Lin is good for skewed bipartite graphs –for NetFlix (2.7M nodes and 100M edges) –w/ 1.5 hr pre-computation for m x m core matrix –fraction of seconds for on-line query But…what if the graph is evolving over time –New edges/nodes arrive; edge weights increase… –1.5hr itself becomes a part of on-line cost!

80 SCS CMU 80 t=0 Q: How to update the core matrix? t=1 ~ ~ ?

81 SCS CMU Update the core matrix Step 1: Step 2: 81 M = Ac Ar X ~ ~ ~ ? M = X + Rank 2 update = + X

82 SCS CMU Update : General Case [Tong+ 2008] E’ edges changed Involves n’ authors, m’ confs. Observation 82 M = Ac Ar X ~ n authors m Conferences

83 SCS CMU 83 Observation: –the rank of update is small! Algorithm: –E’ edges changed –Involves n’ authors, m’ confs. –our Alg. –(details in the paper) Update : General Case 83 n authors m Conferences

84 SCS CMU 84 FastOneUpdate 176x speedup 40x speedup Time (Seconds) Datasets

85 SCS CMU 85 Fast-Batch-Update Min (n’, m’)E’ Time (Seconds) 15x speed-up on average!

86 SCS CMU 86 Summary of Part II Goal: Efficiently Solve Linear System(s) Sols. –B_Lin: Approximate one large linear system –FastAllDAP: multiple inner-related linear systems –BB_Lin: the intrinsic complexity is small –FastUpdate: (smooth) dynamic linear system

87 SCS CMU 87 B_Lin FastAllDAP … BB_Lin … FastUpdate

88 SCS CMU 88 Motivation Part I: Definitions Part II: Fast Solutions Part III: Applications Conclusion Roadmap Link Prediction NF gCap CePS G-Ray pTrack/cTrack

89 SCS CMU 89 Link Prediction: existence no link with link density Prox (i  j)+Prox (j  i) Prox. is effective to distinguish red and blue!

90 SCS CMU 90 Link Prediction: direction Q: Given the existence of the link, what is the direction of the link? A: Compare prox(i  j) and prox(j  i) >70% Prox (i  j) - Prox (j  i) density

91 SCS CMU 91 Neighborhood Formulation … … … … ConferenceAuthor A: RWR! [Sun ICDM2005] Q: what is most related conference to ICDM

92 SCS CMU 92 NF: example

93 SCS CMU 93 gCaP: Automatic Image Caption Q … SeaSunSkyWave {} {} CatForestGrassTiger {?, ?, ?,} ? A: RWR! [Pan KDD2004]

94 SCS CMU 94 Test Image SeaSunSkyWaveCatForestTigerGrass Image Keyword Region

95 SCS CMU 95 Test Image SeaSunSkyWaveCatForestTigerGrass Image Keyword Region {Grass, Forest, Cat, Tiger}

96 SCS CMU 96 Center-Piece Subgraph(CePS) ? Original Graph Black: query nodes CePS Q A: RWR! [Tong KDD 2006] Red: Max (Prox(Red, A) x Prox(Red, B) x Prox(Red, C)) CePS guy

97 SCS CMU 97 CePS: Example

98 SCS CMU 98 K_SoftAnd: Relaxation of AND Asking AND query?  No Answer! Disconnected Communities Noise

99 SCS CMU 99 2_SoftAnd And 1_SoftAnd (OR) x 1e-4

100 SCS CMU 100 CePS: 2 Soft_AND Stat. DB

101 SCS CMU 101 OutputInput Attributed Data Graph Query Graph Matching Subgraph Graph X-Ray

102 SCS CMU 102 G-Ray: How to? matching node Goodness = Prox (12, 4) x Prox (4, 12) x Prox (7, 4) x Prox (4, 7) x Prox (11, 7) x Prox (7, 11) x Prox (12, 11) x Prox (11, 12)

103 SCS CMU 103 Effectiveness: star-query Query Result

104 SCS CMU 104 Effectiveness: line-query Query Result

105 SCS CMU 105 Query Result Effectiveness: loop-query

106 SCS CMU 106 pTrack [Given] –(1) a large, skewed time-evolving bipartite graphs, –(2) the query nodes of interest [Track] –(1) top-k most related nodes for each query node at each time step t; –(2) the proximity score (or rank of proximity) between any two query nodes at each time step t Author A’ Rank in KDD Year

107 SCS CMU 107 Philip S. Yu’s Top-5 conferences up to each year ICDE ICDCS SIGMETRICS PDIS VLDB CIKM ICDCS ICDE SIGMETRICS ICMCS KDD SIGMOD ICDM CIKM ICDCS ICDM KDD ICDE SDM VLDB 1992199720022007 Databases Performance Distributed Sys. Databases Data Mining

108 SCS CMU 108 KDD’s Rank wrt. VLDB over years Rank Year Data Mining and Databases are more and more relavant!

109 SCS CMU 109 cTrack [Given] –(1) a large, skewed time-evolving graphs, –(2) the query nodes of interest [Track] –(1) top-k most central nodes at each time step t; –(2) the centrality score (or rank of centrality) for each query node at each time step t

110 SCS CMU 110 Ranking of Centrality up to each year (in NIPS) M. Jordan G.Hinton C. Koch T. Sejnowski Year Rank of Influential-ness

111 SCS CMU 111 10 most influential authors up to each year Author-paper bipartite graph from NIPS 1987-1999. 3k. 1740 papers, 2037 authors, spreading over 13 years T. Sejnowski M. Jordan

112 SCS CMU 112 RWR Variantsw/ Time w/ Attribute Group Porx. Definitions B_Lin FastAllDAP BB_Lin FastUpdate Computations Link Prediction NF gCap CePS G-Ray pTrack cTrack Applications Proximity On Graphs Weighted Multiple Relationship Efficiently Solve Linear System(s) Use Proximity as Building block

113 SCS CMU Take-home Messages Proximity Definitions –RWR –and a lot of variants Computations –SM Lemma 113

114 SCS CMU References L. Page, S. Brin, R. Motwani, & T. Winograd. (1998), The PageRank Citation Ranking: Bringing Order to the Web, Technical report, Stanford Library. T.H. Haveliwala. (2002) Topic-Sensitive PageRank. In WWW, 517- 526, 2002 J.Y. Pan, H.J. Yang, C. Faloutsos & P. Duygulu. (2004) Automatic multimedia cross-modal correlation discovery. In KDD, 653-658, 2004. C. Faloutsos, K. S. McCurley & A. Tomkins. (2002) Fast discovery of connection subgraphs. In KDD, 118-127, 2004. J. Sun, H. Qu, D. Chakrabarti & C. Faloutsos. (2005) Neighborhood Formation and Anomaly Detection in Bipartite Graphs. In ICDM, 418-425, 2005. W. Cohen. (2007) Graph Walks and Graphical Models. Draft. 114

115 SCS CMU References P. Doyle & J. Snell. (1984) Random walks and electric networks, volume 22. Mathematical Association America, New York. Y. Koren, S. C. North, and C. Volinsky. (2006) Measuring and extracting proximity in networks. In KDD, 245–255, 2006. A. Agarwal, S. Chakrabarti & S. Aggarwal. (2006) Learning to rank networked entities. In KDD, 14-23, 2006. S. Chakrabarti. (2007) Dynamic personalized pagerank in entity-relation graphs. In WWW, 571-580, 2007. F. Fouss, A. Pirotte, J.-M. Renders, & M. Saerens. (2007) Random-Walk Computation of Similarities between Nodes of a Graph with Application to Collaborative Recommendation. IEEE Trans. Knowl. Data Eng. 19(3), 355-369 2007. 115

116 SCS CMU References H. Tong & C. Faloutsos. (2006) Center-piece subgraphs: problem definition and fast solutions. In KDD, 404-413, 2006. H. Tong, C. Faloutsos, & J.Y. Pan. (2006) Fast Random Walk with Restart and Its Applications. In ICDM, 613-622, 2006. H. Tong, Y. Koren, & C. Faloutsos. (2007) Fast direction- aware proximity for graph mining. In KDD, 747-756, 2007. H. Tong, B. Gallagher, C. Faloutsos, & T. Eliassi-Rad. (2007) Fast best-effort pattern matching in large attributed graphs. In KDD, 737-746, 2007. H. Tong, S. Papadimitriou, P.S. Yu & C. Faloutsos. (2008) Proximity Tracking on Time-Evolving Bipartite Graphs. to appear in SDM 2008. 116

117 SCS CMU 117 Thank you! htong@cs.cmu.edu www.cs.cmu.edu/~htong


Download ppt "SCS CMU Proximity on Large Graphs Speaker: Hanghang Tong 2008-4-10 15-826 Guest Lecture."

Similar presentations


Ads by Google