ON LINK-BASED SIMILARITY JOIN A joint work with: Liwen Sun, Xiang Li, David Cheung (University of Hong Kong) Jiawei Han (University of Illinois Urbana.

Slides:

Advertisements

Similar presentations

Weiren Yu 1, Jiajin Le 2, Xuemin Lin 1, Wenjie Zhang 1 On the Efficiency of Estimating Penetrating Rank on Large Graphs 1 University of New South Wales.

Advertisements

Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,

Computing Classic Closeness Centrality, at Scale Edith Cohen Joint with: Thomas Pajor, Daniel Delling, Renato Werneck Microsoft Research.

Analysis and Modeling of Social Networks Foudalis Ilias.

Searching on Multi-Dimensional Data

1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.

Frequent Subgraph Pattern Mining on Uncertain Graph Data

A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.

Yubao Wu 1, Ruoming Jin 2, Xiang Zhang 1

Neighborhood Formation and Anomaly Detection in Bipartite Graphs Jimeng Sun Huiming Qu Deepayan Chakrabarti Christos Faloutsos Speaker: Jimeng Sun.

Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.

1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006.

1 Fast Dynamic Reranking in Large Graphs Purnamrita Sarkar Andrew Moore.

1 PageSim: A Link-based Similarity Measure for the World Wide Web Zhenjiang Lin, Irwin King, and Michael, R., Lyu Computer Science & Engineering, The Chinese.

Abstract Shortest distance query is a fundamental operation in large-scale networks. Many existing methods in the literature take a landmark embedding.

Evaluation of Top-k OLAP Queries Using Aggregate R-trees Nikos Mamoulis (HKU) Spiridon Bakiras (HKUST) Panos Kalnis (NUS)

Survey on Evolving Graphs Research Speaker: Chenghui Ren Supervisors: Prof. Ben Kao, Prof. David Cheung 1.

1 Applications of Relative Importance  Why is relative importance interesting? Web Social Networks Citation Graphs Biological Data  Graphs become too.

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 2.

Mehdi Kargar Aijun An York University, Toronto, Canada Discovering Top-k Teams of Experts with/without a Leader in Social Networks.

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

PageRank for Product Image Search Kevin Jing (Googlc IncGVU, College of Computing, Georgia Institute of Technology) Shumeet Baluja (Google Inc.) WWW 2008.

1 Evaluating top-k Queries over Web-Accessible Databases Paper By: Amelie Marian, Nicolas Bruno, Luis Gravano Presented By Bhushan Chaudhari University.

Diversified Top-k Graph Pattern Matching 1 Yinghui Wu UC Santa Barbara Wenfei Fan University of Edinburgh Southwest Jiaotong University Xin Wang.

The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.

December 7-10, 2013, Dallas, Texas

Discovering Meta-Paths in Large Heterogeneous Information Network

P-Rank: A Comprehensive Structural Similarity Measure over Information Networks CIKM’ 09 November 3 rd, 2009, Hong Kong Peixiang Zhao, Jiawei Han, Yizhou.

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

Xiaowei Ying, Xintao Wu Univ. of North Carolina at Charlotte PAKDD-09 April 28, Bangkok, Thailand On Link Privacy in Randomizing Social Networks.

Finding Top-k Shortest Path Distance Changes in an Evolutionary Network SSTD th August 2011 Manish Gupta UIUC Charu Aggarwal IBM Jiawei Han UIUC.

Efficient Processing of Top-k Spatial Preference Queries

Spatio-temporal Pattern Queries M. Hadjieleftheriou G. Kollios P. Bakalov V. J. Tsotras.

1 LinkClus: Efficient Clustering via Heterogeneous Semantic Links Xiaoxin Yin, Jiawei Han Univ. of Illinois at Urbana-Champaign Philip S. Yu IBM T.J. Watson.

1 Panther: Fast Top-K Similarity Search on Large Networks Jing Zhang 1, Jie Tang 1, Cong Ma 1, Hanghang Tong 2, Yu Jing 1, and Juanzi Li 1 1 Department.

Exact indexing of Dynamic Time Warping

1 AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Advisor ： Dr. Koh Jia-Ling Speaker ： Tu Yi-Lang Date ： Hong.

1 On Optimal Worst-Case Matching Cheng Long (Hong Kong University of Science and Technology) Raymond Chi-Wing Wong (Hong Kong University of Science and.

Mining Top-K Large Structural Patterns in a Massive Network Feida Zhu 1, Qiang Qu 2, David Lo 1, Xifeng Yan 3, Jiawei Han 4, and Philip S. Yu 5 1 Singapore.

1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.

Page 1 PathSim: Meta Path-Based Top-K Similarity Search in Heterogeneous Information Networks Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S. Yu, Tianyi.

1 Authors: Glen Jeh, Jennifer Widom (Stanford University) KDD, 2002 Presented by: Yuchen Bian SimRank: a measure of structural-context similarity.

RoundTripRank Graph-based Proximity with Importance and Speciﬁcity Yuan FangUniv. of Illinois at Urbana-Champaign Kevin C.-C. ChangUniv. of Illinois at.

Panther: Fast Top-k Similarity Search in Large Networks JING ZHANG, JIE TANG, CONG MA, HANGHANG TONG, YU JING, AND JUANZI LI Presented by Moumita Chanda.

Kijung Shin Jinhong Jung Lee Sael U Kang

Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur.

03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.

23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.

Glen Jeh & Jennifer Widom KDD  Many applications require a measure of “similarity” between objects.  Web search  Shopping Recommendations  Search.

A Connectivity-Based Popularity Prediction Approach for Social Networks Huangmao Quan, Ana Milicic, Slobodan Vucetic, and Jie Wu Department of Computer.

CS 440 Database Management Systems Web Data Management 1.

Discovering Meta-Paths in Large Heterogeneous Information Network Changping Meng (Purdue University) Reynold Cheng (University of Hong Kong) Silviu Maniu.

Cohesive Subgraph Computation over Large Graphs

SIMILARITY SEARCH The Metric Space Approach

FORA: Simple and Effective Approximate Single-Source Personalized PageRank Sibo Wang, Renchi Yang, Xiaokui Xiao, Zhewei Wei, Yin Yang School of Information.

CIKM’ 09 November 3rd, 2009, Hong Kong

Probabilistic Data Management

Jiawei Han Department of Computer Science

Spatio-temporal Pattern Queries

ProbeSim: Scalable Single-Source and Top-k SimRank Computations on Dynamic Graphs Yu Liu , Bolong Zheng, Xiaodong He, Zhewei Wei, Xiaokui Xiao, Kai.

Probably Approximately

Zhenjiang Lin, Michael R. Lyu and Irwin King

Jiawei Han Department of Computer Science

Feifei Li, Ching Chang, George Kollios, Azer Bestavros

Asymmetric Transitivity Preserving Graph Embedding

Panagiotis G. Ipeirotis Luis Gravano

Efficient Processing of Top-k Spatial Preference Queries

Approximate Graph Mining with Label Costs

PRSim: Sublinear Time SimRank Computation on Large Power-Law Graphs.

Presentation transcript:

ON LINK-BASED SIMILARITY JOIN A joint work with: Liwen Sun, Xiang Li, David Cheung (University of Hong Kong) Jiawei Han (University of Illinois Urbana Champaign) Presenter: Reynold Cheng Department of Computer Science The University of Hong Kong VLDB 2011

Graph applications  Social networks 2 Link Similarity Join L. Sun, R. Cheng, X. Li, D. Cheung, J. Han  Bibliographic networks  Coauthor/citation relationships  Biological databases  Protein-protein interaction link prediction, recommendation, spam detection,...

Link-based Similarity (LS)  Similarity between a node pair based on links  Personalized PageRank  [Widom, WWW’03][Fogara, Inter. Math’05]  SimRank  [Lizorkin, VLDBJ’10] [Li, SDM’10]  Discounted Hitting Times  [Sarkar, KDD’10] 3 Link Similarity Join L. Sun, R. Cheng, X. Li, D. Cheung, J. Han

Similarity Join  Similarity join: discovers relationship between two sets of objects based on some similarity function  Extensively studied in:  high dim. data [Boehm, SIGMOD’01] [Dittrich, KDD’01]  sets/strings [Arasu, VLDB’06] [Xiao, WWW’08]  Similarity join for graphs: use shortest-path distance for road network and graph pattern matching [Sankaranarayanan, GIS’06; Zou, VLDB’09] 4 Link Similarity Join L. Sun, R. Cheng, X. Li, D. Cheung, J. Han

Link-based Similarity Join (LS-Join)  LS-Join: Given two subsets of nodes P and Q in a graph and a LS measure S, return k pairs of nodes, with the highest values of S. 5 Link Similarity Join L. Sun, R. Cheng, X. Li, D. Cheung, J. Han

LS-Join and Promotion Strategies Find the top-k closest (Sales, Customer) from a social network, using PageRank In a citation network, find top-k similar pairs of papers from the DB and AI communities 6 Link Similarity Join L. Sun, R. Cheng, X. Li, D. Cheung, J. Han Top-1 LS-Join on Sales, Customer

More about LS measures  A LS measure often involves random walk  Let be a probabilistic measure between u and v  Personalized PageRank (PPR)  : prob. a surfer from u visits v at i-th step  SimRank (SR)  : prob. 2 surfers from u and v first meet at i-th step  Discounted Hitting Time (DHT)  : prob. a surfer from u first visits v at i-th step  can be expensive to compute 7 Link Similarity Join L. Sun, R. Cheng, X. Li, D. Cheung, J. Han

Challenge of Evaluating LS-Join Link Similarity Join L. Sun, R. Cheng, X. Li, D. Cheung, J. Han 8  Let S(u,v) be the similarity between u and v based on a LS measure  A simple algorithm:  For each node pair and, compute S(p, q)  Return the k pairs with the highest S(p,q)  Drawback:  S(p,q) is expensive to compute  S(p,q) of a non-answer pair is also evaluated  Can we have a better solution?

LS-Join Algorithms  Iterative Deepening Join (IDJ)  An algorithm for computing any given LS measure  Customization of IDJ for:  Personalized PageRank (PPR)  SimRank (SR) 9 Link Similarity Join L. Sun, R. Cheng, X. Li, D. Cheung, J. Han

e-function: A general form of S(u,v)  where  a, b : real-valued constants; a>0  : decay factor; 0 < <1  : prob. measure  e.g., for PPR:  : prob. a surfer from u visits v at i-th step  a = 1- ; b = 0 10 Link Similarity Join S(u,v) has a general form called e-function Practically, we approximate S(u,v) by some d depth

Properties of e-function where 11 Link Similarity Join L. Sun, R. Cheng, X. Li, D. Cheung, J. Han Observations 1.This bound decreases exponentially with d 2.At small d, S d (u,v) is cheap to compute; it only needs short random walks Observations 1.This bound decreases exponentially with d 2.At small d, S d (u,v) is cheap to compute; it only needs short random walks

Iterative Deepening Join (IDJ)  At iteration i, compute the bound of S(u,v), where d=2 i  As d increases, the bound shrinks and converges to S(u,v)  Compute the bound more frequently at small depths Higher pruning power The bound is cheaper to compute  Conversely, spend less effort for large d 12

IDJ Example: find the top-1 pair Compute S 2 : Perform 2 steps of random walks Iteration 1: d = 2. graph space Prune nodes using bounds 13 Link Similarity Join L. Sun, R. Cheng, X. Li, D. Cheung, J. Han

IDJ Example: find the top-1 pair Iteration 2: d = 4. graph space Compute S 4 : Perform 4 steps of random walks Compute S 4 : Perform 4 steps of random walks Prune nodes using bounds 14 Link Similarity Join L. Sun, R. Cheng, X. Li, D. Cheung, J. Han

IDJ Example: find the top-1 pair Iteration 3: d = 8. graph space Compute actual score S; Return top-1 pair Compute actual score S; Return top-1 pair Compute S 8 : Perform 8 steps of random walks Compute S 8 : Perform 8 steps of random walks 15 Link Similarity Join L. Sun, R. Cheng, X. Li, D. Cheung, J. Han

Remarks on IDJ  IDJ is inspired by the Iterative Deepening Depth-First Search  Search a small scope at early iterations for efficient pruning  Exponentially expand the search scope  Space efficient only store the states of one random surfer at a time Use a small heap to track the top-k candidate pairs  IDJ computes many S d (u,v)’s, which is expensive when d is large.  Can we achieve better pruning for PPR and SR? 16 Link Similarity Join L. Sun, R. Cheng, X. Li, D. Cheung, J. Han

Customization for PPR  Personalized PageRank  V i (p,q): prob. a random surfer from p visits q at the i-th step. 17 Link Similarity Join L. Sun, R. Cheng, X. Li, D. Cheung, J. Han

Customization for PPR  Upper-Bound for PPR  V i (p,Q): prob. a random surfer from p visits any node in Q at the i-th step.  V i (p,q) ≤ V i (p,Q), since. Replace V i (p,q) with V i (p,Q) and obtain an upper-bound of S d (p,q).  How to obtain V i (p, Q) efficiently? Take nodes in Q as start points, and perform backward random walks 18 Link Similarity Join L. Sun, R. Cheng, X. Li, D. Cheung, J. Han

Example: Compute V 2 (p, Q) 1/2 1/5 1/10 1/5 Normal (forward) random walk 1/2 PQ V2(, Q ) = 1/10 + 1/10 = 1/5 19

Example: Compute V 2 (p, Q) 1 1/5 Normal (forward) random walk 1 PQ V2(, Q ) = 1/10 + 1/10 = 1/5 V2(, Q ) = 1/5 + 1/5 = 2/5 20

Example: Compute V 2 (p, Q) V2(, Q ) = 1/10 + 1/10 = 1/5 V2(, Q ) = 1/5 + 1/5 = 2/5 Normal (forward) random walkbackward random walk 1 1/5 1/2 1/5 2/5 PQPQ  Benefit Compute V 2 (p, Q) for all p in P by ONE ROUND of random walks – O(|P|) improvement! 21

 SR is more difficult to handle than PPR  SR involves computing prob. that two random surfers first meet at the i-th iteration  Computing P i (p,q) and S d (u,v) can be very costly  Idea: prune node pairs without evaluating P i.  Pr(“first meet”) ≤ Pr(“meet”)  Pr(“meet”) is much cheaper to derive  Further speed up by backward random walk 22 Link Similarity Join L. Sun, R. Cheng, X. Li, D. Cheung, J. Han Customization for SR (Sketch)

Experiments  Data set  Yeast: protein-protein interaction graph  Coauthor: graph extracted from DBLP  Cora: citation graph  Default value  k = Link Similarity Join L. Sun, R. Cheng, X. Li, D. Cheung, J. Han

PPR on Yeast 24 Link Similarity Join L. Sun, R. Cheng, X. Li, D. Cheung, J. Han

PPR on Coauthor 25 Link Similarity Join L. Sun, R. Cheng, X. Li, D. Cheung, J. Han

Performance Analysis  PPR on Coauthor 26 Link Similarity Join L. Sun, R. Cheng, X. Li, D. Cheung, J. Han

SR on Cora 27 Link Similarity Join L. Sun, R. Cheng, X. Li, D. Cheung, J. Han

Performance Analysis SR in Cora  SR on Cora 28 Link Similarity Join L. Sun, R. Cheng, X. Li, D. Cheung, J. Han

Conclusions  The LS-join is a similarity join for graph applications  The e-function captures random-walk LS measures  We develop two LS-join algorithms  IDJ for any e-function  Customized and faster algorithms for PPR and SR 29 Link Similarity Join L. Sun, R. Cheng, X. Li, D. Cheung, J. Han

Thank you! Reynold Cheng University of Hong Kong 30 Link Similarity Join L. Sun, R. Cheng, X. Li, D. Cheung, J. Han

Future Work  Examine other link-based similarity measures  Consider content- and link- similarity together  Develop indexes and distributed algorithms 31 Link Similarity Join L. Sun, R. Cheng, X. Li, D. Cheung, J. Han

References Link Similarity Join L. Sun, R. Cheng, X. Li, D. Cheung, J. Han 32  J. Sankaranarayanan et al. Distance join queries on spatial networks. In GIS, pages 211–218,  L. Zou et al. Distance-join: pattern match query in a large graph database. PVLDB, 2(1):886–897,  J. Dittrich et al. GESS: a scalable similarity-join algorithm for mining large data sets in high dimensional spaces. In KDD, pages 47–56,  A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, pages 918–929,  C. Boehm et al. Epsilon grid order: An algorithm for the similarity join on massive high-dimensional data. In SIGMOD, pages 379–388,  C. Xiao et al. Efficient similarity joins for near duplicate detection. In WWW, pages 131–140,  G. Jeh and J. Widom. Scaling personalized web search. In WWW, pages 271–279,  D. Lizorkin, P. Velikhov, M. Grinev, and D. Turdakov. Accuracy estimate and optimization techniques for simrank computation. VLDBJ, 19:45–66,  P. Li et al. Fast single-pair simrank computation. In SDM, pages 571–582,  D. Fogaras and B. R´acz. Scaling link-based similarity search. In WWW, pages 641–650,  P. Sarkar and A. Moore. Fast nearest neighbor search in disk-resident graphs. In KDD, pp. 513–522, 2010.

Related works  Similarity Join  Finds relationship among 2 sets of objects  Set/string data [Arasu et al. VLDB’06] [Xiao et al. WWW’08]  High dimensional database [Boehm et al. SIGMOD’01] [Dittrich et al. KDD’01] 33 Link Similarity Join L. Sun, R. Cheng, X. Li, D. Cheung, J. Han

Link-based Similarity Join (LS-Join) Given two subsets of nodes P and Q in a graph and a link- based similarity measure S (e.g., SimRank), return k pairs of nodes, one from each set, with the highest S scores. 34 Link Similarity Join L. Sun, R. Cheng, X. Li, D. Cheung, J. Han

Other Applications of LS-Join  Citation network analysis  In the citation network, find top-k similar pairs of papers from AI & DB  Link prediction  Predict future friendships between individuals from two communities, and 35 Link Similarity Join L. Sun, R. Cheng, X. Li, D. Cheung, J. Han

The e d -Function  A unified function for a broad class of random-walk-based similarity metrics  a, b : real-valued constants, with a>0  : decay factor; 0 < <1  : a random walk based probability of the i-th step  e-Function 36 Link Similarity Join L. Sun, R. Cheng, X. Li, D. Cheung, J. Han

Examples of e-function  Personalized PageRank (PPR)  : prob. A surfer from u visits v at i-th step  a = 1- ; b = 0  SimRank (SR)  : prob. 2 surfers from u and v first meet at i-th step  a = 1; b = 0  Discounted Hitting Time (DHT)  : prob. a surfer from u first visits v at i-th step  a = 1; b = 1 37 Link Similarity Join L. Sun, R. Cheng, X. Li, D. Cheung, J. Han

Remarks  IDJ is inspired by the classical search strategy: Iterative Deepening Depth-First Search  Only search a small scope at early iterations Prune nodes quickly  Exponentially expand the search scope The bounds are refined and more nodes can be pruned  Space efficient only store the states of one random surfer at a time Use a small heap to track the top-k candidate pairs 38 Link Similarity Join L. Sun, R. Cheng, X. Li, D. Cheung, J. Han

Customization for SimRank (SR)  SimRank  F i (p,q): the probability that two random surfers first meet at the i-th step. 39 Link Similarity Join L. Sun, R. Cheng, X. Li, D. Cheung, J. Han

 SR is more difficult to handle than PPR  SR involves two random surfers.  Computing F i (p,q) is even more costly!  We try to prune node pairs without evaluating F i.  Upper bound Idea: Pr(“first meet”) <= Pr(“meet”) And Pr(“meet”) can be much cheaper to derive Further speed up… Using backward random walk! 40 Link Similarity Join L. Sun, R. Cheng, X. Li, D. Cheung, J. Han Customization for SR