PRSim: Sublinear Time SimRank Computation on Large Power-Law Graphs.

Slides:



Advertisements
Similar presentations
Mining User Similarity Based on Location History Yu Zheng, Quannan Li, Xing Xie Microsoft Research Asia.
Advertisements

Weiren Yu 1, Jiajin Le 2, Xuemin Lin 1, Wenjie Zhang 1 On the Efficiency of Estimating Penetrating Rank on Large Graphs 1 University of New South Wales.
1 Weiren Yu 1,2, Xuemin Lin 1, Wenjie Zhang 1 1 University of New South Wales 2 NICTA, Australia Towards Efficient SimRank Computation over Large Networks.
Finding the Sites with Best Accessibilities to Amenities Qianlu Lin, Chuan Xiao, Muhammad Aamir Cheema and Wei Wang University of New South Wales, Australia.
Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.
ICDE 2014 LinkSCAN*: Overlapping Community Detection Using the Link-Space Transformation Sungsu Lim †, Seungwoo Ryu ‡, Sejeong Kwon§, Kyomin Jung ¶, and.
Evaluating “find a path” reachability queries P. Bouros 1, T. Dalamagas 2, S.Skiadopoulos 3, T. Sellis 1,2 1 National Technical University of Athens 2.
Efficient Evaluation of k-Range Nearest Neighbor Queries in Road Networks Jie BaoChi-Yin ChowMohamed F. Mokbel Department of Computer Science and Engineering.
Quality Aware Privacy Protection for Location-based Services Zhen Xiao, Xiaofeng Meng Renmin University of China Jianliang Xu Hong Kong Baptist University.
Multi-label Relational Neighbor Classification using Social Context Features Xi Wang and Gita Sukthankar Department of EECS University of Central Florida.
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Constructing Popular Routes from Uncertain Trajectories Ling-Yin Wei 1, Yu Zheng 2, Wen-Chih Peng 1 1 National Chiao Tung University, Taiwan 2 Microsoft.
ON LINK-BASED SIMILARITY JOIN A joint work with: Liwen Sun, Xiang Li, David Cheung (University of Hong Kong) Jiawei Han (University of Illinois Urbana.
Quantile-Based KNN over Multi- Valued Objects Wenjie Zhang Xuemin Lin, Muhammad Aamir Cheema, Ying Zhang, Wei Wang The University of New South Wales, Australia.
N EIGHBORHOOD F ORMATION AND A NOMALY D ETECTION IN B IPARTITE G RAPHS Jimeng Sun, Huiming Qu, Deepayan Chakrabarti & Christos Faloutsos Jimeng Sun, Huiming.
Neighborhood Formation and Anomaly Detection in Bipartite Graphs Jimeng Sun Huiming Qu Deepayan Chakrabarti Christos Faloutsos Speaker: Jimeng Sun.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005
1 Collaborative Filtering and Pagerank in a Network Qiang Yang HKUST Thanks: Sonny Chee.
SCS CMU Proximity Tracking on Time- Evolving Bipartite Graphs Speaker: Hanghang Tong Joint Work with Spiros Papadimitriou, Philip S. Yu, Christos Faloutsos.
Topic-Sensitive PageRank Taher H. Haveliwala. PageRank Importance is propagated A global ranking vector is pre-computed.
PageRank Identifying key users in social networks Student : Ivan Todorović, 3231/2014 Mentor : Prof. Dr Veljko Milutinović.
School of Electronics Engineering and Computer Science Peking University Beijing, P.R. China Ziqi Wang, Yuwei Tan, Ming Zhang.
A Distributed and Privacy Preserving Algorithm for Identifying Information Hubs in Social Networks M.U. Ilyas, Z Shafiq, Alex Liu, H Radha Michigan State.
Link Recommendation In P2P Social Networks Yusuf Aytaş, Hakan Ferhatosmanoğlu, Özgür Ulusoy Bilkent University, Ankara, Turkey.
Outlier Detection Using k-Nearest Neighbour Graph Ville Hautamäki, Ismo Kärkkäinen and Pasi Fränti Department of Computer Science University of Joensuu,
Influence Maximization in Dynamic Social Networks Honglei Zhuang, Yihan Sun, Jie Tang, Jialin Zhang, Xiaoming Sun.
A Graph-based Friend Recommendation System Using Genetic Algorithm
Discovering Meta-Paths in Large Heterogeneous Information Network
1/52 Overlapping Community Search Graph Data Management Lab, School of Computer Science
Ground Truth Free Evaluation of Segment Based Maps Rolf Lakaemper Temple University, Philadelphia,PA,USA.
Efficient Common Items Extraction from Multiple Sorted Lists Wei Lu, Cuitian Rong, Jinchuan Chen, Xiaoyong Du, Gabriel Fung, Xiaofang Zhou Renmin University.
Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.
Xiaowei Ying, Xintao Wu Univ. of North Carolina at Charlotte PAKDD-09 April 28, Bangkok, Thailand On Link Privacy in Randomizing Social Networks.
Ch 14. Link Analysis Padmini Srinivasan Computer Science Department
Andreas Papadopoulos - [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International.
1 LinkClus: Efficient Clustering via Heterogeneous Semantic Links Xiaoxin Yin, Jiawei Han Univ. of Illinois at Urbana-Champaign Philip S. Yu IBM T.J. Watson.
1 Panther: Fast Top-K Similarity Search on Large Networks Jing Zhang 1, Jie Tang 1, Cong Ma 1, Hanghang Tong 2, Yu Jing 1, and Juanzi Li 1 1 Department.
Intelligent DataBase System Lab, NCKU, Taiwan Josh Jia-Ching Ying, Eric Hsueh-Chan Lu, Wen-Ning Kuo and Vincent S. Tseng Institute of Computer Science.
Panther: Fast Top-k Similarity Search in Large Networks JING ZHANG, JIE TANG, CONG MA, HANGHANG TONG, YU JING, AND JUANZI LI Presented by Moumita Chanda.
Graph Data Management Lab, School of Computer Science Branch Code: A Labeling Scheme for Efficient Query Answering on Tree
Kijung Shin Jinhong Jung Lee Sael U Kang
Supervised Random Walks: Predicting and Recommending Links in Social Networks Lars Backstrom (Facebook) & Jure Leskovec (Stanford) Proc. of WSDM 2011 Present.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Cohesive Subgraph Computation over Large Graphs
Nanyang Technological University
FORA: Simple and Effective Approximate Single­-Source Personalized PageRank Sibo Wang, Renchi Yang, Xiaokui Xiao, Zhewei Wei, Yin Yang School of Information.
Matrix Sketching over Sliding Windows
TT-Join: Efficient Set Containment Join
Sublinear Algorithms for Personalized PageRank, with Applications
ProbeSim: Scalable Single-Source and Top-k SimRank Computations on Dynamic Graphs Yu Liu , Bolong Zheng, Xiaodong He, Zhewei Wei, Xiaokui Xiao, Kai.
StreamApprox Approximate Stream Analytics in Apache Flink
StreamApprox Approximate Stream Analytics in Apache Spark
StreamApprox Approximate Computing for Stream Analytics
Bin Fu Department of Computer Science
Probably Approximately
Jialong Han1, Kai Zheng2, Aixin Sun1, Shuo Shang3, and Ji-Rong Wen4
Jinhong Jung, Woojung Jin, Lee Sael, U Kang, ICDM ‘16
MEgo2Vec: Embedding Matched Ego Networks for User Alignment Across Social Networks Jing Zhang+, Bo Chen+, Xianming Wang+, Fengmei Jin+, Hong Chen+, Cuiping.
Scaling up Link Prediction with Ensembles
Pramod Bhatotia, Ruichuan Chen, Myungjin Lee
Jiawei Han Department of Computer Science
Binghui Wang, Le Zhang, Neil Zhenqiang Gong
Asymmetric Transitivity Preserving Graph Embedding
GANG: Detecting Fraudulent Users in OSNs
Relaxing Join and Selection Queries
Efficient Processing of Top-k Spatial Preference Queries
Distance-Constraint Reachability Computation in Uncertain Graphs
Towards Maximum Independent Sets on Massive Graphs
PRSim: Sublinear Time SimRank Computation on Large Power-Law Graphs
Presentation transcript:

PRSim: Sublinear Time SimRank Computation on Large Power-Law Graphs. Zhewei Wei, Xiaodong He, Xiaokui Xiao, Sibo Wang, Yu Liu, Xiaoyong Du, and Ji-Rong Wen. Zhewei Wei Renmin University of China

Problems and Motivations

SimRank [KDD 02] c∈(0,1) Professor A Student A University High Professor B Student A Student B Similarity=1 High c∈(0,1)

𝒄 -walk 1 4 3 2 5 6 7 9 10 8 11 12 𝑐 -walk: at each step, terminates w.p. 1− 𝑐 , and move to a random in-neighbor w.p. 𝑐

SimRank and 𝒄 -walk 1 4 3 2 5 6 7 9 10 8 11 12 s(u,v)=Pr{two 𝑐 -walks from u, v meet at the same step}

SimRank and 𝒄 -walk 1 4 3 2 5 6 7 9 10 8 11 12 s(u,v)=Pr{two 𝑐 -walks from u, v meet at the same step} Monte-Carlo algorithm: Generate multiple pairs of 𝑐 -walks s(u,v) ≈ the percentage of pairs that meet (at the same step)

Single-Source and top-k SimRank Queries 1 4 3 2 5 6 7 9 10 8 11 12 0.43 0.10 0.13 0.46 0.05 0.0 Node 4 Node 1 Node 2 Node 3 Node 5 Node 6 Node 7 Node 8 Node 9 Node 10 Node 11 Node 12 0.43 0.10 0.13 0.46 0.05 0.0 Top-2 query for node 4:1, 5 Single-source query for node 4 Allow an error of predetermined ε

Applications SPAM detection [KDD12] Recommendation system [WWW15] Clustering via semantic links [VLDB06] (1min)There are lots of applications of SimRank, like collaborative filtering and recommendation systems. For example, consider the system needs to guess the rating of a user to a given singer, to determine whether to recommend the singer to him. First the system needs to find k most similar singers already rated by the user, by some similarity measure such as SimRank, which performs well. Then the system computes a weighted score according to the similarity.

Taxonomy Iterative Non-iterative Random Walk PartialSum Monte Carlo Lizorkin, VLDB08 Monte Carlo EDBT04, WWW05 NI-Sim C. Li, EDBT10 TopSim Jeffery Yu, ICDE12 FS-SR P. Li, SDM10 Linearization Kusumoto, SIGMOD14 KDD14, ICDE15 SRK-Join G. Li, VLDB14 OIP W Yu, ICDE13 Information Sciences17 CloudWalker VLDB15 Par-SR W Yu, VLDB15 Bin Cui, VLDB15 READS W Yu, VLDB17

Drawback 1: Linear Query Time Existing methods (READS[VLDB18], TSF[VLDB15], MC..) u 1 i n … # nodes = 10,000,000

Drawback 2: SimRank v.s. Graph Structure Dataset Type n m It-2004 directed 41,291,594 1,150,725,436 Twitter-2010 41,652,230 1,468,365,182 Query Time (Sec) Dataset ProbeSim TSF TopSim-SM Trun-TopSim Prio-TopSim it-2004 0.018 1.01 35.18 0.67 0.2 twitter-2010 13.6 191.28 N/A

Our Results

1. Achieving Sub-Linear Time Can we do better than O(n) on worst case graphs? SimRank 1 2 3 4 5 6 7 c Output size: O(n)

The end?

1. Achieving Sub-Linear Time Can we do better than O(n) on Real-world graphs? Power-law graph 𝑃 𝑘 ∼ 𝑘 −𝛾 , 𝛾>𝟏

PRSim: Query time 2 𝛾 −1<1 1 𝛾 <1 #of nodes with degree k: 𝑃 𝑘 ∼ 𝑘 −𝛾 , 𝛾>𝟏 2 𝛾 −1<1 1 𝛾 <1

2. 𝛾 v.s. Query time Dataset Type n m Small 𝛾 Large 𝛾 Query Time (Sec) It-2004 directed 41,291,594 1,150,725,436 Twitter-2010 41,652,230 1,468,365,182 Small 𝛾 Large 𝛾 Query Time (Sec) Dataset ProbeSim TSF TopSim-SM Trun-TopSim Prio-TopSim it-2004 0.018 1.01 35.18 0.67 0.2 twitter-2010 13.6 191.28 N/A

High Level Ideas

PRSim: High level ideas Reversely calculate probability trees Precomputation Sample in the query phase d c The probability of w↝c = 1/3 i b j k a f u x s z t w depth = 2 depth = 3 depth = 4

Indexing Probability Trees SLING [SIGMOD16]: precompute probability trees for all target nodes Resulting index size of 𝑂( 𝑛 𝜀 ) Much larger than the graph size m Note scalable for small error 𝜀 Our method Precompute probability tree for only “hub” nodes

Indexing Hub nodes: nodes with high PageRanks A random walk from a random source node u is more likely to visit nodes with higher PageRanks Precomputing probability trees for hub nodes is the most efficient way to reduce query time

Probe Algorithm [VLDB18] Estimate the probability tree for non-hub nodes in the query phase Sample w according to Pr[w↝c] = 1/3 d c Sample node i w.p. 1 Sample nodes j, k w.p. 1 Sample node f w.p. 1/3 i b a j k f u x s z t w depth = 2 depth = 3 depth = 4

Backward Walk Algorithm Probe algorithm: not efficient for nodes with large out-degrees w … j k l p q t

Backward Walk Algorithm Probe algorithm: not efficient for nodes with large out-degrees Backward Walk algorithm Sort adjacency list by in-degrees in preprocess w r = 0.3 … j k l p q t Throw a random number r Only visit nodes with indegree <1/r

Experiments

Experiments Datasets: Competitors: Index-based: READS[VLDB18], SLING[SIGMOD16] and TSF[VLDB15] Index-free: ProbeSim[VDLB18] and TopSim[ICDE12] Pooling [VLDB18] to evaluate precision on large graphs without ground truth

Experiments

Experiments

Experiments Synthetic Power-Law graphs Synthetic ER graphs

Conclusion Sub-Linear time algorithm for single-source SimRank queries on power-law graphs. Outperforms SOTA on large graphs in terms of query time, accuracy, index space and preprocessing time. Hardness of SimRank computation depends on power-law exponent 𝛾.

Thank you!