Panther: Fast Top-k Similarity Search in Large Networks JING ZHANG, JIE TANG, CONG MA, HANGHANG TONG, YU JING, AND JUANZI LI Presented by Moumita Chanda.

Slides:

Advertisements

Similar presentations

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.

Advertisements

Weiren Yu 1, Jiajin Le 2, Xuemin Lin 1, Wenjie Zhang 1 On the Efficiency of Estimating Penetrating Rank on Large Graphs 1 University of New South Wales.

Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology,

Introduction to Algorithms Rabie A. Ramadan rabieramadan.org 2 Some of the sides are exported from different sources.

Active Learning for Streaming Networked Data Zhilin Yang, Jie Tang, Yutao Zhang Computer Science Department, Tsinghua University.

Fast Algorithms For Hierarchical Range Histogram Constructions

TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

VisualRank: Applying PageRank to Large-Scale Image Search Yushi Jing, Member, IEEE, and Shumeet Baluja, Member, IEEE.

1 Appendix B: Solving TSP by Dynamic Programming Course: Algorithm Design and Analysis.

Maximizing the Spread of Influence through a Social Network

Beyond Trilateration: On the Localizability of Wireless Ad Hoc Networks Reported by: 莫斌.

Author: Jie chen and Yousef Saad IEEE transactions of knowledge and data engineering.

DATA MINING LECTURE 12 Link Analysis Ranking Random walks.

Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.

Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.

Neighborhood Formation and Anomaly Detection in Bipartite Graphs Jimeng Sun Huiming Qu Deepayan Chakrabarti Christos Faloutsos Speaker: Jimeng Sun.

Heuristic alignment algorithms and cost matrices

Semi-Supervised Learning Using Randomized Mincuts Avrim Blum, John Lafferty, Raja Reddy, Mugizi Rwebangira.

Visual Querying By Color Perceptive Regions Alberto del Bimbo, M. Mugnaini, P. Pala, and F. Turco University of Florence, Italy Pattern Recognition, 1998.

Novel Self-Configurable Positioning Technique for Multihop Wireless Networks Authors ： Hongyi Wu Chong Wang Nian-Feng Tzeng IEEE/ACM TRANSACTIONS ON NETWORKING,

1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman

Probability Grid: A Location Estimation Scheme for Wireless Sensor Networks Presented by cychen Date ： 3/7 In Secon (Sensor and Ad Hoc Communications and.

Query-Based Outlier Detection in Heterogeneous Information Networks Jonathan Kuck 1, Honglei Zhuang 1, Xifeng Yan 2, Hasan Cam 3, Jiawei Han 1 1 University.

Exposure In Wireless Ad-Hoc Sensor Networks S. Megerian, F. Koushanfar, G. Qu, G. Veltri, M. Potkonjak ACM SIG MOBILE 2001 (Mobicom) Journal version: S.

Network Measures Social Media Mining. 2 Measures and Metrics 2 Social Media Mining Network Measures Klout.

TEDI: Efficient Shortest Path Query Answering on Graphs Author: Fang Wei SIGMOD 2010 Presentation: Dr. Greg Speegle.

Focused Matrix Factorization for Audience Selection in Display Advertising BHARGAV KANAGAL, AMR AHMED, SANDEEP PANDEY, VANJA JOSIFOVSKI, LLUIS GARCIA-PUEYO,

WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS junction.

CSCE350 Algorithms and Data Structure Lecture 17 Jianjun Hu Department of Computer Science and Engineering University of South Carolina

SISAP’08 – Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu

1 Efficient Search Ranking in Social Network ACM CIKM2007 Monique V. Vieira, Bruno M. Fonseca, Rodrigo Damazio, Paulo B. Golgher, Davi de Castro Reis,

Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.

On Graph Query Optimization in Large Networks Alice Leung ICS 624 4/14/2011.

Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.

The Application of The Improved Hybrid Ant Colony Algorithm in Vehicle Routing Optimization Problem International Conference on Future Computer and Communication,

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova ， Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.

Finding Top-k Shortest Path Distance Changes in an Evolutionary Network SSTD th August 2011 Manish Gupta UIUC Charu Aggarwal IBM Jiawei Han UIUC.

Semantic Wordfication of Document Collections Presenter: Yingyu Wu.

Andreas Papadopoulos - [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International.

Learning the Structure of Related Tasks Presented by Lihan He Machine Learning Reading Group Duke University 02/03/2006 A. Niculescu-Mizil, R. Caruana.

ITEC 2620A Introduction to Data Structures Instructor: Prof. Z. Yang Course Website: 2620a.htm Office: TEL 3049.

2005/12/021 Content-Based Image Retrieval Using Grey Relational Analysis Dept. of Computer Engineering Tatung University Presenter: Tienwei Tsai ( 蔡殿偉.

1 Panther: Fast Top-K Similarity Search on Large Networks Jing Zhang 1, Jie Tang 1, Cong Ma 1, Hanghang Tong 2, Yu Jing 1, and Juanzi Li 1 1 Department.

Jiafeng Guo(ICT) Xueqi Cheng(ICT) Hua-Wei Shen(ICT) Gu Xu (MSRA) Speaker: Rui-Rui Li Supervisor: Prof. Ben Kao.

1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.

D-skyline and T-skyline Methods for Similarity Search Query in Streaming Environment Ling Wang 1, Tie Hua Zhou 1, Kyung Ah Kim 2, Eun Jong Cha 2, and Keun.

Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.

CONTEXTUAL SEARCH AND NAME DISAMBIGUATION IN USING GRAPHS EINAT MINKOV, WILLIAM W. COHEN, ANDREW Y. NG SIGIR’06 Date: 2008/7/17 Advisor: Dr. Koh,

Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur.

1 CS 430: Information Discovery Lecture 5 Ranking.

Sporadic model building for efficiency enhancement of the hierarchical BOA Genetic Programming and Evolvable Machines (2008) 9: Martin Pelikan, Kumara.

Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.

1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.

Scalable Learning of Collective Behavior Based on Sparse Social Dimensions Lei Tang, Huan Liu CIKM ’ 09 Speaker: Hsin-Lan, Wang Date: 2010/02/01.

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

Glen Jeh & Jennifer Widom KDD  Many applications require a measure of “similarity” between objects.  Web search  Shopping Recommendations  Search.

Paper Presentation Social influence based clustering of heterogeneous information networks Qiwei Bao & Siqi Huang.

Federated text retrieval from uncooperative overlapped collections Milad Shokouhi, RMIT University, Melbourne, Australia Justin Zobel, RMIT University,

3.1 Clustering Finding a good clustering of the points is a fundamental issue in computing a representative simplicial complex. Mapper does not place any.

Spatial Online Sampling and Aggregation

Section 7.12: Similarity By: Ralucca Gera, NPS.

Probably Approximately

ITEC 2620M Introduction to Data Structures

Binghui Wang, Le Zhang, Neil Zhenqiang Gong

Asymmetric Transitivity Preserving Graph Embedding

Topological Signatures For Fast Mobility Analysis

Locality In Distributed Graph Algorithms

PRSim: Sublinear Time SimRank Computation on Large Power-Law Graphs.

Presentation transcript:

Panther: Fast Top-k Similarity Search in Large Networks JING ZHANG, JIE TANG, CONG MA, HANGHANG TONG, YU JING, AND JUANZI LI Presented by Moumita Chanda Das

Outline  Introduction  Problem formulation  Panther using path sampling  Random Path Sampling  Theoretical Analysis  Algorithms  Panther++  Complexity analysis  Experimental setup  Datasets

Outline Cont..  Efficiency and scalability  Accuracy performance Results  Parameter sensitivity analysis  Qualitative case study  Related work  Conclusion  Review

Introduction  Estimating similarity between vertices is fundamental issue in network analysis.  For example domains like social networks, biological networks, clustering, graph matching etc.  Methods such as common neighbors, structural contexts are difficult to scale up in large networks.  Examples Jaccard index, Cosine similarity based on common neighbors.  First challenge is designing an unified method for both principles.  Second challenge is Most existing methods have a high computation cost

Introduction Cont..  Sampling method is proposed that accurately estimates similarity  Algorithm is based on Random Path to enhance structural similarity between disconnected vertices.  It is 300X faster  Also better performance  Extend the algorithm by augmenting each vertex with a vector of structure-based features. Referred to as Panther++.

Top-k similar search

Problem Formulation  Given Undirected Weighted Network. Let G = (V, E, W)  where V is a set of |V | vertices  E ⊂ V × V is a set of |E| edges between vertices.  W be a weight matrix, with each element w ij ∈ W representing the weight associated with edge e ij.  Focus is finding top-k similar vertices of a given network G = (V, E, W) and a query vertex v ∈ V, how to find a set X v,k of k vertices that have the highest similarities to vertex v, where k is a positive integer.  difference between the two sets X ∗ v,k and X v,k is less than a threshold ε ∈ (0, 1), i.e., Diff(X ∗ v,k, X v,k ) ≤ ε

PANTHER: FAST TOP-K SIMILARITY SEARCH USING PATH SAMPLING  A simple approach is number of common neighbors of v i and v j.  Using Jaccard Index similarity is defined as,  If we use algorithms like SimRank, similarity is  Panther propose a sampling-based method to estimate the top-k similar vertices.  Panther randomly generates R paths with length T in a given network G=(V,E,W)  similarity estimation between two vertices is estimating how likely two vertices appear on a same path.

Random Path Sampling  The basic idea is that two vertices are similar if they frequently appear on the same paths.  A T-path sequence of vertices p = (v 1, · · ·, v T +1 ), which consists of T + 1 vertices and T edges.  Path similarity between vi and v j is defined as:  Eq (1) is recalculated based on the R sampled paths  To generate a path, we randomly select a vertex in G as the starting point,  And then conduct random walks of T steps from v,

Random path sampling cont..  Then w(p) is defined as  The Eq(2) is written as  To calculate Eq. (4), the time complexity is O(RT).

Random Path Sampling Example

Theoretical Analysis  From Vapnik-Chervonenkis (VC) the following is defined.  Let R be a range set on a domain D, with V C(R) ≤ d, and let φ be a distribution on D. Given ε, δ ∈ (0, 1), let S be a set of |S| points sampled from D according to φ, with  The value d is minimized to  Then the Sample space is calculated as

Algorithms

Panther++  Limitation of Panther is that the similarities obtained by the algorithm have a bias to close neighbors.  Panther++ is the extension of Panther Algorithm.  The idea is to augment each vertex with a feature vector.  We select the top-D similarities calculated by Panther to represent the probability distribution.  For vertex v i in the network, we first calculate the similarity between v i and all the other vertices using Panther.  Then we construct a feature vector for v i by taking the largest D similarity scores as feature values, i.e.,

Panther++  Finally, the similarity between vi and vj is re-calculated as the reciprocal Euclidean distance between their feature vectors:  The indexing techniques used to improve the algorithm efficiency.  A memory based kd-tree index for feature vectors of all vertices is created.  Then given a vertex, we can retrieve top-k vertices in the kd-tree  Based on the index, we can query whether a given point is stored in the index very fast.

Algorithm

Index built in Panther++

Complexity Analysis  In general, existing methods result in high complexities.  Table 1 summarizes the time and space complexities of the different methods.

Complexity Analysis Cont..  For Panther, its time complexity includes two parts:  Random path sampling: The time complexity of generating random paths is O(RT log d¯), where log d¯is very small and can be simplified as a small constant c. Hence, the time complexity is O(RT c).  Top-k similarity search: The time complexity of calculating top-k similar vertices for all vertices is O(|V |RT¯ + |V |M¯ )  Panther++ requires additional computation to build the kd-tree.  The time complexity of building a kd-tree is O(|V | log |V |) and querying top- k similar vertices for any vertex is O(|V | log |V |)

Experimental Setup

Datasets  Tencent :  A popular Twitter-like microblogging service in China  consists of over 355,591,065 users and 5,958,853,072 “following” relationships.  The weight associated with each edge is set as 1.0 uniformly.  used it to evaluate the efficiency performance of our methods.  Twitter :  first selected the most popular user on Twitter, i.e., “Lady Gaga” and randomly selected 10,000 of her followers.  Then collected all followers of these users. In the end, 113,044 users and 468,238 “following” relationships in total obtained.  The weight associated with each edge is also set as 1.0 uniformly  this dataset to used to evaluate the accuracy of Panther and Panther++.

Datasets  Mobile:  consists of millions of call records.  each user as a vertex, and communication between users as an edge.  The resultant network consists of 194,526 vertices and 206,934 edges.  The weight associated with each edge is defined as the number of calls.  Co-author:  The dataset is from AMiner.org, and contains 2,092,356 papers from the conferences KDD, ICDM, SIGIR, CIKM, SIGMOD, ICDE, and ICML.  The weight associated with each edge is the number of papers collaborated on by the two connected authors

Efficiency and Scalability Performance

Speed-up of Panther++

Accuracy Performance with Applications Identity Resolution :  authorship at different conferences is used to generate multiple co-author networks.  An author may have a corresponding vertex in each of the generated networks.  Given any two co-author networks, for example KDD-ICDM, a top-k search is performed to find similar vertices from ICDM for each vertex in KDD by different methods.  Correct if the returned k similar vertices from ICDM by a method consists of the corresponding author of the query vertex from KDD.

Performance Result

Accuracy Performance with Applications Approximating Common Neighbors :  Panther performs as good as TopSim, the top-k version of SimRank.  Both based on the principle that two vertices are considered structuraly equivalent if they have many common neighbors in a network.  However, TopSim performs much slower than Panther.

Parameter Sensitivity Analysis Effect of Path Length T :  A too small T(< 5) would result in inferior performance.  On Twitter, when increasing its value up to 5, it almost becomes stable  On Mobile, T = 5 seems to be a good choice.

Parameter Sensitivity Analysis Effect of Vector Dimension D :  performance gets better when D increases and it remains the same after D gets larger than 50.  Thus, the higher the vector dimension, the better the approximation  Once the dimension exceeds a threshold, the performance gets stable.

Qualitative Case study  Similar Researchers:

RELATED WORK  Early similarity measures, including bibliographical coupling and co-citation are based on the assumption that two vertices are similar if they have many common neighbors.  Katz counts two vertices as similar if there are more and shorter paths between them.  SimRank algorithm follows a basic recursive intuition that two nodes are similar if they are referenced by similar nodes.  Most of these methods cannot handle similarity estimation across different networks.  HITS-based recursive method and RoleSim can also calculate the similarity between disconnected vertices.  ReFex defines basic features such as degree, the number of within/out-egonet edges but the computational complexity of ReFex depends on the recursive times.

CONCLUSION  The algorithm is based on the idea of random path.  The extended method is also presented to enhance the structural similarity when two vertices are completely disconnected.  An extensive empirical study is performed and shown that our algorithm can obtain top-k similar vertices for any vertex in a network approximately 300× faster than other methods.  Applications in social networks is performed, to evaluate the accuracy of the estimated similarities.  Experimental results demonstrate that the proposed algorithm achieves clearly better performance than several alternative methods.

Review  A detailed example of random path sampling and how the algorithms work would have helped in the detailed understanding of the methods.  Theorem proof is also not in detail.

Questions??