Presentation is loading. Please wait.

Presentation is loading. Please wait.

Panther: Fast Top-k Similarity Search in Large Networks JING ZHANG, JIE TANG, CONG MA, HANGHANG TONG, YU JING, AND JUANZI LI Presented by Moumita Chanda.

Similar presentations


Presentation on theme: "Panther: Fast Top-k Similarity Search in Large Networks JING ZHANG, JIE TANG, CONG MA, HANGHANG TONG, YU JING, AND JUANZI LI Presented by Moumita Chanda."— Presentation transcript:

1 Panther: Fast Top-k Similarity Search in Large Networks JING ZHANG, JIE TANG, CONG MA, HANGHANG TONG, YU JING, AND JUANZI LI Presented by Moumita Chanda Das

2 Outline  Introduction  Problem formulation  Panther using path sampling  Random Path Sampling  Theoretical Analysis  Algorithms  Panther++  Complexity analysis  Experimental setup  Datasets

3 Outline Cont..  Efficiency and scalability  Accuracy performance Results  Parameter sensitivity analysis  Qualitative case study  Related work  Conclusion  Review

4 Introduction  Estimating similarity between vertices is fundamental issue in network analysis.  For example domains like social networks, biological networks, clustering, graph matching etc.  Methods such as common neighbors, structural contexts are difficult to scale up in large networks.  Examples Jaccard index, Cosine similarity based on common neighbors.  First challenge is designing an unified method for both principles.  Second challenge is Most existing methods have a high computation cost

5 Introduction Cont..  Sampling method is proposed that accurately estimates similarity  Algorithm is based on Random Path to enhance structural similarity between disconnected vertices.  It is 300X faster  Also better performance  Extend the algorithm by augmenting each vertex with a vector of structure-based features. Referred to as Panther++.

6 Top-k similar search

7 Problem Formulation  Given Undirected Weighted Network. Let G = (V, E, W)  where V is a set of |V | vertices  E ⊂ V × V is a set of |E| edges between vertices.  W be a weight matrix, with each element w ij ∈ W representing the weight associated with edge e ij.  Focus is finding top-k similar vertices of a given network G = (V, E, W) and a query vertex v ∈ V, how to find a set X v,k of k vertices that have the highest similarities to vertex v, where k is a positive integer.  difference between the two sets X ∗ v,k and X v,k is less than a threshold ε ∈ (0, 1), i.e., Diff(X ∗ v,k, X v,k ) ≤ ε

8 PANTHER: FAST TOP-K SIMILARITY SEARCH USING PATH SAMPLING  A simple approach is number of common neighbors of v i and v j.  Using Jaccard Index similarity is defined as,  If we use algorithms like SimRank, similarity is  Panther propose a sampling-based method to estimate the top-k similar vertices.  Panther randomly generates R paths with length T in a given network G=(V,E,W)  similarity estimation between two vertices is estimating how likely two vertices appear on a same path.

9 Random Path Sampling  The basic idea is that two vertices are similar if they frequently appear on the same paths.  A T-path sequence of vertices p = (v 1, · · ·, v T +1 ), which consists of T + 1 vertices and T edges.  Path similarity between vi and v j is defined as:  Eq (1) is recalculated based on the R sampled paths  To generate a path, we randomly select a vertex in G as the starting point,  And then conduct random walks of T steps from v,

10 Random path sampling cont..  Then w(p) is defined as  The Eq(2) is written as  To calculate Eq. (4), the time complexity is O(RT).

11 Random Path Sampling Example

12 Theoretical Analysis  From Vapnik-Chervonenkis (VC) the following is defined.  Let R be a range set on a domain D, with V C(R) ≤ d, and let φ be a distribution on D. Given ε, δ ∈ (0, 1), let S be a set of |S| points sampled from D according to φ, with  The value d is minimized to  Then the Sample space is calculated as

13 Algorithms

14 Panther++  Limitation of Panther is that the similarities obtained by the algorithm have a bias to close neighbors.  Panther++ is the extension of Panther Algorithm.  The idea is to augment each vertex with a feature vector.  We select the top-D similarities calculated by Panther to represent the probability distribution.  For vertex v i in the network, we first calculate the similarity between v i and all the other vertices using Panther.  Then we construct a feature vector for v i by taking the largest D similarity scores as feature values, i.e.,

15 Panther++  Finally, the similarity between vi and vj is re-calculated as the reciprocal Euclidean distance between their feature vectors:  The indexing techniques used to improve the algorithm efficiency.  A memory based kd-tree index for feature vectors of all vertices is created.  Then given a vertex, we can retrieve top-k vertices in the kd-tree  Based on the index, we can query whether a given point is stored in the index very fast.

16 Algorithm

17 Index built in Panther++

18 Complexity Analysis  In general, existing methods result in high complexities.  Table 1 summarizes the time and space complexities of the different methods.

19 Complexity Analysis Cont..  For Panther, its time complexity includes two parts:  Random path sampling: The time complexity of generating random paths is O(RT log d¯), where log d¯is very small and can be simplified as a small constant c. Hence, the time complexity is O(RT c).  Top-k similarity search: The time complexity of calculating top-k similar vertices for all vertices is O(|V |RT¯ + |V |M¯ )  Panther++ requires additional computation to build the kd-tree.  The time complexity of building a kd-tree is O(|V | log |V |) and querying top- k similar vertices for any vertex is O(|V | log |V |)

20 Experimental Setup

21 Datasets  Tencent :  A popular Twitter-like microblogging service in China  consists of over 355,591,065 users and 5,958,853,072 “following” relationships.  The weight associated with each edge is set as 1.0 uniformly.  used it to evaluate the efficiency performance of our methods.  Twitter :  first selected the most popular user on Twitter, i.e., “Lady Gaga” and randomly selected 10,000 of her followers.  Then collected all followers of these users. In the end, 113,044 users and 468,238 “following” relationships in total obtained.  The weight associated with each edge is also set as 1.0 uniformly  this dataset to used to evaluate the accuracy of Panther and Panther++.

22 Datasets  Mobile:  consists of millions of call records.  each user as a vertex, and communication between users as an edge.  The resultant network consists of 194,526 vertices and 206,934 edges.  The weight associated with each edge is defined as the number of calls.  Co-author:  The dataset is from AMiner.org, and contains 2,092,356 papers from the conferences KDD, ICDM, SIGIR, CIKM, SIGMOD, ICDE, and ICML.  The weight associated with each edge is the number of papers collaborated on by the two connected authors

23 Efficiency and Scalability Performance

24 Speed-up of Panther++

25 Accuracy Performance with Applications Identity Resolution :  authorship at different conferences is used to generate multiple co-author networks.  An author may have a corresponding vertex in each of the generated networks.  Given any two co-author networks, for example KDD-ICDM, a top-k search is performed to find similar vertices from ICDM for each vertex in KDD by different methods.  Correct if the returned k similar vertices from ICDM by a method consists of the corresponding author of the query vertex from KDD.

26 Performance Result

27 Accuracy Performance with Applications Approximating Common Neighbors :  Panther performs as good as TopSim, the top-k version of SimRank.  Both based on the principle that two vertices are considered structuraly equivalent if they have many common neighbors in a network.  However, TopSim performs much slower than Panther.

28 Parameter Sensitivity Analysis Effect of Path Length T :  A too small T(< 5) would result in inferior performance.  On Twitter, when increasing its value up to 5, it almost becomes stable  On Mobile, T = 5 seems to be a good choice.

29 Parameter Sensitivity Analysis Effect of Vector Dimension D :  performance gets better when D increases and it remains the same after D gets larger than 50.  Thus, the higher the vector dimension, the better the approximation  Once the dimension exceeds a threshold, the performance gets stable.

30 Qualitative Case study  Similar Researchers:

31 RELATED WORK  Early similarity measures, including bibliographical coupling and co-citation are based on the assumption that two vertices are similar if they have many common neighbors.  Katz counts two vertices as similar if there are more and shorter paths between them.  SimRank algorithm follows a basic recursive intuition that two nodes are similar if they are referenced by similar nodes.  Most of these methods cannot handle similarity estimation across different networks.  HITS-based recursive method and RoleSim can also calculate the similarity between disconnected vertices.  ReFex defines basic features such as degree, the number of within/out-egonet edges but the computational complexity of ReFex depends on the recursive times.

32 CONCLUSION  The algorithm is based on the idea of random path.  The extended method is also presented to enhance the structural similarity when two vertices are completely disconnected.  An extensive empirical study is performed and shown that our algorithm can obtain top-k similar vertices for any vertex in a network approximately 300× faster than other methods.  Applications in social networks is performed, to evaluate the accuracy of the estimated similarities.  Experimental results demonstrate that the proposed algorithm achieves clearly better performance than several alternative methods.

33 Review  A detailed example of random path sampling and how the algorithms work would have helped in the detailed understanding of the methods.  Theorem proof is also not in detail.

34 Questions??


Download ppt "Panther: Fast Top-k Similarity Search in Large Networks JING ZHANG, JIE TANG, CONG MA, HANGHANG TONG, YU JING, AND JUANZI LI Presented by Moumita Chanda."

Similar presentations


Ads by Google