1 Authors: Glen Jeh, Jennifer Widom (Stanford University) KDD, 2002 Presented by: Yuchen Bian 3.24.2015 SimRank: a measure of structural-context similarity.

Slides:

Advertisements

Similar presentations

Weiren Yu 1, Jiajin Le 2, Xuemin Lin 1, Wenjie Zhang 1 On the Efficiency of Estimating Penetrating Rank on Large Graphs 1 University of New South Wales.

Advertisements

Absorbing Random walks Coverage

DATA MINING LECTURE 12 Link Analysis Ranking Random walks.

1 Discrete Structures & Algorithms Graphs and Trees: II EECE 320.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005

Neighborhood Formation and Anomaly Detection in Bipartite Graphs Jimeng Sun Huiming Qu Deepayan Chakrabarti Christos Faloutsos Speaker: Jimeng Sun.

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.

CPSC 689: Discrete Algorithms for Mobile and Wireless Systems Spring 2009 Prof. Jennifer Welch.

Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006

MANISHA VERMA, VASUDEVA VARMA PATENT SEARCH USING IPC CLASSIFICATION VECTORS.

Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.

1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006.

ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.

1 PageSim: A Link-based Similarity Measure for the World Wide Web Zhenjiang Lin, Irwin King, and Michael, R., Lyu Computer Science & Engineering, The Chinese.

Prestige (Seeley, 1949; Brin & Page, 1997; Kleinberg,1997) Use edge-weighted, directed graphs to model social networks Status/Prestige In-degree is a good.

Link Analysis HITS Algorithm PageRank Algorithm.

Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.

Nattee Niparnan. Easy & Hard Problem What is “difficulty” of problem? Difficult for computer scientist to derive algorithm for the problem? Difficult.

1 Applications of Relative Importance  Why is relative importance interesting? Web Social Networks Citation Graphs Biological Data  Graphs become too.

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

The importance of sequences and infinite series in calculus stems from Newton’s idea of representing functions as sums of infinite series.  For instance,

MapReduce and Graph Data Chapter 5 Based on slides from Jimmy Lin’s lecture slides ( (licensed.

DATA MINING LECTURE 13 Absorbing Random walks Coverage.

1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:

« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.

Collusion-Resistance Misbehaving User Detection Schemes Speaker: Jing-Kai Lou 2015/10/131.

윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.

CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.

Scaling Personalized Web Search Authors: Glen Jeh, Jennfier Widom Stanford University Written in: 2003 Cited by: 923 articles Presented by Sugandha Agrawal.

P-Rank: A Comprehensive Structural Similarity Measure over Information Networks CIKM’ 09 November 3 rd, 2009, Hong Kong Peixiang Zhao, Jiawei Han, Yizhou.

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.

Thursday, May 9 Heuristic Search: methods for solving difficult optimization problems Handouts: Lecture Notes See the introduction to the paper.

1 Presented by: Yuchen Bian MRWC: Clustering based on Multiple Random Walks Chain.

CSE 2331 / 5331 Topic 12: Shortest Path Basics Dijkstra Algorithm Relaxation Bellman-Ford Alg.

Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.

SimRank : A Measure of Structural-Context Similarity

1 Panther: Fast Top-K Similarity Search on Large Networks Jing Zhang 1, Jie Tang 1, Cong Ma 1, Hanghang Tong 2, Yu Jing 1, and Juanzi Li 1 1 Department.

Slides are modified from Lada Adamic

Panther: Fast Top-k Similarity Search in Large Networks JING ZHANG, JIE TANG, CONG MA, HANGHANG TONG, YU JING, AND JUANZI LI Presented by Moumita Chanda.

Chapter 8 Maximum Flows: Additional Topics All-Pairs Minimum Value Cut Problem  Given an undirected network G, find minimum value cut for all.

Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.

Kijung Shin Jinhong Jung Lee Sael U Kang

1 CS 430: Information Discovery Lecture 5 Ranking.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.

1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.

Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.

Progress Report ekker. Problem Definition In cases such as object recognition, we can not include all possible objects for training. So transfer learning.

Glen Jeh & Jennifer Widom KDD  Many applications require a measure of “similarity” between objects.  Web search  Shopping Recommendations  Search.

A Connectivity-Based Popularity Prediction Approach for Social Networks Huangmao Quan, Ana Milicic, Slobodan Vucetic, and Jie Wu Department of Computer.

SimRank: A Measure of Structural-Context Similarity Glen Jeh and Jennifer Widom Stanford University ACM SIGKDD 2002 January 19, 2011 Taikyoung Kim SNU.

CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.

GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.

1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.

Finding Dense and Connected Subgraphs in Dual Networks

Neighborhood - based Tag Prediction

PageRank Random Surfers on the Web Transition Matrix of the Web Dead Ends and Spider Traps Topic-Specific PageRank Jeffrey D. Ullman Stanford University.

CIKM’ 09 November 3rd, 2009, Hong Kong

Link-Based Ranking Seminar Social Media Mining University UC3M

Effective Social Network Quarantine with Minimal Isolation Costs

3.5 Minimum Cuts in Undirected Graphs

Zhenjiang Lin, Michael R. Lyu and Irwin King

Jinhong Jung, Woojung Jin, Lee Sael, U Kang, ICDM ‘16

Improved Algorithms for Topic Distillation in a Hyperlinked Environment (ACM SIGIR ‘98) Ruey-Lung, Hsiao Nov 23, 2000.

Recurrences (Method 4) Alexandra Stefan.

Asymmetric Transitivity Preserving Graph Embedding

Graph and Link Mining.

Presentation transcript:

1 Authors: Glen Jeh, Jennifer Widom (Stanford University) KDD, 2002 Presented by: Yuchen Bian SimRank: a measure of structural-context similarity

2 1. Motivation ----Similarity measure 2. SimRank and other Versions ----Basic definition ----Variant versions ----Compute SimRank 3. SimRank and Random Walk ----Expected-f meeting distance 4. Experiments 5. Conclusion Content

3 1. Motivation ----Similarity measure 2. SimRank and other Versions ----Basic definition ----Variant versions ----Compute SimRank 3. SimRank and Random Walk ----Expected-f meeting distance 4. Experiments 5. Conclusion Content

4 1. Motivation of RWR Real World Problems: a. Ranking Problem: F riendship in social network Keyword search in WWW Scientific papers citation b. Link Prediction Problem: Spam prediction in WWW Network Recommender system in market Common problem: Given a node/nodes in a graph, which other nodes are (most) similar to this node/nodes? one node: single source all nodes: all pairs

5 1. Motivation of RWR Similarity: Input: topological graph: nodes, edges Output: similarity of other nodes to query nodes a. Shortest path b. Common neighbors Bibliometrics of scientific papers: co-citation, bibliographic coupling Web pages: hub and authority p q a b... n p q a b n

6 1. Motivation of RWR Similarity: Input: nodes, topological graph Output: similarity of other nodes to query nodes Observation: two objects are similar if they are related to similar objects N2N2 N1N1 Intuition: similarity can transfer from node pairs to other node pairs Nodes Edges Nodes pair order “simplified”

7 1. Motivation ----Similarity measure 2. SimRank and other Versions ----Basic definition ----Variant versions ----Compute SimRank 3. SimRank and Random Walk ----Expected-f meeting distance 4. Experiments 5. Conclusion Content

8 SimRank: 2. SimRank and other Versions Basic Version Base case: a=b, s(a,b)=1; Special case: |I(a)|=0 or |I(b)|=0, s(a,b)=0 “c”: confidence factor Jeh G, Widom J. SimRank: a measure of structural-context similarity, SIGKDD, 2002: Information Flow: I: How much similarity(infor) can flow to a and b from (similarity) sources

9 SimRank: 2. SimRank and other Versions Basic Version Base case: a=b, s(a,b)=1; Special case: |I(a)|=0 or |I(b)|=0, s(a,b)=0 “c”: confidence factor {ProfA,ProfA} In G 2 : 15 nodes 21 edges {StudentA,StudentA} {StudentB,StudentB} {ProfB,ProfB} {ProfA,StudentA} {Univ,ProfA} {StudentA,Univ} s{ProfA,StudentA} =c s{Univ,ProfA} =c 2 s{StudentA,Univ} = c 3 s{ProfA,StudentA} s{ProfA,StudentA} =s{Univ,ProfA} =s{StudentA,Univ} =0

10 SimRank: 2. SimRank and other Versions Bipartite Version Recommender System

11 SimRank: 2. SimRank and other Versions Bipartite Version in Homogeneous Domain Jeh G, Widom J. SimRank: a measure of structural-context similarity, SIGKDD, 2002: Web pages: hub and authority a b d f e c g “points to”similarity score: “pointed to”similarity score:

12 SimRank: 2. SimRank and other Versions Minimax Version Students (A, B) and Courses (c, d) A, B are two students in the same major Some courses they selected are same, e.g. curricular requirement But they must select some elective courses: A  c, B  d Prob: what’s the similarity of two (diff major) students (inverse: what’s the similarity of two (elective) courses)

13 1. Motivation ----Similarity measure 2. SimRank and other Versions ----Basic definition ----Variant versions ----Compute SimRank 3. SimRank and Random Walk ----Expected-f meeting distance 4. Experiments 5. Conclusion Content

14 Compute SimRank: 2. SimRank and other Versions Naive method: Iterative fix-point method (iteratively calculate until converge) Jeh G, Widom J. SimRank: a measure of structural-context similarity, SIGKDD, 2002: R will uniquely converge to s (0≤s≤1).

15 Compute SimRank: 2. SimRank and other Versions Naive method: Iterative fix-point method (iteratively computation) Jeh G, Widom J. SimRank: a measure of structural-context similarity, SIGKDD, 2002: R will uniquely converge to s (0≤s≤1). 1.Existence: lim(R k ) (R k will converge) 2.Correctness: lim(R k )=s (0≤lim(R k )≤1) 3.Uniqueness: lim(R k ) is unique

16 Complexity of Naive method: 2. SimRank and other Versions Naive method: Iterative fix-point method

17 Pruning Strategy: 2. SimRank and other Versions “When n is significantly large, it is very likely that the neighborhood (say, nodes within a radius of 2 or 3) of a typical node will be a very small percentage (< 1%) of the entire domain.” “If we consider only node-pairs within a radius of r from each other in the underlying undirected graph (other criteria are possible), and there are on average d r such neighbors for a node, then there will be nd r node-pairs.”

18 “limited-information” problem Problem: In document corpora, many “unpopular” documents, i.e., documents with very few in-citations. Although the scarcity of contextual information makes them difficult to analyze, these documents are often the most important, since they tend to be harder for humans to find. This is especially true for new documents, which are likely unpopular because it takes time for others to notice and cite them, but often we are most interested in new documents. Analysis: A is only cited by B: A is a new paper, A 1 to A m seems to have the same similar with A But when consider the previous citations: intuitively, A m and A get “similarity” form B and B’ which is similar with each other, so the similarity of A m and A is larger that A 1 and A

19 “limited-information” problem Complementary Problem: We are interested in a general document C, and ask whether A should be included on a list of documents most similar to C, and A has limited information. On the one hand, A has only one in-citation, this might be “outlier” citation, It would be safer to consider only documents for which we have more information. On the other hand, we don’t want to eliminate unpopular documents from consideration or popular documents to be favored for every query. Strategy: Eliminate the |I(b)| and re-weight the final results where the constant 0<P<1 is a parameter adjustable by the end user.

20 1. Motivation ----Similarity measure 2. SimRank and other Versions ----Basic definition ----Variant versions ----Compute SimRank 3. SimRank and Random Walk ----Expected-f meeting distance 4. Experiments 5. Conclusion Content

21 4. Experiments Scientific research paper (homogeneous graph) Nodes: 278,628 papers Edges: 688,898 cross-references Two rough similarity baselines Citations: fraction of q’s citation also cited by p Titles: fraction of words in q’s title also in p’s title. Datasets: Algorithms are valuated by the average improvement (difference) to the baselines.

22 4. Experiments Scientific research paper (homogeneous graph) For 13,481 objects p, run SimRank and co-citation, select top N nodes Results:

23 4. Experiments Students and Courses (bipartite graph) Nodes: 1030 undergraduate students, average 40 courses per student Edges: 1030 transcripts Rough similarity baselines If two courses p and q are from the same department: rough similarity=1, otherwise 0 Datasets: Algorithms are valuated by the average improvement (difference) to the baselines.

24 4. Experiments Students and Courses (bipartite graph), 3193 trials Datasets:

25 1. Motivation ----Similarity measure 2. SimRank and other Versions ----Basic definition ----Variant versions ----Compute SimRank 3. SimRank and Random Walk ----Expected-f meeting distance 4. Experiments 5. Conclusion Content

26 5. Conclusion Advantages: Intuition is easy to understand Compute the all-pairs similarity problem: a few iterations to converge Combine the random walk thought Consider the entire topological structure, not just only for common neighbors: especially for “limited-information” problem Can combine with other similarity measures Disadvantages: Space: O(n 2 ) Runtime: O(Kn 2 d 2 ) Pruning strategy cut nodes Sometime will contradict to directly thought about similarity if just use SimRank Evaluation to SimRank:

27 Yuchen Bian Thank you! Q & A

28 4. Experiments Scientific research paper (homogeneous graph) For 13,481 objects p, run SimRank and co-citation, select top N nodes Results:

29 Compute SimRank: 2. SimRank and other Versions Naive method: Iterative fix-point method Jeh G, Widom J. SimRank: a measure of structural-context similarity, SIGKDD, 2002: R will uniquely converge to s (0≤s≤1). 1.lim(R k ) will exist (R k will converge) 2.lim(R k ) is unique 3.lim(R k )=s 4.0≤lim(R k )≤1 Pf: For 4, use induction Also 0<c<1,

30 Compute SimRank: 2. SimRank and other Versions Naive method: Iterative fix-point method R will uniquely converge to s (0≤s≤1). 1.lim(R k ) will exist (R k will converge) Pf: For 1, use monotonicity of R k and Completeness Axiom R k is nondecreasing monotonic R k will converge to R R 0 is a lower bound of R

31 Compute SimRank: 2. SimRank and other Versions Naive method: Iterative fix-point method Jeh G, Widom J. SimRank: a measure of structural-context similarity, SIGKDD, 2002: R will uniquely converge to s (0≤s≤1). 1.lim(R k ) will exist (R k will converge) 2.lim(R k ) is unique 3.lim(R k )=s 4.0≤lim(R k )≤1 Pf: For 3,

32 Compute SimRank: 2. SimRank and other Versions Naive method: Iterative fix-point method R will uniquely converge to s (0≤s≤1). 2. lim(R k ) is unique Pf: Assume there were two solutions, s 1 (a,b) and s 2 (a,b) For any a,b pairs in V, let And let Special case: if a=b, s 1 (a,b) =s 2 (a,b), M=0 If a or b has no in-neighbors, s 1 (a,b) =s 2 (a,b)=0, M=0 Otherwise: Since 0<c<1, then M=0, the solution is unique

33 1. Motivation ----Similarity measure 2. SimRank and other Versions ----Basic definition ----Variant versions ----Compute SimRank 3. SimRank and Random Walk ----Expected-f meeting distance 4. Experiments 5. Conclusion Content

34 3. SimRank and Random Walk SimRank score s(a, b) measures how soon two random surfers are expected to meet at the same node if they started at nodes a and b and randomly walked the graph. Random walk with restart high iteration#, single-source O(n 3 ) to O(n 2 ) (top-k), all pairs

35 3. SimRank and Random Walk In a strongly connected graph Expected distance is that starting at u and end at v, and do not touch v except at the end (expected step to first get v ) Recursive version: Expected distance:

36 3. SimRank and Random Walk In a derived graph G 2 G 2 = (V 2,E 2 ) V 2 = V × V represents a pair (a, b) of nodes in G An edge from (a, b) to (c, d) exists in E 2 iff the edges and exist in G. EMD is the expected distance in G 2 from (a,b) to any singleton node (x,x): meet at the same node Expected meeting distance (EMD): t- a tour from (a,b) to (x,x) Be careful! G 2 is not strongly connected, the distance may be infinite Intuitively, there must be some relationship between EMD and SimRank score, if we consider from similarity (information) flow. a b c d f a b c d f c

37 3. SimRank and Random Walk In order to solve the “infinite EMD” problem and find a mapping function f from EMD to SimRank score EMD to infinite SimRank to 1 There should be a negative correlation from EMD to Simank. Expected-f meeting distance(EMD): If a=b, s’(a,b)=1; if no path to (x,x), s’(a,b)=0 At this time, we still cannot get some relationship between s and s’, also is the c here the same one with the SimRank score parameter?

38 3. SimRank and Random Walk The same idea with the recursive description with d(u,v) in G First step: from (a,b) to any our-neighbor pair Oz((a,b)) Suppose t’ is the new tour from Oz((a,b)) to (x,x), then l(t)=l(t’)+1 Expected-f meeting distance(EMD): At this time, we can see that :