SimRank: A Measure of Structural-Context Similarity Glen Jeh and Jennifer Widom Stanford University ACM SIGKDD 2002 January 19, 2011 Taikyoung Kim SNU.

Slides:



Advertisements
Similar presentations
Weiren Yu 1, Jiajin Le 2, Xuemin Lin 1, Wenjie Zhang 1 On the Efficiency of Estimating Penetrating Rank on Large Graphs 1 University of New South Wales.
Advertisements

Diversity Maximization Under Matroid Constraints Date : 2013/11/06 Source : KDD’13 Authors : Zeinab Abbassi, Vahab S. Mirrokni, Mayur Thakur Advisor :
Fast Algorithms For Hierarchical Range Histogram Constructions
Exact Inference in Bayes Nets
An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.
Characterizing the Distribution of Low- Makespan Schedules in the Job Shop Scheduling Problem Matthew J. Streeter Stephen F. Smith Carnegie Mellon University.
Neighborhood Formation and Anomaly Detection in Bipartite Graphs Jimeng Sun Huiming Qu Deepayan Chakrabarti Christos Faloutsos Speaker: Jimeng Sun.
Placement of Integration Points in Multi-hop Community Networks Ranveer Chandra (Cornell University) Lili Qiu, Kamal Jain and Mohammad Mahdian (Microsoft.
Computing Trust in Social Networks
Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.
1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006.
Tracking Moving Objects in Anonymized Trajectories Nikolay Vyahhi 1, Spiridon Bakiras 2, Panos Kalnis 3, and Gabriel Ghinita 3 1 St. Petersburg State University.
1 PageSim: A Link-based Similarity Measure for the World Wide Web Zhenjiang Lin, Irwin King, and Michael, R., Lyu Computer Science & Engineering, The Chinese.
Scalable Network Distance Browsing in Spatial Database Samet, H., Sankaranarayanan, J., and Alborzi H. Proceedings of the 2008 ACM SIGMOD international.
Grade 8 – Module 5 Module Focus Session
Exposure In Wireless Ad-Hoc Sensor Networks S. Megerian, F. Koushanfar, G. Qu, G. Veltri, M. Potkonjak ACM SIG MOBILE 2001 (Mobicom) Journal version: S.
GDG DevFest Central Italy Joint work with J. Feldman, S. Lattanzi, V. Mirrokni (Google Research), S. Leonardi (Sapienza U. Rome), H. Lynch (Google)
Efficient Gathering of Correlated Data in Sensor Networks
1 Applications of Relative Importance  Why is relative importance interesting? Web Social Networks Citation Graphs Biological Data  Graphs become too.
1 A Graph-Theoretic Approach to Webpage Segmentation Deepayan Chakrabarti Ravi Kumar
1 Efficient Search Ranking in Social Network ACM CIKM2007 Monique V. Vieira, Bruno M. Fonseca, Rodrigo Damazio, Paulo B. Golgher, Davi de Castro Reis,
Energy-Aware Scheduling with Quality of Surveillance Guarantee in Wireless Sensor Networks Jaehoon Jeong, Sarah Sharafkandi and David H.C. Du Dept. of.
Representing and Using Graphs
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
Clustering Spatial Data Using Random Walks Author : David Harel Yehuda Koren Graduate : Chien-Ming Hsiao.
Clustering Spatial Data Using Random Walk David Harel and Yehuda Koren KDD 2001.
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
Graphs. What is a graph? A data structure that consists of a set of nodes (vertices) and a set of edges that relate the nodes to each other The set of.
Google News Personalization: Scalable Online Collaborative Filtering
Efficient Progressive Processing of Skyline Queries in Peer-to-Peer Systems INFOSCALE’06.
Scaling Personalized Web Search Authors: Glen Jeh, Jennfier Widom Stanford University Written in: 2003 Cited by: 923 articles Presented by Sugandha Agrawal.
P-Rank: A Comprehensive Structural Similarity Measure over Information Networks CIKM’ 09 November 3 rd, 2009, Hong Kong Peixiang Zhao, Jiawei Han, Yizhou.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Mathematical Induction
Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova , Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.
1 Presented by: Yuchen Bian MRWC: Clustering based on Multiple Random Walks Chain.
Data Structures & Algorithms Graphs
Simultaneously Learning and Filtering Juan F. Mancilla-Caceres CS498EA - Fall 2011 Some slides from Connecting Learning and Logic, Eyal Amir 2006.
ROCK: A Robust Clustering Algorithm for Categorical Attributes Authors: Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Data Engineering, Proceedings.,
SimRank : A Measure of Structural-Context Similarity
1/24 Introduction to Graphs. 2/24 Graph Definition Graph : consists of vertices and edges. Each edge must start and end at a vertex. Graph G = (V, E)
Optimizing Pheromone Modification for Dynamic Ant Algorithms Ryan Ward TJHSST Computer Systems Lab 2006/2007 Testing To test the relative effectiveness.
Bipartite Matching. Unweighted Bipartite Matching.
The Structure of the Web. Getting to knowing the Web How big is the web and how do you measure it? How many people use the web? How many use search engines?
Accelerating Dynamic Time Warping Clustering with a Novel Admissible Pruning Strategy Nurjahan BegumLiudmila Ulanova Jun Wang 1 Eamonn Keogh University.
1 Authors: Glen Jeh, Jennifer Widom (Stanford University) KDD, 2002 Presented by: Yuchen Bian SimRank: a measure of structural-context similarity.
CS 521 Data Mining Techniques Instructor: Abdullah Mueen LECTURE 8: TIME SERIES AND GRAPH MINING.
Graphs A graphs is an abstract representation of a set of objects, called vertices or nodes, where some pairs of the objects are connected by links, called.
Panther: Fast Top-k Similarity Search in Large Networks JING ZHANG, JIE TANG, CONG MA, HANGHANG TONG, YU JING, AND JUANZI LI Presented by Moumita Chanda.
Kijung Shin Jinhong Jung Lee Sael U Kang
Chapter 20: Graphs. Objectives In this chapter, you will: – Learn about graphs – Become familiar with the basic terminology of graph theory – Discover.
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
1 Using Network Coding for Dependent Data Broadcasting in a Mobile Environment Chung-Hua Chu, De-Nian Yang and Ming-Syan Chen IEEE GLOBECOM 2007 Reporter.
Glen Jeh & Jennifer Widom KDD  Many applications require a measure of “similarity” between objects.  Web search  Shopping Recommendations  Search.
Introduction Wireless Ad-Hoc Network  Set of transceivers communicating by radio.
Ariel Fuxman, Panayiotis Tsaparas, Kannan Achan, Rakesh Agrawal (2008) - Akanksha Saxena 1.
Graphs David Kauchak cs302 Spring Admin HW 12 and 13 (and likely 14) You can submit revised solutions to any problem you missed Also submit your.
Over Lesson 7–4 5-Minute Check 1 A.3.3 × 10 4 B × 10 4 C × 10 6 D × 10 7 What is 3,352,000 in scientific notation?
MEIKE: Influence-based Communities in Networks
CSPs: Search and Arc Consistency Computer Science cpsc322, Lecture 12
Comparing Genetic Algorithm and Guided Local Search Methods
ProbeSim: Scalable Single-Source and Top-k SimRank Computations on Dynamic Graphs Yu Liu , Bolong Zheng, Xiaodong He, Zhewei Wei, Xiaokui Xiao, Kai.
Graph Operations And Representation
RankClus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis Yizhou Sun, Jiawei Han, Peixiang Zhao, Zhijun Yin, Hong Cheng,
Probably Approximately
Zhenjiang Lin, Michael R. Lyu and Irwin King
Efficient Processing of Top-k Spatial Preference Queries
Constraint Satisfaction Problems
Approximate Graph Mining with Label Costs
Presentation transcript:

SimRank: A Measure of Structural-Context Similarity Glen Jeh and Jennifer Widom Stanford University ACM SIGKDD 2002 January 19, 2011 Taikyoung Kim SNU IDB Lab.

Outline  Introduction  Basic Graph Model  SimRank  Random Surfer-Pairs Model  Conclusion  Future Work 2

Introduction  Many applications require a measure of “similarity” between objects –“find-similar-document” query in search engine –Collaborative filtering in a recommender system 3

Introduction  Propose a general approach that exploits the object-to-object relationships in many domains –An algorithm to compute similarity scores between nodes based on the structural context  Intuition behind the algorithm –Similar objects are related to similar objects –The base case is that objects are similar to themselves 4 “Two objects are similar if they are referenced by similar objects”

Basic Graph Model  G = (V, E) [vertex, edge] –Nodes in V: objects in the domain –Directed edges in E: relationships between objects – : from object p to object q  For a node v, denote: –I(v): the set of in-neighbors of v –O(v): the set of out-neighbors of v –I i (v): individual in-neighbor ( 1 ≤ i ≤ |I(v)| ) –O i (v): individual out-neighbor ( 1 ≤ i ≤ |O(v)| ) 5 O (Univ) I (ProfB)

Outline  Introduction  Basic Graph Model  SimRank  Random Surfer-Pairs Model  Conclusion  Future Work 6

SimRank  Motivation –Two objects are similar if they are referenced by similar object –Consider an object maximally similar to itself (similarity score of 1) 7 Similar nodes: {ProfA, ProfB}, {StudentA, StudentB}, {Univ, ProfB}, …

SimRank Basic SimRank Equation  The similarity between objects a and b: s(a, b) ∈ [0, 1] –C is a constant between 0 and 1  Confidence level or decay factor  C gives the rate of decay as similarity flows across edges (since C < 1) –If a or b may not have any in-neighbors, s(a,b) = 0 –SimRank scores are symmetric, i.e., s(a,b) = s(b,a)  Similarity between a and b is the average similarity between in- neighbors of a and in-neighbors of b 8

SimRank Basic SimRank Equation  Similarity can be thought of as “propagating” from pair to pair –Consider the derived graph G 2 =(V 2, E 2 ) where  V 2 =V x V, represents a pair (a,b) of nodes in G  An edge from (a,b) to (c,d) exists in E 2, iff the edges and exist in G 9

SimRank Bipartite SimRank  Bipartite domains consist of two types of objects  Recommender system –People are similar if they purchase similar items –Items are similar if they are purchased by similar people 10

SimRank Bipartite SimRank  Bipartite Equation –Directed edges go from people to items –s(A,B) denote the similarity between persons A and B, (A≠B) –s(c,d) denote the similarity between items c and d, (c≠d) –The similarity between persons A and B is the average similarity between the items they purchased –The similarity between items c and d is the average similarity between the people who purchased them 11

SimRank Computing SimRank - Naïve Method  R k (a,b) gives the score between a and b on iteration k  The values R k (*,*) are non-decreasing as k increase   In experiments, when K = 5, R k is rapidly converged  Complexity –Space: O(n 2 ) to store the result R k, –Time: O(Kn 2 d 2 ), d 2 is the average of |I(a)||I(b)| over all node pairs (a,b) 12

SimRank Computing SimRank - Pruning  Pruning the logical graph G 2 –In naïve method,  All n 2 nodes of G 2 are considered  Similarity score are computed for every node-pair –Nodes far from a node v has less similarity score with v than nodes near v  Pruning –Set the similarity between two nodes far apart to be 0 –Consider node-pairs only for nodes which are near each other in the range of radius r –Complexity  space: O(nd r ), d r is average nodes which are near from a node  time: O(Knd r d 2 ) 13

Outline  Introduction  Basic Graph Model  SimRank  Random Surfer-Pairs Model  Conclusion  Future Work 14

Random Surfer-Pairs Model  For the intuition of similarity scores, provide an intuitive model –Based on “random surfers” –Show the SimRank score s(a,b) measures how soon two random surfers are expected to meet at the same node  Expected Distance –u and v are nodes in strongly connected graph –The ED from u to v is exactly the expected number of steps a random surfer would take before he first reaches v, starting from u –Tour t = –l[t]: length of t –P[t]: probability of traveling t 15

Random Surfer-Pairs Model  Expected Meeting Distance (EMD) –EMD is symmetric –EMD m(a,b) is simply the expected distance in G 2 from (a,b) to any singleton node(x,x) ∈ V 2 16 m(v,w)=1 m(u,v)=∞ m(u,w)=∞ m(*,*)= ∞m(*,*)= 3

Random Surfer-Pairs Model  Expected-f Meeting Distance –Our approach to circumvent the “infinite EMD” problem  Map all distances to a finite interval: instead of computing expected length l(t) of a tour  Equivalence to SimRank –S’(*,*) is exactly models that our original definition of SimRank scores 17

Outline  Introduction  Basic Graph Model  SimRank  Random Surfer-Pairs Model  Conclusion  Future Work 18

Conclusion  Main contribution –A formal definition for SimRank similarity scoring over arbitrary graphs, sev eral useful derivatives of SimRank, and an algorithm to compute SimRank –A graph-theoretic model for SimRank that gives intuitive mathematical insig ht into its use and computation –Experimental results using an in-memory implementation of SimRank over two real data sets shows the effectiveness and feasibility of SimRank 19

Future Work  Address efficiency and scalability issues –Including additional pruning heuristics and disk-based algorithms  Consider ternary (or more) relationships in computing structural- context similarity  Explore the combination of SimRank with other domain-specific similarity measures 20