Page 1 PathSim: Meta Path-Based Top-K Similarity Search in Heterogeneous Information Networks Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S. Yu, Tianyi.

PathSim: Meta Path-Based Top-K Similarity Search in Heterogeneous Information Networks Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S. Yu, Tianyi Wu VLDB ’11 Summarized and presented by Kim Chungrim

Contents Introduction Motivation & Terminology –Heterogeneous Information Network (HIN) –Network Schema –Meta Path Meta Path-based Similarity Search Framework PathSim: A Novel Meta Path-Based Similarity Measure Online Query Processing for Top-K Similarity Search –PathSim-baseline –PathSim-pruning Experiments Conclusions

Logical networks involving multi-typed objects and multi-typed links denoting different relations are arising –Bibliographic networks –Social media networks –Knowledge network encoded in Wikipedia It is important to study similarity search in such networks, as similarity search is a primitive operation in database and Web search engines. Similarity search has been only studied for traditional relational databases or homogeneous information networks –Personalized PageRank (P-PageRank) –SimRank –Random Walk (RW) –Personalized Random Walk (PRW) There are studies of similarity search on Heterogeneous Information Networks (HIN) I NTRODUCTION

Motivation When conventional similarity measures for homogeneous information network is applied to HIN, the subtle semantic meanings that each type of links carry will be ignored Limitation of current similarity/proximity measures defined in networks –Do NOT distinguish different types of objects and different types of links in the network –Different types of objects and links have different semantic meanings E.g., personalized PageRank (P-PageRank), SimRank To distinguish the semantic among paths connecting two objects, a meta path-based similarity framework can be considered

Terminology HIN : –Networks containing multi-typed objects, interconnected via multi-typed relationships –G(V,E) –Examples DBLP network: papers, authors, venues, terms Flickr network: pictures, tags, users, groups –Sources From online web services: online shopping websites, social media websites, bibliographic websites, … From database systems: medical databases, university databases, police department databases, … Network Schema : –Information about the entity type and their binary relations – –Similar to the E-R Model Jim P1 VLDB Network P2 Ann Data DBLP Network schema Paper Venue Term Author

Terminology (cont.) Meta Path –Two objects can be connected via different connectivity paths –E.g., two authors can be connected by “author-paper-author” (APA) “author-paper-author-paper-author” (APAPA) “author-paper-venue-paper-author” (APCPA) Each connectivity path represents a different semantic meaning and implies different similarity semantics A meta path is a meta level description of the topological connectivity between objects –Given a Network Schema, A meta path can be defined as –Can be considered as a new relation defined on type and Jim P1 VLDB Network P2 Ann Data

Meta Path-based Similarity Search Framework Similarity definition –Meta Path X Similarity Measure Conventional Similarity measures –Path Count : the number of path instances p between x and y following P –s(x,y) = |{p : p ∈ P }| –Random Walk : the probability ( Prob(p) ) of the random walk that starts from x and ends with y following meta path P, which is the sum of the probabilities of all the path instances p –s(x,y) = –Pairwise Random Walk : for a meta path P that can be decomposed into two shorter meta paths with the same length P =, pairwise random walk probability is the probabilities starting from x and y and reaching the same middle object z –s(x,y) = xy p xy z

PathSim: A Novel Meta Path-Based Similarity Measure Similarity in terms of ‘Peers’ –Two similar peer object should not only be strongly connected, but also share comparable visibility. Path count and Random walk (RW) –Favor highly visible objects (objects with large degrees) Pairwise random walk (PRW) –Favor pure objects (objects with highly skewed scatterness in their in-links or out-links) PathSim –Favor “peers” (objects with similar visibility and strong connectivity under the given meta path)

PathSim: A Novel Meta Path-Based Similarity Measure (cont) Restricted on Round-Trip Meta Path –A round-trip meta path is a path of the form of P = –Guarantees a symmetric relation – Jim VLDB Mike SIGMOD 2 50 20 1 s(Mike, Jim) =

PathSim: A Novel Meta Path-Based Similarity Measure (cont) Properties of PathSim –Symmetric –s(x,y) = s(y,x) –Self Maximum –s(x,y) ∈ [0,1], s(x,x) = 1 –Balance of Visibility –

PathSim: A Novel Meta Path-Based Similarity Measure (cont) Comparison with other measures.

Online Query Processing for Top-K Similarity Search The Top K Similarity Search Problem under PathSim –Given an HIG G and its network schema, given a round-trip meta path P = the top-K Similarity Search is defined as: –For a given query object x ∈ A1, find the sorted k objects y in the same type A that are most similar to the object x under PathSim similarity definition Major issues for Online Computation –Very large Matrix Multiplication : Need to compute the commuting matrix –Calculating the commuting matrix is too time consuming –Full materialization of commuting matrix is also time and space expensive Solution –Partially materialize commuting matrices for short length meta paths, and concatenate them online to get longer ones for a given query –Materialize the commuting matrix Mp for meta path

Online Query Processing for Top-K Similarity Search – PathSim-baseline Find the candidates via traversing the network following meta path P from the query object x For each candidate y, calculate s(x,y) using partial commuting matrix Mp –Calculate and scale it with sum of visibility –, can be pre-computed and stored using Sort y according to s(x,y) and return top-k objects Still very time-consuming if the candidate set is very large!

Online Query Processing for Top-K Similarity Search – PathSim-pruning PathSim-pruning algorithm prunes the candidates that are not promising Offline: Generate co-clusters according to partial commuting matrix and store statistics for each block for deriving upper bound of similarity Online: For each query –Calculate the upper bound similarity between query object and the candidate cluster; prune the whole cluster if it is not promising –Calculate the upper bound similarity between query and each candidate in the cluster; prune the candidate if it is not promising –Calculate the exact similarity measure between query and the candidate, and update the top-k list

Online Query Processing for Top-K Similarity Search – PathSim-pruning

Experiments The DBLP network –By Nov. 2009 –Contains over 710K authors, 1.2M papers, 5K venues (conferences/journals), and around 70K terms appearing more than once (stopwords have been removed). –Called full DBLP dataset Created two subsets –DBIS dataset : contains all 464 venues and top-5000 authors from the database and information system area –4-area dataset : contains 20 venues and top-5000 authors from 4 areas: database, data mining, machine learning and information retrieval Cluster labels are given for all the 20 venues and a subset of 1713 authors

Experiment - Effectiveness Labeled top-15 result for 15 queries from venue type in DBIS dataset –Labeled each result object with relevance score as 0 : non relevant 1 : some-relevant 2 : very-relevant Used nDCG to evaluate the quality of a ranking algorithm

Experiment - Effectiveness

Experiment - Efficiency

Conclusion & Contribution The authors –defined a meta path-based similarity framework in HIN –Proposed a new measure called PathSim, which is able to detect peer objects for the given meta path –Propose a co-clustering-based efficient online search algorithm to support top-k search

Summary / Discussion / Future work Network inference procedure assumes ad-hoc edge filtering Introduced a threshold on edges and a family of Networks to find a optimal threshold for a certain prediction task –The prediction accuracies peak in a non-obvious yet relatively narrow threshold range Tested on too few datasets Not enough to give a solid conclusion Apply method to variety of networks Test various thresholds for more interests

Page 1 PathSim: Meta Path-Based Top-K Similarity Search in Heterogeneous Information Networks Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S. Yu, Tianyi.

Similar presentations

Presentation on theme: "Page 1 PathSim: Meta Path-Based Top-K Similarity Search in Heterogeneous Information Networks Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S. Yu, Tianyi."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Page 1 PathSim: Meta Path-Based Top-K Similarity Search in Heterogeneous Information Networks Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S. Yu, Tianyi.

Similar presentations

Presentation on theme: "Page 1 PathSim: Meta Path-Based Top-K Similarity Search in Heterogeneous Information Networks Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S. Yu, Tianyi."— Presentation transcript:

Similar presentations

About project

Feedback