Page 1 PathSim: Meta Path-Based Top-K Similarity Search in Heterogeneous Information Networks Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S. Yu, Tianyi.

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

CO-AUTHOR RELATIONSHIP PREDICTION IN HETEROGENEOUS BIBLIOGRAPHIC NETWORKS Yizhou Sun, Rick Barber, Manish Gupta, Charu C. Aggarwal, Jiawei Han 1.
Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology,
Date : 2013/05/27 Author : Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, Gong Yu Source : SIGMOD’12 Speaker.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
DISCOVER: Keyword Search in Relational Databases Vagelis Hristidis University of California, San Diego Yannis Papakonstantinou University of California,
School of Computer Science and Engineering Finding Top k Most Influential Spatial Facilities over Uncertain Objects Liming Zhan Ying Zhang Wenjie Zhang.
Knowledge Graph: Connecting Big Data Semantics
CUBELSI : AN EFFECTIVE AND EFFICIENT METHOD FOR SEARCHING RESOURCES IN SOCIAL TAGGING SYSTEMS Bin Bi, Sau Dan Lee, Ben Kao, Reynold Cheng The University.
EventCube Aviation Safety Data Analysis System Fangbo Tao, Xiao Yu, Jiawei Han 08/10/13.
Mining Top-K Large Structural Patterns in a Massive Network Feida Zhu 1, Qiang Qu 2, David Lo 1, Xifeng Yan 3, Jiawei Han 4, and Philip S. Yu 5 1 Singapore.
Web Document Clustering: A Feasibility Demonstration Hui Han CSE dept. PSU 10/15/01.
A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.
Evaluating Search Engine
CS345 Data Mining Web Spam Detection. Economic considerations  Search has become the default gateway to the web  Very high premium to appear on the.
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.
Keyword Proximity Search on XML Graphs Vagelis Hristidis Yannis Papakonstatinou Andrey Presenter: Feng Shao.
Topic-Sensitive PageRank Taher H. Haveliwala. PageRank Importance is propagated A global ranking vector is pre-computed.
Query-Based Outlier Detection in Heterogeneous Information Networks Jonathan Kuck 1, Honglei Zhuang 1, Xifeng Yan 2, Hasan Cam 3, Jiawei Han 1 1 University.
Honglei Zhuang1, Jing Zhang2, George Brova1,
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
NUITS: A Novel User Interface for Efficient Keyword Search over Databases The integration of DB and IR provides users with a wide range of high quality.
PageRank for Product Image Search Yushi Jing, Shumeet Baluja College of Computing, Georgia Institute of Technology Google, Inc. WWW 2008 Referred Track:
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Pete Bohman Adam Kunk. What is real-time search? What do you think as a class?
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
Presented by: Apeksha Khabia Guided by: Dr. M. B. Chandak
2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop.
CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.
1 Efficient Search Ranking in Social Network ACM CIKM2007 Monique V. Vieira, Bruno M. Fonseca, Rodrigo Damazio, Paulo B. Golgher, Davi de Castro Reis,
Searching for Extremes Among Distributed Data Sources with Optimal Probing Zhenyu (Victor) Liu Computer Science Department, UCLA.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
Discovering Meta-Paths in Large Heterogeneous Information Network
P-Rank: A Comprehensive Structural Similarity Measure over Information Networks CIKM’ 09 November 3 rd, 2009, Hong Kong Peixiang Zhao, Jiawei Han, Yizhou.
On Node Classification in Dynamic Content-based Networks.
1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.
Page 1 Inferring Relevant Social Networks from Interpersonal Communication Munmun De Choudhury, Winter Mason, Jake Hofman and Duncan Watts WWW ’10 Summarized.
Finding Top-k Shortest Path Distance Changes in an Evolutionary Network SSTD th August 2011 Manish Gupta UIUC Charu Aggarwal IBM Jiawei Han UIUC.
Algorithmic Detection of Semantic Similarity WWW 2005.
1 LinkClus: Efficient Clustering via Heterogeneous Semantic Links Xiaoxin Yin, Jiawei Han Univ. of Illinois at Urbana-Champaign Philip S. Yu IBM T.J. Watson.
Event retrieval in large video collections with circulant temporal encoding CVPR 2013 Oral.
Ranking objects based on relationships Computing Top-K over Aggregation Sigmod 2006 Kaushik Chakrabarti et al.
Mining Top-K Large Structural Patterns in a Massive Network Feida Zhu 1, Qiang Qu 2, David Lo 1, Xifeng Yan 3, Jiawei Han 4, and Philip S. Yu 5 1 Singapore.
RoundTripRank Graph-based Proximity with Importance and Specificity Yuan FangUniv. of Illinois at Urbana-Champaign Kevin C.-C. ChangUniv. of Illinois at.
Kijung Shin Jinhong Jung Lee Sael U Kang
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Mining Advisor-Advisee Relationships from Research Publication.
1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Link mining ( based on slides.
Paper Presentation Social influence based clustering of heterogeneous information networks Qiwei Bao & Siqi Huang.
Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.
Presented by: Siddhant Kulkarni Spring Authors: Publication:  ICDE 2015 Type:  Research Paper 2.
Xifeng Yan Philip S. Yu Jiawei Han SIGMOD 2005 Substructure Similarity Search in Graph Databases.
Discovering Meta-Paths in Large Heterogeneous Information Network Changping Meng (Purdue University) Reynold Cheng (University of Hong Kong) Silviu Maniu.
Neighborhood - based Tag Prediction
Probabilistic Data Management
Link Prediction Seminar Social Media Mining University UC3M
Integrating Meta-Path Selection With User-Guided Object Clustering in Heterogeneous Information Networks Yizhou Sun†, Brandon Norick†, Jiawei Han†, Xifeng.
CS7280: Special Topics in Data Mining Information/Social Networks
RankClus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis Yizhou Sun, Jiawei Han, Peixiang Zhao, Zhijun Yin, Hong Cheng,
Adaptive entity resolution with human computation
Jiawei Han Department of Computer Science
Presentation transcript:

Page 1 PathSim: Meta Path-Based Top-K Similarity Search in Heterogeneous Information Networks Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S. Yu, Tianyi Wu VLDB ’11 Summarized and presented by Kim Chungrim

Page 2 Contents Introduction Motivation & Terminology –Heterogeneous Information Network (HIN) –Network Schema –Meta Path Meta Path-based Similarity Search Framework PathSim: A Novel Meta Path-Based Similarity Measure Online Query Processing for Top-K Similarity Search –PathSim-baseline –PathSim-pruning Experiments Conclusions

Page 3 Logical networks involving multi-typed objects and multi-typed links denoting different relations are arising –Bibliographic networks –Social media networks –Knowledge network encoded in Wikipedia It is important to study similarity search in such networks, as similarity search is a primitive operation in database and Web search engines. Similarity search has been only studied for traditional relational databases or homogeneous information networks –Personalized PageRank (P-PageRank) –SimRank –Random Walk (RW) –Personalized Random Walk (PRW) There are studies of similarity search on Heterogeneous Information Networks (HIN) I NTRODUCTION

Page 4 Motivation When conventional similarity measures for homogeneous information network is applied to HIN, the subtle semantic meanings that each type of links carry will be ignored Limitation of current similarity/proximity measures defined in networks –Do NOT distinguish different types of objects and different types of links in the network –Different types of objects and links have different semantic meanings E.g., personalized PageRank (P-PageRank), SimRank To distinguish the semantic among paths connecting two objects, a meta path-based similarity framework can be considered

Page 5 Terminology HIN : –Networks containing multi-typed objects, interconnected via multi-typed relationships –G(V,E) –Examples DBLP network: papers, authors, venues, terms Flickr network: pictures, tags, users, groups –Sources From online web services: online shopping websites, social media websites, bibliographic websites, … From database systems: medical databases, university databases, police department databases, … Network Schema : –Information about the entity type and their binary relations – –Similar to the E-R Model Jim P1 VLDB Network P2 Ann Data DBLP Network schema Paper Venue Term Author

Page 6 Terminology (cont.) Meta Path –Two objects can be connected via different connectivity paths –E.g., two authors can be connected by “author-paper-author” (APA) “author-paper-author-paper-author” (APAPA) “author-paper-venue-paper-author” (APCPA) Each connectivity path represents a different semantic meaning and implies different similarity semantics A meta path is a meta level description of the topological connectivity between objects –Given a Network Schema, A meta path can be defined as –Can be considered as a new relation defined on type and Jim P1 VLDB Network P2 Ann Data

Page 7 Meta Path-based Similarity Search Framework Similarity definition –Meta Path X Similarity Measure Conventional Similarity measures –Path Count : the number of path instances p between x and y following P –s(x,y) = |{p : p ∈ P }| –Random Walk : the probability ( Prob(p) ) of the random walk that starts from x and ends with y following meta path P, which is the sum of the probabilities of all the path instances p –s(x,y) = –Pairwise Random Walk : for a meta path P that can be decomposed into two shorter meta paths with the same length P =, pairwise random walk probability is the probabilities starting from x and y and reaching the same middle object z –s(x,y) = xy p xy z

Page 8 PathSim: A Novel Meta Path-Based Similarity Measure Similarity in terms of ‘Peers’ –Two similar peer object should not only be strongly connected, but also share comparable visibility. Path count and Random walk (RW) –Favor highly visible objects (objects with large degrees) Pairwise random walk (PRW) –Favor pure objects (objects with highly skewed scatterness in their in-links or out-links) PathSim –Favor “peers” (objects with similar visibility and strong connectivity under the given meta path)

Page 9 PathSim: A Novel Meta Path-Based Similarity Measure (cont) Restricted on Round-Trip Meta Path –A round-trip meta path is a path of the form of P = –Guarantees a symmetric relation – Jim VLDB Mike SIGMOD s(Mike, Jim) =

Page 10 PathSim: A Novel Meta Path-Based Similarity Measure (cont) Properties of PathSim –Symmetric –s(x,y) = s(y,x) –Self Maximum –s(x,y) ∈ [0,1], s(x,x) = 1 –Balance of Visibility –

Page 11 PathSim: A Novel Meta Path-Based Similarity Measure (cont) Comparison with other measures.

Page 12 Online Query Processing for Top-K Similarity Search The Top K Similarity Search Problem under PathSim –Given an HIG G and its network schema, given a round-trip meta path P = the top-K Similarity Search is defined as: –For a given query object x ∈ A1, find the sorted k objects y in the same type A that are most similar to the object x under PathSim similarity definition Major issues for Online Computation –Very large Matrix Multiplication : Need to compute the commuting matrix –Calculating the commuting matrix is too time consuming –Full materialization of commuting matrix is also time and space expensive Solution –Partially materialize commuting matrices for short length meta paths, and concatenate them online to get longer ones for a given query –Materialize the commuting matrix Mp for meta path

Page 13 Online Query Processing for Top-K Similarity Search – PathSim-baseline Find the candidates via traversing the network following meta path P from the query object x For each candidate y, calculate s(x,y) using partial commuting matrix Mp –Calculate and scale it with sum of visibility –, can be pre-computed and stored using Sort y according to s(x,y) and return top-k objects Still very time-consuming if the candidate set is very large!

Page 14 Online Query Processing for Top-K Similarity Search – PathSim-pruning PathSim-pruning algorithm prunes the candidates that are not promising Offline: Generate co-clusters according to partial commuting matrix and store statistics for each block for deriving upper bound of similarity Online: For each query –Calculate the upper bound similarity between query object and the candidate cluster; prune the whole cluster if it is not promising –Calculate the upper bound similarity between query and each candidate in the cluster; prune the candidate if it is not promising –Calculate the exact similarity measure between query and the candidate, and update the top-k list

Page 15 Online Query Processing for Top-K Similarity Search – PathSim-pruning

Page 16 Experiments The DBLP network –By Nov –Contains over 710K authors, 1.2M papers, 5K venues (conferences/journals), and around 70K terms appearing more than once (stopwords have been removed). –Called full DBLP dataset Created two subsets –DBIS dataset : contains all 464 venues and top-5000 authors from the database and information system area –4-area dataset : contains 20 venues and top-5000 authors from 4 areas: database, data mining, machine learning and information retrieval Cluster labels are given for all the 20 venues and a subset of 1713 authors

Page 17 Experiment - Effectiveness Labeled top-15 result for 15 queries from venue type in DBIS dataset –Labeled each result object with relevance score as 0 : non relevant 1 : some-relevant 2 : very-relevant Used nDCG to evaluate the quality of a ranking algorithm

Page 18 Experiment - Effectiveness

Page 19 Experiment - Efficiency

Page 20 Experiment - Efficiency

Page 21 Conclusion & Contribution The authors –defined a meta path-based similarity framework in HIN –Proposed a new measure called PathSim, which is able to detect peer objects for the given meta path –Propose a co-clustering-based efficient online search algorithm to support top-k search

Page 22 Summary / Discussion / Future work Network inference procedure assumes ad-hoc edge filtering Introduced a threshold on edges and a family of Networks to find a optimal threshold for a certain prediction task –The prediction accuracies peak in a non-obvious yet relatively narrow threshold range Tested on too few datasets Not enough to give a solid conclusion Apply method to variety of networks Test various thresholds for more interests