Integrating Meta-Path Selection With User-Guided Object Clustering in Heterogeneous Information Networks Yizhou Sun†, Brandon Norick†, Jiawei Han†, Xifeng.

Slides:



Advertisements
Similar presentations
CO-AUTHOR RELATIONSHIP PREDICTION IN HETEROGENEOUS BIBLIOGRAPHIC NETWORKS Yizhou Sun, Rick Barber, Manish Gupta, Charu C. Aggarwal, Jiawei Han 1.
Advertisements

Hierarchical Dirichlet Processes
Multi-label Relational Neighbor Classification using Social Context Features Xi Wang and Gita Sukthankar Department of EECS University of Central Florida.
One Theme in All Views: Modeling Consensus Topics in Multiple Contexts Jian Tang 1, Ming Zhang 1, Qiaozhu Mei 2 1 School of EECS, Peking University 2 School.
Mining Top-K Large Structural Patterns in a Massive Network Feida Zhu 1, Qiang Qu 2, David Lo 1, Xifeng Yan 3, Jiawei Han 4, and Philip S. Yu 5 1 Singapore.
Funding Networks Abdullah Sevincer University of Nevada, Reno Department of Computer Science & Engineering.
Paper Discussion: “Simultaneous Localization and Environmental Mapping with a Sensor Network”, Marinakis et. al. ICRA 2011.
Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1 University of Illinois, IBM TJ Watson Debapriya Basu.
Interactive Generation of Integrated Schemas Laura Chiticariu et al. Presented by: Meher Talat Shaikh.
On Community Outliers and their Efficient Detection in Information Networks Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1.
Heterogeneous Consensus Learning via Decision Propagation and Negotiation Jing Gao † Wei Fan ‡ Yizhou Sun † Jiawei Han † †University of Illinois at Urbana-Champaign.
Heterogeneous Consensus Learning via Decision Propagation and Negotiation Jing Gao† Wei Fan‡ Yizhou Sun†Jiawei Han† †University of Illinois at Urbana-Champaign.
Collaborative Ordinal Regression Shipeng Yu Joint work with Kai Yu, Volker Tresp and Hans-Peter Kriegel University of Munich, Germany Siemens Corporate.
Query-Based Outlier Detection in Heterogeneous Information Networks Jonathan Kuck 1, Honglei Zhuang 1, Xifeng Yan 2, Hasan Cam 3, Jiawei Han 1 1 University.
Honglei Zhuang1, Jing Zhang2, George Brova1,
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Network A/B Testing: From Sampling to Estimation
Modeling and Finding Abnormal Nodes (chapter 2) 駱宏毅 Hung-Yi Lo Social Network Mining Lab Seminar July 18, 2007.
Evolutionary Clustering and Analysis of Bibliographic Networks Manish Gupta (UIUC) Charu C. Aggarwal (IBM) Jiawei Han (UIUC) Yizhou Sun (UIUC) ASONAM 2011.
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Discovering Meta-Paths in Large Heterogeneous Information Network
Learning Geographical Preferences for Point-of-Interest Recommendation Author(s): Bin Liu Yanjie Fu, Zijun Yao, Hui Xiong [KDD-2013]
P-Rank: A Comprehensive Structural Similarity Measure over Information Networks CIKM’ 09 November 3 rd, 2009, Hong Kong Peixiang Zhao, Jiawei Han, Yizhou.
On Node Classification in Dynamic Content-based Networks.
Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova , Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.
Finding Top-k Shortest Path Distance Changes in an Evolutionary Network SSTD th August 2011 Manish Gupta UIUC Charu Aggarwal IBM Jiawei Han UIUC.
MINING COLOSSAL FREQUENT PATTERNS BY CORE PATTERN FUSION FEIDA ZHU, XIFENG YAN, JIAWEI HAN, PHILIP S. YU, HONG CHENG ICDE07 Advisor: Koh JiaLing Speaker:
Algorithmic Detection of Semantic Similarity WWW 2005.
1 LinkClus: Efficient Clustering via Heterogeneous Semantic Links Xiaoxin Yin, Jiawei Han Univ. of Illinois at Urbana-Champaign Philip S. Yu IBM T.J. Watson.
Computing & Information Sciences Kansas State University IJCAI HINA 2015: 3 rd Workshop on Heterogeneous Information Network Analysis KSU Laboratory for.
Stick-Breaking Constructions
Mining Top-K Large Structural Patterns in a Massive Network Feida Zhu 1, Qiang Qu 2, David Lo 1, Xifeng Yan 3, Jiawei Han 4, and Philip S. Yu 5 1 Singapore.
Page 1 PathSim: Meta Path-Based Top-K Similarity Search in Heterogeneous Information Networks Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S. Yu, Tianyi.
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Topic-Factorized Ideal Point Estimation Model for Legislative Voting Network Yupeng Gu †, Yizhou Sun †, Ning Jiang ‡, Bingyu Wang †, Ting Chen † † Northeastern.
Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.
Relation Strength-Aware Clustering of Heterogeneous Information Networks with Incomplete Attributes ∗ Source: VLDB.
ICONIP 2010, Sydney, Australia 1 An Enhanced Semi-supervised Recommendation Model Based on Green’s Function Dingyan Wang and Irwin King Dept. of Computer.
Unsupervised Streaming Feature Selection in Social Media
Using category-Based Adherence to Cluster Market-Basket Data Author : Ching-Huang Yun, Kun-Ta Chuang, Ming-Syan Chen Graduate : Chien-Ming Hsiao.
Paper Presentation Social influence based clustering of heterogeneous information networks Qiwei Bao & Siqi Huang.
Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.
ClusCite:Effective Citation Recommendation by Information Network-Based Clustering Date: 2014/10/16 Author: Xiang Ren, Jialu Liu,Xiao Yu, Urvashi Khandelwal,
1 Dongheng Sun 04/26/2011 Learning with Matrix Factorizations By Nathan Srebro.
Exploring Social Tagging Graph for Web Object Classification
Chapter 7: Counting Principles
CIKM’ 09 November 3rd, 2009, Hong Kong
Privacy Risk in Anonymized Heterogeneous Information Networks (How to Break Anonymity of the KDD Cup 2012 Dataset) Aston Zhang1, Xing Xie2, Kevin C.-C.
Asymmetric Correlation Regularized Matrix Factorization for Web Service Recommendation Qi Xie1, Shenglin Zhao2, Zibin Zheng3, Jieming Zhu2 and Michael.
Learning Sequence Motif Models Using Expectation Maximization (EM)
Collective Network Linkage across Heterogeneous Social Platforms
Hanghang Tong, Brian Gallagher, Christos Faloutsos, Tina Eliassi-Rad
CS7280: Special Topics in Data Mining Information/Social Networks
Location Recommendation — for Out-of-Town Users in Location-Based Social Network Yina Meng.
RankClus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis Yizhou Sun, Jiawei Han, Peixiang Zhao, Zhijun Yin, Hong Cheng,
Community Distribution Outliers in Heterogeneous Information Networks
Clustering Using Pairwise Comparisons
KDD Reviews 周天烁 2018年5月9日.
Jiawei Han Department of Computer Science
Graph Clustering Based on Structural/Attribute Similarities
Michal Rosen-Zvi University of California, Irvine
Graph-based Security and Privacy Analytics via Collective Classification with Joint Weight Learning and Propagation Binghui Wang, Jinyuan Jia, and Neil.
Topic Models in Text Processing
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Semi-Supervised Learning
Heterogeneous Graph Attention Network
Presentation transcript:

Integrating Meta-Path Selection With User-Guided Object Clustering in Heterogeneous Information Networks Yizhou Sun†, Brandon Norick†, Jiawei Han†, Xifeng Yan‡, Philip S. Yu§, and Xiao Yu† †University of Illinois at Urbana-Champaign ‡University of California at Santa Barbara §University of Illinois at Chicago November 13, 2018

Outline Background Motivation and Problem Definition The PathSelClus Model and Algorithm Experimental Results Conclusion

Information Networks Are Everywhere Social Networking Websites Biological Network: Protein Interaction Research Collaboration Network Product Recommendation Network via Emails They are all treated as Homogeneous Networks!

Heterogeneous Networks Multiple object types and/or multiple link types Venue Paper Author DBLP Bibliographic Network The IMDB Movie Network Actor Movie Director Studio The Facebook Network Homogeneous networks are Information loss projection of heterogeneous networks! New problems are emerging in heterogeneous networks! Directly Mining information richer heterogeneous networks

Network Schema and Meta-Path [Sun et al., VLDB’11] Objects are connected together via different types of relationships! “Jim-P1-Ann” “Mike-P2-Ann” “Mike-P3-Bob” “Jim-P1-SIGMOD-P2-Ann” “Mike-P3-SIGMOD-P2-Ann” “Mike-P4-KDD-P5-Bob” Author-Paper-Author Author-Paper-Venue-Paper-Author Network schema Meta-level description of a network Meta-Path Meta-level description of a path between two objects A path on network schema Denote an existing or concatenated relation between two object types

Outline Background Motivation and Problem Definition The PathSelClus Model and Algorithm Experimental Results Conclusion

Why Meta-Path Selection? Goal: Clustering authors based on their connection in the network Which meta-path to choose? {1,3} {2,4} {5,7} {6,8} {1,2,3,4} {5,6,7,8} {1,3,5,7} {2,4,6,8}

The Role of User Guidance It is users’ responsibility to specify their clustering purpose Say, by giving seeds in each cluster {1} {5} {1,2,3,4} {5,6,7,8} + {1} {2} {5} {6} {1,3} {2,4} {5,7} {6,8} Seeds Meta-path(s) Clustering Result

The Problem of User-Guided Clustering with Meta-Path Selection Input: The target type for clustering: T Number of clusters: K Seeds in some of the clusters: 𝑳 𝟏 , 𝑳 𝟐 ,…, 𝑳 𝑲 M Candidate meta-paths starting from T: 𝓟 1 , 𝓟 2 ,…, 𝓟 𝑀 Output: The quality weight for each candidate meta-path in the clustering process 𝛼 𝑚 The clustering results that are consistent with the user guidance 𝜽 𝑖

Existing Link-based User-Guided Clustering Approaches Link-based clustering algorithms on homogeneous networks Treat all types of links equally important (Zhu et al., 2003) Distinguish different relations in HIN, but use ALL the relations in the network Do not distinguish different clustering tasks with different semantic meanings (Long et al., 2007)

Outline Background Motivation and Problem Definition The PathSelClus Model and Algorithm Experimental Results Conclusion

The Probabilistic Model Part 1: Modeling the Relationship Generation A good clustering result should lead to high likelihood in observing existing relationships Keep in mind: higher quality relations should count more in the total likelihood Part 2: Modeling the Guidance from Users The more consistent with the guidance, the higher probability of the clustering result Part 3: Modeling the Quality Weights for Meta-Paths The more consistent with the clustering result, the higher quality weight Objective Function

Part 1: Modeling the Relationship Generation For each meta path 𝓟 𝑚 , let the relation matrix be 𝑊 𝑚 : The relationship 〈 𝑡 𝑖 , 𝑓 𝑗,𝑚 〉 is generated under a mixture of multinomial distributions 𝜋 𝑖𝑗,𝑚 =𝑃 𝑗 𝑖,𝑚 = 𝑘 𝑃 𝑘 𝑖 𝑃 𝑗 𝑘,𝑚 = 𝑘 𝜃 𝑖𝑘 𝛽 𝑘𝑗,𝑚 𝜃 𝑖𝑘 : the probability that 𝑡 𝑖 belongs to Cluster k 𝛽 𝑘𝑗,𝑚 : the probability that feature object 𝑓 𝑗,𝑚 appearing in Cluster k The probability to observing all the relationships in 𝓟 𝑚 𝑃( |Θ ) E.g., 𝑃( |Θ )

Part 2: Modeling the Guidance from Users For each soft clustering probability vector 𝜃 𝑖 : Model it as generated from a Dirichlet prior If 𝑡 𝑖 is labeled as a seed in Cluster 𝑘 ∗ 𝜃 𝑖 ∼𝐷𝑖𝑟(𝜆 𝒆 𝑘 ∗ +𝟏) 𝒆 𝑘 ∗ is an all-zero vector except for item 𝑘 ∗ , which is 1 𝜆 is the user confidence for the guidance If 𝑡 𝑖 is not labeled in any cluster 𝜃 𝑖 ∼𝐷𝑖𝑟(𝟏) The prior density is uniform, a special case of Dirichlet distribution 𝑘 ∗

Part 3: Modeling the Quality Weights for Meta-Paths Model quality weight 𝛼 𝑚 as the relative weight for each relationship in 𝑊 𝑚 Observation of relationships: 𝑊 𝑚 → 𝛼 𝑚 𝑊 𝑚 Further assume relationship generation with Dirichlet Prior: 𝝅 𝑖,𝑚 ∼Dir(𝟏) The best 𝛼 𝑚 : the most likely to generate current clustering-based parameters when 𝛼 𝑚 is small, 𝜋 𝑖,𝑚 is more likely to be a uniform distribution Random generated when 𝛼 𝑚 is large, 𝜋 𝑖,𝑚 is more likely to be 𝒘 𝑖,𝑚 𝑛 𝑖,𝑚 , what we observed Consistent with the observation Dirichlet Distribution

The Learning Algorithm An Iterative algorithm that the clustering result Θ and quality weight vector 𝜶 mutually enhance each other Step 1: Optimize Θ given 𝜶 𝜃 𝑖 is determined by all the relation matrices with different weights 𝛼 𝑚 , as well as the labeled seeds Step 2: Optimize 𝜶 given Θ In general, the higher likelihood of observing 𝑊 𝑚 given Θ, the higher 𝛼 𝑚

Outline Background Motivation and Problem Definition The PathSelClus Model and Algorithm Experimental Results Conclusion

Experiments Datasets DBLP Yelp Object Types: Authors, Venues, Papers, Terms Relation Types: AP, PA, VP, PV, TP, PT Yelp Object Types: Users, Businesses, Reviews, Terms Relation Types: UR, RU, BR, RB, TR, RT

DBLP-Clustering Venues According to Research Areas Task: Target objects: venues Number of clusters: 4; Candidate meta-paths: V-P-A-P-V, V-P-T-P-V Output: Weights: V-P-A-P-V: 1576 (0.0017 per relationship) V-P-T-P-V: 17001 (0.0003 per relationship) Clustering results:

Yelp-T2 Task: Output: Target objects: restaurants Number of clusters: 6; Candidate meta-paths: B-R-U-R-B, B-R-T-R-B. Output: Weights: B-R-U-R-B : 6000 (0.1716 per relationship, compared with 0.5864 for clustering shopping categories) B-R-T-R-B: 2.9522× 10 7 (0.0138 per relationship)

Outline Background Motivation and Problem Definition The PathSelClus Model and Algorithm Experimental Results Conclusion

Conclusion Meta-path selection User guidance PathSelClus Unavoidable problem in information-rich heterogeneous information networks User guidance Users’ responsibility to claim the mining purpose PathSelClus An probabilistic model-based algorithm integrating meta-path selection with user-guided clustering Future work: Candidate meta-path generation; Different forms of guidance;

Q & A Acknowledgement Co-authors Funding Agencies This research was supported by U.S. NSF, DHS, and ARL/NS-CTA Q & A