Query-Based Outlier Detection in Heterogeneous Information Networks Jonathan Kuck 1, Honglei Zhuang 1, Xifeng Yan 2, Hasan Cam 3, Jiawei Han 1 1 University.

Slides:

Advertisements

Similar presentations

CO-AUTHOR RELATIONSHIP PREDICTION IN HETEROGENEOUS BIBLIOGRAPHIC NETWORKS Yizhou Sun, Rick Barber, Manish Gupta, Charu C. Aggarwal, Jiawei Han 1.

Advertisements

Suleyman Cetintas 1, Monica Rogati 2, Luo Si 1, Yi Fang 1 Identifying Similar People in Professional Social Networks with Discriminative Probabilistic.

Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology,

gSpan: Graph-based substructure pattern mining

Hierarchical Dirichlet Processes

A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy Date ： 2014/04/15 Source ： KDD’13 Authors ： Chi Wang, Marina Danilevsky, Nihit.

Presented by Xinyu Chang

Analysis and Modeling of Social Networks Foudalis Ilias.

FTP Biostatistics II Model parameter estimations: Confronting models with measurements.

Hongliang Li, Senior Member, IEEE, Linfeng Xu, Member, IEEE, and Guanghui Liu Face Hallucination via Similarity Constraints.

Intelligent Systems Lab. Recognizing Human actions from Still Images with Latent Poses Authors: Weilong Yang, Yang Wang, and Greg Mori Simon Fraser University,

Mining Top-K Large Structural Patterns in a Massive Network Feida Zhu 1, Qiang Qu 2, David Lo 1, Xifeng Yan 3, Jiawei Han 4, and Philip S. Yu 5 1 Singapore.

Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.

Tru-Alarm: Trustworthiness Analysis of Sensor Network in Cyber Physical Systems Lu-An Tang, Xiao Yu, Sangkyum Kim, Jiawei Han, Chih-Chieh Hung, Wen-Chih.

Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1 University of Illinois, IBM TJ Watson Debapriya Basu.

Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.

On Community Outliers and their Efficient Detection in Information Networks Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1.

1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.

Efficient Search in Peer to Peer Networks By: Beverly Yang Hector Garcia-Molina Presented By: Anshumaan Rajshiva Date: May 20,2002.

Towards Scalable Critical Alert Mining Bo Zong 1 with Yinghui Wu 1, Jie Song 2, Ambuj K. Singh 1, Hasan Cam 3, Jiawei Han 4, and Xifeng Yan 1 1 UCSB, 2.

Honglei Zhuang1, Jing Zhang2, George Brova1,

A Multiresolution Symbolic Representation of Time Series

Navigating and Browsing 3D Models in 3DLIB Hesham Anan, Kurt Maly, Mohammad Zubair Computer Science Dept. Old Dominion University, Norfolk, VA, (anan,

Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

Wei Cheng 1, Xiaoming Jin 1, and Jian-Tao Sun 2 Intelligent Data Engineering Group, School of Software, Tsinghua University 1 Microsoft Research Asia 2.

NUITS: A Novel User Interface for Efficient Keyword Search over Databases The integration of DB and IR provides users with a wide range of high quality.

Graph Indexing: A Frequent Structure based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

Link Recommendation In P2P Social Networks Yusuf Aytaş, Hakan Ferhatosmanoğlu, Özgür Ulusoy Bilkent University, Ankara, Turkey.

Chapter 2 Modeling and Finding Abnormal Nodes. How to define abnormal nodes ? One plausible answer is : –A node is abnormal if there are no or very few.

Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )

Glasgow 02/02/04 NN k networks for content-based image retrieval Daniel Heesch.

Page 1 Ming Ji Department of Computer Science University of Illinois at Urbana-Champaign.

Influence Maximization in Dynamic Social Networks Honglei Zhuang, Yihan Sun, Jie Tang, Jialin Zhang, Xiaoming Sun.

Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.

On Graph Query Optimization in Large Networks Alice Leung ICS 624 4/14/2011.

Discovering Meta-Paths in Large Heterogeneous Information Network

Context-Sensitive Information Retrieval Using Implicit Feedback Xuehua Shen : department of Computer Science University of Illinois at Urbana-Champaign.

Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.

On Node Classification in Dynamic Content-based Networks.

Expert Systems with Applications 34 (2008) 459–468 Multi-level fuzzy mining with multiple minimum supports Yeong-Chyi Lee, Tzung-Pei Hong, Tien-Chin Wang.

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

Query Suggestion Naama Kraus Slides are based on the papers: Baeza-Yates, Hurtado, Mendoza, Improving search engines by query clustering Boldi, Bonchi,

Finding Top-k Shortest Path Distance Changes in an Evolutionary Network SSTD th August 2011 Manish Gupta UIUC Charu Aggarwal IBM Jiawei Han UIUC.

MINING COLOSSAL FREQUENT PATTERNS BY CORE PATTERN FUSION FEIDA ZHU, XIFENG YAN, JIAWEI HAN, PHILIP S. YU, HONG CHENG ICDE07 Advisor: Koh JiaLing Speaker:

1 LinkClus: Efficient Clustering via Heterogeneous Semantic Links Xiaoxin Yin, Jiawei Han Univ. of Illinois at Urbana-Champaign Philip S. Yu IBM T.J. Watson.

1 Panther: Fast Top-K Similarity Search on Large Networks Jing Zhang 1, Jie Tang 1, Cong Ma 1, Hanghang Tong 2, Yu Jing 1, and Juanzi Li 1 1 Department.

Mining Top-K Large Structural Patterns in a Massive Network Feida Zhu 1, Qiang Qu 2, David Lo 1, Xifeng Yan 3, Jiawei Han 4, and Philip S. Yu 5 1 Singapore.

Page 1 PathSim: Meta Path-Based Top-K Similarity Search in Heterogeneous Information Networks Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S. Yu, Tianyi.

A Multiresolution Symbolic Representation of Time Series Vasileios Megalooikonomou Qiang Wang Guo Li Christos Faloutsos Presented by Rui Li.

Date: 2013/6/10 Author: Shiwen Cheng, Arash Termehchy, Vagelis Hristidis Source: CIKM’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Predicting the Effectiveness.

Panther: Fast Top-k Similarity Search in Large Networks JING ZHANG, JIE TANG, CONG MA, HANGHANG TONG, YU JING, AND JUANZI LI Presented by Moumita Chanda.

Active Feedback in Ad Hoc IR Xuehua Shen, ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.

Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.

Speaker : Yu-Hui Chen Authors : Dinuka A. Soysa, Denis Guangyin Chen, Oscar C. Au, and Amine Bermak From : 2013 IEEE Symposium on Computational Intelligence.

A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.

Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.

Graph Indexing From managing and mining graph data.

The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign RankFP : A Framework for Rank Formulation and Processing Hwanjo Yu, Seung-won.

Paper Presentation Social influence based clustering of heterogeneous information networks Qiwei Bao & Siqi Huang.

CSCI 631 – Foundations of Computer Vision March 15, 2016 Ashwini Imran Image Stitching.

Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.

ClusCite:Effective Citation Recommendation by Information Network-Based Clustering Date: 2014/10/16 Author: Xiang Ren, Jialu Liu,Xiao Yu, Urvashi Khandelwal,

The simultaneous evolution of author and paper networks

Probabilistic Data Management

Integrating Meta-Path Selection With User-Guided Object Clustering in Heterogeneous Information Networks Yizhou Sun†, Brandon Norick†, Jiawei Han†, Xifeng.

RankClus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis Yizhou Sun, Jiawei Han, Peixiang Zhao, Zhijun Yin, Hong Cheng,

Relevance Search in Heterogeneous Networks

SEG5010 Presentation Zhou Lanjun.

Accelerating Regular Path Queries using FPGA

Presentation transcript:

Query-Based Outlier Detection in Heterogeneous Information Networks Jonathan Kuck 1, Honglei Zhuang 1, Xifeng Yan 2, Hasan Cam 3, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University of California at Santa Barbara 3 US Army Research Lab

Heterogeneous information networks are networks composed of multi-typed, interconnected vertices and links Outlier detection aims to find vertices that deviate significantly from other vertices However, outliers in such a heterogeneous network could be defined in many different ways – E.g. Finding outlier in authors of this paper Jonathan Honglei Xifeng Hasan Jiawei

Heterogeneous information networks are networks composed of multi-typed, interconnected vertices and links Outlier detection aims to find vertices that deviate significantly from other vertices However, outliers in such a heterogeneous network could be defined in many different ways – E.g. Finding outlier in authors of this paper Jonathan Honglei Xifeng Hasan Jiawei

Heterogeneous information networks are networks composed of multi-typed, interconnected vertices and links Outlier detection aims to find vertices that deviate significantly from other vertices However, outliers in such a heterogeneous network could be defined in many different ways – E.g. Finding outlier in authors of this paper Jonathan Honglei Xifeng Hasan Jiawei As users have their own intuition about which kind of outliers they are interested in… Allow users to specify queries for outlier detection

Research Challenges How do users interact with the system to specify their queries? How do we define a general outlierness measure for different queries? How to efficiently find outliers for different queries?

Outline Basic Concepts and Notations Outlier Query Language NetOut: Outlierness Measure Processing Outlier Queries Experimental Results Summary

Basic Concepts and Notations Heterogeneous Information Network – An information network with multiple types of vertices where is the set of vertices; is the set of edges; is the set of types, and assigns each vertex a type. Network schema An instantiated network

Basic Concepts and Notations Meta-Path – An ordered sequence of vertex types, denoted as – E.g. Meta-Path Instantiation – Use to denote paths between two vertices with the types that align with the given meta-path – E.g. is shown below in red solid lines Network schema An instantiated network

Basic Concepts and Notations Neighborhood – Based on a meta-path, defined as – E.g. Neighbor Vector – Based on a meta-path, the neighbor vector describes how many paths there are from to each of its neighbors – E.g. Network schema An instantiated network

Formalization of Outlier Queries Given a heterogeneous information network A query can be formalized as where – is a set of (same-typed) candidate vertices Specifying from where outliers are chosen from – is a set of (same-typed) reference vertices Specifying a sample of normal vertices Optional. By default it equals to the candidate set – define a weighted set of meta-paths Specifying how two vertices are compared

Outlier Query Language General Formulation FIND OUTLIERS FROM author{“C. Faloutsos”}.paper.author COMPARE TO venue {“KDD”}.paper.author JUDGED BY author.paper.venue TOP 10;

Outlier Query Language General Formulation FIND OUTLIERS FROM author{“C. Faloutsos”}.paper.author COMPARE TO venue {“KDD”}.paper.author JUDGED BY author.paper.venue TOP 10; Specifying the candidate set

Outlier Query Language General Formulation FIND OUTLIERS FROM author{“C. Faloutsos”}.paper.author COMPARE TO venue {“KDD”}.paper.author JUDGED BY author.paper.venue TOP 10; Specifying the candidate set Specifying the reference set

Outlier Query Language General Formulation FIND OUTLIERS FROM author{“C. Faloutsos”}.paper.author COMPARE TO venue {“KDD”}.paper.author JUDGED BY author.paper.venue TOP 10; Specifying the candidate set Compare vertices by neighbor vector Specifying the reference set where

Outlier Query Language General Formulation FIND OUTLIERS FROM author{“C. Faloutsos”}.paper.author COMPARE TO venue {“KDD”}.paper.author JUDGED BY author.paper.venue TOP 10; Specifying the candidate set Specifying how vertices are compared Specifying the reference set Only return top 10 outliers

Outlier Query Language (cont’) Allow complicated judging standards – E.g. multiple weighted neighbor vectors JUDGED BY author.paper.venue: 0.6, author.paper.author: 0.4 Allow operations on sets – E.g. select by conditions FROM venue {“KDD”}.paper.author AS A WHERE COUNT( A.paper ) > 10

NetOut: A General Network-Based Outlierness Measure Given a meta-path (in JUDGED BY clause) Define the connectivity between two vertices – The more paths between them, the more similar they are Define relative connectivity – Compare the connectivity to self-connectivity – Use self-connectivity as a expected connectivity to measure whether two vertices are unexpectedly connected or unexpectedly disconnected Outlier Measure: NetOut

Comparing NetOut to Other Measures Neighbor VectorVLDBKDDSTOCSIGGRAPH Reference Author(s)× Sarah10 11 Rob0120 Lucy0510 *Joe0002 *Emma00030 NetOutPathSim [1] CosSim Sarah Rob Lucy Joe Emma A toy example. Outlier measure comparison. The lower value, the more likely to be an outlier. Joe is not necessarily an interesting outlier, as those papers might simply be noise Emma is obviously an outlier and should be assigned a lower value [1] Sun, Yizhou, et al. "PathSim: Meta path-based top-k similarity search in heterogeneous information networks." VLDB 2011.

Comparison on Real Data Set Apply on network constructed from DBLP data set Find outliers among Christos Faloutsos’ coauthors, in terms of their publishing venues authors with very few paper; uninteresting outliers

Query Processing The calculation of NetOut can be written as

Query Processing The calculation of NetOut can be written as Time complexity: O(|S r |), only calculated once O(|S c |), calculate over all the candidate vertices

Pre-Materialization of Meta-Path We observe… – Calculation of outlierness only has time complexity of – Retrieving for each vertex is much more time consuming Pre-materialization – Pre-store the materialization of all the length-2 meta-paths

Selective Pre-Materialization Storing materialization of all length-2 meta- paths can be space-consuming Selective Pre-materialization – From a given set of training queries, take vertices that frequently appear in the candidate sets (e.g. those appear in more than 10% of queries) – Pre-materialize all the length-2 meta-paths for these frequently appearing vertices

Experimental Setup Data Set – DBLP Data Set 2,244,018 publications, 1,274,360 authors Construct the network according to the schema above Publishing venue information is included Terms are extracted from publication titles Query Sets – Design three different types of queries to find different kinds of outliers – Generate 10,000 random queries for each type of queries as a query set

Case Study With different queries, the outlier measure is able to capture different outliers accordingly – E.g. Finding outliers in Christos Faloutsos’ coauthors Outlier results by comparing publishing venues Outlier results by comparing coauthor communities

Efficiency Study Comparing strategies – Baseline: No pre-materialization – Pre-Materialization (PM): All length-2 meta-path instantiations are pre-computed and indexed – Selective Pre-Materialization (SPM): Only a subset of instantiations with relative frequency larger than 0.01 are indexed

Parameter Studies for SPM Different thresholds for selective pre-materialization

Summary Propose a framework for query-based outlier detection in heterogeneous information networks Formalize the definition of a query and provide a query language for users to interact with the system Design a general outlierness measure NetOut which can effectively find interesting outliers Present an efficient implementation to process such queries

Thank you Introduction Outlier Query Language NetOut Measure Query Processing Experimental Results