Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.

Slides:



Advertisements
Similar presentations
The Optimal-Location Query
Advertisements

The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.
Finding the Sites with Best Accessibilities to Amenities Qianlu Lin, Chuan Xiao, Muhammad Aamir Cheema and Wei Wang University of New South Wales, Australia.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Correlation Search in Graph Databases Yiping Ke James Cheng Wilfred Ng Presented By Phani Yarlagadda.
1 Top-k Spatial Joins
Nearest Neighbor Queries using R-trees
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
School of Computer Science and Engineering Finding Top k Most Influential Spatial Facilities over Uncertain Objects Liming Zhan Ying Zhang Wenjie Zhang.
Techniques for Dealing with Hard Problems Backtrack: –Systematically enumerates all potential solutions by continually trying to extend a partial solution.
Improving the Performance of M-tree Family by Nearest-Neighbor Graphs Tomáš Skopal, David Hoksza Charles University in Prague Department of Software Engineering.
1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)
Efficient Reverse k-Nearest Neighbors Retrieval with Local kNN-Distance Estimation Mike Lin.
Tru-Alarm: Trustworthiness Analysis of Sensor Network in Cyber Physical Systems Lu-An Tang, Xiao Yu, Sangkyum Kim, Jiawei Han, Chih-Chieh Hung, Wen-Chih.
1 Lecture 8: Data structures for databases II Jose M. Peña
2-dimensional indexing structure
Quantile-Based KNN over Multi- Valued Objects Wenjie Zhang Xuemin Lin, Muhammad Aamir Cheema, Ying Zhang, Wei Wang The University of New South Wales, Australia.
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
On Efficient Spatial Matching Raymond Chi-Wing Wong (the Chinese University of Hong Kong) Yufei Tao (the Chinese University of Hong Kong) Ada Wai-Chee.
Euripides G.M. PetrakisIR'2001 Oulu, Sept Indexing Images with Multiple Regions Euripides G.M. Petrakis Dept.
R-Trees 2-dimensional indexing structure. R-trees 2-dimensional version of the B-tree: B-tree of maximum degree 8; degree between 3 and 8 Internal nodes.
CS Instance Based Learning1 Instance Based Learning.
An Unbiased Distance-based Outlier Detection Approach for High Dimensional Data DASFAA 2011 By Hoang Vu Nguyen, Vivekanand Gopalkrishnan and Ira Assent.
Detecting Distance-Based Outliers in Streams of Data Fabrizio Angiulli and Fabio Fassetti DEIS, Universit `a della Calabria CIKM 07.
Outlier Detection & Analysis
by B. Zadrozny and C. Elkan
Outlier Detection Using k-Nearest Neighbour Graph Ville Hautamäki, Ismo Kärkkäinen and Pasi Fränti Department of Computer Science University of Joensuu,
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Outlier Detection Lian Duan Management Sciences, UIOWA.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Extracting meaningful labels for WEBSOM text archives Advisor.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Influence Zone: Efficiently Processing Reverse k Nearest Neighbors Queries Presented By: Muhammad Aamir Cheema Joint work with Xuemin Lin, Wenjie Zhang,
A Study of Balanced Search Trees: Brainstorming a New Balanced Search Tree Anthony Kim, 2005 Computer Systems Research.
Nearest Neighbor Queries Chris Buzzerd, Dave Boerner, and Kevin Stewart.
Challenges in Mining Large Image Datasets Jelena Tešić, B.S. Manjunath University of California, Santa Barbara
Data Mining Anomaly/Outlier Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Lecture 7: Outlier Detection Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology.
On Computing Top-t Influential Spatial Sites Authors: T. Xia, D. Zhang, E. Kanoulas, Y.Du Northeastern University, USA Appeared in: VLDB 2005 Presenter:
9/2/2005VLDB 2005, Trondheim, Norway1 On Computing Top-t Most Influential Spatial Sites Tian Xia, Donghui Zhang, Evangelos Kanoulas, Yang Du Northeastern.
Euripides G.M. PetrakisIR'2001 Oulu, Sept Indexing Images with Multiple Regions Euripides G.M. Petrakis Dept. of Electronic.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Mining Anomaly/Outlier Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.
Database Seminar The Gauss-Tree: Efficient Object Identification in Databases of Probabilistic Feature Vectors Authors : Christian Bohm, Alexey Pryakhin,
New Algorithms for Efficient High-Dimensional Nonparametric Classification Ting Liu, Andrew W. Moore, and Alexander Gray.
A Spatial Index Structure for High Dimensional Point Data Wei Wang, Jiong Yang, and Richard Muntz Data Mining Lab Department of Computer Science University.
Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.
Using category-Based Adherence to Cluster Market-Basket Data Author : Ching-Huang Yun, Kun-Ta Chuang, Ming-Syan Chen Graduate : Chien-Ming Hsiao.
Mining Top-n Local Outliers in Large Databases Author: Wen Jin, Anthony K. H. Tung, Jiawei Han Advisor: Dr. Hsu Graduate: Chia- Hsien Wu.
Ning Jin, Wei Wang ICDE 2011 LTS: Discriminative Subgraph Mining by Learning from Search History.
1 Spatial Query Processing using the R-tree Donghui Zhang CCIS, Northeastern University Feb 8, 2005.
Probabilistic Skylines on Uncertain Data (VLDB2007) Jian Pei et al Supervisor: Dr Benjamin Kao Presenter: For Date: 22 Feb 2008 ??: the possible world.
School of Computing Clemson University Fall, 2012
SIMILARITY SEARCH The Metric Space Approach
KD Tree A binary search tree where every node is a
Outlier Discovery/Anomaly Detection
K Nearest Neighbor Classification
Probabilistic Data Management
CSE572, CBS572: Data Mining by H. Liu
Similarity Search: A Matching Based Approach
Nearest Neighbors CSC 576: Data Mining.
CSE572: Data Mining by H. Liu
The Skyline Query in Databases Which Objects are the Most Important?
Efficient Processing of Top-k Spatial Preference Queries
Donghui Zhang, Tian Xia Northeastern University
Presentation transcript:

Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data Mining, 2006

Introduction Outlier: Object that deviates from the rest of dataset. – Its outlieress typically appears to be more outstanding with respect to its local neighborhood. Applications: Fraud detection, intrusion discovery, video surveillance, pharmaceutical test and weather prediction. Local outliers: the outliers – that have density distribution significantly different from their local neighborhood. – That have significantly lower density than its local neighborhood. Outlierness – The degree of outlierness of an object p is defined to be the ratio of its density and the average density of its neighboring objects.

Example (1) The densities of the nearest neighboring objects for both p and q are the same, but q is slightly closer to cluster C1 than q. In this case, p will have a stronger outlierness measure than q, which is obviously wrong. Although the density of r is lower than p, the average density of its neighboring objects (consisting of 2 objects from C2 and an outlier) is less than those of p. Thus, when the proposed measure is computed, p could turn out to have a stronger outlierness measure than r, which again is wrong.

Motivation Existing outlierness measure is not easily applicable to complex situation in which the dataset contains multiple clusters with very different density distribution. Propose to take both the nearest neighbors(NNs) and reverse nearest neighbors(RNNs) into account when taking an estimation of the neighborhood’s density distribution. RNNs: The RNNs of an object p are essentially objects that have p as one of their k nearest neighbors. By considering the symmetric neighborhood relationship of both NN and RNN, the space of an object influenced by other objects is well determined, the densities of its neighborhood will be reasonably estimated, and thus the outliers found will be more meaningful.

Example (2) p has two RNNs {s, t}. q has no RNNS {} r has only 1

Method proposed Propose the mining of outliers based on a symmetric neighborhood relationship. Assign each object the degree of being INFLuenced Outierness(INFLO). The higher INFLO is, the more likely that this object is an outlier. Present several efficient algorithms to mining top-n outliers based on INFLO. – A Naïve Index-based method – A Two-way search method

Influential Measure of Outlierness by Symmetric Relationship Let D be a database of size N, let p, q and o be some objects in D, and let k be a positive integer. Use d(p,q) to denote the Euclidean distance between objects p and q.

Example(3) D = {p, q1, q2, q3, q4, q5}, k = 3 NN k {p} = {q1, q2, q4} = RNN k {p} = {q1, q2, q4}, IS 3 (p) = {q1, q, q4} NN k {q1} = {p, q2, q4}, RNN k {q1} = {} NN k {q2} = {p, q1, q3}, RNN k {q1} = {} NN k {q3} = {q1, q2, q5}, RNN k {q1} = {} NN k {q4} = {p, q1, q2, q5}, RNN k {q1} = {} NN k {q5} = {q1, q2, q3}, RNN k {q1} = {}

A Naïve Index-based Method Finding influence outlier requires the operations of KNN and RNN for each object in the database, so the search cost is high. Maintain all the points in a spatial index like R-tree, reduce the cost of range queries by the pruning techniques. Suppose that we have computed the temporary k dist (p) by checking a subset of the objects, the value that we have is clearly an upper bound for the actual k dist (p). If the minimum distance between p and the MBR of a node in the R-tree is greater than the k dist (p), none of the objects in the subtree rooted under the node will be among the k-nearest neighbors of p. Along with the search of KNN, the RNN of each object can be dynamically maintained in R-tree.

Algorithm1 Index-Based Method Input: k, D, n, the root of R-tree Output: Top-n INFIO of D. Method: FOR each object pЄ D DO MBRList = root; K dist (p) = ∞; heap = 0; WHILE(MBRList) != empty DO Delete 1 st MBR from MBRList;; If(1stMBR is a leaf) THEN FOR each object q in 1stMBR DO IF(d(p, q)< K dist (p)) AND (heap.size<k) THEN { heap.insert(q); K dist (p)=d(p, heap.top)} ELSE – Append MBR’s children to MBRList; – Sort nodeList by MinDist; FOR each MBR in MBRList DO – IF(K dist (p) < = MinDist(p, MBR)) THEN Remove node form MBRList; FOR each object q in heap DO Add q into NNk(p), add p into RNNk(q); FOR each object pЄ D DO Ascending sort top-n INFLO from kNN and RNN;

Two-way search method Two major factors hamper the efficiency of the previous algorithm. – For any object p, RNN space cannot be determined unless all the other objects have finished nearest neighbor search. – Second, large amount of extra storage is required on R-tree, where each objects at least stores k pointers of its KNN, and stores m points for its RNN. Reduce the computation cost for RNN and corressponding storage cost. By analyzing the characteristics of INFLO – Any object as a member of a cluster : INFLO ~ 1 – Prune off these cluster objects, saving not only the computation cost but also the extra storage space. Early Pruning First search p’s k-nearest neighbor, dynamically find the NN k, for each of these nearest neighbors. If NN k (NN k (p)) still contains p, which shows p is in a closely influenced space and is a core object of a cluster. We can prune p immediately.

Two-way search method Input: k, D, n, the root of R-tree, a threshold M Output: Top-n INFIO of D. Method: FOR each object pЄ D DO count = |RNNk(p)|; IF unvisited(p) THEN S = getKNN(p); // search k-nearest neighbors unvisited(p) = FALSE; ELSE S = KNN(p); // get nearest neighbors directly FOR each object qЄ S DO IF unvisited(q) THEN T= getKNN(q); unvisited(q) = FALSE; IF pЄ T THEN Add q into RNN k (p); Add p into RNN k (q); count++; IF count >=|S|*M THEN //M is a threshold Label p pruned mark; FOR each object p pЄ D’ DO //D’ is unpruned database Ascending sort top-n INFLO from KNN and RNN

Experiment Experiment on the datasets with different sizes and dimensions. Compare it with the LOF method – Lof ONLY considers nearest neighborhood as a density estimation space. DataSet – Synthetic Data Generated based on multiple-gaussian distribution The cardinality varies from 1,000 to 1,000,000 tuples and the dimensionality varies from 2 to 18. – Real Life Data Statistics archive of National Hockey League (NHL) Totally 22,180 records with 12 dimensions.

Synthetic Data 50 percentage of the top 6 outliers are the same points by both measures. INFLO will find even more different top outliers from LOF. outliers found by INFLO is more meaningful. LOF only considers nearest neighborhood as a density estimation space

Real Life Dataset NHL playoff data (22180 tuples) Vary k from 10 to 50. Projection is done on dataset by randomly selecting dimensions, and the outlierness of hockey players is evaluated. Focus on the statistics data in 3-dimensional subspace of Games played, Goals and Shooting percentage. Rob Blake ranks 4 th in INFLO but is only rank 31th outlier using LOF. The variation of shooting percentage is usually small, since only a very few of players can become excellent shooter. Although Blake’s shooting percentage is rather low, Blake is still not too far away from other players. Thus, based on LOF measure, Blake’s could not be ranked in the top palyers. By INFLO, the reason for him being a most exceptional player is that there is no such type of player whose shooting percentage is so low while having so many Goals.

Conclusion Propose the mining of outliers based on a symmetric neighborhood relationship. The proposed method considers the influenced space considering both neighbors and the reverse neighbors of an object when estimating its neighborhood density distribution. Propose a new measure INFLO which is based on a symmetric neighborhood relationship. – Index-based method – Two-way method Experiments shows that the proposed methods are efficient and effective on both synthetic and real life datasets.