An Unbiased Distance-based Outlier Detection Approach for High Dimensional Data DASFAA 2011 By Hoang Vu Nguyen, Vivekanand Gopalkrishnan and Ira Assent.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.
Aggregating local image descriptors into compact codes
Random Forest Predrag Radenković 3237/10
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Towards Twitter Context Summarization with User Influence Models Yi Chang et al. WSDM 2013 Hyewon Lim 21 June 2013.
School of Computer Science and Engineering Finding Top k Most Influential Spatial Facilities over Uncertain Objects Liming Zhan Ying Zhang Wenjie Zhang.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Minqi Zhou © Tan,Steinbach, Kumar Introduction to Data Mining.
Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Stephen D. Bay 1 and Mark Schwabacher 2 1 Institute for.
Presented by: GROUP 7 Gayathri Gandhamuneni & Yumeng Wang.
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Evaluating Search Engine
1 In-Network PCA and Anomaly Detection Ling Huang* XuanLong Nguyen* Minos Garofalakis § Michael Jordan* Anthony Joseph* Nina Taft § *UC Berkeley § Intel.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Margin Based Sample Weighting for Stable Feature Selection Yue Han, Lei Yu State University of New York at Binghamton.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Feature Extraction for Outlier Detection in High- Dimensional Spaces Hoang Vu Nguyen Vivekanand Gopalkrishnan.
Anomaly Detection. Anomaly/Outlier Detection  What are anomalies/outliers? The set of data points that are considerably different than the remainder.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by.
CS Instance Based Learning1 Instance Based Learning.
Detecting Distance-Based Outliers in Streams of Data Fabrizio Angiulli and Fabio Fassetti DEIS, Universit `a della Calabria CIKM 07.
Presented by Tienwei Tsai July, 2005
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
GATree: Genetically Evolved Decision Trees 전자전기컴퓨터공학과 데이터베이스 연구실 G 김태종.
Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者:郝柏翰 2013/01/28.
Author:Rakesh Agrawal
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
1 Pattern Recognition Pattern recognition is: 1. A research area in which patterns in data are found, recognized, discovered, …whatever. 2. A catchall.
An Investigation of Subspace Outlier Detection Alex Wiegand Supervisor: Jiuyong Li.
Spatio-temporal Pattern Queries M. Hadjieleftheriou G. Kollios P. Bakalov V. J. Tsotras.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by.
Data Mining Anomaly Detection © Tan,Steinbach, Kumar Introduction to Data Mining.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Mining Anomaly/Outlier Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Lecture 7: Outlier Detection Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Visual Categorization With Bags of Keypoints Original Authors: G. Csurka, C.R. Dance, L. Fan, J. Willamowski, C. Bray ECCV Workshop on Statistical Learning.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Manoranjan.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Mining Anomaly/Outlier Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar.
Data Mining, ICDM '08. Eighth IEEE International Conference on Duy-Dinh Le National Institute of Informatics Hitotsubashi, Chiyoda-ku Tokyo,
Accelerating Dynamic Time Warping Clustering with a Novel Admissible Pruning Strategy Nurjahan BegumLiudmila Ulanova Jun Wang 1 Eamonn Keogh University.
2015/12/251 Hierarchical Document Clustering Using Frequent Itemsets Benjamin C.M. Fung, Ke Wangy and Martin Ester Proceeding of International Conference.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Collaborative Filtering via Euclidean Embedding M. Khoshneshin and W. Street Proc. of ACM RecSys, pp , 2010.
CLUSTERING HIGH-DIMENSIONAL DATA Elsayed Hemayed Data Mining Course.
Anomaly Detection. Network Intrusion Detection Techniques. Ştefan-Iulian Handra Dept. of Computer Science Polytechnic University of Timișoara June 2010.
Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.
Anomaly Detection Carolina Ruiz Department of Computer Science WPI Slides based on Chapter 10 of “Introduction to Data Mining” textbook by Tan, Steinbach,
Sampath Jayarathna Cal Poly Pomona
کاربرد نگاشت با حفظ تنکی در شناسایی چهره
Decision Matrices Business Economics.
Spatio-temporal Pattern Queries
Outlier Discovery/Anomaly Detection
Data Mining Anomaly/Outlier Detection
On the Range Maximum-Sum Segment Query Problem
Similarity Search: A Matching Based Approach
Nearest Neighbors CSC 576: Data Mining.
CSE572: Data Mining by H. Liu
Data Mining Anomaly Detection
Exploiting the Power of Group Differences to Solve Data Analysis Problems Outlier & Intrusion Detection Guozhu Dong, PhD, Professor CSE
Data Mining Anomaly Detection
Presentation transcript:

An Unbiased Distance-based Outlier Detection Approach for High Dimensional Data DASFAA 2011 By Hoang Vu Nguyen, Vivekanand Gopalkrishnan and Ira Assent Presented By Salman Ahmed Shaikh (D1)

Contents Introduction Subspace Outlier Detection Challenges Objectives of Research The Approach – Subspace Outlier Score Function: FS out – HighDOD Algorithm Empirical Results and Analysis Conclusion

Introduction An outlier, is one that appears to deviate markedly from other members of the sample in which it occurs. [1] Popular techniques of outlier detection – Distance based – Density base Since these techniques take full- dimensional space into account, their performance is impacted by noisy or irrelevant features. Recently, researchers have switched to subspace anomaly detection. Anomalous Subsequence o 1, o 2 and o 3 are anomalous instances w.r.t. the data

Subspace Outlier Detection Challenges Unavoidable exploration of all subspaces to mine full result set: – As the monotonicity property does not hold in the case of outliers, one cannot apply apriori-like heuristic for mining outliers. Difficulty in devising an outlier notion: – Full-dimensional outlier detection techniques suffer the issue of dimensionality bias in subspaces. – They assign higher outlier score in high dimensional subspaces than in lower dimensions Exposure to high false alarm rate: – Binary decision on each data point (normal or outlier) in each subspace flag too many points as outliers. – Solution is ranking-based algorithm.

Objectives Build an efficient technique for mining outliers in subspaces, which should – Avoid expensive scan of all subspaces while still yielding high detection accuracy – Eases the task of parameter setting – Facilitates the design of pruning heuristics to speed up the detection process – Provide a ranking of outliers across subspaces.

The Approach The authors have made an assertion and given some definitions to explain their research approach. { Non-monotonicity Property: Consider a data point p in the dataset DS. Even if p is not anomalous in subspace S of DS, it may be an outlier in some projection(s) of S. Even if p is a normal data point in all projections of S, it may be an outlier in S A A is an outlier in full space but not in subspace B B is an outlier in subspace but not in fullspace

(Subspace) Outlier Score Function Outlier Score Function: F out as given by Angiulli et al. for full space [2] The dissimilarity of a point p with respect to its k nearest neighbors is known by its cumulative neighborhood distance. This is defined as the total distance from p to its k nearest neighbors in DS. – In order to ensure that non-monotonicity property is not violated, the outlier score function is redefined by the authors as below. Subspace Outlier Score Function: FS out The dissimilarity of a point p with respect to its k nearest neighbors in a subspace S of dimensionality dim(S), is known by its cumulative neighborhood distance. This is defined as the total distance from p to its k nearest neighbors in DS (projected onto S), normalized by dim(S). – Where p s is the projection of a data point p ∊ DS onto S.

FS out is Dimensionality Unbiased FS out assigns multiple outlier scores to each data point and is dimensionality unbiased. Example: let k=1 and l=2 In Fig.(a), A's outlier score in the 2-dimensional space is 1/(2) 1/2 which is the largest across all subspaces. In Fig.(b), the outlier score of B when projected on the subspace of the x-axis is 1, which is also the largest in all subspaces. Hence, FS out flags A and B as outliers.

FS out is Globally Comparable Range of Distance: In each subspace S of DS, the distance between any arbitrary data points p and q is bounded by (dim(S)) 1/l Range of Outlier Score: For an arbitrary data point p and any subspace S, we have

Subspace Outlier Detection Problem Using FS out for outliers in subspaces, mining problem now can be re-defined as Given two positive integers k and n, mine the top n distinct anomalies whose outlier scores (in any subspace) are largest.

HighDOD-Subspace Outlier Detection Algorithm HighDOD (High dimensional Distance based Outlier Detection) is – A Distance based approach towards detecting outliers in very high- dimensional datasets. – Unbiased w.r.t. the dimensionality of different subspaces. – Capable of producing ranking of outliers HighDOD is composed of following 3 algorithms – OutlierDetection – CandidateExtraction – SubspaceMining Algorithm OutlierDetection examine subspaces of dimensionality up to some threshold m = O(logN) as suggested by Aggarwal and Ailon in [3, 4]

Algorithm 1: Outlier Detection Carry out a bottom-up exploration of all subspaces of up to a dimensionality of m = O(logN)

Estimate the data points’ local densities by using a kernel density estimator and choose βn data points with the lowest estimates as potential candidates. Algorithm 2: CandidateExtraction

Algorithm 3: SubspaceMining This procedure is used to update the set of outliers TopOut with 2n candidate outliers extracted from a subspace S.

Empirical Results and Analysis Authors have compared HighDOD with DenSamp, HighOut, PODM and LOF. Experiments have been performed to compare detection accuracy and scalability. Precision-Recall trade-off curve is used to evaluate the quality of an unordered set of retrieved items. Datasets – 4 Real data sets from UCI Repository have been used.

Comparison of Detection Accuracy Detection accuracy of HighDOD, DenSamp, HighOut, PODM and LOF

Comparison of Scalability Since PODM and LOF yields unsatisfactory accuracy, they are not included in this experiment. Scalability test is done with CorelHistogram (CH) dataset consisting of records in 32-dimensional space. Scalability of HighDOD, DenSamp and HighOut

Conclusion Work proposed a new outlier detection technique which is dimensionality unbiased. Extends distance-based anomaly detection to subspace analysis. Facilitates the design of ranking-based algorithm. Introduced HighDOD, a ranking-based technique for subspace outlier mining.

Precision-Recall Curve Precision and recall are used to evaluate the quality of an unordered set of retrieved items. Recall is the measure of the ability of a system to present all the relevant items. Precision is the measure of the ability of a system to present only relevant items.

References [1] Wikipedia [2] Angiulli, F., Pizzuti, C.: Outlier mining in large high- dimensional data sets. IEEE Trans. Knowl. Data Eng., [3] Aggarwal, C.C., Yu, P.S.: An effective and efficient algorithm for high-dimensional outlier detection. VLDB Journal, [4] Ailon, N., Chazelle, B.: Faster dimension reduction. Commun. CACM, 2010.