Outlier Detection & Analysis

Slides:



Advertisements
Similar presentations
Copyright Jiawei Han, modified by Charles Ling for CS411a
Advertisements

Hierarchical Cellular Tree: An Efficient Indexing Scheme for Content-Based Retrieval on Multimedia Databases Serkan Kiranyaz and Moncef Gabbouj.
Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.
An Array-Based Algorithm for Simultaneous Multidimensional Aggregates By Yihong Zhao, Prasad M. Desphande and Jeffrey F. Naughton Presented by Kia Hall.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Hierarchical Clustering, DBSCAN The EM Algorithm
PARTITIONAL CLUSTERING
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Minqi Zhou © Tan,Steinbach, Kumar Introduction to Data Mining.
FTP Biostatistics II Model parameter estimations: Confronting models with measurements.
Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Stephen D. Bay 1 and Mark Schwabacher 2 1 Institute for.
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 12 —
Cluster Analysis.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Evaluating Hypotheses
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Anomaly Detection. Anomaly/Outlier Detection  What are anomalies/outliers? The set of data points that are considerably different than the remainder.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by.
Inferences About Process Quality
Data Mining – Intro.
Birch: An efficient data clustering method for very large databases
Major Tasks in Data Preprocessing(Ref Chap 3) By Prof. Muhammad Amir Alam.
Data Mining Techniques
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.
Outline Introduction Descriptive Data Summarization Data Cleaning Missing value Noise data Data Integration Redundancy Data Transformation.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Visual Information Systems Recognition and Classification.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
D. M. J. Tax and R. P. W. Duin. Presented by Mihajlo Grbovic Support Vector Data Description.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Mining Anomaly Detection © Tan,Steinbach, Kumar Introduction to Data Mining.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Mining Anomaly/Outlier Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Lecture 7: Outlier Detection Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology.
1 Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 12 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Manoranjan.
COMP5331 Outlier Prepared by Raymond Wong Presented by Raymond Wong
K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:
CURE: EFFICIENT CLUSTERING ALGORITHM FOR LARGE DATASETS VULAVALA VAMSHI PRIYA.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Mining Anomaly/Outlier Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar.
A new clustering tool of Data Mining RAPID MINER.
Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar GNET 713 BCB Module Spring 2007 Wei Wang.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Nearest Neighbour and Clustering. Nearest Neighbour and clustering Clustering and nearest neighbour prediction technique was one of the oldest techniques.
Anomaly Detection Carolina Ruiz Department of Computer Science WPI Slides based on Chapter 10 of “Introduction to Data Mining” textbook by Tan, Steinbach,
Data Mining – Intro.
Hierarchical Clustering: Time and Space requirements
Data Science Algorithms: The Basic Methods
Data Mining: Concepts and Techniques
Topic 3: Cluster Analysis
Classification and Prediction
Outlier Discovery/Anomaly Detection
K Nearest Neighbor Classification
A Linear Method for Deviation Detection in Large databases
Data Mining Anomaly/Outlier Detection
CSCI N317 Computation for Scientific Applications Unit Weka
CSCI N317 Computation for Scientific Applications Unit Weka
MIS2502: Data Analytics Clustering and Segmentation
MIS2502: Data Analytics Clustering and Segmentation
Nearest Neighbors CSC 576: Data Mining.
Topic 5: Cluster Analysis
CSE572: Data Mining by H. Liu
Data Mining Anomaly Detection
Data Mining Anomaly Detection
Data Mining CSCI 307, Spring 2019 Lecture 23
Presentation transcript:

Outlier Detection & Analysis By: Eric Poulin Colin Yu

Outlier - Outline Introduction / Motivation / Definition Statistical-based Detection Distribution-based, depth-based Deviation-based Method Sequential exception, OLAP data cube Distance-based Detection Index-based, nested-loop, cell-based, local-outliers Questions

Introduction Traditional Data Mining Categories Majority of Objects Dependency detection Class identification Class description Exceptions Exception/outlier detection

Motivation for Outlier Analysis Fraud Detection (Credit card, telecommunications, criminal activity in e-Commerce) Customized Marketing (high/low income buying habits) Medical Treatments (unusual responses to various drugs) Analysis of performance statistics (professional athletes) Weather Prediction Financial Applications (loan approval, stock tracking) “One persons noise could be another person’s signal.”

What is an outlier? Observations inconsistent with rest of the dataset – Global Outlier Special outliers – Local Outlier Observations inconsistent with their neighborhoods A local instability or discontinuity

Causes of Outliers Poor data quality / contamination Low quality measurements, malfunctioning equipment, manual error Correct but exceptional data

Outlier Detection Approaches Objective: Define what data can be considered as inconsistent in a given data set Statistical-Based Outlier Detection Deviation-Based Outlier Detection Distance-Based Outlier Detection Find an efficient method to mine the outliers

Why A Special Technique to Identify Outliers? Why not just modify clustering or other algorithms to detect outliers? Performance considerations Subjective to the clustering algorithm and clustering parameters Only certain attributes may have outlier properties, no need to disqualify the entire tuple Contamination may occur by “column”, not by row

Outlier Analysis - Outline Introduction / Motivation / Definition Statistical-based Detection Distribution-based, depth-based Deviation-based Method Sequential exception, OLAP data cube Distance-based Detection Index-based, nested-loop, cell-based, local-outliers Questions

Statistical-Based Outlier Detection (Distribution-based) Assumptions: Knowledge of data (distribution, mean, variance) Statistical discordancy test Data is assumed to be part of a working hypothesis (working hypothesis) Each data object in the dataset is compared to the working hypothesis and is either accepted in the working hypothesis or rejected as discordant into an alternative hypothesis (outliers)

Statistical-Based Outlier Detection (Distribution-based) Assumptions: Knowledge of data (distribution, mean, variance) Statistical discordancy test Data is assumed to be part of a working hypothesis (working hypothesis) Each data object in the dataset is compared to the working hypothesis and is either accepted in the working hypothesis or rejected as discordant into an alternative hypothesis (outliers)

Statistical-Based Outlier detection (Depth-based) Data is organized into layers according to some definition of depth Shallow layers are more likely to contain outliers than deep layers Can efficiently handle computation for k < 4

Statistical-Based Outlier Detection Strengths Most outlier research has been done in this area, many data distributions are known Weakness Almost all of the statistical models are univariate (only handle one attribute) and those that are multivariate only efficiently handle k<4 All models assume the distribution is known –this is not always the case Outlier detection is completely subjective to the distribution used

Outlier Analysis - Outline Introduction / Motivation / Definition Statistical-based Detection Distribution-based, depth-based Deviation-based Method Sequential exception, OLAP data cube Distance-based Detection Index-based, nested-loop, cell-based, local-outliers Questions

Deviation-Based Outlier Detection Simulate a mechanism familiar to human being: after seeing a series of similar data, an element disturbing the series is considered an exception Sequential Exception Techniques OLAP Data Cube Techniques

Sequential Exception Select subsets of data Ij (j=1,2,…,n) from the dataset I Compare the dissimilarity of I and (I-Ij) Find out the minimum subset Ij that reduce the disimuliarity the most Smoothing factor D is a dissimilarity function C is a cardinality function, for example, the number of elements in the dataset

Example Let the data set I be the set of integer values {1,4,4,4} Ij I- Ij C(I- Ij) D(I- Ij) SF(Ij) {} {1,4,4,4} 4 1.69 0.00 {4} {1,4,4} 3 2.00 -0.93 {4,4} {1,4} 2 2.25 -1.12 {4,4,4} {1} 1 5.07 3.38 Note, when Ij = {}, D(I) = D(I-Ij) = 1.69, SF(Ij)=0 When Ij={1}, SF(Ij) has the maximum value, so {1} is the outlier set

OLAP Data Cube Technique Deviation detection process is overlapped with cube computation Precomputed measures indicating data exceptions are needed A cell value is considered an exception if it is significantly different from the expected value, based on a statistical model Use visual cues such as background color to reflect the degree of exception

Outlier Analysis - Outline Introduction / Motivation / Definition Statistical-based Detection Distribution-based, depth-based Deviation-based Method Sequential exception, OLAP data cube Distance-based Detection Index-based, nested-loop, cell-based, local-outliers Questions

Distance-Based Outlier Detection Distance-based: An object O in a dataset T is a DB(p,D) outier if at least fraction p of the objects in T are >= distance D from O A point O in a dataset is an outlier with respect to parameters k and d if no more than k points in the dataset are at a distance of d or less from O. Relative measurement: Let Dk(O) denote the distance of the kth nearest neighbor of O. It is a measure of how much of an outlier point O is.

Index-based Algorithm [KN98] Indexing Structures such as R-tree (R+-tree), K-D (K-D-B) tree are built for the multi-dimensional database The index is used to search for neighbors of each object O within radius D around that object. Once K (K = N(1-p)) neighbors of object O are found, O is not an outlier. Worst-case computation complexity is O(K*n2), K is the dimensionality and n is the number of objects in the dataset. Pros: scale well with K Cons: the index construction process may cost much time

Nested-loop Algorithm [KN98] Divides the buffer space into two halves (first and second arrays) Break data into blocks and then feed two blocks into the arrays. Directly computes the distance between each pair of objects, inside the array or between arrays Decide the outlier. Here comes an example:… Same computational complexity as the index-based algorithm Pros: Avoid index structure construction Try to minimize the I/Os

Example – stage 1 Buffer DB A is the target block on stage 1 Load A into the first array (1R) Load B into the second array (1R) Load C into the second array (1R) Load D into the second array (1R) Total: 4 Reads A B A B C D Starting Point of Stage 1 A D A B C D End Point of Stage 1

Example Example – stage 2 Buffer DB D is the target block on stage 2 D is already in the buffer (no R) A is already in the buffer (no R) Load B into the first array (1R) Load C into the first array (1R) Total: 2 Reads A D A B C D Starting Point of Stage 2 C D A B C D End Point of Stage 2

Example – stage 3 Buffer DB C is the target block on stage 3 C is already in the buffer (no R) D is already in the buffer (no R) Load A into the second array (1R) Load B into the second array (1R) Total: 2 Reads C D A B C D Starting Point of Stage 3 C B A B C D End Point of Stage 3

Example Example – stage 4 Buffer DB B is the target block on stage 4 B is already in the buffer (no R) C is already in the buffer (no R) Load A into the first array (1R) Load D into the first array (1R) Total: 2 Reads Every block is ¼ of the DB. From stage 1-4, a grand total of 10 blocks are read, amounting to 10/4 passes over the entire dataset. C B A B C D Starting Point of Stage 4 D B A B C D End Point of Stage 4

Cell-Based Algorithm [KN98] Divide the dataset into cells with length K is the dimensionality, D is the distance Define Layer-1 neighbors – all the intermediate neighbor cells. The maximum distance between a cell and its neighbor cells is D Define Layer-2 neighbors – the cells within 3 cell of a certain cell. The minimum distance between a cell and the cells outside of Layer-2 neighbors is D Criteria Search a cell internally. If there are M objects inside, all the objects in this cell are not outlier Search its layer-1 neighbors. If there are M objects inside a cell and its layer-1 neighbors, all the objects in this cell are not outlier Search its layer-2 neighbors. If there are less than M objects inside a cell, its layer-1 neighbor cells, and its layer-2 neighbor cells, all the objects in this cell are outlier Otherwise, the objects in this cell could be outlier, and then need to calculate the distance between the objects in this cell and the objects in the cells in the layer-2 neighbor cells to see whether the total points within D distance is more than M or not. An example

Example Red – A certain cell Yellow – Layer-1 Neighbor Cells Blue – Layer-2 Neighbor Cells Notes: The maximum distance between a point in the red cell and a point In its layer-1 neighbor cells is D The minimum distance between A point in the red cell and a point outside its layer-2 neighbor cells is D

Distance-Based Outlier Detection (Local Outliers) Some outliers can be defined as global outliers, some can be defined as local outliers to a given cluster O2 would not normally be considered an outlier with regular distance-based outlier detection, since it looks at the global picture

Distance-Based Outlier Detection (Local Outliers) Each data object is assigned a local outlier factor (LOF) Objects which are closer to dense clusters receive a higher LOF LOF varies according to the parameter MinPts

Distance-Based Outlier Detection (Local Outliers)

Distance-Based Outlier Detection (Partition-based) Partition-based detection Use BIRCH clustering to identify clusters/partitions of non-outliers Prune partitions that do not contain outliers Use Index/Nested Loop algorithms on the remaining data points Since many data point are removed during pruning, the efficiency is increased significantly.

Outlier Analysis - Outline Introduction / Motivation / Definition Statistical-based Detection Distribution-based, depth-based Deviation-based Method Sequential exception, OLAP data cube Distance-based Detection Index-based, nested-loop, cell-based, local-outliers Questions