Outlier Discovery/Anomaly Detection

Slides:



Advertisements
Similar presentations
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 12 —
Advertisements

Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Mining Anomaly Detection
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Minqi Zhou © Tan,Steinbach, Kumar Introduction to Data Mining.
Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Stephen D. Bay 1 and Mark Schwabacher 2 1 Institute for.
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 12 —
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Anomaly Detection. Anomaly/Outlier Detection  What are anomalies/outliers? The set of data points that are considerably different than the remainder.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by.
Independent Sample T-test Often used with experimental designs N subjects are randomly assigned to two groups (Control * Treatment). After treatment, the.
Data Mining Techniques
Jeff Howbert Introduction to Machine Learning Winter Anomaly Detection Some slides taken or adapted from: “Anomaly Detection: A Tutorial” Arindam.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
1 Statistical Analysis - Graphical Techniques Dr. Jerrell T. Stracener, SAE Fellow Leadership in Engineering EMIS 7370/5370 STAT 5340 : PROBABILITY AND.
Spatial Statistics Applied to point data.
Outlier Detection Using k-Nearest Neighbour Graph Ville Hautamäki, Ismo Kärkkäinen and Pasi Fränti Department of Computer Science University of Joensuu,
Random Sampling, Point Estimation and Maximum Likelihood.
Chapter 4 Statistics. 4.1 – What is Statistics? Definition Data are observed values of random variables. The field of statistics is a collection.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
1 E. Fatemizadeh Statistical Pattern Recognition.
N. GagunashviliRAVEN Workshop Heidelberg Nikolai Gagunashvili (University of Akureyri, Iceland) Data mining methods in RAVEN network.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Mining Anomaly Detection © Tan,Steinbach, Kumar Introduction to Data Mining.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Mining Anomaly/Outlier Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Lecture 7: Outlier Detection Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology.
1 Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 12 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign.
COMP5331 Outlier Prepared by Raymond Wong Presented by Raymond Wong
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Mining Anomaly/Outlier Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
1 CLUSTER VALIDITY  Clustering tendency Facts  Most clustering algorithms impose a clustering structure to the data set X at hand.  However, X may not.
Introduction to statistics I Sophia King Rm. P24 HWB
Variability Introduction to Statistics Chapter 4 Jan 22, 2009 Class #4.
Chapter 8 Estimation ©. Estimator and Estimate estimator estimate An estimator of a population parameter is a random variable that depends on the sample.
1 Statistical Analysis - Graphical Techniques Dr. Jerrell T. Stracener, SAE Fellow Leadership in Engineering EMIS 7370/5370 STAT 5340 : PROBABILITY AND.
Sampling Distributions Chapter 18. Sampling Distributions A parameter is a number that describes the population. In statistical practice, the value of.
Anomaly Detection Carolina Ruiz Department of Computer Science WPI Slides based on Chapter 10 of “Introduction to Data Mining” textbook by Tan, Steinbach,
1 CSE 881: Data Mining Lecture 22: Anomaly Detection.
Anomaly Detection Nathan Dautenhahn CS 598 Class Lecture March 3, 2011.
Descriptive Statistics ( )
Parameter, Statistic and Random Samples
Inference: Conclusion with Confidence
Data Mining Soongsil University
Data Mining: Concepts and Techniques
Non-Parametric Tests 12/1.
Ch8: Nonparametric Methods
PCB 3043L - General Ecology Data Analysis.
Classification of unlabeled data:
Inference for the mean vector
Non-Parametric Tests 12/6.
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 12 —
Non-Parametric Tests.
Machine Learning Basics
CHAPTER 3 Data Description 9/17/2018 Kasturiarachi.
Lecture Notes for Chapter 9 Introduction to Data Mining, 2nd Edition
Comparing k Populations
Data Mining Anomaly Detection
Data Mining Anomaly/Outlier Detection
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Lecture 14: Anomaly Detection
Chapter 6: Becoming Acquainted with Statistical Concepts
CSE572: Data Mining by H. Liu
Data Mining Anomaly Detection
Data Mining Anomaly/Outlier Detection
Data Mining Anomaly Detection
Applied Statistics and Probability for Engineers
Presentation transcript:

Outlier Discovery/Anomaly Detection November 15, 2018 Data Mining: Concepts and Techniques

Anomaly/Outlier Detection What are anomalies/outliers? The set of data points that are considerably different than the remainder of the data Variants of Anomaly/Outlier Detection Problems Given a database D, find all the data points x  D with anomaly scores greater than some threshold t Given a database D, find all the data points x  D having the top-n largest anomaly scores f(x) Given a database D, containing mostly normal (but unlabeled) data points, and a test point x, compute the anomaly score of x with respect to D November 15, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Applications Credit card fraud detection telecommunication fraud detection network intrusion detection fault detection many more November 15, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Anomaly Detection Challenges How many outliers are there in the data? Method is unsupervised Validation can be quite challenging (just like for clustering) Finding needle in a haystack Working assumption: There are considerably more “normal” observations than “abnormal” observations (outliers/anomalies) in the data November 15, 2018 Data Mining: Concepts and Techniques

Anomaly Detection Schemes General Steps Build a profile of the “normal” behavior Profile can be patterns or summary statistics for the overall population Use the “normal” profile to detect anomalies Anomalies are observations whose characteristics differ significantly from the normal profile Types of anomaly detection schemes Graphical & Statistical-based Distance-based Model-based November 15, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Graphical Approaches Boxplot (1-D), Scatter plot (2-D), Spin plot (3-D) Limitations Time consuming Subjective November 15, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Convex Hull Method Extreme points are assumed to be outliers Use convex hull method to detect extreme values What if the outlier occurs in the middle of the data? November 15, 2018 Data Mining: Concepts and Techniques

Statistical Approaches Assume a parametric model describing the distribution of the data (e.g., normal distribution) Apply a statistical test that depends on Data distribution Parameter of distribution (e.g., mean, variance) Number of expected outliers (confidence limit) November 15, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Grubbs’ Test Detect outliers in univariate data Assume data comes from normal distribution Detects one outlier at a time, remove the outlier, and repeat H0: There is no outlier in data HA: There is at least one outlier Grubbs’ test statistic: Reject H0 if: November 15, 2018 Data Mining: Concepts and Techniques

Statistical-based – Likelihood Approach Assume the data set D contains samples from a mixture of two probability distributions: M (majority distribution) A (anomalous distribution) General Approach: Initially, assume all the data points belong to M Let Lt(D) be the log likelihood of D at time t For each point xt that belongs to M, move it to A Let Lt+1 (D) be the new log likelihood. Compute the difference,  = Lt(D) – Lt+1 (D) If  > c (some threshold), then xt is declared as an anomaly and moved permanently from M to A November 15, 2018 Data Mining: Concepts and Techniques

Statistical-based – Likelihood Approach Data distribution, D = (1 – ) M +  A M is a probability distribution estimated from data Can be based on any modeling method A is initially assumed to be uniform distribution Likelihood at time t: November 15, 2018 Data Mining: Concepts and Techniques

Limitations of Statistical Approaches Most of the tests are for a single attribute In many cases, data distribution may not be known For multi-dimensional data, it may be difficult to estimate the true distribution November 15, 2018 Data Mining: Concepts and Techniques

Distance-based Approaches Data is represented as a vector of features Three major approaches Nearest-neighbor based Density based Clustering based November 15, 2018 Data Mining: Concepts and Techniques

Nearest-Neighbor Based Approach Compute the distance between every pair of data points There are various ways to define outliers: Data points for which there are fewer than p neighboring points within a distance D The top n data points whose distance to the kth nearest neighbor is greatest The top n data points whose average distance to the k nearest neighbors is greatest November 15, 2018 Data Mining: Concepts and Techniques

Density-based: LOF approach For each point, compute the density of its local neighborhood Compute local outlier factor (LOF) of a sample p as the average of the ratios of the density of sample p and the density of its nearest neighbors Outliers are points with largest LOF value p2  p1 In the NN approach, p2 is not considered as outlier, while LOF approach find both p1 and p2 as outliers November 15, 2018 Data Mining: Concepts and Techniques

LOF The local outlier factor LOF, is defined as follows: where Nk(p) is the set of k-nearest neighbors to p and November 15, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Clustering-Based Basic idea: Cluster the data into groups of different density Choose points in small cluster as candidate outliers Compute the distance between candidate points and non-candidate clusters. If candidate points are far from all other non-candidate points, they are outliers November 15, 2018 Data Mining: Concepts and Techniques

Outliers in Lower Dimensional Projection Divide each attribute into  equal-depth intervals Each interval contains a fraction f = 1/ of the records Consider a d-dimensional cube created by picking grid ranges from d different dimensions If attributes are independent, we expect region to contain a fraction fk of the records If there are N points, we can measure sparsity of a cube D as: Negative sparsity indicates cube contains smaller number of points than expected To detect the sparse cells, you have to consider all cells…. exponential to d. Heuristics can be used to find them… November 15, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Example N=100,  = 5, f = 1/5 = 0.2, N  f2 = 4 November 15, 2018 Data Mining: Concepts and Techniques