Data Mining Anomaly/Outlier Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.
Descriptive Measures MARE 250 Dr. Jason Turner.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Mining Anomaly Detection
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Minqi Zhou © Tan,Steinbach, Kumar Introduction to Data Mining.
Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Stephen D. Bay 1 and Mark Schwabacher 2 1 Institute for.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 12 —
© Tan,Steinbach, Kumar Introduction to Data Mining 1/17/ Data Mining Cluster Analysis: Advanced Concepts and Algorithms Figures for Chapter 9 Introduction.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
© Tan,Steinbach, Kumar Introduction to Data Mining 1/17/ Data Mining Cluster Analysis: Basic Concepts and Algorithms Figures for Chapter 8 Introduction.
© Tan,Steinbach, Kumar Introduction to Data Mining 1/17/ Data Mining Anomaly Detection Figures for Chapter 10 Introduction to Data Mining by Tan,
BCOR 1020 Business Statistics
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Anomaly Detection brief review of my prospectus Ziba Rostamian CS590 – Winter 2008.
© Tan,Steinbach, Kumar Introduction to Data Mining 1/17/ Data Mining Classification: Alternative Techniques Figures for Chapter 5 Introduction to.
Anomaly Detection. Anomaly/Outlier Detection  What are anomalies/outliers? The set of data points that are considerably different than the remainder.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by.
© Tan,Steinbach, Kumar Introduction to Data Mining 1/17/ Data Mining: Exploring Data Figures for Chapter 3 Introduction to Data Mining by Tan, Steinbach,
Inferences About Process Quality
© Tan,Steinbach, Kumar Introduction to Data Mining 1/17/ Data Mining Association Analysis: Advanced Concepts Figures for Chapter 7 Introduction to.
Jeff Howbert Introduction to Machine Learning Winter Anomaly Detection Some slides taken or adapted from: “Anomaly Detection: A Tutorial” Arindam.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Intrusion Detection Jie Lin. Outline Introduction A Frame for Intrusion Detection System Intrusion Detection Techniques Ideas for Improving Intrusion.
Anomaly Detection Presented by: Anupam Das CS 568MCC Spring 2013
Outlier Detection Using k-Nearest Neighbour Graph Ville Hautamäki, Ismo Kärkkäinen and Pasi Fränti Department of Computer Science University of Joensuu,
Outlier Detection Lian Duan Management Sciences, UIOWA.
N. GagunashviliRAVEN Workshop Heidelberg Nikolai Gagunashvili (University of Akureyri, Iceland) Data mining methods in RAVEN network.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Mining Anomaly Detection © Tan,Steinbach, Kumar Introduction to Data Mining.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
CLUSTER ANALYSIS Introduction to Clustering Major Clustering Methods.
Part II Tools for Knowledge Discovery Ch 5. Knowledge Discovery in Databases Ch 6. The Data Warehouse Ch 7. Formal Evaluation Technique.
Lecture 7: Outlier Detection Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology.
1 Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 12 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign.
COMP5331 Outlier Prepared by Raymond Wong Presented by Raymond Wong
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Mining Anomaly/Outlier Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar.
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Chapter 6: Interpreting the Measures of Variability.
Variability Introduction to Statistics Chapter 4 Jan 22, 2009 Class #4.
Ch. Eick: Some Ideas for Task4 Project2 Ideas on Creating Summaries that Characterize Clustering Results Focus: Primary Focus Cluster Summarization (what.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Anomaly Detection Carolina Ruiz Department of Computer Science WPI Slides based on Chapter 10 of “Introduction to Data Mining” textbook by Tan, Steinbach,
1 CSE 881: Data Mining Lecture 22: Anomaly Detection.
Anomaly Detection Nathan Dautenhahn CS 598 Class Lecture March 3, 2011.
BAE 6520 Applied Environmental Statistics
BAE 5333 Applied Water Resources Statistics
Ch8: Nonparametric Methods
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 12 —
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
Lecture Notes for Chapter 9 Introduction to Data Mining, 2nd Edition
Data Mining Classification: Alternative Techniques
Data Mining Anomaly Detection
Outlier Discovery/Anomaly Detection
Data Mining Anomaly/Outlier Detection
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Lecture 14: Anomaly Detection
Day 52 – Box-and-Whisker.
CSE572: Data Mining by H. Liu
Data Mining Anomaly Detection
Data Mining Anomaly/Outlier Detection
Data Mining Anomaly Detection
Presentation transcript:

Data Mining Anomaly/Outlier Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

© Tan,Steinbach, Kumar Introduction to Data Mining 4/8/ Anomaly/Outlier Detection l What are anomalies/outliers? –The set of data points that are considerably different than the remainder of the data l Variants of Anomaly/Outlier Detection Problems –Given a database D, find all the data points x  D with anomaly scores greater than some threshold t –Given a database D, find all the data points x  D having the top- n largest anomaly scores f(x) –Given a database D, containing mostly normal (but unlabeled) data points, and a test point x, compute the anomaly score of x with respect to D l Applications: –Credit card fraud detection, telecommunication fraud detection, network intrusion detection, fault detection

© Tan,Steinbach, Kumar Introduction to Data Mining 4/8/ Importance of Anomaly Detection Ozone Depletion History l In 1985 three researchers (Farman, Gardinar and Shanklin) were puzzled by data gathered by the British Antarctic Survey showing that ozone levels for Antarctica had dropped 10% below normal levels l Why did the Nimbus 7 satellite, which had instruments aboard for recording ozone levels, not record similarly low ozone concentrations? l The ozone concentrations recorded by the satellite were so low they were being treated as outliers by a computer program and discarded! Sources:

© Tan,Steinbach, Kumar Introduction to Data Mining 4/8/ Anomaly Detection l Challenges –How many outliers are there in the data? –Method is unsupervised  Validation can be quite challenging (just like for clustering) –Finding needle in a haystack l Working assumption: –There are considerably more “normal” observations than “abnormal” observations (outliers/anomalies) in the data

© Tan,Steinbach, Kumar Introduction to Data Mining 4/8/ Anomaly Detection Schemes l General Steps –Build a profile of the “normal” behavior  Profile can be patterns or summary statistics for the overall population –Use the “normal” profile to detect anomalies  Anomalies are observations whose characteristics differ significantly from the normal profile l Types of anomaly detection schemes –Graphical –Model-based –Distance-based –Clustering-based

© Tan,Steinbach, Kumar Introduction to Data Mining 4/8/ Graphical Approaches l Boxplot (1-D), Scatter plot (2-D), Spin plot (3-D) l Limitations –Time consuming –Subjective e.g.: data are outliers if the more than 1.5 IQR away from the box!

© Tan,Steinbach, Kumar Introduction to Data Mining 4/8/ Convex Hull Method l Extreme points are assumed to be outliers l Use convex hull method to detect extreme values l l Approach: Fit a polygon to a point set; outliers are determined by their distance to the polygon boundary

© Tan,Steinbach, Kumar Introduction to Data Mining 4/8/ Statistical Approaches---Model-based l Fit a parametric model M describing the distribution of the data (e.g., normal distribution) l Assess the probability of each point M l The lower a point’s probability the more likely the point is an outlier. l Outlier Detection Approach: –Sort points by the probability –Determine Outliers  based on a probability threshold  take the bottom x percent as outliers.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/8/ Grubbs’ Test if Dataset Contains Outliers l Detect outliers in univariate data l Assume data comes from normal distribution l Detects one outlier at a time, remove the outlier, and repeat –H 0 : There is no outlier in data –H A : There is at least one outlier l Grubbs’ test statistic: l Reject H 0 if:

© Tan,Steinbach, Kumar Introduction to Data Mining 4/8/ Limitations of Statistical Approaches l Most of the tests are for a single attribute l In many cases, data distribution/model may not be known l For high dimensional data, it may be difficult to estimate the true distribution, but there is new work based on EM and mixtures of Gaussians

© Tan,Steinbach, Kumar Introduction to Data Mining 4/8/ Distance-based Approaches l Data is represented as a vector of features l Three major approaches –Nearest-neighbor based –Density based –Clustering based

© Tan,Steinbach, Kumar Introduction to Data Mining 4/8/ Nearest-Neighbor Based Approach l Approach: –Compute the distance between every pair of data points –There are various ways to define outliers:  Data points for which there are fewer than p neighboring points within a distance D  The top n data points whose distance to the kth nearest neighbor is greatest  The top n data points whose average distance to the k nearest neighbors is greatest

© Tan,Steinbach, Kumar Introduction to Data Mining 4/8/ Density-based: LOF approach l For each point, compute the density of its local neighborhood; e.g. use DBSCAN’s approach l Compute local outlier factor (LOF) of a sample p as the average of the ratios of the density of sample p and the density of its nearest neighbors l Outliers are points with lowest LOF value p 2  p 1  In the NN approach, p 2 is not considered as outlier, while LOF approach find both p 1 and p 2 as outliers Alternative approach: directly use density function; e.g. DENCLUE’s density function

© Tan,Steinbach, Kumar Introduction to Data Mining 4/8/ Clustering-Based l Idea: Use a clustering algorithm that has some notion of outliers! l Problem what parameters should I choose for the algorithm; e.g. DBSCAN? l Rule of Thumb: Less than x% of the data should be outliers (with x typically chosen between 0.1 and 10); x might be determined with other methods; e.g. statistical tests or domain knowledge. l Once you have an idea, how much outliers you want, the parameter selection problem becomes easier.