Anomaly Detection Carolina Ruiz Department of Computer Science WPI Slides based on Chapter 10 of “Introduction to Data Mining” textbook by Tan, Steinbach,

Slides:



Advertisements
Similar presentations
Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
Advertisements

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Mining Anomaly Detection
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Minqi Zhou © Tan,Steinbach, Kumar Introduction to Data Mining.
1 BUS 297D: Data Mining Professor David Mease Lecture 8 Agenda: 1) Reminder about HW #4 (due Thursday, 10/15) 2) Lecture over Chapter 10 3) Discuss final.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 12 —
Model Evaluation Metrics for Performance Evaluation
© Tan,Steinbach, Kumar Introduction to Data Mining 1/17/ Data Mining Cluster Analysis: Advanced Concepts and Algorithms Figures for Chapter 9 Introduction.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
© Tan,Steinbach, Kumar Introduction to Data Mining 1/17/ Data Mining Cluster Analysis: Basic Concepts and Algorithms Figures for Chapter 8 Introduction.
© Tan,Steinbach, Kumar Introduction to Data Mining 1/17/ Data Mining Anomaly Detection Figures for Chapter 10 Introduction to Data Mining by Tan,
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Anomaly Detection brief review of my prospectus Ziba Rostamian CS590 – Winter 2008.
Clustering. 2 Outline  Introduction  K-means clustering  Hierarchical clustering: COBWEB.
© Tan,Steinbach, Kumar Introduction to Data Mining 1/17/ Data Mining Classification: Alternative Techniques Figures for Chapter 5 Introduction to.
Anomaly Detection. Anomaly/Outlier Detection  What are anomalies/outliers? The set of data points that are considerably different than the remainder.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by.
© Tan,Steinbach, Kumar Introduction to Data Mining 1/17/ Data Mining: Exploring Data Figures for Chapter 3 Introduction to Data Mining by Tan, Steinbach,
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Jeff Howbert Introduction to Machine Learning Winter Anomaly Detection Some slides taken or adapted from: “Anomaly Detection: A Tutorial” Arindam.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Outlier Detection Using k-Nearest Neighbour Graph Ville Hautamäki, Ismo Kärkkäinen and Pasi Fränti Department of Computer Science University of Joensuu,
Boris Babenko Department of Computer Science and Engineering University of California, San Diego Semi-supervised and Unsupervised Feature Scaling.
Lecture 20: Cluster Validation
Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.
Outlier Detection Lian Duan Management Sciences, UIOWA.
N. GagunashviliRAVEN Workshop Heidelberg Nikolai Gagunashvili (University of Akureyri, Iceland) Data mining methods in RAVEN network.
Unsupervised learning introduction
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Mining Anomaly Detection © Tan,Steinbach, Kumar Introduction to Data Mining.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
CLUSTER ANALYSIS Introduction to Clustering Major Clustering Methods.
Data Mining Anomaly/Outlier Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Lecture 7: Outlier Detection Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology.
1 Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 12 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign.
COMP5331 Outlier Prepared by Raymond Wong Presented by Raymond Wong
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Mining Anomaly/Outlier Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar.
© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/ Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan,
RESEARCH & DATA ANALYSIS
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
CLUSTERING AND SEGMENTATION MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
1 CSE 881: Data Mining Lecture 22: Anomaly Detection.
DATA MINING and VISUALIZATION Instructor: Dr. Matthew Iklé, Adams State University Remote Instructor: Dr. Hong Liu, Embry-Riddle Aeronautical University.
Data Mining Classification and Clustering Techniques Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to Data Mining.
Exploring Data: Summary Statistics and Visualizations
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 12 —
STAT 206: Chapter 6 Normal Distribution.
Lecture Notes for Chapter 9 Introduction to Data Mining, 2nd Edition
Data Mining Classification: Alternative Techniques
Data Mining Anomaly Detection
Outlier Discovery/Anomaly Detection
Data Mining Anomaly/Outlier Detection
Lecture 14: Anomaly Detection
Introduction Previous lessons have demonstrated that the normal distribution provides a useful model for many situations in business and industry, as.
MIS2502: Data Analytics Clustering and Segmentation
MIS2502: Data Analytics Clustering and Segmentation
Summary (Week 1) Categorical vs. Quantitative Variables
Data Mining Anomaly Detection
Data Mining Anomaly/Outlier Detection
Data Mining Anomaly Detection
Presentation transcript:

Anomaly Detection Carolina Ruiz Department of Computer Science WPI Slides based on Chapter 10 of “Introduction to Data Mining” textbook by Tan, Steinbach, Kumar (all figures and some slides taken from this chapter)

Class Discussion Points What's an anomaly (or outlier)? Give an example of a situation in which an anomaly should be removed during pre-processing of the dataset, and another example of a situation in which an anomaly is an interesting data instance worth keeping and/or studying in more detail. Define each of the following approaches to anomaly detection, and describe the differences between each pair: – Model-based, Proximity-based, and Density-based techniques. Can visualization be used to detect outliers? If so, how? – Give specific examples of visualization techniques that can be used for anomaly detection. – For each one, explain whether or not the visualization technique can be considered a Model-based (which includes Statistical), Proximity-based, or Density-based technique for anomaly detection.

Class Discussion Points (cont.) Define each of the following modes to anomaly detection, and describe the differences between pairs: supervised, unsupervised, and semi-supervised. Consider the case of a dataset that has labels identifying the anomalies and the task is to learn how to detect similar anomalies in unlabeled data. Is that supervised or unsupervised anomaly detection? Explain. Consider the case of a dataset that doesn't have labels identifying the anomalies and the task is to find how to assign a sound anomaly score, f(x), to each instance x in the dataset. – Is that supervised or unsupervised anomaly detection? Explain. Precision, recall, and false positive rate are mentioned in the textbook as appropriate metrics to evaluate anomaly detection algorithms – What are those metrics and how can they be used to evaluate anomaly detection?

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Limitation of Accuracy l Consider a 2-class problem –Number of Class 0 examples = 9990 –Number of Class 1 examples = 10 l If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 % –Accuracy is misleading because model does not detect any class 1 example

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Accuracy vs. Precision and Recall Count PREDICTED CLASS ACTUAL CLASS Class=YesClass=No Class=Yes ab Class=No cd N = a + b + c + d Accuracy = (a + d)/N False Positive Rate = c/(c+d)

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Anomaly/Outlier Detection l What are anomalies/outliers? –The set of data points that are considerably different than the remainder of the data l Variants of Anomaly/Outlier Detection Problems –Given a database D, find all the data points x  D with anomaly scores greater than some threshold t –Given a database D, find all the data points x  D having the top- n largest anomaly scores f(x) –Given a database D, containing mostly normal (but unlabeled) data points, and a test point x, compute the anomaly score of x with respect to D l Applications: –Credit card fraud detection, telecommunication fraud detection, network intrusion detection, fault detection

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Importance of Anomaly Detection Ozone Depletion History l In 1985 three researchers (Farman, Gardinar and Shanklin) were puzzled by data gathered by the British Antarctic Survey showing that ozone levels for Antarctica had dropped 10% below normal levels l Why did the Nimbus 7 satellite, which had instruments aboard for recording ozone levels, not record similarly low ozone concentrations? l The ozone concentrations recorded by the satellite were so low they were being treated as outliers by a computer program and discarded! Sources:

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Anomaly Detection l Challenges –How many outliers are there in the data? –Method is unsupervised  Validation can be quite challenging (just like for clustering) –Finding needle in a haystack l Working assumption: –There are considerably more “normal” observations than “abnormal” observations (outliers/anomalies) in the data

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Anomaly Detection Schemes l General Steps –Build a profile of the “normal” behavior  Profile can be patterns or summary statistics for the overall population –Use the “normal” profile to detect anomalies  Anomalies are observations whose characteristics differ significantly from the normal profile l Types of anomaly detection schemes –Graphical & Statistical-based –Distance-based –Model-based

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Graphical Approaches l Boxplot (1-D), Scatter plot (2-D), Spin plot (3-D) l Limitations –Time consuming –Subjective

Anomaly Detection: General Approach For each of the anomaly detection approaches (statistical-based, proximity-based, density-based, and clustering-based) do 1. State the definition(s) of outlier used by the approach 2.How can this be definition used to assign an anomaly score to each data instance? 3.How does this anomaly detection approach work in general? Give an example to illustrate your description.

Anomaly Detection: Statistical Approach Definition of Outlier: Probabilistic definition of outlier: An outlier is an object that has a low probability wrt a probability distribution model of the data. Anomaly score function: Given a data instance x from a dataset D, f(x) = 1/P(x|D) How does the approach work? (in general) 1.Calculate the anomaly score, f(x), for each data point in the dataset. 2.Use a threshold t on this score to determine outliers. That is, x is an outlier iff f(x) > t. to figure out a good value for the threshold, one can repeat the same idea used in clustering of sorting all data points according to their score value, and then finding a good "elbow" in that plot. See example on next slide

Anomaly score f(x) Data instances sorted in increasing order of f(x) 1.What would be a natural choice for the value of this threshold t? 2.Assume that we want to classify 20% of the dataset instances as anomalies. In this case, what threshold value would you pick based on the plot above? Finding a good value for the threshold

Anomaly Detection: Statistical Approach Example: If data follows a normal (Gaussian) distribution: Outliers are those in the right or left tail of the distribution Remember that for normal distributions, z N is a constant that tells how many standard deviations from the mean on both directions (i.e., mean +- z N * sigma) contain N% of the area under the curve. z N can be found in statistical tables.

Anomaly Detection: Proximity Approach Definition of Outlier: Proximity-based definition of outlier using distance to k-nearest neighbor Anomaly score function: Given a data instance x from a dataset D and a value k, Alternate definitions: f(x) = Distance between x and its k-nearest neighbor f(x) = Average distance between x and its k-nearest neighbors How does the approach work? (in general): 1.Calculate the anomaly score, f(x), for each data point in the dataset. 2.Use a threshold t on this score to determine outliers. That is, x is an outlier iff f(x) > t. - To figure out a good value for k, one can repeat the same idea used in clustering: Run experiments with different values of k - To figure out a good value for the threshold, one can repeat the same idea used in clustering of sorting all data points according to their score value, and then finding a good "elbow" in that plot.

Anomaly Detection: Proximity Approach Examples: Next 4 slides

Anomaly Detection: Density Approach

Anomaly score function: Given a data instance x from a dataset D, f(x) = 1/density(x,k), or f(x) = 1/avg_rel_density(x,k) How does the approach work? (in general): 1. Calculate the anomaly score, f(x), for each data point in the dataset. 2.Use a threshold t on this score to determine outliers. That is, x is an outlier iff f(x) > t. Same comments on how to determine good values for k and the threshold as discussed above

It uses the avg_rel_density. LOF: Local Outlier Factor Points A, C, and D have the largest anomaly scores: C: the most extreme outlier D: the most extreme point wrt the compact set of points A: the most extreme point wrt the loose set of points

Anomaly Detection: Clustering Approach Definition of Outlier: Clustering-based definition of outlier: A data instance is a cluster-based outlier if the instance does not strongly belong to any cluster. Anomaly score function: Given a data instance x from a dataset D, Alternate definitions: 1. f(x) = distance between x and its closest centroid 2. f(x) : (called relative distance) = ratio between the point's distance from the centroid to the median distance of all points in the cluster from the centroid 3. f(x) = improvement in the goodness of a cluster (as measured by an objective function) when x is removed

Anomaly Detection: Clustering Approach How does the approach work? (in general): 1. Calculate the anomaly score, f(x), for each data point in the dataset. 2.Use a threshold t on this score to determine outliers. That is, x is an outlier iff f(x) > t. Same comments on how to determine good values for k and the threshold as discussed above.

using K-means with 2 clusters. Fig uses distance of point from closest centroids (D is not considered outlier)

Fig uses relative distance of point from closest centroids to adjust for the difference of densities among the clusters