1 Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 12 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Data Mining: Concepts and Techniques (3rd ed.) — Chapter 12 —
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Minqi Zhou © Tan,Steinbach, Kumar Introduction to Data Mining.
© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/ Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan,
Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very.
Data Mining: A Closer Look Chapter Data Mining Strategies.
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 12 —
On Community Outliers and their Efficient Detection in Information Networks Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1.
BA 555 Practical Business Analysis
On Appropriate Assumptions to Mine Data Streams: Analyses and Solutions Jing Gao† Wei Fan‡ Jiawei Han† †University of Illinois at Urbana-Champaign ‡IBM.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Anomaly Detection. Anomaly/Outlier Detection  What are anomalies/outliers? The set of data points that are considerably different than the remainder.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by.
Data Mining: A Closer Look Chapter Data Mining Strategies 2.
Chapter 5 Data mining : A Closer Look.
MAKING THE BUSINESS BETTER Presented By Mohammed Dwikat DATA MINING Presented to Faculty of IT MIS Department An Najah National University.
More Machine Learning Linear Regression Squared Error L1 and L2 Regularization Gradient Descent.
Aaker, Kumar, Day Ninth Edition Instructor’s Presentation Slides
Overview DM for Business Intelligence.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Describing distributions with numbers
Exploratory Data Analysis. Computing Science, University of Aberdeen2 Introduction Applying data mining (InfoVis as well) techniques requires gaining.
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
CLassification TESTING Testing classifier accuracy
Inductive learning Simplest form: learn a function from examples
Chapter 9 Neural Network.
Copyright © 2010 Pearson Education, Inc. Chapter 6 The Standard Deviation as a Ruler and the Normal Model.
Basic Data Mining Technique
Stat 112 Notes 16 Today: –Outliers and influential points in multiple regression (Chapter 6.7)
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Mining Anomaly Detection © Tan,Steinbach, Kumar Introduction to Data Mining.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
CLUSTER ANALYSIS Introduction to Clustering Major Clustering Methods.
Data Mining Anomaly/Outlier Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Science and Big Data Analytics Chap 4: Advanced Analytical Theory and Methods: Clustering Charles Tappert Seidenberg School of CSIS, Pace University.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Mining Anomaly/Outlier Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar.
© Copyright McGraw-Hill 2004
Anomaly Detection.
1 CSI5388 Practical Recommendations. 2 Context for our Recommendations I This discussion will take place in the context of the following three questions:
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
3/13/2016 Data Mining 1 Lecture 2-1 Data Exploration: Understanding Data Phayung Meesad, Ph.D. King Mongkut’s University of Technology North Bangkok (KMUTNB)
DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.
Anomaly Detection Carolina Ruiz Department of Computer Science WPI Slides based on Chapter 10 of “Introduction to Data Mining” textbook by Tan, Steinbach,
Two-Sample Hypothesis Testing
Data Mining: Concepts and Techniques
Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.
Data Mining 101 with Scikit-Learn
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 12 —
Chapter 2 Simple Comparative Experiments
Objective: Given a data set, compute measures of center and spread.
Lecture Notes for Chapter 9 Introduction to Data Mining, 2nd Edition
Data Mining Anomaly Detection
Outlier Discovery/Anomaly Detection
9 Tests of Hypotheses for a Single Sample CHAPTER OUTLINE
Data Mining Anomaly/Outlier Detection
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 8 —
Lecture 14: Anomaly Detection
MIS2502: Data Analytics Clustering and Segmentation
©Jiawei Han and Micheline Kamber
CSE572: Data Mining by H. Liu
Data Mining Anomaly Detection
Exploiting the Power of Group Differences to Solve Data Analysis Problems Outlier & Intrusion Detection Guozhu Dong, PhD, Professor CSE
Data Mining Anomaly Detection
Presentation transcript:

1 Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 12 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University ©2012 Han, Kamber & Pei. All rights reserved.

2 Chapter 12. Outlier Analysis Outlier and Outlier Analysis Outlier Detection Methods Supervised Methods Unsupervised Methods Semi-supervised methods Statistical Approaches Proximity-Based Approaches Summary

3 What Are Outliers? Outlier: A data object that deviates significantly from the normal objects as if it were generated by a different mechanism Ex.: Unusual credit card purchase Outliers are different from the noise data Noise is random error or variance in a measured variable Noise should be removed before outlier detection

4 Types of Outliers (I) Three kinds: global, contextual and collective outliers Global outlier (or point anomaly) Object is O g if it significantly deviates from the rest of the data set Ex. Intrusion detection in computer networks Issue: Find an appropriate measurement of deviation Contextual outlier (or conditional outlier) Object is O c if it deviates significantly based on a selected context Ex. 80 o F in NYC: outlier? (depending on summer or winter?) Attributes of data objects should be divided into two groups Contextual attributes: defines the context, e.g., time & location Behavioral attributes: characteristics of the object, used in outlier evaluation, e.g., temperature, pressure, humidity Issue: How to define or formulate meaningful context? Global Outlier

5 Types of Outliers (II) Collective Outliers A subset of data objects collectively deviate significantly from the whole data set, even if the individual data objects may not be outliers Applications: E.g., intrusion detection: When a number of computers keep sending denial-of-service packages to each other Collective Outlier Detection of collective outliers Consider not only behavior of individual objects, but also that of groups of objects Need to have the background knowledge on the relationship among data objects, such as a distance or similarity measure on objects.

6 Challenges of Outlier Detection Modeling normal objects and outliers properly Hard to enumerate all possible normal behaviors in an application The border between normal and outlier objects is often a gray area Application-specific outlier detection Choice of distance measure among objects and the model of relationship among objects are often application-dependent E.g., clinic data: a small deviation could be an outlier; while in marketing analysis, larger fluctuations Handling noise in outlier detection Noise may distort the normal objects and blur the distinction between normal objects and outliers. It may help hide outliers and reduce the effectiveness of outlier detection Understandability Understand why these are outliers: Justification of the detection

7 Chapter 12. Outlier Analysis Outlier and Outlier Analysis Outlier Detection Methods Supervised Methods Unsupervised Methods Semi-supervised methods Statistical Approaches Proximity-Based Approaches Summary

Outlier Detection Two ways to categorize outlier detection methods: Based on whether user-labeled examples of outliers can be obtained: Supervised, semi-supervised vs. unsupervised methods Based on assumptions about normal data and outliers: Statistical, proximity-based, and clustering-based methods 8

Supervised Methods Modeling outlier detection as a classification problem Methods for Learning a classifier for outlier detection effectively: Model normal objects & report those not matching the model as outliers, or Model outliers and treat those not matching the model as normal Challenges Imbalanced classes, i.e., outliers are rare 9

Unsupervised Methods Assume the normal objects are somewhat ``clustered'‘ into multiple groups, each having some distinct features An outlier is expected to be far away from any groups of normal objects Weakness: Cannot detect collective outlier effectively Normal objects may not share any strong patterns, but the collective outliers may share high similarity in a small area 10

Semi-Supervised Methods In many applications, the number of labeled data is often small: Labels could be on outliers only, normal objects only, or both If some labeled normal objects are available Use the labeled examples and the proximate unlabeled objects to train a model for normal objects Those not fitting the model of normal objects are detected as outliers 11

Statistical Methods Statistical methods (also known as model-based methods) assume that the normal data follow some statistical model The data not following the model are outliers. 12

Statistical Methods – Using Maximum Likelihood Ex: Avg. temp.: {24.0, 28.9, 28.9, 29.0, 29.1, 29.1, 29.2, 29.2, 29.3, 29.4} Use the maximum likelihood method to estimate μ and σ 13 For the above data with n = 10, we have Consider the value – 3*1.51 = So, 24 is an outlier

Statistical Methods – Box Plot Values less than Q1-1.5*IQR and greater than Q3+1.5*IQR are outliers Consider the following dataset: 10.2, 14.1, , 14.4, 14.5, 14.5, 14.6, 14.7, 14.7, 14.7, 14.9, 15.1, 15.9, 16.4 Here, Q2(median) = 14.6 Q1 = 14.4 Q3 = 14.9 IQR = Q3 – Q1 = = 0.5 Outliers will be any points: below Q1 – 1.5×IQR = 14.4 – 0.75 = or above Q ×IQR = = So, the outliers are at 10.2, 15.9, and

Statistical Methods – Using Histogram 15 Figure shows the histogram of purchase amounts in transactions A transaction in the amount of $7,500 is an outlier, since only 0.2% transactions have an amount higher than $5,000 Problem: Hard to choose an appropriate bin size for histogram Too small bin size → normal objects in empty/rare bins, false positive Too big bin size → outliers in some frequent bins, false negative

Statistical Methods – Other Methods Grubbs test Mahalaobis distance Chi Square test 16

Proximity-Based Methods An object is an outlier if the nearest neighbors of the object are far away, i.e., the proximity of the object significantly deviates from the proximity of most of the other objects in the same data set 17 Example (right figure): Model the proximity of an object using its 3 nearest neighbors Objects in region R are substantially different from other objects in the data set. Thus the objects in R are outliers

18 Chapter 12. Outlier Analysis Outlier and Outlier Analysis Outlier Detection Methods Supervised Methods Unsupervised Methods Semi-supervised methods Statistical Approaches Proximity-Based Approaches Summary

Types of outliers global, contextual & collective outliers Outlier detection supervised, semi-supervised, or unsupervised Statistical (or model-based) approaches Proximity-base approaches 19