Machine Learning University of Eastern Finland

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Minqi Zhou © Tan,Steinbach, Kumar Introduction to Data Mining.
Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Stephen D. Bay 1 and Mark Schwabacher 2 1 Institute for.
Clustering (1) Clustering Similarity measure Hierarchical clustering Model-based clustering Figures from the book Data Clustering by Gan et al.
University at BuffaloThe State University of New York Interactive Exploration of Coherent Patterns in Time-series Gene Expression Data Daxin Jiang Jian.
1 Learning Semantics-Preserving Distance Metrics for Clustering Graphical Data Aparna S. Varde, Elke A. Rundensteiner, Carolina Ruiz, Mohammed Maniruzzaman.
Overview Of Clustering Techniques D. Gunopulos, UCR.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Anomaly Detection. Anomaly/Outlier Detection  What are anomalies/outliers? The set of data points that are considerably different than the remainder.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by.
Example Data Sets Prior Research Join related objects to form independent compound objects, cluster normally (Yin et al., 2005). Use attribute-based distance.
Intelligent Systems Lecture 23 Introduction to Intelligent Data Analysis (IDA). Example of system for Data Analyzing based on neural networks.
Clustering Methods: Part 2d Pasi Fränti Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu, FINLAND Swap-based.
Clustering methods Course code: Pasi Fränti Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,
Outlier Detection Using k-Nearest Neighbour Graph Ville Hautamäki, Ismo Kärkkäinen and Pasi Fränti Department of Computer Science University of Joensuu,
Self-organizing map Speech and Image Processing Unit Department of Computer Science University of Joensuu, FINLAND Pasi Fränti Clustering Methods: Part.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Some working definitions…. ‘Data Mining’ and ‘Knowledge Discovery in Databases’ (KDD) are used interchangeably Data mining = –the discovery of interesting,
Garrett Poppe, Liv Nguekap, Adrian Mirabel CSUDH, Computer Science Department.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
Fast search methods Pasi Fränti Clustering methods: Part 5 Speech and Image Processing Unit School of Computing University of Eastern Finland
RDF: A Density-based Outlier Detection Method Using Vertical Data Representation Dongmei Ren, Baoying Wang, William Perrizo North Dakota State University,
1 Pattern Recognition Pattern recognition is: 1. A research area in which patterns in data are found, recognized, discovered, …whatever. 2. A catchall.
Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Mining Anomaly Detection © Tan,Steinbach, Kumar Introduction to Data Mining.
Data Mining Anomaly/Outlier Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Lecture 7: Outlier Detection Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology.
Graph preprocessing. Framework for validating data cleaning techniques on binary data.
Presented by Ho Wai Shing
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Mining Anomaly/Outlier Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar.
Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.
Privacy Preserving Outlier Detection using Locality Sensitive Hashing
Clustering Categorical Data
Agglomerative clustering (AC)
International Conference on Mathematical Modelling and Computational Methods in Science and Engineering, Alagappa University, Karaikudi, India February.
Dr. Hongqin FAN Department of Building and Real Estate
What Is Cluster Analysis?
CSE 4705 Artificial Intelligence
Fast nearest neighbor searches in high dimensions Sami Sieranoja
Random Swap algorithm Pasi Fränti
A Methodology for Finding Bad Data
Lecture Notes for Chapter 9 Introduction to Data Mining, 2nd Edition
Integrating Meta-Path Selection With User-Guided Object Clustering in Heterogeneous Information Networks Yizhou Sun†, Brandon Norick†, Jiawei Han†, Xifeng.
Overview Of Clustering Techniques
Data Mining Anomaly Detection
Outlier Discovery/Anomaly Detection
Community Distribution Outliers in Heterogeneous Information Networks
Random Swap algorithm Pasi Fränti
TOP DM 10 Algorithms C4.5 C 4.5 Research Issue:
Data Mining Anomaly/Outlier Detection
University of Crete Department Computer Science CS-562
K-means properties Pasi Fränti
GPX: Interactive Exploration of Time-series Microarray Data
CSE572, CBS572: Data Mining by H. Liu
Pasi Fränti and Sami Sieranoja
Introduction to Sensor Interpretation
CSE572: Data Mining by H. Liu
Data Mining Anomaly Detection
Introduction to Sensor Interpretation
Mean-shift outlier detection
Inferring Road Networks from GPS Trajectories
Dimensionally distributed Pasi Fränti and Sami Sieranoja
Data Mining Anomaly Detection
Clustering methods: Part 10
Machine Learning.
Presentation transcript:

Machine Learning University of Eastern Finland Clustering methods: Part 7 Outlier removal Pasi Fränti 16.5.2017 Machine Learning University of Eastern Finland

Outlier detection methods Distance-based methods Knorr & Ng Density-based methods KDIST: Kth nearest distance MeanDIST: Mean distance Graph-based methods MkNN: Mutual K-nearest neighbor ODIN: Indegree of nodes in k-NN graph

What is outlier? One definition: Outlier is an observation that deviates from other observations so much that it is expected to be generated by a different mechanism. Outliers

Distance-based method [Knorr and Ng , CASCR 1997] Definition: Data point x is an outlier if at most k points are within the distance d from x. Example with k=3 Inlier Inlier Outlier

Selection of distance threshold Too small value of d: false detection of outliers Too large value of d: outliers missed

Density-based method: KDIST [Ramaswamy et al., ACIM SIGMOD 2000] Define KDIST as distance to the kth nearest point. Points are sorted by their KDIST distance. The last n points in the list are classified as outliers.

Density-based: MeanDist [Hautamäki et al., ICPR 2004] MeanDIST = the mean of k nearest distances. User parameters: Cutting point k, and local threshold t:

Comparison of KDIST and MeanDIST

Distribution-based method [Aggarwal and Yu, ACM SIGMOD, 2001]

Detection of sparse cells

Mutual k-nearest neighbor [Brito et al Mutual k-nearest neighbor [Brito et al., Statistics & Probability Letters, 1997] Generate directed k-NN graph. Create undirected graph: Points a and b are mutual neighbors if both links a b and b a exist. Change all mutual links ab to undirected link a—b. Remove the rest. Connected components are clusters. Isolated points as outliers.

Mutual k-NN example k = 2 Given a data with one outlier. 1 Given a data with one outlier. For each point find two nearest neighbours and create directed 2-NN graph. For each pair of points, create link if both a→b and b→a exist. 6 5 1 1 2 4 5 8 1 2 4 5 8 6 1 2 4 5 8 6 2 3 Clusters I am not sure, if the points should be coloured in this picture, because we have no outlier detection so far. Perhaps they should be all just blue. Outlier

ODIN: Outlier detection using indegree [Hautamäki et al., ICPR 2004] Definition: Given kNN graph, classify data point x as an outlier its indegree  T.

Example of ODIN k = 2 Input data Graph and indegrees Outlier 1 5 6 2 4 8 Input data 1 2 4 5 8 6 3 Graph and indegrees 1 2 4 5 8 6 3 Outlier Threshold value 0 1 2 4 5 8 6 3 Outliers Threshold value 1

Example of FA and FR k = 2 T False Acceptance False Rejection 0/1 0/5 0/1 0/5 1 2/5 2 3 4/5 4 5/5 5 6 Detected as outlier with different threshold values (T) 1 2 4 5 8 6 3 4 3 1 1

1 2 4 5 8 6

Experiments Measures False acceptance (FA): False rejection (FR): Number of outliers that are not detected. False rejection (FR): Number of good points wrongly classified as outlier. Half total error rate: HTER = (FR+FA) / 2

Comparison of graph-based methods

Difficulty of parameter setup MeanDIST: ODIN: KDD S1 Value of k is not important as long as threshold below 0.1. A clear valley in error surface between 20-50.

Improved k-means using outlier removal Original After 40 iterations After 70 iterations At each step, remove most diverging data objects and construct new clustering.

Example of removal factor Outlier factor:

CERES algorithm [Hautamäki et al., SCIA 2005]

Experiments Artificial data sets Image data sets Plot of M2 A1 S3 S4

Comparison

Literature D.M. Hawkins, Identification of Outliers, Chapman and Hall, London, 1980. W. Jin, A.K.H. Tung, J. Han, "Finding top-n local outliers in large database", In Proc. 7th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pp. 293-298, 2001. E.M. Knorr, R.T. Ng, "Algorithms for mining distance-based outliers in large datasets", In Proc. 24th Int. Conf. Very Large Data Bases, pp. 392-403, New York, USA, 1998. M.R. Brito, E.L. Chavez, A.J. Quiroz, J.E. Yukich, "Connectivity of the mutual k-nearest-neighbor graph in clustering and outlier detection", Statistics & Probability Letters, 35 (1), 33-42, 1997.

Literature C.C. Aggarwal and P.S. Yu, "Outlier detection for high dimensional data", Proc. Int. Conf. on Management of data ACM SIGMOD, pp. 37-46, Santa Barbara, California, United States, 2001. V. Hautamäki, S. Cherednichenko, I. Kärkkäinen, T. Kinnunen and P. Fränti, Improving K-Means by Outlier Removal, In Proc. 14th Scand. Conf. on Image Analysis (SCIA’2005), 978-987, Joensuu, Finland, June, 2005. V. Hautamäki, I. Kärkkäinen and P. Fränti, "Outlier Detection Using k-Nearest Neighbour Graph", In Proc. 17th Int. Conf. on Pattern Recognition (ICPR’2004), 430-433, Cambridge, UK, August, 2004.