Distance Measures Tan et al. From Chapter 2.

Slides:



Advertisements
Similar presentations
Different types of data e.g. Continuous data:height Categorical data ordered (nominal):growth rate very slow, slow, medium, fast, very fast not ordered:fruit.
Advertisements

Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Clustering.
Clustering.
Similarity and Distance Sketching, Locality Sensitive Hashing
Lecture Notes for Chapter 2 Introduction to Data Mining
CLUSTERING PROXIMITY MEASURES
Qiang Yang Adapted from Tan et al. and Han et al.
Distance and Similarity Measures
Clustering (1) Clustering Similarity measure Hierarchical clustering Model-based clustering Figures from the book Data Clustering by Gan et al.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or.
An Introduction to Clustering
EECS 800 Research Seminar Mining Biological Data
Cluster Analysis (1).
Chapter 6 Distance Measures From: McCune, B. & J. B. Grace Analysis of Ecological Communities. MjM Software Design, Gleneden Beach, Oregon
Statistical Analysis of Microarray Data
Distance Measures Tan et al. From Chapter 2. Similarity and Dissimilarity Similarity –Numerical measure of how alike two data objects are. –Is higher.
COSC 4335 DM: Preprocessing Techniques
Distance and Similarity Measures
Lecture Notes for Chapter 2 Introduction to Data Mining
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Computational Intelligence: Methods and Applications Lecture 5 EDA and linear transformations. Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
Data Mining & Knowledge Discovery Lecture: 2 Dr. Mohammad Abu Yousuf IIT, JU.
Minqi Zhou Introduction to Data Mining 3/23/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Minqi Zhou.
Chapter 2: Getting to Know Your Data
1 Chapter 2 Data. What is Data? Collection of data objects and their attributes An attribute is a property or characteristic of an object –Examples: eye.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Overview Data Mining - classification and clustering
Tallahassee, Florida, 2016 CIS4930 Introduction to Data Mining Getting to Know Your Data Peixiang Zhao.
Distance/Similarity Functions for Pattern Recognition J.-S. Roger Jang ( 張智星 ) CS Dept., Tsing Hua Univ., Taiwan
Big Data Infrastructure Week 9: Data Mining (4/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
1 Data Mining Lecture 02a: Data Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors)
Clustering (1) Clustering Similarity measure Hierarchical clustering
Chapter 2: Getting to Know Your Data
Distance and Similarity Measures
Lecture Notes for Chapter 2 Introduction to Data Mining
Chapter 2: Getting to Know Your Data
Lecture 2-2 Data Exploration: Understanding Data
ECE 417 Lecture 2: Metric (=Norm) Learning
COP 6726: New Directions in Database Systems
Lecture Notes for Chapter 2 Introduction to Data Mining
Similarity and Dissimilarity
Lecture Notes for Chapter 2 Introduction to Data Mining
Topics Related to Data Mining
Lecture Notes for Chapter 2 Introduction to Data Mining
School of Computer Science & Engineering
CISC 4631 Data Mining Lecture 02:
Lecture Notes for Chapter 2 Introduction to Data Mining
Lecture Notes for Chapter 2 Introduction to Data Mining
Lecture Notes for Chapter 2 Introduction to Data Mining
Clustering and Multidimensional Scaling
Lecture Notes for Chapter 2 Introduction to Data Mining
Lecture Notes for Chapter 2 Introduction to Data Mining
Statistical Data Analysis
Lecture Notes for Chapter 2 Introduction to Data Mining
Data Mining: Concepts and Techniques — Chapter 2 —
What Is Good Clustering?
Lecture Notes for Chapter 2 Introduction to Data Mining
Nearest Neighbors CSC 576: Data Mining.
Lecture Notes for Chapter 2 Introduction to Data Mining
Group 9 – Data Mining: Data
Lecture Notes for Chapter 2 Introduction to Data Mining
Data Pre-processing Lecture Notes for Chapter 2
Lecture Notes for Chapter 2 Introduction to Data Mining
Data Mining: Concepts and Techniques — Chapter 2 —
Presentation transcript:

Distance Measures Tan et al. From Chapter 2

Similarity and Dissimilarity Numerical measure of how alike two data objects are. Is higher when objects are more alike. Often falls in the range [0,1] Dissimilarity Numerical measure of how different are two data objects Lower when objects are more alike Minimum dissimilarity is often 0 Upper limit varies Proximity refers to a similarity or dissimilarity

Euclidean Distance Euclidean Distance Where n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q. Standardization is necessary, if scales differ.

Euclidean Distance Distance Matrix

Minkowski Distance Minkowski Distance is a generalization of Euclidean Distance Where r is a parameter, n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q.

Minkowski Distance: Examples r = 1. City block (Manhattan, taxicab, L1 norm) distance. A common example of this is the Hamming distance, which is just the number of bits that are different between two binary vectors r = 2. Euclidean distance r  . “supremum” (Lmax norm, L norm) distance. This is the maximum difference between any component of the vectors Example: L_infinity of (1, 0, 2) and (6, 0, 3) = ?? Do not confuse r with n, i.e., all these distances are defined for all numbers of dimensions.

Minkowski Distance Distance Matrix

Mahalanobis Distance  is the covariance matrix of the input data X When the covariance matrix is identity Matrix, the mahalanobis distance is the same as the Euclidean distance. Useful for detecting outliers. Q: what is the shape of data when covariance matrix is identity? Q: A is closer to P or B? A P For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6.

Mahalanobis Distance Covariance Matrix: C A: (0.5, 0.5) B B: (0, 1) Mahal(A,B) = 5 Mahal(A,C) = 4 B A

Common Properties of a Distance Distances, such as the Euclidean distance, have some well known properties. d(p, q)  0 for all p and q and d(p, q) = 0 only if p = q. (Positive definiteness) d(p, q) = d(q, p) for all p and q. (Symmetry) d(p, r)  d(p, q) + d(q, r) for all points p, q, and r. (Triangle Inequality) where d(p, q) is the distance (dissimilarity) between points (data objects), p and q. A distance that satisfies these properties is a metric, and a space is called a metric space

Common Properties of a Similarity Similarities, also have some well known properties. s(p, q) = 1 (or maximum similarity) only if p = q. s(p, q) = s(q, p) for all p and q. (Symmetry) where s(p, q) is the similarity between points (data objects), p and q.

Similarity Between Binary Vectors Common situation is that objects, p and q, have only binary attributes Compute similarities using the following quantities M01 = the number of attributes where p was 0 and q was 1 M10 = the number of attributes where p was 1 and q was 0 M00 = the number of attributes where p was 0 and q was 0 M11 = the number of attributes where p was 1 and q was 1 Simple Matching and Jaccard Distance/Coefficients SMC = number of matches / number of attributes = (M11 + M00) / (M01 + M10 + M11 + M00) J = number of value-1-to-value-1 matches / number of not-both-zero attributes values = (M11) / (M01 + M10 + M11)

Binary Variables ({0, 1}, or {true, false}) A contingency table for binary data Simple matching coefficient Invariant of coding of binary variable: if you assign 1 to “pass” and 0 to “fail”, or the other way around, you’ll get the same distance value. Row j Row i

SMC versus Jaccard: Example q = 0 0 0 0 0 0 1 0 0 1 M01 = 2 (the number of attributes where p was 0 and q was 1) M10 = 1 (the number of attributes where p was 1 and q was 0) M00 = 7 (the number of attributes where p was 0 and q was 0) M11 = 0 (the number of attributes where p was 1 and q was 1) SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) / (2+1+0+7) = 0.7 J = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0

Cosine Similarity If d1 and d2 are two document vectors, then cos( d1, d2 ) = (d1  d2) / ||d1|| ||d2|| , where  indicates vector dot product and || d || is the length of vector d. Example: d1 = 3 2 0 5 0 0 0 2 0 0 d2 = 1 0 0 0 0 0 0 1 0 2 d1  d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5 ||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481 ||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245 cos( d1, d2 ) = .3150, distance=1-cos(d1,d2)

Nominal Attributes A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green Method 1: Simple matching m: # of matches, p: total # of variables Method 2: use a large number of binary variables creating a new binary variable for each of the M nominal states

Measures of cluster distance Minimum distance Max distance Mean distance Avarage distance