Similarity and Dissimilarity

Slides:



Advertisements
Similar presentations
Different types of data e.g. Continuous data:height Categorical data ordered (nominal):growth rate very slow, slow, medium, fast, very fast not ordered:fruit.
Advertisements

Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Clustering.
Similarity and Distance Sketching, Locality Sensitive Hashing
Lecture Notes for Chapter 2 Introduction to Data Mining
CLUSTERING PROXIMITY MEASURES
Qiang Yang Adapted from Tan et al. and Han et al.
Distance and Similarity Measures
Clustering (1) Clustering Similarity measure Hierarchical clustering Model-based clustering Figures from the book Data Clustering by Gan et al.
What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or.
An Introduction to Clustering
1-NN Rule: Given an unknown sample X decide if for That is, assign X to category if the closest neighbor of X is from category i.
EECS 800 Research Seminar Mining Biological Data
Distance Measures Tan et al. From Chapter 2.
Cluster Analysis (1).
Chapter 6 Distance Measures From: McCune, B. & J. B. Grace Analysis of Ecological Communities. MjM Software Design, Gleneden Beach, Oregon
Distance Measures Tan et al. From Chapter 2. Similarity and Dissimilarity Similarity –Numerical measure of how alike two data objects are. –Is higher.
COSC 4335 DM: Preprocessing Techniques
Distance and Similarity Measures
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
University of Texas at Austin CS384G - Computer Graphics Fall 2008 Don Fussell Orthogonal Functions and Fourier Series.
Data Mining & Knowledge Discovery Lecture: 2 Dr. Mohammad Abu Yousuf IIT, JU.
Nearest Neighbor (NN) Rule & k-Nearest Neighbor (k-NN) Rule Non-parametric : Can be used with arbitrary distributions, No need to assume that the form.
Minqi Zhou Introduction to Data Mining 3/23/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Minqi Zhou.
1 Local and Global Scores in Selective Editing Dan Hedlin Statistics Sweden.
Chapter 2: Getting to Know Your Data
Types of Data How to Calculate Distance? Dr. Ryan Benton January 29, 2009.
1 Chapter 2 Data. What is Data? Collection of data objects and their attributes An attribute is a property or characteristic of an object –Examples: eye.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
Unsupervised Learning
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Tallahassee, Florida, 2016 CIS4930 Introduction to Data Mining Getting to Know Your Data Peixiang Zhao.
Distance/Similarity Functions for Pattern Recognition J.-S. Roger Jang ( 張智星 ) CS Dept., Tsing Hua Univ., Taiwan
Big Data Infrastructure Week 9: Data Mining (4/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
Clustering (1) Clustering Similarity measure Hierarchical clustering
Chapter 2: Getting to Know Your Data
Distance and Similarity Measures
Lecture Notes for Chapter 2 Introduction to Data Mining
Elementary Linear Algebra
Chapter 2: Getting to Know Your Data
Lecture 2-2 Data Exploration: Understanding Data
ECE 417 Lecture 2: Metric (=Norm) Learning
COP 6726: New Directions in Database Systems
Lecture Notes for Chapter 2 Introduction to Data Mining
Similarity and Distance Recommender Systems
Lecture Notes for Chapter 2 Introduction to Data Mining
Topics Related to Data Mining
Lecture Notes for Chapter 2 Introduction to Data Mining
School of Computer Science & Engineering
CISC 4631 Data Mining Lecture 02:
Lecture Notes for Chapter 2 Introduction to Data Mining
Lecture Notes for Chapter 2 Introduction to Data Mining
Lecture Notes for Chapter 2 Introduction to Data Mining
Clustering and Multidimensional Scaling
Scaled Neural Indirect Predictor
Lecture Notes for Chapter 2 Introduction to Data Mining
Lecture Notes for Chapter 2 Introduction to Data Mining
Statistical Data Analysis
Lecture Notes for Chapter 2 Introduction to Data Mining
Data Mining: Concepts and Techniques — Chapter 2 —
Lecture Notes for Chapter 2 Introduction to Data Mining
Nearest Neighbors CSC 576: Data Mining.
Lecture Notes for Chapter 2 Introduction to Data Mining
Group 9 – Data Mining: Data
Lecture Notes for Chapter 2 Introduction to Data Mining
Lecture Notes for Chapter 2 Introduction to Data Mining
Data Mining: Concepts and Techniques — Chapter 2 —
Presentation transcript:

Similarity and Dissimilarity Numerical measure of how alike two data objects are. Is higher when objects are more alike. Often falls in the range [0,1] Dissimilarity Numerical measure of how different are two data objects Lower when objects are more alike Minimum dissimilarity is often 0 Upper limit varies Proximity refers to a similarity or dissimilarity

Similarity/Dissimilarity for Simple Attributes p and q are the attribute values for two data objects.

Euclidean Distance Euclidean Distance Standardization is necessary, if scales differ.

Minkowski Distance Minkowski Distance is a generalization of Euclidean Distance

Minkowski Distance: Examples r = 1. City block (Manhattan, taxicab, L1 norm) distance. A common example of this is the Hamming distance, which is just the number of bits that are different between two binary vectors r = 2. Euclidean distance r  . “supremum” (Lmax norm, L norm) distance. This is the maximum difference between any component of the vectors Do not confuse r with n, i.e., all these distances are defined for all numbers of dimensions.

Minkowski Distance: Examples

Common Properties of a Distance Distances, such as the Euclidean distance, have some well known properties. d(p, q)  0 for all p and q and d(p, q) = 0 only if p = q. (Positive definiteness) d(p, q) = d(q, p) for all p and q. (Symmetry) d(p, r)  d(p, q) + d(q, r) for all points p, q, and r. (Triangle Inequality) where d(p, q) is the distance (dissimilarity) between points (data objects), p and q. A distance that satisfies these properties is a metric

Common Properties of a Similarity Similarities, also have some well known properties. s(p, q) = 1 (or maximum similarity) only if p = q. s(p, q) = s(q, p) for all p and q. (Symmetry) where s(p, q) is the similarity between points (data objects), p and q.

Similarity Between Binary Vectors Simple Matching Jaccard Coefficients Cosine similarity Correlation See IDM section 2.4 for details