Similarity and Dissimilarity

Slides:

Advertisements

Similar presentations

Different types of data e.g. Continuous data:height Categorical data ordered (nominal):growth rate very slow, slow, medium, fast, very fast not ordered:fruit.

Advertisements

Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.

Similarity and Distance Sketching, Locality Sensitive Hashing

Lecture Notes for Chapter 2 Introduction to Data Mining

CLUSTERING PROXIMITY MEASURES

Qiang Yang Adapted from Tan et al. and Han et al.

Distance and Similarity Measures

Clustering (1) Clustering Similarity measure Hierarchical clustering Model-based clustering Figures from the book Data Clustering by Gan et al.

What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or.

An Introduction to Clustering

1-NN Rule: Given an unknown sample X decide if for That is, assign X to category if the closest neighbor of X is from category i.

EECS 800 Research Seminar Mining Biological Data

Distance Measures Tan et al. From Chapter 2.

Cluster Analysis (1).

Chapter 6 Distance Measures From: McCune, B. & J. B. Grace Analysis of Ecological Communities. MjM Software Design, Gleneden Beach, Oregon

Distance Measures Tan et al. From Chapter 2. Similarity and Dissimilarity Similarity –Numerical measure of how alike two data objects are. –Is higher.

COSC 4335 DM: Preprocessing Techniques

Distance and Similarity Measures

© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,

University of Texas at Austin CS384G - Computer Graphics Fall 2008 Don Fussell Orthogonal Functions and Fourier Series.

Data Mining & Knowledge Discovery Lecture: 2 Dr. Mohammad Abu Yousuf IIT, JU.

Nearest Neighbor (NN) Rule & k-Nearest Neighbor (k-NN) Rule Non-parametric : Can be used with arbitrary distributions, No need to assume that the form.

Minqi Zhou Introduction to Data Mining 3/23/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Minqi Zhou.

1 Local and Global Scores in Selective Editing Dan Hedlin Statistics Sweden.

Chapter 2: Getting to Know Your Data

Types of Data How to Calculate Distance? Dr. Ryan Benton January 29, 2009.

1 Chapter 2 Data. What is Data? Collection of data objects and their attributes An attribute is a property or characteristic of an object –Examples: eye.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,

Unsupervised Learning

CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.

Tallahassee, Florida, 2016 CIS4930 Introduction to Data Mining Getting to Know Your Data Peixiang Zhao.

Distance/Similarity Functions for Pattern Recognition J.-S. Roger Jang ( 張智星 ) CS Dept., Tsing Hua Univ., Taiwan

Big Data Infrastructure Week 9: Data Mining (4/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.

Clustering (1) Clustering Similarity measure Hierarchical clustering

Chapter 2: Getting to Know Your Data

Distance and Similarity Measures

Lecture Notes for Chapter 2 Introduction to Data Mining

Elementary Linear Algebra

Chapter 2: Getting to Know Your Data

Lecture 2-2 Data Exploration: Understanding Data

ECE 417 Lecture 2: Metric (=Norm) Learning

COP 6726: New Directions in Database Systems

Lecture Notes for Chapter 2 Introduction to Data Mining

Similarity and Distance Recommender Systems

Lecture Notes for Chapter 2 Introduction to Data Mining

Topics Related to Data Mining

Lecture Notes for Chapter 2 Introduction to Data Mining

School of Computer Science & Engineering

CISC 4631 Data Mining Lecture 02:

Lecture Notes for Chapter 2 Introduction to Data Mining

Lecture Notes for Chapter 2 Introduction to Data Mining

Lecture Notes for Chapter 2 Introduction to Data Mining

Clustering and Multidimensional Scaling

Scaled Neural Indirect Predictor

Lecture Notes for Chapter 2 Introduction to Data Mining

Lecture Notes for Chapter 2 Introduction to Data Mining

Statistical Data Analysis

Lecture Notes for Chapter 2 Introduction to Data Mining

Data Mining: Concepts and Techniques — Chapter 2 —

Lecture Notes for Chapter 2 Introduction to Data Mining

Nearest Neighbors CSC 576: Data Mining.

Lecture Notes for Chapter 2 Introduction to Data Mining

Group 9 – Data Mining: Data

Lecture Notes for Chapter 2 Introduction to Data Mining

Lecture Notes for Chapter 2 Introduction to Data Mining

Data Mining: Concepts and Techniques — Chapter 2 —

Presentation transcript:

Similarity and Dissimilarity Numerical measure of how alike two data objects are. Is higher when objects are more alike. Often falls in the range [0,1] Dissimilarity Numerical measure of how different are two data objects Lower when objects are more alike Minimum dissimilarity is often 0 Upper limit varies Proximity refers to a similarity or dissimilarity

Similarity/Dissimilarity for Simple Attributes p and q are the attribute values for two data objects.

Euclidean Distance Euclidean Distance Standardization is necessary, if scales differ.

Minkowski Distance Minkowski Distance is a generalization of Euclidean Distance

Minkowski Distance: Examples r = 1. City block (Manhattan, taxicab, L1 norm) distance. A common example of this is the Hamming distance, which is just the number of bits that are different between two binary vectors r = 2. Euclidean distance r  . “supremum” (Lmax norm, L norm) distance. This is the maximum difference between any component of the vectors Do not confuse r with n, i.e., all these distances are defined for all numbers of dimensions.

Minkowski Distance: Examples

Common Properties of a Distance Distances, such as the Euclidean distance, have some well known properties. d(p, q)  0 for all p and q and d(p, q) = 0 only if p = q. (Positive definiteness) d(p, q) = d(q, p) for all p and q. (Symmetry) d(p, r)  d(p, q) + d(q, r) for all points p, q, and r. (Triangle Inequality) where d(p, q) is the distance (dissimilarity) between points (data objects), p and q. A distance that satisfies these properties is a metric

Common Properties of a Similarity Similarities, also have some well known properties. s(p, q) = 1 (or maximum similarity) only if p = q. s(p, q) = s(q, p) for all p and q. (Symmetry) where s(p, q) is the similarity between points (data objects), p and q.

Similarity Between Binary Vectors Simple Matching Jaccard Coefficients Cosine similarity Correlation See IDM section 2.4 for details