Measurement of Similarity and Clustering

Measurement of Similarity and Clustering
Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA

Outline of Talk What is Similarity? Why Measure Similarity?
Some nomenclature A useful tool (dendrogram) Why Measure Similarity? Classification Clustering Indexing Desirable Properties of Similarity Measures Mathematical properties Intuitiveness Time and space complexity Two Approaches Feature Projection Transformation (Edit Distance) Hierarchal Clustering…

What is Similarity? The quality or state of being similar; likeness; resemblance; as, a similarity of features. Webster's Dictionary Similarity is hard to define, but… “We know it when we see it” The real meaning of similarity is a philosophical question. We will take a more pragmatic approach.

Some Nomenclature I We shall talk of measuring “similarity”, however we are usually measuring “dissimilarity”. Similarity: The larger the number, the more alike two objects are. Dissimilarity: The larger the number, the less alike two objects are. Distance is a common synonym for Dissimilarity, so we may speak of “Distance measure” and “Dissimilarity measure” interchangeably. However a “Distance measure” is not the same thing as a “Distance Metric”. We will see why later…

Some Nomenclature II a b c a a c c b b
Similarity Queries are often expressed as Nearest Neighbor Queries or Range Queries. What is the nearest item to the green item? What items are within R of the blue item? a b c a a c c b b Can be generalized to the K nearest Neighbors R is given by the user R

A Useful Tool for Summarizing Similarity Measurements
In order to better appreciate and evaluate the examples given in the early part of this talk, we will now introduce the dendrogram. (We will have much more to say about dendrograms later) The similarity between two objects in a dendrogram is represented as the height of the lowest internal node they share.

(Bovine:0. 69395,(Gibbon:0. 36079,(Orangutan:0. 33636,(Gorilla:0
(Bovine: ,(Gibbon: ,(Orangutan: ,(Gorilla: ,(Chimp: ,Human: ): ): ): ): );

Eurogram Curvogram Swoopogram Tree Diagram Cladogram Phenogram

Why Measure Similarity?
Classification: Given an unlabeled item Q, assign it to one of two or more predefined classes. (We can do classification without measuring similarity, but similarity based methods (I.e. nearest neighbor), are very competitive). Clustering: Find natural groupings of items under some similarity measure. Indexing (Query by Content): Given a query object Q, and some similarity measure, find the nearest matching item in the database, without having to examine every item.

Defining Distance Measures
Definition: Let O1 and O2 be two objects from the universe of possible objects. The distance (dissimilarity) between O1 and O2 is a real number denoted by D(O1,O2) Peter Piotr 0.23 3 342.7

What properties should a distance measure have?
Peter Piotr When we peek inside one of these black boxes, we see some function on two variables. These functions might very simple or very complex. In either case it is natural to ask, what properties should these functions have? d('', '') = 0 d(s, '') = d('', s) = |s| -- i.e. length of s d(s1+ch1, s2+ch2) = min( d(s1, s2) + if ch1=ch2 then 0 else 1 fi, d(s1+ch1, s2) + 1, d(s1, s2+ch2) + 1 ) 3 What properties should a distance measure have? D(A,B) = D(B,A) Symmetry D(A,A) = 0 Constancy of Self-Similarity D(A,B) = 0 IIf A= B Positivity (Separation) D(A,B)  D(A,C) + D(B,C) Triangular Inequality

Intuitions behind desirable distance measure properties
D(A,B) = D(B,A) Symmetry Otherwise you could claim “Alex looks like Bob, but Bob looks nothing like Alex.” D(A,A) = 0 Constancy of Self-Similarity Otherwise you could claim “Alex looks more like Bob, than Bob does.” D(A,B) = 0 IIf A=B Positivity (Separation) Otherwise there are objects in your world that are different, but you cannot tell apart. D(A,B)  D(A,C) + D(B,C) Triangular Inequality Otherwise you could claim “Alex is very like Bob, and Alex is very like Carl, but Bob is very unlike Carl.”

Why is the Triangular Inequality so Important?
Virtually all techniques to index data require the triangular inequality to hold. Suppose I have a database of 3 objects. Further suppose that the triangular inequality holds, and that we have precomplied a table of distance between all the items in the database. a b c

Why is the Triangular Inequality so Important?
Virtually all techniques to index data require the triangular inequality to hold. Suppose I am looking for the closest point to Q, in a database of 3 objects. Further suppose that the triangular inequality holds, and that we have precomplied a table of distance between all the items in the database. I find a and calculate that it is 2 units from Q, it becomes my best-so-far. I find b and calculate that it is 7.81 units away from Q. I don’t have to calculate the distance from Q to c! I know D(Q,b)  D(Q,c) + D(b,c) D(Q,b) - D(b,c)  D(Q,c)  D(Q,c) 5.51  D(Q,c) So I know that c is at least 5.51 units away, but my best-so-far is only 2 units away. Q a c b

Thoughts on the Triangular Inequality I
Sometimes the triangular inequality requirement maps nicely onto human intuitions. Consider the similarity between the horse, the zebra and the lion. The horse and the zebra are very similar, and both are very unlike the lion.

The centaur example is due to Remco Velkamp
Thoughts on the Triangular Inequality II Sometimes the triangular inequality requirement fails to map onto human intuition. Consider the similarity between the horse, a man and the centaur. The horse and the man are very different, but both share many features with the centaur. This relationship does not obey the triangular inequality. The centaur example is due to Remco Velkamp

What other properties should we require of a distance measure
It should really measure similarity!! It should be fast to compute Euclidean distance and Hamming distance are O(n), Dynamic Time Warping and String Edit distance are O(n2) It should be space efficient This is usually not as important as time efficiency It should allow indexing If the measure is a metric, this is automatically true, otherwise it depends… A fast lower bound measure is desirable A, B lower_bound_distance(A, B)  true_distance(A, B) Whatever that means (We will see why on the next slide)

…If not fast to compute, a fast lower bound measure is desirable
Algorithm Lower_Bounding_Sequential_Scan(Q) best_so_far = infinity; for all items in database LB_dist = lower_bound_distance(Ci, Q); if LB_dist < best_so_far true_dist = true_distance(Ci, Q); if true_dist < best_so_far best_so_far = true_dist; index_of_best_match = i; endif endif 11 endfor Assume that true_distance(A, B) is the correct distance function, but is very expensive to compute, and that lower_bound_distance(A, B), is a cheap lower bounding estimate of true_distance(A, B), the above algorithm will allow faster sequential searching.

If we want to measure the similarity between items, we will have to measure some features
Scalar Binary: Only two possible states. True/False, Jew/Gentile, Married/Unmarried Nominal: Generalization of Binary to 3 or more states Jew/Catholic/Protestant, Married/Divorced/Widower In basketball, jersey numbers are nominal You cannot order, or do any mathematical operations on nominal data

Scalar (continued) Ordinal: Same as nominal, but order matters. However the distance between two values is not meaningful… For example, we might have a coded survey, 0 = No high school, 1 = some high school, 2 = high school diploma, 4 = some college… While we can clearly rank these attribute, the distacne between a “1” and a “2” is not the same as the distance between a “2” and a “3”. Interval: Distance between attributes is meaningful. In this case the we can measure intervals and take averages, but we cannot form ratios (I.E we cannot say 10 is twice as large as 5) For example, consider temperature in Fahrenheit or Celsius Ratio: You can meaningfully form ratios. For example, weight, height, number of children

Scalar (continued) Note that both Interval and Ratio data can be either discrete or continuous For example consider the following two examples of ratio data: Number of Children (for a given person) Average Number of Children (For women in different countries) Some algorithms work better (or only work) for one of either discrete or continuous. We can convert from continuous to discrete…

In addition to scalar values, much of the data we are interested in is nonscalar...
Vectors or Matrices of Binary/Nominal/Ordinal/Interval/Ratio Bitmaps, Time Series, Strings, Trees, Graphs

Consider “color”, what kind of feature is this?
Nominal Scalar: Blue, Red, Yellow etc Ordinal Discrete: Red, Orange, Yellow, Green, Blue, Indigo, Violet. Ordinal Continuous: 780 – 622nm, 622 – 597nm, 597 – 577nm Vector Continuous: [0.95, 0.01, 0.21] (Red/Green/Blue, or Hue/Saturation/Luminosity) We sometimes have a choice of representation. Often making the right choice can be very important.

The similarity between two items depends on the features we measure (and the distance measure itself) Last Name Similarity Skin Color Similarity 115

Sometimes we are given the perfect features to measure similarity…
…sometimes we need to Generate Features: Suppose we hope to find similar people with regard to their medical conditions, knowing both their height and weight is not helpful, knowing their BMI is. (BMI =Weight in kilos /[Height in meters]2 ) Clean Features: Our features may contain noise or outliers. Normalize Features: We may need to transform features. Reduce Features: We may have too many features to do efficient similarity measurement, so dimensionality reduction may be necessary.

There is no single “magic” black box for measuring similarity
However there are two useful and general tricks Project the data into feature space, the distance in feature space (appropriately measured) becomes the similarity. Transform one object into the other, the “cost” of this transformation becomes the similarity. “Feature Projection” “Edit Distance”

Feature Projection Example I
1.0 0.9 0.8 0.7 0.6 0.5 Ratio of beak length over body length 0.4 From left to right Bee Hummingbird Costas Hummingbird Ruby Topaz Hummingbird Kestrel Gyrfalcon Bald Eagle 0.3 Use the features to project the items into feature space. The distance between two objects in this space (appropriately measured) is the measure of similarity 0.2 0.1 1 2 3 4 5 6 7 8 9 10 Body Mass

Feature Projection Example II
R. A. Fisher’s Iris Dataset. 3 variations of the Iris flower 50 of each Iris Setosa Iris Versicolor Iris Virginica

A generic technique for measuring similarity
To measure the similarity between two objects, transform one of the objects into the other, and measure how much effort it took. The measure of effort becomes the distance measure. The distance between Patty and Selma. Change dress color, 1 point Change earring shape, 1 point Change hair part, point D(Patty,Selma) = 3 The distance between Marge and Selma. Change dress color, 1 point Add earrings, point Decrease height, point Take up smoking, point Lose weight, point D(Marge,Selma) = 5

Peter Piotr Piter Pioter Piotr Pyotr Piero Peter Petros Pietro Pierre
Edit Distance Example I How similar are the names “Peter” and “Piotr”? Assume the following cost function Substitution 1 Unit Insertion 1 Unit Deletion 1 Unit D(Peter,Piotr) is 3 It is possible to transform any string Q into string C, using only Substitution, Insertion and Deletion. Assume that each of these operators has a cost associated with it. The similarity between two strings can be defined as the cost of the cheapest transformation from Q to C. Note that for now we have ignored the issue of how we can find this cheapest transformation Peter Piter Pioter Piotr Substitution (i for e) Insertion (o) Piotr Pyotr Petros Pietro Pedro Pierre Piero Peter Deletion (e)

Edit Distance Example II
We can make two time series appear more similar by making one point on one map onto two (or more) points it the other. For example, suppose we have Q = [5, 6, 8, 8, 7] and C = [5, 6, 6, 8, 7] [5, 6, 8, 8, 7] [5, 6, 6, 8, 7] A “one to one” measure would have to match an “8” in Q to a “6” in C. However if we allowed nonlinear alignments every number can match with “itself”. Another way of looking at it is an attempt to make the two sequences more similar by inserting values… [5, 6, 8, 8, 7] [5, 6, 6, 8, 7] [5, 6, 6, 8, 8, 7] This is call Dynamic Time Warping

Dynamic Time Warping Fixed Time Axis “Warped” Time Axis
Sequences are aligned “one to one”. “Warped” Time Axis Nonlinear alignments are possible.

The Minkowski Metric So, we have projected our objects into feature space. How do we measure the distance between points? Assume Q and C are vectors of features measured from the objects of interest. 1.0 1 2 3 4 5 6 7 8 9 10 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 p = 1 Manhattan (Rectilinear, City Block) p = 2 Euclidean p =  Max (Supremum, “sup”)

The Minkowski Metric, a Weakness
Suppose we have a database of 3 items, with 2 features, number of children and temperature. We want to know who is most similar to Mr Red under the Euclidean distance. 44 Celsius 110 (5,36) (1,36) (5,38) Fahrenheit 43 109 (5,96.8) (1,96.8) (5,102.2) 108 42 107 106 41 105 96 97 98 99 100 101 102 103 104 40 39 38 3 5.4 37 36 4 4 35 95 94 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Green is closest to Red Blue is closest to Red The Minkowski metric is sensitive to the units use to measure features, a very undesirable property since the units are usually arbitrary. Two solutions suggest themselves, normalize the features or use a weighted version of the Minkowski metric.

Normalizing Features Let C be a database of items, with the ith feature denoted by ci To normalize the database… After normalization each feature will have a mean of zero and a standard deviation of one. for each feature ci = ci – (mean(ci) / std(ci); end Note that in both these images the axes are square (there is the same number of pixels per unit in both the X and Y direction) -6 -4 -2 2 4 6 8 10 12 -2.5 -2 -1.5 -1 -0.5 0.5 1 1.5 2 2.5 Before normalization, the Y-axis dominates After normalization, both axes are equally important

The Weighted Minkowski Metric
Assume Q and C are vectors of feature measured from the objects of interest. Further assume that W is a vector containing the relative importance of the features But how do we know the weights?

a c b The Minkowski Metrics have Simple Geometric Interpretations
Euclidean a b c Weighted Euclidean Manhattan Max

What is Clustering? Also called unsupervised learning, sometimes called classification by statisticians and sorting by psychologists and segmentation by people in marketing Organizing data into classes such that there is high intra-class similarity low inter-class similarity Finding the class labels and the number of classes directly from the data (in contrast to classification). More informally, finding natural groupings among objects.

What is a natural grouping among these objects?

What is a natural grouping among these objects?
Clustering is subjective Simpson's Family School Employees Females Males

Even if we know in advance the number of clusters we expect to see, the clustering obtained may be subjective.

Two Types of Clustering
Partitional algorithms: Construct various partitions and then evaluate them by some criterion Hierarchical algorithms: Create a hierarchical decomposition of the set of objects using some criterion Hierarchical Partitional

Desirable Properties of a Clustering Algorithm
Scalability (in terms of both time and space) Ability to deal with different data types Discovery of clusters with arbitrary shape Minimal requirements for domain knowledge to determine input parameters Able to deal with noise and outliers Insensitive to order of input records High dimensionality Incorporation of user-specified constraints Interpretability and usability

Hierarchical Clustering
The number of dendrograms with n leafs = (2n -3)!/[(2(n -2)) (n -2)!] Number Number of Possible of Leafs Dendrograms 2 1 3 3 4 15 5 105 ... … 34,459,425 Since we cannot test all possible trees we will have to heuristic search of all possible trees. We could do this.. Bottom-Up (agglomerative): Starting with each item in its own cluster, find the best pair to merge into a new cluster. Repeat until all clusters are fused together. Top-Down (divisive): Starting with all the data in a single cluster, consider every possible way to divide the cluster into two. Choose the best division and recursively operate on both sides.

We begin with a distance matrix which contains the distances between every pair of objects in our database. 8 7 2 4 3 1 D( , ) = 8 D( , ) = 1

Bottom-Up (agglomerative): Starting with each item in its own cluster, find the best pair to merge into a new cluster. Repeat until all clusters are fused together. Consider all possible merges… Choose the best …

Bottom-Up (agglomerative): Starting with each item in its own cluster, find the best pair to merge into a new cluster. Repeat until all clusters are fused together. Consider all possible merges… Choose the best … Consider all possible merges… Choose the best …

Bottom-Up (agglomerative): Starting with each item in its own cluster, find the best pair to merge into a new cluster. Repeat until all clusters are fused together. Consider all possible merges… Choose the best … Consider all possible merges… Choose the best … Consider all possible merges… Choose the best …

8 7 2 4 3 1 8 2 8 8 ? 2 ? ? ? In the first iteration of agglomerative clustering we merged so we need to remove them from the matrix We now need to add the single cluster to our new smaller matrix But what values do we fill in? What is… D( , ) = ?

We know how to measure the distance between two objects, but defining the distance between an object and a cluster, or defining the distance between two clusters is non obvious. Single linkage (nearest neighbor): In this method the distance between two clusters is determined by the distance of the two closest objects (nearest neighbors) in the different clusters. Complete linkage (furthest neighbor): In this method, the distances between clusters are determined by the greatest distance between any two objects in the different clusters (i.e., by the "furthest neighbors"). Group average: In this method, the distance between two clusters is calculated as the average distance between all pairs of objects in the two different clusters.

D( , ) = Min[ D( , ), D( , ) ] = 4 D( , ) = Min[ D( , ), D( , ) ] = 7
8 7 2 4 3 1 8 7 2 4 3 Using Single linkage (nearest neighbor)… D( , ) = Min[ D( , ), D( , ) ] = 4 D( , ) = Min[ D( , ), D( , ) ] = 7

Summary of Hierarchal Clustering Methods
No need to specify the number of clusters in advance. Hierarchal nature maps nicely onto human intuition for some domains They do not scale well: time complexity of at least O(n2), where n is the number of total objects. Like any heuristic search algorithms, local optima are a problem. Interpretation of results is subjective.

Measurement of Similarity and Clustering

Similar presentations

Presentation on theme: "Measurement of Similarity and Clustering"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Measurement of Similarity and Clustering

Similar presentations

Presentation on theme: "Measurement of Similarity and Clustering"— Presentation transcript:

Similar presentations

About project

Feedback