Presentation on theme: "CLUSTERING PROXIMITY MEASURES"— Presentation transcript:
1CLUSTERING PROXIMITY MEASURES Presentation slide for courses, classes, lectures et al.CLUSTERING PROXIMITY MEASURESBy Çağrı Sarıgöz Submitted to Assoc. Prof. Turgay İbrikçiEE 639
2ClassificationClassifying has been one of the crucial though activities of human kind.It makes it easy to perceive the outside world and act accordingly.Aristotle’s Classification of Living Things is one of the most famous classification works dating back to ancient times
3Cluster AnalysisCluster analysis brings mathematical methodology to the solution of classification problemsIt deals with classification or grouping of data into a set of categories or clusters.Data objects that are in the same cluster should be similar and the ones that are in different clusters should be dissimilar in some context.It’s generally a subjective matter to determine this context.
4Approaching the Data Objects Feature TypesContinuousDiscreteBinaryMeasurement LevelsQualitativeNominalOrdinalQuantitativeIntervalRatio
5Feature TypesA continuous feature can take a value from an uncountably infinite range.Exact weight of a person.Whereas a discrete feature has a range of value that is finite or countably infinite.Number of heartbeats of a person, in bpm.Binary feature is a special case of discrete features where there is only 2 values that the feature can take.Presence or absence of tattoos on a person’s skin.
6Measurement Levels: Qualitative Features at nominal level have no mathematical meaning; they generally are levels, states or names.Color of a car, condition of weather, etc..Features at ordinal level are still just names, but with a certain order. But, the difference between the values are still meaningless in mathematical sense.Degrees of headache: none, slight, moderate, severe, unbearable, etc..
7Measurement Levels: Quantitative At interval level, difference between feature values has a meaning, but there is no true zero in the range of level, i.e. the ratio between two values has no meaning.IQ score. A person with 140 IQ score isn’t necessarily two times intelligent than a person with 70 IQ score.Features at ratio level have all the properties of the other, plus a true zero, so that the ratio between two values has a mathematical meaning.Number of cars in a parking lot.
8Definition of Proximity Measures: Dissimilarity (Distance) A dissimilarity or distance function D on a data set X is defined to satisfy these conditions:Symmetry: D(xi , xj) = D(xj , xi )Positivity: D(xi, xj) ≥ 0 for all xi and xj.It’s called a dissimilarity metric if these conditions also hold,Triangle inequality:D(xi, xj) ≤ D(xi, xk) + D(xk, xj) for all xi, xj and xkReflexivity: D(xi, xj) = 0 iff xi = xjIt’s called a semimetric if triangle inequality does not holdIf the following condition also holds, it’s called a ultrametric:D(xi, xj) ≤ max(D(xi, xk),D(xj, xk)) for all xi, xj and xk.
9Definition of Proximity Measures: Similarity A similarity function S is defined to satisfy the following conditions:Symmetry: S(xi , xj) = S(xj , xi);Positivity: ≤ S(xi, xj) ≤ 1, for all xi and xj.It’s called a similarity metric if the following additional conditions also hold:For all xi , xj, and xk, S (xi , xj)S (xj , xk) ≤ [S (xi , xj) + S (xj , xk)]S (xi , xk)S(xi , xj) = 1 iff xi = xj
10Proximity Measures for Continuous Variables Euclidean distance (also known as L2 norm) :xi and xj are d-dimensional data objectsEuclidean distance is a metric, tending to form hyperspherical clusters. Also, clusters formed with Euclidean distance are invariant to translations and rotations in the feature space.Without normalizing the data, features with large values and variances will tend to dominate over other features. A commonly used method is data standardization, in which each feature has zero mean and unit variance,where xil* represents the raw dataand sample mean ml and sample standard sl are defined asand respectively.
11Proximity Measures for Continuous Variables Another normalization approach:The Euclidean distance can be generalized as a special case of a family of metrics, called Minkowski distance or Lp norm, defined as:When p = 2, the distance becomes the Euclidean distance.p = 1: the city-block (Manhattan distance) or L1 norm,p → ∞ : the sup distance or L∞ norm,
12Proximity Measures for Continuous Variables The squared Mahalanobis distance is also a metric:Where S is the within-class covariance matrix defined as S = E[(x − μ)(x − μ)T] where μ is the mean vector and E[·] calculates the expected value of a random variable.Mahalanobis distance tends to form hyperellipsodial clusters, which are invariant to any nonsingular linear transformation.The calculation of the inverse of S may cause some computational burden for large-scale data.When features are not correlated, S equals to an identity matrix, making Mahalanobis distance equal to Euclidean distance.
13Proximity Measures for Continuous Variables The point symmetry distance is based on the assumption that the cluster’s structure is symmetric:Where xr is a reference point (e.g. the centroid of the cluster) and ||·|| represents the Euclidean norm.It calculates the distance between an object xi and xr, the reference point, given other N – 1 objects and minimized when a symmetric pattern exists.
14Proximity Measures for Continuous Variables The distance measure can also be derived from a correlation coefficient, such as the Pearson correlation coefficient, defined as,The correlation coefficient is in the range of [-1,1], with -1 and 1 indicating the strongest negative and positive corre- lation respectively. So we can define the distance measure as which is in the range of [0,1].Features should be measured on the same scales, otherwise the calculation of the mean or variance in calculating the Pearson correlation coefficient would have no meaning.
15Proximity Measures for Continuous Variables Cosine similarity is an example of similarity measures, which can be used to compare a pair of data objects with continuous variables, given as,which can be constructed as a distance measure by simply using D(xi, xj) = 1 − S(xi, xj).Like Pearson correlation coefficient, the cosine similarity is unable to provide information on the magnitude of differences.
16Examples and Applications of the Proximity Measures for Continuous Variables
17Proximity Measures for Discrete Variables: Binary Variables Invariant similarity measures for symmetric binary variables:1-1 match and 0-0 match of the variables are regarded as equally important. Unmatched pairs are weighted based on their contribution to the similarity.For the simple matching coefficient, the corresponding dissimilarity measure from D(xi, xj) = 1 − S(xi, xj) is known as the Hamming distance.
18Proximity Measures for Discrete Variables: Binary Variables Non-invariant similarity measures for asymmetric binary variables:These measures focus on 1-1 match features while ignoring the effect of 0-0 match, which is considered uninformative.Again, the unmatched pairs are weighted depending on their importance.
19Proximity Measures for Discrete Variables with More than Two Values One simple and direct approach is to map the variables into new binary features.It is simple, but it may cause introducing too many binary variables.A more effective and commonly used method is based on matching criterion. For a pair of d- dimensional objects xi and xj, the similarity using the simple matching criterion is given as:where
20Proximity Measures for Discrete Variables with More than Two Values The categorical features may display certain orders, known as the ordinal features.In this case, the codes from 1 to Ml, where Ml is the highest level, are no meaningless in similarity measures. In fact, the closer the two levels are, the more similar the two objects in that feature.Objects with this type of feature can be compared using the continuous dissimilarity measures. Since the number of possible levels varies with the different features, the original ranks ril* for the ith object in the lth feature are usually converted into the new ranks ril in the range of [0,1], using the following method:Then city-block or Euclidean distance can be used.
21Proximity Measures for Mixed Variables The similarity measure for a pair of d-dimensional mixed data objects xi and xj can be defined as:where Sijl indicates the similarity for the lth feature between the two objects, and δijl is a 0-1 coefficient based on whether the measure of the two objects is missing. Correspondingly, the dissimilarity measure can be obtained by simply using D(xi, xj) = 1 − S(xi, xj).The component similarity for discrete variables:For continuous variables:where Rl is the range of the lthvariable over all objects, written as