# Clustering: Introduction Adriano Joaquim de O Cruz ©2002 NCE/UFRJ

## Presentation on theme: "Clustering: Introduction Adriano Joaquim de O Cruz ©2002 NCE/UFRJ"— Presentation transcript:

Clustering: Introduction Adriano Joaquim de O Cruz ©2002 NCE/UFRJ adriano@nce.ufrj.br

Introduction

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 3 What is cluster analysis?  The process of grouping a set of physical or abstract objects into classes of similar objects.  The class label of each class is unknown.  Classification separates objects into classes when the labels are known.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 4 What is cluster analysis? cont.  Clustering is a form of learning by observations.  Neural Networks learn by examples.  Unsupervised learning.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 5Applications  In business helps to discover distinct groups of customers.  In data mining used to gain insight into the distribution of data, to observe the characteristics of each cluster.  Pre-processing step for classification.  Pattern recognition.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 6Requirements  Scalability: work with large databases.  Ability to deal with different types of attributes (not only interval based data).  Clusters of arbitrary shape, not only spherical.  Minimal requirements about domain.  Ability do deal with noisy data.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 7 Requirements cont.  Insensitivity to the order of input records.  Work with samples of high dimensionality.  Constrained-based clustering  Interpretability and usability: results should be easily interpretable.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 8 Sensitivity to Input Order  Some algorithms are sensitive to the order of input data  Leader algorithm is an example  Ellipse: 2 1 3 5 4 6; Triangle: 1 2 6 4 5 3

Clustering Techniques

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 10 Heuristic Clustering Techniques  Incomplete or heuristic clustering: geometrical methods or projection techniques.  Dimension reduction techniques (e.g. PCA) are used obtain a graphical representation in two or three dimensions.  Heuristic methods based on visualisation are used to determine the clusters.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 11 Deterministic Crisp Clustering  Each datum will be assigned to only one cluster.  Each cluster partition defines a ordinary partition of the data set.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 12 Overlapping Crisp Clustering  Each datum will be assigned to at least one cluster.  Elements may belong to more than one cluster at various degrees.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 13 Probabilistic Clustering  For each element, a probabilistic distribution over the clusters is determined.  The distribution specifies the probability with which a datum is assigned to a cluster.  If the probabilities are interpreted as degree of membership then these are fuzzy clustering techniques.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 14 Possibilistic Clustering  Degrees of membership or possibility indicate to what extent a datum belongs to the clusters.  Possibilistic cluster analysis drops the constraint that the sum of memberships of each datum to all clusters is equal to one.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 15 Hierarchical Clustering  Descending techniques: they divide the data into more fine-grained classes.  Ascending techniques: they combine small classes into more coarse-grained ones.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 16 Objective Function Clustering  An objective function assigns to each cluster partition values that have to be optimised.  This is strictly an optimisation problem.

Data Types

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 18 Data Types  Interval-scaled variables are continuous measurements of a linear scale. Ex. height, weight, temperature.  Binary variables have only two states. Ex. smoker, fever, client, owner.  Nominal variables are a generalisation of a binary variable with m states. Ex. Map colour, Marital state.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 19 Data Types cont.  Ordinal variables are ordered nominal variables. Ex. Olympic medals, Professional ranks.  Ratio-scaled variables have a non-linear scale. Ex. Growth of a bacteria population

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 20 Interval-scaled variables  Interval-scaled variables are continuous measurements of a linear scale. Ex. height, weight, temperature.  Interval-scaled variables are dependent on the units used.  Measurement unit can affect analysis, so standardisation should be used.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 21Problems Person Age (yr) Height (cm) A35190 B40190 C35160 D40160

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 22Standardisation  Converting original measurements to unitless values.  Attempts to give all variables the equal weight.  Useful when there is no prior knowledge of the data.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 23 Standardisation algorithm  Z-scores indicate how far and in what direction an item deviates from its distribution's mean, expressed in units of its distribution's standard deviation.  The transformed scores will have a mean of zero and standard deviation of one.  It is useful when comparing relative standings of items from distributions with different means and/or different standard deviation.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 24 Standardisation algorithm  Consider n values of a variable x.  Calculate the mean value.  Calculate the standard deviation.  Calculate the z-score.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 25 Z-scores example

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 26 Real heights and ages charts

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 27 Z-scores for heights and ages

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 28 Data chart

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 29 Data chart

Similarities

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 31 Data Matrices  Data matrix: represents n objects with p characteristics.  Ex. person = {age, sex, income,...}  Dissimilarity matrix: represents a collection of dissimilarities between all pairs of objects.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 32Dissimilarities  Dissimilarity measures some form of distance between objects.  Clustering algorithms use dissimilarities to cluster data.  How can dissimilarities be measured?

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 33 How to calculate dissimilarities?  The most popular methods are based on the distance between pairs of objects.  Minkowski distance:  p is the number of characteristics  q is the distance type  q=2 (Euclides distance), q=1 (Manhattan)

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 34Similarities  It is also possible to work with similarities [s(x i,x j )]  0<=s(x i,x j )<=1  s(x i,x i )=1  s(x i,x j )=s(x j,x i )  It is possible to consider that d(x i,x j )=1- s(x i,x j )

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 35Distances

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 36Dissimilarities  There are other ways to obtain dissimilarities.  So we no longer speak of distances.  Basically dissimilarities are nonnegative numbers (d(i,j)) that are small (close to 0) when i and j are similar.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 37Pearson  Pearson product-moment correlation between variables f and g  Coefficients lie between –1 and +1

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 38 Pearson - cont  A correlation of +1 means that there is a perfect positive linear relationship between variables.  A correlation of -1 means that there is a perfect negative linear relationship between variables.  A correlation of 0 means there is no linear relationship between the two variables.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 39 Pearson - ex  ryz = 0.9861; ryw = -0.9551; ryr= 0.2770

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 40 Correlation and dissimilarities 1  d(f,g)=(1-R(f,g))/2 (1)  Variables with a high positive correlation (+1) receive a dissimilarity close to 0  Variables with strongly negative correlation will be considered very dissimilar

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 41 Correlation and dissimilarities 2  d(f,g)=1-|R(f,g)| (2)  Variables with a high positive correlation (+1) and negative correlation will receive a dissimilarity close to 0

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 42 Numerical Example NameWeightHeightMonthYear Ilan1595182 Jack49156555 Kim13951181 Lieve45160756 Leon85178648 Peter66176656 Talia12901283 Tina1078184

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 43 Numerical Example NameWeightHeightMonthYear Ilan1595182 Jack49156555 Kim13951181 Lieve45160756 Leon85178648 Peter66176656 Talia12901283 Tina1078184

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 44 Numerical Example 1 QuantiWeightHeightMonthYear CorrWeight1 Height0.9571 Month-0.0360.0211 Year-0.953-0.9850.0131 DissWeight0 (1)Height0.0210 Month0.5180.4890 Year0.9770.9920.4930 DissWeight0 (2)Height0.0430 Month0.9640.9790 Year0.0470.0150.9870

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 45 Binary Variables  Binary variables have only two states.  States can be symmetric or asymmetric.  Binary variables are symmetric if both states are equally valuable. Ex. gender  When the states are not equally important the variable is asymmetric. Ex. disease tests (1-positive; 0-negative)

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 46 Contingency tables  Consider objects described by p binary variables  q variables are equal to one on i and j  r variables are 1 on i and 0 on object j

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 47 Symmetric Variables  Dissimilarity based on symmetric variables is invariant.  The result should not change when variables are interchanged.  Simple dissimilarity coefficient:

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 48 Symmetric Variables  Dissimilarity  Similarity

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 49 Asymmetric Variables  Similarity based on asymmetric variables is not invariant.  Two ones are more important than two zeros  Jacard coefficient:

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 50 Computing dissimilarities

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 51 Computing Dissimilarities JackMary q 1,1 r 1,0 s 0,1 t 0,0 FeverYY1000 CoughNN0001 Test1PP1000 Test2NN0001 Test3NP0010 Test4NN0001 2013

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 52 Computing dissimilarities Jim and Mary have the highest dissimilarity value, so they have low probability of having the same disease.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 53 Nominal Variables  A nominal variable is a generalisation of the binary variable.  A nominal variable can take more than two states  Ex. Marital status: married, single, divorced  Each state can be represented by a number or letter  There is no specific ordering

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 54 Computing dissimilarities  Consider two objects i and j, described by nominal variables  Each object has p characteristics  m is the number of matches

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 55 Binarising nominal variables  An nominal variable can encoded to create a new binary variable for each state  Example:  Marital state = {married, single, divorced}  Married: 1=yes – 0=no  Single: 1=yes – 0=no  Divorced: 1=yes – 0=no  Ex. Marital state = {married}  married = 1, single = 0, divorced = 0

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 56 Ordinal variables  A discrete ordinal variable is similar to a nominal variable, except that the states are ordered in a meaningful sequence  Ex. Bronze, silver and gold medals  Ex. Assistant, associate, full member

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 57 Computing dissimilarities  Consider n objects defined by a set of ordinal variables  f is one of these ordinal variables and have M f states.  These states define the ranking r f  {1,…, M f }.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 58 Steps to calculate dissimilarities  Assume that the value of f for the ith object is x if. Replace each x if by its corresponding rank r if g {1,…,M f }.  Since the number of states of each variable differs, it is often necessary map the range onto [0.0,1.0] using the equation  Dissimilarity can be computed using distance measures of interval-scaled variables

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 59 Ratio-scaled variables  Variables on a non-linear scale, such as exponential  To compute dissimilarities there are three methods Treat as interval-scaled. Not always good. Treat as interval-scaled. Not always good. Apply a transformation like y=log(x) and treat as interval-scaled Apply a transformation like y=log(x) and treat as interval-scaled Treat as ordinal data and assume ranks as interval-scaled Treat as ordinal data and assume ranks as interval-scaled

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 60 Variables of mixed types  One technique is to bring all variables onto a common scale of the interval [0.0.1.0]  Suppose that the data set contains p variables of mixed type. Dissimilarity is between i and j is

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 61  Dissimilarity is between i and j is Variables of mixed types

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 62  The contribution of each variable is dependent on its type  f is binary or nominal:  f is interval-based:  f is ordinal of ratio-scaled: compute ranks and treat as interval-based Variables of mixed types cont

Clustering Methods

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 64 Classification types  Clustering is an unsupervised method

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 65 Clustering Methods  Partitioning  Hierarchical  Density-based  Grid-based  Model-based

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 66 Partitioning Methods  Given n objects k partitions are created.  Each partition must contain at least one element.  It uses an iterative relocation technique to improve partitioning.  Distance is the usual criterion.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 67 Partitioning Methods cont.  They work well for finding spherical-shaped clusters.  They are not efficient on very large databases.  K-means where each cluster is represented by the mean value of the objects in the cluster.  K-medoids where each cluster is represented by an object near the centre of the cluster.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 68 Hierarchical Methods  Creates a hierarchical decomposition of the set  Agglomerative approaches start with each object forming a separate group  Merges objects or groups until all objects belong to one group or a termination condition occurs  Divisive approaches starts with all objects in the same cluster  Each successive iteration splits a cluster until all objects are on separate clusters or a termination condition occurs

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 69 Hierarchical Clustering cont  Definition of cluster proximity.  Min: most similar (sensitive to noise)  Max: most dissimilar (break large clusters

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 70 Density-based methods  Method creates clusters until the density in the neighbourhood exceeds some threshold  Able to find clusters of arbitrary shapes

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 71 Grid-based methods  Grid methods divide the object space into finite number of cells forming a grid-like structure.  Cells that contain more than a certain number of elements are treated as dense.  Dense cells are connected to form clusters.  Fast processing time, independent of the number of objects.  STING and CLIQUE are examples.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 72 Model-based methods  Model-based methods hypothesise a model for each cluster and find the best fit of the data to the given model.  Statistical models  SOM networks

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 73 Partition methods  Given a database of n objects a partition method organises them into k clusters (k<= n)  The methods try to minimise an objective function such as distance  Similar objects are “close” to each other