Cluster Analysis Chapter 12.

Name: Cluster Analysis Chapter 12.
Uploaded: 2017-08-14T12:05:30+00:00
Duration: PTM23S40
Channel: Pauline Dorsey
Description: Cluster Analysis Chapter 12.

Cluster Analysis Chapter 12

Cluster analysis It is a class of techniques used to classify cases into groups that are relatively homogeneous within themselves and heterogeneous between each other Homogeneity (similarity) and heterogeneity (dissimilarity) are measured on the basis of a defined set of variables These groups are called clusters

Market segmentation Cluster analysis is especially useful for market segmentation Segmenting a market means dividing its potential consumers into separate sub-sets where Consumers in the same group are similar with respect to a given set of characteristics Consumers belonging to different groups are dissimilar with respect to the same set of characteristics This allows one to calibrate the marketing mix differently according to the target consumer group

Other uses of cluster analysis
Product characteristics and the identification of new product opportunities. Clustering of similar brands or products according to their characteristics allow one to identify competitors, potential market opportunities and available niches Data reduction Factor analysis and principal component analysis allow to reduce the number of variables. Cluster analysis allows to reduce the number of observations, by grouping them into homogeneous clusters. Maps profiling simultaneously consumers and products, market opportunities and preferences as in preference or perceptual mappings (lecture 14)

Steps to conduct a cluster analysis
Select a distance measure Select a clustering algorithm Define the distance between two clusters Determine the number of clusters Validate the analysis

Distance measures for individual observations
To measure similarity between two observations a distance measure is needed With a single variable, similarity is straightforward Example: income – two individuals are similar if their income level is similar and the level of dissimilarity increases as the income gap increases Multiple variables require an aggregate distance measure Many characteristics (e.g. income, age, consumption habits, family composition, owning a car, education level, job…), it becomes more difficult to define similarity with a single value The most known measure of distance is the Euclidean distance, which is the concept we use in everyday life for spatial coordinates.

Examples of distances Dij distance between cases i and j
Euclidean distance B A City-block (Manhattan) distance B Dij distance between cases i and j xkj value of variable xk for case j Problems Different measures = different weights Correlation between variables (double counting) Solution: Standardization, rescaling, principal component analysis

Other distance measures
Other distance measures: Chebychev, Minkowski, Mahalanobis An alternative approach: use correlation measures, where correlations are not between variables, but between observations. Each observation is characterized by a set of measurements (one for each variable) and bi-variate correlations can be computed between two observations.

Clustering procedures
Hierarchical procedures Agglomerative (start from n clusters to get to 1 cluster) Divisive (start from 1 cluster to get to n clusters) Non hierarchical procedures K-means clustering

Hierarchical clustering
Agglomerative: Each of the n observations constitutes a separate cluster The two clusters that are more similar according to same distance rule are aggregated, so that in step 1 there are n-1 clusters In the second step another cluster is formed (n-2 clusters), by nesting the two clusters that are more similar, and so on There is a merging in each step until all observations end up in a single cluster in the final step. Divisive All observations are initially assumed to belong to a single cluster The most dissimilar observation is extracted to form a separate cluster In step 1 there will be 2 clusters, in the second step three clusters and so on, until the final step will produce as many clusters as the number of observations. The number of clusters determines the stopping rule for the algorithms

Non-hierarchical clustering
These algorithms do not follow a hierarchy and produce a single partition Knowledge of the number of clusters (c) is required In the first step, initial cluster centres (the seeds) are determined for each of the c clusters, either by the researcher or by the software (usually the first c observation or observations are chosen randomly) Each iteration allocates observations to each of the c clusters, based on their distance from the cluster centres Cluster centres are computed again and observations may be reallocated to the nearest cluster in the next iteration When no observations can be reallocated or a stopping rule is met, the process stops

Distance between clusters
Algorithms vary according to the way the distance between two clusters is defined. The most common algorithm for hierarchical methods include single linkage method complete linkage method average linkage method Ward algorithm (see slide 14) centroid method (see slide 15)

Linkage methods Single linkage method (nearest neighbour): distance between two clusters is the minimum distance among all possible distances between observations belonging to the two clusters. Complete linkage method (furthest neighbour): nests two cluster using as a basis the maximum distance between observations belonging to separate clusters. Average linkage method: the distance between two clusters is the average of all distances between observations in the two clusters

Ward algorithm The sum of squared distances is computed within each of the cluster, considering all distances between observation within the same cluster The algorithm proceeds by choosing the aggregation between two clusters which generates the smallest increase in the total sum of squared distances. It is a computationally intensive method, because at each step all the sum of squared distances need to be computed, together with all potential increases in the total sum of squared distances for each possible aggregation of clusters.

Centroid method The distance between two clusters is the distance between the two centroids, Centroids are the cluster averages for each of the variables each cluster is defined by a single set of coordinates, the averages of the coordinates of all individual observations belonging to that cluster Difference between the centroid and the average linkage method Centroid: computes the average of the co-ordinates of the observations belonging to an individual cluster Average linkage: computes the average of the distances between two separate clusters.

Non-hierarchical clustering: K-means method
The number k of clusters is fixed An initial set of k “seeds” (aggregation centres) is provided First k elements Other seeds (randomly selected or explicitly defined) Given a certain fixed threshold, all units are assigned to the nearest cluster seed New seeds are computed Go back to step 3 until no reclassification is necessary Units can be reassigned in successive steps (optimising partioning)

Non-hierarchical threshold methods
Sequential threshold methods a prior threshold is fixed and units within that distance are allocated to the first seed a second seed is selected and the remaining units are allocated, etc. Parallel threshold methods more than one seed are considered simultaneously When reallocation is possible after each stage, the methods are termed optimizing procedures.

Hierarchical vs. non-hierarchical methods
No decision about the number of clusters Problems when data contain a high level of error Can be very slow, preferable with small data-sets Initial decisions are more influential (one-step only) At each step they require computation of the full proximity matrix Faster, more reliable, works with large data sets Need to specify the number of clusters Need to set the initial seeds Only cluster distances to seeds need to be computed in each iteration

The number of clusters c
Two alternatives Determined by the analysis Fixed by the researchers In segmentation studies, the c represents the number of potential separate segments. Preferable approach: “let the data speak” Hierarchical approach and optimal partition identified through statistical tests (stopping rule for the algorithm) However, the detection of the optimal number of clusters is subject to a high degree of uncertainty If the research objectives allow a choice rather than estimating the number of clusters, non-hierarchical methods are the way to go.

Example: fixed number of clusters
A retailer wants to identify several shopping profiles in order to activate new and targeted retail outlets The budget only allows him to open three types of outlets A partition into three clusters follows naturally, although it is not necessarily the optimal one. Fixed number of clusters and (k-means) non hierarchical approach

Example: c determined from the data
Clustering of shopping profiles is expected to detect a new market niche. For market segmentation purposes, it is less advisable to constrain the analysis to a fixed number of clusters A hierarchical procedure allows to explore all potentially valid numbers of clusters For each of them there are some statistical diagnostics to pinpoint the best partition. What is needed is a stopping rule for the hierarchical algorithm, which determines the number of clusters at which the algorithm should stop. Statistical tests are not always univocal, leaving some room to the researcher’s experience and arbitrariness Statistical rigidities should be balanced with the knowledge gained from and interpretability of the final classification.

Determining the optimal number of cluster from hierarchical methods
Graphical dendrogram scree diagram Statistical Arnold’s criterion pseudo F statistic pseudo t2 statistic cubic clustering criterion (CCC)

Dendrogram And the merging distance is relatively small
This dotted line represents the distance between clusters Case 231 and case 275 are merged These are the individual cases As the algorithm proceeds, the merging distances become larger

Scree diagram Merging distance on the y-axis
When one moves from 7 to 6 clusters, the merging distance increases noticeably

Statistical tests The rationale is that in optimal partition, variability within clusters should be as small as possible, while variability between clusters should be maximized This principle is similar to the ANOVA-F test However, since hierarchical algorithms proceed sequentially, the probability distribution of statistics relating variability within and variability between is unknown and differs from the F distribution

Statistical criteria to detect the optimal partition
Arnold’s criterion: find the minimum of the determinant of the within cluster sum of squares matrix W Pseudo F, CCC and Pseudo t2: the ideal number of clusters should correspond to a local maximum for the Pseudo-F and CCC, and a small value of the pseudo t2 which increases in the next step (preferably a local minimum). These criteria are rarely consistent among them, so that the researcher should also rely on meaningful (interpretable) criteria. Non-parametric methods (SAS) also allow one to determine the number of clusters k-th nearest neighbour method: the researcher sets a parameter (k) for each k the method returns the optimal number of clusters. if this optimal number is the same for several values of k, then the determination of the number of clusters is relatively robust

Suggested approach: 2-steps procedures
First perform a hierarchical method to define the number of clusters Then use the k-means procedure to actually form the clusters The reallocation problem Rigidity of hierarchical methods: once a unit is classified into a cluster, it cannot be moved to other clusters in subsequent steps The k-means method allows a reclassification of all units in each iteration. If some uncertainty about the number of clusters remains after running the hierarchical method, one may also run several k-means clustering procedures and apply the previously discussed statistical tests to choose the best partition.

The SPSS two-step procedure
The observations are preliminarily aggregated into clusters using an hybrid hierarchical procedure named cluster feature tree. This first step produces a number of pre-clusters, which is higher than the final number of clusters, but much smaller than the number of observations. In the second step, a hierarchical method is used to classify the pre-clusters, obtaining the final classification. During this second clustering step, it is possible to determine the number of clusters. The user can either fix the number of clusters or let the algorithm search for the best one according to information criteria which are also based on goodness-of-fit measures.

Evaluation and validation
goodness-of-fit of a cluster analysis ratio between the sum of squared errors and the total sum of squared errors (similar to R2) root mean standard deviation within clusters. Validation: if the identified cluster structure (number of clusters and cluster characteristics) is real, it should not be c Validation approaches use of different samples to check whether the final output is similar Split the sample into two groups when no other samples are available Check for the impact of initial seeds / order of cases (hierarchical approach) on the final partition Check for the impact of the selected clustering method

Cluster analysis in SPSS
Three types of cluster analysis are available in SPSS

Hierarchical cluster analysis
Variables selected for the analysis Create a new variable with cluster membership for each case Clustering method and options Statistics required in the analysis Graphs (dendrogram) Advice: no plots

Statistics The agglomeration schedule is a table which shows the steps of the clustering procedure, indicating which cases (clusters) are merged and the merging distance Shows the cluster membership of individual cases only for a sub-set of solutions The proximity matrix contains all distances between cases (it may be huge)

Plots Shows the clustering process, indicating which cases are aggregated and the merging distance With many cases, the dendrogram is hardly readable The icicle plot (which can be restricted to cover a small range of clusters), shows at what stage cases are clustered. The plot is cumbersome and slows down the analysis (advice: no icicle)

Method Choose a hierarchical algorithm
Choose the type of data (interval, counts binary) and the appropriate measure Specify whether the variables (values) should be standardized before analysis. Z-scores return variables with zero mean and unity variance. Other standardizations are possible. Distance measures can also be transformed

Cluster memberships If the number of clusters has been decided (or at least a range of solutions), it is possible to save the cluster membership for each case into new variables

The example: agglomeration schedule
Last 10 stages of the process (10 to 1 clusters) Cluster Combined Stage Number of clusters Cluster 1 Cluster 2 Distance Diff. Dist 490 10 8 12 544.4 491 9 11 559.3 14.9 492 3 7 575.0 15.7 493 366 591.6 16.6 494 6 610.6 19.0 495 5 37 636.6 26.0 496 4 13 23 663.7 27.1 497 700.8 37.1 498 2 1 754.1 53.3 499 864.2 110.2 As the algorithms proceeds towards the end, the distance increases

Scree diagram The scree diagram (not provided by SPSS but created from the agglomeration schedule) shows a larger distance increase when the cluster number goes below 4 Elbow?

Non-hierarchical solution with 4 clusters

K-means solution (4 clusters)
Variables Number of clusters (fixed) Ask for one (classify only) or more iterations before stopping the algorithm It is possible to read a file with initial seeds or write final seeds on a file

K-means options Creates a new variable with cluster membership for each case Improve the algorithm by allowing for more iterations and running means (seeds are recomputed at each stage) More options including an ANOVA table with statistics

Results from k-means (initial seeds chosen by SPSS)
The k-means algorithm is sensible to outliers and SPSS chose an improbable amount for recreation expenditure as an initial seed for cluster 2 (probably an outlier due to misrecording or an exceptional expenditure)

Results from k-means:initial seeds from hierarchical clustering
The first cluster is now larger, but it still represents older and poorer households. The other clusters are not very different from the ones obtained with the Ward algorithm, indicating a certain robustness of the results.

2-step clustering it is possible to make a distinction between categorical and continuous variables This is the information criterion to choose the optimal partition The search for the optimal number of clusters may be constrained One may also asks for plots and descriptive stats

Options It is advisable to control for outliers (OLs) because the analysis is usually sensitive to OLs It is possible to choose which variable should be standardized prior to run the analysis More advanced options are available for a better control on the procedure

Output Results are not satisfactory
With no prior decision on the number of clusters, two clusters are found, one with a single observations and the other with the remaining 499 observations. Allowing for outlier treatment does not improve results Setting the number of clusters to four produces these results It seems that the two-step clustering is biased towards finding a macro-cluster. This might be due to the fact that the number of observations is relatively small, but the combination of the Ward algorithm with the k-means algorithm is more effective

SAS cluster analysis Compared to SPSS, SAS provides more diagnostics and the option of non-parametric clustering through three SAS/STAT procedures the procedure CLUSTER and VARCLUS (for hierarchical and the k-th neighbour methods) the procedure FASTCLUS (for non-hierarchical methods) and the procedure MODECLUS (for non-parametric methods)

Discussion It might seem that cluster analysis is too sensitive to the researcher’s choice.s This is partly due to the relatively small data-set and possibly to correlation between variables However, all outputs point out to a segment with older and poorer household and another with younger and larger households, with high expenditures. By intensifying the search and adjusting some of the properties, cluster analysis does help identifying homogeneous groups. “Moral”: cluster analysis needs to be adequately validated and it may be risky to run a single cluster analysis and take the results as truly informative, especially in presence of outliers.

Cluster Analysis Chapter 12.

Similar presentations

Presentation on theme: "Cluster Analysis Chapter 12."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Cluster Analysis Chapter 12.

Similar presentations

Presentation on theme: "Cluster Analysis Chapter 12."— Presentation transcript:

Similar presentations

About project

Feedback