Presentation on theme: "Cluster Analysis Purpose and process of clustering Profile analysis Selection of variables and sample Determining the # of clusters."— Presentation transcript:
Cluster Analysis Purpose and process of clustering Profile analysis Selection of variables and sample Determining the # of clusters
Intro to Clustering Clustering is like “reverse linear discriminant analysis” you are looking for groups (but can’t define them a priori) The usual starting point is a “belief” that a particular population is not homogeneous that there are two or more “kinds” of folks in the group Reasons for this “belief” -- usually your own “failure” predictive models don’t work or seem way too complicated (need lots of unrelated predictors) treatment programs only work for some folks “best predictors” or “best treatments” vary across folks “gut feeling”
Process of a Cluster Analysis Identification of the target population Identification of a likely set of variables to “carve” the population into groups Broad sampling procedures -- rep all “groups” Construct “profile” for each participant Compute “similarities” among all participant profiles Determine the # of clusters Identify who is in what cluster Describe/Interpret the clusters Plan the next study -- replication & extension
Profiles and Profile Comparisons A profile is a person’s “pattern” of scores for a set of variables Profiles differ in two basic ways level -- average value (“height”) shape -- peaks & dips A B C D E Profiles for 4 folks across 5 variables Who’s most “similar” to whom? How should we measure “similarity” – level, shape or both ?? Cluster analysts are usually interested in both shape and level differences among groups.
L F K Hs D Hy Pd Mf Pa Pt Sz Ma Si Commonly found MMPI Clusters -- simplified “normal”“elevated”“misrepresenting”“unhappy” Validity ScalesClinical Scales
How Hierarchical Clustering works Date in an “X” matrix (cases x variables) Compute the “profile similarity” of all pairs of cases and put those values in a “D” matrix (cases x cases) Start with # clusters = # cases (1 case in @ cluster) On each step Identify the 2 clusters that are “most similar” & combine those into a single cluster Compute “profile error” (not everyone in a cluster are = ) Re-compute the “profile similarity” among all cluster pairs Repeat until there is a single cluster
Determining the # of hierarchical clusters With each agglomeration step in the clustering procedure the 2 most similar groups are combined into a single cluster parsimony increases -- fewer clusters = simpler solution error increases -- cases combined into clusters aren’t identical We want to identify the “parsimony – error trade-off “ Examine the “error increase” at each agglomeration step a large “jump” in error indicates “too few” clusters have just combined two clusters that are very dissimilar frankly this doesn’t often work very well by itself -- need to include more info to decide how many clusters you have !!!
Determining the # of clusters, cont… When evaluating a cluster solution, be sure to consider... Stability – are clusters similar if you add or delete clusters ? Replicability – split-half or replication analyses Meaningfulness (e.g., knowing a priori about groups helps)` There is really no substitute for obtaining and plotting multiple clustering solutions & paying close attention to the # of cases in the different clusters. Follow the merging of clusters, asking if “importantly dissimilar groups of substantial size have been merged” be sure to consider the sizes of the groups – subgroups of less than 5% are usually too small to trust without theory & replication we’ll look at some statistical help for this later …
How different is “different enough” to keep as separate clusters? That is a tough one… On how many variables must the clusters differ? By how much must they differ? Are level differences really important? Or only shape differences? How many have to be in the cluster for it to be interesting? This is a great example of something we’ve discussed many times before … The more content knowledge you bring to the analysis the more informative the analysis is likely to be!!! You need to know about the population and related literature to know “how much of a difference is a difference that matters”.
“Strays” in Hierarchical Analyses A “stray” is a person with a profile that matches no one else. data collection, collation, computation or other error member(s) of a population/group not otherwise represented “Strays” can cause us a couple of kinds of problems: a 10-group clustering might be 6 strays and 4 substantial clusters – the agglomeration error can’t tell you – you have to track the cluster frequencies a stray may be forced into a group, without really belonging there, and change the profile of that group such that which other cases join it are changed – you have to check if group members are really similar (more later)
Within-cluster Variability in Cluster Analyses When we plot profiles -- differences in level or shape can look important enough to keep two clusters separated. Adding “whiskers” (Std, SEM or CIs) can help us recognize when groups are and aren’t really different (these aren’t) HNST tests can help too (more later)
Making cluster solutions more readable Some variable sets and the their ordering are well known… MMPI, WISC, NEO, MCMI, etc. if so, follow the expected ordering Most of the time, the researcher can select the variable order pick an order that highlights and simplifies cluster comparisons minimize the number of “humps” & “cross-overs” the one on the left below is probably better A B C D E FC D F A B E