Cluster Analysis Purpose and process of clustering Profile analysis Selection of variables and sample Determining the # of clusters.

Slides:



Advertisements
Similar presentations
Standardized Scales.
Advertisements

Transformations & Data Cleaning
Cross Sectional Designs
Automated Regression Modeling Descriptive vs. Predictive Regression Models Four common automated modeling procedures Forward Modeling Backward Modeling.
1 Reliability in Scales Reliability is a question of consistency do we get the same numbers on repeated measurements? Low reliability: reaction time High.
Copyright © 2009 Pearson Education, Inc. Chapter 29 Multiple Regression.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 13 Experiments and Observational Studies.
Metrics, Algorithms & Follow-ups Profile Similarity Measures Cluster combination procedures Hierarchical vs. Non-hierarchical Clustering Statistical follow-up.
Cluster Analysis.
Statistical Tests How to tell if something (or somethings) is different from something else.
Common Factor Analysis “World View” of PC vs. CF Choosing between PC and CF PAF -- most common kind of CF Communality & Communality Estimation Common Factor.
Multiple Group X² Designs & Follow-up Analyses X² for multiple condition designs Pairwise comparisons & RH Testing Alpha inflation Effect sizes for k-group.
Power Analysis for Correlation & Multiple Regression Sample Size & multiple regression Subject-to-variable ratios Stability of correlation values Useful.
Multiple Regression Models: Some Details & Surprises Review of raw & standardized models Differences between r, b & β Bivariate & Multivariate patterns.
Determining the # Of PCs Remembering the process Some cautionary comments Statistical approaches Mathematical approaches “Nontrivial factors” approaches.
Multivariate Analyses & Programmatic Research Re-introduction to Programmatic research Factorial designs  “It Depends” Examples of Factorial Designs Selecting.
Multivariate Analyses & Programmatic Research Re-introduction to Multivariate research Re-introduction to Programmatic research Factorial designs  “It.
Intro to Parametric Statistics, Assumptions & Degrees of Freedom Some terms we will need Normal Distributions Degrees of freedom Z-values of individual.
Chapter 7 Correlational Research Gay, Mills, and Airasian
Multivariate Methods EPSY 5245 Michael C. Rodriguez.
Principal Components An Introduction
Reliability, Validity, & Scaling
Copyright © 2010 Pearson Education, Inc. Chapter 13 Experiments and Observational Studies.
Experiments and Observational Studies. Observational Studies In an observational study, researchers don’t assign choices; they simply observe them. look.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 13 Experiments and Observational Studies.
CLUSTER ANALYSIS.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide
Generic Approaches to Model Validation Presented at Growth Model User’s Group August 10, 2005 David K. Walters.
Correlational Research Chapter Fifteen Bring Schraw et al.
Collection of Data Chapter 4. Three Types of Studies Survey Survey Observational Study Observational Study Controlled Experiment Controlled Experiment.
Coding Multiple Category Variables for Inclusion in Multiple Regression More kinds of predictors for our multiple regression models Some review of interpreting.
The Scientific Method. Steps of Scientific Method 1.Observation: notice and describe events or processes 2.Make a question 1.Relate to observation 2.Should.
CJT 765: Structural Equation Modeling Class 12: Wrap Up: Latent Growth Models, Pitfalls, Critique and Future Directions for SEM.
Cluster Analysis Cluster Analysis Cluster analysis is a class of techniques used to classify objects or cases into relatively homogeneous groups.
Data Analysis Econ 176, Fall Populations When we run an experiment, we are always measuring an outcome, x. We say that an outcome belongs to some.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Plotting Linear Main Effects Models Interpreting 1 st order terms w/ & w/o interactions Coding & centering… gotta? oughta? Plotting single-predictor models.
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition Instructor’s Presentation Slides 1.
Validity Validity is an overall evaluation that supports the intended interpretations, use, in consequences of the obtained scores. (McMillan 17)
Chapter 13 Repeated-Measures and Two-Factor Analysis of Variance
Chapter 3 Surveys and Sampling © 2010 Pearson Education 1.
Single-Subject and Correlational Research Bring Schraw et al.
Reliability a measure is reliable if it gives the same information every time it is used. reliability is assessed by a number – typically a correlation.
Computational Biology Group. Class prediction of tumor samples Supervised Clustering Detection of Subgroups in a Class.
General Linear Model What is the General Linear Model?? A bit of history & our goals Kinds of variables & effects It’s all about the model (not the data)!!
D/RS 1013 Discriminant Analysis. Discriminant Analysis Overview n multivariate extension of the one-way ANOVA n looks at differences between 2 or more.
Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.
Regression. Why Regression? Everything we’ve done in this class has been regression: When you have categorical IVs and continuous DVs, the ANOVA framework.
Survey Methodology Reliability and Validity
Reliability and Validity
PSY 626: Bayesian Statistics for Psychological Science
Comparing Two Proportions
Leacock, Warrican and Rose (2009)
Principal Components & Common Factoring An Introduction
Question So, I’ve done my factor analysis.
Sampling Distributions and Hypothesis Testing
Data, Univariate Statistics & Statistical Inference
Chapter 25 Comparing Counts.
Regression.
Comparing Two Proportions
PSY 626: Bayesian Statistics for Psychological Science
Comparing Two Proportions
EPSY 5245 EPSY 5245 Michael C. Rodriguez
Paired Samples and Blocks
Chapter 26 Comparing Counts.
Cluster Analysis.
Psych 231: Research Methods in Psychology
Inferential Statistics
Chapter 26 Comparing Counts.
CHAPTER 4 Designing Studies
Presentation transcript:

Cluster Analysis Purpose and process of clustering Profile analysis Selection of variables and sample Determining the # of clusters

Intro to Clustering Clustering is like “reverse linear discriminant analysis” you are looking for groups (but can’t define them a priori) The usual starting point is a “belief” that a particular population is not homogeneous that there are two or more “kinds” of folks in the group Reasons for this “belief” -- usually your own “failure” predictive models don’t work or seem way too complicated (need lots of unrelated predictors) treatment programs only work for some folks “best predictors” or “best treatments” vary across folks “gut feeling”

Process of a Cluster Analysis Identification of the target population Identification of a likely set of variables to “carve” the population into groups Broad sampling procedures -- rep all “groups” Construct “profile” for each participant Compute “similarities” among all participant profiles Determine the # of clusters Identify who is in what cluster Describe/Interpret the clusters Plan the next study -- replication & extension

Profiles and Profile Comparisons A profile is a person’s “pattern” of scores for a set of variables Profiles differ in two basic ways level -- average value (“height”) shape -- peaks & dips A B C D E Profiles for 4 folks across 5 variables Who’s most “similar” to whom? How should we measure “similarity” – level, shape or both ?? Cluster analysts are usually interested in both shape and level differences among groups.

L F K Hs D Hy Pd Mf Pa Pt Sz Ma Si Commonly found MMPI Clusters -- simplified “normal”“elevated”“misrepresenting”“unhappy” Validity ScalesClinical Scales

How Hierarchical Clustering works Date in an “X” matrix (cases x variables) Compute the “profile similarity” of all pairs of cases and put those values in a “D” matrix (cases x cases) Start with # clusters = # cases (1 case cluster) On each step Identify the 2 clusters that are “most similar” & combine those into a single cluster Compute “profile error” (not everyone in a cluster are = ) Re-compute the “profile similarity” among all cluster pairs Repeat until there is a single cluster

Determining the # of hierarchical clusters With each agglomeration step in the clustering procedure the 2 most similar groups are combined into a single cluster parsimony increases -- fewer clusters = simpler solution error increases -- cases combined into clusters aren’t identical We want to identify the “parsimony – error trade-off “ Examine the “error increase” at each agglomeration step a large “jump” in error indicates “too few” clusters have just combined two clusters that are very dissimilar frankly this doesn’t often work very well by itself -- need to include more info to decide how many clusters you have !!!

Determining the # of clusters, cont… When evaluating a cluster solution, be sure to consider... Stability – are clusters similar if you add or delete clusters ? Replicability – split-half or replication analyses Meaningfulness (e.g., knowing a priori about groups helps)` There is really no substitute for obtaining and plotting multiple clustering solutions & paying close attention to the # of cases in the different clusters. Follow the merging of clusters, asking if “importantly dissimilar groups of substantial size have been merged” be sure to consider the sizes of the groups – subgroups of less than 5% are usually too small to trust without theory & replication we’ll look at some statistical help for this later …

How different is “different enough” to keep as separate clusters? That is a tough one… On how many variables must the clusters differ? By how much must they differ? Are level differences really important? Or only shape differences? How many have to be in the cluster for it to be interesting? This is a great example of something we’ve discussed many times before … The more content knowledge you bring to the analysis the more informative the analysis is likely to be!!! You need to know about the population and related literature to know “how much of a difference is a difference that matters”.

“Strays” in Hierarchical Analyses A “stray” is a person with a profile that matches no one else. data collection, collation, computation or other error member(s) of a population/group not otherwise represented “Strays” can cause us a couple of kinds of problems: a 10-group clustering might be 6 strays and 4 substantial clusters – the agglomeration error can’t tell you – you have to track the cluster frequencies a stray may be forced into a group, without really belonging there, and change the profile of that group such that which other cases join it are changed – you have to check if group members are really similar (more later)

Within-cluster Variability in Cluster Analyses When we plot profiles -- differences in level or shape can look important enough to keep two clusters separated. Adding “whiskers” (Std, SEM or CIs) can help us recognize when groups are and aren’t really different (these aren’t) HNST tests can help too (more later)

Making cluster solutions more readable Some variable sets and the their ordering are well known… MMPI, WISC, NEO, MCMI, etc. if so, follow the expected ordering Most of the time, the researcher can select the variable order pick an order that highlights and simplifies cluster comparisons minimize the number of “humps” & “cross-overs” the one on the left below is probably better A B C D E FC D F A B E