Cluster Analysis Grouping Cases or Variables. Clustering Cases Goal is to cluster cases into groups based on shared characteristics. Start out with each.

Slides:



Advertisements
Similar presentations
McGraw-Hill/Irwin McGraw-Hill/Irwin Copyright © 2009 by The McGraw-Hill Companies, Inc. All rights reserved.
Advertisements

Factor Analysis with SAS
General Linear Models The theory of general linear models posits that many statistical tests can be solved as a regression analysis, including t-tests.
Randomized Complete Block and Repeated Measures (Each Subject Receives Each Treatment) Designs KNNL – Chapters 21,
Multiple Regression. Introduction In this chapter, we extend the simple linear regression model. Any number of independent variables is now allowed. We.
1 Interpreting a Model in which the slopes are allowed to differ across groups Suppose Y is regressed on X1, Dummy1 (an indicator variable for group membership),
Simple Linear Regression Analysis
4/4/2015Slide 1 SOLVING THE PROBLEM A one-sample t-test of a population mean requires that the variable be quantitative. A one-sample test of a population.
Chapter 5 The Mathematics of Diversification
PCA for analysis of complex multivariate data. Interpretation of large data tables by PCA In industry, research and finance the amount of data is often.
Multiple Regression in Practice The value of outcome variable depends on several explanatory variables. The value of outcome variable depends on several.
Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 1 ~ Curve Fitting ~ Least Squares Regression Chapter.
Correlation and Linear Regression.
Multivariate Statistics Least Squares ANOVA & ANCOV Repeated Measures ANOVA Cluster Analysis.
Discriminant Analysis Database Marketing Instructor:Nanda Kumar.
© LOUIS COHEN, LAWRENCE MANION & KEITH MORRISON
Chapter 17 Making Sense of Advanced Statistical Procedures in Research Articles.
Cluster Analysis.
Chapter 17 Overview of Multivariate Analysis Methods
Classical Regression III
LINEAR REGRESSION: Evaluating Regression Models. Overview Standard Error of the Estimate Goodness of Fit Coefficient of Determination Regression Coefficients.
BHS Methods in Behavioral Sciences I April 18, 2003 Chapter 4 (Ray) – Descriptive Statistics.
Comparing the Various Types of Multiple Regression
Regression and Correlation
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Multiple Regression Involves the use of more than one independent variable. Multivariate analysis involves more than one dependent variable - OMS 633 Adding.
Discrim Continued Psy 524 Andrew Ainsworth. Types of Discriminant Function Analysis They are the same as the types of multiple regression Direct Discrim.
Multiple Linear Regression
Goals of Factor Analysis (1) (1)to reduce the number of variables and (2) to detect structure in the relationships between variables, that is to classify.
SPSS Statistical Package for the Social Sciences is a statistical analysis and data management software package. SPSS can take data from almost any type.
SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Split Sample Validation General criteria for split sample validation Sample problems.
Multivariate Methods EPSY 5245 Michael C. Rodriguez.
Example of Simple and Multiple Regression
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved Section 10-3 Regression.
Chapter 15 Correlation and Regression
Biostatistics Unit 9 – Regression and Correlation.
Lecturer’s desk INTEGRATED LEARNING CENTER ILC 120 Screen Row A Row B Row C Row D Row E Row F Row G Row.
Correlation and Linear Regression. Evaluating Relations Between Interval Level Variables Up to now you have learned to evaluate differences between the.
Multiple Regression The Basics. Multiple Regression (MR) Predicting one DV from a set of predictors, the DV should be interval/ratio or at least assumed.
© 2007 Prentice Hall20-1 Chapter Twenty Cluster Analysis.
Thursday AM  Presentation of yesterday’s results  Factor analysis  A conceptual introduction to: Structural equation models Structural equation models.
Cluster Analysis Cluster Analysis Cluster analysis is a class of techniques used to classify objects or cases into relatively homogeneous groups.
STA 286 week 131 Inference for the Regression Coefficient Recall, b 0 and b 1 are the estimates of the slope β 1 and intercept β 0 of population regression.
11/23/2015Slide 1 Using a combination of tables and plots from SPSS plus spreadsheets from Excel, we will show the linkage between correlation and linear.
Cluster Analysis.
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition Instructor’s Presentation Slides 1.
Multivariate Analysis and Data Reduction. Multivariate Analysis Multivariate analysis tries to find patterns and relationships among multiple dependent.
1 Hair, Babin, Money & Samouel, Essentials of Business Research, Wiley, Learning Objectives: 1.Understand how to use cluster analysis with discriminant.
Applied Multivariate Statistics Cluster Analysis Fall 2015 Week 9.
Tom.h.wilson Department of Geology and Geography West Virginia University Morgantown, WV.
Chapter 8 Relationships Among Variables. Outline What correlational research investigates Understanding the nature of correlation What the coefficient.
Principal Component Analysis
Copyright © 2010 Pearson Education, Inc Chapter Twenty Cluster Analysis.
Clustering / Scaling. Cluster Analysis Objective: – Partitions observations into meaningful groups with individuals in a group being more “similar” to.
1 Cluster Analysis Prepared by : Prof Neha Yadav.
Chapter 14 EXPLORATORY FACTOR ANALYSIS. Exploratory Factor Analysis  Statistical technique for dealing with multiple variables  Many variables are reduced.
CLUSTER ANALYSIS. What is Cluster analysis? Cluster analysis is a techniques for grouping objects, cases, entities on the basis of multiple variables.
Chapter_20 Cluster Analysis Naresh K. Malhotra
Unsupervised Learning
Chapter 14 Inference on the Least-Squares Regression Model and Multiple Regression.
Chen Jimena Melisa Parodi Menashe Shalom
Quality Control at a Local Brewery
Measuring latent variables
Chapter_20 Cluster Analysis
15.1 The Role of Statistics in the Research Process
Cluster Analysis.
Multiple Regression – Split Sample Validation
The worksheet is provided in a variety of sizes/formats.
Measuring latent variables
Unsupervised Learning
Presentation transcript:

Cluster Analysis Grouping Cases or Variables

Clustering Cases Goal is to cluster cases into groups based on shared characteristics. Start out with each case being a one-case cluster. The clusters are located in k-dimensional space, where k is the number of variables. Compute the squared Euclidian distance between each case and each other case.

Squared Euclidian Distance the sum across variables (from i = 1 to v) of the squared difference between the score on variable i for the one case (X i ) and the score on variable i for the other case (Yi)

Agglomerate The two cases closest to each other are agglomerated into a cluster. The distances between entities (clusters and cases) are recomputed. The two entities closest to each other are agglomerated. This continues until all cases end up in one cluster.

What is the Correct Solution? You may have theoretical reasons to expect a certain k cluster solution. Look at that solution and see if it matches your expectations. Alternatively, you may try to make sense out of solutions at two or more levels of the analysis.

Faculty Salaries Subjects were faculty in Psychology at ECU. Variables were rank, experience, number of publications, course load, and salary. Data are at ClusterAnonFaculty.savClusterAnonFaculty.sav Also see the statistical outputthe statistical output

Analyze, Classify, Hierarchical Cluster

Statistics

Plots

Method

Save

Proximity Matrix We did not request this, but if we had it would display a measure of dissimilarity for each pair of entities. The pair of cases with the smallest squared Euclidian distance are clustered.

Stage Cluster CombinedCoefficients Cluster 1 Cluster 2Cluster Look at the Agglomeration Schedule. Cases 32 and 33 are clustered. They are very similar (distance = 0.000)

Agglomeration Schedule Stage Cluster Combined Coefficient s Stage Cluster First Appears Next Stage Cluster 1Cluster 2Cluster 1Cluster 2Cluster 1Cluster Steps 2 Through 5

Stages 2-5 The agglomeration schedule show that in Stage 2 cases 41 and 42 are clustered. In Stage 3 cases 43 and 44 are clustered. In Stage 4 cases 37 and 38 are clustered. In Stage 5 case 39 is added to the cluster that contains cases 37 and 38. And so on.

Vertical Icicle, Two Clusters Look at the top of the display (next slide). You can see two clusters –On the left Boris through Willy –On the right, Deanna through Sunila The 2 cluster solution was adjuncts versus full time faculty.

Vertical Icicle, Three Clusters Look at the icicle second highest white bar. Now there are three clusters –Adjuncts –Junior faculty (Deanna through Mickey) –Senior faculty (Lawrence through Roslyn)

Vertical Icicle, Four Clusters Look at the white bar furthest to the right. Now there are four clusters –Adjuncts –Junior faculty –The acting chair (Lawrence) –The rest of the senior faculty (Catalina through Roslyn)

The Dendogram At the far right you can see the two cluster solution. The next step to the left shows the three cluster solution. The next step to the left shows the four cluster solution. And so on. Truncated and rotated dendogram on next slide.

Compare Two Clusters The 2 cluster solution was adjuncts versus everybody else. Look at the t tests in the output Adjuncts had lower rank, experience, number of publications, course load, and salary.

Compare Three Clusters Look at the ANOVAs and plots. The senior faculty had higher salary, experience, rank, and number of pubs. Compare Four Clusters The acting chair had a higher salary and number of publications.

I Could Not Help Myself With these data on hand, I could not resist predicting salary from the other variables. Salary was well correlated with Rank, FTEs, Publications, and Experience. In the multiple regression, only Rank and FTEs had significant unique effects. The residuals suggest who was being overpaid and who underpaid.

Split by Sex For men, the unique effect of number of publications was positive – more publications, higher salary. For women it was negative – more publications, lower salary. Curious.

Workaholism Aziz & Zickar (2005) Workaholics may be defined as those –High in work involvement, –High in drive to work, and –Low in work enjoyment. For each case, a score was obtained for each of these three dimensions.

The Three Cluster Solution Workaholics –High work involvement –High drive to work –Low work enjoyment Positively engaged workers –High work involvement –Medium drive to work –High work enjoyment

Unengaged workers –Low work involvement –Low drive to work –Low work enjoyment Past research/theory indicated there should be six clusters, but the theorized six clusters were not obtained.

Clustering Variables FactBeer.sav The statistical output.The statistical output Analyze, Classify, Hierarchical Cluster

Statistics

Plots

Method

Proximity Matrix Is simply the intercorrelation matrix The two most correlated variables are Color and Aroma (r =.909) – they are clustered on the first step. Stage 2: Size and Alcohol (r =.904) are clustered. Stage 3: Taste added to the cluster that already contains Color and Aroma

Also See Other Tables & Plots Stage 4: Cost added to the cluster that already contains Size and Alcohol. Stage 5: The two clusters are combined –But they are not very similar (similarity coefficient =.038) –Now we have one cluster with six variables and one with one (Reputation)