Copyright © 2010 Pearson Education, Inc. 20-1 Chapter Twenty Cluster Analysis.

Slides:



Advertisements
Similar presentations
Aaker, Kumar, Day Seventh Edition Instructor’s Presentation Slides
Advertisements

Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
N. Kumar, Asst. Professor of Marketing Database Marketing Cluster Analysis.
Chapter Nineteen Factor Analysis.
Livelihoods analysis using SPSS. Why do we analyze livelihoods?  Food security analysis aims at informing geographical and socio-economic targeting 
Regression Analysis Using Excel. Econometrics Econometrics is simply the statistical analysis of economic phenomena Here, we just summarize some of the.
Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very.
Metrics, Algorithms & Follow-ups Profile Similarity Measures Cluster combination procedures Hierarchical vs. Non-hierarchical Clustering Statistical follow-up.
Introduction to Bioinformatics
AEB 37 / AE 802 Marketing Research Methods Week 7
Cluster Analysis.
Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 1 Cluster Analysis (from Chapter 12)
Clustering II.
What is Cluster Analysis?
Multivariate Data Analysis Chapter 9 - Cluster Analysis
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Clustering. What is clustering? Grouping similar objects together and keeping dissimilar objects apart. In Information Retrieval, the cluster hypothesis.
Dr. Michael R. Hyman Cluster Analysis. 2 Introduction Also called classification analysis and numerical taxonomy Goal: assign objects to groups so that.
Copyright ©2011 Pearson Education, Inc. publishing as Prentice Hall 18-1 Chapter 18 Data Analysis Overview Statistics for Managers using Microsoft Excel.
Introduction to Data Mining Data mining is a rapidly growing field of business analytics focused on better understanding of characteristics and.
Clustering analysis workshop Clustering analysis workshop CITM, Lab 3 18, Oct 2014 Facilitator: Hosam Al-Samarraie, PhD.
Clustering Unsupervised learning Generating “classes”
Evaluating Performance for Data Mining Techniques
Cluster Analysis Chapter 12.
9-1 Copyright © 2013 Pearson Education, Inc. Publishing as Prentice Hall Multicriteria Decision Making Chapter 9.
Chapter 9 Designing Databases Modern Systems Analysis and Design Sixth Edition Jeffrey A. Hoffer Joey F. George Joseph S. Valacich.
CLUSTER ANALYSIS.
Copyright © 2010 Pearson Education, Inc Chapter Seventeen Correlation and Regression.
Understanding Regression Analysis Basics. Copyright © 2014 Pearson Education, Inc Learning Objectives To understand the basic concept of prediction.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
© 2007 Prentice Hall20-1 Chapter Twenty Cluster Analysis.
Cluster analysis 포항공과대학교 산업공학과 확률통계연구실 이 재 현. POSTECH IE PASTACLUSTER ANALYSIS Definition Cluster analysis is a technigue used for combining observations.
Chapter XX Cluster Analysis. Chapter Outline Chapter Outline 1) Overview 2) Basic Concept 3) Statistics Associated with Cluster Analysis 4) Conducting.
Technological Educational Institute Of Crete Department Of Applied Informatics and Multimedia Intelligent Systems Laboratory 1 CLUSTERS Prof. George Papadourakis,
Cluster Analysis Cluster Analysis Cluster analysis is a class of techniques used to classify objects or cases into relatively homogeneous groups.
Chapter 14 – Cluster Analysis © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Cluster Analysis.
1 Hair, Babin, Money & Samouel, Essentials of Business Research, Wiley, Learning Objectives: 1.Explain the difference between dependence and interdependence.
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition Instructor’s Presentation Slides 1.
Data Science and Big Data Analytics Chap 4: Advanced Analytical Theory and Methods: Clustering Charles Tappert Seidenberg School of CSIS, Pace University.
Lecture 12 Factor Analysis.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Copyright © 2010 Pearson Education, Inc Chapter Nineteen Factor Analysis.
Copyright © 2010 Pearson Education, Inc., publishing as Prentice-Hall. 9-1 Chapter 9 Cluster Analysis.
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.1 Lecture 10: Cluster analysis l Uses of cluster analysis.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Applied Multivariate Statistics Cluster Analysis Fall 2015 Week 9.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
1 Cluster Analysis Prepared by : Prof Neha Yadav.
CLUSTER ANALYSIS. What is Cluster analysis? Cluster analysis is a techniques for grouping objects, cases, entities on the basis of multiple variables.
Basic statistical concepts Variance Covariance Correlation and covariance Standardisation.
Chapter_20 Cluster Analysis Naresh K. Malhotra
CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles.
Chapter 2 Describing Data: Numerical
Unsupervised Learning
Clustering CSC 600: Data Mining Class 21.
Chapter 15 – Cluster Analysis
Lecturing 12 Cluster Analysis
Chapter 12 Using Descriptive Analysis, Performing
Clustering and Multidimensional Scaling
Data Mining – Chapter 4 Cluster Analysis Part 2
Chapter_20 Cluster Analysis
Cluster Analysis.
Text Categorization Berlin Chen 2003 Reference:
Cluster analysis Presented by Dr.Chayada Bhadrakom
Unsupervised Learning
Presentation transcript:

Copyright © 2010 Pearson Education, Inc Chapter Twenty Cluster Analysis

Copyright © 2010 Pearson Education, Inc Chapter Outline 1) Overview 2) Basic Concept 3) Statistics Associated with Cluster Analysis 4) Conducting Cluster Analysis i.Formulating the Problem ii.Selecting a Distance or Similarity Measure iii.Selecting a Clustering Procedure iv.Deciding on the Number of Clusters v.Interpreting and Profiling the Clusters vi.Assessing Reliability and Validity

Copyright © 2010 Pearson Education, Inc Chapter Outline 5) Applications of Nonhierarchical Clustering 6) Clustering Variables 7) Summary

Copyright © 2010 Pearson Education, Inc Cluster Analysis Cluster analysis is a class of techniques used to classify objects or cases into relatively homogeneous groups called clusters. Objects in each cluster tend to be similar to each other and dissimilar to objects in the other clusters. Cluster analysis is also called classification analysis, or numerical taxonomy. Both cluster analysis and discriminant analysis are concerned with classification. However, discriminant analysis requires prior knowledge of the cluster or group membership for each object or case included, to develop the classification rule. In contrast, in cluster analysis there is no a priori information about the group or cluster membership for any of the objects. Groups or clusters are suggested by the data, not defined a priori.

Copyright © 2010 Pearson Education, Inc An Ideal Clustering Situation Variable 2 Variable 1 Fig. 20.1

Copyright © 2010 Pearson Education, Inc A Practical Clustering Situation X Variable 2 Variable 1 Fig. 20.2

Copyright © 2010 Pearson Education, Inc Statistics Associated with Cluster Analysis Agglomeration schedule. An agglomeration schedule gives information on the objects or cases being combined at each stage of a hierarchical clustering process. Cluster centroid. The cluster centroid is the mean values of the variables for all the cases or objects in a particular cluster. Cluster centers. The cluster centers are the initial starting points in nonhierarchical clustering. Clusters are built around these centers, or seeds. Cluster membership. Cluster membership indicates the cluster to which each object or case belongs.

Copyright © 2010 Pearson Education, Inc Statistics Associated with Cluster Analysis Dendrogram. A dendrogram, or tree graph, is a graphical device for displaying clustering results. Vertical lines represent clusters that are joined together. The position of the line on the scale indicates the distances at which clusters were joined. The dendrogram is read from left to right. Figure 20.8 is a dendrogram. Distances between cluster centers. These distances indicate how separated the individual pairs of clusters are. Clusters that are widely separated are distinct, and therefore desirable.

Copyright © 2010 Pearson Education, Inc Statistics Associated with Cluster Analysis Icicle plot. An icicle plot is a graphical display of clustering results, so called because it resembles a row of icicles hanging from the eaves of a house. The columns correspond to the objects being clustered, and the rows correspond to the number of clusters. An icicle plot is read from bottom to top. Figure 20.7 is an icicle plot. Similarity/distance coefficient matrix. A similarity/distance coefficient matrix is a lower- triangle matrix containing pairwise distances between objects or cases.

Copyright © 2010 Pearson Education, Inc Conducting Cluster Analysis Formulate the Problem Assess the Validity of Clustering Select a Distance Measure Select a Clustering Procedure Decide on the Number of Clusters Interpret and Profile Clusters Fig. 20.3

Copyright © 2010 Pearson Education, Inc Attitudinal Data For Clustering Case No.V 1 V 2 V 3 V 4 V 5 V Table 20.1

Copyright © 2010 Pearson Education, Inc Conducting Cluster Analysis: Formulate the Problem Perhaps the most important part of formulating the clustering problem is selecting the variables on which the clustering is based. Inclusion of even one or two irrelevant variables may distort an otherwise useful clustering solution. Basically, the set of variables selected should describe the similarity between objects in terms that are relevant to the marketing research problem. The variables should be selected based on past research, theory, or a consideration of the hypotheses being tested. In exploratory research, the researcher should exercise judgment and intuition.

Copyright © 2010 Pearson Education, Inc Conducting Cluster Analysis: Select a Distance or Similarity Measure The most commonly used measure of similarity is the Euclidean distance or its square. The Euclidean distance is the square root of the sum of the squared differences in values for each variable. Other distance measures are also available. The city-block or Manhattan distance between two objects is the sum of the absolute differences in values for each variable. The Chebychev distance between two objects is the maximum absolute difference in values for any variable. If the variables are measured in vastly different units, the clustering solution will be influenced by the units of measurement. In these cases, before clustering respondents, we must standardize the data by rescaling each variable to have a mean of zero and a standard deviation of unity. It is also desirable to eliminate outliers (cases with atypical values). Use of different distance measures may lead to different clustering results. Hence, it is advisable to use different measures and compare the results.

Copyright © 2010 Pearson Education, Inc A Classification of Clustering Procedures Fig. 20.4

Copyright © 2010 Pearson Education, Inc Conducting Cluster Analysis: Select a Clustering Procedure–Hierarchical Hierarchical clustering is characterized by the development of a hierarchy or tree-like structure. Hierarchical methods can be agglomerative or divisive. Agglomerative clustering starts with each object in a separate cluster. Clusters are formed by grouping objects into bigger and bigger clusters. This process is continued until all objects are members of a single cluster. Divisive clustering starts with all the objects grouped in a single cluster. Clusters are divided or split until each object is in a separate cluster. Agglomerative methods are commonly used in marketing research. They consist of linkage methods, error sums of squares or variance methods, and centroid methods.

Copyright © 2010 Pearson Education, Inc Conducting Cluster Analysis: Select a Clustering Procedure – Linkage Method The single linkage method is based on minimum distance, or the nearest neighbor rule. At every stage, the distance between two clusters is the distance between their two closest points (see Figure 20.5). The complete linkage method is similar to single linkage, except that it is based on the maximum distance or the furthest neighbor approach. In complete linkage, the distance between two clusters is calculated as the distance between their two furthest points. The average linkage method works similarly. However, in this method, the distance between two clusters is defined as the average of the distances between all pairs of objects, where one member of the pair is from each of the clusters (Figure 20.5).

Copyright © 2010 Pearson Education, Inc Linkage Methods of Clustering Fig Single Linkage Minimum Distance Complete Linkage Maximum Distance Average Linkage Average Distance Cluster 1 Cluster 2 Cluster 1 Cluster 2 Cluster 1 Cluster 2

Copyright © 2010 Pearson Education, Inc Conducting Cluster Analysis: Select a Clustering Procedure – Variance Method The variance methods attempt to generate clusters to minimize the within-cluster variance. A commonly used variance method is the Ward's procedure. For each cluster, the means for all the variables are computed. Then, for each object, the squared Euclidean distance to the cluster means is calculated (Figure 20.6). These distances are summed for all the objects. At each stage, the two clusters with the smallest increase in the overall sum of squares within cluster distances are combined. In the centroid methods, the distance between two clusters is the distance between their centroids (means for all the variables), as shown in Figure Every time objects are grouped, a new centroid is computed. Of the hierarchical methods, average linkage and Ward's methods have been shown to perform better than the other procedures.

Copyright © 2010 Pearson Education, Inc Other Agglomerative Clustering Methods Ward’s Procedure Centroid Method Fig. 20.6

Copyright © 2010 Pearson Education, Inc Conducting Cluster Analysis: Select a Clustering Procedure – Nonhierarchical The nonhierarchical clustering methods are frequently referred to as k-means clustering. These methods include sequential threshold, parallel threshold, and optimizing partitioning. In the sequential threshold method, a cluster center is selected and all objects within a prespecified threshold value from the center are grouped together. Then a new cluster center or seed is selected, and the process is repeated for the unclustered points. Once an object is clustered with a seed, it is no longer considered for clustering with subsequent seeds. The parallel threshold method operates similarly, except that several cluster centers are selected simultaneously and objects within the threshold level are grouped with the nearest center. The optimizing partitioning method differs from the two threshold procedures in that objects can later be reassigned to clusters to optimize an overall criterion, such as average within cluster distance for a given number of clusters.

Copyright © 2010 Pearson Education, Inc Conducting Cluster Analysis: Select a Clustering Procedure It has been suggested that the hierarchical and nonhierarchical methods be used in tandem. First, an initial clustering solution is obtained using a hierarchical procedure, such as average linkage or Ward's. The number of clusters and cluster centroids so obtained are used as inputs to the optimizing partitioning method. Choice of a clustering method and choice of a distance measure are interrelated. For example, squared Euclidean distances should be used with the Ward's and centroid methods. Several nonhierarchical procedures also use squared Euclidean distances.

Copyright © 2010 Pearson Education, Inc Results of Hierarchical Clustering Stage cluster Clusters combined first appears StageCluster 1Cluster 2 Coefficient Cluster 1 Cluster 2 Next stage Agglomeration Schedule Using Ward’s Procedure Table 20.2

Copyright © 2010 Pearson Education, Inc Results of Hierarchical Clustering Number of Clusters Label case Cluster Membership of Cases Using Ward’s Procedure Table 20.2, cont.

Copyright © 2010 Pearson Education, Inc Vertical Icicle Plot Using Ward’s Method Fig. 20.7

Copyright © 2010 Pearson Education, Inc Fig Dendrogram Using Ward’s Method

Copyright © 2010 Pearson Education, Inc Conducting Cluster Analysis: Decide on the Number of Clusters Theoretical, conceptual, or practical considerations may suggest a certain number of clusters. In hierarchical clustering, the distances at which clusters are combined can be used as criteria. This information can be obtained from the agglomeration schedule or from the dendrogram. In nonhierarchical clustering, the ratio of total within-group variance to between-group variance can be plotted against the number of clusters. The point at which an elbow or a sharp bend occurs indicates an appropriate number of clusters. The relative sizes of the clusters should be meaningful.

Copyright © 2010 Pearson Education, Inc Conducting Cluster Analysis: Interpreting and Profiling the Clusters Interpreting and profiling clusters involves examining the cluster centroids. The centroids enable us to describe each cluster by assigning it a name or label. It is often helpful to profile the clusters in terms of variables that were not used for clustering. These may include demographic, psychographic, product usage, media usage, or other variables.

Copyright © 2010 Pearson Education, Inc Cluster Centroids Table 20.3 Cluster V 1 V 2 V 3 V 4 V 5 V Means of Variables

Copyright © 2010 Pearson Education, Inc Conducting Cluster Analysis: Assess Reliability and Validity 1.Perform cluster analysis on the same data using different distance measures. Compare the results across measures to determine the stability of the solutions. 2.Use different methods of clustering and compare the results. 3.Split the data randomly into halves. Perform clustering separately on each half. Compare cluster centroids across the two subsamples. 4.Delete variables randomly. Perform clustering based on the reduced set of variables. Compare the results with those obtained by clustering based on the entire set of variables. 5.In nonhierarchical clustering, the solution may depend on the order of cases in the data set. Make multiple runs using different order of cases until the solution stabilizes.

Copyright © 2010 Pearson Education, Inc Table 20.4 Results of Nonhierarchical Clustering Initial Cluster Centers V1 V2 V3 V4 V5 V6 123 Cluster Convergence achieved due to no or small distance change. The maximum distance by which any center has changed is The current iteration is 2. The minimum distance between initial centers is a. Iteration History a Iteration Change in Cluster Centers

Copyright © 2010 Pearson Education, Inc Results of Nonhierarchical Clustering Table 20.4 cont. Cluster Membership Case Number ClusterDistance

Copyright © 2010 Pearson Education, Inc Results of Nonhierarchical Clustering Table 20.4, cont. Final Cluster Centers Cluster 123 V1 V2 V3 V4 V5 V Distances between Final Cluster Centers Cluster

Copyright © 2010 Pearson Education, Inc Results of Nonhierarchical Clustering Table 20.4, cont.

Copyright © 2010 Pearson Education, Inc Results of Two-Step Clustering Table 20.5

Copyright © 2010 Pearson Education, Inc Cluster Distribution Table 20.5, cont. N % of Combined % of Total % % Cluster Combined % Total %

Copyright © 2010 Pearson Education, Inc Cluster Profiles Table 20.5, cont.

Copyright © 2010 Pearson Education, Inc Clustering Variables In this instance, the units used for analysis are the variables, and the distance measures are computed for all pairs of variables. Hierarchical clustering of variables can aid in the identification of unique variables, or variables that make a unique contribution to the data. Clustering can also be used to reduce the number of variables. Associated with each cluster is a linear combination of the variables in the cluster, called the cluster component. A large set of variables can often be replaced by the set of cluster components with little loss of information. However, a given number of cluster components does not generally explain as much variance as the same number of principal components.

Copyright © 2010 Pearson Education, Inc SPSS Windows To select this procedure using SPSS for Windows, click: Analyze>Classify>Hierarchical Cluster … Analyze>Classify>K-Means Cluster … Analyze>Classify>Two-Step Cluster 

Copyright © 2010 Pearson Education, Inc SPSS Windows: Hierarchical Clustering 1.Select ANALYZE from the SPSS menu bar. 2.Click CLASSIFY and then HIERARCHICAL CLUSTER. 3.Move “Fun [v1],” “Bad for Budget [v2],” “Eating Out [v3],” “Best Buys [v4],” “Don’t Care [v5],” and “Compare Prices [v6]” into the VARIABLES box. 4.In the CLUSTER box, check CASES (default option). In the DISPLAY box, check STATISTICS and PLOTS (default options). 5.Click on STATISTICS. In the pop-up window, check AGGLOMERATION SCHEDULE. In the CLUSTER MEMBERSHIP box, check RANGE OF SOLUTIONS. Then, for MINIMUM NUMBER OF CLUSTERS, enter 2 and for MAXIMUM NUMBER OF CLUSTERS, enter 4. Click CONTINUE. 6.Click on PLOTS. In the pop-up window, check DENDROGRAM. In the ICICLE box, check ALL CLUSTERS (default). In the ORIENTATION box, check VERTICAL. Click CONTINUE. 7.Click on METHOD. For CLUSTER METHOD, select WARD’S METHOD. In the MEASURE box, check INTERVAL and select SQUARED EUCLIDEAN DISTANCE. Click CONTINUE. 8.Click OK.

Copyright © 2010 Pearson Education, Inc SPSS Windows: K-Means Clustering 1.Select ANALYZE from the SPSS menu bar. 2.Click CLASSIFY and then K-MEANS CLUSTER. 3.Move “Fun [v1],” “Bad for Budget [v2],” “Eating Out [v3],” “Best Buys [v4],” “Don’t Care [v5],” and “Compare Prices [v6]” into the VARIABLES box. 4.For NUMBER OF CLUSTER, select 3. 5.Click on OPTIONS. In the pop-up window, in the STATISTICS box, check INITIAL CLUSTER CENTERS and CLUSTER INFORMATION FOR EACH CASE. Click CONTINUE. 6.Click OK.

Copyright © 2010 Pearson Education, Inc SPSS Windows: Two-Step Clustering 1.Select ANALYZE from the SPSS menu bar. 2.Click CLASSIFY and then TWO-STEP CLUSTER. 3.Move “Fun [v1],” “Bad for Budget [v2],” “Eating Out [v3],” “Best Buys [v4],” “Don’t Care [v5],” and “Compare Prices [v6]” into the CONTINUOUS VARIABLES box. 4.For DISTANCE MEASURE, select EUCLIDEAN. 5.For NUMBER OF CLUSTER, select DETERMINE AUTOMATICALLY. 6.For CLUSTERING CRITERION, select AKAIKE’S INFORMATION CRITERION (AIC). 7.Click OK.

Copyright © 2010 Pearson Education, Inc SAS Learning Edition To select this procedure using SAS Learning Edition, click: Analyze>Classify>Cluster Analysis

Copyright © 2010 Pearson Education, Inc SAS Learning Edition: Hierarchical Clustering 1.Select ANALYZE from the SAS Learning Edition menu bar. 2.Select Multivariate>Cluster Analysis. 3.Move V1-V6 to the Analysis variables task role. 4.Click Cluster and select Ward’s minimum variance method under Cluster method. 5.Click Results and select Simple summary statistics. 6.Click Run.

Copyright © 2010 Pearson Education, Inc SAS Learning Edition Windows: K-Means Clustering 1.Select ANALYZE from the SAS Learning Edition menu bar. 2.Select Multivariate>Cluster Analysis. 3.Move V1-V6 to the Analysis variables task role. 4.Click Cluster and select K-means algorithm as the cluster method and 3 for the Maximum number of clusters. 5.Click Run.

Copyright © 2010 Pearson Education, Inc

Copyright © 2010 Pearson Education, Inc All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior written permission of the publisher. Printed in the United States of America. Copyright © 2010 Pearson Education, Inc.