Cluster analysis 포항공과대학교 산업공학과 확률통계연구실 이 재 현. POSTECH IE PASTACLUSTER ANALYSIS Definition Cluster analysis is a technigue used for combining observations.

Slides:



Advertisements
Similar presentations
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Advertisements

Aaker, Kumar, Day Seventh Edition Instructor’s Presentation Slides
Clustering.
N. Kumar, Asst. Professor of Marketing Database Marketing Cluster Analysis.
Clustering Categorical Data The Case of Quran Verses
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very.
Metrics, Algorithms & Follow-ups Profile Similarity Measures Cluster combination procedures Hierarchical vs. Non-hierarchical Clustering Statistical follow-up.
Introduction to Bioinformatics
AEB 37 / AE 802 Marketing Research Methods Week 7
Cluster Analysis.
Cluster Analysis Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis.
Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 1 Cluster Analysis (from Chapter 12)
Clustering (1) Clustering Similarity measure Hierarchical clustering Model-based clustering Figures from the book Data Clustering by Gan et al.
6-1 ©2006 Raj Jain Clustering Techniques  Goal: Partition into groups so the members of a group are as similar as possible and different.
What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or.
Mutual Information Mathematical Biology Seminar
Clustering Petter Mostad. Clustering vs. class prediction Class prediction: Class prediction: A learning set of objects with known classes A learning.
Cluster Analysis (1).
What is Cluster Analysis?
Multivariate Data Analysis Chapter 9 - Cluster Analysis
CLUSTERING (Segmentation)
K-means Clustering. What is clustering? Why would we want to cluster? How would you determine clusters? How can you do this efficiently?
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Dr. Michael R. Hyman Cluster Analysis. 2 Introduction Also called classification analysis and numerical taxonomy Goal: assign objects to groups so that.
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Clustering Unsupervised learning Generating “classes”
Evaluating Performance for Data Mining Techniques
COMP53311 Clustering Prepared by Raymond Wong Some parts of this notes are borrowed from LW Chan ’ s notes Presented by Raymond Wong
CLUSTER ANALYSIS.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
© 2007 Prentice Hall20-1 Chapter Twenty Cluster Analysis.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
1 Gene Ontology Javier Cabrera. 2 Outline Goal: How to identify biological processes or biochemical pathways that are changed by treatment.Goal: How to.
Clustering Procedure Cheng Lei Department of Electrical and Computer Engineering University of Victoria April 16, 2015.
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
1 Cluster Analysis Objectives ADDRESS HETEROGENEITY Combine observations into groups or clusters such that groups formed are homogeneous (similar) within.
Cluster Analysis Cluster Analysis Cluster analysis is a class of techniques used to classify objects or cases into relatively homogeneous groups.
Chapter 14 – Cluster Analysis © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.
Clustering.
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition Instructor’s Presentation Slides 1.
Data Science and Big Data Analytics Chap 4: Advanced Analytical Theory and Methods: Clustering Charles Tappert Seidenberg School of CSIS, Pace University.
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
Selecting Diverse Sets of Compounds C371 Fall 2004.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall 6.8: Clustering Rodney Nielsen Many / most of these.
Machine Learning Queens College Lecture 7: Clustering.
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.1 Lecture 10: Cluster analysis l Uses of cluster analysis.
Applied Multivariate Statistics Cluster Analysis Fall 2015 Week 9.
Copyright © 2010 Pearson Education, Inc Chapter Twenty Cluster Analysis.
1 Cluster Analysis – 2 Approaches K-Means (traditional) Latent Class Analysis (new) by Jay Magidson, Statistical Innovations based in part on a presentation.
1 Cluster Analysis Prepared by : Prof Neha Yadav.
Multivariate statistical methods Cluster analysis.
Basic statistical concepts Variance Covariance Correlation and covariance Standardisation.
Clustering Machine Learning Unsupervised Learning K-means Optimization objective Random initialization Determining Number of Clusters Hierarchical Clustering.
CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles.
Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.
Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,
Unsupervised Learning
Clustering (1) Clustering Similarity measure Hierarchical clustering
Multivariate statistical methods
Clustering and Multidimensional Scaling
Multivariate Statistical Methods
Data Mining – Chapter 4 Cluster Analysis Part 2
Cluster Analysis.
Text Categorization Berlin Chen 2003 Reference:
Cluster analysis Presented by Dr.Chayada Bhadrakom
Unsupervised Learning
Presentation transcript:

Cluster analysis 포항공과대학교 산업공학과 확률통계연구실 이 재 현

POSTECH IE PASTACLUSTER ANALYSIS Definition Cluster analysis is a technigue used for combining observations into groups or clusters such that Each group or cluster is homogeneous or compact with respect to certain characteristics Each group should be different from other groups with respect to the same characteristics Example A marketing manager is interested in identifying similar cities that can be used for test marketing The campaign manager for a political candidate is interested in identifying groups of votes who have similar views on important issues

POSTECH IE PASTACLUSTER ANALYSIS Objective of clustering analysis The objective of cluster analysis is to group observation into clusters such that each cluster is as homogeneous as possible with respect to the clustering variables overview of cluster analysis step 1 ; n objects measured on p variables step 2 ; Transform to n * n similarity(distance) matrix step 3 ; Cluster formation (Hierarchical or nonhierarchical clusters) step 4 ; Cluster profile

POSTECH IE PASTACLUSTER ANALYSIS Key problem Measure of similarity Fundamental to the use of any clustering technique is the computation of a measure of similarity to distance between the respective objects. Distance-type measures – Euclidean distance for standardized data, Mahalanobis distance Matching-type measures – Association coefficients, correlation coefficients A procedure for forming the clusters Hierarchical clustering – Centroid method, Single-linkage method, Complete- linkage method, Average-linkage method, Ward’s method. Nonhierarchical clustering – k-means clustering

POSTECH IE PASTACLUSTER ANALYSIS Similarity Measure – Distance type Minkowski metric If r = 2, then Euclidean distance if r = 1, then absolute distance consider below example Data SubjectIncomeEducation S155 S266 S31514 S41615 S52520 S63019 Similarity matrix S1S2S3S4S5S6 S S S S S S Person Weight in Pounds Height in Feet A B C Height in FeetHeight in inches d AB = 3.08d AB = 8.92 d AC = 5.02d AC = 7.81 d BC = 2.01d BC = 3.12

POSTECH IE PASTACLUSTER ANALYSIS Similarity Measure – Distance type Euclidean distance for standardized data To make scale invariant data The squared euclidean distance is weighted by Mahalanobis distance x is p*1 vector, S is a p*p covariance matrix It is designed to take into account the correlation among the variables and is also scale invariant. Similarity matrix S1S2S3S4S5S6 S S S S S S

POSTECH IE PASTACLUSTER ANALYSIS Similarity Measure – Matching type Association coefficients This type of measure is used to represent similarity for binary variables Similarity coefficients Attribute Person A B Person A Person B

POSTECH IE PASTACLUSTER ANALYSIS Similarity Measure – Matching type Correlation coefficient Pearson product moment correlation coefficient is used for measure of similarity. d AB = 1, d AC = 0.82 PersonX1X1 X2X2 X3X3 X4X4 A1322 B41077 C1222

POSTECH IE PASTACLUSTER ANALYSIS Hierarchical clustering Centroid method Each group is replaced by Average Subject which is the centroid of that group Data for five clusters Cluster Cluster members IncomeEducation 1S1 & S25.5 2S S S S Similarity matrix S1 & S2S3S4S5S6 S1 & S S S S S Data for four clusters Cluster Cluster members IncomeEducation 1S1 & S25.5 2S3 & S S S Similarity matrix S1 & S2S3 & S4S5S6 S1 & S S3 & S S S

POSTECH IE PASTACLUSTER ANALYSIS Hierarchical clustering Data for three clusters Cluster Cluster members IncomeEducation 1S1 & S25.5 2S3 & S S5 & S Similarity matrix S1 & S2S3 & S4S5 & S6 S1 & S S3 & S S5 & S

POSTECH IE PASTACLUSTER ANALYSIS Hierarchical clustering Single-Linkage method The distance between two clusters is represented by the minimum of the distance between all possible pairs of subjects in the two clusters = 181 and = 145 = 221 and = 181 Similarity matrix S1 & S2S3S4S5S6 S1 & S S S S S

POSTECH IE PASTACLUSTER ANALYSIS Hierarchical clustering Complete-Linkage method The distance between two clusters is defined as the maximum of the distances between all possible pairs of observations in the two clusters = 181 and = 145 = 625 and = 557 Similarity matrix S1 & S2S3S4S5S6 S1 & S S S S S

POSTECH IE PASTACLUSTER ANALYSIS Hierarchical clustering Average-Linkage method The distance between two clusters is obtained by taking the average distance between all pairs of subjects in the two clusters and ( ) / 2 = 163 Similarity matrix S1 & S2S3S4S5S6 S1 & S S S S S

POSTECH IE PASTACLUSTER ANALYSIS Hierarchical clustering Ward’s method It forms clusters by maximizing within-clusters homogeneity. The within-group sum of squares is used as the measure of homogeneity. The Ward’s method tries to minimize the total within-group or within-cluster sums of squares

POSTECH IE PASTACLUSTER ANALYSIS Evaluating the cluster solution and determining the number of cluster Root-mean-square standard deviation(RMSSTD)of the new cluster RMSSTD if the pooled standard deviation of all the variables forming the cluster. pooled variance = pooled SS for all the variables / pooled degrees of freedom for all the variables R-Squared(RS) RS is the ratio of SS b to SS t (SS t = SS b + SS w ) RS of CL2 is ( – ) / = Within-Group sum of squares and degrees of freedom for cluster formed in steps 1,2,3,4 and 5 Within-Group Sum of SquaresDegrees of Freedom RMSSTD Step numberClusterIncomeEducationPooledIncomeEducationPooled 1CL CL CL CL CL

POSTECH IE PASTACLUSTER ANALYSIS Evaluating the cluster solution and determining the number of cluster Semipartial R-Squared (SPR) The sum of pooled SS w ’s of cluster joined to obtain the new cluster is called loss of homogeneity. If loss of homogeneity is large then the new cluster is obtained by merging two heterogeneous clusters. SPR is the loss of homogeneity due to combining two groups or clusters to form a new group or cluster. SPR of CL2 is (183 – (1 – 13)) / = Distance between clusters It is simply the euclidean distance between the centroids of the two clusters that are to be joined or merger and it is termed the centroid distance (CD) Data for three clusters Cluster Cluster members IncomeEducation 1S1 & S25.5 2S3 & S S5 & S

POSTECH IE PASTACLUSTER ANALYSIS Evaluating the cluster solution and determining the number of cluster Summary of the statistics for evaluating cluster solution StatisticConcept measuredComments RMSSTDHomogeneity of new clustersValue should be small SPRHomogeneity of merged clustersValue should be small RSHomogeneity of new clustersValue should be high CDHomogeneity of merged clustersValue should be small

POSTECH IE PASTACLUSTER ANALYSIS Nonhierarchical clustering The data are divided into k partitions or groups with each partition representing a cluster. The number of clusters must be known a priori. Step 1. Select k initial cluster centroids or seeds, where k is number of clusters desired. 2. Assign each observation to the cluster to which it is the closest. 3. Reassign or reallocate each observation to one of the k clusters according to a predetermined stopping rule. 4. Stop if there is no reallocation of data points or if the reassignment satisfies the criteria set by the stopping rule. Otherwise go to Step 2. Difference the method used for obtaining initial cluster centroids or seeds the rule used for reassigning observations

POSTECH IE PASTACLUSTER ANALYSIS Nonhierarchical clustering Algorithm 1 step 1. select the first k observation as cluster center 2. compute the centroid of each cluster 3. reassigned by computing the distance of each observation Initial cluster centroids Cluster Variable123 Income5615 Education5614 Distance from cluster centroid Observation123 Assign to cluster S! S S S S S Reassignment of Observation Observatio n 123Previous Reassignm ent S! S S S S S Centroid of the three clusters Cluster Variable123 Income Education5617.0

POSTECH IE PASTACLUSTER ANALYSIS Nonhierarchical clustering Algorithm 2 step 1. select the first k observation as cluster center 2. seeds are replaced by remaining observation. 3. reassigned by computing the distance of each observation Distance from cluster centroid Observation123 Assign to cluster S! S S S S S {1}, {2}, {3} 2.{1}, {2}, {3, 4} 3.{1, 2}, {5}, {3, 4} 4.{1, 2}, {5, 6}, {3, 4} Centroid of the three clusters Cluster Variable123 Income Education Reassignment of Observation Observatio n 123Previous Reassignm ent S! S S S S S

POSTECH IE PASTACLUSTER ANALYSIS Nonhierarchical clustering Algorithm 3 selecting the initial seeds Sum(i) be the sum of the values of the variables Minimizes the ESS Initial Assignment SubjectIncomeEducationSum(i)CiCi Assigned to cluster S S S S S S Centroid of the three clusters Cluster Variable123 Income Education Change in ESS = 3[(5-27.5) 2 + (5-19.5) 2 ]/2 – [(5-5.5) 2 + (5-5.5) 2 ]/2 increase decrease Reassignment of Observation Observatio n 123Previous Reassignm ent S! S S S S S

POSTECH IE PASTACLUSTER ANALYSIS Which clustering method is best Hierarchical methods advantage ; Do not require a priori knowledge of the number of clusters of the starting partition. disadvantage ; Once an observation is assigned to a cluster it cannot be reassigned to another cluster. Nonhierarchical methods The cluster centers or the initial partition has to be identified before the technique can proceed to cluster observations. The nonhierarchical clustering algorithms, in general, are very sensitive to the initial partition. k-mean algorithm and other nonhierarchical clustering algorithms perform poorly when random initial partitions are used. However, their performance is much superior when the results from hierarchical methods are used to form the initial partition. Hierarchical and nonhierarchical techniques should be viewed an complementary clustering techniques rather than as competing techniques.