Unsupervised Learning

Slides:



Advertisements
Similar presentations
SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Advertisements

Cluster Analysis: Basic Concepts and Algorithms
PARTITIONAL CLUSTERING
Clustering: Introduction Adriano Joaquim de O Cruz ©2002 NCE/UFRJ
Metrics, Algorithms & Follow-ups Profile Similarity Measures Cluster combination procedures Hierarchical vs. Non-hierarchical Clustering Statistical follow-up.
Introduction to Bioinformatics
Cluster Analysis.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
L16: Micro-array analysis Dimension reduction Unsupervised clustering.
Introduction to Bioinformatics - Tutorial no. 12
What is Cluster Analysis?
Gene Expression 1. Methods –Unsupervised Clustering Hierarchical clustering K-means clustering Expression data –GEO –UCSC EPCLUST 2.
What is Cluster Analysis?
CLUSTERING (Segmentation)
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Clustering. What is clustering? Grouping similar objects together and keeping dissimilar objects apart. In Information Retrieval, the cluster hypothesis.
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
COMP53311 Clustering Prepared by Raymond Wong Some parts of this notes are borrowed from LW Chan ’ s notes Presented by Raymond Wong
© 2007 Prentice Hall20-1 Chapter Twenty Cluster Analysis.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Data Clustering 2 – K Means contd & Hierarchical Methods Data Clustering – An IntroductionSlide 1.
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
Cluster Analysis Cluster Analysis Cluster analysis is a class of techniques used to classify objects or cases into relatively homogeneous groups.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Chapter 14 – Cluster Analysis © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition Instructor’s Presentation Slides 1.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
1 Cluster Analysis Prepared by : Prof Neha Yadav.
C LUSTERING José Miguel Caravalho. CLUSTER ANALYSIS OR CLUSTERING IS THE TASK OF ASSIGNING A SET OF OBJECTS INTO GROUPS ( CALLED CLUSTERS ) SO THAT THE.
Clustering [Idea only, Chapter 10.1, 10.2, 10.4].
CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles.
Cluster Analysis Finding Groups in Data C. Taillie February 19, 2004.
CPH Dr. Charnigo Chap. 14 Notes In supervised learning, we have a vector of features X and a scalar response Y. (A vector response is also permitted.
Unsupervised Learning
Cluster Analysis Finding Groups in Data
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
Clustering MacKay - Chapter 20.
PREDICT 422: Practical Machine Learning
Hierarchical Clustering
Clustering CSC 600: Data Mining Class 21.
Chapter 15 – Cluster Analysis
CZ5211 Topics in Computational Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Data Mining K-means Algorithm
K-means and Hierarchical Clustering
Hierarchical clustering approaches for high-throughput data
Clustering and Multidimensional Scaling
Information Organization: Clustering
Multivariate Statistical Methods
DATA MINING Introductory and Advanced Topics Part II - Clustering
Statistical Data Analysis
Linear Model Selection and regularization
CSCI N317 Computation for Scientific Applications Unit Weka
Dimension reduction : PCA and Clustering
Data Mining – Chapter 4 Cluster Analysis Part 2
Cluster Analysis.
Text Categorization Berlin Chen 2003 Reference:
SEEM4630 Tutorial 3 – Clustering.
Clustering.
Unsupervised Learning
Unsupervised Learning
STT : Intro. to Statistical Learning
Presentation transcript:

Unsupervised Learning STT592-002: Intro. to Statistical Learning Unsupervised Learning Chapter 10 Disclaimer: This PPT is modified based on IOM 530: Intro. to Statistical Learning

STT592-002: Intro. to Statistical Learning What is clustering?

STT592-002: Intro. to Statistical Learning PCA and Clustering PCA looks to find a low-dimensional representation of the observations that explain a good fraction of the variance; Clustering looks to find homogeneous subgroups among the observations.

Review: Examples of Unsupervised Learning STT592-002: Intro. to Statistical Learning Review: Examples of Unsupervised Learning Eg: A cancer researcher might assay gene expression levels in 100 patients with breast cancer, and look for subgroups among the breast cancer samples, or among the genes, in order to obtain a better understanding of the disease. Eg: Online shopping site: identify groups of shoppers with similar browsing and purchase histories, as well as items of interest within each group. Then an individual shopper can be preferentially shown the items likely to be interested, based on the purchase histories of similar shoppers. A search engine choose search results to display to a particular individual based on the click histories of other individuals with similar search patterns.

STT592-002: Intro. to Statistical Learning Clustering Clustering refers to a set of techniques for finding subgroups, or clusters, in a data set. A good clustering is one when the observations within a group are similar but between groups are very different For example, suppose we collect p measurements on each of n breast cancer patients. There may be different unknown types of cancer which we could discover by clustering the data

Different Clustering Methods STT592-002: Intro. to Statistical Learning Different Clustering Methods There are many different types of clustering methods We will concentrate on two of the most commonly used approaches: K-Means Clustering Hierarchical Clustering

STT592-002: Intro. to Statistical Learning K-Means clustering

STT592-002: Intro. to Statistical Learning K-Means Clustering To perform K-means clustering, one must first specify the desired number of clusters K. (Pre-determined K drawback!) Then the K-means algorithm will assign each observation to exactly one (distinct, non-overlapping) of the K clusters.

STT592-002: Intro. to Statistical Learning How does K-Means work? We would like to partition that data set into K clusters Each observation belong to at least one of the K clusters The clusters are non-overlapping, i.e. no observation belongs to more than one cluster Idea: Objective is to have a minimal “within-cluster-variation”, i.e. elements within a cluster should be as similar as possible

STT592-002: Intro. to Statistical Learning How does K-Means work? Idea: Objective is to have a minimal “within-cluster-variation”, i.e. elements within a cluster should be as similar as possible One way to achieve this is to minimize the sum of all the pair-wise squared Euclidean distances between the observations in each cluster.

STT592-002: Intro. to Statistical Learning K-Means Algorithm Initial Step: Randomly assign each observation to one of K clusters. Iterate until the cluster assignments stop changing: For each of the K clusters, compute the cluster centroid. The kth cluster centroid is the mean of the observations assigned to the kth cluster. Assign each observation to the cluster whose centroid is closest (where “closest” is defined using Euclidean distance.

An Illustration of the K-Means Algorithm STT592-002: Intro. to Statistical Learning An Illustration of the K-Means Algorithm Random Assignment of points Compute cluster centers from initial assignments Computer new cluster centers Assign points to closest cluster center Now there is no further change so stop

STT592-002: Intro. to Statistical Learning Local Optimums Bad Solution Good Solution The K-means algorithm depend on initial (random) cluster assignment. It can get stuck in “local optimums” and not find the best solution Hence, it is important to run the algorithm multiple times with random starting points to find a good solution

Hierarchical clustering STT592-002: Intro. to Statistical Learning Hierarchical clustering

Hierarchical Clustering STT592-002: Intro. to Statistical Learning Hierarchical Clustering K-Means clustering requires choosing the number of clusters K. If we don’t want to do that, an alternative is to use Hierarchical Clustering Hierarchical Clustering has an added advantage that it produces a tree based representation of the observations, called a Dendrogram Bottom-up, or agglomerative clustering will be covered.

STT592-002: Intro. to Statistical Learning Dendrograms First join closest points (5 and 7) Height of fusing/merging (on vertical axis) indicates how similar the points are After the points are fused they are treated as a single observation and the algorithm continues

STT592-002: Intro. to Statistical Learning Interpretation Each “leaf” of the Dendrogram represents one of the 45 observations At the bottom of the Dendrogram, each observation is a distinct leaf. However, as we move up the tree, some leaves begin to fuse. These correspond to observations that are similar to each other. As we move higher up the tree, an increasing number of observations have fused. The earlier (lower in the tree) two observations fuse, the more similar they are to each other. Observations that fuse later are quite different

STT592-002: Intro. to Statistical Learning Choosing Clusters To choose clusters we draw lines across the Dendrogram We can form any number of clusters depending on where we draw the break point. One Cluster Two Clusters Three Clusters

Algorithm (Agglomerative Approach) STT592-002: Intro. to Statistical Learning Algorithm (Agglomerative Approach) The Dendrogram is produced as follows: Start with each point as a separate cluster (n clusters) Calculate a measure of dissimilarity between all points/clusters. Fuse two clusters that are most similar so that there are now n-1 clusters Fuse next two most similar clusters so there are now n-2 clusters Continue until there is only 1 cluster

STT592-002: Intro. to Statistical Learning Review Example Start with 9 clusters Fuse 5 and 7 Fuse 6 and 1 Fuse the (5,7) cluster with 8. Continue until all observations are fused.

How do we define dissimilarity? STT592-002: Intro. to Statistical Learning How do we define dissimilarity? How do we define the dissimilarity, or linkage, between the fused (5,7) and 8? There are four options: Complete Linkage Single Linkage Average Linkage Centriod Linkage

Linkage Methods: Distance Between Clusters STT592-002: Intro. to Statistical Learning Linkage Methods: Distance Between Clusters Complete Linkage: Largest distance between observations Single Linkage: Smallest distance between observations Average Linkage: Average distance between observations Centroid: distance between centroids of the observations

Linkage Methods: Distance Between Clusters STT592-002: Intro. to Statistical Learning Linkage Methods: Distance Between Clusters Complete Linkage: Largest distance between observations Single Linkage: Smallest distance between observations Average Linkage: Average distance between observations Centroid: distance between centroids of the observations

Linkage Can be Important STT592-002: Intro. to Statistical Learning Linkage Can be Important Here we have three clustering results for the same data The only difference is the linkage method but the results are very different Complete and average linkage tend to yield evenly sized clusters whereas single linkage tends to yield extended clusters to which single leaves are fused one by one. Average and Complete linkage tend to yield more balanced clusters.

Choice of Dissimilarity Measure STT592-002: Intro. to Statistical Learning Choice of Dissimilarity Measure So far, we have considered using Euclidean distance as the dissimilarity measure However, an alternative measure that could make sense in some cases is the correlation based distance

Comparing Dissimilarity Measures STT592-002: Intro. to Statistical Learning Comparing Dissimilarity Measures In this example, we have 3 observations and p = 20 variables In terms of Euclidean distance obs. 1 and 3 are similar However, obs. 1 and 2 are highly correlated so would be considered similar in terms of correlation measure

Online Shopping Example STT592-002: Intro. to Statistical Learning Online Shopping Example Suppose we record the number of purchases of each item (columns) for each customer (rows) Using Euclidean distance, customers who have purchased very little will be clustered together Using correlation measure, customers who tend to purchase the same types of products will be clustered together even if the magnitude of their purchase may be quite different

Standardizing/scale the Variables STT592-002: Intro. to Statistical Learning Standardizing/scale the Variables Consider an online shop that sells two items: socks and computers Left: In terms of quantity, socks have higher weight Center: After standardizing, socks and computers have equal weight Right: In terms of dollar sales, computers have higher weight

STT592-002: Intro. to Statistical Learning Final thoughts

Practical Issues in Clustering STT592-002: Intro. to Statistical Learning Practical Issues in Clustering In order to perform clustering, some decisions must be made: Should the features first be standardized? i.e. Have the variables centered to have a mean of zero and standard deviation of one. In case of hierarchical clustering: What dissimilarity measure should be used? What type of linkage should be used? Where should we cut the Dendrogram in order to obtain clusters? In case of K-means clustering: How many clusters K should we look for the data? In practice, we try several different choices, and look for the one with the most useful or interpretable solution. There is no single right answer!

STT592-002: Intro. to Statistical Learning Final Thoughts Most importantly, one must be careful about how the results of a clustering analysis are reported These results should NOT be taken as the absolute truth about a data set Rather, they should constitute a starting point for the developments of a scientific hypothesis and further study, preferably on independent data  Exploratory data analysis