LISA Short Course Series Multivariate Clustering Analysis in R Yuhyun Song Nov 03, 2015 LISA: Multivariate Clustering Analysis in RNov 3, 2015.

Slides:



Advertisements
Similar presentations
CS 478 – Tools for Machine Learning and Data Mining Clustering: Distance-based Approaches.
Advertisements

SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Clustering.
Cluster Analysis: Basic Concepts and Algorithms
Clustering Basic Concepts and Algorithms
Clustering Categorical Data The Case of Quran Verses
PARTITIONAL CLUSTERING
CART: Classification and Regression Trees Chris Franck LISA Short Course March 26, 2013.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Introduction to Bioinformatics
Important clustering methods used in microarray data analysis Steve Horvath Human Genetics and Biostatistics UCLA.
Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 1 Cluster Analysis (from Chapter 12)
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or.
Clustering II.
LISA Short Course Series Multivariate Analysis in R Liang (Sally) Shan March 3, 2015 LISA: Multivariate Analysis in RMar. 3, 2015.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
Cluster Analysis (1).
What is Cluster Analysis?
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
K-means Clustering. What is clustering? Why would we want to cluster? How would you determine clusters? How can you do this efficiently?
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Lecture 09 Clustering-based Learning
Clustering analysis workshop Clustering analysis workshop CITM, Lab 3 18, Oct 2014 Facilitator: Hosam Al-Samarraie, PhD.
LISA Short Course Series R Statistical Analysis Ning Wang Summer 2013 LISA: R Statistical AnalysisSummer 2013.
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
Hierarchical Clustering
Lecture 20: Cluster Validation
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.
Clustering.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Clustering.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Machine Learning Queens College Lecture 7: Clustering.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Cluster Analysis Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas.
Clustering Algorithms Sunida Ratanothayanon. What is Clustering?
Cluster Analysis Dr. Bernard Chen Ph.D. Assistant Professor Department of Computer Science University of Central Arkansas Fall 2010.
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
Multivariate statistical methods Cluster analysis.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
CSE4334/5334 Data Mining Clustering. What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related)
CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles.
Unsupervised Learning
Multivariate statistical methods
Data Mining: Basic Cluster Analysis
Clustering CSC 600: Data Mining Class 21.
Topic 3: Cluster Analysis
CSE 5243 Intro. to Data Mining
Clustering.
Data Mining 資料探勘 分群分析 (Cluster Analysis) Min-Yuh Day 戴敏育
Multivariate Statistical Methods
DATA MINING Introductory and Advanced Topics Part II - Clustering
Data Mining – Chapter 4 Cluster Analysis Part 2
Clustering Wei Wang.
Text Categorization Berlin Chen 2003 Reference:
Topic 5: Cluster Analysis
SEEM4630 Tutorial 3 – Clustering.
Unsupervised Learning
Presentation transcript:

LISA Short Course Series Multivariate Clustering Analysis in R Yuhyun Song Nov 03, 2015 LISA: Multivariate Clustering Analysis in RNov 3, 2015

Laboratory for Interdisciplinary Statistical Analysis Collaboration: Visit our website to request personalized statistical advice and assistance with: Designing Experiments Analyzing Data Interpreting Results Grant Proposals Software (R, SAS, JMP, Minitab...) LISA statistical collaborators aim to explain concepts in ways useful for your research. Great advice right now: Meet with LISA before collecting your data. All services are FREE for VT researchers. We assist with research—not class projects or homework. LISA helps VT researchers benefit from the use of Statistics LISA also offers: Educational Short Courses: Designed to help graduate students apply statistics in their research Walk-In Consulting: Available Monday-Friday from 1-3 PM in the Old Security Building (OSB), Tuesday, Thursday, and Friday from 10-12pm in the GLC, and Wednesday from 10 am-12 pm in Hutcheson for questions <30 mins.

1.Data 2.What is multivariate analysis? 3.What is clustering analysis? 4.Clustering algorithms Hierarchical agglomerative clustering algorithm Partitioning clustering algorithms K-means clustering Partitioning Around Medoids (PAM) 5. Cluster Validation LISA: Multivariate Clustering Analysis in RNov 3, 2015 OUTLINE

1.Data 2.What is multivariate analysis? 3.What is clustering analysis? 4.Clustering algorithms Hierarchical agglomerative clustering algorithm Partitioning clustering algorithms K-means clustering Partitioning Around Medoids (PAM) Model based clustering algorithm 5. Cluster Validation LISA: Multivariate Clustering Analysis in RNov 3, 2015 OUTLINE

DATA: Tweeter data LISA: Multivariate Clustering Analysis in RNov 3, 2015 Can be downloaded from the website, Contains 320 tweets

DATA: Tweeter data LISA: Multivariate Clustering Analysis in RNov 3, 2015 Text data needs several procedures for data munging since text data is categorical. - Transforming text Changing letters to lower case Removing punctuations, numbers, stop words. -Stemming words -Building a term document matrix containing word frequencies. We will implement above procedures before we apply clustering algorithms into a data matrix!

1.Data 2.What is multivariate analysis? 3.What is clustering analysis? 4.Clustering algorithms Hierarchical agglomerative clustering algorithm Partitioning clustering algorithms K-means clustering Partitioning Around Medoids (PAM) 5. Cluster Validation LISA: Multivariate Clustering Analysis in RNov 3, 2015 OUTLINE

Univariate Data Analysis – used when one outcome variable is measured for each object. Multivariate Data Analysis – used when more than one outcome variables are measured for each object. – refers any statistical technique used to analyze data that arises from more than one variable. – concerned with the study of association among sets of measurements. Multivariate Data Analysis LISA: Multivariate Clustering Analysis in RNov 3, 2015

Multivariate Data Analysis LISA: Multivariate Clustering Analysis in R MethodObjectivesExploratory vs. Confirmatory Principal Components Analysis Dimension ReductionExploratory Factor AnalysisUnderstand patterns of intercorrelation Both Multidimensional Scaling Analysis Create spatial representation from objects similarities Mainly Exploratory Classification AnalysisBuild a classification rules for predefined groups Both Clustering AnalysisCreate groupings from objects similarities Exploratory Nov 3, 2015

1.Data 2.What is multivariate analysis? 3.What is clustering analysis? 4.Clustering algorithms Hierarchical agglomerative clustering algorithm Partitioning clustering algorithms K-means clustering Partitioning Around Medoids (PAM) 5. Cluster Validation LISA: Multivariate Clustering Analysis in RNov 3, 2015 OUTLINE

Clustering Analysis LISA: Multivariate Clustering Analysis in R What is a natural grouping among characters? Segmenting characters into groups is subjective. Villains Heroes Nov 3, 2015 Males Females

Maximize inter- cluster distances Minimize intra- cluster distances Cluster: a collection of data objects – Objects are similar to one another within the same cluster. – Objects are dissimilar to the objects in other clusters. Cluster analysis – Finding similarities between data according to the characteristics found in the data and grouping a set of data objects in such a way that objects in the same group Unsupervised learning: no predefined classes Clustering Analysis LISA: Multivariate Clustering Analysis in RNov 3, 2015

Two Types of Clustering Analysis LISA: Multivariate Clustering Analysis in RNov 3, 2015 Hierarchical Partitional Hierarchical Clustering: Objects are partitioned into nested groups that are organized as a hierarchical tree. Partitioning Clustering: Objects are partitioned into non- overlapping groups and each object belongs to one group only.

Data Structure LISA: Multivariate Clustering Analysis in R Nov 3, 2015 Data matrix n x p matrix, where n is the number of data objects and p is the number of variables most suitable for partitioning methods Similarity/dissimilarity (distance) matrix n × n matrix calculated from the data matrix most suitable for hierarchical agglomerative methods

Dissimilarity (Distance) Measures A distance measure is the numerical measure that indicates how different two objects are; the lower its value the more similar the objects are. Given two data objects X 1 and X 2, the distance between X 1 and X 2 is a real number denoted by d(X 1,X 2 ). Common distance measures between data objects: Euclidean Distance: Manhattan Distance: Minkowski Distance: LISA: Multivariate Clustering Analysis in R Nov 3, 2015

1.Data 2.What is multivariate analysis? 3.What is clustering analysis? 4.Clustering algorithms Hierarchical agglomerative clustering algorithm Partitioning clustering algorithms K-means clustering Partitioning Around Medoids (PAM) 5. Cluster Validation LISA: Multivariate Clustering Analysis in RNov 3, 2015 OUTLINE

Hierarchical Agglomerative Clustering LISA: Multivariate Clustering Analysis in RNov 3, 2015 Hierarchical Agglomerative Clustering produces a sequence of solutions (nested clusters), and is organized in a hierarchical tree structure. Use a distance matrix for clustering and the solution is visualized by a dendrogram. This method does not require the number of clusters k as an input.

Hierarchical Agglomerative Clustering LISA: Multivariate Clustering Analysis in RNov 3, 2015 Distance between clusters Distance between clusters : Single linkage: smallest distance between an object in one cluster and an object in the other, i.e., d(C i, C j ) = min(X ip, X jq ) Complete linkage: largest distance between an object in one cluster and an object in the other, i.e., d(C i, C j ) = = max(X ip, X jq ) Average linkage: avg distance between an object in one cluster and an object in the other, i.e., d(C i, C j ) = = avg(X ip, X jq ) single link (min) complete link (max) average

Hierarchical Agglomerative Clustering LISA: Multivariate Clustering Analysis in RNov 3, 2015 Given a data set of n data objects, Hierarchical Agglomerative Clustering algorithm is implemented in following steps: Step 1. Calculate the distance matrix for n data objects Step 2. Set each object as a cluster Step 3. Repeat until the number of cluster is 1 Step 3.1. Merge two closest clusters Step 3.2. Update the distance matrix by linkage functions

Hierarchical Agglomerative Clustering LISA: Multivariate Clustering Analysis in RNov 3, 2015 Example: Given 5 data objects, A D B E C Distance Matrix

A D B E C Hierarchical Agglomerative Clustering LISA: Multivariate Clustering Analysis in RNov 3, 2015 Update the distance matrix by using Single Linkage function.

A D B E C Hierarchical Agglomerative Clustering LISA: Multivariate Clustering Analysis in RNov 3, 2015 Update the distance matrix by using Single Linkage function.

Hierarchical Agglomerative Clustering LISA: Multivariate Clustering Analysis in RNov 3, In the beginning we have 5 clusters. 2.We merge clusters A and B into cluster (A, B) at distance We merge cluster C and cluster D into (C, D) at distance 1 4.We merge clusters (A,B) and (C, D) into ((A, B), (C, D)) at distance We merge clusters ((A, B), (C, D)) and E at distance The last cluster contain all the objects, thus conclude the computation A B C D E Dist. Dendrogram

Hierarchical Agglomerative Clustering LISA: Multivariate Clustering Analysis in RNov 3, 2015 A B C D E Dist. K=3 K=2 K=4 How do we decide the number of clusters? Cut the tree.

R: Hierarchical Agglomerative Clustering LISA: Multivariate Clustering Analysis in RNov 3, 2015 we will cluster words in tweets Let’s build a data matrix of the word frequencies that enumerates the number of times that each word occurs in each tweet (document) in R. Then, we will cluster words in tweets by a Hierarchical Agglomerative Clustering algorithm.

1.Data 2.What is multivariate analysis? 3.What is clustering analysis? 4.Clustering algorithms Hierarchical agglomerative clustering algorithm Partitioning clustering algorithms K-means clustering Partitioning Around Medoids (PAM) 5. Cluster Validation LISA: Multivariate Clustering Analysis in RNov 3, 2015 OUTLINE

Partitioning Algorithm Partitioning method: Construct a partition of n data objects into a set of K clusters. Given a pre-determined K, find a partition of K clusters that optimizes the chosen partitioning criterion. k-means (MacQueen’67): Each cluster is represented by the center of the cluster. PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the data objects in the cluster. LISA: Multivariate Clustering Analysis in RNov 3, 2015

1.Data 2.What is multivariate analysis? 3.What is clustering analysis? 4.Clustering algorithms Hierarchical agglomerative clustering algorithm Partitioning clustering algorithms K-means clustering Partitioning Around Medoids (PAM) 5. Cluster Validation LISA: Multivariate Clustering Analysis in RNov 3, 2015 OUTLINE

K-means clustering Given a set of observations, K-means clustering aims to partition n observations into K clusters by minimizing the within-cluster sum of squares (WCSS), where Each cluster is associated with a centroid. – Each point is assigned to the cluster with the closest centroid. – Initial K centroids are chosen randomly. – The centroid is the mean of the points in the cluster. Number of clusters, K must be specified Nov 3, 2015 LISA: Multivariate Clustering Analysis in RNov 3, 2015

Given K, K-means algorithm is implemented in four steps: Step 1. Partition objects into K nonempty subsets Step 2. Compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean point, of the cluster) Step 3. Assign each object to the cluster with the nearest seed point Step 4. Go back to Step 2, stop when no more new assignment K-means clustering LISA: Multivariate Clustering Analysis in RNov 3, 2015

K-means clustering How to determine the number of clusters in K- means clustering? – Fit K-means clustering with different K’s and calculate WSS. – Draw a scree plot – Choose the number of clusters where there is sharp drop with respect to WSS LISA: Multivariate Clustering Analysis in RNov 3, 2015

Reference:"Kmeans animation withoutWatermark" by Incheol - Licensed under CC BY-SA 4.0 via ation_withoutWatermark.gif Clustering Analysis: K-means clustering LISA: Multivariate Clustering Analysis in RNov 3, 2015

1.Data 2.What is multivariate analysis? 3.What is clustering analysis? 4.Clustering algorithms Hierarchical agglomerative clustering algorithm Partitioning clustering algorithms K-means clustering Partitioning Around Medoids (PAM) 5. Cluster Validation LISA: Multivariate Clustering Analysis in RNov 3, 2015 OUTLINE

The PAM algorithm partitions the n objects into K clusters by specifying the clustering solution which minimizes the overall dissimilarity between the represents of each cluster and its members. Each cluster is associated with a medoid. – Each point is assigned to the cluster with the closest medoid. – K medoids are K representative data objects. Partitioning Around Medoids(PAM) LISA: Multivariate Clustering Analysis in RNov 3, 2015

In PAM, Swapping Cost is used as a objective function: – For each pair of a medoid m and a non-medoid object h, measure whether h is better than m as a medoid – Use the squared-error criterion – Compute E h -E m – Negative: swapping brings benefit Choose the minimum swapping cost Partitioning Around Medoids(PAM) LISA: Multivariate Clustering Analysis in RNov 3, 2015

Partitioning Around Medoids(PAM) Given K, PAM is implemented in 6 steps: Step 1. Randomly pick K data points as initial medoids Step 2. Assign each data point to the nearest medoid x Step 3. Calculate the objective function the sum of dissimilarities of all points to their nearest medoids. (squared-error criterion) Step 4. Randomly select an point y Step 5. Swap x by y if the swap reduces the objective function Step 6. Repeat step 3-step 6 until no change LISA: Multivariate Clustering Analysis in RNov 3, 2015

1.Data 2.What is multivariate analysis? 3.What is clustering analysis? 4.Clustering algorithms Hierarchical agglomerative clustering algorithm Partitioning clustering algorithms K-means clustering Partitioning Around Medoids (PAM) 5. Cluster Validation LISA: Multivariate Clustering Analysis in RNov 3, 2015 OUTLINE

Cluster Validation LISA: Multivariate Clustering Analysis in RNov 3, 2015 – Why is cluster validation necessary? Clustering algorithms will define clusters even if there are no natural cluster structure. In higher dimensions, it is not easy to detect whether there are natural cluster structures. Thus, we need approaches to determine whether there is non-random structure in the data and how well the results of a cluster fit the data.

The Silhouette Coefficient LISA: Multivariate Clustering Analysis in RNov 3, 2015 The Silhouette Coefficient: a method of interpretation and validation of consistency within clusters of data. It quantifies the quality of clustering. How to compute the SC? For an individual point, i – Calculate a = avg. distance of i to the points in its cluster – Calculate b = min (avg. distance of i to points in another cluster) – The silhouette coefficient for a point is then given by s = 1 – a/b if a < b, (or s = b/a - 1 if a  b, not the usual case) Can calculate the Average Silhouette Coefficient for a cluster or a clustering. The closer to 1 the better.

The Silhouette Coefficient LISA: Multivariate Clustering Analysis in RNov 3, 2015 Range of avg. SC Interpretation A strong structure has been found A reasonable structure has been found The structure is weak and could be artificial. Try additional methods of data analysis. <0.26No substantial structure has been found

R: K-means clustering and PAM LISA: Multivariate Clustering Analysis in RNov 3, 2015 We will cluster tweets We will cluster tweets by K-means clustering and PAM. Then, we will visualize the silhouette plot to see the quality of clustering solutions.

Reference LISA: Multivariate Clustering Analysis in RNov 03, 2015 RDataMining: Pang-Ning, Tan, Michael Steinbach, and Vipin Kumar. "Introduction to data mining." Library of Congress Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning. Vol. 1. Springer, Berlin: Springer series in statistics, 2001.

LISA: Multivariate Clustering Analysis in RNov 3, 2015 Please don’t forget to fill the sign in sheet and to complete the survey that will be sent to you by . Thank you!