Clustering (1) Clustering Similarity measure Hierarchical clustering

Slides:



Advertisements
Similar presentations
Different types of data e.g. Continuous data:height Categorical data ordered (nominal):growth rate very slow, slow, medium, fast, very fast not ordered:fruit.
Advertisements

Clustering II.
Clustering.
Hierarchical Clustering, DBSCAN The EM Algorithm
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Clustering Beyond K-means
Qiang Yang Adapted from Tan et al. and Han et al.
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Introduction to Bioinformatics
Clustering (1) Clustering Similarity measure Hierarchical clustering Model-based clustering Figures from the book Data Clustering by Gan et al.
An Introduction to Clustering
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Clustering Petter Mostad. Clustering vs. class prediction Class prediction: Class prediction: A learning set of objects with known classes A learning.
Clustering.
Distance Measures Tan et al. From Chapter 2.
Multiple Object Class Detection with a Generative Model K. Mikolajczyk, B. Leibe and B. Schiele Carolina Galleguillos.
Clustering with Bregman Divergences Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh Presented by Rohit Gupta CSci 8980: Machine Learning.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Distance Measures Tan et al. From Chapter 2. Similarity and Dissimilarity Similarity –Numerical measure of how alike two data objects are. –Is higher.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Clustering Gene Expression Data BMI/CS 576 Colin Dewey Fall 2010.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Clustering.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Clustering Algorithms Presented by Michael Smaili CS 157B Spring
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall 6.8: Clustering Rodney Nielsen Many / most of these.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
For multivariate data of a continuous nature, attention has focussed on the use of multivariate normal components because of their computational convenience.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.
Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Clustering Anna Reithmeir Data Mining Proseminar 2017
Machine Learning for the Quantified Self
Semi-Supervised Clustering
CZ5211 Topics in Computational Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Machine Learning and Data Mining Clustering
Machine Learning Clustering: K-means Supervised Learning
Slides by Eamonn Keogh (UC Riverside)
Lecture 2-2 Data Exploration: Understanding Data
Classification of unlabeled data:
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Clustering (3) Center-based algorithms Fuzzy k-means
Similarity and Dissimilarity
Clustering Evaluation The EM Algorithm
K-means and Hierarchical Clustering
John Nicholas Owen Sarah Smith
Hierarchical clustering approaches for high-throughput data
دانشگاه صنعتی امیرکبیر Instructor : Saeed Shiry
Clustering and Multidimensional Scaling
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Revision (Part II) Ke Chen
Gaussian Mixture Models And their training with the EM algorithm
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
INTRODUCTION TO Machine Learning
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Text Categorization Berlin Chen 2003 Reference:
Clustering Techniques
Machine Learning and Data Mining Clustering
Radial Basis Functions: Alternative to Back Propagation
Clustering (2) & EM algorithm
Machine Learning and Data Mining Clustering
Presentation transcript:

Clustering (1) Clustering Similarity measure Hierarchical clustering Model-based clustering Figures from the book Data Clustering by Gan et al.

Clustering Objects in a cluster should share closely related properties have small mutual distances be clearly distinguishable from objects not in the same cluster A cluster should be a densely populated region surrounded by relatively empty regions. Compact cluster --- can be represented by a center Chained cluster --- higher order structures

Clustering

Clustering Types of clustering:

Similarity measures A metric distance function should satisfy

Similarity measures Similarity function:

Similarity measures From a dataset, Distance matrix: Similarity matrix:

Some similarity measures for continuous data Euclidean distance Mahattan distance Mahattan segmental distance (using only part of the dimensions)

Some similarity measures for continuous data Maximum distance (sup distance) Minkowski distance. This is the general case. R=2, Euclidean distance; R=1, Manhattan distance; R=∞, maximum distance.

Some similarity measures for continuous data Mahalanobis distance It is invariant under non-singular transformations C is any nonsingular d × d matrix. The new covariant matrix is

Some similarity measures for continuous data The Mahalanobis distance doesn’t change

Some similarity measures for categorical data In one dimension: Simple matching distance for multi-dimensions: Taking category frequency into account:

Some similarity measures for categorical data For more general definitions of similarity, define: Number of match: Number of match to NA (? means missing here): Number of non-match:

Some example similarity measures for categorical data

Some similarity measures for categorical data Binary feature vectors: Define: S is the number of occurrences of the case.

Some similarity measures for categorical data

Some similarity measures for mixed-type data General similarity coefficient by Gower:

Similarity measures Similarity between clusters Mean-based distance (between mean vectors): Nearest neighbor

Similarity measures Farthest neighbor Average neighbor

Hierarchical clustering Agglomerative: build tree by joining nodes; Divisive: build tree by dividing groups of objects.

Hierarchical clustering

Hierarchical clustering Example data:

Hierarchical clustering Single linkage: find the distance between any two nodes by nearest neighbor distance.

Hierarchical clustering Single linkage:

Hierarchical clustering Complete linkage: find the distance between any two nodes by farthest neighbor distance. Average linkage: find the distance between any two nodes by average distance.

Hierarchical clustering Comments: Hierarchical clustering generates a tree; to find clusters, the tree needs to be cut at a certain height; Complete linkage method favors compact, ball-shaped clusters; single linkage method favors chain-shaped clusters; average linkage is somewhere in between.

Model-based clustering Impose certain model assumptions on potential clusters; try to optimize the fit between data and model. The data is viewed as coming from a mixture of probability distributions; each of the distributions represents a cluster.

Model-based clustering For example, if we believe the data come from a mixture of several Gaussian densities, the likelihood that data point i is from cluster j is: Classification likelihood approach: find cluster assignments and parameters that maximize

Model-based clustering Mixture likelihood approach: The most commonly used method is the EM algorithm. It iterates between soft cluster assignment and parameter estimation.

Model-based clustering EM algorithm in the simplest case: two component Gaussian in 1D

Model-based clustering

Model-based clustering

Model-based clustering Gaussian cluster models.

Model-based clustering Common assumptions: From 1 to 4, the model becomes more flexible, yet more parameters need to be estimated. May become less stable.