Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 11: Clustering Martin Russell.

Slides:



Advertisements
Similar presentations
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 11: K-Means Clustering Martin Russell.
Advertisements

SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Hierarchical Clustering
Cluster Analysis: Basic Concepts and Algorithms
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering, DBSCAN The EM Algorithm
PARTITIONAL CLUSTERING
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Distance metric learning, with application to clustering with side-information Eric P. Xing, Andrew Y. Ng, Michael I. Jordan and Stuart Russell University.
Introduction to Bioinformatics
IT 433 Data Warehousing and Data Mining Hierarchical Clustering Assist.Prof.Songül Albayrak Yıldız Technical University Computer Engineering Department.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Unsupervised learning: Clustering Ata Kaban The University of Birmingham
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
ICS 421 Spring 2010 Data Mining 2 Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 4/8/20101Lipyeow Lim.
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis Martin Russell.
Today Unsupervised Learning Clustering K-means. EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms Ali Al-Shahib.
1 Text Clustering. 2 Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: –Examples within a cluster are very similar.
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 15: Introduction to Artificial Neural Networks Martin Russell.
Spatial and Temporal Data Mining
L16: Micro-array analysis Dimension reduction Unsupervised clustering.
4. Ad-hoc I: Hierarchical clustering
1 Cluster Analysis EPP 245 Statistical Analysis of Laboratory Data.
Dilys Thomas PODS Achieving Anonymity via Clustering G. Aggarwal, T. Feder, K. Kenthapadi, S. Khuller, R. Panigrahy, D. Thomas, A. Zhu.
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
Slide 1 EE3J2 Data Mining EE3J2 Data Mining - revision Martin Russell.
What is Cluster Analysis?
1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Clustering Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
1 Cluster Analysis EPP 245 Statistical Analysis of Laboratory Data.
What is Cluster Analysis?
Multivariate Data Analysis Chapter 9 - Cluster Analysis
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
CLUSTERING (Segmentation)
Revision (Part II) Ke Chen COMP24111 Machine Learning Revision slides are going to summarise all you have learnt from Part II, which should be helpful.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 9 Data Analysis Martin Russell.
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
Scan Conversion Line and Circle
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
CSE5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides.
Clustering.
Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.
Slide 1 EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms.
Applied Multivariate Statistics Cluster Analysis Fall 2015 Week 9.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Mining and Text Mining. The Standard Data Mining process.
Unsupervised Learning
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
Clustering CSC 600: Data Mining Class 21.
Clustering Algorithms
Data Mining K-means Algorithm
EPP 245/298 Statistical Analysis of Laboratory Data
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
CSE 5243 Intro. to Data Mining
Clustering.
AIM: Clustering the Data together
Revision (Part II) Ke Chen
Clustering and Multidimensional Scaling
Revision (Part II) Ke Chen
Data Mining – Chapter 4 Cluster Analysis Part 2
Text Categorization Berlin Chen 2003 Reference:
Unsupervised Learning: Clustering
SEEM4630 Tutorial 3 – Clustering.
Presentation transcript:

Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 11: Clustering Martin Russell

Slide 2 EE3J2 Data Mining Objectives  To explain the motivation for clustering  To introduce the ideas of distance and distortion  To describe agglomerative and divisive clustering  To explain the relationships between clustering and decision trees

Slide 3 EE3J2 Data Mining Example from speech processing Plot of high-frequency energy vs low- frequency energy, for 25 ms speech segments, sampled every 10ms

Slide 4 EE3J2 Data Mining Structure of data  Typical real data is not uniformly distrubuted  It has structure  Variables might be correlated  The data might be grouped into natural ‘clusters’  The purpose of cluster analysis is to find this underlying structure automatically

Slide 5 EE3J2 Data Mining Clusters and centroids  If we assume that the clusters are spherical, then they are determined by their centres  The cluster centres are called centroids  How many centroids do we need?  Where should we put them? centroids

Slide 6 EE3J2 Data Mining Distance  A function d(x,y) defined on pairs of points x and y is called a distance or metric if it satisfies: –d(x,x) = 0 for every point x –d(x,y) = d(y,x) for all points x and y (d is symmetric) –d(x,z)  d(x,y) + d(y,z) for all points x, y and z (this is called the triangle inequality)

Slide 7 EE3J2 Data Mining Example metrics  The most common metric is the Euclidean metric  In this case, if x = (x 1, x 2,…,x N ) and y = (y 1,y 2,…,y N ) then:  This corresponds to the standard notion of distance in Euclidean space  There are lots of others, but focus on this one

Slide 8 EE3J2 Data Mining Distortion  Distortion is a measure of how well a set of centroids models a set of data  Suppose we have: –data points y 1, y 2,…,y T –centroids c 1,…,c M  For each data point y t let c i(t) be the closest centroid  In other words: d(y t, c i(t) ) = min m d(y t,c m )

Slide 9 EE3J2 Data Mining Distortion  The distortion for the centroid set C = c 1,…,c M is defined by:  In other words, the distortion is the sum of distances between each data point and its nearest centroid  The task of clustering is to find a centroid set C such that the distortion Dist(C) is minimised

Slide 10 EE3J2 Data Mining Types of Clustering  Initially we will look at two types of cluster analysis: –Agglomerative clustering, or ‘bottom-up’ clustering –Divisive clustering, or ‘top-down’ clustering

Slide 11 EE3J2 Data Mining Agglomerative clustering  Agglomerative clustering begins by assuming that each data point belongs to its own, unique, 1 point cluster  Clusters are then combined until the required number of clusters is obtained  The simplest agglomerative clustering algorithm is one which, at each stage, combines the two closest centroids into a single centroid

Slide 12 EE3J2 Data Mining Original data (302 points)

Slide 13 EE3J2 Data Mining 252 centroids

Slide 14 EE3J2 Data Mining 152 centroids

Slide 15 EE3J2 Data Mining 52 centroids

Slide 16 EE3J2 Data Mining 12 centroids

Slide 17 EE3J2 Data Mining Divisive Clustering  Divisive clustering begins by assuming that there is just one centroid – typically in the centre of the set of data points  That point is replaced with 2 new centroids  Then each of these is replaced with 2 new centroids  …

Slide 18 EE3J2 Data Mining Original data (302 points) 

Slide 19 EE3J2 Data Mining Original data (302 points)  

Slide 20 EE3J2 Data Mining Decision tree interpretation Single centroid - whole set Multiple centroids – one per data point Top down clustering - divisive Bottom up clustering - agglomerative

Slide 21 EE3J2 Data Mining Note on optimality  An ‘optimal’ set of centroids is one which minimises the distortion  None of these methods necessarily give optimal sets of centroids  Instead they give locally optimal sets of centroids  Why?

Slide 22 EE3J2 Data Mining Summary  Distance metrics and distortion  Agglomerative clustering  Divisive clustering  Decision tree interpretation