Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 11: Clustering Martin Russell.

EE3J2 Data Mining EE3J2 Data Mining Lecture 11: Clustering Martin Russell

EE3J2 Data Mining Objectives  To explain the motivation for clustering  To introduce the ideas of distance and distortion  To describe agglomerative and divisive clustering  To explain the relationships between clustering and decision trees

EE3J2 Data Mining Example from speech processing Plot of high-frequency energy vs low- frequency energy, for 25 ms speech segments, sampled every 10ms

EE3J2 Data Mining Structure of data  Typical real data is not uniformly distrubuted  It has structure  Variables might be correlated  The data might be grouped into natural ‘clusters’  The purpose of cluster analysis is to find this underlying structure automatically

EE3J2 Data Mining Clusters and centroids  If we assume that the clusters are spherical, then they are determined by their centres  The cluster centres are called centroids  How many centroids do we need?  Where should we put them? centroids

EE3J2 Data Mining Distance  A function d(x,y) defined on pairs of points x and y is called a distance or metric if it satisfies: –d(x,x) = 0 for every point x –d(x,y) = d(y,x) for all points x and y (d is symmetric) –d(x,z)  d(x,y) + d(y,z) for all points x, y and z (this is called the triangle inequality)

EE3J2 Data Mining Example metrics  The most common metric is the Euclidean metric  In this case, if x = (x 1, x 2,…,x N ) and y = (y 1,y 2,…,y N ) then:  This corresponds to the standard notion of distance in Euclidean space  There are lots of others, but focus on this one

EE3J2 Data Mining Distortion  Distortion is a measure of how well a set of centroids models a set of data  Suppose we have: –data points y 1, y 2,…,y T –centroids c 1,…,c M  For each data point y t let c i(t) be the closest centroid  In other words: d(y t, c i(t) ) = min m d(y t,c m )

EE3J2 Data Mining Distortion  The distortion for the centroid set C = c 1,…,c M is defined by:  In other words, the distortion is the sum of distances between each data point and its nearest centroid  The task of clustering is to find a centroid set C such that the distortion Dist(C) is minimised

EE3J2 Data Mining Types of Clustering  Initially we will look at two types of cluster analysis: –Agglomerative clustering, or ‘bottom-up’ clustering –Divisive clustering, or ‘top-down’ clustering

EE3J2 Data Mining Agglomerative clustering  Agglomerative clustering begins by assuming that each data point belongs to its own, unique, 1 point cluster  Clusters are then combined until the required number of clusters is obtained  The simplest agglomerative clustering algorithm is one which, at each stage, combines the two closest centroids into a single centroid

EE3J2 Data Mining Original data (302 points)

EE3J2 Data Mining 252 centroids

EE3J2 Data Mining Divisive Clustering  Divisive clustering begins by assuming that there is just one centroid – typically in the centre of the set of data points  That point is replaced with 2 new centroids  Then each of these is replaced with 2 new centroids  …

EE3J2 Data Mining Original data (302 points) 

EE3J2 Data Mining Original data (302 points)  

EE3J2 Data Mining Decision tree interpretation........ Single centroid - whole set Multiple centroids – one per data point Top down clustering - divisive Bottom up clustering - agglomerative

EE3J2 Data Mining Note on optimality  An ‘optimal’ set of centroids is one which minimises the distortion  None of these methods necessarily give optimal sets of centroids  Instead they give locally optimal sets of centroids  Why?

EE3J2 Data Mining Summary  Distance metrics and distortion  Agglomerative clustering  Divisive clustering  Decision tree interpretation

Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 11: Clustering Martin Russell.

Similar presentations

Presentation on theme: "Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 11: Clustering Martin Russell."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 11: Clustering Martin Russell.

Similar presentations

Presentation on theme: "Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 11: Clustering Martin Russell."— Presentation transcript:

Similar presentations

About project

Feedback