Presentation is loading. Please wait.

Presentation is loading. Please wait.

Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 11: Clustering Martin Russell.

Similar presentations


Presentation on theme: "Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 11: Clustering Martin Russell."— Presentation transcript:

1 Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 11: Clustering Martin Russell

2 Slide 2 EE3J2 Data Mining Objectives  To explain the motivation for clustering  To introduce the ideas of distance and distortion  To describe agglomerative and divisive clustering  To explain the relationships between clustering and decision trees

3 Slide 3 EE3J2 Data Mining Example from speech processing Plot of high-frequency energy vs low- frequency energy, for 25 ms speech segments, sampled every 10ms

4 Slide 4 EE3J2 Data Mining Structure of data  Typical real data is not uniformly distrubuted  It has structure  Variables might be correlated  The data might be grouped into natural ‘clusters’  The purpose of cluster analysis is to find this underlying structure automatically

5 Slide 5 EE3J2 Data Mining Clusters and centroids  If we assume that the clusters are spherical, then they are determined by their centres  The cluster centres are called centroids  How many centroids do we need?  Where should we put them? centroids

6 Slide 6 EE3J2 Data Mining Distance  A function d(x,y) defined on pairs of points x and y is called a distance or metric if it satisfies: –d(x,x) = 0 for every point x –d(x,y) = d(y,x) for all points x and y (d is symmetric) –d(x,z)  d(x,y) + d(y,z) for all points x, y and z (this is called the triangle inequality)

7 Slide 7 EE3J2 Data Mining Example metrics  The most common metric is the Euclidean metric  In this case, if x = (x 1, x 2,…,x N ) and y = (y 1,y 2,…,y N ) then:  This corresponds to the standard notion of distance in Euclidean space  There are lots of others, but focus on this one

8 Slide 8 EE3J2 Data Mining Distortion  Distortion is a measure of how well a set of centroids models a set of data  Suppose we have: –data points y 1, y 2,…,y T –centroids c 1,…,c M  For each data point y t let c i(t) be the closest centroid  In other words: d(y t, c i(t) ) = min m d(y t,c m )

9 Slide 9 EE3J2 Data Mining Distortion  The distortion for the centroid set C = c 1,…,c M is defined by:  In other words, the distortion is the sum of distances between each data point and its nearest centroid  The task of clustering is to find a centroid set C such that the distortion Dist(C) is minimised

10 Slide 10 EE3J2 Data Mining Types of Clustering  Initially we will look at two types of cluster analysis: –Agglomerative clustering, or ‘bottom-up’ clustering –Divisive clustering, or ‘top-down’ clustering

11 Slide 11 EE3J2 Data Mining Agglomerative clustering  Agglomerative clustering begins by assuming that each data point belongs to its own, unique, 1 point cluster  Clusters are then combined until the required number of clusters is obtained  The simplest agglomerative clustering algorithm is one which, at each stage, combines the two closest centroids into a single centroid

12 Slide 12 EE3J2 Data Mining Original data (302 points)

13 Slide 13 EE3J2 Data Mining 252 centroids

14 Slide 14 EE3J2 Data Mining 152 centroids

15 Slide 15 EE3J2 Data Mining 52 centroids

16 Slide 16 EE3J2 Data Mining 12 centroids

17 Slide 17 EE3J2 Data Mining Divisive Clustering  Divisive clustering begins by assuming that there is just one centroid – typically in the centre of the set of data points  That point is replaced with 2 new centroids  Then each of these is replaced with 2 new centroids  …

18 Slide 18 EE3J2 Data Mining Original data (302 points) 

19 Slide 19 EE3J2 Data Mining Original data (302 points)  

20 Slide 20 EE3J2 Data Mining Decision tree interpretation........ Single centroid - whole set Multiple centroids – one per data point Top down clustering - divisive Bottom up clustering - agglomerative

21 Slide 21 EE3J2 Data Mining Note on optimality  An ‘optimal’ set of centroids is one which minimises the distortion  None of these methods necessarily give optimal sets of centroids  Instead they give locally optimal sets of centroids  Why?

22 Slide 22 EE3J2 Data Mining Summary  Distance metrics and distortion  Agglomerative clustering  Divisive clustering  Decision tree interpretation


Download ppt "Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 11: Clustering Martin Russell."

Similar presentations


Ads by Google