Presentation is loading. Please wait.

Presentation is loading. Please wait.

Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,

Similar presentations


Presentation on theme: "Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,"— Presentation transcript:

1 Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu, FINLAND Part 1: Introduction

2 Sample data Sources of RGB vectors Red-Green plot of the vectors

3 Sample data Employment statistics:

4 Application example 1 Color reconstruction Image with compression artifacts Image with original colors

5 Application example 2 speaker modeling for voice biometrics Training data Feature extraction and clustering Matti Mikko Tomi Speaker models Tomi Matti Feature extraction Best match: Matti ! Mikko ?

6 Speaker modeling Speech dataResult of clustering

7 Application example 3 Image segmentation Normalized color plots according to red and green components. Image with 4 color clusters red green

8 Application example 4 Quantization Quantized signal Original signal Approximation of continuous range values (or a very large set of possible discrete values) by a small set of discrete symbols or integer values

9 Color quantization of images Color imageRGB samples Clustering

10 Application example 5 Clustering of spatial data

11 Clustered locations of users

12 Clustering of photos Timeline clustering

13 Clustering GPS trajectories Mobile users, taxi routes, fleet management

14 Conclusions from clusters Cluster 1: Office Cluster 2: Home

15 Part I: Clustering problem

16 Subproblems of clustering 1. 1. Where are the clusters? (Algorithmic problem) 2. 2. How many clusters? (Methodological problem: which criterion?) 3. 3. Selection of attributes (Application related problem) 4. 4. Preprocessing the data (Practical problems: normalization, outliers)

17 Clustering result as partition Illustrated by Voronoi diagram Illustrated by Convex hulls Cluster prototypes Partition of data

18 Cluster prototypes Partition of data Centroids as prototypes Partition by nearest prototype mapping Duality of partition and centroids

19 Cluster missingClusters missing Too many clusters Incorrect cluster allocation Incorrect number of clusters Challenges in clustering

20 How to solve? Solve the clustering:   Given input data (X) of N data vectors, and number of clusters (M), find the clusters.   Result given as a set of prototypes, or partition. Solve the number of clusters:   Define appropriate cluster validity function f.   Repeat the clustering algorithm for several M.   Select the best result according to f. Solve the problem efficiently. Algorithmic problem Mathematical problem Computer science problem

21 Taxonomy of clustering [Jain, Murty, Flynn, Data clustering: A review, ACM Computing Surveys, 1999.] One possible classification based on cost function. MSE is well defined and most popular.

22 Definitions and data Set of N data points: X={x 1, x 2, …, x N } Set of M cluster prototypes (centroids): C={c 1, c 2, …, c M }, P={p 1, p 2, …, p M }, Partition of the data:

23 Distance and cost function Euclidean distance of data vectors: Mean square error:

24   Centroid condition: for a given partition (P), optimal cluster centroids (C) for minimizing MSE are the average vectors of the clusters: Dependency of data structures  Optimal partition: for a given centroids (C), optimal partition is the one with nearest centroid :

25 Complexity of clustering Clustering problem is NP complete [Garey et al., 1982] Optimal solution by branch-and-bound in exponential time. Practical solutions by heuristic algorithms. Number of possible clusterings:

26 Cluster software Main area Input area Output area Main area: working space for data Input area: inputs to be processed Output area: obtained results Menu Process: selection of operation http://cs.joensuu.fi/sipu/soft/cluster2009.exe

27 Clustering image Data set Codebook Partition Procedure to simulate k-means Open data set (file *.ts), move it into Input area Process – Random codebook, select number of clusters REPEAT Move obtained codebook from Output area into Input area Process – Optimal partition, select Error function Move codebook into Main area, partition into Input area Process – Optimal codebook UNTIL DESIRED CLUSTERING

28 XLMiner software http://www.resample.com/xlminer/help/HClst/HClst_ex.htm

29 Example of data in XLMiner

30 Distance matrix & dendrogram

31 Conclusions   Clustering is a fundamental tools needed in Speech and Image processing.   Failing to do clustering properly may defect the application analysis.   Good clustering tool needed so that researchers can focus on application requirements.

32 1. 1. S. Theodoridis and K. Koutroumbas, Pattern Recognition, Academic Press, 3rd edition, 2006. 2. 2. C. Bishop, Pattern Recognition and Machine Learning, Springer, 2006. 3. 3. A.K. Jain, M.N. Murty and P.J. Flynn, Data clustering: A review, ACM Computing Surveys, 31(3): 264-323, September 1999. 4. 4. M.R. Garey, D.S. Johnson and H.S. Witsenhausen, The complexity of the generalized Lloyd-Max problem, IEEE Transactions on Information Theory, 28(2): 255-256, March 1982. 5. 5. F. Aurenhammer: Voronoi diagrams-a survey of a fundamental geometric data structure, ACM Computing Surveys, 23 (3), 345-405, September 1991. Literature


Download ppt "Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,"

Similar presentations


Ads by Google