Presentation is loading. Please wait.

Presentation is loading. Please wait.

Clustering 1 (Introduction and kmean)

Similar presentations


Presentation on theme: "Clustering 1 (Introduction and kmean)"— Presentation transcript:

1 Clustering 1 (Introduction and kmean)
COMP1942 Clustering 1 (Introduction and kmean) Prepared by Raymond Wong Some parts of this notes are borrowed from LW Chan’s notes Screenshots of XLMiner Captured by Kai Ho CHAN Presented by Raymond Wong COMP1942

2 Clustering Computer History Raymond 100 40 Louis 90 45 Wyman 20 95 …
Cluster 2 (e.g. High Score in History and Low Score in Computer) Computer History Computer History Raymond 100 40 Louis 90 45 Wyman 20 95 Cluster 1 (e.g. High Score in Computer and Low Score in History) Problem: to find all clusters COMP1942

3 Why Clustering? Clustering for Utility Summarization Compression

4 Why Clustering? Clustering for Understanding Applications Biology
Group different species Psychology and Medicine Group medicine Business Group different customers for marketing Network Group different types of traffic patterns Software Group different programs for data analysis COMP1942

5 Clustering Methods K-means Clustering Hierarchical Clustering Methods
Original k-means Clustering Sequential K-means Clustering Forgetful Sequential K-means Clustering How to use the data mining tool Hierarchical Clustering Methods Agglomerative methods Divisive methods – polythetic approach and monothetic approach COMP1942

6 K-mean Clustering Suppose that we have n example feature vectors x1, x2, …, xn, and we know that they fall into k compact clusters, k < n Let mi be the mean of the vectors in cluster i. we can say that x is in cluster i if distance from x to mi is the minimum of all the k distances. COMP1942

7 Procedure for finding k-means
Make initial guesses for the means m1, m2, .., mk Until there is no change in any mean Assign each data point to the cluster whose mean is the nearest Calculate the mean of each cluster For i from 1 to k Replace mi with the mean of all examples for cluster i COMP1942

8 Procedure for finding k-means
1 2 3 4 5 6 7 8 9 10 10 9 8 7 6 Assign each objects to most similar center 5 Update the cluster means 4 3 2 1 1 2 3 4 5 6 7 8 9 10 reassign reassign k=2 Arbitrarily choose k means Update the cluster means COMP1942

9 Initialization of k-means
The way to initialize the means was not specified. One popular way to start is to randomly choose k of the examples The results produced depend on the initial values for the means, and it frequently happens that suboptimal partitions are found. The standard solution is to try a number of different starting points COMP1942

10 Disadvantages of k-means
In a “bad” initial guess, there are no points assigned to the cluster with the initial mean mi. The value of k is not user-friendly. This is because we do not know the number of clusters before we want to find clusters. COMP1942

11 Clustering Methods K-means Clustering Hierarchical Clustering Methods
Original k-means Clustering Sequential K-means Clustering Forgetful Sequential K-means Clustering How to use the data mining tool Hierarchical Clustering Methods Agglomerative methods Divisive methods – polythetic approach and monothetic approach COMP1942

12 Sequential k-Means Clustering
Another way to modify the k-means procedure is to update the means one example at a time, rather than all at once. This is particularly attractive when we acquire the examples over a period of time, and we want to start clustering before we have seen all of the examples Here is a modification of the k-means procedure that operates sequentially COMP1942

13 Sequential k-Means Clustering
Make initial guesses for the means m1, m2, …, mk Set the counts n1, n2, .., nk to zero Until interrupted Acquire the next example, x If mi is closest to x Increment ni Replace mi by mi + (1/ni) . (x – mi) COMP1942

14 Clustering Methods K-means Clustering Hierarchical Clustering Methods
Original k-means Clustering Sequential K-means Clustering Forgetful Sequential K-means Clustering How to use the data mining tool Hierarchical Clustering Methods Agglomerative methods Divisive methods – polythetic approach and monothetic approach COMP1942

15 Forgetful Sequential k-means
This also suggests another alternative in which we replace the counts by constants. In particular, suppose that a is a constant between 0 and 1, and consider the following variation: Make initial guesses for the means m1, m2, …, mk Until interrupted Acquire the next example x If mi is closest to x, replace mi by mi+a(x-mi) COMP1942

16 Forgetful Sequential k-means
The result is called the “forgetful” sequential k-means procedure. It is not hard to show that mi is a weighted average of the examples that were closest to mi, where the weight decreases exponentially with the “age” to the example. That is, if m0 is the initial value of the mean vector and if xj is the j-th example out of n examples that were used to form mi, then it is not hard to show that mn = (1-a)n m0 +a  k=1n (1-a)n-kxk COMP1942

17 Forgetful Sequential k-means
Thus, the initial value m0 is eventually forgotten, and recent examples receive more weight than ancient examples. This variation of k-means is particularly simple to implement, and it is attractive when the nature of the problem changes over time and the cluster centres “drift”. COMP1942

18 Clustering Methods K-means Clustering Hierarchical Clustering Methods
Original k-means Clustering Sequential K-means Clustering Forgetful Sequential K-means Clustering How to use the data mining tool Hierarchical Clustering Methods Agglomerative methods Divisive methods – polythetic approach and monothetic approach COMP1942

19 How to use the data mining tool
We can use XLMiner for performing k-mean clustering Open “cluster.xls” COMP1942

20 COMP1942

21 COMP1942

22 COMP1942

23 COMP1942

24 Workbook Data source Worksheet Data range COMP1942

25 COMP1942

26 First row contains headers Variables In Input Data
# Cols: 10 # Rows : Variables 3 First row contains headers Variables In Input Data Name Computer History COMP1942

27 COMP1942

28 COMP1942

29 COMP1942

30 Normalize input data Parameters # Clusters: 10 2 # Iterations:
COMP1942

31 Options Set Seed 5 Fixed Start Random Starts No. of Starts: 12345
COMP1942

32 Show distances from each cluster center
Show data summary Show distances from each cluster center COMP1942

33 COMP1942

34 COMP1942

35 COMP1942

36 COMP1942

37 COMP1942

38 COMP1942

39 COMP1942

40 COMP1942

41 COMP1942

42 COMP1942

43 COMP1942

44 COMP1942

45 COMP1942

46 COMP1942

47 COMP1942

48 COMP1942

49 COMP1942

50 COMP1942

51 COMP1942

52 COMP1942

53 COMP1942

54 COMP1942


Download ppt "Clustering 1 (Introduction and kmean)"

Similar presentations


Ads by Google