Presentation is loading. Please wait.

Presentation is loading. Please wait.

Clustering Approaches Ka-Lok Ng Department of Bioinformatics Asia University.

Similar presentations


Presentation on theme: "Clustering Approaches Ka-Lok Ng Department of Bioinformatics Asia University."— Presentation transcript:

1 Clustering Approaches Ka-Lok Ng Department of Bioinformatics Asia University

2 Perform a cluster analysis on gene expression profiles

3 Perform a cluster analysis on gene expression profiles by computing the Pearson correlation coefficient

4 Hierarchical Clustering Method We continue this process, clustering 1 with 4, then {2,3} with 5. The resulting hierarchy takes the form 2351423514

5 K-Means Clustering Problem: Formulation Input: A set, V, consisting of n points and a parameter k Output: A set X consisting of k points (cluster centers) that minimizes the squared error distortion d(V,X) over all possible choices of X

6 1-Means Clustering Problem: an Easy Case Input: A set, V, consisting of n points Output: A single points x (cluster center) that minimizes the squared error distortion d(V,x) over all possible choices of x

7 K-Means Clustering Problem: Formulation The basic step of k-means clustering is simple: Iterate until stable (= no object move group): 1.Determine the centroid coordinate 2.Determine the distance of each object to the centroids 3.Group the object based on minimum distance Ref:http://www.people.revoledu.com/kar di/tutorial/kMean/NumericalExampl e.htmhttp://www.people.revoledu.com/kar di/tutorial/kMean/NumericalExampl e.htm

8 K-Means Clustering Problem: Formulation Suppose we have several objects (4 types of medicines) and each object have two attributes or features as shown in table below. Our goal is to group these objects into K=2 group of medicine based on the two features (pH and weight index). Object attribute 1 (X): attribute 2 (Y): weight index pH Medicine A 1 1 Medicine B 2 1 Medicine C 4 3 Medicine D 5 4 Each medicine represents one point with two attributes (X, Y) that we can represent it as coordinate in an attribute space as shown in the figure on the right.

9 K-Means Clustering Problem: Formulation 1. Initial value of centroids : Suppose we use medicine A and medicine B as the first centroids. Let C 1 and C 2 denote the coordinate of the centroids, then C 1 =(1,1) and C 2 =(2,1).

10 K-Means Clustering Problem: Formulation

11

12

13

14

15

16 x1x1 x2x2 x3x3

17 x1x1 x2x2 x3x3

18 x1x1 x2x2 x3x3

19 x1x1 x2x2 x3x3

20 1-Means Clustering Problem: an Easy Case Input: A set, V, consisting of n points Output: A single points x (cluster center) that minimizes the squared error distortion d(V,x) over all possible choices of x 1-Means Clustering problem is easy. However, it becomes very difficult (NP-complete) for more than one center. An efficient heuristic (learn by discovering things 探索法 ) method for K-Means clustering is the Lloyd algorithm Perform two steps until either it converges to until the fluctuations become very small Assign each data point to the cluster C, corresponding to the closest cluster representative xi (1 ≦ i ≦ k) After the assignments of all n data points, compute new cluster representatives according to the center of gravity of each cluster, that is, the new cluster representative is for every cluster C

21 K-Means Clustering Problem: Formulation Similar to other algorithm, K-mean clustering has many weaknesses: When the numbers of data are not so many, initial grouping will determine the cluster significantly. The number of cluster, k, must be determined before hand. We never know the real cluster, using the same data, if it is inputted in a different order may produce different cluster if the number of data is a few. Sensitive to initial condition. Different initial condition may produce different result of cluster. The algorithm may be trapped in the local optimum. We never know which attribute contributes more to the grouping process since we assume that each attribute has the same weight. Weakness of arithmetic mean is not robust to outliers. Very far data from the centroid may pull the centroid away from the real one. The result is circular cluster shape because based on distance.distance


Download ppt "Clustering Approaches Ka-Lok Ng Department of Bioinformatics Asia University."

Similar presentations


Ads by Google