Presentation is loading. Please wait.

Presentation is loading. Please wait.

C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Lecture 9 Clustering Algorithms Bioinformatics Data Analysis and Tools.

Similar presentations


Presentation on theme: "C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Lecture 9 Clustering Algorithms Bioinformatics Data Analysis and Tools."— Presentation transcript:

1 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Lecture 9 Clustering Algorithms Bioinformatics Data Analysis and Tools

2 Part A: Cluster analysis in general C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

3 Biologists are estimated to produce 10.000.000.000.000.000 bytes of data each year (± 16 billion of CD-roms). How do we learn something from this data? First step: Find patterns/structure in the data.  Use cluster analysis The challenge C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

4 Definition: Clustering is the process of grouping several objects into a number of groups, or clusters. Goal: Objects in the same cluster are more similar to one another than they are to objects in other clusters. Cluster analysis C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

5 SpeciesFat (%)Proteins (%) Horse Donkey Mule Camel Llama Zebra Sheep Buffalo Fox Pig Rabbit Rat Deer Reindeer Whale 1.0 1.4 1.8 3.4 3.2 4.8 6.4 7.9 5.9 5.1 13.1 12.6 19.7 20.3 21.2 2.6 1.7 2.0 3.5 3.9 3.0 5.6 5.9 7.4 6.6 7.1 12.3 9.2 10.4 11.1 Simple example Data taken from: W.S. Spector, Ed., Handbook of Biological Data, WB Saunders Company, London, 1956 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

6 Simple example Composition of mammalian milk Fat (%) Proteins (%) Clustering C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

7 Clustering (unsupervised learning): We have objects, but we do not know to which class (group/cluster) they belong. The task here is trying to find a ‘natural’ grouping of the objects. Classification (supervised learning): We have labeled objects (we know to which class they belong). The task here is to build a classifier based on the labeled objects; a model which is able to predict the class of a new object. This can be done with Support Vector Machines (SVM), neural networks, etc. Clustering vs. Classification C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

8 Let’s say we have gene expression data of 5000 genes, for the healthy case and for 5 types of cancer. So for each gene expression measurement, we know to which class (healthy, cancer type 1, cancer type 2, etc.) it belongs. Now, based on this data, we can build a classifier by training or learning. If we have the gene expression data of a person, who might have cancer, we can use our classifier to predict to which class this person belongs (healthy, cancer type 1, cancer type 2, etc). Example of classification C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

9 The six steps of cluster analysis: 1.Noise/outlier filtering 2.Choose distance measure 3.Choose clustering criterion 4.Choose clustering algorithm 5.Validation of the results 6.Interpretation of the results Back to cluster analysis C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

10 Noise and outliers are ‘bad measurements’ in data, caused by: –Human errors –Measurement errors (i.e. defective thermometer) Noise and outliers can negatively influence your data analysis, so it is better to filter them out. But be careful! Outliers could be correct, but just rare data. A lot of filtering methods exist, but we won’t go into that. Step 1 – Noise/outlier filtering C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

11 The distance measure defines the dissimilarity between the objects that we are clustering. Examples. What might be a good distance measure for comparing: –Body sizes of persons? –Protein sequences? –Coordinates in n-dimensional space? Step 2 – Choose a distance measure d = | length_person_1 – length_person_2 | d = number of positions with different aa, or difference in molecular weight, or… d = Euclidean distance: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

12 We just defined the distances between the object we want to cluster. But how to decide whether two objects should be in the same cluster? The clustering criterion defines how the objects should be clustered Possible clustering criterions: –A cost function to be minimized. This is optimization- based clustering. –A (set of) rule(s). This is hierarchical clustering. Step 3 – Choose clustering criterion C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

13 Different types of hierarchical clustering, based on the different clustering criterions: –Single linkage / Nearest neighbour –Complete linkage / Furthest neighbour –Group averaging / UPGMA (see also the lecture on machine learning) Step 3 – Choose clustering criterion C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

14 Choose an algorithm which performs the clustering, based on the chosen distance measure and clustering criterion. Step 4 – Choose clustering algorithm C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

15 Look at the resulting clusters! Are the object indeed grouped together as expected? Experts mostly know what to expect. For example, if a membrane protein is assigned to a cluster containing mostly DNA-binding proteins, there might be something wrong. (or not!) Step 5 – Validation of the results C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

16 The results depend totally on the choices made. Other distance measures, clustering criterions and clustering algorithms lead to different results! Important to know! C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E For example, try hierarchical clustering with different cluster criteria: http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/AppletH.html

17 Part B: Cluster analysis of gene expression data with dynamic evolutionary fuzzy clustering C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

18 Micro arrays are commonly used for measuring gene expression: Gene expression data How are micro arrays produced? Click here for an animation http://www.bio.davidson.edu/courses/genomics/chip/chip.html C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

19 Gene expression data From the colors of the micro array, the expression levels for the different genes can be deduced. In this figure, each line represents the expression level of one gene during some time period. C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

20 Experimental observation: Genes that are expressed in a similar way (co-expressed) are under common regulatory control, and there products (proteins) share a biological function 1. So it would be interesting to know which genes are expressed in a similar way.  Clustering! 1 Eisen MB, Spellman PT, Brown PO, and D. Botstein, Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA, 1998, 95:14863– 14868. The task C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

21 Many proteins have multiple roles in the cell, and act with distinct sets of cooperating proteins to fulfill each role. Their genes are therefore co- expressed with different groups of genes. The above means that one gene might belong to different groups/clusters. The challenge  We need ‘fuzzy’ clustering. C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

22 Hard clustering: Each object is assigned to exactly one cluster. Fuzzy clustering: Each object is assigned to all clusters, but to some extent. The task of the clustering algorithm in this case is to assign a membership degree u i,j for each object j to each cluster i. Fuzzy clustering 1 2 3 C clusters Object 1 u 1,1 = 0.6 u 3,1 = 0.1 u 2,1 = 0.3  u i,j = 1 i=0 C C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

23 Remember our data: Which genes are closer, based on their expression? Green and red, or green and blue? Choose distance function Time Gene expression level We want to cluster genes that show co-expression, so the shape of the lines should be similar. In this case, it’s not important if the lines are near each other. So green and blue are closer in the example. C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

24 (1 - Pearson correlation) is a good measure for the distance of the shapes of two lines v i and x j : Choose distance function C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

25 In this case we choose a cost function J to be minimized, which commonly used for fuzzy clustering. Each cluster i is represented by it’s center v i. Choose cluster criterion C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

26 Outliers can disturb the clustering process, and the resulting clustering. So, we should deal with that. We adapt our cost function as follows: If w j is high, the contribution of object x j is low, so it will hardly influence the clustering process. How to deal with outliers? C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

27 Use a genetic algorithm: Minimization of the cost function C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

28 Generate data, based on these profiles: Test the algorithm C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

29 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

30 Each scenario represents a different test case. Data is generated by producing new gene expression profiles: take a template profile, add noise to it, and add it to the database. Do this 50 times for each profile. The algorithm should find the correct number of cluster (= number of template profiles), and should assign each datum to the correct cluster. Test the algorithm C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

31 The algorithm finds the true number of clusters in 90 - 95% of the cases. The algorithm puts the genes in the correct clusters in 90 - 97% of the cases. I am now testing the algorithm with real data Results C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E


Download ppt "C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Lecture 9 Clustering Algorithms Bioinformatics Data Analysis and Tools."

Similar presentations


Ads by Google