Presentation is loading. Please wait.

Presentation is loading. Please wait.

Basic Gene Expression Data Analysis--Clustering Pairwise Measures Clustering Motif Searching/Network Construction Integrated Analysis (NMR/SNP/Clinic/….)

Similar presentations


Presentation on theme: "Basic Gene Expression Data Analysis--Clustering Pairwise Measures Clustering Motif Searching/Network Construction Integrated Analysis (NMR/SNP/Clinic/….)"— Presentation transcript:

1 Basic Gene Expression Data Analysis--Clustering Pairwise Measures Clustering Motif Searching/Network Construction Integrated Analysis (NMR/SNP/Clinic/….)

2 Microarray Experiment Control Treated mRNA RT and label with fluor dyes cDNA Spot (DNA probe): known cDNA or Oligo Mix and hybridize target to microarray

3 Collections of Experiments Time course after a treatment Different treatments Disease cell lines Data are represented in a matrix

4 Cluster Analysis Grouping of genes with “similar” expression profiles Grouping of disease cell lines/toxicants with “similar” effects on gene expression Clustering algorithms –Hierarchical clustering –Self-organizing maps –K-means clustering

5 Normalized Expression Data Gene Expression Clustering Protein/protein complex DNA regulatory elements Semantics of clusters: From co-expressed to co-regulated

6 Key Terms in Cluster Analysis Distance & Similarity measures Hierarchical & non-hierarchical Single/complete/average linkage Dendrograms & ordering

7 Measuring Similarity of Gene Expression Euclidean (L 2 ) distance Manhattan (L 1 ) distance L m : (|x 1 -x 2 | m +|y 1 -y 2 | m ) 1/m L ∞ : max(|x 1 -x 2 |,|y 1 -y 2 |) Inner product: x 1 x 2 +y 1 y 2 Correlation coefficient Spearman rank correlation coefficient (x 1, y 1 ) (x 2,y 2 )

8 Distance Measures: Minkowski Metric ref

9 Commonly Used Minkowski Metrics

10 An Example 4 3 x y

11 Manhattan distance is called Hamming distance when all features are binary. Gene Expression Levels Under 17 Conditions (1-High,0-Low)

12 From Clustering to Correlation Time Gene A Gene B Gene A Time Gene B Expression Level Time Gene A Gene B

13 Similarity Measures: Correlation Coefficient

14 Hierarchical Clustering Given a set of N items to be clustered, and an NxN distance (or similarity) matrix, the basic process hierarchical clustering is this: 1.Start by assigning each item to its own cluster, so that if you have N items, you now have N clusters, each containing just one item. Let the distances (similarities) between the clusters equal the distances (similarities) between the items they contain. 2.Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one less cluster. 3.Compute distances (similarities) between the new cluster and each of the old clusters. 4.Repeat steps 2 and 3 until all items are clustered into a single cluster of size N.

15 Normalized Expression Data Hierarchical Clustering

16 3 clusters? 2 clusters?

17 Cluster Analysis Eisen et al. (1998) (PNAS, 95:14863)  Correlation as measure of co-expression Experiment over time timet0t0 t1t1 t2t2... control N genes N*N correlation matrix

18 Cluster Analysis Scan matrix for maximum Join genes to 1 node 2 3 Update matrix 1

19 Cluster Analysis Result: Dendogram assemling N genes Points of discussion –similarity based, useful for co-expression –dependent on similarity measure? –useful in preliminary scans –biological relevance of clusters?

20 Distance Between Two Clusters Min distance Average distance Max distance  Single-Link Method / Nearest Neighbor  Complete-Link / Furthest Neighbor  Their Centroids.  Average of all cross-cluster pairs. single-link clustering (also called the connectedness or minimum method) : we consider the distance between one cluster and another cluster to be equal to the shortest distance from any member of one cluster to any member of the other cluster. If the data consist of similarities, we consider the similarity between one cluster and another cluster to be equal to the greatest similarity from any member of one cluster to any member of the other cluster. complete-link clustering (also called the diameter or maximum method): we consider the distance between one cluster and another cluster to be equal to the longest distance from any member of one cluster to any member of the other cluster. average-link clustering : we consider the distance between one cluster and another cluster to be equal to the average distance from any member of one cluster to any member of the other cluster.

21 Single-Link Method b a Distance Matrix Euclidean Distance (1) (2) (3) a,b,c ccd a,b dd a,b,c,d

22 Complete-Link Method b a Distance Matrix Euclidean Distance (1) (2) (3) a,b ccd d c,d a,b,c,d

23 Identifying disease genes Non-tumor Liver Tumor Liver Liver-specific Ribosomal proteins Proliferation Endothelial cells 1 X. Chen & P.O. Brown et al Molecular Biology of the Cell Vol. 13, , June 2002

24 Human tumor patient and normal cells; various conditions Cluster or Classify genes according to tumors Cluster tumors according to genes

25 K-Means Clustering Algorithm 1) Select an initial partition of k clusters 2) Assign each object to the cluster with the closest center: 3) Compute the new centers of the clusters: 4) Repeat step 2 and 3 until no object changes cluster

26 K-Means Clustering Basic Ideas : using cluster centre (means) to represent cluster Assigning data elements to the closet cluster (centre). Goal: Minimise square error (intra-class dissimilarity) : = Variations of K-Means –Initialisation (select the number of clusters, initial partitions) –Updating of center –Hill-climbing (trying to move an object to another cluster). This method initially takes the number of components of the population equal to the final required number of clusters. In this step itself the final required number of clusters is chosen such that the points are mutually farthest apart. Next, it examines each component in the population and assigns it to one of the clusters depending on the minimum distance. The centroid's position is recalculated everytime a component is added to the cluster and this continues until all the components are grouped into the final required number of clusters.

27 The K-Means Clustering Method Example

28 k-means Clustering : Procedure (1) Initialization 1 Specify the number of cluster k : for example, k = Expression matrix Each point is called “gene”

29 k-means Clustering : Procedure (2) Initialization 2 Genes are randomly assigned to one of k clusters

30 k-means Clustering : Procedure (2) Calculate the mean of each cluster (1,2) (3,2) (3,4) (6,7) [(6,7) + (3,4) + …]

31 k-means Clustering : Procedure (4) Each gene is reassigned to the nearest cluster Gene i to cluster c

32 k-means Clustering : Procedure (4) Each gene is reassigned to the nearest cluster Gene i to cluster c

33 k-means Clustering : Procedure (5) Iterate until the means are converged

34 k-means clustering : application 6220 yeast genes 15 time points during cell cycle M/G1 phase G1 phase M phase Result : 13 clusters of 30 clusters had statistical significance for each biological function S. Tavazoie & GM Church Nature Genetics Vol. 22, July 1999 :

35 Computation Time and Memory Requirement n genes and m experiments Computation time: Hierarchical clustering –O( m n 2 log(n) ) K-means clustering –t: number of iterations –O( k t m n ) Memory requirement: Hierarchical clustering –O( mn + n 2 ) K-means clustering –t: number of iterations –O( mn + kn )

36 Issues in Cluster Analysis A lot of clustering algorithms A lot of distance/similarity metrics Which clustering algorithm runs faster and uses less memory? How many clusters after all? Are the clusters stable? Are the clusters meaningful?

37 K-Means vs Hierarchical Clustering

38 Pattern Recognition Clarification of decision making processes and automating them using computers supervisedunsupervised known number of classes based on a training set used to classify future observations unknown number of classes no prior knowledge cluster analysis = one form


Download ppt "Basic Gene Expression Data Analysis--Clustering Pairwise Measures Clustering Motif Searching/Network Construction Integrated Analysis (NMR/SNP/Clinic/….)"

Similar presentations


Ads by Google