Presentation is loading. Please wait.

Presentation is loading. Please wait.

Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University.

Similar presentations


Presentation on theme: "Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University."— Presentation transcript:

1 Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign

2 Gene Expression Data (Microarray) p genes on n samples Genes mRNA samples Gene expression level of gene i in mRNA sample j Log (treated-exp-value /controlled-exp-value ) sample1sample2sample3sample4sample5 …

3 Some possible applications  Sample from specific organ to show which genes are expressed  Compare samples from healthy and sick host to find gene-disease connection  Discover co-regulated genes  Discover promoters

4 Major Analysis Techniques  Single gene analysis  Compare the expression levels of the same gene under different conditions  Main techniques: Significance test (e.g., t-test)  Gene group analysis  Find genes that are expressed similarly across many different conditions  Main techniques: Clustering (many possibilities)  Gene network analysis  Analyze gene regulation relationship at a large scale  Main techniques: Bayesian networks

5 Clustering Methods  Similarity-based ( need a similarity function )  Construct a partition  Agglomerative, bottom up  Searching for an optimal partition  Typically “hard” clustering  Model-based (latent models, probabilistic or algebraic)  First compute the model  Clusters are obtained easily after having a model  Typically “soft” clustering

6 Similarity-based Clustering  Define a similarity function to measure similarity between two objects  Common criteria: Find a partition to  Maximize intra-cluster similarity  Minimize inter-cluster similarity  Two ways to construct the partition  Hierarchical (e.g.,Agglomerative Hierarchical Clustering)  Search by starting at a random partition (e.g., K-means)

7 Method 1 (Similarity-based): Agglomerative Hierarchical Clustering

8 Agglomerative Hierachical Clustering  Given a similarity function to measure similarity between two objects  Gradually group similar objects together in a bottom-up fashion  Stop when some stopping criterion is met  Variations: different ways to compute group similarity based on individual object similarity

9 Similarity Measure: Pearson CC  The most popular correlation coefficient is Pearson correlation coefficient (1892)  correlation between X={X 1, X 2, …, X n } and Y={Y 1, Y 2, …, Y n } :  where (Adapted from a Slide by Shin-Mu Tseng) s XY s XY is the similarity between X & Y Better measures focus on a subset of values…

10 Similarity-induced Structure

11 How to Compute Group Similarity? Given two groups g1 and g2, Single-link algorithm: s(g1,g2)= similarity of the closest pair complete-link algorithm: s(g1,g2)= similarity of the farthest pair average-link algorithm: s(g1,g2)= average of similarity of all pairs Three Popular Methods:

12 Three Methods Illustrated Single-link algorithm ? g1 g2 complete-link algorithm …… average-link algorithm

13 Comparison of the Three Methods  Single-link  “Loose” clusters  Individual decision, sensitive to outliers  Complete-link  “Tight” clusters  Individual decision, sensitive to outliers  Average-link  “In between”  Group decision, insensitive to outliers  Which one is the best? Depends on what you need!

14 Method 2 (similarity-based): K-Means

15 K-Means Clustering  Given a similarity function  Start with k randomly selected data points  Assume they are the centroids of k clusters  Assign every data point to a cluster whose centroid is the closest to the data point  Recompute the centroid for each cluster  Repeat this process until the similarity-based objective function converges

16 Method 3 (model-based): Mixture Models

17 Mixture Model for Clustering P(X|Cluster 1 ) P(X|Cluster 2 ) P(X|Cluster 3 ) P(X)= 1 P(X|Cluster 1 )+ 2 P(X|Cluster 2 )+ 3 P(X|Cluster 3 )

18 Mixture Model Estimation  Likelihood function  Parameters: i,  i,  i  Using EM algorithm  Similar to “soft” K-means

19 Method 4 (model-based) [If we have gtime] Singular Value Decomposition (SVD) Also called “Latent Semantic Indexing” (LSI)

20 Example of “Semantic Concepts” (Slide from C. Faloutsos’s talk)

21 Singular Value Decomposition (SVD) A [n x m] = U [n x r]   r x r] (V [m x r] ) T  A: n x m matrix (n documents, m terms)  U: n x r matrix (n documents, r concepts)   : r x r diagonal matrix (strength of each ‘concept’) (r : rank of the matrix)  V: m x r matrix (m terms, r concepts) (Slide from C. Faloutsos’s talk)

22 Example of SVD data inf retrieval brain lung = CS MD xx CS-concept MD-concept Term rep of concept (Slide adapted from C. Faloutsos’s talk) Strength of CS-concept Dim. Reduction A = U  V T

23 More clustering methods and software  Partitioning : K-Means, K-Medoids, PAM, CLARA …  Hierarchical : Cluster, HAC 、 BIRCH 、 CURE 、 ROCK  Density-based : CAST, DBSCAN 、 OPTICS 、 CLIQUE…  Grid-based : STING 、 CLIQUE 、 WaveCluster…  Model-based : SOM (self-organized map) 、 COBWEB 、 CLASSIT 、 AutoClass…  Two-way Clustering  Block clustering


Download ppt "Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University."

Similar presentations


Ads by Google