Download presentation

Presentation is loading. Please wait.

Published byAmari Overland Modified about 1 year ago

1
Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign

2
Gene Expression Data (Microarray) p genes on n samples Genes mRNA samples Gene expression level of gene i in mRNA sample j Log (treated-exp-value /controlled-exp-value ) sample1sample2sample3sample4sample5 …

3
Some possible applications Sample from specific organ to show which genes are expressed Compare samples from healthy and sick host to find gene-disease connection Discover co-regulated genes Discover promoters

4
Major Analysis Techniques Single gene analysis Compare the expression levels of the same gene under different conditions Main techniques: Significance test (e.g., t-test) Gene group analysis Find genes that are expressed similarly across many different conditions Main techniques: Clustering (many possibilities) Gene network analysis Analyze gene regulation relationship at a large scale Main techniques: Bayesian networks

5
Clustering Methods Similarity-based ( need a similarity function ) Construct a partition Agglomerative, bottom up Searching for an optimal partition Typically “hard” clustering Model-based (latent models, probabilistic or algebraic) First compute the model Clusters are obtained easily after having a model Typically “soft” clustering

6
Similarity-based Clustering Define a similarity function to measure similarity between two objects Common criteria: Find a partition to Maximize intra-cluster similarity Minimize inter-cluster similarity Two ways to construct the partition Hierarchical (e.g.,Agglomerative Hierarchical Clustering) Search by starting at a random partition (e.g., K-means)

7
Method 1 (Similarity-based): Agglomerative Hierarchical Clustering

8
Agglomerative Hierachical Clustering Given a similarity function to measure similarity between two objects Gradually group similar objects together in a bottom-up fashion Stop when some stopping criterion is met Variations: different ways to compute group similarity based on individual object similarity

9
Similarity Measure: Pearson CC The most popular correlation coefficient is Pearson correlation coefficient (1892) correlation between X={X 1, X 2, …, X n } and Y={Y 1, Y 2, …, Y n } ： where (Adapted from a Slide by Shin-Mu Tseng) s XY s XY is the similarity between X & Y Better measures focus on a subset of values…

10
Similarity-induced Structure

11
How to Compute Group Similarity? Given two groups g1 and g2, Single-link algorithm: s(g1,g2)= similarity of the closest pair complete-link algorithm: s(g1,g2)= similarity of the farthest pair average-link algorithm: s(g1,g2)= average of similarity of all pairs Three Popular Methods:

12
Three Methods Illustrated Single-link algorithm ? g1 g2 complete-link algorithm …… average-link algorithm

13
Comparison of the Three Methods Single-link “Loose” clusters Individual decision, sensitive to outliers Complete-link “Tight” clusters Individual decision, sensitive to outliers Average-link “In between” Group decision, insensitive to outliers Which one is the best? Depends on what you need!

14
Method 2 (similarity-based): K-Means

15
K-Means Clustering Given a similarity function Start with k randomly selected data points Assume they are the centroids of k clusters Assign every data point to a cluster whose centroid is the closest to the data point Recompute the centroid for each cluster Repeat this process until the similarity-based objective function converges

16
Method 3 (model-based): Mixture Models

17
Mixture Model for Clustering P(X|Cluster 1 ) P(X|Cluster 2 ) P(X|Cluster 3 ) P(X)= 1 P(X|Cluster 1 )+ 2 P(X|Cluster 2 )+ 3 P(X|Cluster 3 )

18
Mixture Model Estimation Likelihood function Parameters: i, i, i Using EM algorithm Similar to “soft” K-means

19
Method 4 (model-based) [If we have gtime] Singular Value Decomposition (SVD) Also called “Latent Semantic Indexing” (LSI)

20
Example of “Semantic Concepts” (Slide from C. Faloutsos’s talk)

21
Singular Value Decomposition (SVD) A [n x m] = U [n x r] r x r] (V [m x r] ) T A: n x m matrix (n documents, m terms) U: n x r matrix (n documents, r concepts) : r x r diagonal matrix (strength of each ‘concept’) (r : rank of the matrix) V: m x r matrix (m terms, r concepts) (Slide from C. Faloutsos’s talk)

22
Example of SVD data inf retrieval brain lung = CS MD xx CS-concept MD-concept Term rep of concept (Slide adapted from C. Faloutsos’s talk) Strength of CS-concept Dim. Reduction A = U V T

23
More clustering methods and software Partitioning ： K-Means, K-Medoids, PAM, CLARA … Hierarchical ： Cluster, HAC 、 BIRCH 、 CURE 、 ROCK Density-based ： CAST, DBSCAN 、 OPTICS 、 CLIQUE… Grid-based ： STING 、 CLIQUE 、 WaveCluster… Model-based ： SOM (self-organized map) 、 COBWEB 、 CLASSIT 、 AutoClass… Two-way Clustering Block clustering

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google