Download presentation

Presentation is loading. Please wait.

Published byAmari Overland Modified over 2 years ago

1
Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign

2
Gene Expression Data (Microarray) p genes on n samples Genes mRNA samples Gene expression level of gene i in mRNA sample j Log (treated-exp-value /controlled-exp-value ) sample1sample2sample3sample4sample5 … 1 0.46 0.30 0.80 1.51 0.90... 2-0.10 0.49 0.24 0.06 0.46... 3 0.15 0.74 0.04 0.10 0.20... 4-0.45-1.03-0.79-0.56-0.32... 5-0.06 1.06 1.35 1.09-1.09...

3
Some possible applications Sample from specific organ to show which genes are expressed Compare samples from healthy and sick host to find gene-disease connection Discover co-regulated genes Discover promoters

4
Major Analysis Techniques Single gene analysis Compare the expression levels of the same gene under different conditions Main techniques: Significance test (e.g., t-test) Gene group analysis Find genes that are expressed similarly across many different conditions Main techniques: Clustering (many possibilities) Gene network analysis Analyze gene regulation relationship at a large scale Main techniques: Bayesian networks

5
Clustering Methods Similarity-based ( need a similarity function ) Construct a partition Agglomerative, bottom up Searching for an optimal partition Typically “hard” clustering Model-based (latent models, probabilistic or algebraic) First compute the model Clusters are obtained easily after having a model Typically “soft” clustering

6
Similarity-based Clustering Define a similarity function to measure similarity between two objects Common criteria: Find a partition to Maximize intra-cluster similarity Minimize inter-cluster similarity Two ways to construct the partition Hierarchical (e.g.,Agglomerative Hierarchical Clustering) Search by starting at a random partition (e.g., K-means)

7
Method 1 (Similarity-based): Agglomerative Hierarchical Clustering

8
Agglomerative Hierachical Clustering Given a similarity function to measure similarity between two objects Gradually group similar objects together in a bottom-up fashion Stop when some stopping criterion is met Variations: different ways to compute group similarity based on individual object similarity

9
Similarity Measure: Pearson CC The most popular correlation coefficient is Pearson correlation coefficient (1892) correlation between X={X 1, X 2, …, X n } and Y={Y 1, Y 2, …, Y n } ： where (Adapted from a Slide by Shin-Mu Tseng) s XY s XY is the similarity between X & Y Better measures focus on a subset of values…

10
Similarity-induced Structure

11
How to Compute Group Similarity? Given two groups g1 and g2, Single-link algorithm: s(g1,g2)= similarity of the closest pair complete-link algorithm: s(g1,g2)= similarity of the farthest pair average-link algorithm: s(g1,g2)= average of similarity of all pairs Three Popular Methods:

12
Three Methods Illustrated Single-link algorithm ? g1 g2 complete-link algorithm …… average-link algorithm

13
Comparison of the Three Methods Single-link “Loose” clusters Individual decision, sensitive to outliers Complete-link “Tight” clusters Individual decision, sensitive to outliers Average-link “In between” Group decision, insensitive to outliers Which one is the best? Depends on what you need!

14
Method 2 (similarity-based): K-Means

15
K-Means Clustering Given a similarity function Start with k randomly selected data points Assume they are the centroids of k clusters Assign every data point to a cluster whose centroid is the closest to the data point Recompute the centroid for each cluster Repeat this process until the similarity-based objective function converges

16
Method 3 (model-based): Mixture Models

17
Mixture Model for Clustering P(X|Cluster 1 ) P(X|Cluster 2 ) P(X|Cluster 3 ) P(X)= 1 P(X|Cluster 1 )+ 2 P(X|Cluster 2 )+ 3 P(X|Cluster 3 )

18
Mixture Model Estimation Likelihood function Parameters: i, i, i Using EM algorithm Similar to “soft” K-means

19
Method 4 (model-based) [If we have gtime] Singular Value Decomposition (SVD) Also called “Latent Semantic Indexing” (LSI)

20
Example of “Semantic Concepts” (Slide from C. Faloutsos’s talk)

21
Singular Value Decomposition (SVD) A [n x m] = U [n x r] r x r] (V [m x r] ) T A: n x m matrix (n documents, m terms) U: n x r matrix (n documents, r concepts) : r x r diagonal matrix (strength of each ‘concept’) (r : rank of the matrix) V: m x r matrix (m terms, r concepts) (Slide from C. Faloutsos’s talk)

22
Example of SVD data inf retrieval brain lung = CS MD xx CS-concept MD-concept Term rep of concept (Slide adapted from C. Faloutsos’s talk) Strength of CS-concept Dim. Reduction A = U V T

23
More clustering methods and software Partitioning ： K-Means, K-Medoids, PAM, CLARA … Hierarchical ： Cluster, HAC 、 BIRCH 、 CURE 、 ROCK Density-based ： CAST, DBSCAN 、 OPTICS 、 CLIQUE… Grid-based ： STING 、 CLIQUE 、 WaveCluster… Model-based ： SOM (self-organized map) 、 COBWEB 、 CLASSIT 、 AutoClass… Two-way Clustering Block clustering

Similar presentations

OK

CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website: www.csc.villanova.edu/~map/4510/

CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website: www.csc.villanova.edu/~map/4510/

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Download ppt on civil disobedience movement in civil rights Ppt on producers consumers and decomposers video Ppt on lhasa Ppt on preservation of public property search Ppt on five monuments of india Ppt on eisenmenger syndrome pregnancy Ppt on indian power grid system Ppt on marketing strategy of coca cola in india Ppt on taj mahal hotel Ppt on diode circuits