Presentation is loading. Please wait.

Presentation is loading. Please wait.

Learn from chips: Microarray data analysis and clustering CS 374 Yu Bai Nov. 16, 2004.

Similar presentations


Presentation on theme: "Learn from chips: Microarray data analysis and clustering CS 374 Yu Bai Nov. 16, 2004."— Presentation transcript:

1 Learn from chips: Microarray data analysis and clustering CS 374 Yu Bai Nov. 16, 2004

2 Outlines Background & motivation Algorithms overview fuzzy k-mean clustering (1 st paper) Independent component analysis(2 nd paper)

3 Why does cancer occur? CHIP-ing away at medical questions Molecular level understanding Diagnosis Treatment Drug design Snapshot of gene expression “( DNA) Microarray”

4 Spot your genes Known gene sequences Glass slide (chip) Cancer cell Normal cell Isolation RNA Cy3 dye Cy5 dye

5 Matrix of expression Gene 1 Gene 2 Gene N Exp 1 E 1 Exp 2 E 2 Exp 3 E 3

6 Why care about “clustering” ? E1 E2E3 Gene 1 Gene 2 Gene N E1E2E3Gene N Gene 1 Gene 2 Discover functional relation Similar expression  functionally related Assign function to unknown gene Find which gene controls which other genes

7 A review: microarray data analysis Supervised (Classification) Un-supervised (Clustering) “Heuristic” methods: - Hierarchical clustering - k mean clustering - Self organizing map - Others Probability-based methods: - Principle component analysis (PCA) - Independent component analysis (ICA) -Others

8 1. Euclidean distance: D(X,Y)=sqrt[(x 1 -y 1 ) 2 +(x 2 -y 2 ) 2 +…(x n -y n ) 2 ] 2. (Pearson) Correlation coefficient R(X,Y)=1/n*∑[(x i -E(x))/  x *(y i -E(y))/  y ]  x= sqrt(E(x 2 )-E(x) 2 ); E(x)=expected value of x R=1 if x=y 0 if E(xy)=E(x)E(y) 3. Other choices for distances… Heuristic methods: distance metrix

9 Hierarchical clustering E1E2E3 Easy Depends on where to start the grouping Trouble to interpret “tree” structure

10 K-mean clustering How many (k) How to initiate Local minima Generally, heuristic methods have no established means to determine the “correct” number of clusters and to choose “best” algorithm Overall optimization

11 Probability-based methods: Principle component analysis (PCA) Pearson 1901; Everitt 1992; Basilevksy 1994 Common use: reduce dimension & filter noise Goal: find “uncorrelated ” component(s) that account for as much of variance by initial variables as possible (Linear) “Uncorrelated”: E[xy]=E[x]*E[y] x ≠ y

12 PCA algorithm “Column-centered” matrix: A Covariance matrix: A T A Eigenvalue Decomposition A T A = U Λ U T U: Eigenvectors (principle components) Λ: Eigenvalues Digest principle components Gaussian assumption Exp1-n Eigenarray genes Λ U Exp1-n Eigenarray X

13 Are biologists satisfied ? Biological process is non-Gaussian “Faithful” vs. “meaningful” … Gene5 Gene4 Gene3 Gene2 Gene1 Ribosome Biogenesis Energy Pathway Biological Regulators Expression level … Gene5 Gene4 Gene3 Gene2 Gene1 Super-Gaussian model

14 Equal to “source separation” Source 1 Source 2 Mixture 1 ?

15 Independent vs. uncorrelated E[g(x)f(y)]=E[g(x)]*E[f(y)] x≠y The fact that sources are independent stronger than uncorrelated Source x1Source x2 Two mixtures: y1= 2*x1 + 3*x2 y2= 4*x1 + x2 y1 y2 y1 y2 Independent components principle components

16 Independent component analysis(ICA) Simplified notation Find “unmixing” matrix A which makes s 1,…, s m as independent as possible

17 (Linear) ICA algorithm “Likehood function” = Log (probability of observation) Y= WX p(x) = |detW| p(y) p(y)= Π p i (y i ) L(y,W) = Log p(x) = Log |detW| + Σ Log p i (y i )

18 (Linear) ICA algorithm Find W maximize L(y,W) Super-Gaussian model

19 First paper: Gasch, et al. (2002) Genome biology, 3, 1-59 Improve the detection of conditional coregulation in gene expression by fuzzy k-means clustering

20 Biology is “fuzzy” Many genes are conditionally co-regulated k-mean clustering vs. fuzzy k-mean: X i : expression of ith gene V j : jth cluster center

21 FuzzyK flowchart Initial V j = PCA eigenvectors 1 st cycle 3 rd cycle V j ’= Σ i m 2 XiVj W Xi X i Σ i m 2 XiVj W Xi weight W Xi evaluates the correlation of Xi with others 2 nd cycle Remove correlated genes(>0.7)

22 FuzzyK performance k is more “definitive” Uncover new gene clusters Cell wall and secretion factors Reveal new promoter sequence Recover clusters in classical methods

23 Second paper: ICA is so new… Lee, et al. (2003) Genome biology, 4, R76 Systematic evaluation of ICA with respect to other clustering methods (PCA, k-mean)

24 From linear to non-linear Linear ICA: X = AS X: expression matrix (N conditions X K genes) s i = independent vector of K gene levels xj=Σ i a ji s i Or Non-linear ICA: X= f(AS)

25 How to do non-linear ICA? Construct feature space F Mapping X to Ψ in F ICA of Ψ IR n Input space IR L feature space Normally, L>N

26 Kernel function: k(xi,xj)=Φ(xi)Φ(xj) xi (ith column of X), xj in |R n are mapped to Φ(xi), Φ(xj) in feature space Kernel trick Construct F = construct Φ V ={Φ (v 1 ), Φ (v 2 ),… Φ (v L ) } to be the basis of F, i.e. rank(Φ V T Φ V )=L Φ V T Φ V = [ ] ; choose vectors {v 1 …v L } from {xi} k(v 1,v 1 ) … k(v 1, v L ) : : k(v L,v 1 ) … k(v L,v L ) Mapped points in F: Ψ[xi] =[ ] 1/2 [ ] k(v 1,v 1 ) …k(v 1, v L ) : : k(v L,v 1 ) …k(v L,v L ) k(v 1,xi) : k(v k,xi)

27 ICA-based clustering Independent component y i =(y i1,y i2,…y iK ), i=1,…M “Load” – the jth entry of y i is the load of jth gene Two clusters per component Cluster i,1 = {gene j| y ij = (C%xK)largest load in y i } Cluster i,2 = {gene j| y ij = (C%xK)smallest load in y i }

28 Evaluate biological significance Cluster 1 Cluster 2 Cluster 3 Cluster n GO 1 GO 2 GO i GO m Cluster i GO j Clusters from ICsFunctional Classes Calculate the p value for each pair : probability that they share many genes by change

29 Evaluate biological significance g i fn Functional class Microarray data “P-value”: p = 1-Σ i=1 k-1 Prob of sharing i genes = ( )( ) fifi g-f n-i gngn ( ) ( )( ) fifi g-f n-i gngn ( ) True positive = k n Sensitivity = k f

30 Who is better ? Conclusion: ICA based clustering Is general better

31 References Su-in lee,(2002) group talk:“Microarray data analysis using ICA” Altman, et al. 2001, “whole genome expression analysis: challenges beyond clustering”, Curr. Opin. Stru. Biol. 11, 340 Hyvarinen, et al. 1999, “Survey on Independent Component analysis” Neutral Comp Surv 2, Alter, et al. 2000, “singular value decomposition for genome wide expression data processing and modeling” PNAS, 97, Harmeling et al. “Kernel feature spaces & nonlinear blind source separation”


Download ppt "Learn from chips: Microarray data analysis and clustering CS 374 Yu Bai Nov. 16, 2004."

Similar presentations


Ads by Google