Presentation is loading. Please wait.

Presentation is loading. Please wait.

‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.

Similar presentations


Presentation on theme: "‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E."— Presentation transcript:

1 ‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E May 15, 2003

2 Presentation Outline Biology Background Reminder of Principle Component Analysis What is Gene Shaving ? The ‘Gene Shaving’ Algorithm Applications of Gene Shaving Conclusions

3 What is “gene expression”? Each cell contains a complete copy of all genes. The difference between a skin cell and bone cell is determined by which genes are producing proteins i.e., which genes are being “expressed”. The expression of DNA information occurs in two steps:  Transcription:DNA  mRNA  Translation:mRNA  protein DNA microarrays measure transcription (i.e., the mRNA produced)

4

5 Reference cells sample test cells sample Label with dye Transcription Hybridize to array

6 The Dataset N x p expression matrix X: p columns (patients) N rows (genes) Green: under-expressed genes. Red: over-expressed genes. X = [x ij ]

7 The ratio of the red and green intensities for each spot indicates the relative abundance of the corresponding DNA probe in the two nucleic acid target samples. X ij = log 2 (R/G) X ij < 0, gene is over expressed in test sample relative to reference sample X ij = 0, gene is expressed equally X ij > 0, gene is under expressed in test sample relative to reference. sample.

8 Knowing the list of human genes does not mean we know what they do. cDNA arrays help study the variation of gene expression across samples (e.g., tissues, or patients). Major challenge is interpreting data that consists of the expression levels of, say 6000 genes and 50 patients. Present goal: create a clustering that organizes genes with coherent behavior across samples. Remarks

9 1 st eigengene (principal component of X T ) Singular value decomposition of X T : X T = U  V T = 11 rr u1u1 v1v1 X T V= U   1 u 1 = X T v 1 = linear comb. columns of X T (genes) with highest variance g1g1 g2g2 gNgN

10 Introduction What is Gene Shaving ?  A new statistical method that identifies subsets of genes with coherent expression patterns and large variation across different conditions  Differs from hierarchical clustering and other widely used methods for analyzing gene expression in that genes may belong to more that one cluster.

11 The Gene Shaving Algorithm

12 Estimating the Optimal Cluster Size K Gene Shaving requires a quality measure for a cluster To select a good cluster, the method focuses on high coherence between members of the cluster

13 Estimating the Optimal Cluster Size K (cont.) The method defines the following measures of variances for a cluster S k : The ‘Between Variance’ is the variance of the mean gene The ‘Within Variance’ measures the variability of each gene about the average

14 A useful measure for choosing cluster size is the percent variance: A large R 2 implies a tight cluster of coherent genes Gene Shaving uses this measure for selecting a cluster from the shaving sequence S k Estimating the Optimal Cluster Size K (cont.)

15 Once a cluster is selected from the sequence, we can proceed to finding the optimal cluster size Let D k be the R 2 measure for the k-th sequence member. We wish to find the “Gap” between this value D k and D *b k, which is the R 2 measure for cluster S *b k This S *b k is the clustering sequence from a permuted matrix X *b Estimating the Optimal Cluster Size K (cont.)

16 The “Gap” function is defined as: Where D * k is the average of D *b k over b. The optimal cluster size K is selected such that this “Gap” is the largest: Estimating the Optimal Cluster Size K (cont.)

17 The Gene Shaving Algorithm (cont.)

18 So Far : form clusters S k with high variance across samples; high correlation among genes within a cluster; low correlation between genes in different clusters. The procedure seeks clusters S k by maximizing v(S k ) = var( vector of col. avgs. ) Now incorporate supervision: use info, y, about the patients, and seek S k by maximizing (1-  ) v(S k ) +  J ( v(S k ), y )

19 Goal is in predicting patient survival  Find genes whose expression correlates with patient survival.  Produce groupings of patients which are statistically different in survival.  Use additional information about the patients, y = (y 1,…, y p ), and combine unsupervised & supervised criteria into the objective function: (1-  ) v(S k ) +  J ( v(S k ), y ) 0    1

20 Maximize (1-  ) v(S k ) +  J ( v(S k ), y ) Information measure J ( v(S k ), y ) is a quadratic function that depends on the type of patient information, y. y = (y 1,…, y p ) may identify catagories of patients. Used here: y = (p patient survival times), and J (v(S k ), y) = g g T where g is the score vector of the Cox model for predicting survival.

21 They chose  = 0.1 as it “seemed to give a good mix of high gene correlation and low p-value for the Cox model”.

22 This produced a cluster of 234 genes. It includes “strong” genes for predicting survival (130 of the 200 stongest) as well as some“weak” genes (e.g., #1332).

23 (a)Gap curve for supervised shaving. (b)Survival curves in the two groups defined by the low or high expression of the 234 genes. Group I has high expression of positive genes, and low expression of negative genes; Group 2 has low expression of positive genes, and high expression of negative genes. Negative genes are those preceded by a minus sign in Table 2.

24 Conclusions The proposed gene shaving methods search for clusters of genes showing both high variation across the samples, and correlation across the genes. This method is a potentially useful tool for exploration of gene expression data and identification of interesting clusters of genes worth further investigation


Download ppt "‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E."

Similar presentations


Ads by Google