Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Problem of Detecting Differentially Expressed Genes.

Similar presentations


Presentation on theme: "The Problem of Detecting Differentially Expressed Genes."— Presentation transcript:

1 The Problem of Detecting Differentially Expressed Genes

2 Gene 1 Gene 2 Gene N........ Sample 1Sample 2Sample M

3 Gene 1 Gene 2 Gene N........ Sample 1Sample 2Sample M Class 1 Class 2

4 Fold Change is the Simplest Method Calculate the log ratio between the two classes and consider all genes that differ by more than an arbitrary cutoff value to be differentially expressed. A two-fold difference is often chosen. Fold change is not a statistical test.

5 (1) For gene consider the null hypothesis of no association between its expression level and its class membership. Test of a Single Hypothesis (3) Perform a test (e.g Student’s t-test) for each gene. (2) Decide on level of significance (commonly 5%). (4) Obtain P-value corresponding to that test statistic. (5) Compare P-value with the significance level. Then either reject or retain the null hypothesis.

6 Two-Sample t-Statistic Student’s t-statistic:

7 Two-Sample t-Statistic Pooled form of the Student’s t-statistic, assumed common variance in the two classes:

8 Two-Sample t-Statistic Modified t-statistic of Tusher et al. (2001):

9 Retain Null Reject Null Null False positive Non-null False negative TRUE PREDICTED Types of Errors in Hypothesis Testing

10 Multiplicity Problem Further: Genes are co-regulated, subsequently there is correlation between the test statistics. When many hypotheses are tested, the probability of a false positive increases sharply with the number of hypotheses.

11 Suppose we measure the expression of 10,000 genes in a microarray experiment. If all 10,000 genes were not differentially expressed, then we would expect for: P= 0.05 for each test, 500 false positives. P= 0.05/10,000 for each test,.05 false positives. Example

12 Controlling the Error Rate Methods for controlling false positives e.g. Bonferroni are too strict for microarray analyses Use the False Discovery Rate instead (FDR) (Benjamini and Hochberg 1995)

13 Methods for dealing with the Multiplicity Problem The Bonferroni Method controls the family wise error rate (FWER) i.e. the probability that at least one false positive error will be made The False Discovery Rate (FDR) emphasizes the proportion of false positives among the identified differentially expressed genes.  Too strict for gene expression data, tries to make it unlikely that even one false rejection of the null is made, may lead to missed findings  Good for gene expression data – says something about the chosen genes

14 The FDR is essentially the expectation of the proportion of false positives among the identified differentially expressed genes. False Discovery Rate Benjamini and Hochberg (1995)

15 Possible Outcomes for N Hypothesis Tests

16 where

17 Positive FDR

18 Lindsay, Kettenring, and Siegmund (2004). A Report on the Future of Statistics. Statist. Sci. 19.

19

20 Key papers on controlling the FDR Genovese and Wasserman (2002) Storey (2002, 2003) Storey and Tibshirani (2003a, 2003b) Storey, Taylor and Siegmund (2004) Black (2004) Cox and Wong (2004) Controlling FDR Benjamini and Hochberg (1995 )

21 Benjamini-Hochberg (BH) Procedure Controls the FDR at level  when the P-values following the null distribution are independent and uniformly distributed. (1) Let be the observed P-values. (2) Calculate. (3) If exists then reject null hypotheses corresponding to. Otherwise, reject nothing.

22 Example: Bonferroni and BH Tests Suppose that 10 independent hypothesis tests are carried out leading to the following ordered P-values: 0.00017 0.00448 0.00671 0.00907 0.01220 0.33626 0.39341 0.53882 0.58125 0.98617 (a) With  = 0.05, the Bonferroni test rejects any hypothesis whose P-value is less than  / 10 = 0.005. Thus only the first two hypotheses are rejected. (b) For the BH test, we find the largest k such that P (k) < k  / N. Here k = 5, thus we reject the first five hypotheses.

23 q-VALUE q-value of a gene j is expected proportion of false positives when calling that gene significant. P-value is the probability under the null hypothesis of obtaining a value of the test statistic as or more extreme than its observed value. The q-value for an observed test statistic can be viewed as the expected proportion of false positives among all genes with their test statistics as or more extreme than the observed value.

24 LIST OF SIGNIFICANT GENES Call all genes significant if p j < 0.05 or Call all genes significant if q j < 0.05 to produce a set of significant genes so that a proportion of them (<0.05) is expected to be false (at least for a large no. of genes not necessarily independent)

25 BRCA1 versus BRCA2-mutation positive tumours (Hedenfalk et al., 2001) BRCA1 (7) versus BRCA2-mutation (8) positive tumours, p=3226 genes P=.001 gave 51 genes differentially expressed P=0.0001 gave 9-11 genes

26 Using q<0.05, gives 160 genes are taken to be significant. It means that approx. 8 of these 160 genes are expected to be false positives. Also, it is estimated that 33% of the genes are differentially expressed.

27

28

29 Permutation Method The null distribution has a resolution on the order of the number of permutations. If we perform B permutations, then the P-value will be estimated with a resolution of 1/B. If we assume that each gene has the same null distribution and combine the permutations, then the resolution will be 1/(NB) for the pooled null distribution. Null Distribution of the Test Statistic

30 Using just the B permutations of the class labels for the gene-specific statistic W j, the P-value for W j = w j is assessed as: where w (b) 0j is the null version of w j after the bth permutation of the class labels.

31 If we pool over all N genes, then:

32 Class 1Class 2 Gene 1 A 1 (1) A 2 (1) A 3 (1) B 4 (1) B 5 (1) B 6 (1) Gene 2 A 1 (2) A 2 (2) A 3 (2) B 4 (2) B 5 (2) B 6 (2) Suppose we have two classes of tissue samples, with three samples from each class. Consider the expressions of two genes, Gene 1 and Gene 2. Null Distribution of the Test Statistic: Example

33 Class 1Class 2 Gene 1 A 1 (1) A 2 (1) A 3 (1) B 4 (1) B 5 (1) B 6 (1) Gene 2 A 1 (2) A 2 (2) A 3 (2) B 4 (2) B 5 (2) B 6 (2) Gene 1 A 1 (1) A 2 (1) A 3 (1) A 4 (1) A 5 (1) A 6 (1) To find the null distribution of the test statistic for Gene 1, we proceed under the assumption that there is no difference between the classes (for Gene 1) so that: Perm. 1 A 1 (1) A 2 (1) A 4 (1) A 3 (1) A 5 (1) A 6 (1)... There are 10 distinct permutations. And permute the class labels:

34 Ten Permutations of Gene 1 A 1 (1) A 2 (1) A 3 (1) A 4 (1) A 5 (1) A 6 (1) A 1 (1) A 2 (1) A 4 (1) A 3 (1) A 5 (1) A 6 (1) A 1 (1) A 2 (1) A 5 (1) A 3 (1) A 4 (1) A 6 (1) A 1 (1) A 2 (1) A 6 (1) A 3 (1) A 4 (1) A 5 (1) A 1 (1) A 3 (1) A 4 (1) A 2 (1) A 5 (1) A 6 (1) A 1 (1) A 3 (1) A 5 (1) A 2 (1) A 4 (1) A 6 (1) A 1 (1) A 3 (1) A 6 (1) A 2 (1) A 4 (1) A 5 (1) A 1 (1) A 4 (1) A 5 (1) A 2 (1) A 3 (1) A 6 (1) A 1 (1) A 4 (1) A 6 (1) A 2 (1) A 3 (1) A 5 (1) A 1 (1) A 5 (1) A 6 (1) A 2 (1) A 3 (1) A 4 (1)

35 As there are only 10 distinct permutations here, the null distribution based on these permutations is too granular. Hence consideration is given to permuting the labels of each of the other genes and estimating the null distribution of a gene based on the pooled permutations so obtained. But there is a problem with this method in that the null values of the test statistic for each gene does not necessarily have the theoretical null distribution that we are trying to estimate.

36 Suppose we were to use Gene 2 also to estimate the null distribution of Gene 1. Suppose that Gene 2 is differentially expressed, then the null values of the test statistic for Gene 2 will have a mixed distribution.

37 Class 1Class 2 Gene 1 A 1 (1) A 2 (1) A 3 (1) B 4 (1) B 5 (1) B 6 (1) Gene 2 A 1 (2) A 2 (2) A 3 (2) B 4 (2) B 5 (2) B 6 (2) Perm. 1 A 1 (2) A 2 (2) B 4 (2) A 3 (2) B 5 (2) B 6 (2)... Permute the class labels:

38 Example of a null case: with 7 N(0,1) points and 8 N(0,1) points; histogram of the pooled two-sample t-statistic under 1000 permutations of the class labels with t 13 density superimposed. tyty

39 Example of a null case: with 7 N(0,1) points and 8 N(10,9) points; histogram of the pooled two-sample t-statistic under 1000 permutations of the class labels with t 13 density superimposed. tyty

40 The SAM Method Use the permutation method to calculate the null distribution of the modified t-statistic (Tusher et al., 2001). The order statistics t (1),..., t (N) are plotted against their null expectations above. A good test in situations where there are more genes being over-expressed than under-expressed, or vice-versa.

41 TRUE PREDICTED Retain Null Reject Null Null a b (false positive) Non-null c (false negative) d FDR ~ R FNDR ~ The FDR and other error rates

42 TRUE PREDICTED Retain Null Reject Null Null a b (false positive) Non-null c (false negative) d FDR ~ R FNR = The FDR and other error rates

43 Test Result non-DE DE Total True non-DE DE Total A = 9025 B = 475 C = 100 D = 400 9125 875 9500 500 10000 FDR = B / (B + D) = 475 / 875 = 54% FNDR = C / (A + C) = 100 / 9125 = 1% Toy Example with 10,000 Genes

44 Two-component mixture model is the proportion of genes that are not differentially expressed, and

45 Use of the P-Value as a Summary Statistic (Allison et al., 2002) Instead of using the pooled form of the t-statistic, we can work with the value p j, which is the P-value associated with t j in the test of the null hypothesis of no difference in expression between the two classes. The distribution of the P-value is modelled by the h-component mixture model where  11 =  12 = 1.,

46 Use of the P-Value as a Summary Statistic Under the null hypothesis of no difference in expression for the jth gene, p j will have a uniform distribution on the unit interval; ie the  1,1 distribution. The   1,  2 density is given by where

47

48 Efron B, Tibshirani R, Storey JD, Tusher V (2001) Empirical Bayes analysis of a microarray experiment. JASA 96,1151-1160. Efron B (2004) Large-scale simultaenous hypothesis testing: the choice of a null hypothesis. JASA 99, 96-104. Efron B (2004) Selection and Estimation for Large-Scale Simultaneous Inference. Efron B (2005) Local False Discovery Rates. Efron B (2006) Correlation and Large-Scale Simultaneous Significance Testing.

49 McLachlan GJ, Bean RW, Ben-Tovim Jones L, Zhu JX. Using mixture models to detect differentially expressed genes. Australian Journal of Experimental Agriculture 45, 859-866. McLachlan GJ, Bean RW, Ben-Tovim Jones L. A simple implentation of a normal mixture approach to differential gene expression in multiclass microarrays. Bioinformatics 26. To appear.

50 Two component mixture model Using Bayes’ Theorem, we calculate the posterior probability that gene j is not differentially expressed: π 0 is the proportion of genes that are not differentially expressed. The two-component mixture model is:

51 0 (w j ) c 0, then this decision minimizes the (estimated) Bayes risk If we conclude that gene j is differentially expressed if:

52 where where e 01 is the probability of a false positive and e 10 is the probability of a false negative.

53 Estimated FDR where

54 Similarly, the false positive rate is given by and the false non-discovery rate and false negative rate by:

55 F 0 : N(0,1), π 0 =0.9 F 1 : N(1,1), π 1 =0.1 Reject H 0 if z≥2 τ 0 (2) = 0.99972 but FDR=0.17 F 0 : N(0,1), π 0 =0.6 F 1 : N(1,1), π 1 =0.4 Reject H 0 if z≥2 τ 0 (2) = 0.251 but FDR=0.177 Glonek and Solomon (2003)

56

57

58 Gene Statistics: Two-Sample t-Statistic Student’s t-statistic Pooled form of the Student’s t-statistic, assumed common variance in the two classes Modified t-statistic of Tusher et al. (2001)

59 1. Obtain the z -score for each of the genes The Procedure 2. Rank the genes on the basis of the z -scores, starting with the largest ones (the same ordering as with the P - values, p j ). 3. The posterior probability of non-differential expression of gene j, is given by 0 (z j ). 4. Conclude gene j to be differentially expressed if 0 (z j ) < c 0

60 τ 0 (z j ) ≤ c If π 0 /π 1 ≤ 9,then e.g. c = 0.2,

61 Supposeπ 0/ π 1 ≥ 0.9 then τ 0 (z j ) ≤ c τ 0 (z j ) ≤ 0.2 if Much stronger level of evidence against the null than in standard one-at-a-time testing.

62 e.g. Z j ~ N(μ,1) H 0 : μ = 0 vs H 1 : μ = 2.8 Rejecting H 0 if |Z j | ≥ 1.96 yields a two-sided test of size 0.05 and power 0.80. f 1 (1.96)/f 0 (1.96) = 4.8.

63 Z-scores, null case Z-scores, +1 Z-scores, +2 Z-scores, +3

64 zjzj j Use of Wilson-Hilferty Transformation as in Broet et al. (2004)

65 Plot of approximate z-scores obtained by Wilson-Hilferty transformation of simulated values of F-statistic with 1 and 8 degrees of freedom.

66 EMMIX-FDR A program has been written in R which interfaces with EMMIX to implement the algorithm described in McLachlan et al (2006). We fit a mixture of two normal components to test statistics calculated from the gene expression data.

67 When we equate the sample mean and variance of the mixture to their population counterparts, we obtain:

68 When we are working with the theoretical null, we can easily estimate the mean and variance of the non-null component with the following formulae.

69 Following the approach of Storey and Tibshirani (2003) we can obtain an initial estimate for π 0 as follows: This estimate of π 0 is used as input to EMMIX as part of the initial parameter estimates. Thus no random or k-means starts are required, as is usually the case. There are two different versions of EMMIX used, the standard version for the empirical null and a modified version for the theoretical null which fixes the mean and variance of the null component to be 0 and 1 respectively.

70 Theoretical and empirical nulls Efron (2004) suggested the use of two kinds of null component: the theoretical and the empirical null. In the theoretical case the null component has mean 0 and variance 1 and the empirical null has unrestricted mean and variance.

71 Examples We examined the performance of EMMIX- FDR on three well-known data sets from the literature. Alon colon cancer data (1999) Hedenfalk breast cancer data (2001) van’t Wout HIV data (2003)

72 Hedenfalk Breast Cancer Data Hedenfalk et al. (2001) used cDNA arrays to obtain gene expression profiles of tumours from carriers of either the BRCA1 or BRCA2 mutation (hereditary breast cancers), as well as sporadic breast cancer. We consider their data set of M = 15 patients, comprising two patient groups: BRCA1 (7) versus BRCA2 - mutation positive (8), with N = 3,226 genes. The problem is to find genes which are differentially expressed between the BRCA1 and BRCA2 patients. Hedenfalk et al. (2001) NEJM, 344, 539-547

73 Fit to the N values of w j (based on pooled two-sample t- statistic) j th gene is taken to be differentially expressed if: Two component model for the Breast Cancer Data

74 Fitting two component mixture model to Hedenfalk data Null Non–Null (DE genes)

75 Estimates of π 0 for Hedenfalk data 0.52 (Broet, 2004) 0.64 (Gottardo, 2006) 0.61 (Ploner et al, 2006) 0.47 (Storey, 2002) Using a theoretical null, we estimated π 0 to be 0.65.

76 We used p j = 2 F 13 (-|w j |) Similar results where p j obtained by permutation methods.

77 Gene j PjPj zjzj 0 ( z j ) Gene 1. Gene R. Gene R+R 1. Gene N 0.002. 0.1 0.12 0.18. 0.20 c 0 = 0.1 Ranking and Selecting the Genes FDR = Sum/R = 0.06 Proportion of False Negatives = 1 – Sum 1 / R 1 Local FDR

78 c0c0 R 0.59060.26 0.46940.21 0.35180.16 0.23260.11 0.11430.06 Estimated FDR for various levels of c 0

79 c0c0 NrNr 0.11430.060.320.880.004 0.23380.110.280.730.02 0.35390.160.250.600.04 0.47420.210.220.480.08 0.59710.270.180.370.12 Estimated FDR and other error rates for various levels of threshold c 0 applied to the posterior probability of nondifferential expression for the breast cancer data (N r =number of selected genes)

80 Storey and Tibshirani (2003) PNAS, 100, 9440-9445 Comparison of identified DE genes Our method (143)Hedenfalk (175) Storey and Tibshirani (160) 101 6 12 8 39 2924

81 Uniquely Identified Genes: Differentially Expressed between BRCA1 and BRCA2 GeneGO Term UBE2B, DDB2 (UBE2V1) RAB9, RHOC ITGB5, ITGA3 PRKCBP1 HDAC3, MIF KIF5B, spindle body pole protein CTCL1 TNAFIP1 HARS, HSD17B7 DNA repair (cell cycle) small GTPase signal transduction integrin mediated signalling pathway regulation of transcription negative regulation of apoptosis cytoskeleton organisation vesicle mediated transport cation transport metabolism

82 1. Obtain the z -score for each of the genes The Procedure 2. Rank the genes on the basis of the z -scores, starting with the largest ones (the same ordering as with the P - values, p j ). 3. The posterior probability of non-differential expression of gene j, is given by 0 (z j ). 4. Conclude gene j to be differentially expressed if 0 (z j ) < c 0

83 Breast cancer data: plot of fitted two-component normal mixture model with theoretical N(0,1) null and non-null components (weighted respectively by the estimated proportion of null and non-null genes) imposed on histogram of z-scores.

84 Hedenfalk breast cancer data: plot of fitted two-component normal mixture model with empirical null and non-null components (weighted respectively by the estimated proportion of null and non-null genes) imposed on histogram of z-scores.

85 Histogram of N=3,226 z-Values from the Breast Cancer Study. The theoretical N(0,1) null is much narrower than the central peak, which has (δ 0,σ 0 )=(-.02,1.58). In this case the central peak seems to include the entire histogram.

86 Alon Colon Cancer Data Affymetrix arrays, samples from colon tissues from 62 patients Dataset of M = 62 patients: 40 tumor vs 22 normal, with N = 2,000 genes. Alon et al. (1999) PNAS, 96, 6745-6750

87 Alon data theoretical null We estimated π 0 to be 0.39 using the theoretical null. With the empirical null, it is 0.53. Six smooth muscle genes are included in the 2,000 genes and each has an estimated posterior probability of non-DE less than 0.015.

88 Colon cancer data: plot of fitted two-component normal mixture model with theoretical N(0,1) null and non-null components (weighted respectively by the estimated proportion of null and non-null genes) imposed on histogram of z-scores.

89 Colon cancer data: plot of fitted two-component normal mixture model with empirical null and non-null components (weighted respectively by the estimated proportion of null and non-null genes) imposed on histogram of z-scores.

90 van’t Wout HIV Data Four experiments using the same RNA preparation on 4 different slides Expression levels of transcripts assessed in CD4-T-cell lines 24 hours after infection with HIV type 1 Two samples – one corresponds to HIV infected cells and one to non-infected cells Dataset of M = 8 samples: 4 HIV-infected vs 4 non- infected, with N = 7,680 genes. Analysed by Gottardo et al (2006) van’t Wout et al (2003), J Virology 77, 1392-1402 Gottardo et al (2006), Biometrics 62, 10-18

91 HIV data: plot of fitted two-component normal mixture model with empirical null and non-null components (weighted respectively by the estimated proportion of null and non-null genes) imposed on histogram of z-scores.

92 Mixture-model based approach to finding DE genes can yield new information Gives a measure of the posterior probability that a gene is not DE (i.e. a local FDR rather than global) Provides estimates of global rates – the FDR and the false negative rate FNR (FNR=1-sensitivity) Conclusions

93 Mixture Model-Based Approach: Advantages Provides a local FDR and a measure of the FNR, as well as the global FDR for the selected genes. We show that it can yield new information

94 Allison generated data for 10 mice over 3000 genes. The data are generated in six groups of 500 with a value ρ of 0, 0.4, or 0.8 in the off- diagonal elements of the 500 x 500 covariance matrix used to generate each group. For a random 20% of the genes, a value d of 0, 4, or 8 is added to the gene expression levels of the last five mice. Allison Mice Simulation

95 Theoretical null, ρ=0.8, d=4

96 Empirical null, ρ=0.8, d=4

97 Theoretical null, ρ=0.8, d=8

98 Empirical null, ρ=0.8, d=8

99 Suppose  0 (w) is monotonic decreasing in w. Then for

100 Suppose  0 ( w ) is monotonic decreasing in w. Then for where

101 For a desired control level , say  = 0.05, define w (1) If is monotonic in w, then using (1) to control the FDR [with and taken to be the empirical distribution function] is equivalent to using the Benjamini-Hochberg procedure based on the P -values corresponding to the statistic w j.


Download ppt "The Problem of Detecting Differentially Expressed Genes."

Similar presentations


Ads by Google