Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multiple testing and false discovery rate in feature selection  Workflow of feature selection using high- throughput data  General considerations of.

Similar presentations


Presentation on theme: "Multiple testing and false discovery rate in feature selection  Workflow of feature selection using high- throughput data  General considerations of."— Presentation transcript:

1 Multiple testing and false discovery rate in feature selection  Workflow of feature selection using high- throughput data  General considerations of statistical testing using high-throughput data  Some important FDR methods Benjamini-Hochberg FDR Storey-Tibshirani’s q-value Efron et al.’s local fdr Ploner et al.’s Multidimensional local fdr

2 Gene/protein/metabolite expression data After all the pre-processing, we have a feature by sample matrix of expression indices. It is like an molecular “fingerprint” of each sample. The most common use: to find biomarkers of a disease.

3 Workflow of feature selection Raw data Feature-level expression in each sample Preprocessing, normalization,filtering … Statistical testing/model fitting Test statistic for every feature Compare to null distribution P-value for every feature Significance of every feature (FWER, FDR, …) Other information (fold change, biological pathway …) Selected features Biological interpretation, … …

4 Workflow of feature selection Raw data Feature-level expression in each sample Statistical testing/model fitting Feature group information (biological pathway ……) Group-level significance Selected features from significant groups Another route. Preprocessing, normalization,filtering …

5 The simplest strategy: Assume each gene is independent from others. Perform testing between treatment groups for every gene. Select those that are significant. When we do 50,000 t-tests, if the alpha level of 0.05 is used, we expect ~50,000x0.05 = 2,500 false-positives ! If we use Bonferonni’s correction? 0.05/50000= 1e-6 Unrealistic! Gene/protein/metabolite expression data

6 General considerations Family-wise error rate (FWER) When we have multiple tests, let V be the number of true nulls called significant (false positives) FWER = P(V ≥ 1) = 1-P(V=0) “Family”: a group of hypothesis that are similar in purpose, and need to be jointly accurate. Bonferroni correction is one version of FWER control. It is the simplest and most conservative approach.

7 General considerations Control every test at the level α/m For each test, P(T i significant | H 0 ) ≤ α/m Then P(some T are significant | H 0 ) ≤ α i.e. FWER = P(V ≥ 1) ≤ α It has little power to detect differential expression when m is big.

8 Non-technical Reviews:  Gusnanto A, Calza S, Pawitan Y. Curr Opin Lipidol 2007; 18:187-193.  Pounds SB. Brief Bioinf 2005; 7(1): 25-36.  Saeys Y, Inza I, Larranaga P. Bioinformatics 2007, 23 (19): 2507-2517. Original papers:  Benjamini Y, Hochberg Y. JRSS B 1995; 57(1):289–300.  Storey JD, Tibshirani R. Proc Natl Acad Sci U S A 2003; 100:9440– 9445.  Efron B. Ann Stat 2007; 35(4):1351-137.  Ploner A, Calza S, Gusnanto A, Pawitan Y. Bioinf 2006;22(5):556-565. (A number of figures were taken from these papers.) References

9 General considerations Significant Non- significant No changeVUQ Differentially expressed STM-Q RM-RM Simultaneously test M hypotheses. Q is # true null – genes that didn’t change (unobserved) R is # rejected – genes called significant (observed) U, V, T, S are unobservable random variables. V: number of type-I errors; T: number of type-II errors.

10 General considerations Signific ant Non- significan t No changeVUQ Differentially expressed STM-Q RM-RM In traditional testing, we consider just one test, from a frequentist’s point of view. we control the false positive rate: E(V/Q) Sensitivity: E[S/(M-Q)] Specificity: E[U/Q]

11 General considerations There is always the trade-off between sensitivity and specificity. Signific ant Non- significa nt No change False positive True negative Total true negative Differentially expressed True positive False negative Total true positive Total positive calls Total negative calls total Receiver operating characteristic (ROC) curve. Example from Jiang et al. BMC Bioinformatics 7:417.

12 http://upload.wikimedia.org/wikipedia/en/b/b4/Roc-general.png General considerations

13 False discovery rate (FDR) = E(V/R) Among all tests called significant, what percentage are false calls? Significant Non- significant No changeVUQ Differentially expressed STM-Q RM-RM

14 General considerations Significant Non- significant No change54979549800 Differentially expressed 95105200 1004990050000 Significant Non- significant No change3204948049800 Differentially expressed 18020200 5004950050000 It makes more sense than this, which leans too heavily towards sensitivity:

15 General considerations Significant Non- significant No change54979549800 Differentially expressed 95105200 1004990050000 Significant Non- significant No change14979949800 Differentially expressed 14186200 154998550000 It makes more sense than this, which leans too heavily towards specificity:

16 Was the BH definition the first? No. Defined in 1955…. True discovery rate True positive rate False positive rate http://en.wikipedia.org/wiki/Precision_and_recall

17 FDR – BH procedure Testing m hypotheses: The p-values are: Order the p-values such that: Let q* be the level of FDR we want to control, Find the largest i such that Make the corresponding p-value the cutoff value, the FDR is controlled at q*.

18 The method assumes weak dependence between test statistics. In computation, it can be simplified by taking mP (i) /i and compare to q*. Intuitively, mP (i) is the number of false-positives expected if the cutoff is P (i) If the cutoff were P (i), then we select the first i features. So, mP (i) /i is the expected fraction of false-positives – the FDR. FDR – BH procedure

19 Higher power compared to FWER controlling methods: FDR – BH procedure

20 ST q-value Signific ant Non- significant No changeVUQ Differentially expressed STM-Q RM-RM FDR = E[V/(V+S)] = E[V/R] Let t be the threshold on p- value, then with all p-values observed, V and R become functions of t. V(t) = # {null pi ≤ t} R(t) = # {pi ≤ t} FDR(t) = E[V(t)/R(t)] ≈ E[V(t)]/E[R(t)] For R(t), we can simply plug in # {pi ≤ t}; For V(t), true null p-values should be uniformly distributed.

21 Signific ant Non- significant No changeVUQ Differentially expressed STM-Q RM-RM V(t) = Qt However, Q is unknown. Let π 0 =Q/M Now, try to find π 0. Without specifying the distribution of the alternative p-values, but assuming most of them are small, we can use areas of the histogram that’s relatively flat to estimate π 0 Density of p-values λ ST q-value

22 Significan t Non- significant No changeVUQ Differentially expressed STM-Q RM-RM This procedure involves tuning the parameter λ. With most alternative p- values at the smaller end, the estimated Should stabilize when λ is above a certain value. ST q-value

23 “The more mathematical definition of the q value is the minimum FDR that can be attained when calling that feature significant” Given a list of ordered p-values, this guarantees the corresponding q-values are increasing in the same order as the p-values. The q-value procedure is robust against weak dependence between features, which “can loosely be described as any form of dependence whose effect becomes negligible as the number of features increases to infinity.” ST q-value

24

25

26 Efron’s Local fdr The previous versions of FDR make statements about features falling on the tails of the distribution of the test statistic. However they don’t make statements about and individual feature, i.e. how likely is this feature false-positive given its specific p-value ? ------------------------------- Efron’s local FDR uses a mixture model and the empirical Bayes approach. An empirical null distribution is put in the place of the theoretical null. With z being the test statistic, local FDR:

27 The test statistic come from a mixture of two distributions: The exact form of f 1 () is not specified. It is required to be longer- tailed than f 0 (). We need the empirical null. But we only have a histogram from the mixture. So the null comes in a strong parametric form. And we need the proportion p 0, the Bayes a priori probability. Efron’s Local fdr

28 One way to estimate in the R package locfdr - “central matching”: Use quadratic form to approximate Efron’s Local fdr

29 6033 test statistics

30 Efron’s Local fdr Now we have the null distribution and the proportion. Define the null subdensity around z: The Bayes posterior probability that a case is null given z, Compare to other forms of Fdr that focus on tail area, (the c.d.f.s of f 0 and f 1 ) Fdr(z) is the average of fdr(Z) for Z { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/11/3252013/slides/slide_30.jpg", "name": "Efron’s Local fdr Now we have the null distribution and the proportion.", "description": "Define the null subdensity around z: The Bayes posterior probability that a case is null given z, Compare to other forms of Fdr that focus on tail area, (the c.d.f.s of f 0 and f 1 ) Fdr(z) is the average of fdr(Z) for Z

31 Efron’s Local fdr A real data example. Notice most non-null cases (bars plotted negatively) are not reported. A big loss of sensitivity to control FDR, which is very common.

32 Multidimensional Local fdr A natural extension to the local FDR. Use more than one test statistics to capture different characteristics of the features. Now we have a multidimensional mixture model. Comment: Remember the “curse of dimensionality” ? Since we don’t have too many realizations of the non-null distribution, we can’t go beyond just a few, say 2, dimensions.

33 Multidimensional Local fdr Using t-statistic in one dimension and the log standard error in the other. Genes with small s.e. tend to have higher FDR. This approach discounts genes with too small s.e. – similar to the fold change idea but in a theoretically sound way. Simulated:

34 Multidimensional Local fdr  The null distribution is generated by permutation: Permute the treatment labels of each sample, and re-compute the test statistics. Repeat 100 times to obtain the null distribution f 0 (z).  The f(z) is obtained by the observed Z.  Like local FDR, smoothing is involved. Here two densities in 2D need to be obtained by smoothing. In 2D, the points are not as dense as in 1D. So the choice of smoothing parameters becomes more consequential.

35 Multidimensional Local fdr To address the problem, the authors did smoothing on the ratio (details skipped): p is the number of permutations. Afterwards, the local fdr is estimated by:

36 Multidimensional Local fdr Real data:

37 Multidimensional Local fdr Using other statistics:


Download ppt "Multiple testing and false discovery rate in feature selection  Workflow of feature selection using high- throughput data  General considerations of."

Similar presentations


Ads by Google