Multiple testing in high- throughput biology Petter Mostad.

Multiple testing in high- throughput biology Petter Mostad

Overview Review of hypothesis testing Multiple testing: What is the issue? Types of errors to control Adjusting p-values A Bayesian alternative

Statistical hypothesis testing Fix a null-hypothesis H 0, and a model for what data would look like if H 0 is true. Find a test-statistic (something computed from the data) which will be more extreme when H 0 is not true. Compute the test statistic from the data Compute its p-value: The probability of observing this or a more extreme test statistic if H 0 is true Reject H 0 if p is very small (e.g., less than 0.05)

Example: testing the mean of a sample Assume you have observed values Assume they come from a normal distribution with expectation μ and variance σ². H 0 : μ = μ 0 Test statistic: Under the null-hypothesis, t has a t n-1 -distribution, i.e., a t-distribution with n-1 d.f. The p-value is the probability of more extreme values than t in this distribution. p-value < 0.05 if and only if μ 0 is in confidence interval for μ

Example: Non-parametric tests Tests where the null hypothesis does not specify a particular probability model for the data Example: Are the values in two groups of values different? To find out, rank all the values. Wilcoxon rank sum test statistic: Add the ranks of all items in one group, and compare to the sum of the other ranks. Compare with the rank sums one would get if one set of items is selected at random. This gives a p- value.

Example: Finding differentially expressed genes Assume we want to find diff.exp. genes using 4 cDNA arrays with 6000 genes comparing ”treated” samples with ”untreated”. Assume there are NO diff.exp. genes, only noise, and that all 6000x4 values are normally distributed. Of the 6000 t-test statistics computed, roughly 300 will have values ”more extreme” than the 0.05 cutoff value, just by chance. These are false positives. This problem is a consequence of the hypothesis testing approach and multiple testing, and not of the particular method chosen.

Multiple testing in general Multiple testing is in fact always a problem: When using a confidence level of 5%, 1 in 20 null hypotheses will be falsely rejected, whether the tests occur in the same or different research projects. Problem seems more acute when a large number of formalized statistical tests are performed, as in modern high-throughput biology. What to do?

Ways to deal with the problem Adjusting p-values: –Adjusting the levels of p-values to account for multiple tests. –Changing the interpretation of p-values. Reformulating statistics away from hypothesis testing towards (Bayesian) model choices.

Types of errors # not rejected# rejectedSUM # true null hypUVM0 # false null hypTSM1 SUMWRM Traditionally: Want to find a procedure controlling size of V, while at the same time keeping T as small as possible. Type 1 error: false positives Type 2 errors: false negatives

Some error rates: Per comparison error rate: Family-wise error rate: False discovery rate: Positive false discovery rate:

Types of control Exact control: Controlling error rates conditionally on the set of true null hypotheses. Weak control: Controlling error rates under the condition that all null hypotheses are true. Strong control: Controlling error rates no matter what hypotheses are true or false.

Controlling FWER with Bonferroni Adjust p-values p 1, p 2,..., p M using Bonferroni method: Provides strong control of FWER Very conservative, hard to get any hypotheses rejected.

Alternative: Sidak adjustment Given p-values p 1, p 2,..., p M, the adjusted p- values are now Provides strong control of FWER if the test statistics are independent (rarely the case)

Holm step-down adjusted p-values Order the tests so that the raw p-values are ordered: Compute adjusted p-values: Provides strong control of FWER Somewhat less conservative than Bonferroni

Permutation testing Given items divided into two or more groups Compute a test statistic capturing the way in which your data seem to be extreme. Compare with the test statistic computed when the items are permuted. This gives a p-value. Wilcoxon rank sum test can be seen as a permutation test Can be a good way to find statistically significant differences without using parametric assumptions

Combining permutation testing with adjusting p-values Given data matrix, with some columns corresponding to ”treated” samples, some to ”untreated”. Permute columns, to get permutation- derived p-values. Adjust these, using for example the Holm step-down method.

Correcting for multiple testing with permutations (Dudoit / Speed) Assumes treatment and control samples have been compared to a common reference sample. Use test statistic By permuting data from treated and control samples we can approximate the null distribution of the test statistic. Use permutation for multiple testing problem

Adjusting p-values Order genes so that values are non-increasing. Compute values by permutations Unadjusted is proportion of these that exceed For each permutation b, set as non-increasing envelope. Set as proportion of these that exceed Make non-decreasing by possibly increasing values

Adjusting p-values

QQ-plots We may also compute t-values by computing differences within groups instead of between groups. Find extreme quantiles of null- distribution, and compare with the first t-values. Compare null t-values, or normal distribution, with the first t-values in qq-plot. May compute single-gene p- values with quantiles in the null-distribution.

Controlling FDR Start with ordering tests so that raw p-values are ordered: Adjusted p-values can then be computed: This can be interpreted as a step-up procedure, starting with investigating the largest p-value. Provides strong control of FDR under indepencence of test statistics, and under certain more general conditions.

Interpreting FDR-adjusted p-values Remember that fdr-adjusted p-values must be interpreted differently: If you select all genes with fdr-adjusted p-values less than 0.05, it means that you expect a proportion of 0.05 of these to be false discoveries. In contrast, if you select all genes with holm- adjusted p-values less than 0.05, it means that you can expect the chance of seing any false positive gene as 0.05.

Multiple testing using BioConductor software In the limma package, the topTable function can be used to adjust p-values. It contains a number of different procedures, for example bonferroni, holm, and fdr. The package multtest can be used to perform permutation-based computations

Critisism of FDR approach It is ”possible to cheat”: – You have a number of hypotheses you want to reject, but p-values are not quite good enough. –Add to your hypotheses a number of untrue hypotheses, with low p-values. –The number of rejections will rise, but not the number of false rejections, so your FDR improves, and you ”prove” the hypotheses you care about.

Critisism of hypothesis testing in general Assume a pharmaceutical company wants to prove a new drug is effective If they compute new p-values continuously, as new data comes in, they can just stop whenever the p-value dips below 0.05. This gives ”unfair advantage” compared to testing only after the end of the entire study, as they do multiple tests. ”Solution”: Deciding before study at what times to compute p-values, and adjust them accordingly. Result: Two companies can have exactly identical data, but one gets drug approved, the other not.

Example: A Bayesian approach (Scott and Berger, 2003) Objective: Given a large number of values of normally distributed noise with expectation zero, and some values which deviate substantially more from zero, how to find these deviating values, limiting errors of types 1 and 2. Crucial ingredient: Use data to estimate the probability that values are deviating.

Hierarchical model used: The model above specifies ”noise values”, for which and ”signal values”, where We would like to fit the model to data, and thus find probability for each to be equal to 1.

Example computations: We assume we have 10 ”signal observations”: -8.48, -5.43, -4.81, -2.64, - 2.40, 3.32, 4.07, 4.81, 5.81, 6.24 We have n=10, 50, 500, or 5000 ”noise observations”, normally distributed N(0,1). We find posterior probabilities for each observation to be noise or signal: Can we fish out the signals from the noise?

Results n-5.4-4.8-2.6-2.43.34.14.81#p i >.6 1011.94.89.99111 5011.71.59.94110 50011.26.17.67.9611 50001.0.98.03.02.16.67.982 Central seven ’signal’ observations: False positives: Note: The 3 more extreme signal observations always have prob. 1 of being signals. Note: The number of ”false positives” does not change much with the amount of noise: Adjustment for multiple testing is automatic! Note: The most central observations are ”drowned out” when the noise increases.

Multiple testing for microarray data in practice: Too few repetitions (2 or 3) of each sample is often used to reliably find differentially expressed genes. Instead of finding a definite list of diff.exp. genes, a ranking list may be the output, ordering genes according to how likely it is that they are diff. exp.. The top of the list is chosen for further validation. Combination with other sources of information may be used to find interesting genes.

Multiple testing in high- throughput biology Petter Mostad.

Similar presentations

Presentation on theme: "Multiple testing in high- throughput biology Petter Mostad."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multiple testing in high- throughput biology Petter Mostad.

Similar presentations

Presentation on theme: "Multiple testing in high- throughput biology Petter Mostad."— Presentation transcript:

Similar presentations

About project

Feedback