Download presentation

Presentation is loading. Please wait.

Published byMya Boote Modified over 2 years ago

1
1 Statistics Achim Tresch UoC / MPIPZ Cologne treschgroup.de/OmicsModule1415.html tresch@mpipz.mpg.de

2
Challenge You test plants/patients/… in two settings (or from different populations). You want to know which / how many genes are differentially expressed (alternate) You don’t want to make too many mistakes (declaring a gene to be alternate = differentially expressen when in fact they are null – not differentially expressed). Multiple Testing

3
You choose a significance level, say 0.05. You calculate p-values of the differences in expression. The p-value of g is the probability that if g is null (not differentially expressed), it would have a test statistic (e.g., t-statistic) at least that large. You say all genes that differ with p-value ≤ 0.05 are truly different. The Multiple Testing Problem What’s the problem?

4
Suppose that you test 10,000 genes, but no genes are truly differentially expressed. You will conclude that about 5% of those you called significant are differentially expressed. You will find 500 “significant” genes. Bad. The Multiple Testing Problem You are testing many genes at the same time

5
The Multiple Testing Problem

6
Bonferroni Correction

7
Bonferroni Correction (FWER control) Pr(at least one gene found diff.expr.) Bonferroni controls the probability by which our list of differentially expressed genes contains at least one mistake = Family-wise error rate (FWER). This is very strict.

8
A Fundamental Insight All truly null genes (i.e. not truly differentially expressed) are equally likely to have any p-value. That is by construction of p-val: under the null hypothesis, 1% of the genes will be in the top 1 percentile, 1% will be in percentile between 89 and 90 th and so on. P-val is just a way of saying percentile in null condition. False Discovery Rate (FDR) estimation 0 1 p-value

9
Idea: The observed p-value distribution is a mixture of null genes (light blue marbles) and truly different genes (red marbles). If the chosen test is appropriate, red marbles should be concentrated at the low p-values. False Discovery Rate (FDR) estimation 0 1 p-value Differential gene Non-Differential gene

10
We don’t of course know the colors of the marbles/we don’t know which genes are true alternates. However, we know that null marbles are equally likely to have any p-value. So, at the p-value where the height of the marbles levels off, we have primarily light blue marbles/null genes. False Discovery Rate (FDR) estimation

11
≈non- differential genes Because if all genes/marbles were null, the heights would be about uniform. Provided the reds are concentrated near the low p-values, the flat regions will be primarily light blues. Absolute frequency 0 1 p-value We estimate the baseline of null marbles

12
False Discovery Rate (FDR) estimation ≈non- differential genes ≈ differential genes 0 1 p-value Subtracting the “baseline” of true null hypotheses, the remaining balls are primarily red, i.e., they are true alternative hypotheses Absolute frequency

13
≈non- differential genes ≈ differential genes False Discovery Rate (FDR) estimation 0 1 p-value Given a p-value cutoff, we can estimate the rate of false discoveries (FDR) that pass this threshold. Absolute frequency p-value cutoff FDR(p-cut) = +

14
Baseline of nulls Absolute frequency 0 1 p-value FDR-based p-value cutoff Given a desired FDR (e.g., 20%), we can find the largest p-value cutoff for which this FDR is achieved. FDR(p-cut 1 )= 9% p-cut 1 = 0.1

15
Baseline of nulls Absolute frequency 0 1 p-value FDR-based p-value cutoff Given a desired FDR (e.g., 20%), we can find the largest p-value cutoff for which this FDR is achieved. FDR(p-cut 1 )= 9% FDR(p-cut 1 )= 20% p-cut 1 = 0.1 p-cut 1 = 0.2

16
Baseline of nulls Absolute frequency 0 1 p-value FDR-based p-value cutoff Given a desired FDR (e.g., 20%), we can find the largest p-value cutoff for which this FDR is achieved. FDR(p-cut 1 )= 9% FDR(p-cut 1 )= 20% FDR(p-cut 3 )= 52% p-cut 1 = 0.1 p-cut 1 = 0.2 p-cut 1 = 0.7

17
Baseline of nulls Absolute frequency 0 1 p-value FDR-based p-value cutoff Given a desired FDR (e.g., 20%), we can find the largest p-value cutoff for which this FDR is achieved. p-cut 1 = 0.1 FDR(p-cut 1 )= 9% FDR(p-cut 1 )= 20% FDR(p-cut 3 )= 52% p-cut 1 = 0.2 p-cut 1 = 0.7

18
Consider the all null case (all marbles are blue). For any p-value cutoff, the estimated FDR will be close to 100%. For any sensible FDR (substantially below 100%), there will be no suitable p-value cutoff, and the method will not return any gene. Good. Example: All null 0 1 p-value

19
Examples: All alternate 0 1 p-value Consider the all alternate case (all marbles are red). For a large range of p-value cutoffs, the estimated FDR will be close to 0. For sensible FDR cutoffs (e.g. 20%), the corresponding p-value cutoff will be high. The method will return many genes Good.

20
A flat p-value distribution may force us to the far left in order to get a low False Discovery Rate. This may eliminate genes of interest. If subsequent validation experiments are not too expensive, we can accept a higher False Discovery Rate (e.g., 20%) FDR rate and significance level are entirely different things! Conclusions

22
Gene Set Enrichment

23
Fisher‘s exact test, once more

25
Gene Ontology Example 559

26
Gene Ontology Example (immune response) (macromolecule biosynthesis)

27
Kolmogorov-Smirnov Test < 10 -10 Move 1/K up when you see a gene from group a Move 1/(N-K) down when you see a gene not in group a

28
GO scoring: general problem

29
GO Independence Assumption light yellow GO sets

30
GO Independence Assumption light yellow

31
The elim method

32
Top 10 significant nodes (boxes) obtained with the elim method

33
Algorithms Summary

34
Evaluation: Top scoring GO term Significant GO terms in the ALL dataset

35
Advantages & Disadvantages for ALL

36
Simulation Study Introduce noise

37
Simulation Study

39
Quality of GO scoring methods 10% noise level 40% noise level

40
Summary

Similar presentations

OK

Multiple testing adjustments European Molecular Biology Laboratory Predoc Bioinformatics Course 17 th Nov 2009 Tim Massingham,

Multiple testing adjustments European Molecular Biology Laboratory Predoc Bioinformatics Course 17 th Nov 2009 Tim Massingham,

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt on db2 introduction to business Ppt on international maritime organisation Ppt on human chromosomes labels Ppt on series and parallel circuits practice Ppt on council of ministers Ppt on water pollution in delhi Ppt on allotropes of carbon Moral values for kids ppt on batteries Ppt on l&t finance Ppt on producers consumers and decomposers song