Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.

Similar presentations


Presentation on theme: "Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006."— Presentation transcript:

1 Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006

2 ANOVA based Filtering Linearisation Bootstrapping Log transformation Array by array approach Filtering normalization Ratio Test statistic (T-test) Log transformation Preprocessing Background corr

3 Overview further analysis Raw data Preprocessed data Differentially expressed genes Clusters of coexpressed genes Preprocessing ClusteringTest statistic

4 Comparison of 2 experiments: Fold test T-test SAM … A plethora of different method available Which one performs best? Different underlying statistical assumptions Implication on the final result Difficult to define the best method Test Statistic Preprocessing: test statistic

5 Type1: Comparison of 2 samples Statistical testing Control sample Induced sample Retrieve statistically over or under expressed genes Diff Expr Genes: test statistic

6 black/white experiment description (array V mice genes) Condition 1 : pygmee mouse 10 days old (test) Condition 2 : normal mouse 10 days old (ref) detect differentially expressed genes Experiment design (Latin Square) Condition 1 Dye1 Replica L Condition 1 dye1 Replica R Condition 2 dye2 Replica L Condition 2 dye2 Replica R Condition 2 dye1 Replica L Condition 2 dye1 Replica R Condition 1 dye2 Replica L Condition 1 dye2 Replica R Array 1 Array 2 Per gene, per condition 4 measurements available Diff Expr Genes : test statistic

7 Fold change (ratio test) 4 measurements per gene, condition Calculate average Sort averages log(Sample/control) > threshold (usually 2) Arbitrary threshold Discards all information obtained from replicates Implicitly assumes constant variance but variance depends on expression value Diff Expr Genes : test statistic

8 Why does fold chance fail: Majority of genes expressed at low levels where signal/noise is low => not sufficiently conservative –2 fold change occurs at random for a large number of genes –High number of false positives Higher levels of expression smaller changes in gene expression may be real => too conservative –High number of false negatives Improvement: –T-test –pairwise fold change: genes significantly differentially expressed if R=-fold change is observed consistently between paired samples –SAM http://www-stat-class.stanford/SAM/SAMServlet Diff Expr Genes : test statistic

9 Possible if replicates of reference and test are available Significance of the difference between the reference and test data (level of expression) relative to the observed level of within class variation (consistency) Assumptions Normal distribution of variables Population mean and variance estimated from data => (Student t distribution for H0 hypothesis) Not all genes need to have the same variance Under null hypothesis sample means should be equal (rescaling obligatory) T-test: hypothesis test Diff Expr Genes : test statistic

10 Consider paired data as new variable Calculate average ratio Calculate standard deviation of the 4 ratio measurements Determine t-value df, student t distribution, t-value p-value p-value (represents the probability that a certain null hypothesis is true) Paired t-test (microarray data are paired) Diff Expr Genes : test statistic

11 Classical hypothesis tests (t-test, Wilcoxon rank-sum test,...): –a test statistic is calculated (t-value) –the probability or p-value is calculated that an equally good or better test statistic is generated if a certain null hypothesis is true –The null hypothesis: gene has no difference in mean expression levels between 2 conditions –Low p-value (below rejection level  ): null hypothesis is not likely: reject null hypothesis: there is a difference in (mean) expression between the two classes t-test H0 H1 H0: D=0 H1: D<>0 Gene x Type I Type II Diff Expr Genes : test statistic

12 Comparison of fold test with paired t-test Gene expression levels measured under two different conditions Rejection level  –p j <  : null hypothesis rejected (result Positive) –p j >  : null hypothesis not rejected (result Negative) But: Multiple testing: Type I and Type II error = False positives and negatives Diff Expr Genes : test statistic

13 Each gene is assigned a score on the basis of its change in expression relative to the standard deviation of repeated measurements for that gene H0 (expected relative difference) is estimated by permutation analysis –Permute the samples –Calculate d(i) values for both the experimental samples and the permutated control samples –Rank genes by magnitude of their d(i) values for both the experimental and the permutated control samples SAM Diff Expr Genes : test statistic

14 Observed values Calculate d(I) value for each gene Rank genes according to their d(I) value Simulated values Permute dataset Calculate d(I) value for each gene in each permuted dataset Calculate average d(I) value for each gene Rank d(I) values Make scatterplot SAM Diff Expr Genes : test statistic

15 SAM Diff Expr Genes : test statistic

16 T-test Paired t-test SAM Parametrized : Student t- distribution Errors normally distributed Restricted number of repeat measurements Impossible to evaluate assumption No explicit assumption Order statistics Test statisticAssumptionsDistribution H0 Errors equal variance (iid) Less stringent assumption Diff Expr Genes : test statistic

17

18 Multiple testing: problem P value: measure of significance in terms of the false positive rate The rate that truly null features are called significant Significance is 5%: on average 5% of the truly null features will be called significant (type-I error) Type I error: Null hypothesis rejected when it is true – ‘ accidental ’ low p-value – falsely declared differentially expressed = false positive Multiple testing: Example: 10000 genes with random expression profiles -  = 5% - one would find  500 genes with a p-value lower than 5% = false positives Type II error: Null hypothesis not rejected when it is not true (false negatives). Gene that is actually differentially expressed is not declared differentially expressed. Adapted from De Smet et al Diff Expr Genes: test statistic

19 Multiple testing: solutions Control of the familywise error rate (FWE): P(FP  1) – protection against type I errors Bonferonni correction: reject null hypothesis at rejection level  /N, which guarantees that FWE = P(FP  1) <  Is OK when very few genes are expected to be actually differentially expressed (i.e., affected by the difference in conditions / for which the null hypopthesis is false): every false positive is ‘costly’ Rejection rate becomes very conservative But in microarray data, usually a considerable number of genes is actually differentially expressed: control of the FWE results in a severe loss of statistical power (FN or type II error is large) In practice we do not have to protect against every possible FP Better solution FDR: false positive discovery rate Adapted from De Smet et al Diff Expr Genes: test statistic

20 We need a sensible balance between the number of true positives and the number of false positives Therefore is is better to control the ‘False Discovery Rate’ (FDR) instead of the FWE: The false positive rate: The rate that truly null features are called significant The FDR: = % of false positives among all the genes that are declared positive = % of true null hypotheses erroneously rejected among all the null hypotheses rejected Adapted from De Smet et al FDR Diff Expr Genes: test statistic

21 Difference p-value and FDR 5% FDR: 5% false positives among the features called significant 5% p value cutoff: 5% false positives among all the null features in the dataset, says little about the content of the features actually called significant Diff Expr Genes: test statistic

22 An estimate of E[S(t)] is the observed S(t): i= the number of observed pvalues <p i E[F(t)] = N 0 p i Estimate N 0 No real differential expression Randomised data set Uniform distribution FN TN TP FP Rejection level  Non-accidental differential expression Superposition of two distribuions Adapted from De Smet et al

23 Overview MICROARRAY PREPROCESSING Gene expression Omics era Transcript profiling Experiment design Preprocessing Slide by slide normalisation ANOVA Exercises


Download ppt "Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006."

Similar presentations


Ads by Google