Differentially expressed genes

Slides:



Advertisements
Similar presentations
Multiple testing and false discovery rate in feature selection
Advertisements

Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is, “What is the statistical model to this data?” We then characterize.
1 Statistics Achim Tresch UoC / MPIPZ Cologne treschgroup.de/OmicsModule1415.html
From the homework: Distribution of DNA fragments generated by Micrococcal nuclease digestion mean(nucs) = bp median(nucs) = 110 bp sd(nucs+ = 17.3.
The Multiple Regression Model Prepared by Vera Tabakova, East Carolina University.
Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005.
MARE 250 Dr. Jason Turner Hypothesis Testing II To ASSUME is to make an… Four assumptions for t-test hypothesis testing: 1. Random Samples 2. Independent.
MARE 250 Dr. Jason Turner Hypothesis Testing II. To ASSUME is to make an… Four assumptions for t-test hypothesis testing:
10 Hypothesis Testing. 10 Hypothesis Testing Statistical hypothesis testing The expression level of a gene in a given condition is measured several.
ANOVA Determining Which Means Differ in Single Factor Models Determining Which Means Differ in Single Factor Models.
Statistical Analysis of Microarray Data
Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.
Analysis of Differential Expression T-test ANOVA Non-parametric methods Correlation Regression.
Lecture 9: One Way ANOVA Between Subjects
1 Test of significance for small samples Javier Cabrera.
T-Tests Lecture: Nov. 6, 2002.
Significance Tests P-values and Q-values. Outline Statistical significance in multiple testing Statistical significance in multiple testing Empirical.
Chapter 11: Inference for Distributions
Chapter 9 Hypothesis Testing.
Statistical Comparison of Two Learning Algorithms Presented by: Payam Refaeilzadeh.
5-3 Inference on the Means of Two Populations, Variances Unknown
Chapter 9: Introduction to the t statistic
Chapter 9 Hypothesis Testing II. Chapter Outline  Introduction  Hypothesis Testing with Sample Means (Large Samples)  Hypothesis Testing with Sample.
Different Expression Multiple Hypothesis Testing STAT115 Spring 2012.
Inferential Statistics
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 8 Tests of Hypotheses Based on a Single Sample.
False Discovery Rate (FDR) = proportion of false positive results out of all positive results (positive result = statistically significant result) Ladislav.
AM Recitation 2/10/11.
Probability Distributions and Test of Hypothesis Ka-Lok Ng Dept. of Bioinformatics Asia University.
Hypothesis Testing Statistics for Microarray Data Analysis – Lecture 3 supplement The Fields Institute for Research in Mathematical Sciences May 25, 2002.
Multiple testing in high- throughput biology Petter Mostad.
Jeopardy Hypothesis Testing T-test Basics T for Indep. Samples Z-scores Probability $100 $200$200 $300 $500 $400 $300 $400 $300 $400 $500 $400.
Candidate marker detection and multiple testing
Chapter 9.3 (323) A Test of the Mean of a Normal Distribution: Population Variance Unknown Given a random sample of n observations from a normal population.
Essential Statistics in Biology: Getting the Numbers Right
Comparing Two Population Means
Differential Expression II Adding power by modeling all the genes Oct 06.
Differential Gene Expression Dennis Kostka, Christine Steinhoff Slides adapted from Rainer Spang.
ANOVA (Analysis of Variance) by Aziza Munir
Basic concept Measures of central tendency Measures of central tendency Measures of dispersion & variability.
Multiple Testing in Microarray Data Analysis Mi-Ok Kim.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
A A R H U S U N I V E R S I T E T Faculty of Agricultural Sciences Introduction to analysis of microarray data David Edwards.
Confidence intervals and hypothesis testing Petter Mostad
Back to basics – Probability, Conditional Probability and Independence Probability of an outcome in an experiment is the proportion of times that.
Introduction to Microarrays Dr. Özlem İLK & İbrahim ERKAN 2011, Ankara.
Multiple Testing Matthew Kowgier. Multiple Testing In statistics, the multiple comparisons/testing problem occurs when one considers a set of statistical.
Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002.
MRNA Expression Experiment Measurement Unit Array Probe Gene Sequence n n n Clinical Sample Anatomy Ontology n 1 Patient 1 n Disease n n ProjectPlatform.
I271B The t distribution and the independent sample t-test.
Application of Class Discovery and Class Prediction Methods to Microarray Data Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics.
Copyright © Cengage Learning. All rights reserved. 12 Analysis of Variance.
Suppose we have T genes which we measured under two experimental conditions (Ctl and Nic) in n replicated experiments t i * and p i are the t-statistic.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
1 Estimation of Gene-Specific Variance 2/17/2011 Copyright © 2011 Dan Nettleton.
The Broad Institute of MIT and Harvard Differential Analysis.
1 Significance analysis of Microarrays (SAM) Applied to the ionizing radiation response Tusher, Tibshirani, Chu (2001) Dafna Shahaf.
Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
Distinguishing active from non active genes: Main principle: DNA hybridization -DNA hybridizes due to base pairing using H-bonds -A/T and C/G and A/U possible.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Analysis of (cDNA) Microarray.
Chapter 9: Introduction to the t statistic. The t Statistic The t statistic allows researchers to use sample data to test hypotheses about an unknown.
Hypothesis Tests. An Hypothesis is a guess about a situation that can be tested, and the test outcome can be either true or false. –The Null Hypothesis.
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Canadian Bioinformatics Workshops
Estimating the False Discovery Rate in Genome-wide Studies BMI/CS 576 Colin Dewey Fall 2008.
Differential Gene Expression
Chapter 9 Hypothesis Testing.
Presentation transcript:

Differentially expressed genes 09/19/07

Identify differentially expressed genes

Fold Change Based on the expression index, select genes with high fold change (e.g., R/G > 3) Advantage: Intuitive Larger fold change may indicate greater biological impact. Drawback Reliable estimates are difficult to get. 1st approach is sensitive to noise; 2nd approach loses quantitative information

Fold change is noisy Log-transformed expression in replicate #2 Noise is very high at low intensity. Log-transformed expression in replicate #1

SAM Significance Analysis of Microarrays (SAM) considers a signal-to-noise ratio. where d(i) can large if either the signal is large or the noise is low. Therefore, it is different from fold change. Genes are ranked by d(i). The top candidates genes correspond to most positive or negative d(i). (Tusher et al. 2001)

Permutation test 2 1 “I” 2 5 I 3 3 4 4 5 6 U “U” 6 1 If a gene expresses at the same level in I and U conditions, then then relabeling the arrays will not affect the result of the value of d.

SAM To test for statistical significance, arrays are randomly permuted. For each permutation, compute and rank the result dp(i). Calculate Idea is that for truly differentially expressed genes, d(i) should be greater than dE(i). Select those d(i) that are different from dE(i) more than a threshold level D.

D

Statistical hypothesis testing Null hypothesis H0: there is no association between the expression levels and the sample groups. Alternative hypothesis H1: there is association. Differentially expressed genes Rejection of null hypothesis. Genes are selected regardless of fold change. The last feature is often not desirable.

Single Hypothesis Testing Calculate the value a test statistic. IF the value is very unlikely given the null hypothesis H0, THEN H0 is rejected and H1 is accepted. The gene is differentially expressed. ELSE H0 is not rejected. The gene is not differentially expressed.

Rejection Region Density t-value

Two type of errors Density t-value

p-value The p-value is the probability of obtaining a result at least as extreme as a given data point. It is also the minimum significance level required to reject H0.

Choice of test statistic Standard t-test Assume that yij are Gaussian distributed, then ti is given by the student-t distribution. A p-value is calculated from t-distribution with the 2n-2 degree of freedom. Issues: When n is small the denominator is an unreliable estimate of the variance. The assumption that yij are Gaussian is often violated in real data.

Variance shrinkage Basic idea: The variance at different genes should be correlated. If the data are noisy, then they are likely to be noisy everywhere. Thus one can use the information from other genes to estimate the variance at a given gene.

Variance shrinkage (Smyth 2004) Assume and where d0 and s02 correspond to the pooled data. Then Modify the t-statistic by replacing si2 with The new statistic obeys t-distribution with d0 + di degrees of freedom.

Permutation test 2 1 “normal” 2 5 normal 3 3 4 4 5 6 cancer “cancer” 6 If H0 is correct, then relabeling the arrays will not affect the result of the test statistic.

Permutation p-value Permutation-test For the b-th permutation, b = 1, …, B, Permute the n columns (array labels) of the data matrix X. Compute test statistics t1,b, …, tm,b for each hypothesis (whether the m-th gene is not differentially expressed). The permutation distribution of the test statistic Ti for hypothesis Hi, ti,1, …, ti,B. For two-sided alternative hypotheses, the permutation p-value for hypothesis Hi is where I(.) is the indicator function, equaling 1 if the condition in parenthesis is true, and 0 otherwise.

Permutation p-value permutation distribution t-distribution scaled H0 is correct H0 is rejected

Multiple hypothesis testing Microarray experiments measure expression levels of thousand of genes. The hypothesis testing procedure is applied once for each gene. A large number of false positives may result. Cutoff at p = 0.05 for 6000 genes 6000 X 0.05 = 300 genes falsely rejected If number of real target ~ 100, then most rejected genes are false targets.

Bonferroni correction Let m be the total number of tests. Reject hypothesis at a/m instead of a. Strong control of FWER. Too conservative.

Adjusted p-value The adjusted p-value for a single hypothesis Hj is the nominal level of the entire test procedure at which Hj would just be rejected, given the values of all test statistics involved. Example: pi = 0.001. If rejecting all hypotheses with cutoff p < pi leads to FDR = 0.2, then the adjusted p-value is 0.2. The adjusted p-value is dependent on the specific test procedure.

Adjusted p-value The adjusted p-value for Bonferroni correction is.

False Discovery Rate FWER aims at requiring no false positive at all. This is often too stringent in practice. False discovery rate (FDR) is proposed by Benjamini and Hochberg (1995). The idea is to allow a few false positives while enhancing the power.

Control of FDR, BH-procedure Find ordered observed p-values, and Let k be the largest i for which Reject all H1, …, Hk. (Benjamini and Hochberg, 1995)

Control of FDR, BH-procedure Find ordered observed p-values, and Let k be the largest i for which Reject all H1, …, Hk. Strongly controls FDR Also weakly controls FWER (Benjamini and Hochberg, 1995)

Positive false discovery rate (pFDR) Better power than FDR procedure. Estimate

Estimation of p0(t) Under the null hypothesis, p-value is uniformly distributed.

Estimation of p0(t) Procedure: Choose 0 < l < 1 Assume pi is uniformly distributed at p > l. Then estimate as l

(Streinsland)

(Streinsland)

SAM To test for statistical significance, arrays are randomly permuted. For each permutation, compute and rank the result dp(i). Calculate Idea is that for truly differentially expressed genes, d(i) should be greater than dE(i). Select those d(i) that are different from dE(i) more than a threshold level D.

Estimation of FDR in SAM R ≈ #(genes called significant) V ≈ #(genes called significant in permutation tests) FDR ≈ V/R Power of SAM is better than fold change criteria.

Data: Apo AI experiment 8 mice in treatment group (apo AI knockout); 8 mice in control group (normal) 16 arrays: Cy5 – mRNA from trt or control mice; Cy3 – mRNA from pooled control mice. 6356 genes. Want to detect differentially (trt vs control mice) expressed genes.

SAM is the least stringent

Cutoff value vs top genes Each metric can be viewed as a monotonic transformation of another. The only difference is the cutoff values are different. All statistical hypothesis testing methods are equivalent in terms of selecting the top k genes, for a fixed k.