1 A Discussion of False Discovery Rate and the Identification of Differentially Expressed Gene Categories in Microarray Studies Ames, Iowa August 8, 2007.

Slides:



Advertisements
Similar presentations
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

ADVANCED STATISTICS FOR MEDICAL STUDIES Mwarumba Mwavita, Ph.D. School of Educational Studies Research Evaluation Measurement and Statistics (REMS) Oklahoma.
1 Statistics Achim Tresch UoC / MPIPZ Cologne treschgroup.de/OmicsModule1415.html
From the homework: Distribution of DNA fragments generated by Micrococcal nuclease digestion mean(nucs) = bp median(nucs) = 110 bp sd(nucs+ = 17.3.
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
CHAPTER 23: Two Categorical Variables: The Chi-Square Test
Data mining with the Gene Ontology Josep Lluís Mosquera April 2005 Grup de Recerca en Estadística i Bioinformàtica GOing into Biological Meaning.
Introduction to Functional Analysis J.L. Mosquera and Alex Sanchez.
Using Gene Ontology Models and Tests Mark Reimers, NCI.
Part I – MULTIVARIATE ANALYSIS
Gene Expression Data Analyses (3)
Differentially expressed genes
Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.
PSY 307 – Statistics for the Behavioral Sciences
1 Test of significance for small samples Javier Cabrera.
Significance Tests P-values and Q-values. Outline Statistical significance in multiple testing Statistical significance in multiple testing Empirical.
Statistics 270– Lecture 25. Cautions about Z-Tests Data must be a random sample Outliers can distort results Shape of the population distribution matters.
1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops
Chapter 13 – 1 Chapter 12: Testing Hypotheses Overview Research and null hypotheses One and two-tailed tests Errors Testing the difference between two.
Multiple testing in high- throughput biology Petter Mostad.
Jeopardy Hypothesis Testing T-test Basics T for Indep. Samples Z-scores Probability $100 $200$200 $300 $500 $400 $300 $400 $300 $400 $500 $400.
Education 793 Class Notes T-tests 29 October 2003.
Getting the story – biological model based on microarray data Once the differentially expressed genes are identified (sometimes hundreds of them), we need.
Applying False Discovery Rate (FDR) Control in Detecting Future Climate Changes ZongBo Shang SIParCS Program, IMAGe, NCAR August 4, 2009.
Essential Statistics in Biology: Getting the Numbers Right
Chapter 15 Data Analysis: Testing for Significant Differences.
Chapter 9: Testing Hypotheses
Week 111 Power of the t-test - Example In a metropolitan area, the concentration of cadmium (Cd) in leaf lettuce was measured in 7 representative gardens.
Suppose we have analyzed total of N genes, n of which turned out to be differentially expressed/co-expressed (experimentally identified - call them significant)
Differential Expression II Adding power by modeling all the genes Oct 06.
PSY 307 – Statistics for the Behavioral Sciences Chapter 16 – One-Factor Analysis of Variance (ANOVA)
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
1 Gene Ontology Javier Cabrera. 2 Outline Goal: How to identify biological processes or biochemical pathways that are changed by treatment.Goal: How to.
Adapted from: Wulff HR, Andersen B, Brandenhoff P, Guttler F (1987): What do doctors know about statistics? Statistics in Medicine 6:3-10 Suppose we conduct.
Back to basics – Probability, Conditional Probability and Independence Probability of an outcome in an experiment is the proportion of times that.
Jeopardy Hypothesis Testing t-test Basics t for Indep. Samples Related Samples t— Didn’t cover— Skip for now Ancient History $100 $200$200 $300 $500 $400.
ANOVA: Analysis of Variance.
CHAPTER 23: Two Categorical Variables The Chi-Square Test ESSENTIAL STATISTICS Second Edition David S. Moore, William I. Notz, and Michael A. Fligner Lecture.
Week111 The t distribution Suppose that a SRS of size n is drawn from a N(μ, σ) population. Then the one sample t statistic has a t distribution with n.
Statistical Testing with Genes Saurabh Sinha CS 466.
Gene set analyses of genomic datasets Andreas Schlicker Jelle ten Hoeve Lodewyk Wessels.
Suppose we have T genes which we measured under two experimental conditions (Ctl and Nic) in n replicated experiments t i * and p i are the t-statistic.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
1 Estimation of Gene-Specific Variance 2/17/2011 Copyright © 2011 Dan Nettleton.
Lecture PowerPoint Slides Basic Practice of Statistics 7 th Edition.
Hypothesis test flow chart frequency data Measurement scale number of variables 1 basic χ 2 test (19.5) Table I χ 2 test for independence (19.9) Table.
The Broad Institute of MIT and Harvard Differential Analysis.
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Analysis of (cDNA) Microarray.
Chapter 9: Introduction to the t statistic. The t Statistic The t statistic allows researchers to use sample data to test hypotheses about an unknown.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 11 Inference for Distributions of Categorical.
Gene Set Analysis using R and Bioconductor Daniel Gusenleitner
I231B QUANTITATIVE METHODS Analysis of Variance (ANOVA)
Statistical Methods for Detecting Differentially Abundant Features in Clinical Metagenomic Samples James Robert White, Niranjan Nagaranjan, Mihai Pop PLoS.
Chapter 11: Categorical Data n Chi-square goodness of fit test allows us to examine a single distribution of a categorical variable in a population. n.
Chapter 22 Inferential Data Analysis: Part 2 PowerPoint presentation developed by: Jennifer L. Bellamy & Sarah E. Bledsoe.
Micro array Data Analysis. Differential Gene Expression Analysis The Experiment Micro-array experiment measures gene expression in Rats (>5000 genes).
Estimating the False Discovery Rate in Genome-wide Studies BMI/CS 576 Colin Dewey Fall 2008.
Estimation of Gene-Specific Variance
Module 2: Analyzing gene lists: over-representation analysis
Multiple Testing Methods for the Analysis of Microarray Data
Statistical Testing with Genes
Mixture Modeling of the Distribution of p-values from t-tests
Significance Analysis of Microarrays (SAM)
Multiple Testing Methods for the Analysis of Gene Expression Data
Multiple Testing Methods for the Analysis of Microarray Data
Significance Analysis of Microarrays (SAM)
One-Way Analysis of Variance
Varying Intolerance of Gene Pathways to Mutational Classes Explain Genetic Convergence across Neuropsychiatric Disorders  Shahar Shohat, Eyal Ben-David,
Statistical Testing with Genes
Gene set enrichment analysis reveals MafA regulation of many key β-cell activities. Gene set enrichment analysis reveals MafA regulation of many key β-cell.
Presentation transcript:

1 A Discussion of False Discovery Rate and the Identification of Differentially Expressed Gene Categories in Microarray Studies Ames, Iowa August 8, 2007 Dan Nettleton Iowa State University

2 Myostatin Knockout Mice vs. Wild Type Belgian Blue cattle have a mutation in the myostatin gene.

3 Affymetrix GeneChips on 5 Mice per Genotype WT M M M M M

4 The Dataset Gene ID Wild Type Mutant p 1 p 2 p 3 p 4 p 5 p 6 p 7 p 8 p 9 p 10 p 11 p 12 p p p-value

5 A Standard Analysis Two-sample t-tests for each gene. Compute p-values by comparing t-statistics to a t-distribution with 8 d.f. Use an adjustment for multiple testing to create a list of genes declared to be differentially expressed.

6 Histogram of p-values from the Two-Sample t-Tests p-value Number of Genes

7 Example p-value Distributions Two-Sample t-test of H 0 :μ 1 =μ 2 n 1 =n 2 =5, variance=1 μ 1 -μ 2 =1 μ 1 -μ 2 =0.5 μ 1 -μ 2 =0

8 Histogram of p-values from the Two-Sample t-Tests p-value Number of Genes

9 False Discovery Rate (FDR) FDR is an error measure that can be useful for multiple testing problems encountered in microarray experiments. FDR was introduced by Benjamini and Hochberg (1995) and is formally defined as follows: R = # rejected null hypotheses when conducting m tests V = # of type I errors (false discoveries) FDR=E(Q) where Q=V/R if R>0 and Q=0 otherwise.

10 A Conceptual Description of FDR Suppose a scientist conducts many independent microarray experiments. For each experiment, the scientist uses a method for producing a list of genes declared to be differentially expressed. For each list consider the ratio of the number of false positive results to the total number of genes on the list (set this ratio to 0 if the list contains no genes). The FDR is approximated by the average of the ratios described above.

11 Some of the gene lists may contain a high proportion of false positive results and yet the method used may still control FDR at a given level. It is the average performance across repeated experiments that matters. A Conceptual Description of FDR (continued)

12 Number of Genes Declared to be Differentially Expressed for Various Estimated FDR Levels FDRNumber of Genes P-Value Threshold FDR estimated using the method of Storey and Tibshirani (2003).

13 Using Information about Genes to Interpret the Results of Microarray Experiments Based on a large body of past research, some information is known about many of the genes represented on a microarray. The information might include tissues in which a gene is known to be expressed, the biological process in which a gene’s protein is known to act, or other general or quite specific details about the function of the protein produced by a gene. By examining this information in concert with the results of a microarray experiment, biologists can often gain a greater understanding of their microarray experiments.

14 Gene Ontology (GO) Terms GO terms provide one example of information that is available about genes. The GO project provides three ontologies (structured controlled vocabularies) that describe a gene’s 1. Biological Processes, 2. Cellular Components, and 3. Molecular Functions.

15 Gene Ontology (GO) Terms Each gene may be associated with 0 or more GO terms in a given ontology. The GO terms in each ontology have varying levels of specificity. The GO terms in each ontology can be organized in a directed acyclic graph (DAG) where each node represents a term and arrows point from specific terms to more general terms.

16 Portion of the Biological Processes Ontology Shown in a DAG Biological Process Cellular Process Metabolic Process Cellular Metabolic Process Primary Metabolic Process Macromolecule Metabolic Process Carbohydrate Metabolic Process Generation of Precursor Metabolites and Energy Energy Derivation by Oxidation of Organic Compounds Alcohol Metabolic Process

17 Constructing Gene Categories from GO Terms The set of genes associated with any particular GO term could be considered as a category or gene set of interest for subsequent testing. For example, we might ask if genes that are associated with the Molecular Function term muscle alpha-actinin binding are affected by a treatment of interest. We could simultaneously query many groups, general and specific, to better understand the impact of treatment on expression.

18 Simultaneous Testing of Multiple Categories with Various Levels of Specificity molecular function binding protein binding actinin binding cytoskeletal protein binding alpha-actinin binding muscle alpha-actinin binding enzyme binding ATPase binding RNA polymerase core enzyme binding myosin binding beta-actinin binding

19 Some Formal Methods for Testing Gene Categories with Microarray Data Fisher’s exact test on lists of gene declared to be differentially expressed (DDE) Gene Set Enrichment Analysis (GSEA) Significance Analysis of Function and Expression (SAFE) Pathway Level Analysis of Gene Expression (PLAGE) Domain Enhanced Analysis (DEA) Many others appearing and soon to appear

20 Number of Genes Declared to be Differentially Expressed for Various Estimated FDR Levels FDRNumber of Genes P-Value Threshold FDR estimated using the method of Storey and Tibshirani (2003).

21 Are genes of category X overrepresented among the genes declared to be differentially expressed? Declared to be Differentially Expressed? yes no Gene of Category X? yes no Highly significant overrepresentation according to a chi-square test or Fisher’s exact test.

22 Problems with Chi-Square or Fisher’s Exact Test for Detecting Overrepresentation The outcome of the overrepresentation test depends on the significance threshold used to declare genes differentially expressed. Functional categories in which many genes exhibit small changes may go undetected. Genes are not independent, so a key assumption of the chi-square and Fisher’s exact tests is violated. Information in the multivariate distribution of genes in a category is not utilized.

23 Advantage of a Multivariate Approach Gene 1 Expression Gene 2 Expression

24 Multiresponse Permutation Procedure (MRPP) Mielke and Berry, (2001). Permutation Methods: A Distance Function Approach. Springer, N.Y. Nonparametric test for a difference among multivariate distributions Test statistic based on within-group inter-point distances P-value obtained by data permutation

25 The MRPP Test: An Illustrative Example Expression Measure for Gene 1 Expression Measure for Gene 2 For balanced data, test statistic is sum of within-group inter-point distances.

26 The test statistic is computed for all permutations of the data. Expression Measure for Gene 1 Expression Measure for Gene 2

27 An Example Permutation Expression Measure for Gene 1 Expression Measure for Gene 2

28 Test statistic will be larger for permutations than for the original data Expression Measure for Gene 1 Expression Measure for Gene 2 Permutation p-value = # ( T original ≥ T permutation ) = 2 # permutations 252

29 Portion of Directed Acyclic Graph of GO Molecular Function Terms