Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 A Discussion of False Discovery Rate and the Identification of Differentially Expressed Gene Categories in Microarray Studies Ames, Iowa August 8, 2007.

Similar presentations


Presentation on theme: "1 A Discussion of False Discovery Rate and the Identification of Differentially Expressed Gene Categories in Microarray Studies Ames, Iowa August 8, 2007."— Presentation transcript:

1 1 A Discussion of False Discovery Rate and the Identification of Differentially Expressed Gene Categories in Microarray Studies Ames, Iowa August 8, 2007 Dan Nettleton Iowa State University

2 2 Myostatin Knockout Mice vs. Wild Type Belgian Blue cattle have a mutation in the myostatin gene.

3 3 Affymetrix GeneChips on 5 Mice per Genotype WT M M M M M

4 4 The Dataset 14835.84578.24856.34483.74275.34170.73836.93901.84218.44094.0 2153.9161.0139.7173.0160.1180.1265.1201.2130.8130.7 33546.53622.73364.33433.62757.23346.92723.82892.03021.32452.7 4711.3717.3776.6787.5750.3910.2813.3687.9811.1695.6 5126.3178.2114.5158.7157.3231.7147.0102.8157.6146.8 64161.84622.93795.74501.24265.83931.33327.63726.74003.03906.8 7419.3555.3509.6515.5488.9426.6425.8500.8347.8580.3 82420.72616.12768.72663.72264.62379.72196.22491.32710.02759.1 9321.5540.6471.9348.2356.6382.5375.9481.5260.6515.7 101061.4949.41236.81034.7976.81059.8903.61060.3960.11134.5 111293.31147.71173.81173.91274.21062.81172.11113.01432.11012.4 12336.1413.5425.2462.8412.2391.7388.1363.7310.8404.6 13325.2278.9242.8255.6283.5161.1181.0222.0279.3232.9 22690249.6283.6271.0246.9252.7214.2217.9266.6193.7413.2 Gene ID Wild Type................................. Mutant p 1 p 2 p 3 p 4 p 5 p 6 p 7 p 8 p 9 p 10 p 11 p 12 p 13... p 22690 p-value

5 5 A Standard Analysis Two-sample t-tests for each gene. Compute p-values by comparing t-statistics to a t-distribution with 8 d.f. Use an adjustment for multiple testing to create a list of genes declared to be differentially expressed.

6 6 Histogram of p-values from the Two-Sample t-Tests p-value Number of Genes

7 7 Example p-value Distributions Two-Sample t-test of H 0 :μ 1 =μ 2 n 1 =n 2 =5, variance=1 μ 1 -μ 2 =1 μ 1 -μ 2 =0.5 μ 1 -μ 2 =0

8 8 Histogram of p-values from the Two-Sample t-Tests p-value Number of Genes

9 9 False Discovery Rate (FDR) FDR is an error measure that can be useful for multiple testing problems encountered in microarray experiments. FDR was introduced by Benjamini and Hochberg (1995) and is formally defined as follows: R = # rejected null hypotheses when conducting m tests V = # of type I errors (false discoveries) FDR=E(Q) where Q=V/R if R>0 and Q=0 otherwise.

10 10 A Conceptual Description of FDR Suppose a scientist conducts many independent microarray experiments. For each experiment, the scientist uses a method for producing a list of genes declared to be differentially expressed. For each list consider the ratio of the number of false positive results to the total number of genes on the list (set this ratio to 0 if the list contains no genes). The FDR is approximated by the average of the ratios described above.

11 11 Some of the gene lists may contain a high proportion of false positive results and yet the method used may still control FDR at a given level. It is the average performance across repeated experiments that matters. A Conceptual Description of FDR (continued)

12 12 Number of Genes Declared to be Differentially Expressed for Various Estimated FDR Levels FDRNumber of Genes P-Value Threshold 0.01 8 0.000003 0.05 313 0.000900 0.10 748 0.004339 0.15 1465 0.012730 0.20 2143 0.024909 FDR estimated using the method of Storey and Tibshirani (2003).

13 13 Using Information about Genes to Interpret the Results of Microarray Experiments Based on a large body of past research, some information is known about many of the genes represented on a microarray. The information might include tissues in which a gene is known to be expressed, the biological process in which a gene’s protein is known to act, or other general or quite specific details about the function of the protein produced by a gene. By examining this information in concert with the results of a microarray experiment, biologists can often gain a greater understanding of their microarray experiments.

14 14 Gene Ontology (GO) Terms GO terms provide one example of information that is available about genes. The GO project provides three ontologies (structured controlled vocabularies) that describe a gene’s 1. Biological Processes, 2. Cellular Components, and 3. Molecular Functions.

15 15 Gene Ontology (GO) Terms Each gene may be associated with 0 or more GO terms in a given ontology. The GO terms in each ontology have varying levels of specificity. The GO terms in each ontology can be organized in a directed acyclic graph (DAG) where each node represents a term and arrows point from specific terms to more general terms.

16 16 Portion of the Biological Processes Ontology Shown in a DAG Biological Process Cellular Process Metabolic Process Cellular Metabolic Process Primary Metabolic Process Macromolecule Metabolic Process Carbohydrate Metabolic Process Generation of Precursor Metabolites and Energy Energy Derivation by Oxidation of Organic Compounds Alcohol Metabolic Process

17 17 Constructing Gene Categories from GO Terms The set of genes associated with any particular GO term could be considered as a category or gene set of interest for subsequent testing. For example, we might ask if genes that are associated with the Molecular Function term muscle alpha-actinin binding are affected by a treatment of interest. We could simultaneously query many groups, general and specific, to better understand the impact of treatment on expression.

18 18 Simultaneous Testing of Multiple Categories with Various Levels of Specificity molecular function binding protein binding actinin binding cytoskeletal protein binding alpha-actinin binding muscle alpha-actinin binding enzyme binding ATPase binding RNA polymerase core enzyme binding myosin binding beta-actinin binding

19 19 Some Formal Methods for Testing Gene Categories with Microarray Data Fisher’s exact test on lists of gene declared to be differentially expressed (DDE) Gene Set Enrichment Analysis (GSEA) Significance Analysis of Function and Expression (SAFE) Pathway Level Analysis of Gene Expression (PLAGE) Domain Enhanced Analysis (DEA) Many others appearing and soon to appear

20 20 Number of Genes Declared to be Differentially Expressed for Various Estimated FDR Levels FDRNumber of Genes P-Value Threshold 0.01 8 0.000003 0.05 313 0.000900 0.10 748 0.004339 0.15 1465 0.012730 0.20 2143 0.024909 FDR estimated using the method of Storey and Tibshirani (2003).

21 21 Are genes of category X overrepresented among the genes declared to be differentially expressed? Declared to be Differentially Expressed? yes no Gene of Category X? 50 250 300 50 19650 19700 100 19900 20000 yes no Highly significant overrepresentation according to a chi-square test or Fisher’s exact test.

22 22 Problems with Chi-Square or Fisher’s Exact Test for Detecting Overrepresentation The outcome of the overrepresentation test depends on the significance threshold used to declare genes differentially expressed. Functional categories in which many genes exhibit small changes may go undetected. Genes are not independent, so a key assumption of the chi-square and Fisher’s exact tests is violated. Information in the multivariate distribution of genes in a category is not utilized.

23 23 Advantage of a Multivariate Approach Gene 1 Expression Gene 2 Expression

24 24 Multiresponse Permutation Procedure (MRPP) Mielke and Berry, (2001). Permutation Methods: A Distance Function Approach. Springer, N.Y. Nonparametric test for a difference among multivariate distributions Test statistic based on within-group inter-point distances P-value obtained by data permutation

25 25 The MRPP Test: An Illustrative Example Expression Measure for Gene 1 Expression Measure for Gene 2 For balanced data, test statistic is sum of within-group inter-point distances.

26 26 The test statistic is computed for all permutations of the data. Expression Measure for Gene 1 Expression Measure for Gene 2

27 27 An Example Permutation Expression Measure for Gene 1 Expression Measure for Gene 2

28 28 Test statistic will be larger for permutations than for the original data Expression Measure for Gene 1 Expression Measure for Gene 2 Permutation p-value = # ( T original ≥ T permutation ) = 2 # permutations 252

29 29 Portion of Directed Acyclic Graph of GO Molecular Function Terms


Download ppt "1 A Discussion of False Discovery Rate and the Identification of Differentially Expressed Gene Categories in Microarray Studies Ames, Iowa August 8, 2007."

Similar presentations


Ads by Google