Significance Tests P-values and Q-values. Outline Statistical significance in multiple testing Statistical significance in multiple testing Empirical.

Slides:



Advertisements
Similar presentations
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

Shibing Deng Pfizer, Inc. Efficient Outlier Identification in Lung Cancer Study.
Multiple testing and false discovery rate in feature selection
1 Statistics Achim Tresch UoC / MPIPZ Cologne treschgroup.de/OmicsModule1415.html
From the homework: Distribution of DNA fragments generated by Micrococcal nuclease digestion mean(nucs) = bp median(nucs) = 110 bp sd(nucs+ = 17.3.
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Genomic Profiles of Brain Tissue in Humans and Chimpanzees II Naomi Altman Oct 06.
C82MST Statistical Methods 2 - Lecture 4 1 Overview of Lecture Last Week Per comparison and familywise error Post hoc comparisons Testing the assumptions.
AP Statistics – Chapter 9 Test Review
Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005.
Analysis of gene expression data (Nominal explanatory variables) Shyamal D. Peddada Biostatistics Branch National Inst. Environmental Health Sciences (NIH)
Gene Expression Data Analyses (3)
Differentially expressed genes
ANOVA Determining Which Means Differ in Single Factor Models Determining Which Means Differ in Single Factor Models.
1 Data Analysis for Gene Chip Data Part I: One-gene-at-a-time methods Min-Te Chao 2002/10/28.
Lecture 9: One Way ANOVA Between Subjects
1 Test of significance for small samples Javier Cabrera.
The Need For Resampling In Multiple testing. Correlation Structures Tukey’s T Method exploit the correlation structure between the test statistics, and.
False Discovery Rate Methods for Functional Neuroimaging Thomas Nichols Department of Biostatistics University of Michigan.
Topic 3: Regression.
Chapter 11: Inference for Distributions
Statistical Comparison of Two Learning Algorithms Presented by: Payam Refaeilzadeh.
Today Concepts underlying inferential statistics
Gene Set Enrichment Analysis Petri Törönen petri(DOT)toronen(AT)helsinki.fi.
Chapter 14 Inferential Data Analysis
Multiple Testing Procedures Examples and Software Implementation.
Different Expression Multiple Hypothesis Testing STAT115 Spring 2012.
False Discovery Rate (FDR) = proportion of false positive results out of all positive results (positive result = statistically significant result) Ladislav.
Proteomics Informatics – Data Analysis and Visualization (Week 13)
Hypothesis Testing Statistics for Microarray Data Analysis – Lecture 3 supplement The Fields Institute for Research in Mathematical Sciences May 25, 2002.
Multiple testing in high- throughput biology Petter Mostad.
Jeopardy Hypothesis Testing T-test Basics T for Indep. Samples Z-scores Probability $100 $200$200 $300 $500 $400 $300 $400 $300 $400 $500 $400.
CENTRE FOR INNOVATION, RESEARCH AND COMPETENCE IN THE LEARNING ECONOMY Session 2: Basic techniques for innovation data analysis. Part I: Statistical inferences.
Statistics Primer ORC Staff: Xin Xin (Cindy) Ryan Glaman Brett Kellerstedt 1.
Essential Statistics in Biology: Getting the Numbers Right
Chapter 11 Inference for Distributions AP Statistics 11.1 – Inference for the Mean of a Population.
Bootstrapping (And other statistical trickery). Reminder Of What We Do In Statistics Null Hypothesis Statistical Test Logic – Assume that the “no effect”
Jesse Gillis 1 and Paul Pavlidis 2 1. Department of Psychiatry and Centre for High-Throughput Biology University of British Columbia, Vancouver, BC Canada.
Education Research 250:205 Writing Chapter 3. Objectives Subjects Instrumentation Procedures Experimental Design Statistical Analysis  Displaying data.
Differential Expression II Adding power by modeling all the genes Oct 06.
Differential Gene Expression Dennis Kostka, Christine Steinhoff Slides adapted from Rainer Spang.
Significance Testing of Microarray Data BIOS 691 Fall 2008 Mark Reimers Dept. Biostatistics.
Statistics 11 Confidence Interval Suppose you have a sample from a population You know the sample mean is an unbiased estimate of population mean Question:
Multiple Testing in Microarray Data Analysis Mi-Ok Kim.
Hypothesis Testing A procedure for determining which of two (or more) mutually exclusive statements is more likely true We classify hypothesis tests in.
Chapter 10: Analyzing Experimental Data Inferential statistics are used to determine whether the independent variable had an effect on the dependent variance.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
A A R H U S U N I V E R S I T E T Faculty of Agricultural Sciences Introduction to analysis of microarray data David Edwards.
Back to basics – Probability, Conditional Probability and Independence Probability of an outcome in an experiment is the proportion of times that.
Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002.
Limits to Statistical Theory Bootstrap analysis ESM April 2006.
Application of Class Discovery and Class Prediction Methods to Microarray Data Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics.
Suppose we have T genes which we measured under two experimental conditions (Ctl and Nic) in n replicated experiments t i * and p i are the t-statistic.
Bootstrap Event Study Tests Peter Westfall ISQS Dept. Joint work with Scott Hein, Finance.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Introducing Communication Research 2e © 2014 SAGE Publications Chapter Seven Generalizing From Research Results: Inferential Statistics.
Spatial Smoothing and Multiple Comparisons Correction for Dummies Alexa Morcom, Matthew Brett Acknowledgements.
The Broad Institute of MIT and Harvard Differential Analysis.
1 Identifying Robust Activation in fMRI Thomas Nichols, Ph.D. Assistant Professor Department of Biostatistics University of Michigan
1 Significance analysis of Microarrays (SAM) Applied to the ionizing radiation response Tusher, Tibshirani, Chu (2001) Dafna Shahaf.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
Hypothesis Tests. An Hypothesis is a guess about a situation that can be tested, and the test outcome can be either true or false. –The Null Hypothesis.
Canadian Bioinformatics Workshops
Microarray Data Analysis Xuming He Department of Statistics University of Illinois at Urbana-Champaign.
Fewer permutations, more accurate P-values Theo A. Knijnenburg 1,*, Lodewyk F. A. Wessels 2, Marcel J. T. Reinders 3 and Ilya Shmulevich 1 1Institute for.
Canadian Bioinformatics Workshops
Estimating the False Discovery Rate in Genome-wide Studies BMI/CS 576 Colin Dewey Fall 2008.
Differential Gene Expression
One Way ANOVAs One Way ANOVAs
Presentation transcript:

Significance Tests P-values and Q-values

Outline Statistical significance in multiple testing Statistical significance in multiple testing Empirical distribution of test statistics Empirical distribution of test statistics Family-wide p-values Family-wide p-values Correlation and p-values Correlation and p-values False discovery rates False discovery rates

Tests and Test Statistics T-test is fairly robust to skew, but not robust to outliers – “thick tails” of distribution T-test is fairly robust to skew, but not robust to outliers – “thick tails” of distribution Non-parametric tests are robust, but lose too much ability to detect differences (power) Non-parametric tests are robust, but lose too much ability to detect differences (power) Robust tests can be useful Robust tests can be useful Permutation tests are simple and easy to program Permutation tests are simple and easy to program Some authors use: Some authors use: rather than To reduce numbers of low fold-changes in highly signficant scores

Distribution of test statistics Quantile plots of t-statistics: left: random distn; right: experiment

Distribution of Set of p-values

Multiple comparisons Suppose 10,000 genes on a chip Suppose 10,000 genes on a chip None actually differentially expressed None actually differentially expressed Each gene has a 5% chance of exceeding the threshold score for a p-value of.05 Each gene has a 5% chance of exceeding the threshold score for a p-value of.05 Type I error definition Type I error definition On average, 500 genes should exceed.05 threshold ‘by chance’ On average, 500 genes should exceed.05 threshold ‘by chance’

Family-Wide Error Rate ‘Corrected’ p-value: ‘Corrected’ p-value: Probability of finding a single false positive among all N tests Probability of finding a single false positive among all N tests Normally all tests at same threshold Normally all tests at same threshold Simplest correction (Bonferroni) Simplest correction (Bonferroni) p i * = Np i, (if Np i < 1, otherwise 1) p i * = Np i, (if Np i < 1, otherwise 1) Fairly close to true false positive rate in simulations of independent tests Fairly close to true false positive rate in simulations of independent tests Too conservative in practice! Too conservative in practice!

P-Values from Correlated Genes Null distribution from independent genes Null distribution from perfectly correlated genes Rows: genes; columns: samples; entries: p-values from randomized distribution Null distribution from highly correlated genes

The Effect of Correlation If all genes are uncorrelated, Sidak is exact If all genes are uncorrelated, Sidak is exact If all genes were perfectly correlated If all genes were perfectly correlated p-values for one are p-values for all p-values for one are p-values for all No multiple-comparisons correction needed No multiple-comparisons correction needed Typical gene data is highly correlated Typical gene data is highly correlated First eigenvalue of SVD may be more than half the variance First eigenvalue of SVD may be more than half the variance More sensitive tests possible if we can generate joint null distribution of p-values More sensitive tests possible if we can generate joint null distribution of p-values

Re-formulating the Question Independent: ~5% of genes exceed.05 threshold, all the time Independent: ~5% of genes exceed.05 threshold, all the time Perfectly Correlated: all genes exceed.05 threshold ~5% of the time Perfectly Correlated: all genes exceed.05 threshold ~5% of the time Realistically correlated:.05 < f 1 < 1 of genes exceeds.05 threshold,.05 < f 2 < 1 of the cases Realistically correlated:.05 < f 1 < 1 of genes exceeds.05 threshold,.05 < f 2 < 1 of the cases New question: for a given f 1 and , how likely is it that a fraction f 1 of genes will exceed the  threshold? New question: for a given f 1 and , how likely is it that a fraction f 1 of genes will exceed the  threshold?

Step-Down p-Values Calculate single-step p-values for genes: p 1, …, p N Calculate single-step p-values for genes: p 1, …, p N Order the smallest k p-values: p (1), …, p (k) Order the smallest k p-values: p (1), …, p (k) For each k, ask: For each k, ask: How likely are we to get k p-values less than p (k) if no differences are real? How likely are we to get k p-values less than p (k) if no differences are real? Generate null distribution by permutations Generate null distribution by permutations More significant genes, at the same level of Type I error, compared with single-step procedures More significant genes, at the same level of Type I error, compared with single-step procedures See Ge, et al, Test, 2003 See Ge, et al, Test, 2003 Bioconductor package multtest Bioconductor package multtest

False Discovery Rate At threshold t* what fraction of genes are likely to be true positives? At threshold t* what fraction of genes are likely to be true positives? Illustration: 10,000 independent genes Illustration: 10,000 independent genes tp#sigE(FP)FDR* % % % In practice use permutation algorithm to compute FDR

pFDR How to estimate the FDR? How to estimate the FDR? ‘positive’ False Discovery Rate: ‘positive’ False Discovery Rate: E(#false positives/#positives) * P(#positives >0) E(#false positives/#positives) * P(#positives >0) Simes’ inequality allows this to be computed from p-values Simes’ inequality allows this to be computed from p-values