Uniform-Beta Mixture Modeling of the p-value Distribution

Slides:



Advertisements
Similar presentations
Estimation of Means and Proportions
Advertisements

Shibing Deng Pfizer, Inc. Efficient Outlier Identification in Lung Cancer Study.
1 Parametric Empirical Bayes Methods for Microarrays 3/7/2011 Copyright © 2011 Dan Nettleton.
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
CHAPTER 21 Inferential Statistical Analysis. Understanding probability The idea of probability is central to inferential statistics. It means the chance.
Estimation  Samples are collected to estimate characteristics of the population of particular interest. Parameter – numerical characteristic of the population.
From the homework: Distribution of DNA fragments generated by Micrococcal nuclease digestion mean(nucs) = bp median(nucs) = 110 bp sd(nucs+ = 17.3.
Sample size computations Petter Mostad
Hypothesis Testing: Type II Error and Power.
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Descriptive statistics Experiment  Data  Sample Statistics Experiment  Data  Sample Statistics Sample mean Sample mean Sample variance Sample variance.
Inferences About Process Quality
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 7 Statistical Intervals Based on a Single Sample.
The Neymann-Pearson Lemma Suppose that the data x 1, …, x n has joint density function f(x 1, …, x n ;  ) where  is either  1 or  2. Let g(x 1, …,
Hypothesis Testing and T-Tests. Hypothesis Tests Related to Differences Copyright © 2009 Pearson Education, Inc. Chapter Tests of Differences One.
AM Recitation 2/10/11.
1/2555 สมศักดิ์ ศิวดำรงพงศ์
McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Statistical Inferences Based on Two Samples Chapter 9.
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
Random Sampling, Point Estimation and Maximum Likelihood.
Bayesian inference review Objective –estimate unknown parameter  based on observations y. Result is given by probability distribution. Bayesian inference.
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
Confidence Interval & Unbiased Estimator Review and Foreword.
Sampling and estimation Petter Mostad
1 Estimation of Gene-Specific Variance 2/17/2011 Copyright © 2011 Dan Nettleton.
M.Sc. in Economics Econometrics Module I Topic 4: Maximum Likelihood Estimation Carol Newman.
Statistics Sampling Distributions and Point Estimation of Parameters Contents, figures, and exercises come from the textbook: Applied Statistics and Probability.
Statistical Methods. 2 Concepts and Notations Sample unit – the basic landscape unit at which we wish to establish the presence/absence of the species.
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Analysis of (cDNA) Microarray.
R. Kass/W03 P416 Lecture 5 l Suppose we are trying to measure the true value of some quantity (x T ). u We make repeated measurements of this quantity.
Week 21 Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 7 Inferences Concerning Means.
Bayesian Estimation and Confidence Intervals Lecture XXII.
The Probit Model Alexander Spermann University of Freiburg SoSe 2009
Estimation of Gene-Specific Variance
Lecture 1.31 Criteria for optimal reception of radio signals.
Statistical Estimation
Confidence Intervals and Sample Size
Chapter 9 Hypothesis Testing.
Mixture Modeling of the p-value Distribution
Chapter 7. Classification and Prediction
Bayesian Estimation and Confidence Intervals
Multiple Testing Methods for the Analysis of Microarray Data
IEE 380 Review.
Review and Preview and Basics of Hypothesis Testing
Dr.MUSTAQUE AHMED MBBS,MD(COMMUNITY MEDICINE), FELLOWSHIP IN HIV/AIDS
The binomial applied: absolute and relative risks, chi-square
Mixture Modeling of the Distribution of p-values from t-tests
Mixture modeling of the distribution of p-values from t-tests
Chapter 9 Hypothesis Testing.
Multiple Testing Methods for the Analysis of Gene Expression Data
Elementary Statistics
9 Tests of Hypotheses for a Single Sample CHAPTER OUTLINE
More about Posterior Distributions
Modelling data and curve fitting
Discrete Event Simulation - 4
EC 331 The Theory of and applications of Maximum Likelihood Method
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Example Human males have one X-chromosome and one Y-chromosome,
Stat Lab 6 Parameter Estimation.
Biological Validation of Differentially Expressed Genes in Chronic Lymphocytic Leukemia Identified by Applying Multiple Statistical Methods to Oligonucleotide.
Parametric Empirical Bayes Methods for Microarrays
One-Sample Tests of Hypothesis
CHAPTER – 1.2 UNCERTAINTIES IN MEASUREMENTS.
3.2. SIMPLE LINEAR REGRESSION
Data Mining Anomaly Detection
Type I and Type II Errors
Data Mining Anomaly Detection
Generating Random Variates
CHAPTER – 1.2 UNCERTAINTIES IN MEASUREMENTS.
Presentation transcript:

Uniform-Beta Mixture Modeling of the p-value Distribution 2/17/2011 Copyright © 2011 Dan Nettleton

Mixture Modeling of the p-value Distribution First proposed by Allison, D. B. , Gadbury, G. L., Heo, M., Fernández, J. R., Lee, C.-K., Prolla, T. A., Weindruch, R. (2002). A mixture model approach for the analysis of microarray gene expression data, Computational Statistics and Data Analysis, 39, 1-20. Model p-value distribution as a mixture of a Uniform(0,1) distribution (corresponding to true nulls) and a Beta(α,β) distribution (corresponding to false nulls). Pounds and Morris (2003) propose mixture of Uniform(0,1) and Beta(α,1). (BUM model)

Beta Distributions A Beta(α,β) distribution is a probability distribution on the interval (0,1). The probability density function of a Beta(α,β) distribution is given by f(x)= xα-1(1-x)β-1 for 0<x<1. The mean of a Beta(α,β) distribution is The variance of a Beta(α,β) distribution is Γ(α+β) Γ(α)Γ(β) α α+β αβ (α+β+1)(α+β)2

Various Beta Distributions f(x) Beta(8,8) Beta(4,1) Beta(1,1) x

Model distribution of observed p-values as a mixture of uniform and beta Number of Genes p-value

p-value density is assumed to be a mixture of a uniform density and a beta density. Γ(α+β) Γ(α)Γ(β) g(p) = π0 + π1 pα-1(1-p)β-1 π0 and π1 are non-negative mixing proportions that sum to 1. Matching up with our previous notation we have π0=m0 / m and π1=m1 / m. The parameters π0, π1, α, and β are estimated by the method of maximum likelihood assuming independence of all p-values. Numerical maximization is necessary.

^ ^ ^ ^ π0 = 0.8725 π1 = 0.1275 α = 0.657 β = 15.853 0.1275*Beta(0.657,15.853) Density 0.8725*U(0,1) p-value

Posterior Probability of Differential Expression P(A)P(B|A) P(B) Bayes Rule: P(A|B)= P(H0i is False | pi = p) = P(H0i is False)f(pi = p | H0i is False) g(pi = p) π1 pα-1(1-p)β-1 Γ(α+β) Γ(α)Γ(β) = π0 + π1 pα-1(1-p)β-1 Γ(α+β) Γ(α)Γ(β)

Posterior Probability of Differential Expression (continued) The posterior probability of differential expression is the probability that a gene is differentially expressed given its p-value. It can be estimated by replacing the unknown parameters π0, π1, α, and β in the previous expression by their maximum likelihood estimates.

p-values Estimated Posterior Probability of D.E. 1. 0.000001111 0.9862353 2. 0.000020858 0.9632383 3. 0.000025233 0.9618519 4. 0.000028355 0.9593173 5. 0.000032869 0.9572907 501. 0.009275782 0.7381684 502. 0.009286863 0.7380571 503. 0.009318375 0.7377411 504. 0.009332409 0.7376005 505. 0.009347553 0.7374489

Relationship between Posterior Probability of Differential Expression (PPDE) and FDR 1 - average estimated PPDE for a list of genes provides an estimate of the FDR for that list of genes. For example, the estimated FDR for the top 5 genes is 1-(0.986+0.963+0.961+0.959+0.957)/5=0.035. The theoretical properties of this approach to estimating FDR have not been thoroughly investigated.

Plot of FDR Estimates Based on PPDE vs Plot of FDR Estimates Based on PPDE vs. q-value for the Simulated Example p-values Estimated FDR Based on PPDE q-values

Plot of FDR Estimates Based on PPDE vs Plot of FDR Estimates Based on PPDE vs. Actual Ratio of False Positives to Number of Rejections for the Simulated Example p-values Estimated FDR Based on PPDE V / R

FDR Estimates Based Directly on the Estimated Mixture Model P(H0i is True | pi ≤ c) = Replacing the parameters in the expression above with their estimates gives an estimated “FDR” for any significance cutoff c. P(H0i is True)P(pi ≤ c | H0i is True) P(pi ≤ c) π0c = c π0c+π1 pα-1(1-p)β-1dp Γ(α+β) Γ(α)Γ(β)

FDR Estimates Based Directly on the Estimated Mixture Model is area under dashed line divided by area under solid curve. c=0.1 p-value

Comparison of Mixture Model Methods for Estimating FDR FDR Estimates Based on 1 - Average PPDE FDR Estimates Based Directly on the Estimated Mixture Model

Comments The two methods will produce similar FDR estimates when there are a large number of closely spaced p-values. The method based on 1 – average estimated PPDE may be useful for estimating the FDR in a list of genes that does not necessarily include the most significant genes. The method based directly on the estimated mixture model may be conceptually preferable in the usual case where a list will consist of the most differentially expressed genes.