Mixture Modeling of the p-value Distribution

Slides:



Advertisements
Similar presentations
Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Advertisements

Shibing Deng Pfizer, Inc. Efficient Outlier Identification in Lung Cancer Study.
1 Parametric Empirical Bayes Methods for Microarrays 3/7/2011 Copyright © 2011 Dan Nettleton.
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
CHAPTER 21 Inferential Statistical Analysis. Understanding probability The idea of probability is central to inferential statistics. It means the chance.
Estimation  Samples are collected to estimate characteristics of the population of particular interest. Parameter – numerical characteristic of the population.
From the homework: Distribution of DNA fragments generated by Micrococcal nuclease digestion mean(nucs) = bp median(nucs) = 110 bp sd(nucs+ = 17.3.
Sample size computations Petter Mostad
9-1 Hypothesis Testing Statistical Hypotheses Statistical hypothesis testing and confidence interval estimation of parameters are the fundamental.
Hypothesis Testing: Type II Error and Power.
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Descriptive statistics Experiment  Data  Sample Statistics Experiment  Data  Sample Statistics Sample mean Sample mean Sample variance Sample variance.
Inferences About Process Quality
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 7 Statistical Intervals Based on a Single Sample.
The Neymann-Pearson Lemma Suppose that the data x 1, …, x n has joint density function f(x 1, …, x n ;  ) where  is either  1 or  2. Let g(x 1, …,
Hypothesis Testing and T-Tests. Hypothesis Tests Related to Differences Copyright © 2009 Pearson Education, Inc. Chapter Tests of Differences One.
AM Recitation 2/10/11.
1/2555 สมศักดิ์ ศิวดำรงพงศ์
McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Statistical Inferences Based on Two Samples Chapter 9.
Population All members of a set which have a given characteristic. Population Data Data associated with a certain population. Population Parameter A measure.
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
Random Sampling, Point Estimation and Maximum Likelihood.
Bayesian inference review Objective –estimate unknown parameter  based on observations y. Result is given by probability distribution. Bayesian inference.
Mid-Term Review Final Review Statistical for Business (1)(2)
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
Maximum Likelihood Estimator of Proportion Let {s 1,s 2,…,s n } be a set of independent outcomes from a Bernoulli experiment with unknown probability.
Week 71 Hypothesis Testing Suppose that we want to assess the evidence in the observed data, concerning the hypothesis. There are two approaches to assessing.
Maximum Likelihood - "Frequentist" inference x 1,x 2,....,x n ~ iid N( ,  2 ) Joint pdf for the whole random sample Maximum likelihood estimates.
- 1 - Bayesian inference of binomial problem Estimating a probability from binomial data –Objective is to estimate unknown proportion (or probability of.
Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.
Chapter 7 Point Estimation of Parameters. Learning Objectives Explain the general concepts of estimating Explain important properties of point estimators.
Confidence Interval & Unbiased Estimator Review and Foreword.
Sampling and estimation Petter Mostad
1 Estimation of Gene-Specific Variance 2/17/2011 Copyright © 2011 Dan Nettleton.
M.Sc. in Economics Econometrics Module I Topic 4: Maximum Likelihood Estimation Carol Newman.
Statistics Sampling Distributions and Point Estimation of Parameters Contents, figures, and exercises come from the textbook: Applied Statistics and Probability.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Statistical Significance Hypothesis Testing.
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Analysis of (cDNA) Microarray.
R. Kass/W03 P416 Lecture 5 l Suppose we are trying to measure the true value of some quantity (x T ). u We make repeated measurements of this quantity.
Confidence Intervals. Point Estimate u A specific numerical value estimate of a parameter. u The best point estimate for the population mean is the sample.
Week 21 Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 7 Inferences Concerning Means.
Bayesian Estimation and Confidence Intervals Lecture XXII.
Estimation of Gene-Specific Variance
Lecture 1.31 Criteria for optimal reception of radio signals.
Statistical Estimation
Confidence Intervals and Sample Size
Bayesian Estimation and Confidence Intervals
Two-Sample Hypothesis Testing
Multiple Testing Methods for the Analysis of Microarray Data
Point and interval estimations of parameters of the normally up-diffused sign. Concept of statistical evaluation.
LECTURE 33: STATISTICAL SIGNIFICANCE AND CONFIDENCE (CONT.)
Dr.MUSTAQUE AHMED MBBS,MD(COMMUNITY MEDICINE), FELLOWSHIP IN HIV/AIDS
The binomial applied: absolute and relative risks, chi-square
Mixture Modeling of the Distribution of p-values from t-tests
Mixture modeling of the distribution of p-values from t-tests
Uniform-Beta Mixture Modeling of the p-value Distribution
Chapter 9 Hypothesis Testing.
Multiple Testing Methods for the Analysis of Gene Expression Data
9 Tests of Hypotheses for a Single Sample CHAPTER OUTLINE
More about Posterior Distributions
Discrete Event Simulation - 4
EC 331 The Theory of and applications of Maximum Likelihood Method
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Parametric Empirical Bayes Methods for Microarrays
Parametric Methods Berlin Chen, 2005 References:
CHAPTER – 1.2 UNCERTAINTIES IN MEASUREMENTS.
Data Mining Anomaly Detection
Data Mining Anomaly Detection
CHAPTER – 1.2 UNCERTAINTIES IN MEASUREMENTS.
Presentation transcript:

Mixture Modeling of the p-value Distribution 3/12/2009 Copyright © 2009 Dan Nettleton

Mixture Modeling of the p-value Distribution First proposed by Allison, D. B. , Gadbury, G. L., Heo, M., Fernández, J. R., Lee, C.-K., Prolla, T. A., Weindruch, R. (2002). A mixture model approach for the analysis of microarray gene expression data, Computational Statistics and Data Analysis, 39, 1-20. Model p-value distribution as a mixture of a Uniform(0,1) distribution (corresponding to true nulls) and a Beta(α,β) distribution (corresponding to false nulls). Pounds and Morris (2003) propose mixture of Uniform(0,1) and Beta(α,1). (BUM model)

Beta Distributions A Beta(α,β) distribution is a probability distribution on the interval (0,1). The probability density function of a Beta(α,β) distribution is given by f(x)= xα-1(1-x)β-1 for 0<x<1. The mean of a Beta(α,β) distribution is The variance of a Beta(α,β) distribution is Γ(α+β) Γ(α)Γ(β) α α+β αβ (α+β+1)(α+β)2

Various Beta Distributions f(x) Beta(8,8) Beta(4,1) Beta(1,1) x

Model distribution of observed p-values as a mixture of uniform and beta Number of Genes p-value

p-value density is assumed to be a mixture of a uniform density and a beta density. Γ(α+β) Γ(α)Γ(β) g(p) = π0 + π1 pα-1(1-p)β-1 π0 and π1 are non-negative mixing proportions that sum to 1. Matching up with our previous notation we have π0=m0 / m and π1=m1 / m. The parameters π0, π1, α, and β are estimated by the method of maximum likelihood assuming independence of all p-values. Numerical maximization is necessary.

^ ^ ^ ^ π0 = 0.8725 π1 = 0.1275 α = 0.657 β = 15.853 0.1275*Beta(0.657,15.853) Density 0.8725*U(0,1) p-value

Posterior Probability of Differential Expression P(A)P(B|A) P(B) Bayes Rule: P(A|B)= P(H0i is False | pi = p) = P(H0i is False)P(pi = p | H0i is False) P(pi = p) π1 pα-1(1-p)β-1 Γ(α+β) Γ(α)Γ(β) = π0 + π1 pα-1(1-p)β-1 Γ(α+β) Γ(α)Γ(β)

Posterior Probability of Differential Expression (continued) The posterior probability of differential expression is the probability that a gene is differentially expressed given its p-value. It can be estimated by replacing the unknown parameters π0, π1, α, and β in the previous expression by their maximum likelihood estimates.

p-values Estimated Posterior Probability of D.E. 1. 0.000001111 0.9862353 2. 0.000020858 0.9632383 3. 0.000025233 0.9618519 4. 0.000028355 0.9593173 5. 0.000032869 0.9572907 501. 0.009275782 0.7381684 502. 0.009286863 0.7380571 503. 0.009318375 0.7377411 504. 0.009332409 0.7376005 505. 0.009347553 0.7374489

Relationship between Posterior Probability of Differential Expression (PPDE) and FDR 1 - average PPDE for a list of genes provides an estimate of the FDR for that list of genes. For example, the estimated FDR for the top 5 genes is 1-(0.986+0.963+0.961+0.959+0.957)/5=0.035. The properties of this approach to estimating FDR are unknown.

Plot of FDR Estimates Based on PPDE vs Plot of FDR Estimates Based on PPDE vs. q-value for the Simulated Example p-values Estimated FDR Based on PPDE q-values

Plot of FDR Estimates Based on PPDE vs Plot of FDR Estimates Based on PPDE vs. Actual Ratio of False Positives to Number of Rejections for the Simulated Example p-values Estimated FDR Based on PPDE V / R

FDR Estimates Based Directly on the Estimated Mixture Model P(H0i is True | pi ≤ c) = Replacing the parameters in the expression above with their estimates gives an estimated “FDR” for any significance cutoff c. P(H0i is True)P(pi ≤ c | H0i is True) P(pi ≤ c) π0c = c π0c+π1 pα-1(1-p)β-1 Γ(α+β) Γ(α)Γ(β)

FDR Estimates Based Directly on the Estimated Mixture Model is area under dashed line divided by area under solid curve. c=0.1 p-value

Comparison of Mixture Model Methods for Estimating FDR FDR Estimates Based on 1 - Average PPDE FDR Estimates Based Directly on the Estimated Mixture Model

Comments The two methods will produce similar FDR estimates when there are a large number of closely spaced p-values. The method based on 1 – average PPDE may be useful for estimating the FDR in a list of genes that does not necessarily include the most significant genes. The method based directly on the estimated mixture model is probably preferable in the usual case where a list will consist of the most differentially expressed genes.