Differential Expressions Bayesian Techniques Lecture Topic 8.

Slides:



Advertisements
Similar presentations
MCMC estimation in MlwiN
Advertisements

Lecture (11,12) Parameter Estimation of PDF and Fitting a Distribution Function.
Chapter 12: Testing hypotheses about single means (z and t) Example: Suppose you have the hypothesis that UW undergrads have higher than the average IQ.
Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is, “What is the statistical model to this data?” We then characterize.
Bayesian inference “Very much lies in the posterior distribution” Bayesian definition of sufficiency: A statistic T (x 1, …, x n ) is sufficient for 
Psychology 290 Special Topics Study Course: Advanced Meta-analysis April 7, 2014.
Chapter 7 Title and Outline 1 7 Sampling Distributions and Point Estimation of Parameters 7-1 Point Estimation 7-2 Sampling Distributions and the Central.
Maximum likelihood (ML) and likelihood ratio (LR) test
Elementary hypothesis testing
Bayesian estimation Bayes’s theorem: prior, likelihood, posterior
Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.
Evaluating Hypotheses Chapter 9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics.
Maximum likelihood (ML) and likelihood ratio (LR) test
Elementary hypothesis testing Purpose of hypothesis testing Type of hypotheses Type of errors Critical regions Significant levels Hypothesis vs intervals.
Evaluating Hypotheses
Presenting: Assaf Tzabari
Lecture Inference for a population mean when the stdev is unknown; one more example 12.3 Testing a population variance 12.4 Testing a population.
Chapter 9 Hypothesis Testing.
G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.
Maximum likelihood (ML)
Inferential Statistics
Lecture II-2: Probability Review
Lecture 9: p-value functions and intro to Bayesian thinking Matthew Fox Advanced Epidemiology.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Estimation and Hypothesis Testing. The Investment Decision What would you like to know? What will be the return on my investment? Not possible PDF for.
Chapter 5 Sampling and Statistics Math 6203 Fall 2009 Instructor: Ayona Chatterjee.
Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University ECON 4550 Econometrics Memorial University of Newfoundland.
Statistical Decision Theory
Model Inference and Averaging
Bayesian Inference, Basics Professor Wei Zhu 1. Bayes Theorem Bayesian statistics named after Thomas Bayes ( ) -- an English statistician, philosopher.
Statistics for Data Miners: Part I (continued) S.T. Balke.
Random Sampling, Point Estimation and Maximum Likelihood.
Dr. Gary Blau, Sean HanMonday, Aug 13, 2007 Statistical Design of Experiments SECTION I Probability Theory Review.
Mid-Term Review Final Review Statistical for Business (1)(2)
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
1 Statistical Distribution Fitting Dr. Jason Merrick.
1 SMU EMIS 7364 NTU TO-570-N Inferences About Process Quality Updated: 2/3/04 Statistical Quality Control Dr. Jerrell T. Stracener, SAE Fellow.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Maximum Likelihood - "Frequentist" inference x 1,x 2,....,x n ~ iid N( ,  2 ) Joint pdf for the whole random sample Maximum likelihood estimates.
Bayesian statistics Probabilities for everything.
- 1 - Bayesian inference of binomial problem Estimating a probability from binomial data –Objective is to estimate unknown proportion (or probability of.
Bayes’ Nets: Sampling [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available.
Correlation Assume you have two measurements, x and y, on a set of objects, and would like to know if x and y are related. If they are directly related,
Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:
Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.
Simple examples of the Bayesian approach For proportions and means.
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
Business Statistics for Managerial Decision Farideh Dehkordi-Vakil.
Review of Probability. Important Topics 1 Random Variables and Probability Distributions 2 Expected Values, Mean, and Variance 3 Two Random Variables.
Bayes Theorem. Prior Probabilities On way to party, you ask “Has Karl already had too many beers?” Your prior probabilities are 20% yes, 80% no.
Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Bayesian data analysis
Model Inference and Averaging
Bayes Net Learning: Bayesian Approaches
Remember that our objective is for some density f(y|) for observations where y and  are vectors of data and parameters,  being sampled from a prior.
CAP 5636 – Advanced Artificial Intelligence
Chapter 9 Hypothesis Testing.
Bayesian Inference, Basics
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Ch13 Empirical Methods.
CS 188: Artificial Intelligence
Parametric Methods Berlin Chen, 2005 References:
Learning From Observed Data
Mathematical Foundations of BME Reza Shadmehr
Applied Statistics and Probability for Engineers
Differential Expressions Bayesian Techniques
Presentation transcript:

Differential Expressions Bayesian Techniques Lecture Topic 8

Why Bayes? A friend of mine who is Bayesian said the following when asked this question: Some problems very hard to solve by classical techniques e.g. Behrens-Fisher problem Every new problem requires a new solution Bayes provides a coherent path

The Frequentist Paradigm Probability refers to a limiting relative frequency. Probability are OBJECTIVE properties in the real world. Parameters are fixed unknown constants, NO probability statement is possible about a parameter. Statistical procedures should be designed to have well- defined LONG-RUN frequency properties. For example a 95% confidence interval should trap the true value of the parameter with a limiting frequency of 95%.

Bayesian Philosophy Probability describes a DEGREE OF BELIEF not a relative frequency. As such you can make probability statements about anything, not just data We CAN make probability statement about parameters even if they are fixed constants. We make inferences about a parameter by producing its probability distributions. Inferences such as point or interval estimation maybe extracted from the probability distribution of the parameter.

The Contrasts According to Larry Wasserman: “Bayesian inference is a controversial approach as it embraces a subjective notion of probability”. In general Bayesian methods have NO guarantees for long run performance.

Advantages of Bayesian Methods Provide ability to formally incorporate prior information Inference conditional on actual data (not what might have been) More easily interpretable by non-specialists (e.g. confidence intervals) All analyses follow directly from posterior distribution Stopping Rule does not affect Inference Any question can be directly answered ex. bioequivalence –H0: θ0 ≠ θ0 –H1: θ0 = θ1 ■ Reverse role of null and alternative ■ Hard to use traditional testing methods in Bayes easy

Disadvantages Initial Bayesians were subjectivist Results not “objective,” could be manipulated to yield any desired result How to set the prior in general? Computationally difficult Need to evaluate complex integrals even for simple problems Need inexpensive high speed computing

How Bayesian Method Works Choose a probability density f(  ) – called the PRIOR distribution - that expresses our beliefs about a parameter BEFORE we see any data. We choose a statistical model f(x|  ) that reflects our beliefs our x given . Here we write it as f(x |  ) NOT f(x;  ) in the frequentist world. After OBSERVING the data X 1, …, X n, we update our belief in the parameter and calculate the posterior distribution f(  | x). It essentially uses the Bayes theorem to calculate the posterior distribution.

Bayes Theorem: Discrete Version A Simple Probability Result Let B 1,B 2... B n disjoint sets P(B k ) > 0, all k, P(B1 U B2... U Bn) = 1 (Mutually exclusive and exhaustive) For any event A P(B j |A) = P(B j )P(A|B j )/  P(B k )P(A|B k )

EXAMPLE: Disease incidence in population – P(D)=0.001 Diagnostic test –false positive rate 0.05, P(+|not D) = 0.05 –false negative rate 0.01, P(-|D) = 0.01 If Person drawn at random tests +, What is probability he has disease, D?

Comment Hence, probability that you HAVE the disease given that you have TESTED positive is still pretty LOW, even with very small FALSE POSITIVES and FALSE NEGATIVES. This rule is very useful in numerous other situations.

Bayes Theorem: The Continuous Version Let f(  be our prior distribution (density) for our parameter  Suppose we have the data X 1, …, X n, with density f(X 1, …, X n |  also written as L n (X,  )

Some Simplifications The denominator is sometimes very hard to deal with, since the integration over the parameters is not trivial. We call that the normalizing constant. And in most cases don’t explicitly evaluate it. And we use the idea that:

Bayes’ Idea Think of a model for data y 1,..., y n f(y 1,..., y n |θ ) e.g. Normal, Binomial, etc. θ random with prior density g(.) Bayes Rule says that: p (θ| y 1,..., y n ) = g (θ) f (y 1,..., y n |θ) Hence, the posterior is proportional to probability of prior multiplied by probability of data given the parameter.

Hypothesis Testing: Classical vs. Bayesian Classical: Set up null, alternative hypotheses, perform a test, calculate a p-value, reject or fail to reject the null Bayesian: Inference based on posterior distribution, p(θ|y 1,..., y n ) Consider evidence in favor of certain parameter values Data as well as prior beliefs influence inference

Major Challenge 1: Setting Priors Approaches Subjective - based on beliefs of individual, expert, etc. issues: – how to do in practice? –-people inconsistent – elicitation can help Non-informative - based on “prior ignorance” about parameter issues: – often hard to define – may lead to improper posteriors – sensitive to parameterization

Setting Priors: Conjugate Priors Conjugate priors are priors so that combined with the model the posterior will have a KNOWN distribution. issues: –choice of convenience –avoids computational problems –exists only for limited families Example: y ~ Bin(n,θ), θ ~ Beta(α,β) then p(θ|y) Beta(α+y,β+n-y) Normal conjugate is Normal for location Poisson conjugate is Gamma Inverse Gamma is often used as a prior for Normal  2. Generally all members of the Exponential Families have conjugate priors.

Setting Priors: Non-informative Assuming we have no REAL information about the parameter, we can model it with a “non-informative” prior. For example if  i is discrete we can think of –P(  i ) =1/n for i= 1…n If we know an interval (a,b) in which  lies, we can define –Prior as P(  ) = 1/(b-a) a <  < b. We can also define –P(  ) = c, c > 0. (improper Prior, since its not a pdf).

Setting Priors: Jeffery’s Prior Uniform non-informative priors are criticized since they do not lend themselves to transformation. Jeffery’s Prior is often used, that IS invariant under transformation. P(  ) = [I(  )] 1/2, I: information matrix

Major Challenge II: Computation Need to evaluate complicated high dimensional integrals Lots of technology developed in last years Approaches Earliest solutions: approximations and numerical integration Noniterative Monte Carlo: direct sampling, indirect sampling (importance, rejection) Markov Chain Monte Carlo (MCMC): Gibbs sampling, Metropolis- Hastings algorithm, hybrid methods... MCMC most popular and can be implemented in high dimensional situations.

Simple Example

Simple Example contd… Posterior mean is weighted average of prior mean and data mean ■ Sample average is shrunk toward prior mean ■ Weight depends on relative variability of prior and data Posterior precision is sum of prior precision and data precision Samples from posterior are easy to get given data, σ², μ, τ²

Lessons from Example General principle: posterior is compromise between prior and data μ and τ² not known ■ Empirical Bayes: estimate μ and τ² ■ Hierarchical Bayes: put prior on μ and τ² as well

Bayesian Hypothesis Testing The idea is due to Jefferys (1961). Idea: Based on the data that each hypothesis is supposed to predict, one applies Bayes’ Theorem and computes the posterior probability that the first hypothesis is correct. UNLIKE Classical methods the hypothesis DO NOT have to be nested within each other.

Mechanics of Bayesian Hypothesis Testing Lets consider we have two hypothesis H 0 and H 1 (the Bayesians prefer to use the word “models” as opposed to hypothesis, but we will keep “hypothesis” to be consistent with the classical ideas). Let H 0 and H 1 be two hypotheses concerning the data Y, and let  0 and  1 be the associated parameters. We define  i (  i ) as the corresponding priors. Let f i (y |  i ) be the corresponding marginal distributions. We can use Bayes’ Theorem to calculate, P(  i |y) the posteriors. Bayes’ hypothesis testing consists of finding the following and using pre-specified cut-offs for decisions: –B=[P(  0 |y)/P(  1 |y)]/[P(  0 )/P(  1 )] (Bayes’ Factor) –P(  0 | Y=y), P(  0 | Y>=y) (Bayesian p-values)

Bayesian Hypothesis Tests in Microarrays Let H g1 : gene is differentially expressed H g0 : gene is not differentially expressed Traditional Bayesians would write this as

Method 1 Differential Expression Score Use t-statistic or Wilcoxon Rank sum statistic, z g Then Calculate P(H 0 | z g =z) or P(H 0 | z g z) or P(v g =0 | z g =z) or P(v g =0 | z g z) McClure and Wit (2004) show that the second term is identical to using the FDR method for controlling error.

Fully Bayesian Analysis In general we are interested in: The term given below where p 0 is the fraction of inactive genes in the array, F 0 is the distribution under the null hypothesis, v=0, F is the distribution of the test statistic

Bayesian t test The t statistic is given by: Assume: z g |{v g =0} ~ N(0,  0 2 ) z g |{v g =1} ~ N(0,  1 2 ) Hence, z g ~ (1-p 1 ) N(0,  0 2 )+ p 1 N(0,  1 2 )

Bayesian t test: Priors p 1 ~ Uniform(0,1) v g ~ Bernoulli (p 1 )  0 2 ~ Gamma( ,  ),  1 2 ~ Gamma( ,  ),  ~ Gamma( ,  ),  ~ Gamma( ,  ),   = (v, p1,  0 2, , ,  1 2, , , , ,  )  These are all conjugate priors to make the calculations easier.  One uses the Gibbs sampler to simulate from P(  | z) to estimate p1,  0 2,  1 2 to calculate the required probability.

Gibbs Sampler It is used to calculate the poster mean. It does not calculate P(  |y) explicitly. It simulates draws from this distribution. Using sample summaries we get a good idea of the joint posterior as well as the marginal distribution of interest P(v| y). It samples from the distribution of P(  i |  -i,y), until it converges to a stationery distribution. This is called “burn-in”. After burn-in each draw of  is a draw from a posterior distribution. Bayes Theorem states that the conditional distribution of P(  i |  -i,y) is proportional to the likelihood of the prior, P(y|  )P(  ) as a function of  i. If the marginal distributions without the specific component is defined (generally using conjugate priors) this procedure can be applied easily.

Empirical Bayes Idea The prior distributions depend upon unknown parameters which in turn may need a second or higher stage prior in some hierarchical setting. But at some point we HAVE to specify all remaining parameters of the hyper-prior. In other words we HAVE to use our knowledge to specify our prior. The Empirical Bayes method uses sample data to estimate the parameters for the final stage prior. The idea is if we are interested in  |y, let q ~ P(  1),  1~P(  2)…  L-1 ~P(  L ). In the empirical Bayes idea we use the data to estimate the parameter  L obtained as the value that maximizes the marginal likelihood P(Y|  L ). We replace the estimate of  L in the priors, and the posterior distribution is now P(  |y,est-  L ).

Empirical Bayes’ Idea in Differential Expression Average log fold change. Problem: non DE genes with large variances have too much chance of being selected. t-statistics Problem: apparently DE genes with very small sample variances are suspect. Moderated t-statistics A happy compromise between the two above, an empirical Bayes estimate, using data to estimate the new se, s g. Generally

The moderated t statistic Smoothed standard deviations: shrink towards Eliminates large t-statistics due merely to very small s values,and reduces the impact of very large s values.

EB Idea Posterior odds (for DE) Posterior probability of differential expression for any gene is A monotonic function of t˜ 2 for constant d.

Estimating hyper-parameters Closed form estimators with good properties are available: for s 0 and d 0 in terms of the first two moments of log s 2. for c 0 in terms of quantiles of the | t˜ g |. Nowadays the EB estimate is used most often for differential expressions and the genes are ranked by the EB estimates. Instead of doing strict Error Control, the top g genes are looked at using EB estimates for ranking purposes. Sometimes | t˜ g | >4 is used as an empirical cut-off. Limma in R uses empirical Bayes estimates for looking at which genes are differentially expressed.