Bayesian and Least Squares fitting: Problem: Given data (d) and model (m) with adjustable parameters (x), what are the best values and uncertainties for.

Slides:



Advertisements
Similar presentations
Bayesian Learning & Estimation Theory
Advertisements

Pattern Recognition and Machine Learning
Pattern Recognition and Machine Learning
Introduction to Monte Carlo Markov chain (MCMC) methods
Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
The Maximum Likelihood Method
Computational Statistics. Basic ideas  Predict values that are hard to measure irl, by using co-variables (other properties from the same measurement.
Bayesian Estimation in MARK
Ch 11. Sampling Models Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by I.-H. Lee Biointelligence Laboratory, Seoul National.
Markov-Chain Monte Carlo
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
The adjustment of the observations
Bayesian Reasoning: Markov Chain Monte Carlo
Visual Recognition Tutorial
Computing the Posterior Probability The posterior probability distribution contains the complete information concerning the parameters, but need often.
G. Cowan Lectures on Statistical Data Analysis 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem, random variables, pdfs 2Functions.
Bayesian Analysis of X-ray Luminosity Functions A. Ptak (JHU) Abstract Often only a relatively small number of sources of a given class are detected in.
Visual Recognition Tutorial
Using ranking and DCE data to value health states on the QALY scale using conventional and Bayesian methods Theresa Cain.
Computer vision: models, learning and inference
G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.
CISE-301: Numerical Methods Topic 1: Introduction to Numerical Methods and Taylor Series Lectures 1-4: KFUPM.
Lecture II-2: Probability Review
Principles of the Global Positioning System Lecture 10 Prof. Thomas Herring Room A;
Radial Basis Function Networks
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Introduction to Monte Carlo Methods D.J.C. Mackay.
Colorado Center for Astrodynamics Research The University of Colorado STATISTICAL ORBIT DETERMINATION Project Report Unscented kalman Filter Information.
PATTERN RECOGNITION AND MACHINE LEARNING
Chapter 15 Modeling of Data. Statistics of Data Mean (or average): Variance: Median: a value x j such that half of the data are bigger than it, and half.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
CISE-301: Numerical Methods Topic 1: Introduction to Numerical Methods and Taylor Series Lectures 1-4: KFUPM CISE301_Topic1.
G. Cowan Lectures on Statistical Data Analysis Lecture 3 page 1 Lecture 3 1 Probability (90 min.) Definition, Bayes’ theorem, probability densities and.
Module 1: Statistical Issues in Micro simulation Paul Sousa.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
1 Lesson 8: Basic Monte Carlo integration We begin the 2 nd phase of our course: Study of general mathematics of MC We begin the 2 nd phase of our course:
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Bayesian Inversion of Stokes Profiles A.Asensio Ramos (IAC) M. J. Martínez González (LERMA) J. A. Rubiño Martín (IAC) Beaulieu Workshop ( Beaulieu sur.
Example: Bioassay experiment Problem statement –Observations: At each level of dose, 5 animals are tested, and number of death are observed.
Bayesian Reasoning: Tempering & Sampling A/Prof Geraint F. Lewis Rm 560:
1 2 nd Pre-Lab Quiz 3 rd Pre-Lab Quiz 4 th Pre-Lab Quiz.
MCMC (Part II) By Marc Sobel. Monte Carlo Exploration  Suppose we want to optimize a complicated distribution f(*). We assume ‘f’ is known up to a multiplicative.
Data Modeling Patrice Koehl Department of Biological Sciences National University of Singapore
Lecture 2: Statistical learning primer for biologists
Machine Learning 5. Parametric Methods.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
The Unscented Particle Filter 2000/09/29 이 시은. Introduction Filtering –estimate the states(parameters or hidden variable) as a set of observations becomes.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Gil McVean, Department of Statistics Thursday February 12 th 2009 Monte Carlo simulation.
G. Cowan Lectures on Statistical Data Analysis Lecture 5 page 1 Statistical Data Analysis: Lecture 5 1Probability, Bayes’ theorem 2Random variables and.
Application of Dynamic Programming to Optimal Learning Problems Peter Frazier Warren Powell Savas Dayanik Department of Operations Research and Financial.
G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.
CHAPTER 4 ESTIMATES OF MEAN AND ERRORS. 4.1 METHOD OF LEAST SQUARES I n Chapter 2 we defined the mean  of the parent distribution and noted that the.
Density Estimation in R Ha Le and Nikolaos Sarafianos COSC 7362 – Advanced Machine Learning Professor: Dr. Christoph F. Eick 1.
NASSP Masters 5003F - Computational Astronomy Lecture 4: mostly about model fitting. The model is our estimate of the parent function. Let’s express.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.
Fundamentals of Data Analysis Lecture 11 Methods of parametric estimation.
Markov Chain Monte Carlo in R
Data Modeling Patrice Koehl Department of Biological Sciences
Lecture 1.31 Criteria for optimal reception of radio signals.
The simple linear regression model and parameter estimation
Remember that our objective is for some density f(y|) for observations where y and  are vectors of data and parameters,  being sampled from a prior.
Random WALK, BROWNIAN MOTION and SDEs
Modelling data and curve fitting
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
6.5 Taylor Series Linearization
5.2 Least-Squares Fit to a Straight Line
CISE-301: Numerical Methods Topic 1: Introduction to Numerical Methods and Taylor Series Lectures 1-4: KFUPM CISE301_Topic1.
Presentation transcript:

Bayesian and Least Squares fitting: Problem: Given data (d) and model (m) with adjustable parameters (x), what are the best values and uncertainties for the parameters? Given Bayes’ Theorem: prob(m|d,I)  prob(d|m,I) prob(m|I) + Gaussian data uncertainties + flat priors Then log ( prob(m|d) )  constant –    data and maximizing the log ( prob(m|d) ) is equivalent to minimizing chi-squared (i.e., least squares)

    and x fitted = x| 0 +  x r P  x

Weighted least-squares: Least squares equations: r = P  x, where P ij = ∂m i /∂x j Each datum equation: r i =  j ( P ij  x j ) Divide both sides of each equation by datum uncertainty,  i, i.e., r i  r i /  i and P ij  P ij /  i for each j gives variance-weighted solution.

Including priors in least-squares: Least squares equations: r = d – m = P  x, where P ij = ∂m i /∂x j Each data equation: r i =  j ( P ij  x j ) The weighted data (residuals) need not be homogenous: r = ( d – m ) /  can be composed of N “normal” data and some “prior-like data”. Possible prior-like datum: x k = v k ±  k (for k th parameter) Thenr N+1 = ( v k - x k ) /  k andP N+1,j = 1/  k for j =k 0 for j ≠ k

Non-linear models Least squares equations: r = P  x, where P ij = ∂m i /∂x j has solution:  x = (P T P) -1 P T r If the partial derivatives of the model are independent of the parameters, then the first-order Taylor expansion is exact and applying the parameter corrections,  x, gives the final answer. Example linear problem: m = x 1 + x 2 t + x 3 t 2 If not, you have a non-linear problem and the 2 nd and higher order terms in the Taylor expansion can be important until  x  0; so iteration is required. Example non-linear problem: m = sin(  x  t)

How to calculate partial derivatives (P ij ): Analytic formulae  If the model can be expressed analytically) Numerical evaluations: “wiggle” parameters one at a time: x w = x except for j th parameter x j w = x j +  x Partial derivative of the i th datum for parameter j: P ij = ( m i (x w ) - m i (x) ) / ( x j w – x j ) NB: choose  x small enough to avoid 2 nd order errors, but large enough to avoid numerical inaccuracies. Always use 64-bit computations!

Can do very complicated modeling: Example problem: model pulsating photosphere for Mira variables ( see Reid & Goldston 2002, ApJ, 568, 931 ) Data: Observed flux, S(t, ), at radio, IR and optical wavelengths Model: Assume power-law temperature, T(r,t), and density,  (r,t); calculate opacity sources (ionization equilibrium, H 2 formation, …); numerically integrate radiative transfer along ray-paths through atmosphere for many impact parameters and wavelengths; parameters include T 0 and  0 at radius r 0 Even though model is complicated and not analytic, one can easily calculate partials numerically and solve for best parameter values.

Modeling Mira Variables: Visual:  m v ~ 8 mag ~ factor of 1000; variable formation of TiO clouds at ~2R * with top T ~ 1400 K IR: seeing pulsating stellar surface Radio: H  free-free opacity at ~2R *

Iteration and parameter adjustment: Least squares equations: r = P  x, where P ij = ∂m i /∂x j has solution:  x = (P T P) -1 P T r It is often better to make parameter adjustments slowly, so for the k+1 iteration, set x| k+1 = x| k +  *  xI k where 0 <  < 1 NB: this is equivalent to scaling partial derivatives by . So, if one iterates enough, one only needs to get the sign of the partial derivatives correct !

Evaluating Fits: Least squares equations: r = P  x, where P ij = ∂m i /∂x j has solution:  x = (P T P) -1 P T r Always carefully examine final residuals (r) Plot them Look for >3  values Look for non-random behavior Always look at parameter correlations correlation coefficient:  jk = D jk / sqrt( D jj D kk ) where D = (P T P) -1

   sqrt(  cos 2  t )     sqrt( 2 ) = 0.7     sqrt( 3 ) = 0.6     sqrt( 4 ) = 0.5

Bayesian vs Least Squares Fitting: Least Squares fitting: Seeks the best parameter values and their uncertainties Bayesian fitting: Seeks the posteriori probability distribution for parameters

Bayesian Fitting: Bayesian: what is posterior probability distribution for parameters? Answer, evaluate prob(m|d,I)  prob(d|m,I) prob(m|I) If data and parameter priors have Gaussian distributions, log( prob(m|d) )  constant –    data – (    ) param_priors “Simply” evaluate for all (reasonable) parameter values. But this can be computationally challenging: eg, a modest problem with only 10 parameters, evaluated on a coarse grid of 100 values each, requires model calculations!

Markov chain Monte Carlo (McMC) methods: Instead of complete exploration of parameter space, avoid regions with low probability and wander about quasi-randomly over high probability regions: “Monte Carlo”  random trials (like roulette wheel in Monte Carlo casinos) “Markov chain”  (k+1) th trial parameter values are “close to” k th values

McMC using Metropolis-Hastings (M-H) algorithm: 1.Given k th model (ie, values for all parameters in the model), generate (k+1) th model by small random changes: x j | k+1 = x j | k +  g  j  is an “acceptance fraction” parameter,  g is a Gaussian random number (mean=0, standard deviation=1),  j is the width of the posteriori probability distribution of parameter x j 2.Evaluate the probability ratio: R = prob(m|d)| k+1 / prob(m|d)| k 2.Draw a random number, U, uniformly distributed from 0  1 2.If R > U, “accept” and store the (k+1) th parameter values, else “replace” the (k+1) th values with a copy the k th values and store them (NB: this yields many duplicate models) Stored parameter values from the M-H algorithm give the posteriori probability distribution of the parameters !

Metropolis-Hastings (M-H) details: M-H McMC parameter adjustments: x j | k+1 = x j | k +  g  j  determines the “acceptance fraction” (start near 1/√N), g is a Gaussian random number (mean=0, standard deviation=1),  j is the “sigma” of the posteriori probability distribution of x j The M-H “acceptance fraction” should be about 23% for problems with many parameters and about 50% for few parameters; iteratively adjust  to achieve this; decreasing  increases acceptance rate Since one doesn’t know the parameter posteriori uncertainties,  j, at the start, one needs to do trial solutions and iteratively adjust. When exploring the PDF of the parameters with the M-H algorithm, one should start with near-optimum parameter values, so discard early “burn-in” trials.

M-H McMC flow: Enter data (d) and initial guesses for parameters (x,  prior,  posteriori ) Start “burn-in” & “acceptance-fraction” adjustment loops (eg, ~10 loops) start a McMC loop (with eg, ~10 5 trials) make new model: x j | k+1 = x j | k +  g  j posteriori calculate (k+1) th log( prob(m|d) )    data + (    ) param_priors calculate Metropolis ratio: R = exp( log(prob k+1 ) – log(prob k ) ) if R > U k+1 accept and store model < U k+1 replace with k th model and store end McMC loop estimate & update  posteriori and adjust  for desired acceptance fraction End “burn-in” loops Start “real” McMC exploration with latest parameter values and using final  posteriori and  to determine parameter step sizes  Use a large number of trials (eg, ~10 6 )

Estimation of  posteriori : Make histogram of trial parameter values (must cover full range) Start bin loop check cumulative # crossing “-1  ” check cumulative # crossing “+1  ” End bin loop Estimates of (Gaussian)  posteriori :  p_val(+1  ) – p_val(-1  ) |  p_val(+2  ) – p_val(-2  ) | Relatively robust to non-Gaussian pdfs

Adjusting Acceptance Rate Parameter (  ): Metropolis trial acceptance rule for (k+1) th trial model: if R > U k+1 accept and store model < U k+1 replace with k th model and store For n th set of M-H McMC trials, count cumulative number of accepted (N a ) and replaced (N r ) models. Acceptance rate: A = N a / (N a + N r ) For (n+1) th set of trials, set  n+1  A n / A desired )  n

“Non-least-squares” Bayesian fitting: Sivia gives 2 examples where the data uncertainties are not known Gaussians; hence least squares is non-optimal: 1) prob(   ) =      for   ; 0 otherwise) where  is the error on a datum, which typically is close to   (the minimum error), but can occasionally be much larger. 2) prob(    ) =  –    –     where  is the fraction of “bad” data whose uncertainties are   

“Error tolerant” Bayesian fitting: Sivia’s “conservative formulation”: data uncertainties are given by prob(   ) =      for   ; 0 otherwise) where  is the error on a datum, which typically is close to   (the minimum error), but can occasionally be much larger. Marginalizing over  gives prob(d|m,   ) =   prob(d |m,  ) prob(   ) d   =    √  exp[(d-m) 2 /2         d   =   √      –  exp(-R 2 /2) ) / R 2 ) where R = (d-m)/   Thus, one maximizes ∑ i log     –  exp(-R 2 /2) ) / R 2 ), instead of minimizing    ∑ i log  exp(-R 2 /2) )  ∑ i -R 2 /2

Data PDFs: Gaussian pdf has sharper peak, giving more accurate parameter estimates (provided all data are good). Error tolerant pdf doesn’t have a large penalty for a wild point, so it will not “care much” about some wild data.

Error tolerant fitting example: Goal: to determine motion of 100’s of maser spots Data: maps (positions) at 12 epochs Method: find all spots at nearly the same position over all epochs; then fit for linear motion Problem: some “extra” spots appear near those selected to fit (eg, R>10). Too much data to plot, examine and excise by hand.

Error tolerant fitting example: Error tolerant fitting output with no “human intervention”

The “good-and-bad” data Bayesian fitting: Box & Tiao’s (1968) data uncertainties come in two “flavors”, given by prob(    ) =  –    –     where  is the fraction of “bad” data (   whose uncertainties are    Marginalizing over  for Gaussian errors  gives prob(d|m,    ) =   prob(d |m,  ) prob(    ) d   =   √    exp(-R 2 /2   ) + (1-  ) exp(-R 2 /2) ) where R = (d-m)/   Thus, one maximizes constant + ∑ i log   exp(-R 2 /2   ) + (1-  ) exp(-R 2 /2) ), which for no bad data (  ) recovers least squares. But one must estimate 2 extra parameters:  and 

Estimation of parameter PDFs : Bayesian fitting result: histogram of M-H trial parameter values (PDF) This “integrates” over all values of all other parameters and is the parameter estimate “marginalized” over all other parameters. Parameter correlations: e.g, plot all trial values of x i versus x j