# Bayesian and Least Squares fitting: Problem: Given data (d) and model (m) with adjustable parameters (x), what are the best values and uncertainties for.

## Presentation on theme: "Bayesian and Least Squares fitting: Problem: Given data (d) and model (m) with adjustable parameters (x), what are the best values and uncertainties for."— Presentation transcript:

Bayesian and Least Squares fitting: Problem: Given data (d) and model (m) with adjustable parameters (x), what are the best values and uncertainties for the parameters? Given Bayes’ Theorem: prob(m|d,I)  prob(d|m,I) prob(m|I) + Gaussian data uncertainties + flat priors Then log ( prob(m|d) )  constant –    data and maximizing the log ( prob(m|d) ) is equivalent to minimizing chi-squared (i.e., least squares)

    and x fitted = x| 0 +  x r P  x

Weighted least-squares: Least squares equations: r = P  x, where P ij = ∂m i /∂x j Each datum equation: r i =  j ( P ij  x j ) Divide both sides of each equation by datum uncertainty,  i, i.e., r i  r i /  i and P ij  P ij /  i for each j gives variance-weighted solution.

Including priors in least-squares: Least squares equations: r = d – m = P  x, where P ij = ∂m i /∂x j Each data equation: r i =  j ( P ij  x j ) The weighted data (residuals) need not be homogenous: r = ( d – m ) /  can be composed of N “normal” data and some “prior-like data”. Possible prior-like datum: x k = v k ±  k (for k th parameter) Thenr N+1 = ( v k - x k ) /  k andP N+1,j = 1/  k for j =k 0 for j ≠ k

Non-linear models Least squares equations: r = P  x, where P ij = ∂m i /∂x j has solution:  x = (P T P) -1 P T r If the partial derivatives of the model are independent of the parameters, then the first-order Taylor expansion is exact and applying the parameter corrections,  x, gives the final answer. Example linear problem: m = x 1 + x 2 t + x 3 t 2 If not, you have a non-linear problem and the 2 nd and higher order terms in the Taylor expansion can be important until  x  0; so iteration is required. Example non-linear problem: m = sin(  x  t)

How to calculate partial derivatives (P ij ): Analytic formulae  If the model can be expressed analytically) Numerical evaluations: “wiggle” parameters one at a time: x w = x except for j th parameter x j w = x j +  x Partial derivative of the i th datum for parameter j: P ij = ( m i (x w ) - m i (x) ) / ( x j w – x j ) NB: choose  x small enough to avoid 2 nd order errors, but large enough to avoid numerical inaccuracies. Always use 64-bit computations!

Can do very complicated modeling: Example problem: model pulsating photosphere for Mira variables ( see Reid & Goldston 2002, ApJ, 568, 931 ) Data: Observed flux, S(t, ), at radio, IR and optical wavelengths Model: Assume power-law temperature, T(r,t), and density,  (r,t); calculate opacity sources (ionization equilibrium, H 2 formation, …); numerically integrate radiative transfer along ray-paths through atmosphere for many impact parameters and wavelengths; parameters include T 0 and  0 at radius r 0 Even though model is complicated and not analytic, one can easily calculate partials numerically and solve for best parameter values.

Modeling Mira Variables: Visual:  m v ~ 8 mag ~ factor of 1000; variable formation of TiO clouds at ~2R * with top T ~ 1400 K IR: seeing pulsating stellar surface Radio: H  free-free opacity at ~2R *

Iteration and parameter adjustment: Least squares equations: r = P  x, where P ij = ∂m i /∂x j has solution:  x = (P T P) -1 P T r It is often better to make parameter adjustments slowly, so for the k+1 iteration, set x| k+1 = x| k +  *  xI k where 0 <  < 1 NB: this is equivalent to scaling partial derivatives by . So, if one iterates enough, one only needs to get the sign of the partial derivatives correct !

Evaluating Fits: Least squares equations: r = P  x, where P ij = ∂m i /∂x j has solution:  x = (P T P) -1 P T r Always carefully examine final residuals (r) Plot them Look for >3  values Look for non-random behavior Always look at parameter correlations correlation coefficient:  jk = D jk / sqrt( D jj D kk ) where D = (P T P) -1

   sqrt(  cos 2  t )     sqrt( 2 ) = 0.7     sqrt( 3 ) = 0.6     sqrt( 4 ) = 0.5

Bayesian vs Least Squares Fitting: Least Squares fitting: Seeks the best parameter values and their uncertainties Bayesian fitting: Seeks the posteriori probability distribution for parameters

Bayesian Fitting: Bayesian: what is posterior probability distribution for parameters? Answer, evaluate prob(m|d,I)  prob(d|m,I) prob(m|I) If data and parameter priors have Gaussian distributions, log( prob(m|d) )  constant –    data – (    ) param_priors “Simply” evaluate for all (reasonable) parameter values. But this can be computationally challenging: eg, a modest problem with only 10 parameters, evaluated on a coarse grid of 100 values each, requires 10 20 model calculations!

Markov chain Monte Carlo (McMC) methods: Instead of complete exploration of parameter space, avoid regions with low probability and wander about quasi-randomly over high probability regions: “Monte Carlo”  random trials (like roulette wheel in Monte Carlo casinos) “Markov chain”  (k+1) th trial parameter values are “close to” k th values

McMC using Metropolis-Hastings (M-H) algorithm: 1.Given k th model (ie, values for all parameters in the model), generate (k+1) th model by small random changes: x j | k+1 = x j | k +  g  j  is an “acceptance fraction” parameter,  g is a Gaussian random number (mean=0, standard deviation=1),  j is the width of the posteriori probability distribution of parameter x j 2.Evaluate the probability ratio: R = prob(m|d)| k+1 / prob(m|d)| k 2.Draw a random number, U, uniformly distributed from 0  1 2.If R > U, “accept” and store the (k+1) th parameter values, else “replace” the (k+1) th values with a copy the k th values and store them (NB: this yields many duplicate models) Stored parameter values from the M-H algorithm give the posteriori probability distribution of the parameters !

Metropolis-Hastings (M-H) details: M-H McMC parameter adjustments: x j | k+1 = x j | k +  g  j  determines the “acceptance fraction” (start near 1/√N), g is a Gaussian random number (mean=0, standard deviation=1),  j is the “sigma” of the posteriori probability distribution of x j The M-H “acceptance fraction” should be about 23% for problems with many parameters and about 50% for few parameters; iteratively adjust  to achieve this; decreasing  increases acceptance rate Since one doesn’t know the parameter posteriori uncertainties,  j, at the start, one needs to do trial solutions and iteratively adjust. When exploring the PDF of the parameters with the M-H algorithm, one should start with near-optimum parameter values, so discard early “burn-in” trials.

M-H McMC flow: Enter data (d) and initial guesses for parameters (x,  prior,  posteriori ) Start “burn-in” & “acceptance-fraction” adjustment loops (eg, ~10 loops) start a McMC loop (with eg, ~10 5 trials) make new model: x j | k+1 = x j | k +  g  j posteriori calculate (k+1) th log( prob(m|d) )    data + (    ) param_priors calculate Metropolis ratio: R = exp( log(prob k+1 ) – log(prob k ) ) if R > U k+1 accept and store model < U k+1 replace with k th model and store end McMC loop estimate & update  posteriori and adjust  for desired acceptance fraction End “burn-in” loops Start “real” McMC exploration with latest parameter values and using final  posteriori and  to determine parameter step sizes  Use a large number of trials (eg, ~10 6 )

Estimation of  posteriori : Make histogram of trial parameter values (must cover full range) Start bin loop check cumulative # crossing “-1  ” check cumulative # crossing “+1  ” End bin loop Estimates of (Gaussian)  posteriori :  p_val(+1  ) – p_val(-1  ) |  p_val(+2  ) – p_val(-2  ) | Relatively robust to non-Gaussian pdfs

Adjusting Acceptance Rate Parameter (  ): Metropolis trial acceptance rule for (k+1) th trial model: if R > U k+1 accept and store model < U k+1 replace with k th model and store For n th set of M-H McMC trials, count cumulative number of accepted (N a ) and replaced (N r ) models. Acceptance rate: A = N a / (N a + N r ) For (n+1) th set of trials, set  n+1  A n / A desired )  n

“Non-least-squares” Bayesian fitting: Sivia gives 2 examples where the data uncertainties are not known Gaussians; hence least squares is non-optimal: 1) prob(   ) =      for   ; 0 otherwise) where  is the error on a datum, which typically is close to   (the minimum error), but can occasionally be much larger. 2) prob(    ) =  –    –     where  is the fraction of “bad” data whose uncertainties are   

“Error tolerant” Bayesian fitting: Sivia’s “conservative formulation”: data uncertainties are given by prob(   ) =      for   ; 0 otherwise) where  is the error on a datum, which typically is close to   (the minimum error), but can occasionally be much larger. Marginalizing over  gives prob(d|m,   ) =   prob(d |m,  ) prob(   ) d   =    √  exp[(d-m) 2 /2         d   =   √      –  exp(-R 2 /2) ) / R 2 ) where R = (d-m)/   Thus, one maximizes ∑ i log     –  exp(-R 2 /2) ) / R 2 ), instead of minimizing    ∑ i log  exp(-R 2 /2) )  ∑ i -R 2 /2

Data PDFs: Gaussian pdf has sharper peak, giving more accurate parameter estimates (provided all data are good). Error tolerant pdf doesn’t have a large penalty for a wild point, so it will not “care much” about some wild data.

Error tolerant fitting example: Goal: to determine motion of 100’s of maser spots Data: maps (positions) at 12 epochs Method: find all spots at nearly the same position over all epochs; then fit for linear motion Problem: some “extra” spots appear near those selected to fit (eg, R>10). Too much data to plot, examine and excise by hand.

Error tolerant fitting example: Error tolerant fitting output with no “human intervention”

The “good-and-bad” data Bayesian fitting: Box & Tiao’s (1968) data uncertainties come in two “flavors”, given by prob(    ) =  –    –     where  is the fraction of “bad” data (   whose uncertainties are    Marginalizing over  for Gaussian errors  gives prob(d|m,    ) =   prob(d |m,  ) prob(    ) d   =   √    exp(-R 2 /2   ) + (1-  ) exp(-R 2 /2) ) where R = (d-m)/   Thus, one maximizes constant + ∑ i log   exp(-R 2 /2   ) + (1-  ) exp(-R 2 /2) ), which for no bad data (  ) recovers least squares. But one must estimate 2 extra parameters:  and 

Estimation of parameter PDFs : Bayesian fitting result: histogram of M-H trial parameter values (PDF) This “integrates” over all values of all other parameters and is the parameter estimate “marginalized” over all other parameters. Parameter correlations: e.g, plot all trial values of x i versus x j

Similar presentations