Presentation is loading. Please wait.

Presentation is loading. Please wait.

Correlation With Errors-In-Variables3/28/20021 Correlation with Errors-In-Variables and an Application to Galaxies William H. Jefferys University of Texas.

Similar presentations


Presentation on theme: "Correlation With Errors-In-Variables3/28/20021 Correlation with Errors-In-Variables and an Application to Galaxies William H. Jefferys University of Texas."— Presentation transcript:

1 Correlation With Errors-In-Variables3/28/20021 Correlation with Errors-In-Variables and an Application to Galaxies William H. Jefferys University of Texas at Austin, USA

2 Correlation With Errors-In-Variables3/28/20022 A Problem in Correlation A graduate student at Maryland asked me to assist her on a problem involving galaxy data. She wanted to know if the data showed clear evidence of correlation, and if so, what the correlation was and how strong was the evidence for it.

3 Correlation With Errors-In-Variables3/28/20023 The Data

4 Correlation With Errors-In-Variables3/28/20024 Comments on Data At first glance the correlation seems obvious. But there is an unusual feature of her problem: she knew that the data were imperfect, and for each data point had an error bar in both x and y. Standard treatments of correlation do not address this situation.

5 Correlation With Errors-In-Variables3/28/20025 The Data

6 Correlation With Errors-In-Variables3/28/20026 Comments on Data The presence of the error bars contributes to uncertainty as to how big the correlation is and how well it is determined The data are sparse, so we also need to be concerned about small number statistics The student also was concerned about how the lowest point affected any correlation. What would happen, she wondered, if it were not included in the sample? [She was afraid that the ellipticity of the particular galaxy was so low that it might not have been measured accurately and in fact that the galaxy might belong to a different class of galaxies]

7 Correlation With Errors-In-Variables3/28/20027 The Data

8 Correlation With Errors-In-Variables3/28/20028 Bayesian Analysis and Astronomy She had been trying to use our program GaussFit to analyze the data, but it is not designed for tasks such as measuring correlation. I, of course, suggested to her that a Bayesian approach might be appropriate Bayesian methods offer many advantages for astronomical research and have attracted much recent interest. Astronomy and Astrophysics Abstracts lists 169 articles with the keywords ‘Bayes’ or ‘Bayesian’ in the past 5 years, and the number is increasing rapidly (there were 53 in 2000 alone, up from 33 in 1999).

9 Correlation With Errors-In-Variables3/28/20029 Advantages of Bayesian Methods Bayesian methods allow us to do things that would be difficult or impossible with standard (frequentist) statistical analysis. It is simple to incorporate prior physical or statistical information Interpretation of results is very natural Model comparison is easy and straightforward. (This is such a problem) It is a systematic way of approaching statistical problems, rather than a collection of ad hoc techniques. Very complex problems (difficult or impossible to handle classically) are straightforwardly analyzed within a Bayesian framework.

10 Correlation With Errors-In-Variables3/28/200210 The Fundamental Idea In a nutshell, Bayesian analysis entails the following uniform and systematic steps: Choose prior distributions (“priors”) that reflect your knowledge about each parameter and model under consideration, prior to looking at the data Determine the likelihood function for the data under each model and parameter value Compute and normalize the full posterior distribution, conditioned on the data, using Bayes’ theorem Derive summaries of quantities of interest from the full posterior distribution by integrating over the posterior distribution to produce marginal distributions or integrals of interest (e.g., means, variances).

11 Correlation With Errors-In-Variables3/28/200211 Priors Choose prior distributions (“priors”) that reflect your knowledge about each parameter and model prior to looking at the data The investigator is required to provide all relevant prior information that he has before proceeding There is always prior information. For example, we cannot count a negative number of photons. Parallaxes are greater than zero. We now know that the most likely value of the Hubble constant is in the ballpark of 60-80 km/sec/mpc (say) with smaller probabilities of its being higher or lower. Prior information can be statistical in nature, e.g., we may have statistical knowledge about the spatial or velocity distribution of stars, or the variation in a telescope’s plate scale.

12 Correlation With Errors-In-Variables3/28/200212 Prior Information In Bayesian analysis, our knowledge about an unknown quantity  is encoded in a prior distribution on the quantity in question, e.g., p(  |B), where B is background information. Where prior information is vague or uninformative, a vague prior often recovers results similar to a classical analysis. »EXCEPTION: Model selection/model averaging Sensitive dependence of the results on reasonable variations in prior information indicates that no analysis, Bayesian or other, can give reliable results.

13 Correlation With Errors-In-Variables3/28/200213 Likelihood Function Determine the likelihood function of the data under each model and parameter value. The likelihood function describes the statistical properties of the mathematical model of our problem. It tells us how the statistics of the observations (e.g., normal or Poisson data) are related to the parameters and any background information. It is proportional to the sampling distribution for observing the data Y, given the parameters, but we are interested in its functional dependence on the parameters: The likelihood is known up to a constant but arbitrary factor which cancels out in the analysis.

14 Correlation With Errors-In-Variables3/28/200214 Posterior Distribution Compute and normalize the full posterior distribution, conditioned on the data, using Bayes’ theorem. The posterior distribution encodes what we know about the parameters and model after we observe the data. Thus, Bayesian analysis models learning. Bayes’ theorem says that Bayes’ theorem is a trivial result of probability theory. The denominator is just a normalization factor and can often be dispensed with

15 Correlation With Errors-In-Variables3/28/200215 Bayes’ Theorem (Proof) By standard probability theory, from which Bayes’ theorem follows immediately.

16 Correlation With Errors-In-Variables3/28/200216 Integration and Marginalization Derive summaries of quantities of interest from the full posterior distribution by integrating over the posterior distribution to produce marginal distributions or integrals of interest (e.g., means, variances). Bayesian methodology provides a simple and systematic way of handling nuisance parameters required by the analysis but which are of no interest to us. We just integrate them out (marginalize them) to obtain the marginal distribution of the parameter(s) of interest: Likewise, computing summary statistics is simple : e.g., posterior means and variances

17 Correlation With Errors-In-Variables3/28/200217 Bayesian Model Selection/Averaging Given models M i, each of which depends on a vector of parameters M, and given data Y, Bayes’ theorem tells us that The probabilities p ( M | M i ) and p (M i ) are the prior probabilities of the parameters given the model and of the model, respectively; p (Y | M, M i ) is the likelihood function, and p ( M, M i |Y ) is the joint posterior probability distribution of the parameters and models, given the data. Note that some parameters may not appear in some models, and there is no requirement that the models be nested.

18 Correlation With Errors-In-Variables3/28/200218 Bayesian Model Selection Assume for the moment that we have supplied priors and performed the necessary integrations to produce a normalized posterior distribution. In practice this is often done by simulation using Markov Chain Monte Carlo (MCMC). Once this has been done, it is simple in principle to compute posterior probabilities of the models: This set of numbers summarizes our degree of belief in each of the models, after looking at the data. In model selection, we choose the model with the highest posterior probability

19 Correlation With Errors-In-Variables3/28/200219 Strategy I do not see a simple frequentist approach to this student’s problem A reasonable Bayesian approach is fairly straightforward: Assume that the underlying “true” (but unknown) galaxy parameters  i and  i (corresponding to the observed x i and y i ) are distributed as a bivariate normal distribution

20 Correlation With Errors-In-Variables3/28/200220 Strategy Since we do not know  i and  i for each galaxy but instead only the observed values x i and y i, we introduced the  i and  i for each galaxy as latent variables. These are parameters to be estimated. Here a and b give the true center of the distribution;  is the true correlation coefficient, and   and   are the true standard deviations. None of these quantities are known. They are also parameters which must be estimated.

21 Correlation With Errors-In-Variables3/28/200221 Strategy [Since we are using a bivariate normal, the variance- covariance matrix is with inverse

22 Correlation With Errors-In-Variables3/28/200222 Strategy So the density is where  =(  –a  –b)´ ]

23 Correlation With Errors-In-Variables3/28/200223 Strategy This expression may be regarded as our prior on the latent variables  i and  i. It depends on the other parameters (a, b, ,  ,   ). We can regard these as hyperparameters, which will in turn require their own priors. The joint prior on all the latent variables  i and  i can be written as a product: where  and  are vectors whose components are  i and  i respectively.

24 Correlation With Errors-In-Variables3/28/200224 Strategy We know the distributions of the data x i and y i conditional on  i and  i. Their joint distribution is given by Here s i and t i are the standard deviations of the data points, assumed known perfectly for this analysis (these are the basis of the error bars I showed earlier...).

25 Correlation With Errors-In-Variables3/28/200225 The Data The “true” value is somewhere near this error ellipse

26 Correlation With Errors-In-Variables3/28/200226 Strategy Now we can write down the likelihood, the joint probability of observing the data, conditional on the parameters (here only the latent parameters appear, the others are implicit through the prior on the latent parameters): where X, Y, S, and T are vectors whose components are x i, y i, s i, and t i, respectively.

27 Correlation With Errors-In-Variables3/28/200227 Priors The next step is to assign priors for each of the parameters, including the latent variables. Lacking special information, I chose conventional priors for all but  and . Thus, I assign Improper constant flat priors on a and b. Improper Jeffreys priors 1/   and 1/   on   and  . We have two models, one with correlation (M 1 ) and one without (M 0 ). I assign p(M 1 )= p(M 0 )=1/2 We will compare M 1 and M 0 by computing their posterior probabilities. I chose the prior p(  |M 1 ) on  to be flat and normalized on [–1,1] and zero elsewhere; I chose a delta-function prior p(  |M 0 )=  (  –0) on M 0 Priors on  and  were displayed earlier

28 Correlation With Errors-In-Variables3/28/200228 Posterior Distribution The posterior distribution is proportional to the prior times the likelihood, as Bayes instructs us

29 Correlation With Errors-In-Variables3/28/200229 Simulation Strategy We used simulation to generate a sample from the posterior distribution through a combination of Gibbs and Metropolis- Hastings samplers (“Metropolis-within-Gibbs”). The sample can be used to calculate quantities of interest: »Compute posterior mean and variance of the correlation coefficient , (calculate sample mean and variance of  ) »Plot the posterior distribution of , (plot a histogram of  from the sample). »Determine quantiles of the posterior distribution of  (use quantiles of the sample). »Compute posterior probabilities of each model (calculate the frequency of the model in the sample).

30 Correlation With Errors-In-Variables3/28/200230 Posterior Conditionals The conditional distribution on  i and  i looks like By combining terms and completing the square we can sample  i and  i from a bivariate normal.

31 Correlation With Errors-In-Variables3/28/200231 Posterior Conditionals Similarly, the posterior conditional on a, b is Again, by completing the square we can sample a and b from a bivariate normal Note that if this were not an EIV problem, we would have x’s and y’s instead of  ’s and  ’s

32 Correlation With Errors-In-Variables3/28/200232 Posterior Conditionals The posterior conditional on   and   is It wasn’t obvious to me that we could sample this in a Gibbs step, but maybe there’s a way to do it. I just used independent M-H steps with a uniform symmetric proposal distribution, tuning the step size for good mixing, and this worked fine.

33 Correlation With Errors-In-Variables3/28/200233 Posterior Conditionals We do a reversible-jump step on  and M simultaneously. Here the idea is to propose a model M and at the same time a correlation  in a M-H step and either accept or reject according to the M-H . Proposing from M 0 to M 0 or from M 1 to M 1 is basically simple, just an ordinary M-H step. If we are proposing between models, then things are a bit more complicated. This is due to the fact that the dimensionalities of the parameter spaces are different between the two models.

34 Correlation With Errors-In-Variables3/28/200234 Posterior Conditionals The posterior conditional of  under M 1 is (The leading factor of 1/2 comes from the prior on  and is very important)

35 Correlation With Errors-In-Variables3/28/200235 Posterior Conditionals The posterior conditional of  under M 0 is The  function guarantees that  =0. Here, the proportionality factor is chosen so to match the factor of 1/2 under M 0. The factors come from the priors [p(  M 1  ~U(–1,1) which has an implicit factor 1/2, and p(  M 0  ~  (  )].

36 Correlation With Errors-In-Variables3/28/200236 Posterior Conditionals The M-H ratio when jumping from (M,  ) to (M *,  * ) is therefore (where the q’s are the proposals): We sampled using a beta proposal q(  |M 1,…) with parameters tuned by experiment for efficiency and good mixing under the complex model, and with a proposal q(M 1 |…) that also was chosen by experiment with an eye to getting an accurate estimate of the posterior odds on M 1. The idea is that a beta proposal on  matches the actual conditional pretty well so will be accepted with high probability; the M-H ratio will be close to 1.

37 Correlation With Errors-In-Variables3/28/200237 Sampling Strategy for Our Problem To summarize: We sampled the a, b,  j,  j in Gibbs steps (a and b appear in the posterior distribution as a bivariate normal distribution, as do the  j,  j ). We sampled  ,   with M-H steps using symmetric uniform proposals centered on the current point, adjusting the maximum step for good mixing We sampled  and M in a simultaneous reversible-jump M-H step, using a beta proposal on  with parameters tuned by experiment for efficiency and good mixing under the complex model, and with a proposal on M that also was chosen by experiment with an eye to getting an accurate estimate of the posterior odds on M.

38 Correlation With Errors-In-Variables3/28/200238 Results For the data set including the circled point, we obtained Odds on model with correlation = 207 (assumes prior odds equal to 1) Median rho = -0.81 Mean rho = -0.79 ± 0.10

39 Correlation With Errors-In-Variables3/28/200239 Posterior distribution of  (Including all points)

40 Correlation With Errors-In-Variables3/28/200240 Results For the data set including the circled point, we obtained Odds on model with correlation = 207 (assumes prior odds equal to 1) Median rho = -0.81 Mean rho = -0.79 ± 0.10 For the data set without the circled point we obtained Odds on model with correlation = 9.9 Median rho = -0.70 Mean rho = -0.68 ± 0.16

41 Correlation With Errors-In-Variables3/28/200241 Posterior distribution of  (Excluding 1 point)

42 Correlation With Errors-In-Variables3/28/200242 Final Comments This problem combines several interesting features: Latent variables, introduced because this is an errors-in- variables problem Model selection, implemented through reversible-jump MCMC simulation A combination of Gibbs and Metropolis-Hastings steps to implement the sampler (“Metropolis-within-Gibbs”) It is a good example of how systematic application of basic Bayesian analysis can yield a satisfying solution of a problem that, when looked at frequentistically, seems almost intractable

43 Correlation With Errors-In-Variables3/28/200243 Final Comments One final comment: If you look at the tail area in either of the two cases investigated, you will see that it is much less than the 1/200 or 1/10 odds ratio that we calculated for the odds of M 0 against M 1. This is an example of how tail areas in general are not reliable statistics for deciding whether a hypothesis should be selected.

44 Correlation With Errors-In-Variables3/28/200244 Practical Computation Until recently, a major practical difficulty has been computing the required integrals, limiting the method to situations where results can be obtained exactly or with analytic approximations In the past 15 years, considerable progress has been made in solving the computational difficulties, particularly with the development of Markov Chain Monte Carlo (MCMC) methods for simulating a random sample from the full posterior distribution, from which marginal distributions and summary means and variances (as well as other averages) can be calculated conveniently. These methods have their origin in physics. The Metropolis- Hastings and Gibbs sampler methods are two popular schemes that originated in early attempts to solve large physics problems by Monte Carlo methods.

45 Correlation With Errors-In-Variables3/28/200245 Practical Computation: Markov Chains Start from an arbitrary point in the space of models and parameters. Following a specific set of rules, which depend only on the unnormalized posterior distribution, generate a random walk in this space, such that the distribution of the generated points converges to a sample drawn from the underlying posterior distribution. The random walk is a Markov Chain: That is, each step depends only upon the immediately previous step, and not on any of the earlier steps. Many rules for generating the transition from one state to the next are possible. All converge to the same distribution. One attempts to choose a rule that will give efficient sampling with a reasonable expenditure of effort and time.

46 Correlation With Errors-In-Variables3/28/200246 Gibbs Sampler The Gibbs Sampler is a scheme for generating a sample from the full posterior distribution by sampling in succession from the conditional distributions, when they can be sampled from conveniently. Thus, let the parameter vector  be decomposed into a set of subvectors  1,  2, …,  n. Suppose it is possible to sample from the conditional distributions p(  1 |  2,  3,…,  n ), p(  2 |  1,  3,…,  n ), … p(  n |  1,  2,…,  n-1 ).

47 Correlation With Errors-In-Variables3/28/200247 Gibbs Sampler (2) Starting from an arbitrary initial vector  (0) =(  1 (0),  2 (0), …,  n (0) ), generate in succession vectors  (1),  (2),… by sampling in succession from the conditional distributions: p(  1 (k) |  2 (k-1),  3 (k-1),…,  n (k-1) ), p(  2 (k) |  1 (k),  3 (k-1),…,  n (k-1) ), … p(  n (k) |  1 (k),  2 (k),…,  n-1 (k) ), with  (k) =(  1 (k),  2 (k), …,  n (k) ). In the limit, the sample thus generated will converge (in distribution) to a sample drawn from the full posterior distribution.

48 Correlation With Errors-In-Variables3/28/200248 Metropolis-Hastings Step Gibbs sampling works when the conditional distributions are standard distributions from which samples can easily be drawn. This is not usually the case, and we would have to replace Gibbs steps with another scheme. A Metropolis-Hastings step involves producing a sample from a suitable proposal distribution q(  * |  ), where  is the value at the previous step. Then a calculation is done to see whether to accept the new  * as the new step, or to keep the old  as the new step. If we retain the old value, the sampler does not “move” the parameter  at this step. If we accept the new value, it will move. We choose q so that random samples can easily be generated from it, and with other characteristics. This is part of the art of designing “good” (efficient) sampling schemes.

49 Correlation With Errors-In-Variables3/28/200249 Metropolis-Hastings Step (2) Specifically, if p(  ) is the target distribution from which we wish to sample, first generate  * from q(  * |  ). Then calculate  =min{1,(p(  * ) q(  |  * ))/(p(  ) q(  * |  ))} Then generate a random number r uniform on [0,1] Accept the proposed  * if r≤ , otherwise keep . Note that if p(  * )= q(  * |  ) for all ,  *, then we will always accept the new value. In this case the Metropolis-Hastings step becomes an ordinary Gibbs step. Generally, although the Metropolis-Hastings steps are guaranteed to produce a Markov chain with the right limiting distribution, one gets better performance the closer we can approximate p(  * ) by q(  * |  ).


Download ppt "Correlation With Errors-In-Variables3/28/20021 Correlation with Errors-In-Variables and an Application to Galaxies William H. Jefferys University of Texas."

Similar presentations


Ads by Google