Presentation is loading. Please wait.

Presentation is loading. Please wait.

Shashi M. Kanbur SUNY Oswego Workshop on Best Practices in Astro-Statistics IUCAA January 2016.

Similar presentations


Presentation on theme: "Shashi M. Kanbur SUNY Oswego Workshop on Best Practices in Astro-Statistics IUCAA January 2016."— Presentation transcript:

1 Shashi M. Kanbur SUNY Oswego Workshop on Best Practices in Astro-Statistics IUCAA January 2016

2 Collaborators and Funding HP Singh, R. Gupta, C. Ngeow, L. Macri, A. Bhardwaj, S. Das, R. Kundu, S. Deb. A. Nanthakumar NSF, IUSSTF, IUCAA, Delhi University, SUNY Oswego Website: http://www.oswego.edu/~kanbur/iucaa2016/ http://www.oswego.edu/~kanbur/DU2014

3 Linear Regression A very common type of model in science (x i,y i ), i=1,….,N Y i = a + bx i + ε i, where x i, y i are the independent/dependent variables, respectively, a,b are the intercept/slope, respectively and ε i is the error. The error model is usually ε i ~ N(0, σ 2 ) We are interested in testing hypotheses on the slope b.

4 Linear Regression Least Squares estimates of the intercept and slope are given by with standard errors given by

5 Linear Regression Interested in testing whether the following model is better: H 0 : b=b 0 vs. H A : b=b 1, x ≤ x 0, b=b 2, x > x 0 That is there is a change of slope at x 0 - the break point. Can fit regression lines to data on both sides of the break point with slope estimates

6 Linear Regression The standard way to “check” this is by looking at the intervals and see if they are mutually exclusive. This essentially puts confidence intervals around the slope estimates. Depending on the choice of m, this says that the probability that the true slope is in the interval above is 1-α – or the probability of an error is α.

7 Linear Regression Then if A={“short period” slope is wrong}, B={“long period” slope is wrong}. In comparing the long and short period slopes, the probability of at least one mistake If 1 > α > 0, then 2α-α 2 > α. If we carry out statistical tests to significance level α, then this is saying that the statistical tests outlined in this talk have a smaller chance of making an error.

8 F Test Perhaps the simplest way to test for nonlinearity is to use the F test: Refer this statistic to F(ν R – ν F, ν F ) where the subscript R, F stands for the reduced and full models respectively, and ν stands for the degrees of freedom. RSS stands for the residual sum of squares and refer this test statistic to the theoretical F distribution. Normality, heteroskedasticity and IID observations.

9 Normality/Heteroskedasticity (X i, Y i ) with residuals ε i. Y i ‘ = Y f i + ε i Permute residuals without replacement (bootstrap is with replacement) ε n i = ε j Y n i = Y f i + ε n i With (X i, Y n i ) get the F statistic – repeat – F i. Find proportion of F i that are greater than the observed value of F. Heteroskedasticity – plot residuals against the independent variable. Try a transformation - perhaps log.

10 Testing for Normality Data (X i, Y i ), i=1,….N Quantiles: F n (u) = (#Y i ≤ u)/N and compare with that expected from a normal distribution. If the data are from a normal distribution, the q-q plot should be close to a straight line.

11 Random Walk Methods Order the independent variable: x 1 <x x <….<x N If r k is the kth residual from a linear regression, then If the data are consistent with a single linear regression, then the C(j) are a simple random walk. Our test statistic, R, is the vertical range of the C(j)

12 Random Walk Methods If the partial sums are a random walk, R will be small. Permute r k so that you randomize the residuals. Then recompute R. Repeat this procedure for a large number (~10000) permutations. The significance statistic is the Fraction of the permuted R statistics that are greater than the observed value of R: this is the significance level under the null hypothesis of linearity. This is a non-parametric test and does not depend on normality of the errors.

13 Testimator Test Estimator Sort the data in order of increasing independent variable. Divide the sample into N1 different non-overlapping and hence completely independent datasets. Each subset has n data points and the remaining datapoints are included in the last subset. We fit a linear regression to the first subset and determine an initial slope estimate, β’.

14 Testimator This initial estimate of the slope becomes β 0 in the next subset under the null hypothesis that the slope of the second subset is equal to the slope of the first subset. We calculate the t-statistic such that

15 Testimator Since there will be n g =n-1 hypothesis tests, the critical t value will be a Bonferroni type and ν is the number of data points in each subset. Once we know the observed and critical value of the t- statistics, we determine which is the probability that the initial testimator guess is true. If the value of k < 1, the null hypothesis is accepted and we derive the new testimator slope for the next subset using the previously determined β’s such that

16 Testimator This value of the testimator is taken as β 0 for the next subset. This process of hypothesis testing is repeated n g times or until the value of k > 1, suggesting rejection of the null hypothesis – that is the data are more consistent with a non-linear relation.

17 The Extra-Galactic Distance Scale μ=m-M μ=m-(a+b.logP) Calibrating Galaxy, observe Cepheids and determine M=a+blogP Target galaxy, observe Cepheids m i, i=1,…N. So μ i = m i – (a + blogP i ) y=Lq where y=(m 1, m 2,…m N ), q=(a,b,μ 1,μ 2,…μ N ) is the vector of unknowns and L is a (Nx(N+2)) matrix containing 1’s and logP i ’s.

18 The Extra-Galactic Distance Scale This is a vector equation for the q’s and easily solvable using the General Linear Model interface in R. Minimize χ 2 = (y-Lq) T C -1 (y-Lq) yields the MLE estimator for q. C is the matrix of measurement errors Weighted least squares estimate when errors are normally distributed. q’ = (L T C -1 L) -1 L T C -1 [y] and standard errors for the parameters in q’ are (L T C -1 L) -1. If you formulate your statistical data analysis problems in this General Linear Model formalism, its very easy to solve in R along with a full error analysis.

19 The Extra-Galactic Distance Scale and Bayes Bayesian GLM formalism applied to the estimate of H0

20 Segmented Lines and the Davies Test The model is Y =a s + b s X + ψ(X)Δa(X-X b ) and Δa=a L -a S and Ψ(X)=0, X<X b, ψ(X)=1, X≥X b. This assumes a continuous transition between the two linear models. A more general situation, perhaps a discontinuity is Y=a s +b s X + Ψ(X)[Δa(X-X b ) – γ], where γ represents the magnitude of the gap.

21 Segmented Lines Choose an initial break point X b ’ and then fit the other parameters in the equation. Estimate a new break point, X b ’’ = X b ’ + γ/Δa. Repeat until γ≈0.

22 Cepheid PL Relations

23 Cepheid PC Relations

24 Multiphase PL Relations

25 Multiwavelength PL Relations

26 Galactic PL Relations

27 ExtraGalactic PL Relations


Download ppt "Shashi M. Kanbur SUNY Oswego Workshop on Best Practices in Astro-Statistics IUCAA January 2016."

Similar presentations


Ads by Google