Shashi M. Kanbur SUNY Oswego Workshop on Best Practices in Astro-Statistics IUCAA January 2016.

Slides:



Advertisements
Similar presentations
Probability models- the Normal especially.
Advertisements

11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Managerial Economics in a Global Economy
Computational Statistics. Basic ideas  Predict values that are hard to measure irl, by using co-variables (other properties from the same measurement.
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Econ 140 Lecture 81 Classical Regression II Lecture 8.
Linear regression models
Ch11 Curve Fitting Dr. Deshi Ye
Objectives (BPS chapter 24)
The General Linear Model. The Simple Linear Model Linear Regression.
© 2010 Pearson Prentice Hall. All rights reserved Least Squares Regression Models.
Elementary hypothesis testing
1 Chapter 3 Multiple Linear Regression Ray-Bing Chen Institute of Statistics National University of Kaohsiung.
Sample size computations Petter Mostad
Elementary hypothesis testing
Elementary hypothesis testing Purpose of hypothesis testing Type of hypotheses Type of errors Critical regions Significant levels Hypothesis vs intervals.
The Simple Regression Model
SIMPLE LINEAR REGRESSION
Probability & Statistics for Engineers & Scientists, by Walpole, Myers, Myers & Ye ~ Chapter 11 Notes Class notes for ISE 201 San Jose State University.
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Introduction Classical Cepheids are variable stars whose magnitudes oscillate with periods of 1 to 50 days. In the 1910s, Henrietta Swan Leavitt discovered.
5-3 Inference on the Means of Two Populations, Variances Unknown
A Testimator Based Approach to Investigate the Non- Linearity of the LMC Cepheid Period-Luminosity Relation R. Stevens, A. Nanthakumar, S. M. Kanbur (Physics.
Chapter 12 Section 1 Inference for Linear Regression.
Regression and Correlation Methods Judy Zhong Ph.D.
SIMPLE LINEAR REGRESSION
Regression Analysis Regression analysis is a statistical technique that is very useful for exploring the relationships between two or more variables (one.
Inference for regression - Simple linear regression
Linear Regression Analysis Additional Reference: Applied Linear Regression Models – Neter, Kutner, Nachtsheim, Wasserman The lecture notes of Dr. Thomas.
Correlation and Linear Regression
Chapter 5 Sampling and Statistics Math 6203 Fall 2009 Instructor: Ayona Chatterjee.
CPE 619 Simple Linear Regression Models Aleksandar Milenković The LaCASA Laboratory Electrical and Computer Engineering Department The University of Alabama.
Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter Inference on the Least-Squares Regression Model and Multiple Regression 14.
Simple Linear Regression Models
CORRELATION & REGRESSION
1 Least squares procedure Inference for least squares lines Simple Linear Regression.
+ Chapter 12: Inference for Regression Inference for Linear Regression.
1 Chapter 12 Simple Linear Regression. 2 Chapter Outline  Simple Linear Regression Model  Least Squares Method  Coefficient of Determination  Model.
Lecture 8 Simple Linear Regression (cont.). Section Objectives: Statistical model for linear regression Data for simple linear regression Estimation.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
Y X 0 X and Y are not perfectly correlated. However, there is on average a positive relationship between Y and X X1X1 X2X2.
Inference for Regression Chapter 14. Linear Regression We can use least squares regression to estimate the linear relationship between two quantitative.
Lesson Multiple Regression Models. Objectives Obtain the correlation matrix Use technology to find a multiple regression equation Interpret the.
+ Chapter 12: More About Regression Section 12.1 Inference for Linear Regression.
Interval Estimation and Hypothesis Testing Prepared by Vera Tabakova, East Carolina University.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Simple linear regression Tron Anders Moger
Chapter 14: Inference for Regression. A brief review of chapter 4... (Regression Analysis: Exploring Association BetweenVariables )  Bi-variate data.
Correlation & Regression Analysis
Essentials of Business Statistics: Communicating with Numbers By Sanjiv Jaggia and Alison Kelly Copyright © 2014 by McGraw-Hill Higher Education. All rights.
Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,
Example x y We wish to check for a non zero correlation.
Statistics 350 Lecture 2. Today Last Day: Section Today: Section 1.6 Homework #1: Chapter 1 Problems (page 33-38): 2, 5, 6, 7, 22, 26, 33, 34,
Linear Regression Hypothesis testing and Estimation.
The “Big Picture” (from Heath 1995). Simple Linear Regression.
CHAPTER 12 More About Regression
The simple linear regression model and parameter estimation
Chapter 4 Basic Estimation Techniques
Regression Analysis AGEC 784.
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Inference for Regression (Chapter 14) A.P. Stats Review Topic #3
Basic Estimation Techniques
CHAPTER 12 More About Regression
Simple Linear Regression - Introduction
CHAPTER 29: Multiple Regression*
Simple Linear Regression
CHAPTER 12 More About Regression
Product moment correlation
CHAPTER 12 More About Regression
Presentation transcript:

Shashi M. Kanbur SUNY Oswego Workshop on Best Practices in Astro-Statistics IUCAA January 2016

Collaborators and Funding HP Singh, R. Gupta, C. Ngeow, L. Macri, A. Bhardwaj, S. Das, R. Kundu, S. Deb. A. Nanthakumar NSF, IUSSTF, IUCAA, Delhi University, SUNY Oswego Website:

Linear Regression A very common type of model in science (x i,y i ), i=1,….,N Y i = a + bx i + ε i, where x i, y i are the independent/dependent variables, respectively, a,b are the intercept/slope, respectively and ε i is the error. The error model is usually ε i ~ N(0, σ 2 ) We are interested in testing hypotheses on the slope b.

Linear Regression Least Squares estimates of the intercept and slope are given by with standard errors given by

Linear Regression Interested in testing whether the following model is better: H 0 : b=b 0 vs. H A : b=b 1, x ≤ x 0, b=b 2, x > x 0 That is there is a change of slope at x 0 - the break point. Can fit regression lines to data on both sides of the break point with slope estimates

Linear Regression The standard way to “check” this is by looking at the intervals and see if they are mutually exclusive. This essentially puts confidence intervals around the slope estimates. Depending on the choice of m, this says that the probability that the true slope is in the interval above is 1-α – or the probability of an error is α.

Linear Regression Then if A={“short period” slope is wrong}, B={“long period” slope is wrong}. In comparing the long and short period slopes, the probability of at least one mistake If 1 > α > 0, then 2α-α 2 > α. If we carry out statistical tests to significance level α, then this is saying that the statistical tests outlined in this talk have a smaller chance of making an error.

F Test Perhaps the simplest way to test for nonlinearity is to use the F test: Refer this statistic to F(ν R – ν F, ν F ) where the subscript R, F stands for the reduced and full models respectively, and ν stands for the degrees of freedom. RSS stands for the residual sum of squares and refer this test statistic to the theoretical F distribution. Normality, heteroskedasticity and IID observations.

Normality/Heteroskedasticity (X i, Y i ) with residuals ε i. Y i ‘ = Y f i + ε i Permute residuals without replacement (bootstrap is with replacement) ε n i = ε j Y n i = Y f i + ε n i With (X i, Y n i ) get the F statistic – repeat – F i. Find proportion of F i that are greater than the observed value of F. Heteroskedasticity – plot residuals against the independent variable. Try a transformation - perhaps log.

Testing for Normality Data (X i, Y i ), i=1,….N Quantiles: F n (u) = (#Y i ≤ u)/N and compare with that expected from a normal distribution. If the data are from a normal distribution, the q-q plot should be close to a straight line.

Random Walk Methods Order the independent variable: x 1 <x x <….<x N If r k is the kth residual from a linear regression, then If the data are consistent with a single linear regression, then the C(j) are a simple random walk. Our test statistic, R, is the vertical range of the C(j)

Random Walk Methods If the partial sums are a random walk, R will be small. Permute r k so that you randomize the residuals. Then recompute R. Repeat this procedure for a large number (~10000) permutations. The significance statistic is the Fraction of the permuted R statistics that are greater than the observed value of R: this is the significance level under the null hypothesis of linearity. This is a non-parametric test and does not depend on normality of the errors.

Testimator Test Estimator Sort the data in order of increasing independent variable. Divide the sample into N1 different non-overlapping and hence completely independent datasets. Each subset has n data points and the remaining datapoints are included in the last subset. We fit a linear regression to the first subset and determine an initial slope estimate, β’.

Testimator This initial estimate of the slope becomes β 0 in the next subset under the null hypothesis that the slope of the second subset is equal to the slope of the first subset. We calculate the t-statistic such that

Testimator Since there will be n g =n-1 hypothesis tests, the critical t value will be a Bonferroni type and ν is the number of data points in each subset. Once we know the observed and critical value of the t- statistics, we determine which is the probability that the initial testimator guess is true. If the value of k < 1, the null hypothesis is accepted and we derive the new testimator slope for the next subset using the previously determined β’s such that

Testimator This value of the testimator is taken as β 0 for the next subset. This process of hypothesis testing is repeated n g times or until the value of k > 1, suggesting rejection of the null hypothesis – that is the data are more consistent with a non-linear relation.

The Extra-Galactic Distance Scale μ=m-M μ=m-(a+b.logP) Calibrating Galaxy, observe Cepheids and determine M=a+blogP Target galaxy, observe Cepheids m i, i=1,…N. So μ i = m i – (a + blogP i ) y=Lq where y=(m 1, m 2,…m N ), q=(a,b,μ 1,μ 2,…μ N ) is the vector of unknowns and L is a (Nx(N+2)) matrix containing 1’s and logP i ’s.

The Extra-Galactic Distance Scale This is a vector equation for the q’s and easily solvable using the General Linear Model interface in R. Minimize χ 2 = (y-Lq) T C -1 (y-Lq) yields the MLE estimator for q. C is the matrix of measurement errors Weighted least squares estimate when errors are normally distributed. q’ = (L T C -1 L) -1 L T C -1 [y] and standard errors for the parameters in q’ are (L T C -1 L) -1. If you formulate your statistical data analysis problems in this General Linear Model formalism, its very easy to solve in R along with a full error analysis.

The Extra-Galactic Distance Scale and Bayes Bayesian GLM formalism applied to the estimate of H0

Segmented Lines and the Davies Test The model is Y =a s + b s X + ψ(X)Δa(X-X b ) and Δa=a L -a S and Ψ(X)=0, X<X b, ψ(X)=1, X≥X b. This assumes a continuous transition between the two linear models. A more general situation, perhaps a discontinuity is Y=a s +b s X + Ψ(X)[Δa(X-X b ) – γ], where γ represents the magnitude of the gap.

Segmented Lines Choose an initial break point X b ’ and then fit the other parameters in the equation. Estimate a new break point, X b ’’ = X b ’ + γ/Δa. Repeat until γ≈0.

Cepheid PL Relations

Cepheid PC Relations

Multiphase PL Relations

Multiwavelength PL Relations

Galactic PL Relations

ExtraGalactic PL Relations