Statistics for Business and Economics Module 2: Regression and time series analysis Spring 2010 Lecture 4: Inference about regression Priyantha Wijayatunga,

Slides:



Advertisements
Similar presentations
Forecasting Using the Simple Linear Regression Model and Correlation
Advertisements

Inference for Regression
CHAPTER 24: Inference for Regression
Objectives (BPS chapter 24)
LECTURE 3 Introduction to Linear Regression and Correlation Analysis
Chapter 10 Simple Regression.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 13-1 Chapter 13 Simple Linear Regression Basic Business Statistics 11 th Edition.
1 Pertemuan 13 Uji Koefisien Korelasi dan Regresi Matakuliah: A0392 – Statistik Ekonomi Tahun: 2006.
Pengujian Parameter Koefisien Korelasi Pertemuan 04 Matakuliah: I0174 – Analisis Regresi Tahun: Ganjil 2007/2008.
Chapter Topics Types of Regression Models
Linear Regression and Correlation Analysis
1 Simple Linear Regression Chapter Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.
Introduction to Probability and Statistics Linear Regression and Correlation.
© 2000 Prentice-Hall, Inc. Chap Forecasting Using the Simple Linear Regression Model and Correlation.
Chapter 7 Forecasting with Simple Regression
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Regression Chapter 14.
1 Simple Linear Regression 1. review of least squares procedure 2. inference for least squares lines.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS & Updated by SPIROS VELIANITIS.
SIMPLE LINEAR REGRESSION
Introduction to Linear Regression and Correlation Analysis
Inference for regression - Simple linear regression
Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter Inference on the Least-Squares Regression Model and Multiple Regression 14.
1 Least squares procedure Inference for least squares lines Simple Linear Regression.
BPS - 3rd Ed. Chapter 211 Inference for Regression.
Inference for Linear Regression Conditions for Regression Inference: Suppose we have n observations on an explanatory variable x and a response variable.
OPIM 303-Lecture #8 Jose M. Cruz Assistant Professor.
© 2003 Prentice-Hall, Inc.Chap 13-1 Basic Business Statistics (9 th Edition) Chapter 13 Simple Linear Regression.
+ Chapter 12: Inference for Regression Inference for Linear Regression.
Introduction to Linear Regression
Multiple regression - Inference for multiple regression - A case study IPS chapters 11.1 and 11.2 © 2006 W.H. Freeman and Company.
1 1 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
EQT 373 Chapter 3 Simple Linear Regression. EQT 373 Learning Objectives In this chapter, you learn: How to use regression analysis to predict the value.
Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation.
MBP1010H – Lecture 4: March 26, Multiple regression 2.Survival analysis Reading: Introduction to the Practice of Statistics: Chapters 2, 10 and 11.
Lecture 8 Simple Linear Regression (cont.). Section Objectives: Statistical model for linear regression Data for simple linear regression Estimation.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
+ Chapter 12: More About Regression Section 12.1 Inference for Linear Regression.
Inference for regression - More details about simple linear regression IPS chapter 10.2 © 2006 W.H. Freeman and Company.
Simple Linear Regression ANOVA for regression (10.2)
Copyright ©2011 Brooks/Cole, Cengage Learning Inference about Simple Regression Chapter 14 1.
STA 286 week 131 Inference for the Regression Coefficient Recall, b 0 and b 1 are the estimates of the slope β 1 and intercept β 0 of population regression.
Lecture 10 Chapter 23. Inference for regression. Objectives (PSLS Chapter 23) Inference for regression (NHST Regression Inference Award)[B level award]
Lecture 10: Correlation and Regression Model.
28. Multiple regression The Practice of Statistics in the Life Sciences Second Edition.
Multiple Regression I 1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 4 Multiple Regression Analysis (Part 1) Terry Dielman.
Chapter 10 Inference for Regression
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 12 More About Regression 12.1 Inference for.
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Simple Linear Regression Analysis Chapter 13.
Inference for regression - More details about simple linear regression IPS chapter 10.2 © 2006 W.H. Freeman and Company.
Chapter 12 Simple Linear Regression.
BPS - 5th Ed. Chapter 231 Inference for Regression.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 12 More About Regression 12.1 Inference for.
1 Simple Linear Regression Chapter Introduction In Chapters 17 to 19 we examine the relationship between interval variables via a mathematical.
23. Inference for regression
Warm-Up The least squares slope b1 is an estimate of the true slope of the line that relates global average temperature to CO2. Since b1 = is very.
Chapter 20 Linear and Multiple Regression
Inference for Least Squares Lines
Statistics for Managers using Microsoft Excel 3rd Edition
Correlation and Simple Linear Regression
Chapter 11 Simple Regression
Chapter 13 Simple Linear Regression
The Practice of Statistics in the Life Sciences Fourth Edition
CHAPTER 29: Multiple Regression*
CHAPTER 26: Inference for Regression
Basic Practice of Statistics - 3rd Edition Inference for Regression
CHAPTER 12 More About Regression
SIMPLE LINEAR REGRESSION
Correlation and Simple Linear Regression
Correlation and Simple Linear Regression
Presentation transcript:

Statistics for Business and Economics Module 2: Regression and time series analysis Spring 2010 Lecture 4: Inference about regression Priyantha Wijayatunga, Department of Statistics, Umeå University These materials are altered ones from copyrighted lecture slides (© 2009 W.H. Freeman and Company) from the homepage of the book: The Practice of Business Statistics Using Data for Decisions :Second Edition by Moore, McCabe, Duckworth and Alwan.

Inference about the regression model and using the model Reference to the book: Chapter 10.1, 10.2 and 10.3)  Statistical model for simple linear regression  Estimating the regression parameters and standard errors  Conditions for regression inference  Confidence intervals and significance tests  Inference about correlation  Confidence and prediction intervals  Analysis of variance for regression and coefficient of determination

Error Variable: Required Conditions  Linear regrssion model:  The error  is a critical part of the regression model.  Four requirements involving the distribution of  must be satisfied.  The probability distribution of  is normal.  The mean of  is zero: E(  ) = 0.  The standard deviation of  is  for all values of x.  The set of errors associated with different values of y are all independent.  Our estimated model:

The data in a scatterplot is a random sample from a population that may exhibit a linear relationship between x and y. Different sample  different plot. Now we want to describe the population mean response  y as a function of the explanatory variable x:  y =   +   x. And to assess whether the observed relationship is statistically significant (not entirely explained by chance events due to random sampling).

Statistical model for simple linear regression In the population, the linear regression equation is  y =  0 +  1 x. Sample data then fits the model: Data = fit + residual y i = (  0 +  1 x i ) + (  i ) where the  i are independent and normally distributed N(0,  ). Linear regression assumes equal variance of y (  is the same for all values of x).

 y =   +   x The intercept  , the slope  , and the standard deviation  of y are the unknown parameters of the regression model  We rely on the random sample data to provide unbiased estimates of these parameters.  The value of ŷ from the least-squares regression line is really a prediction of the mean value of y (  y ) for a given value of x.  The least-squares regression line (ŷ = b 0 + b 1 x) obtained from sample data is the best estimate of the true population regression line (  y =   +   x). ŷ unbiased estimate for mean response  y b 0 unbiased estimate for intercept  0 b 1 unbiased estimate for slope   Estimating the regression parameters

 The standard deviation of the error variables shows the dispersion around the true line for a certain x.  If  is big we have big dispersion around the true line. If  is small the observations tend to be close to the line. Then, the model fits the data well.  Therefore, we can, use  as a measure of the suitability of using a linear model.  However we often do not know   Therefore we use s to estimate it Regression standard error: s (Standard error of estimate)

The regression standard error, s, for n sample data points is calculated from the residuals (y i – ŷ i ): s is an unbiased estimate of the regression standard deviation  The population standard deviation  for y at any given value of x represents the spread of the normal distribution of the  i around the mean  y.

Conditions for regression inference  The observations are independent (error term in each observation should be independent from each other)  The relationship is indeed linear.  The standard deviation of y, σ, is the same for all values of x.  The response y varies normally around its mean. That is, error term should be normally distributed with mean zero

Using residual plots to check for regression validity The residuals (y − ŷ) give useful information about the contribution of individual data points to the overall pattern of scatter. We view the residuals in a residual plot: If residuals are scattered randomly around 0 with uniform variation, it indicates that the data fits a linear model, and has normally distributed residuals for each value of x and constant standard deviation σ.

Residuals are randomly scattered  good! Curved pattern  the relationship is not linear. Change in variability across plot  σ not equal for all values of x.

What is the relationship between the average speed a car is driven and its fuel efficiency? We plot fuel efficiency (in miles per gallon, MPG) against average speed (in miles per hour, MPH) for a random sample of 60 cars. The relationship is curved. When speed is log transformed (log of miles per hour, LOGMPH) the new scatterplot shows a positive, linear relationship.

Residual plot: The spread of the residuals is reasonably random—no clear pattern. The relationship is indeed linear. But we see one low residual (3.8, −4) and one potentially influential point (2.5, 0.5). Normal quantile plot for residuals: The plot is fairly straight, supporting the assumption of normally distributed residuals.  Data okay for inference.

Checking normality of residuals Most of the departure from the required conditions can be diagnosed by the residual analysis For our case Food company... –a 1 st data: when ADVER= 276, SALES= predicted SALES= residual= – = –

Checking normality Non–normality of the residuals can be checked by making a histogram on residuals

Heteroscedasticity Variance of the errors is not constant (Violation of the requirement) Homoscedasticity Variance of the errors is constant (No violation of the requirement) Check: plot the residuals against predicted values of Y by the model Non–independnece of error variable

Confidence interval for regression parameters Estimating the regression parameters  0,  1 is a case of one-sample inference with unknown population variance .  We rely on the t distribution, with n – 2 degrees of freedom. A level C confidence interval for the slope,  1, is: b 1 ± t* SE b1 A level C confidence interval for the intercept,  0 : b 0 ± t* SE b0 t* is the t critical for the t (n – 2) distribution with area C between –t* and +t*.

Significance test for the slope We can test the hypothesis H 0 :  1 = 0 versus a 1 or 2 sided alternative. We calculatet = b 1 / SE b1 which has the t (n – 2) distribution to find the p-value of the test. Note: Software typically provides two-sided p-values.

Standard errors of b 1 and b 0 To estimate the parameters of the regression, we calculate the standard errors for the estimated regression coefficients. The standard error of the least-squares slope β 1 is: The standard error of the intercept β 0 is:

Testing the hypothesis of no relationship We may look for evidence of a significant relationship between variables x and y in the population from which our data were drawn. For that, we can test the hypothesis that the regression slope parameter β is equal to zero. H 0 : β 1 = 0 vs. H 0 : β 1 ≠ 0 Testing H 0 : β 1 = 0 also allows us to test the hypothesis of no correlation between x and y in the population. Note: A test of hypothesis for  0 is irrelevant (  0 is often not even achievable).

Example: ADVER and SALES i x i y i Total Mean

Regression standard error (Standard error of the estimate) Standard deviation of If H 0 is true the test statistic has a t–dist with df=9–2 =7 P–value > 2 x 0.005: Reject H 0 at 0.01 level of significance

Coefficients a Model Unstandardized Coefficients Standardized Coefficients BStd. ErrorBetatSig. 1(Constant)99,20112,0808,212,000 ADVER,068,018,8263,878,006 a. Dependent Variable: SALES Example: ADVER and SALES SPSS output Coefficients a Model 95,0% Confidence Interval for B Lower BoundUpper Bound 1 (Constant)70,635127,767 ADVER,027,110 a. Dependent Variable: SALES

SPSS Using technology Computer software runs all the computations for regression analysis. Here is some software output for the car speed/gas efficiency example. Slope Intercept p-value for tests of significance Confidence intervals The t-test for regression slope is highly significant (p < 0.001). There is a significant relationship between average car speed and gas efficiency.

Excel SAS “intercept”: intercept “logmph”: slope P-value for tests of significance confidence intervals

Inference for correlation To test for the null hypothesis of no linear association, we have the choice of also using the correlation parameter ρ.  When x is clearly the explanatory variable, this test is equivalent to testing the hypothesis H 0 : β 1 = 0.  When there is no clear explanatory variable (e.g., arm length vs. leg length), a regression of x on y is not any more legitimate than one of y on x. In that case, the correlation test of significance should be used.  When both x and y are normally distributed H 0 : ρ = 0 tests for no association of any kind between x and y—not just linear associations.

The test of significance for ρ uses the one-sample t-test for: H 0 : ρ  = 0. We compute the t statistics for sample size n and correlation coefficient r. The p-value is the area under t (n – 2) for values of T as extreme as t or more in the direction of H a :

Relationship between average car speed and fuel efficiency There is a significant correlation (r is not 0) between fuel efficiency (MPG) and the logarithm of average speed (LOGMPH). r p-value n

Example: ADVER and SALES Correlation coefficient = r = To test H 0 : ρ  = 0 H a : ρ  ≠ 0 Value of the tests statistic P–value from the t–distribution with df 7: P–value > 2 x 0.005, Do not accept H 0 Correlations ADVERSALES ADVERPearson Correlation1,826 ** Sig. (2-tailed),006 N99 SALESPearson Correlation,826 ** 1 Sig. (2-tailed),006 N99 **. Correlation is significant at the 0.01 level (2-tailed).

Confidence Interval for Regression Response Using inference, we can also calculate a confidence interval for the population mean μ y of all responses y when x takes the value x* (within the range of data tested): This interval is centered on ŷ, the unbiased estimate of μ y. The true value of the population mean μ y at a given value of x will indeed be within our confidence interval in C% of all intervals calculated from many different random samples.

The level C confidence interval for the mean response μ y at a given value x* of x is: y ± t n − 2 * SE  A separate confidence interval is calculated for μ y along all the values that x takes. Graphically, the series of confidence intervals is shown as a continuous interval on either side of ŷ. t* is the t critical for the t (n – 2) distribution with area C between –t* and +t*. 95% confidence interval for  y

Prediction Interval for Regression Response One use of regression is for predicting the value of y, ŷ, for any value of x within the range of data tested: ŷ = b 0 + b 1 x. But the regression equation depends on the particular sample drawn. More reliable predictions require statistical inference: To estimate an individual response y for a given value of x, we use a prediction interval. If we randomly sampled many times, there would be many different values of y obtained for a particular x following N(0, σ) around the mean response µ y.

The level C prediction interval for a single observation on y when x takes the value x* is: ŷ ± t* n − 2 SE ŷ The prediction interval represents mainly the error from the normal distribution of the residuals  i. Graphically, the series confidence intervals is shown as a continuous interval on either side of ŷ. 95% prediction interval for ŷ t* is the t critical for the t (n – 2) distribution with area C between –t* and +t*.

 The confidence interval for μ y contains, with C% confidence, the population mean μ y of all responses at a particular value of x.  The prediction interval contains C% of all the individual values taken by y at a particular value of x. 95% prediction interval for ŷ 95% confidence interval for  y Estimating  y uses a smaller confidence interval than estimating an individual in the population (sampling distribution narrower than population distribution).

To estimate or predict future responses, we calculate the following standard errors. The standard error of the mean response µ y is: The standard error for predicting an individual response ŷ is:

1918 flu epidemics The line graph suggests that 7 to 9% of those diagnosed with the flu died within about a week of their diagnosis. We look at the relationship between the number of deaths in a given week and the number of new diagnosed cases one week earlier.

EXCEL Regression Statistics Multiple R R Square Adjusted R Square 0.82 Standard Error Observations Coefficients St. Error t Stat P-value Lower 95% Upper 95% Intercept (14.720) FluCases flu epidemic: Relationship between the number of deaths in a given week and the number of new diagnosed cases one week earlier. s r = 0.91 P-value for H 0 : β 1 = 0 b1b1 P-value very small  reject H 0  β 1 significantly different from 0 There is a significant relationship between the number of flu cases and the number of deaths from flu a week later.

SPSS Least squares regression line 95% prediction interval for ŷ 95% confidence interval for  y CI for mean weekly death count one week after 4000 flu cases are diagnosed: µ y within about 300–380. Prediction interval for a weekly death count one week after 4000 flu cases are diagnosed: ŷ within about 180–500 deaths.

What is this? A 90% prediction interval for the height (above) and a 90% prediction interval for the weight (below) of male children, ages 3 to 18.

Analysis of variance and coefficient of determination Overall variability in y The regression model Remains, in part, unexplained The error Explained in part by Variation in the dependent variable Y = variation explained by the independent variable + unexplained variation (by random error) SST=SSR+SSE The greater the explained independent variable, better the model Co-efficient of determination is a measure of explanatory power of the model

Analysis of variance for regression The regression model is: Data = fit + residual y i = (  0 +  1 x i ) + (  i ) where the  i is independent and normally distributed N(0,  ), and  is the same for all values of x. For data: Sometimes SSR is called SSM

Analysis of variance for regression The regression model is: Data = fit + residual y i = (  0 +  1 x i ) + (  i ) where the  i is independent and normally distributed N(0,  ), and  is the same for all values of x. It resembles an ANOVA, which also assumes equal variance, where SST = SS model + SS error and DFT = DF model + DF error

For a simple linear relationship, the ANOVA tests the hypotheses H 0 : β 1 = 0 versus H a : β 1 ≠ 0 by comparing MSM (model) to MSE (error): F = MSM/MSE When H 0 is true, F follows the F(1, n − 2) distribution. The p-value is P(F > f ). The ANOVA test and the two-sided t-test for H 0 : β 1 = 0 yield the same p-value. Software output for regression may provide t, F, or both, along with the p-value.

ANOVA table SourceSum of squares SSDFMean square MSFP-value Model (Regression) 1SSM/DFMMSM/MSETail area above F Error (Residual) n − 2SSE/DFE Totaln − 1 SST = SSM + SSE DFT = DFM + DFE The standard deviation of the sampling distribution, s, for n sample data points is calculated from the residuals e i = y i – ŷ i s is an unbiased estimate of the regression standard deviation σ.

F–Distribution: Critical values (partly)

Coefficient of determination, r 2 The coefficient of determination, r 2, square of the correlation coefficient, is the percentage of the variance in y (vertical scatter from the regression line) that can be explained by changes in x. r 2 = variation in y caused by x (i.e., the regression line) total variation in observed y values around the mean

i x i y i Total Mean Food Company example: Call X=ADVER and Y=SALES

ANOVA b Model Sum of SquaresdfMean SquareFSig. 1 Regression3894, ,039,006 a Residual1812, ,951 Total5707,0768 a. Predictors: (Constant), ADVER b. Dependent Variable: SALES Food company example:

What is the relationship between the average speed a car is driven and its fuel efficiency? We plot fuel efficiency (in miles per gallon, MPG) against average speed (in miles per hour, MPH) for a random sample of 60 cars. The relationship is curved. When speed is log transformed (log of miles per hour, LOGMPH) the new scatterplot shows a positive, linear relationship.

Using software: SPSS ANOVA and t-test give same p-value. r 2 =SSM/SST = 494/552

1918 flu epidemics The line graph suggests that about 7 to 8% of those diagnosed with the flu died within about a week of their diagnosis. We look at the relationship between the number of deaths in a given week and the number of new diagnosed cases one week earlier.

MINITAB - Regression Analysis: FluDeaths1 versus FluCases0 The regression equation is FluDeaths1 = FluCases0 Predictor Coef SE Coef T P Constant FluCases S = R-Sq = 83.0% R-Sq(adj) = 81.8% Analysis of Variance Source DF SS MS F P Regression Residual Error Total P-value for H 0 : β = 0; H a : β ≠ 0 r = flu epidemic: Relationship between the number of deaths in a given week and the number of new diagnosed cases one week earlier. r 2 = SSM / SST SSM SST