Chapter 14 Inference on the Least-Squares Regression Model and Multiple Regression.

Slides:



Advertisements
Similar presentations
Inference for Regression
Advertisements

6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Objectives (BPS chapter 24)
© 2010 Pearson Prentice Hall. All rights reserved Regression Interval Estimates.
© 2010 Pearson Prentice Hall. All rights reserved Least Squares Regression Models.
Chapter 12 Simple Regression
The Simple Regression Model
Chapter Topics Types of Regression Models
Chapter 11 Multiple Regression.
Simple Linear Regression Analysis
Simple Linear Regression Analysis
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS & Updated by SPIROS VELIANITIS.
Correlation and Regression
Chapter 13: Inference in Regression
Copyright © Cengage Learning. All rights reserved. 13 Linear Correlation and Regression Analysis.
Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter Inference on the Least-Squares Regression Model and Multiple Regression 14.
Correlation and Regression
BPS - 3rd Ed. Chapter 211 Inference for Regression.
1 1 Slide © 2005 Thomson/South-Western Slides Prepared by JOHN S. LOUCKS St. Edward’s University Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
OPIM 303-Lecture #8 Jose M. Cruz Assistant Professor.
CHAPTER 14 MULTIPLE REGRESSION
1 1 Slide Simple Linear Regression Coefficient of Determination Chapter 14 BA 303 – Spring 2011.
Introduction to Linear Regression
1 1 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
Production Planning and Control. A correlation is a relationship between two variables. The data can be represented by the ordered pairs (x, y) where.
Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Section Inference about Two Means: Independent Samples 11.3.
Lesson Multiple Regression Models. Objectives Obtain the correlation matrix Use technology to find a multiple regression equation Interpret the.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
STA 286 week 131 Inference for the Regression Coefficient Recall, b 0 and b 1 are the estimates of the slope β 1 and intercept β 0 of population regression.
Lesson Testing the Significance of the Least Squares Regression Model.
BPS - 5th Ed. Chapter 231 Inference for Regression.
© 2010 Pearson Prentice Hall. All rights reserved Chapter Hypothesis Tests Regarding a Parameter 10.
23. Inference for regression
Chapter 20 Linear and Multiple Regression
Regression and Correlation
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Inference for Regression (Chapter 14) A.P. Stats Review Topic #3
Correlation and Simple Linear Regression
Inference for Regression
Comparing Three or More Means
Chapter 11 Simple Regression
Elementary Statistics
Chapter 13 Simple Linear Regression
...Relax... 9/21/2018 ST3131, Lecture 3 ST5213 Semester II, 2000/2001
Slides by JOHN LOUCKS St. Edward’s University.
Correlation and Simple Linear Regression
The Practice of Statistics in the Life Sciences Fourth Edition
Correlation and Regression
CHAPTER 29: Multiple Regression*
CHAPTER 26: Inference for Regression
Chapter 12 Inference on the Least-squares Regression Line; ANOVA
Elementary Statistics
Chapter 10 Correlation and Regression
Multiple Regression Models
Correlation and Simple Linear Regression
Statistical Inference about Regression
Inference in Linear Models
Inference on Categorical Data
Correlation and Regression
Paired Samples and Blocks
Basic Practice of Statistics - 3rd Edition Inference for Regression
CHAPTER 14 MULTIPLE REGRESSION
Hypothesis Tests for a Standard Deviation
SIMPLE LINEAR REGRESSION
Introduction to Regression
St. Edward’s University
STATISTICS INFORMED DECISIONS USING DATA
Correlation and Simple Linear Regression
Correlation and Simple Linear Regression
Presentation transcript:

Chapter 14 Inference on the Least-Squares Regression Model and Multiple Regression

Testing the Significance of the Least-Squares Regression Model Section 14.1 Testing the Significance of the Least-Squares Regression Model

Objectives State the requirements of the least-squares regression model Compute the standard error of the estimate Verify that the residuals are normally distributed Conduct inference on the slope 5. Construct a confidence interval about the slope of the least-squares regression model

Objective 1 State the Requirements of the Least-Squares Regression Model

Requirement 1 for Inference on the Least-Squares Regression Model For any particular value of the explanatory variable x, the mean of the corresponding responses in the population depends linearly on x. That is, for some numbers 0 and 1, where y|x represents the population mean response when the value of the explanatory variable is x.

Requirement 2 for Inference on the Least-Squares Regression Model The response variables are normally distributed with mean and standard deviation .

“In Other Words” When doing inference on the least-squares regression model, we require (1) that for any explanatory variable, x, the mean of the response variable, y, depends on the value of x through a linear equation, and (2) that the response variable, y, is normally distributed with a constant standard deviation, . The mean increases/decreases at a constant rate depending on the slope, while the variance remains constant.

The least-squares regression model is given by where yi is the value of the response variable for the ith individual 0 and 1 are the parameters to be estimated based on sample data xi is the value of the explanatory variable for the ith individual i is a random error term with mean 0 and variance , the error terms are independent. i=1,…,n, where n is the sample size (number of ordered pairs in the data set)

Objective 2 Compute the Standard Error of the Estimate

The standard error of the estimate, se, is found using the formula

Compute the standard error of the estimate for the Parallel Example 2: Compute the Standard Error Compute the standard error of the estimate for the drilling data which is presented on the next slide.

Solution Step 1: Using technology, we find the least squares regression line to be Step 2, 3: The predicted values as well as the residuals for the 12 observations are given in the table on the next slide

Solution Step 4: We find the sum of the squared residuals by summing the last column of the table: Step 5: The standard error of the estimate is then given by:

CAUTION! Be sure to divide by n-2 when computing the standard error of the estimate.

Objective 3 Verify That the Residuals Are Normally Distributed

Parallel Example 4: Compute the Standard Error Verify that the residuals from the drilling example are normally distributed.

Objective 4 Conduct Inference on the Slope

Hypothesis Test Regarding the Slope Coefficient, 1 To test whether two quantitative variables are linearly related, we use the following steps provided that the sample is obtained using random sampling. the residuals are normally distributed with constant error variance.

Step 1: Determine the null and alternative hypotheses Step 1: Determine the null and alternative hypotheses. The hypotheses can be structured in one of three ways: Step 2: Select a level of significance, , depending on the seriousness of making a Type I error. Two-tailed Left-Tailed Right-Tailed H0: 1 = 0 H1: 1  0 H1: 1 < 0 H1: 1 > 0

Step 3: Compute the test statistic which follows Student’s t-distribution with n-2 degrees of freedom. Remember, when computing the test statistic, we assume the null hypothesis to be true. So, we assume that 1=0.

Classical Approach Step 4: Use Table VI to determine the critical value using n-2 degrees of freedom.

Classical Approach Two-Tailed

Classical Approach Left-Tailed

Classical Approach Right-Tailed

Classical Approach Step 5: Compare the critical value with the test statistic.

P-Value Approach Step 4: Use Table VI to estimate the P-value using n-2 degrees of freedom.

P-Value Approach Two-Tailed

P-Value Approach Left-Tailed

P-Value Approach Right-Tailed

P-Value Approach Step 5: If the P-value < , reject the null hypothesis.

Step 6: State the conclusion.

CAUTION! Before testing H0: 1 = 0, be sure to draw a residual plot to verify that a linear model is appropriate.

Parallel Example 5: Testing for a Linear Relation Test the claim that there is a linear relation between drill depth and drill time at the  = 0.05 level of significance using the drilling data.

Solution Verify the requirements: We assume that the experiment was randomized so that the data can be assumed to represent a random sample. In Parallel Example 4 we confirmed that the residuals were normally distributed by constructing a normal probability plot. To verify the requirement of constant error variance, we plot the residuals against the explanatory variable, drill depth.

There is no discernable pattern.

Solution Step 1: We want to determine whether a linear relation exists between drill depth and drill time without regard to the sign of the slope. This is a two-tailed test with H0: 1 = 0 versus H1: 1  0 Step 2: The level of significance is  = 0.05. Step 3: Using technology, we obtained an estimate of 1 in Parallel Example 2, b1=0.0116. To determine the standard deviation of b1, we compute . The calculations are on the next slide.

Solution Step 3, cont’d: We have The test statistic is

Solution: Classical Approach Step 4: Since this is a two-tailed test, we determine the critical t-values at the  = 0.05 level of significance with n-2=12-2=10 degrees of freedom to be -t0.025 = -2.228 and t0.025 = 2.228. Step 5: Since the value of the test statistic, 3.867, is greater than 2.228, we reject the null hypothesis.

Solution: P-Value Approach Step 4: Since this is a two-tailed test, the P-value is the sum of the area under the t-distribution with 12-2=10 degrees of freedom to the left of -t0 = -3.867 and to the right of t0 = 3.867. Using Table VI we find that with 10 degrees of freedom, the value 3.867 is between 3.581 and 4.144 corresponding to right-tail areas of 0.0025 and 0.001, respectively. Thus, the P-value is between 0.002 and 0.005. Step 5: Since the P-value is less than the level of significance, 0.05, we reject the null hypothesis.

Solution Step 6: There is sufficient evidence at the  = 0.05 level of significance to conclude that a linear relation exists between drill depth and drill time.

Objective 5 Construct a Confidence Interval about the Slope of the Least-Squares Regression Model

Confidence Intervals for the Slope of the Regression Line A (1- )100% confidence interval for the slope of the true regression line, 1, is given by the following formulas: Lower bound: Upper bound: Here, t/2 is computed using n-2 degrees of freedom.

Note: The confidence interval formula for 1 can be computed only if the data are randomly obtained, the residuals are normally distributed, and there is constant error variance.

Parallel Example 7: Constructing a Confidence Interval for Parallel Example 7: Constructing a Confidence Interval for the Slope of the True Regression Line Construct a 95% confidence interval for the slope of the least-squares regression line for the drilling example.

Solution The requirements for the usage of the confidence interval formula were verified in previous examples. We also determined b1 = 0.0116 in previous examples.

Solution Since t0.025=2.228 for 10 degrees of freedom, we have Lower bound = 0.0116-2.2280.003=0.0049 Upper bound = 0.0116+2.2280.003=0.0183. We are 95% confident that the mean increase in the time it takes to drill 5 feet for each additional foot of depth at which the drilling begins is between 0.005 and 0.018 minutes.

Confidence and Prediction Intervals Section 14.2 Confidence and Prediction Intervals

Objectives Construct confidence intervals for a mean response Construct prediction intervals for an individual response

Confidence intervals for a mean response are intervals constructed about the predicted value of y, at a given level of x, that are used to measure the accuracy of the mean response of all the individuals in the population. Prediction intervals for an individual response are intervals constructed about the predicted value of y that are used to measure the accuracy of a single individual’s predicted value.

Objective 1 Construct Confidence Intervals for a Mean Response

Confidence Interval for the Mean Response of y, . A (1-)100% confidence interval for , the mean response of y for a specified value of x, is given by Lower bound: Upper bound: where x* is the given value of the explanatory variable, n is the number of observations, and t/2 is the critical value with n-2 degrees of freedom.

Parallel Example 1: Constructing a Confidence Interval for a Parallel Example 1: Constructing a Confidence Interval for a Mean Response Construct a 95% confidence interval about the predicted mean time to drill 5 feet for all drillings started at a depth of 110 feet.

Solution The least squares regression line is . To find the predicted mean time to drill 5 feet for all drillings started at 110 feet, let x*=110 in the regression equation and obtain . Recall: se=0.5197 t0.025=2.228 for 10 degrees of freedom

Solution Therefore, Lower bound: Upper bound:

Solution We are 95% confident that the mean time to drill 5 feet for all drillings started at a depth of 110 feet is between 6.45 and 7.15 minutes.

Objective 2 Prediction Interval for an Individual Response about .

Prediction Interval for an Individual Response about . A (1-)100% prediction interval for , the individual response of y, is given by Lower bound: Upper bound: where x* is the given value of the explanatory variable, n is the number of observations, and t/2 is the critical value with n-2 degrees of freedom.

Parallel Example 2: Constructing a Prediction Interval for an Parallel Example 2: Constructing a Prediction Interval for an Individual Response Construct a 95% prediction interval about the predicted time to drill 5 feet for a single drilling started at a depth of 110 feet.

Solution The least squares regression line is . To find the predicted mean time to drill 5 feet for all drillings started at 110 feet, let x*=110 in the regression equation and obtain . Recall: se=0.5197 t0.025=2.228 for 10 degrees of freedom

Solution Therefore, Lower bound: Upper bound:

Solution We are 95% confident that the time to drill 5 feet for a random drilling started at a depth of 110 feet is between 5.59 and 8.01 minutes.

Section 14.3 Multiple Regression

Objectives Obtain the correlation matrix Use technology to find a multiple regression equation Interpret the coefficients of a multiple regression equation Determine R2 and adjusted R2

Objectives (continued) Perform an F-test for lack of fit Test individual regression coefficients for significance Construct confidence and prediction intervals Build a regression model

Objective 1 Obtain the Correlation Matrix

A multiple regression model is given by where yi is the value of the response variable for the ith individual 0, 1,…, k are the parameters to be estimated based on sample data x1i is the ith observation for the first explanatory variable, x2i is the ith observation for the second explanatory variable, and so on i is a random error term that is normally distributed with mean 0 and variance The error terms are independent, and i=1,…,n, where n is the sample size.

A correlation matrix shows the linear correlation between each pair of variables under consideration in a multiple regression model.

Multicollinearity exists between two explanatory variables if they have a high linear correlation.

CAUTION! If two explanatory variables in the regression model are highly correlated with each other, watch out for strange results in the regression output.

Parallel Example 1: Constructing a Correlation Matrix As cheese ages, various chemical processes take place that determine the taste of the final product. The next two slides give concentrations of various chemicals in 30 samples of mature cheddar cheese and a subjective measure of taste for each sample. Source: Moore, David S., and George P. McCabe (1989)

Solution The following correlation matrix is from MINITAB: Correlations: taste, Acetic, H2S, Lactic taste Acetic H2S Acetic 0.550 H2S 0.756 0.618 Lactic 0.704 0.604 0.645 Cell Contents: Pearson correlation

Objective 2 Use Technology to Find a Multiple Regression Equation

Parallel Example 2: Multiple Regression Use technology to obtain the least-squares regression equation where x1 represents the natural logarithm of the cheese’s acetic acid concentration, x2 represents the natural logarithm of the cheese’s hydrogen sulfide concentration, x3 represents the cheese’s lactic acid concentration and y represents the subjective taste score (combined score of several tasters). Draw residual plots and a boxplot of the residuals to assess the adequacy of the model.

Solution 1. 2.

Solution 2.

Solution 2.

Objective 3 Interpret the Coefficients of a Multiple Regression Equation

Parallel Example 3: Interpreting Regression Coefficients Interpret the regression coefficients for the least-squares regression equation found in Parallel Example 2.

Solution The least-squares regression equation found in Parallel Example 2 was Since b1=0.328, for every 1 unit increase in the natural logarithm of acetic acid concentration, the cheese’s taste score will increase by 0.328, assuming that the hydrogen sulfide and lactic acid concentrations remain unchanged.

Solution Since b2=3.912, for every 1 unit increase in the natural logarithm of hydrogen sulfide concentration, the cheese’s taste score will increase by 3.912, assuming that the acetic acid and lactic acid concentrations remain unchanged. Since b3=19.671, for every 1unit increase in lactic acid concentration, the cheese’s taste score will increase by 19.671, assuming that the hydrogen sulfide and acetic acid concentrations remain unchanged.

Objective 4 Determine R2 and Adjusted R2

The adjusted R2, denoted , is the adjusted coefficient of determination. It modifies the value of R2 based on the sample size, n, and the number of explanatory variables, k. The formula for the adjusted R2 is

“In Other Words” The value of R2 always increases by adding one more explanatory variable.

CAUTION! Never use R2 to compare regression models with a different number of explanatory variables. Rather, use the adjusted R2.

Parallel Example 4: Coefficient of Determination For the regression model obtained in Parallel Example 2, determine the coefficient of determination and the adjusted R2.

Solution

Solution From the MINITAB output, we see that R2=65.2%. This means that 65.2% of the variation in the taste scores is explained by the least-squares regression model. We also find that .

Objective 5 Perform an F-Test for Lack of Fit

Test Statistic for Multiple Regression with k-1 degrees of freedom in the numerator and n-k degrees of freedom in the denominator, where k is the number of explanatory variables and n is the sample size.

F-Test Statistic for Multiple Regression Using R2 where R2 is the coefficient of determination k is the number of explanatory variables n is the sample size.

Decision Rule for Testing H0: 1=2= ··· =k=0 If the P-value is less than the level of significance, , reject the null hypothesis. Otherwise, do not reject the null hypothesis.

“In Other Words” The null hypothesis states that there is no linear relation between the explanatory variables and the response variable. The alternative hypothesis states that there is a linear relation between at least one explanatory variable and the response variable.

Parallel Example 5: Inference on the Regression Model Test H0: 1=2=3=0 versus H1: at least one i0 for the multiple regression model for the cheese taste data.

Solution We must first determine whether it is reasonable to believe that the residuals are normally distributed with no outliers.

Solution Although there appears to be one outlier, the sample size is large enough for this to not be of great concern. We look at the P-value associated with the F-test statistic from the MINITAB output. Since the P-value < 0.001, we reject H0 and conclude that at least one of the regression coefficients is different from zero.

CAUTION! If we reject the null hypothesis that the slope coefficients are zero, then we are saying that at least one of the slopes is different from zero, not that they all are different from zero.

Objective 6 Test Individual Regression Coefficients for Significance

Test the following hypotheses for the cheese taste data: Parallel Example 6: Testing the Significance of Individual Predictor Variables Test the following hypotheses for the cheese taste data: a) H0: 1=0 versus H1: 10 b) H0: 2=0 versus H1: 20 c) H0: 3=0 versus H1: 30

Solution We will again use the MINITAB output The test statistic for acetic acid is 0.07 with a P-value of 0.942 so we fail to reject H0. The test statistic for hydrogen sulfide is 3.13 with a P-value of 0.004 so we reject H0. The test statistic for lactic acid is 2.28 with a P-value of 0.031 so we reject H0. We conclude that the natural logarithm of hydrogen sulfide concentration and lactic acid concentration are useful predictors for taste, but the natural logarithm of acetic acid concentration is not.

We refit the model using the natural logarithm of hydrogen sulfide concentration and lactic acid concentration to obtain The adjusted R2=62.6% which is actually higher than it was for the model with three predictors ( ).

Objective 7 Construct Confidence and Prediction Intervals

Parallel Example 7: Testing the Significance of Individual Parallel Example 7: Testing the Significance of Individual Predictor Variables Construct a 95% confidence interval for the mean taste score of all cheddar cheeses whose natural logarithm of hydrogen sulfide concentration is 5.5 and whose lactic acid concentration is 1.75. Construct a 95% prediction interval for the taste score of an individual cheddar cheese whose natural logarithm of hydrogen sulfide concentration is 5.5 and whose lactic acid concentration is 1.75.

Solution Predicted Values for New Observations New Obs Fit SE Fit 95% CI 95% PI 1 28.92 3.34 (22.07, 35.76) (7.40, 50.43) Values of Predictors for New Observations Obs H2S Lactic 1 5.50 1.75

Solution Based on the MINITAB output, we are 95% confident that the mean taste score of all cheddar cheeses with ln(hydrogen sulfide) =5.5 and a lactic acid concentration of 1.75 is between 22.07 and 35.76. We are 95% confident that the mean taste score of an individual cheddar cheese with ln(hydrogen sulfide) =5.5 and a lactic acid concentration of 1.75 will be between 7.40 and 50.43.

Objective 8 Build a Regression Model

Guidelines in Developing a Multiple Regression Model Construct a correlation matrix to help identify the explanatory variables that have a high correlation with the response variable. In addition, look at the correlation matrix for any indication that the explanatory variables are correlated with each other. Remember just because two explanatory variables have high correlation, this does not mean that multicollinearity is a problem, but it is a tip-off to watch out for strange results from the regression model.

Guidelines in Developing a Multiple Regression Model Determine the multiple regression model using all the explanatory variables that have been identified by the researcher. If the null hypothesis that all the slope coefficients are zero has been rejected, we proceed to look at the individual slope coefficients. Identify those slope coefficients that have small t-test statistics (and therefore large P-values). These are candidates for explanatory variables that may be removed from the model. We should only remove one explanatory variables at a time from the model before recomputing the regression model.

Guidelines in Developing a Multiple Regression Model Repeat Step 3 until all slope coefficients are significantly different from zero. Be sure that the model is appropriate by drawing residual plots.