CHAPTER 26: Inference for Regression

CHAPTER 26: Inference for Regression
Basic Practice of Statistics - 3rd Edition CHAPTER 26: Inference for Regression Basic Practice of Statistics 7th Edition Lecture PowerPoint Slides Chapter 5

In Chapter 26, We Cover … Conditions for regression inference
Estimating the parameters Using technology Testing the hypothesis of no linear relationship Testing lack of correlation Confidence intervals for the regression slope Inference about prediction Checking the conditions for inference

Introduction When a scatterplot shows a linear relationship between a quantitative explanatory variable x and a quantitative response variable y, we can use the least-squares line fitted to the data to predict y for a given value of x. If the data are a random sample from a larger population, we need statistical inference to answer questions like these: Is there really a linear relationship between x and y in the population, or could the pattern we see in the scatterplot plausibly happen just by chance? What is the slope (rate of change) that relates y to x in the population, including a margin of error for our estimate of the slope? If we use the least-squares regression line to predict y for a given value of x, how accurate is our prediction (again, with a margin of error)?

Example STATE: Infants who cry easily may be more easily stimulated than others. This may be a sign of higher IQ. Child development researchers explored the relationship between the crying of infants 4 to 10 days old and their later IQ test scores. PLAN: Make a scatterplot. If the relationship appears linear, use correlation and regression to describe it. Finally, ask whether there is a statistically significant linear relationship between crying and IQ. SOLVE (first steps): Consider the scatterplot: Look for the form, direction, and strength of the relationship as well as for outliers or other deviations. There is a moderately strong positive linear relationship, with no extreme outliers or potentially influential observations.

Example SOLVE: Because the scatterplot shows a roughly linear (straight-line) pattern, the correlation describes the direction and strength of the relationship. The correlation between crying and IQ is r = We are interested in predicting the response from information about the explanatory variable. So we find the least-squares regression line for predicting IQ from crying. The equation of the regression line is 𝑦 = 𝑎+𝑏𝑥 = CONCLUDE (first steps): Children who cry more vigorously do tend to have higher IQs. Because r2 = 0.207, only about 21% of the variation in IQ scores is explained by crying intensity. Prediction of IQ will not be very accurate. Is this observed relationship statistically significant?

Conditions for Regression Inference
To do inference, think of the slope b and intercept a of the least- squares line as estimates of unknown corresponding parameters 𝛽 and 𝛼 that describe the population of interest. CONDITIONS FOR REGRESSION INFERENCE We have n observations on an explanatory variable x and a response variable y. Our goal is to study or predict the behavior of y for given values of x. For any fixed value of x, the response y varies according to a Normal distribution. Repeated responses y are independent of each other. The mean response 𝜇 𝑦 has a straight-line relationship with x given by a population regression line 𝜇 𝑦 =𝛼+𝛽𝑦 The slope b and intercept a are unknown parameters. The standard deviation of y (call it s) is the same for all values of x. The value of s is unknown. There are thus three population parameters that we must estimate from the data: a, b, and s.

Conditions for Regression Inference
The figure below shows the regression model when the conditions are met. The line in the figure is the population regression line µy= α + βx. For each possible value of the explanatory variable x, the mean of the responses µy moves along this line. The value of σ determines whether the points fall close to the population regression line (small σ) or are widely scattered (large σ). The Normal curves show how y will vary when x is held fixed at different values. All the curves have the same standard deviation σ, so the variability of y is the same for all values of x.

Estimating the Parameters
The first step in inference is to estimate the unknown parameters a, b, and s. ESTIMATING THE POPULATION REGRESSION LINE When the conditions for regression are met and we calculate the least-squares line 𝑦 =𝑎+𝑏𝑥, the slope b of the least- squares line is an unbiased estimator of the population slope b, and the intercept a of the least-squares line is an unbiased estimator of the population intercept a. The remaining parameter is the standard deviation s , which describes the variability of the response y about the population regression line.

Estimating the Parameters
The least-squares line estimates the population regression line, so the residuals estimate how much y varies about the population line. Recall that the residuals are the vertical deviations of the data points from the least-squares line: residual = observed 𝑦−predicted 𝑦 = 𝑦− 𝑦 REGRESSION STANDARD ERROR The regression standard error is 𝑠 = 1 𝑛−2 residual 2 𝑠 = 1 𝑛− 𝑦− 𝑦 2 Use s to estimate the standard deviation s of responses about the mean given by the population regression line.

Using Technology The least squares regression line for these data is
IQ = Crycount The intercept, or constant coefficient, is our predicted IQ if we observed a cry count of 0. The slope, or regression coefficient, is our predicted change in the IQ of an infant for each additional observed cry—here , or increasing by about 1.5 points.

Using Technology

Testing the Hypothesis of No Linear Relationship
Significance Test for Regression Slope To test the hypothesis 𝐻 0 : β = 0, compute the test statistic 𝑡= 𝑏 SE 𝑏 In this formula, the standard error of the least-square slope b is SE 𝑏 = 𝑠 𝑥− 𝑥 2 The sum runs over all observations on the explanatory variable x. In terms of a random variable T having the t(n − 2) distribution, the P-value for a test of 𝐻 0 against 𝐻 𝑎 :𝛽>0 is 𝑃 𝑇≥𝑡 𝐻 𝑎 :𝛽<0 is 𝑃 𝑇≤𝑡 𝐻 𝑎 :𝛽≠0 is 2×𝑃 𝑇≥|𝑡|

Example Crying and IQ: Is the relationship significant?
SOLVE: The hypothesis 𝐻 0 :𝛽=0 says that crying has no straight-line relationship with IQ. We conjecture that there is a positive relationship, so we use the one-sided alternative, 𝐻 𝑎 :𝛽>0. The earlier scatterplot showed a positive relationship; the Minitab printout below shows b = and SEb = Thus, 𝑡= 𝑏 SE 𝑏 = =3.07 CONCLUDE: It is not surprising that all the outputs give t = 3.07 with two-sided P-value The P-value for the one-sided test is half of this, P = There is very strong evidence that IQ increases as the intensity of crying increases.

Testing Lack of Correlation
The least-squares regression slope b is closely related to the correlation r between the explanatory and response variables. In the same way, the slope β of the population regression line is closely related to the correlation between x and y in the population. Testing the null hypothesis, 𝐻 0 : β = 0, is therefore exactly the same as testing that there is no correlation between x and y in the population from which we drew our data.

Confidence Intervals for Regression Slope
The slope is the rate of change of the mean response as the explanatory variable increases. We often want to estimate β. The confidence interval for β has the familiar form estimate± 𝑡 ∗ SE estimate Confidence Interval for Regression Slope A level C confidence interval for the slope of the population regression line is 𝑏± 𝑡 ∗ SE 𝑏 Here, 𝑡 ∗ is the critical value for the t(n − 2) density curve with area C between −𝑡 ∗ and 𝑡 ∗ .

Example Crying and IQ: estimating the slope
From the computer output, we have slope b = and SEb = There are 38 data points, so the degrees of freedom are n – 2 = 36. Using software, for 95% confidence, enter the cumulative proportion 0.975: The 95 % confidence interval for the slope of the population regression line is 𝑏± 𝑡 ∗ SE 𝑏 = ± = ± = to 2.481 We are 95% confident that the mean IQ increases by between 0.5 and 2.5 points for each additional peak in crying.

Inference About Prediction
One of the most common reasons to fit a line to data is to predict the response to a particular value of the explanatory variable. We want, not simply a prediction, but a prediction with a margin of error that describes how accurate the prediction is likely to be. Write the given value of the explanatory variable x as x*. The distinction between predicting a single outcome and predicting the mean of all outcomes when x = x* determines what margin of error is correct. To emphasize the distinction, we use different terms for the two intervals. To estimate the mean response, we use a confidence interval. It is an ordinary confidence interval for the mean response when x has the value x*, which is 𝜇 𝑦 =𝛼+𝛽𝑥. This is a parameter, a fixed number whose value we don’t know. To estimate an individual response y, we use a prediction interval. A prediction interval estimates a single random response y rather than a parameter like 𝜇 𝑦 . The response y is not a fixed number. If we took more observations with x = x*, we would get different responses.

CONFIDENCE AND PREDICTION INTERVALS FOR REGRESSION RESPONSE A level C confidence interval for the mean response 𝜇 𝑦 when x takes the value x* is 𝑦 = 𝑡 ∗ 𝑆𝐸 𝜇 The standard error 𝑆𝐸 𝜇 is 𝑆𝐸 𝜇 =𝑠 1 𝑛 + 𝑥 ∗ − 𝑥 𝑥− 𝑥 2

CONFIDENCE AND PREDICTION INTERVALS FOR REGRESSION RESPONSE A level C prediction interval for a single observation y when x takes the value x* is 𝑦 = 𝑡 ∗ 𝑆𝐸 𝑦 The standard error for prediction 𝑆𝐸 𝑦 is 𝑆𝐸 𝑦 =𝑠 𝑛 + 𝑥 ∗ − 𝑥 𝑥− 𝑥 2 In both intervals, t* is the critical value for the t(n – 2) density curve with area C between –t* and t*.

Checking the Conditions for Inference
You can fit a least-squares line to any set of explanatory-response data when both variables are quantitative. If the scatterplot doesn’t show a roughly linear pattern, the fitted line may be almost useless. Before you can trust the results of inference, you must check the conditions for inference one-by-one. The relationship is linear in the population. The response varies Normally about the population regression line. Observations are independent. The standard deviation of the responses is the same for all values of x. You can check all of the conditions for regression inference by looking at graphs of the residuals, such as a residual plot.

CHAPTER 26: Inference for Regression

Similar presentations

Presentation on theme: "CHAPTER 26: Inference for Regression"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CHAPTER 26: Inference for Regression

Similar presentations

Presentation on theme: "CHAPTER 26: Inference for Regression"— Presentation transcript:

Similar presentations

About project

Feedback