Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 13 Simple Linear Regression and Correlation: Inferential Methods.

Similar presentations


Presentation on theme: "Chapter 13 Simple Linear Regression and Correlation: Inferential Methods."— Presentation transcript:

1 Chapter 13 Simple Linear Regression and Correlation: Inferential Methods

2 Suppose we were to investigate the relationship between y = the first-year college grade point average and x = high school grade point average. The equation for an additive probabilistic model is: Where e is an “error” variable Is the first-year college grade point average determined solely by the high school grade point average? A relationship in which the value of y is completely determined by the value of an independent variable x is called a deterministic relationship. The first-year college grade point average and the high school grade point average do NOT have a deterministic relationship. A description of the relationship between two variables that are not deterministically related can be given by a probabilistic model.

3 x y x1x1 x2x2 The simple linear regression model assumes that there is a line with y -intercept  and slope , called the population regression line. When a value of the independent variable x is fixed and an observation on the dependent variable y is made,  Population regression line (slope  ) e1e1 e2e2 Without the random deviation e in the equation, all observed ( x, y ) points would fall exactly on the population regression line.

4 Basic Assumptions of the Simple Linear Regression Model 1.The distribution of e at any particular x value has mean value 0. that is,  e = 0. 2.The standard deviation of e is the same for any particular value of x. This standard deviation is denoted by . 3.The distribution of e at any particular value of x is normal. 4.The random deviations e 1, e 2,..., e n associated with different observations are independent of one another.

5 Height Weight How much would an adult female weigh if she were 5 feet tall? Weights of women that are 5 feet tall will vary – in other words, there is a distribution of weights for adult females who are 5 feet tall. This distribution is normally distributed. We want the standard deviations of all these normal distributions to be the same. Let’s look at the heights and weights of a population of adult women. Are some of these weights more likely than others? What would this distribution look like? What would you expect for other heights? Where would you expect the population regression line to be?

6 Basic Assumptions of the Simple Linear Regression Model Revisited 1.The distribution of e at any particular x value has mean value 0. that is,  e = 0. 2.The standard deviation of e is the same for any particular value of x. This standard deviation is denoted by . 3.The distribution of e at any particular value of x is normal. 4.The random deviations e 1, e 2,..., e n associated with different observations are independent of one another. Remember the variable e is a measure of the extent that individual y -values deviate from the population regression line. For any particular x value, the standard deviation of y equals the standard deviation of e. The distribution of y at any particular value of x is normal.

7 We use to estimate the true population regression line. b = point estimate of  = where a = point estimate of  = y - bx

8 Medical researches have noted that adolescent females are much more likely to deliver low-birth-weight babies than are adult females. Because low-birth-weight babies have higher mortality rates, a number of studies have examined the relationship between birth weight and mother’s age for babies born to young mothers. The following data is on x = maternal age (in years) and y = birth weight of baby (in grams). x 15171815161917161819 y 2289339332712648289733272970253531383573 Mother’s Age (yrs) Baby’s Weight (g) Sketch a scatterplot of these data. The scatterplot shows a linear pattern and the spread in the y values appears to be similar across the range of x values. This supports the appropriateness of the simple linear regression model.

9 Birth Weight Continued... The following data is on x = maternal age (in years) and y = birth weight of baby (in grams). x 15171815161917161819 y 2289339332712648289733272970253531383573 Mother’s Age (yrs) Baby’s Weight (g) What is the point estimate for the mean weight of babies born to 18- year-old mothers? This is the point estimate for the mean weight of all babies born to 18- year-old mothers. This is also the prediction of the weight of a single baby born to a mother 18 years of age. The weight of babies increase approximately 245.15 grams for each increase of 1 year in the mother’s age.

10 The statistic for estimating the variance  2 is where The estimate for the standard deviation  is Recall the coefficient of determination, r 2, is the proportion of observed y variation that is attributed to the model relationship. The subscript e reminds us that we are estimating the variance of the “errors” or residuals. Note that the degrees of freedom associated with estimating  2 or  in simple linear regression is df = n - 2 Why n – 2? Since we must estimate both for  and  in the regression line, we reduce the sample size n by 2

11 Birth Weight Revisited... The following data is on x = maternal age (in years) and y = birth weight of baby (in grams). x 15171815161917161819 y 2289339332712648289733272970253531383573 Mother’s Age (yrs) Baby’s Weight (g) For a particular mother’s age, the typical deviation for possible weights of babies is approximately 231 grams. Approximately 76% of the variability observed weight of babies can be explained by this model.

12 Properties of the Sampling Distribution of b When the four basic assumptions of the simple linear regression model are satisfied, the following statements are true: 1.The mean value of b is . That is,  b = . 2.The standard deviation of the statistic b is 3.The statistic b has a normal distribution (a consequence of the model assumption that the random deviation e is normally distributed.) Since  is almost always unknown, it must be estimated from independently selected observations. The slope b of the least-squares line gives a point estimate for . Since  is usually unknown, the estimated standard deviation of the statistic b is

13 Confidence Interval for  When the four basic assumptions of the simple linear regression model are satisfied, a confidence interval for , the slope of the population regression line, has the form where the t critical value is based on df = n – 2.

14 The plot shows a linear pattern, and the vertical spread of points does not appear to be changing over the range of x values in the sample. If we assume that the distribution of errors at any given x value is approximately normal, then the simple linear regression model seems appropriate. Is cardiovascular fitness (as measured by time to exhaustion from running on a treadmill) related to an athlete’s performance in a 20-km ski race? The following data on x = treadmill time to exhaustion (in minutes) and y = 20-km ski time (in minutes) were taken from the article “Physiological Characteristics and Performance of Top U.S. Biathletes” (Medicine and Science in Sports and Exercise, 1995): x 7.78.48.79.09.6 10.010.210.411.011.7 y 71.071.465.068.764.469.463.064.666.962.661.7 Sketch a scatterplot for the data. Treadmill Time (min) Ski Time (min)

15 We are 95% confident that the true average decrease in ski time associated with a 1 minute increase in treadmill exhaustion time is between 1 minute and 3.7 minutes. Biathletes Continued... x = treadmill exhaustion time y = ski time x 7.78.48.79.09.6 10.010.210.411.011.7 y 71.071.465.068.764.469.463.064.666.962.661.7 Find a 95% confidence interval for the slope of the true regression line. Treadmill Time (min) Ski Time (min)

16 Biathletes Continued... Partial Minitab Output The regression equation is Ski time = 88.8 – 2.33 treadmill time PredictorCoefStDevTP Constant88.7965.75015.440.000 Treadmill-2.33350.5911-3.950.003 S = 2.188 R-Sq = 63.4% R-Sq (adj) = 59.3% Analysis of Variance SourceDFSSMSFP Regression174.630 15.580.003 Residual Error943.0974.789 Total10117.727 Equation of estimated regression line Estimated y intercept a Estimated slope b s b = estimated standard deviation of b sese 100× r 2 r 2 (adjusted) is not used in simple linear regression. SS Resid SS To n - 2

17 Summary of Hypothesis Tests Concerning  Null hypothesis: H 0 :  = hypothesized value Test Statistic: The test is based on df = n – 2. Alternative Hypothesis: P -v alue: H a :  > hypothesized value area to right of t under the appropriate t curve H a :  < hypothesized value area to left of t under the appropriate t curve H a :  ≠ hypothesized value 2(area to right of t ) if + t or 2(area to left of t ) if - t Often the hypothesized value is zero – this is called the model utility test for simple linear regression.

18 Summary of Hypothesis Tests Concerning  Continued... Assumptions: For this test to be appropriate the four basic assumptions of the simple regression model must be met: 1.The distribution of e at any particular x value has a mean of 0 (  e = 0), 2.The standard deviation of e is , which does not depend on x. 3.The distribution of e at any particular x value is normal. 4.The random deviations e 1, e 2, …, e n associated with different observations are independent of one another.

19 Height Weight Suppose the least-squares line is horizontal – would height be useful in predicting weight? What is the slope of a horizontal line? A slope of zero – means that there is NO linear relationship between x and y !

20 The Model Utility Test for Simple Linear Regression The model utility test for simple linear regression is the test of H 0 :  = 0 H a :  ≠ 0 Test Statistic: The null hypothesis specifies that there is no useful linear relationship between x and y.

21 Since the P -value < , we reject H 0. There is sufficient evidence of a linear relationship between treadmill time and ski time. Biathletes Revisited... x = treadmill exhaustion time y = ski time x 7.78.48.79.09.6 10.010.210.411.011.7 y 71.071.465.068.764.469.463.064.666.962.661.7 Treadmill Time (min) Ski Time (min) H 0 :  = 0 H a :  ≠ 0 Where  is the slope of the population regression line between treadmill time and ski time P -value =.003  =.05df = 9 Even though the scatterplots indicates a linear relationship between ski time and treadmill time, let’s perform the model utility test.

22 Biathletes Revisited... Partial Minitab Output The regression equation is Ski time = 88.8 – 2.33 treadmill time PredictorCoefStDevTP Constant88.7965.75015.440.000 Treadmill-2.33350.5911-3.950.003 S = 2.188 R-Sq = 63.4% R-Sq (adj) = 59.3% Analysis of Variance SourceDFSSMSFP Regression174.630 15.580.003 Residual Error943.0974.789 Total10117.727 t test statistic ÷= P- value Statistical software usually performs the model utility test with H 0 :  = 0 versus H a :  ≠ 0

23 Checking Model Adequacy The simple linear regression model is y =  +  x + e where e represents the random deviation of an observed y value from the population regression line  +  x. The assumptions for simple linear regression are based on this random deviation e. However, we do not know the deviations for e 1, e 2, …, e n because the population regression line is unknown. If we knew the deviations of e 1, e 2, …, e n, we could examine them for any inconsistencies with model assumptions. Therefore, we must estimate these deviations using the residuals from the estimated line. Thus, we use the residuals to check our assumptions.

24 Residual Analysis Standardize the residuals to look at their magnitudes Create a residual plot (from Chapter 5) or a standardized residual plot (which is a plot of the ( x, standardized residual) pairs) Any observation with a large positive or negative residual should be examined carefully for any error in recording data, nonstandard experimental condition, or atypical experimental unit. Most statistical software will perform this calculation. It is tedious to do by hand. A desirable plot is one that exhibits no particular pattern (such as curvature or much greater spread in one part on the plot than the other) and that has no point that is far removed from all the others.

25 A Look at Standardized Residual Plots This is a desirable plot in that it exhibits no pattern and has no point that lies far away from the other points. This plot exhibits a curved pattern which indicates that the fitted model should be changed to incorporate the curvature. In this plot, the standard deviation of the residuals increases as the x -values increase. While a straight-line model might still be appropriate, the best-fit line should be found using weighted least-squares. Consult your local statistician! Both of these plots contain points far away from the others. These points can have substantial effects on estimates of  and  as well as other quantities.

26 Biathletes Revisited... r = residuals sr = standardized residuals (from Minitab) x 7.78.48.79.09.6 10.010.210.411.011.7 y 71.071.465.068.764.469.463.064.666.962.661.7 r 0.172.21-3.490.91-1.993.01-2.46-0.392.37-0.530.21 sr 0.101.13-1.740.44-0.961.44-1.18-0.191.16-0.270.12 Treadmill Time (min) Ski Time (min) Normal Score Standardized Residual Let’s look at a normal probability plot of the standardized residuals The normal probability plot of the standardized residuals is quite straight. There is no reason to doubt the plausibility that the random deviations e are normally distributed.

27 Biathletes Continued... r = residuals sr = standardized residuals (from Minitab) x 7.78.48.79.09.6 10.010.210.411.011.7 y 71.071.465.068.764.469.463.064.666.962.661.7 r 0.172.21-3.490.91-1.993.01-2.46-0.392.37-0.530.21 sr 0.101.13-1.740.44-0.961.44-1.18-0.191.16-0.270.12 Treadmill Time Standardized Residuals Sketch a standardized residual plot. The standardized residual plot does not show evidence of any pattern or of increasing spread. Sketch a residual plot. Treadmill Time Residuals Notice that these two plots have similar appearances. Remember that residuals can also be plotted against y.

28 Optional Topics Inferences Based on the Estimated Regression Line and Inference about the Population Correlation Coefficient

29 Properties of the Sampling Distribution of a + bx for a Fixed Value of x Let x * denote a particular value of the independent variable x. When the four basic assumptions of the simple linear regression model are satisfied, the sampling distribution of the statistic a + bx * had the following properties: 1) The mean value of a + bx * is  +  x *, so a + bx * is an unbiased statistic estimating the mean y value when x = x *. 2) The standard deviation of the statistic a + bx *, denoted by  a + bx *, is given by 3) The distribution of a + bx * is normal. The farther x * is from the center, the larger  a+bx* is. Since  is unknown,  a+bx* can be estimated by s a+bx* which substitutes s e in place of .

30 Confidence Interval for a Mean y Value When the basic assumptions of the simple linear regression model are met, a confidence interval for  +  x *, the mean y value when x has value x *, is where the t critical value is based on df = n – 2. Because s a+bx* is larger the farther x * is from x, the confidence interval becomes wider as x * moves away from the center of the data.

31 Physical characteristics of sharks are of interest to surfers and scuba divers as well as to marine researcher. The data on x = length (in feet) and y = jaw width (in inches) for 44 sharks (were found in various articles appearing in the magazines Skin Diver and Scuba News. (These data are found on page 778 of the text.) Because it is difficult to measure jaw width in living sharks, researchers would like to determine whether it is possible to estimate jaw width from body length, which is more easily measured. This scatterplot of the data shows a linear pattern and is consistent with use of the simple linear regression model.

32 Jaws Continued... The regression equation is Jaw Width = 0.69 + 0.963 Length PredictorCoefStDevTP Constant0.6881.2990.530.599 Length0.963450.0822811.710.000 S = 1.376R-Sq = 76.6%R-Sq (adj) = 76.0% The simple linear regression model explains 76.6% of the variability in jaw width. The model utility test confirms the usefulness of this model. The point estimate is Let’s use the data to compute a 90% confidence interval for the mean jaw width for 15 foot long sharks. The estimated standard deviation of a + b (15) is

33 Jaws Continued... The regression equation is Jaw Width = 0.69 + 0.963 Length PredictorCoefStDevTP Constant0.6881.2990.530.599 Length0.963450.0822811.710.000 S = 1.376R-Sq = 76.6%R-Sq (adj) = 76.0% The 90% confidence interval is Based on these sample data, we can be 90% confident that the mean jaw width for sharks of length 15 feet is between 14.782 and 15.498 inches.

34 Prediction Interval for a Single y Value When the basic assumptions of the simple linear regression model are met, a prediction interval for y *, a single y observation made when x = x *, has the form where the t critical value is based on df = n – 2. The prediction interval and the confidence interval are centered at exactly the same place, a + bx *. The prediction interval is wider than the confidence interval due to the due to the addition of s e under the square-root symbol.

35 Jaws Revisited... The 90% prediction interval is We can be 90% confident that an individual shark of length 15 feet will have a jaw width between 12.801 and 17.479 inches. Suppose that we were interested in predicting the jaw width of a single shark of length 15 feet. Notice that this interval is much wider than the confidence interval for the mean jaw width.

36 Below is a Regression Plot from Minitab showing the confidence interval and the prediction interval for the shark data. Notice that the prediction interval is substantial wider than the confidence interval Also notice that the confidence interval is very narrow close to x, but widens the farther it is from the mean.

37 A Test for Independence in a Bivariate Normal Population Null Hypothesis: H 0 :  = 0 Test Statistic: The test is based on df = n – 2. Alternative Hypothesis: P -value: H a :  > 0 (positive dependence)Area to the right of t H a :  < 0 (negative dependence)Area to the left of t H a :  ≠ 0 (dependence) 2(Area to the right of t ) if + t or 2(Area to the left of t ) if - t Greek letter “rho”  is the population correlation coefficient. It assesses the extent of any linear relationship in the population.  must be between -1 and 1. Many investigators are interested if ANY relationship exist between x and y. That is, are x and y are independent of each other? However,  = 0 is NOT equivalent to x and y being independent except in the case of a bivariate normal population. A bivariate normal population is one where for any fixed x value, the distribution of associated y values is normal, and for any fixed y value, the distribution of x values is normal. An example would be the height x and weight y of American adult males.

38 A Test for Independence in a Bivariate Normal Population Assumptions: r is the correlation coefficient for a random sample from a bivariate normal population. The one way to verify that the population is a bivariate normal population is to plot individual normal probability plots of the x and y variables.

39 Where  = the correlation between average nightly sleep and blood leptin level for the population of adult Americans The relationship between sleep duration and the level of the hormone leptin ( a hormone related to energy intake and energy expenditure) in the blood was investigated. Average nightly sleep ( x, in hours) and blood leptin level ( y ) were recorded for each person in a sample of 716 participants in the Wisconsin Sleep Cohort Study. The sample correlation coefficient was r = 0.11. Does this support the claim that short sleep duration is associated with reduced leptin? Use  =.01. H 0 :  = 0 H a :  > 0 Test Statistic: State the hypotheses. P -value =.0015 df = 714  =.01 To verify the assumptions, we would look at normal probability plots of the x values and of the y values. However, data is not available, so we will assume the bivariate normal population is reasonable. We will also assume that it is reasonable to regard the sample of participants as representative of the population of adult Americans.

40 Where  = the correlation between average nightly sleep and blood leptin level for the population of adult Americans Sleepless Nights Continued... H 0 :  = 0 H a :  > 0 Test Statistic: P -value =.0015 df = 714  =.01 Since the P -value <.01, we reject H 0. There is evidence to suggest that there is a positive association (perhaps a weak one since r =.11) between sleep duration and blood leptin level. Note: the hypothesis of no linear relationship (H 0 :  = 0) can also be used to test for independence in a bivariate normal population.


Download ppt "Chapter 13 Simple Linear Regression and Correlation: Inferential Methods."

Similar presentations


Ads by Google