Simple Linear Regression and Correlation: Inferential Methods

Simple Linear Regression and Correlation: Inferential Methods
Chapter 13 Simple Linear Regression and Correlation: Inferential Methods

Suppose we were to investigate the relationship between y = the first-year college grade point average and x = high school grade point average. The equation for an additive probabilistic model is: Where e is an “error” variable The first-year college grade point average and the high school grade point average do NOT have a deterministic relationship. Is the first-year college grade point average determined solely by the high school grade point average? A relationship in which the value of y is completely determined by the value of an independent variable x is called a deterministic relationship. A description of the relationship between two variables that are not deterministically related can be given by a probabilistic model.

The simple linear regression model assumes that there is a line with y-intercept a and slope b, called the population regression line. When a value of the independent variable x is fixed and an observation on the dependent variable y is made, x y x1 x2 Population regression line (slope b) Without the random deviation e in the equation, all observed (x, y) points would fall exactly on the population regression line. e1 a e2

Basic Assumptions of the Simple Linear Regression Model
The distribution of e at any particular x value has mean value 0. that is, me = 0. The standard deviation of e is the same for any particular value of x. This standard deviation is denoted by s. The distribution of e at any particular value of x is normal. The random deviations e1, e2, . . ., en associated with different observations are independent of one another.

Let’s look at the heights and weights of a population of adult women.
How much would an adult female weigh if she were 5 feet tall? Weights of women that are 5 feet tall will vary – in other words, there is a distribution of weights for adult females who are 5 feet tall. Are some of these weights more likely than others? What would this distribution look like? We want the standard deviations of all these normal distributions to be the same. Where would you expect the population regression line to be? Height Weight What would you expect for other heights? This distribution is normally distributed.

Basic Assumptions of the Simple Linear Regression Model Revisited
The distribution of e at any particular x value has mean value 0. that is, me = 0. The standard deviation of e is the same for any particular value of x. This standard deviation is denoted by s. The distribution of e at any particular value of x is normal. The random deviations e1, e2, . . ., en associated with different observations are independent of one another. The distribution of y at any particular value of x is normal. Remember the variable e is a measure of the extent that individual y-values deviate from the population regression line. For any particular x value, the standard deviation of y equals the standard deviation of e.

We use to estimate the true population regression line.
b = point estimate of b = where Let x* denote a specific value of the predictor variable x. Then a + bx* has two different interpretations: 1. It is a point estimate of the mean y value when x = x*. 2. It is a point prediction of an individual y value to be observed when x = x*. a = point estimate of a = y - bx

Sketch a scatterplot of these data.
Medical researches have noted that adolescent females are much more likely to deliver low-birth-weight babies than are adult females. Because low-birth-weight babies have higher mortality rates, a number of studies have examined the relationship between birth weight and mother’s age for babies born to young mothers. The following data is on x = maternal age (in years) and y = birth weight of baby (in grams). Mother’s Age (yrs) Baby’s Weight (g) x 15 17 18 16 19 y 2289 3393 3271 2648 2897 3327 2970 2535 3138 3573 The scatterplot shows a linear pattern and the spread in the y values appears to be similar across the range of x values. This supports the appropriateness of the simple linear regression model. Sketch a scatterplot of these data.

Birth Weight Continued . . .
The following data is on x = maternal age (in years) and y = birth weight of baby (in grams). x 15 17 18 16 19 y 2289 3393 3271 2648 2897 3327 2970 2535 3138 3573 Summary statistics computed from the sample data are: Using these summary statistics The estimated regression line is: y = x

Birth Weight Continued . . .
The following data is on x = maternal age (in years) and y = birth weight of baby (in grams). The weight of babies increase approximately grams for each increase of 1 year in the mother’s age. x 15 17 18 16 19 y 2289 3393 3271 2648 2897 3327 2970 2535 3138 3573 What is the point estimate for the mean weight of babies born to 18-year-old mothers? Mother’s Age (yrs) Baby’s Weight (g) This is the point estimate for the mean weight of all babies born to 18-year-old mothers. This is also the prediction of the weight of a single baby born to a mother 18 years of age.

The statistic for estimating the variance s2 is
where The estimate for the standard deviation s is Recall the coefficient of determination, r2, is the proportion of observed y variation that is attributed to the model relationship. Why n – 2? Note that the degrees of freedom associated with estimating s2 or s in simple linear regression is df = n - 2 Since we must estimate both for a and b in the regression line, we reduce the sample size n by 2 The subscript e reminds us that we are estimating the variance of the “errors” or residuals.

Use this to compute se and r2.
Birth Weight Revisited . . . The following data is on x = maternal age (in years) and y = birth weight of baby (in grams). For a particular mother’s age, the typical deviation for possible weights of babies is approximately 231 grams. Approximately 76% of the variability observed weight of babies can be explained by this model. x 15 17 18 16 19 y 2289 3393 3271 2648 2897 3327 2970 2535 3138 3573 Mother’s Age (yrs) Baby’s Weight (g) Find SSResid and SSTo. Use this to compute se and r2.

Properties of the Sampling Distribution of b
When the four basic assumptions of the simple linear regression model are satisfied, the following statements are true: The mean value of b is b. That is, mb = b. The standard deviation of the statistic b is The statistic b has a normal distribution (a consequence of the model assumption that the random deviation e is normally distributed.) Since b is almost always unknown, it must be estimated from independently selected observations. The slope b of the least-squares line gives a point estimate for b. Since s is usually unknown, the estimated standard deviation of the statistic b is

Confidence Interval for b
When the four basic assumptions of the simple linear regression model are satisfied, a confidence interval for b, the slope of the population regression line, has the form where the t critical value is based on df = n – 2.

Sketch a scatterplot for the data.
Is cardiovascular fitness (as measured by time to exhaustion from running on a treadmill) related to an athlete’s performance in a 20-km ski race? The following data on x = treadmill time to exhaustion (in minutes) and y = 20-km ski time (in minutes) were taken from the article “Physiological Characteristics and Performance of Top U.S. Biathletes” (Medicine and Science in Sports and Exercise, 1995): x 7.7 8.4 8.7 9.0 9.6 10.0 10.2 10.4 11.0 11.7 y 71.0 71.4 65.0 68.7 64.4 69.4 63.0 64.6 66.9 62.6 61.7 The plot shows a linear pattern, and the vertical spread of points does not appear to be changing over the range of x values in the sample. If we assume that the distribution of errors at any given x value is approximately normal, then the simple linear regression model seems appropriate. Treadmill Time (min) Ski Time (min) Sketch a scatterplot for the data.

Biathletes Continued . . . x = treadmill exhaustion time y = ski time x 7.7 8.4 8.7 9.0 9.6 10.0 10.2 10.4 11.0 11.7 y 71.0 71.4 65.0 68.7 64.4 69.4 63.0 64.6 66.9 62.6 61.7 We are 95% confident that the true average decrease in ski time associated with a 1 minute increase in treadmill exhaustion time is between 1 minute and 3.7 minutes. Treadmill Time (min) Ski Time (min) Find a 95% confidence interval for the slope of the true regression line.

Biathletes Continued . . . Equation of estimated regression line
Partial Minitab Output Equation of estimated regression line The regression equation is Ski time = 88.8 – 2.33 treadmill time Predictor Coef StDev T P Constant 88.796 5.750 15.44 0.000 Treadmill 0.5911 -3.95 0.003 S = R-Sq = 63.4% R-Sq (adj) = 59.3% Analysis of Variance Source DF SS MS F Regression 1 74.630 15.58 Residual Error 9 43.097 4.789 Total 10 Estimated y intercept a sb = estimated standard deviation of b Estimated slope b r2 (adjusted) is not used in simple linear regression. se 100×r2 SSResid SSTo n - 2

Summary of Hypothesis Tests Concerning b
Null hypothesis: H0: b = hypothesized value Test Statistic: The test is based on df = n – 2. Alternative Hypothesis: P -value: Ha: b > hypothesized value area to right of t under the appropriate t curve Ha: b < hypothesized value area to left of t under the appropriate t curve Ha: b ≠ hypothesized value 2(area to right of t ) if +t or 2(area to left of t ) if -t Often the hypothesized value is zero – this is called the model utility test for simple linear regression.

Summary of Hypothesis Tests Concerning b Continued . . .
Assumptions: For this test to be appropriate the four basic assumptions of the simple regression model must be met: The distribution of e at any particular x value has a mean of 0 (me = 0), The standard deviation of e is s, which does not depend on x. The distribution of e at any particular x value is normal. The random deviations e1, e2, …, en associated with different observations are independent of one another.

What is the slope of a horizontal line?
Height Weight Suppose the least-squares line is horizontal –would height be useful in predicting weight? A slope of zero – means that there is NO linear relationship between x and y!

The Model Utility Test for Simple Linear Regression
The model utility test for simple linear regression is the test of H0: b = 0 Ha: b ≠ 0 Test Statistic: The null hypothesis specifies that there is no useful linear relationship between x and y.

Biathletes Revisited . . . H0: b = 0 Ha: b ≠ 0 P-value = .003
x = treadmill exhaustion time y = ski time x 7.7 8.4 8.7 9.0 9.6 10.0 10.2 10.4 11.0 11.7 y 71.0 71.4 65.0 68.7 64.4 69.4 63.0 64.6 66.9 62.6 61.7 H0: b = 0 Ha: b ≠ 0 Where b is the slope of the population regression line between treadmill time and ski time Even though the scatterplots indicates a linear relationship between ski time and treadmill time, let’s perform the model utility test. P-value = .003 a = .05 df = 9 Treadmill Time (min) Ski Time (min) Since the P-value < a, we reject H0. There is sufficient evidence of a linear relationship between treadmill time and ski time.

Statistical software usually performs the model utility test with
Biathletes Revisited . . . Partial Minitab Output The regression equation is Ski time = 88.8 – 2.33 treadmill time Predictor Coef StDev T P Constant 88.796 5.750 15.44 0.000 Treadmill 0.5911 -3.95 0.003 S = R-Sq = 63.4% R-Sq (adj) = 59.3% Analysis of Variance Source DF SS MS F Regression 1 74.630 15.58 Residual Error 9 43.097 4.789 Total 10 t test statistic P-value ÷ = Statistical software usually performs the model utility test with H0: b = 0 versus Ha: b ≠ 0

Checking Model Adequacy
The simple linear regression model is y = a + bx + e where e represents the random deviation of an observed y value from the population regression line a + bx. If we knew the deviations of e1, e2, …, en, we could examine them for any inconsistencies with model assumptions. Therefore, we must estimate these deviations using the residuals from the estimated line. Thus, we use the residuals to check our assumptions. However, we do not know the deviations for e1, e2, …, en because the population regression line is unknown. The assumptions for simple linear regression are based on this random deviation e.

Residual Analysis Standardize the residuals to look at their magnitudes Create a residual plot (from Chapter 5) or a standardized residual plot (which is a plot of the (x, standardized residual) pairs) Any observation with a large positive or negative residual should be examined carefully for any error in recording data, nonstandard experimental condition, or atypical experimental unit. Most statistical software will perform this calculation. It is tedious to do by hand. A desirable plot is one that exhibits no particular pattern (such as curvature or much greater spread in one part on the plot than the other) and that has no point that is far removed from all the others.

A Look at Standardized Residual Plots
This is a desirable plot in that it exhibits no pattern and has no point that lies far away from the other points. Both of these plots contain points far away from the others. These points can have substantial effects on estimates of a and b as well as other quantities. This plot exhibits a curved pattern which indicates that the fitted model should be changed to incorporate the curvature. In this plot, the standard deviation of the residuals increases as the x-values increase. While a straight-line model might still be appropriate, the best-fit line should be found using weighted least-squares. Consult your local statistician!

Let’s look at a normal probability plot of the standardized residuals
Biathletes Revisited . . . r = residuals sr = standardized residuals (from Minitab) x 7.7 8.4 8.7 9.0 9.6 10.0 10.2 10.4 11.0 11.7 y 71.0 71.4 65.0 68.7 64.4 69.4 63.0 64.6 66.9 62.6 61.7 r 0.17 2.21 -3.49 0.91 -1.99 3.01 -2.46 -0.39 2.37 -0.53 0.21 sr 0.10 1.13 -1.74 0.44 -0.96 1.44 -1.18 -0.19 1.16 -0.27 0.12 The normal probability plot of the standardized residuals is quite straight. There is no reason to doubt the plausibility that the random deviations e are normally distributed. Let’s look at a normal probability plot of the standardized residuals Normal Score Standardized Residual Treadmill Time (min) Ski Time (min)

Notice that these two plots have similar appearances.
Biathletes Continued . . . r = residuals sr = standardized residuals (from Minitab) x 7.7 8.4 8.7 9.0 9.6 10.0 10.2 10.4 11.0 11.7 y 71.0 71.4 65.0 68.7 64.4 69.4 63.0 64.6 66.9 62.6 61.7 r 0.17 2.21 -3.49 0.91 -1.99 3.01 -2.46 -0.39 2.37 -0.53 0.21 sr 0.10 1.13 -1.74 0.44 -0.96 1.44 -1.18 -0.19 1.16 -0.27 0.12 Notice that these two plots have similar appearances. Remember that residuals can also be plotted against y. The standardized residual plot does not show evidence of any pattern or of increasing spread. Sketch a residual plot. Sketch a standardized residual plot. Treadmill Time Standardized Residuals Treadmill Time Residuals

Optional Topics Inferences Based on the Estimated Regression Line and
Inference about the Population Correlation Coefficient

The farther x* is from the center, the larger sa+bx* is.
Properties of the Sampling Distribution of a + bx for a Fixed Value of x Let x* denote a particular value of the independent variable x. When the four basic assumptions of the simple linear regression model are satisfied, the sampling distribution of the statistic a +bx* had the following properties: The mean value of a + bx* is a + bx*, so a + bx* is an unbiased statistic estimating the mean y value when x = x*. The standard deviation of the statistic a + bx*, denoted by sa+bx*, is given by The distribution of a + bx* is normal. The farther x* is from the center, the larger sa+bx* is. Since s is unknown, sa+bx* can be estimated by sa+bx* which substitutes se in place of s.

Confidence Interval for a Mean y Value
Because sa+bx* is larger the farther x* is from x, the confidence interval becomes wider as x* moves away from the center of the data. When the basic assumptions of the simple linear regression model are met, a confidence interval for a +bx*, the mean y value when x has value x*, is where the t critical value is based on df = n – 2.

Physical characteristics of sharks are of interest to surfers and scuba divers as well as to marine researcher. The data on x = length (in feet) and y = jaw width (in inches) for 44 sharks (were found in various articles appearing in the magazines Skin Diver and Scuba News. (These data are found on page 778 of the text.) Because it is difficult to measure jaw width in living sharks, researchers would like to determine whether it is possible to estimate jaw width from body length, which is more easily measured. This scatterplot of the data shows a linear pattern and is consistent with use of the simple linear regression model.

The model utility test confirms the usefulness of this model.
Jaws Continued . . . The regression equation is Jaw Width = Length Predictor Coef StDev T P Constant 0.688 1.299 0.53 0.599 Length 11.71 0.000 S = 1.376 R-Sq = 76.6% R-Sq (adj) = 76.0% The point estimate is Let’s use the data to compute a 90% confidence interval for the mean jaw width for 15 foot long sharks. The model utility test confirms the usefulness of this model. The simple linear regression model explains 76.6% of the variability in jaw width. The estimated standard deviation of a + b(15) is

The 90% confidence interval is
Jaws Continued . . . The regression equation is Jaw Width = Length Predictor Coef StDev T P Constant 0.688 1.299 0.53 0.599 Length 11.71 0.000 S = 1.376 R-Sq = 76.6% R-Sq (adj) = 76.0% The 90% confidence interval is Based on these sample data, we can be 90% confident that the mean jaw width for sharks of length 15 feet is between and inches.

Prediction Interval for a Single y Value
When the basic assumptions of the simple linear regression model are met, a prediction interval for y*, a single y observation made when x = x*, has the form where the t critical value is based on df = n – 2. The prediction interval is wider than the confidence interval due to the due to the addition of se under the square-root symbol. The prediction interval and the confidence interval are centered at exactly the same place, a + bx*.

Jaws Revisited . . . Suppose that we were interested in predicting the jaw width of a single shark of length 15 feet. Notice that this interval is much wider than the confidence interval for the mean jaw width. The 90% prediction interval is We can be 90% confident that an individual shark of length 15 feet will have a jaw width between and inches.

Below is a Regression Plot from Minitab showing the confidence interval and the prediction interval for the shark data. Notice that the prediction interval is substantial wider than the confidence interval Also notice that the confidence interval is very narrow close to x, but widens the farther it is from the mean.

A Test for Independence in a Bivariate Normal Population
Null Hypothesis: H0: r = 0 Test Statistic: The test is based on df = n – 2. Alternative Hypothesis: P-value: Ha: r > 0 (positive dependence) Area to the right of t Ha: r < 0 (negative dependence) Area to the left of t Ha: r ≠ 0 (dependence) (Area to the right of t) if +t or 2(Area to the left of t) if -t Greek letter “rho” r is the population correlation coefficient. It assesses the extent of any linear relationship in the population. r must be between -1 and 1. Many investigators are interested if ANY relationship exist between x and y. That is, are x and y are independent of each other? However, r = 0 is NOT equivalent to x and y being independent except in the case of a bivariate normal population. A bivariate normal population is one where for any fixed x value, the distribution of associated y values is normal, and for any fixed y value, the distribution of x values is normal. An example would be the height x and weight y of American adult males.

A Test for Independence in a Bivariate Normal Population
Assumptions: r is the correlation coefficient for a random sample from a bivariate normal population. The one way to verify that the population is a bivariate normal population is to plot individual normal probability plots of the x and y variables.

The relationship between sleep duration and the level of the hormone leptin ( a hormone related to energy intake and energy expenditure) in the blood was investigated. Average nightly sleep (x, in hours) and blood leptin level (y) were recorded for each person in a sample of 716 participants in the Wisconsin Sleep Cohort Study. The sample correlation coefficient was r = Does this support the claim that short sleep duration is associated with reduced leptin? Use a = .01. H0: r = 0 Ha: r > 0 Test Statistic: Where r = the correlation between average nightly sleep and blood leptin level for the population of adult Americans To verify the assumptions, we would look at normal probability plots of the x values and of the y values. However, data is not available, so we will assume the bivariate normal population is reasonable. We will also assume that it is reasonable to regard the sample of participants as representative of the population of adult Americans. State the hypotheses. P-value = df = a = .01

Sleepless Nights Continued . . .
H0: r = 0 Ha: r > 0 Test Statistic: Where r = the correlation between average nightly sleep and blood leptin level for the population of adult Americans P-value = df = a = .01 Note: the hypothesis of no linear relationship (H0: b = 0) can also be used to test for independence in a bivariate normal population. Since the P-value < .01, we reject H0. There is evidence to suggest that there is a positive association (perhaps a weak one since r = .11) between sleep duration and blood leptin level.

Simple Linear Regression and Correlation: Inferential Methods

Similar presentations

Presentation on theme: "Simple Linear Regression and Correlation: Inferential Methods"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Simple Linear Regression and Correlation: Inferential Methods

Similar presentations

Presentation on theme: "Simple Linear Regression and Correlation: Inferential Methods"— Presentation transcript:

Similar presentations

About project

Feedback