 # 1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Summarizing Bivariate Data Introduction to Linear Regression.

## Presentation on theme: "1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Summarizing Bivariate Data Introduction to Linear Regression."— Presentation transcript:

1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Summarizing Bivariate Data Introduction to Linear Regression

2 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Linear Relations The relationship y = a + bx is the equation of a straight line. The value b, called the slope of the line, is the amount by which y increases when x increases by 1 unit. The value of a, called the intercept (or sometimes the vertical intercept) of the line, is the height of the line above the value x = 0.

3 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Example x y 02468 0 5 10 15 y = 7 + 3x a = 7 x increases by 1 y increases by b = 3

4 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Example y y = 17 - 4x x increases by 1 y changes by b = -4 (i.e., changes by –4) 02468 0 5 10 15 a = 17

5 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Least Squares Line The most widely used criterion for measuring the goodness of fit of a line y = a + bx to bivariate data (x 1, y 1 ), (x 2, y 2 ), , (x n, y n ) is the sum of the of the squared deviations about the line: The line that gives the best fit to the data is the one that minimizes this sum; it is called the least squares line or sample regression line.

6 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Coefficients a and b The slope of the least squares line is And the y intercept is We write the equation of the least squares line as where the ^ above y emphasizes that (read as y-hat) is a prediction of y resulting from the substitution of a particular value into the equation.

7 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Calculating Formula for b

8 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Greyhound Example Continued

9 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Calculations From the previous slide, we have

10 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Minitab Graph The following graph is a copy of the output from a Minitab command to graph the regression line.

11 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Greyhound Example Revisited

12 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Greyhound Example Revisited Using the calculation formula we have: Notice that we get the same result.

13 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Three Important Questions To examine how useful or effective the line summarizing the relationship between x and y, we consider the following three questions. 1.Is a line an appropriate way to summarize the relationship between the two variables? 2.Are there any unusual aspects of the data set that we need to consider before proceeding to use the regression line to make predictions? 3.If we decide that it is reasonable to use the regression line as a basis for prediction, how accurate can we expect predictions based on the regression line to be?

14 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Terminology

15 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Greyhound Example Continued

16 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Residual Plot A residual plot is a scatter plot of the data pairs (x, residual). The following plot was produced by Minitab from the Greyhound example.

17 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Residual Plot - What to look for. Isolated points or patterns indicate potential problems. Ideally the the points should be randomly spread out above and below zero. This residual plot indicates no systematic bias using the least squares line to predict the y value. Generally this is the kind of pattern that you would like to see. Note: 1.Values below 0 indicate over prediction 2.Values above 0 indicate under prediction.

18 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. The Greyhound example continued For the Greyhound example, it appears that the line systematically predicts fares that are too high for cities close to Rochester and predicts fares that are too little for most cities between 200 and 500 miles. Predicted fares are too high. Predicted fares are too low.

19 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. More Residual Plots

20 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Definition formulae

21 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Calculational formulae SSTo and SSResid are generally found as part of the standard output from most statistical packages or can be obtained using the following computational formulas:

22 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Coefficient of Determination The coefficient of determination, denoted by r 2, gives the proportion of variation in y that can be attributed to an approximate linear relationship between x and y. Note that the coefficient of determination is the square of the Pearson correlation coefficient.

23 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Greyhound Example Revisited

24 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. We can say that 93.5% of the variation in the Fare (y) can be attributed to the least squares linear relationship between distance (x) and fare. Greyhound Example Revisited

25 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. More on variability The standard deviation about the least squares line is denoted s e and given by s e is interpreted as the “typical” amount by which an observation deviates from the least squares line.

26 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. The “typical” deviation of actual fare from the prediction is \$6.80. Greyhound Example Revisited

27 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Minitab output for Regression Regression Analysis: Standard Fare versus Distance The regression equation is Standard Fare = 10.1 + 0.150 Distance Predictor Coef SE Coef T P Constant 10.138 4.327 2.34 0.039 Distance 0.15018 0.01196 12.56 0.000 S = 6.803 R-Sq = 93.5% R-Sq(adj) = 92.9% Analysis of Variance Source DF SS MS F P Regression 1 7298.1 7298.1 157.68 0.000 Residual Error 11 509.1 46.3 Total 12 7807.2 SSTo SSResid sese r2r2 ab Least squares regression line