Presentation is loading. Please wait.

Presentation is loading. Please wait.

Slide 2 - 42 Copyright © 2008 Pearson Education, Inc. Chapter 4 Descriptive Methods in Regression and Correlation.

Similar presentations


Presentation on theme: "Slide 2 - 42 Copyright © 2008 Pearson Education, Inc. Chapter 4 Descriptive Methods in Regression and Correlation."— Presentation transcript:

1

2 Slide 2 - 42 Copyright © 2008 Pearson Education, Inc. Chapter 4 Descriptive Methods in Regression and Correlation

3 Slide 3 - 42 Copyright © 2008 Pearson Education, Inc. Definition 4.1

4 Slide 4 - 42 Copyright © 2008 Pearson Education, Inc. Key Fact 4.1 Figure 4.6

5 Slide 5 - 42 Copyright © 2008 Pearson Education, Inc. Table 4.2 Table 4.2 displays data on age and price for a sample of cars of a particular make and model. We refer to the car as the Orion, but the data, obtained from the Asian Import edition of the Auto Trader magazine, is for a real car. Ages are in years; prices are in hundreds of dollars, rounded to the nearest hundred dollars.

6 Slide 6 - 42 Copyright © 2008 Pearson Education, Inc. Plotting the data in a scatterplot helps us visualize any apparent relationship between age and price. Generally speaking, a scatterplot (or scatter diagram) is a graph of data from two quantitative variables of a population. To construct a scatterplot, we use a horizontal axis for the observations of one variable and a vertical axis for the observations of the other. Each pair of observations is then plotted as a point. Figure 4.7 shows a scatterplot for the age-price data in Table 4.2. Note that we use a horizontal axis for ages and a vertical axis for prices. Each age-price observation is plotted as a point. For instance, the second car in Table 4.2 is 4 years old and has a price of 103 ($10,300). We plot this age-price observation as the point (4, 103), shown in magenta in Fig. 4.7.

7 Slide 7 - 42 Copyright © 2008 Pearson Education, Inc. Figure 4.7

8 Slide 8 - 42 Copyright © 2008 Pearson Education, Inc. Although the age-price data points do not fall exactly on a line, they appear to cluster about a line. We want to fit a line to the data points and use that line to predict the price of an Orion based on its age. Because we could draw many different lines through the cluster of data points, we need a method to choose the “best” line.The method, called the least-squares criterion, is based on an analysis of the errors made in using a line to fit the data points. To introduce the least-squares criterion, we use a very simple data set in Example 4.3. We return to the Orion data shortly.

9 Slide 9 - 42 Copyright © 2008 Pearson Education, Inc. Example 4.3 Consider the problem of fitting a line to the four data points in Table 4.3. Many (in fact, infinitely many) lines can “fit” those four data points. Two possibilities are shown in Figs. 4.9(a) and 4.9(b). Table 4.3

10 Slide 10 - 42 Copyright © 2008 Pearson Education, Inc. Figure 4.9 Example 4.3

11 Slide 11 - 42 Copyright © 2008 Pearson Education, Inc. Example 4.3 To avoid confusion, we use to denote the y-value predicted by a line for a value of x. For instance, the y- value predicted by Line A for x = 2 is and the y-value predicted by Line B for x = 2 is To measure quantitatively how well a line fits the data, we first consider the errors, e, made in using the line to predict the y-values of the data points. For instance, as we have just demonstrated, Line A predicts a y-value of = 3 when x = 2. The actual y-value for x = 2 is y = 2 (see Table 4.3). So, the error made in using Line A to predict the y- value of the data point (2, 2) is e = y − = 2 − 3 =−1, as seen in Fig. 4.9(a).

12 Slide 12 - 42 Copyright © 2008 Pearson Education, Inc. Table 4.4 Example 4.3 In general, an error, e, is the signed vertical distance from the line to a data point. The fourth column of Table 4.4(a) shows the errors made by Line A for all four data points; the fourth column of Table 4.4(b) shows the same for Line B.

13 Slide 13 - 42 Copyright © 2008 Pearson Education, Inc. Key Fact 4.2 Definition 4.2

14 Slide 14 - 42 Copyright © 2008 Pearson Education, Inc. Definition 4.3

15 Slide 15 - 42 Copyright © 2008 Pearson Education, Inc. Formula 4.1

16 Slide 16 - 42 Copyright © 2008 Pearson Education, Inc. Table 4.5 Example 4.4 In the first two columns of Table 4.5, we repeat our data on age and price for a sample of 11 Orions.

17 Slide 17 - 42 Copyright © 2008 Pearson Education, Inc. Example 4.4 a.Determine the regression equation for the data. b.Graph the regression equation and the data points. c.Describe the apparent relationship between age and price of Orions. d.Interpret the slope of the regression line in terms of prices for Orions. e.Use the regression equation to predict the price of a 3-year-old Orion and a 4-year-old Orion.

18 Slide 18 - 42 Copyright © 2008 Pearson Education, Inc. Solution Example 4.4 a.We first need to compute b 1 and b 0 by using Formula 4.1. We did so by constructing a table of values for x (age), y (price), xy, x 2, and their sums in Table 4.5. The slope of the regression line therefore is

19 Slide 19 - 42 Copyright © 2008 Pearson Education, Inc. Solution Example 4.4 a.The y-intercept is So the regression equation is

20 Slide 20 - 42 Copyright © 2008 Pearson Education, Inc. Solution Example 4.4 b.To graph the regression equation, we need to substitute two different x-values in the regression equation to obtain two distinct points. Let’s use the x-values 2 and 8. The corresponding y-values are and Therefore, the regression line goes through the two points (2, 154.95) and (8, 33.39). In Fig. 4.10, we plotted these two points with hollow dots. Drawing a line through the two hollow dots yields the regression line, the graph of the regression equation. Figure 4.10 also shows the data points from the first two columns of Table 4.5.

21 Slide 21 - 42 Copyright © 2008 Pearson Education, Inc. Figure 4.10

22 Slide 22 - 42 Copyright © 2008 Pearson Education, Inc. Solution Example 4.4 c.Because the slope of the regression line is negative, price tends to decrease as age increases, which is no particular surprise. d.Because x represents age, in years, and y represents price, in hundreds of dollars, the slope of −20.26 indicates that Orions depreciate an estimated $2026 per year, at least in the 2- to 7- year-old range. e.For a 3-year-old Orion, x = 3, and the regression equation yields the predicted price of y = 134.69. Similarly, the predicted price for a 4-year-old Orion is y = 114.43. Interpretation The estimated price of a 3-year-old Orion is $13,469, and the estimated price of a 4- year-old Orion is $11,443.

23 Slide 23 - 42 Copyright © 2008 Pearson Education, Inc. Extrapolation Suppose that a scatterplot indicates a linear relationship between two variables. Then, within the range of the observed values of the predictor variable, we can reasonably use the regression equation to make predictions for the response variable. However, to do so outside that range, which is called extrapolation, may not be reasonable because the linear relationship between the predictor and response variables may not hold there. Grossly incorrect predictions can result from extrapolation. The Orion example is a case in point. Its observed ages (values of the predictor variable) range from 2 to 7 years old. But suppose that we extrapolate to predict the price of an 11-year-old Orion. Using the regression equation, the predicted price is

24 Slide 24 - 42 Copyright © 2008 Pearson Education, Inc. Extrapolation or −$2739. Clearly, this result is ridiculous: no one is going to pay us $2739 to take away their 11-year-old Orion. Consequently, although the relationship between age and price of Orions appears to be linear in the range from 2 to 7 years old, it is definitely not so in the range from 2 to 11 years old. Figure 4.11 summarizes the discussion on extrapolation as it applies to age and price of Orions.

25 Slide 25 - 42 Copyright © 2008 Pearson Education, Inc. Figure 4.11

26 Slide 26 - 42 Copyright © 2008 Pearson Education, Inc. Outliers and Influential Observations An outlier is an observation that lies outside the overall pattern of the data. In the context of regression, an outlier is a data point that lies far from the regression line, relative to the other data points. An outlier can sometimes have a significant effect on a regression analysis. We must also watch for influential observations. In regression analysis, an influential observation is a data point whose removal causes the regression equation (and line) to change considerably. A data point separated in the x-direction from the other data points is often an influential observation because the regression line is “pulled” toward such a data point without counteraction by other data points.

27 Slide 27 - 42 Copyright © 2008 Pearson Education, Inc. Outliers and Influential Observations For the Orion data, the data point (2, 169) might be an influential observation because the age of 2 years appears separated from the other observed ages. Removing that data point and recalculating the regression equation yields = 160.33 – 14.24x. Figure 4.12 reveals that this equation differs markedly from the regression equation based on the full data set. The data point (2, 169) is indeed an influential observation. The influential observation (2, 169) is not a recording error; it is a legitimate data point. Nonetheless, we may need either to remove it – thus limiting the analysis to Orions between 4 and 7 years old – or to obtain additional data on 2- and 3- year-old Orions so that the regression analysis is not so dependent on one data point.

28 Slide 28 - 42 Copyright © 2008 Pearson Education, Inc. Figure 4.12

29 Slide 29 - 42 Copyright © 2008 Pearson Education, Inc. Key Fact 4.3

30 Slide 30 - 42 Copyright © 2008 Pearson Education, Inc. Definition 4.4

31 Slide 31 - 42 Copyright © 2008 Pearson Education, Inc. Definition 4.5

32 Slide 32 - 42 Copyright © 2008 Pearson Education, Inc. Example 4.7 The scatterplot and regression line for the age and price data of 11 Orions are repeated in Fig. 4.15 on the next slide. The scatterplot reveals that the prices of the 11 Orions vary widely, ranging from a low of 48 ($4800) to a high of 169 ($16,900). But Fig. 4.15 also shows that much of the price variation is “explained” by the regression (or age); that is, the regression line, with age as the predictor variable, predicts a sizeable portion of the type of variation found in the prices. Make this qualitative statement precise by finding and interpreting the coefficient of determination for the Orion data.

33 Slide 33 - 42 Copyright © 2008 Pearson Education, Inc. Figure 4.15

34 Slide 34 - 42 Copyright © 2008 Pearson Education, Inc. Table 4.6 Solution Example 4.7 To compute the total sum of squares, SST, we must first find the mean of the observed prices. Referring to the second column of Table 4.6, we get

35 Slide 35 - 42 Copyright © 2008 Pearson Education, Inc. Table 4.6 Solution Example 4.7 After constructing the third column of Table 4.6, we calculate the entries for the fourth column and then find the total sum of squares: which is the total variation in the observed prices.

36 Slide 36 - 42 Copyright © 2008 Pearson Education, Inc. Table 4.7 Solution Example 4.7 To compute the regression sum of squares, SSR, we need the predicted prices and the mean of the observed prices. Each predicted price is obtained by substituting the age of the Orion in question for x in the regression equation. The third column of Table 4.7 shows the predicted prices for all 11 Orions.

37 Slide 37 - 42 Copyright © 2008 Pearson Education, Inc. Solution Example 4.7 Recalling that = 88.64, we construct the fourth column of Table 4.7. We then calculate the entries for the fifth column and obtain the regression sum of squares: which is the variation in the observed prices explained by the regression. From SST and SSR, we compute the coefficient of determination, the percentage of variation in the observed prices explained by the regression (i.e., by the linear relationship between age and price for the sampled Orions): Interpretation Evidently, age is quite useful for predicting price because 85.3% of the variation in the observed prices is explained by the regression of price on age.

38 Slide 38 - 42 Copyright © 2008 Pearson Education, Inc. Table 4.8 Error of Sum of Squares To compute SSE, we need the observed prices and the predicted prices. Both quantities are displayed in Table 4.7 and are repeated in the second and third columns of Table 4.8. From the final column, we get the error sum of squares:

39 Slide 39 - 42 Copyright © 2008 Pearson Education, Inc. Error of Sum of Squares Continued which is the variation in the observed prices not explained by the regression. Because the regression line is the line that best fits the data according to the least squares criterion, SSE is also the smallest possible sum of squared errors among all lines.

40 Slide 40 - 42 Copyright © 2008 Pearson Education, Inc. Definition 4.6

41 Slide 41 - 42 Copyright © 2008 Pearson Education, Inc. Understanding the Linear Correlation Coefficient We now discuss some other important properties of the linear correlation coefficient, r. Keep in mind that r measures the strength of the linear relationship between two variables and that the following properties of r are meaningful only when the data points are scattered about a line. r reflects the slope of the scatterplot. The magnitude of r indicates the strength of the linear relationship. The sign of r suggests the type of linear relationship. The sign of r and the sign of the slope of the regression line are identical.

42 Slide 42 - 42 Copyright © 2008 Pearson Education, Inc. Figure 4.17 Understanding the Linear Correlation Coefficient To graphically portray the meaning of the linear correlation coefficient, we present various degrees of linear correlation in Fig. 4.17.


Download ppt "Slide 2 - 42 Copyright © 2008 Pearson Education, Inc. Chapter 4 Descriptive Methods in Regression and Correlation."

Similar presentations


Ads by Google