3 To find a linear model given two data points: We find the equation of the line that passes through them. However, we often have more than two data points, and they will rarely all lie on a single straight line, but may often come close to doing so. The problem is to find the line coming closest to passing through all of the points.
4 Linear Regression Suppose, for example, that we are conducting research for a company interested in expanding into Mexico. Of interest to us would be current and projected growth in that country’s economy. The following table shows past and projected per capita gross domestic product (GDP) of Mexico for 2000–2014.
5 Linear Regression A plot of these data suggests a roughly linear growth of the GDP (Figure 27(a)). These points suggest a roughly linear relationship between t and y, although they clearly do not all lie on a single straight line. Figure 27(a)
6 Linear Regression Figure 27(b) shows the points together with several lines, some fitting better than others. Can we precisely measure which lines fit better than others? For instance, which of the two lines labeled as “good” fits in Figure 27(b) models the data more accurately? Figure 27(b)
7 Linear Regression We begin by considering, for each value of t, the difference between the actual GDP (the observed value) and the GDP predicted by a linear equation (the predicted value). The difference between the predicted value and the observed value is called the residual. Residual = Observed Value – Predicted Value
8 Linear Regression On the graph, the residuals measure the vertical distances between the (observed) data points and the line (Figure 28) and they tell us how far the linear model is from predicting the actual GDP. Figure 28
9 Linear Regression The more accurate our model, the smaller the residuals should be. We can combine all the residuals into a single measure of accuracy by adding their squares. (We square the residuals in part to make them all positive.) The sum of the squares of the residuals is called the sum-of-squares error, SSE. Smaller values of SSE indicate more accurate models.
10 Linear Regression Observed and Predicted Values Suppose we are given a collection of data points (x 1, y 1 ), …, (x n, y n ). The n quantities y 1, y 2, …, y n are called the observed y-values. If we model these data with a linear equation ŷ = mx + b, then the y-values we get by substituting the given x-values into the equation are called the predicted y-values: ŷ 1 = mx 1 + b ŷ 2 = mx 2 + b … ŷ n = mx n + b. ŷ stands for “estimated y” or “predicted y.” Substitute x 1 for x. Substitute x 2 for x. Substitute x n for x.
11 Linear Regression Quick Example Consider the three data points (0, 2), (2, 5), and (3, 6). The observed y-values are y 1 = 2, y 2 = 5, and y 3 = 6. If we model these data with the equation ŷ = x + 2.5, then the predicted values are: ŷ 1 = x = = 2.5 ŷ 2 = x = = 4.5 ŷ 3 = x = = 5.5.
12 Linear Regression Residuals and Sum-of-Squares Error (SSE) If we model a collection of data (x 1, y 1 ), …, (x n, y n ) with a linear equation ŷ = mx + b, then the residuals are the n quantities (Observed Value – Predicted Value): (y 1 – ŷ 1 ), (y 2 – ŷ 2 ), …, (y n – ŷ n ). The sum-of-squares error (SSE) is the sum of the squares of the residuals: SSE = (y 1 – ŷ 1 ) 2 + (y 2 – ŷ 2 ) 2 + … +(y n – ŷ n ) 2.
13 Linear Regression Quick Example For the data and linear approximation given above, the residuals are: y 1 – ŷ 1 = 2 – 2.5 = –0.5 y 2 – ŷ 2 = 5 – 4.5 = 0.5 y 3 – ŷ 3 = 6 – 5.5 = 0.5 and so SSE = (–0.5) 2 + (0.5) 2 + (0.5) 2 = 0.75.
14 Example 1 – Computing SSE Using the data above on the GDP in Mexico, compute SSE for the linear models y = 0.5t + 8 and y = 0.25t + 9. Which model is the better fit? Solution: We begin by creating a table showing the values of t, the observed (given) values of y, and the values predicted by the first model.
15 Example 1 – Solution We now add two new columns for the residuals and their squares. SSE, the sum of the squares of the residuals, is then the sum of the entries in the last column, SSE = 8. cont’d
16 Example 1 – Solution Repeating the process using the second model, 0.25t + 9, yields the following table: This time, SSE = 2 and so the second model is a better fit. cont’d
17 Example 1 – Solution Figure 29 shows the data points and the two linear models in question. cont’d Figure 29
18 Linear Regression Among all possible lines, there ought to be one with the least possible value of SSE—that is, the greatest possible accuracy as a model. The line (and there is only one such line) that minimizes the sum of the squares of the residuals is called the regression line, the least-squares line, or the best-fit line. To find the regression line, we need a way to find values of m and b that give the smallest possible value of SSE.
19 Linear Regression Regression Line The regression line (least squares line, best-fit line) associated with the points (x 1, y 1 ), (x 2, y 2 ), …, (x n, y n ) is the line that gives the minimum (SSE).
20 Linear Regression The regression line is y = mx + b, where m and b are computed as follows: n = number of data points. The quantities m and b are called the regression coefficients.
21 Linear Regression Here, “ ” means “the sum of.” Thus, for example, x = Sum of the x-values = x 1 + x 2 + …+x n xy = Sum of products = x 1 y 1 + x 2 y 2 + …+ x n y n x 2 = Sum of the squares of the x-values = x x …+ x n 2. On the other hand, ( x) 2 = Square of x = Square of the sum of the x-values.
22 Coefficient of Correlation
23 Coefficient of Correlation If all the data points do not lie on one straight line, we would like to be able to measure how closely they can be approximated by a straight line. We know that SSE measures the sum of the squares of the deviations from the regression line; therefore it constitutes a measurement of what is called “goodness of fit.” (For instance, if SSE = 0, then all the points lie on a straight line.) However, SSE depends on the units we use to measure y, and also on the number of data points (the more data points we use, the larger SSE tends to be).
24 Coefficient of Correlation Thus, while we can (and do) use SSE to compare the goodness of fit of two lines to the same data, we cannot use it to compare the goodness of fit of one line to one set of data with that of another to a different set of data. To remove this dependency, statisticians have found a related quantity that can be used to compare the goodness of fit of lines to different sets of data. This quantity, called the coefficient of correlation or correlation coefficient, and usually denoted r, is between –1 and 1. The closer r is to –1 or 1, the better the fit.
25 Coefficient of Correlation For an exact fit, we would have r = –1 (for a line with negative slope) or r = 1 (for a line with positive slope). For a bad fit, we would have r close to 0. Figure 31 shows several collections of data points with least squares lines and the corresponding values of r. Figure 31
26 Coefficient of Correlation Correlation Coefficient The coefficient of correlation of the n data points (x 1, y 1 ), (x 2, y 2 ), …, (x n, y n ) is It measures how closely the data points (x 1, y 1 ), (x 2, y 2 ), …, (x n, y n ) fit the regression line. (The value r 2 is sometimes called the coefficient of determination.)
27 Coefficient of Correlation Interpretation If r is positive, the regression line has positive slope; if r is negative, the regression line has negative slope. If r = 1 or –1, then all the data points lie exactly on the regression line; if it is close to ±1, then all the data points are close to the regression line. On the other hand, if r is not close to ±1, then the data points are not close to the regression line, so the fit is not a good one. As a general rule of thumb, a value of | r | less than around 0.8 indicates a poor fit of the data to the regression line.
28 Example 3 – Computing the Coefficient of Correlation Use the following table that shows past and projected per capita gross domestic product (GDP) of Mexico for 2000–2014 and find the correlation coefficient for the same. Is the regression line a good fit?
29 Example 3 – Solution The formula for r requires x, x 2, xy, y, and y 2. Let’s organize our work in the form of a table, where the original data are entered in the first two columns and the bottom row contains the column sums.
30 Example 3 – Solution Substituting these values into the formula we get As r is close to 1, the fit is a fairly good one; that is, the original points lie nearly along a straight line. cont’d