Introduction to Linear Regression.  You have seen how to find the equation of a line that connects two points.

Presentation on theme: "Introduction to Linear Regression.  You have seen how to find the equation of a line that connects two points."— Presentation transcript:

Introduction to Linear Regression

 You have seen how to find the equation of a line that connects two points.

 Often, we have more than two data points, and usually the data points do not all lie on a single line.

 You have seen how to find the equation of a line that connects two points.  Often, we have more than two data points, and usually the data points do not all lie on a single line.  It is possible to find the equation of a line that most closely fits a set of data points. Such a line is called a regression line or a linear regression equation.

 You have seen how to find the equation of a line that connects two points.  Often, we have more than two data points, and usually the data points do not all lie on a single line.  It is possible to find the equation of a line that most closely fits a set of data points. Such a line is called a regression line or a linear regression equation.  Our goal here is to learn what a regression line is. You can then watch the presentation on how to find the equation of a regression line on Excel.

 Consider the following table that the average price of a two-bedroom apartment in downtown New York City from 1994 to 2004, where t=0 represents 1994.

 We can plot each of these data points on a graph. Each point is of the form (t, p), so we have 6 points to plot.

 Consider the following table that the average price of a two-bedroom apartment in downtown New York City from 1994 to 2004, where t=0 represents 1994.  We can plot each of these data points on a graph. Each point is of the form (t, p), so we have 6 points to plot.  They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20), and (10, 1.60). Just looking at them like this doesn’t give much indication of a pattern, although we can see that the p-values are increasing as t increases.

 When we plot the points all together on a set of axes, we get the following scatter plot:

 It seems that the data do follow a somewhat linear pattern.

 We can find the line the line that most closely fits the equation and graph it over the data points.

 Notice that the line does not go through all of the data points.

 We can also find the equation of this “line of best fit”.

 We can also get what’s called the correlation coefficient.

 We can also find the equation of this “line of best fit”.  We can also get what’s called the correlation coefficient.

 We can also find the equation of this “line of best fit”.  We can also get what’s called the correlation coefficient.  You will be able to do all of this on Excel once you watch the instructional video and read the PDFs for this material. For now, we just want to get an idea of what the regression line is and what the correlation coefficient tells us about the regression equation.

 What does the regression equation tell us about the relationship between time and sale price?

 The slope and the vertical intercept (usually the y- intercept, here the p-intercept) tell us different things.

 In this case, the p-intercept tells us what the sale price is predicted to be when t=0 (that is, in the year 1994).

 The regression equation is p=0.1264t+0.2229. Recall that price is in millions of dollars.

 In this case, the p-intercept tells us what the sale price is predicted to be when t=0 (that is, in the year 1994).  The regression equation is p=0.1264t+0.2229. Recall that price is in millions of dollars.  Thus, if t=0, the regression equation predicts a price of \$0.2229 million or \$222,900.

 In this case, the p-intercept tells us what the sale price is predicted to be when t=0 (that is, in the year 1994).  The regression equation is p=0.1264t+0.2229. Recall that price is in millions of dollars.  Thus, if t=0, the regression equation predicts a price of \$0.2229 million or \$222,900.  According to the table, the actual price was \$0.38 million or \$380,000. These values don’t have to be the same however, since the regression equation can’t match every point exactly. It is only a model that most closely fits the data points.

 What does the slope of the regression equation tell us?

 The slope of our regression equation is 0.1264.

 What does the slope of the regression equation tell us?  The slope of our regression equation is 0.1264.  We can always write a number x as x divided by 1, so we can write this slope as.

 What does the slope of the regression equation tell us?  The slope of our regression equation is 0.1264.  We can always write a number x as x divided by 1, so we can write this slope as.  Recall that the definition of slope is.

 What does the slope of the regression equation tell us?  The slope of our regression equation is 0.1264.  We can always write a number x as x divided by 1, so we can write this slope as.  Recall that the definition of slope is.  In this case we are using p and t, so it’s.

 What does the slope of the regression equation tell us?  The slope of our regression equation is 0.1264.  We can always write a number x as x divided by 1, so we can write this slope as.  Recall that the definition of slope is.  In this case we are using p and t, so it’s.  So for our problem, we have.

 What does the slope of the regression equation tell us?  The slope of our regression equation is 0.1264.  We can always write a number x as x divided by 1, so we can write this slope as.  Recall that the definition of slope is.  In this case we are using p and t, so it’s.  So for our problem, we have.  We can interpret this to mean that when t increases by 1, we can expect that p will increase by 0.1264.

 For this problem, t is measure in years and p is measured in millions of dollars.

 So more specifically, the slope can be interpreted to mean that if t increases by 1 year, the model predicts that the average price p of a two-bedroom apartment will increase by about \$0.1264 million dollars, or \$126,400.

 For this problem, t is measure in years and p is measured in millions of dollars.  So more specifically, the slope can be interpreted to mean that if t increases by 1 year, the model predicts that the average price p of a two-bedroom apartment will increase by about \$0.1264 million dollars, or \$126,400.  Even more plainly, we can say that the model predicts that the average price of a two-bedroom apartment in New York City will increase by about \$126,400 per year.

 For this problem, t is measure in years and p is measured in millions of dollars.  So more specifically, the slope can be interpreted to mean that if t increases by 1 year, the model predicts that the average price p of a two-bedroom apartment will increase by about \$0.1264 million dollars, or \$126,400.  Even more plainly, we can say that the model predicts that the average price of a two-bedroom apartment in New York City will increase by about \$126,400 per year.  We can now use the linear regression model to predict future prices. For example, if we wanted to predict what the price of an apartment was in 2008, we could plug in 14 for t in the regression equation (since t=0 is 1994).

 Plugging in 14 for t into the regression equation gives p=0.1264(14)+0.2229=1.9925.

 This means that if the trend continued, we can expect that the price of a two-bedroom apartment was around \$1,992,500 in 2008.

 Plugging in 14 for t into the regression equation gives p=0.1264(14)+0.2229=1.9925.  This means that if the trend continued, we can expect that the price of a two-bedroom apartment was around \$1,992,500 in 2008.  You can also use the regression equation to check how closely the model matches the actual price in some years that were given on the table. For example, for 2000 the equation predicts a price of p=0.1264(6)+0.2229=0.9813, or \$981,300.

 Plugging in 14 for t into the regression equation gives p=0.1264(14)+0.2229=1.9925.  This means that if the trend continued, we can expect that the price of a two-bedroom apartment was around \$1,992,500 in 2008.  You can also use the regression equation to check how closely the model matches the actual price in some years that were given on the table. For example, for 2000 the equation predicts a price of p=0.1264(6)+0.2229=0.9813, or \$981,300.  According to the table, the actual price was \$950,000, so the regression equation is pretty close.

 It is important to remember that the regression equation is just a model, and it won’t give the exact values.

 If the equation is a good fit to the data however, it will give a very good approximation, so it can be used to forecast what may happen in the future if the current trend continues.

 It is important to remember that the regression equation is just a model, and it won’t give the exact values.  If the equation is a good fit to the data however, it will give a very good approximation, so it can be used to forecast what may happen in the future if the current trend continues.  Next, let’s take a quick look at how a regression equation is derived, and then take a look at what the correlation coefficient (or the r-squared value on Excel) tell us about the regression equation.

 Let’s take another look at the data points and the regression line.

 Why does this particular line give the best “fit” for the data? Why not some other line?

 It has to do with what is called a residual.

 A residual is the difference between a particular data point and the regression line.

 If we zoom in on a particular data point, we can see what a residual is.

 Let’s zoom in on this particular data point.

 Zooming into this box:

 We see the data point and the line.

 Zooming into this box:  We see the data point and the line.  The vertical distance between the line and the data point is the residual.

 Zooming into this box:  We see the data point and the line.  The vertical distance between the line and the data point is the residual.

 Zooming into this box:  We see the data point and the line.  The vertical distance between the line and the data point is the residual.

 Zooming into this box:  We see the data point and the line.  The vertical distance between the line and the data point is the residual.  The idea behind linear regression is to keep the residuals as small as possible.

 There is a method that allows us to minimize the sum of all of the residuals.

 This is called the least-squares method. You can read about it in the PDF for linear regression.

 There is a method that allows us to minimize the sum of all of the residuals.  This is called the least-squares method. You can read about it in the PDF for linear regression.  Since these formulas can get fairly complicated, you will not be required to use them in the course.

 There is a method that allows us to minimize the sum of all of the residuals.  This is called the least-squares method. You can read about it in the PDF for linear regression.  Since these formulas can get fairly complicated, you will not be required to use them in the course.  You will only need to know how to find a regression line using Excel. You can watch the video on how to do this, or read through the PDF, or both.

 There is a method that allows us to minimize the sum of all of the residuals.  This is called the least-squares method. You can read about it in the PDF for linear regression.  Since these formulas can get fairly complicated, you will not be required to use them in the course.  You will only need to know how to find a regression line using Excel. You can watch the video on how to do this, or read through the PDF, or both.  Next, we look at what the correlation coefficient tells us about the regression equation.

 Recall that in our graph, a number was given, called the correlation coefficient, denoted by the letter r.

 The correlation coefficient tells us how closely the regression line “fits” the data points.

 Recall that in our graph, a number was given, called the correlation coefficient, denoted by the letter r.  The correlation coefficient tells us how closely the regression line “fits” the data points.  It has a value between -1 and 1. A value very close to 1 indicates a very good fit with a positive sloping linear function.

 Recall that in our graph, a number was given, called the correlation coefficient, denoted by the letter r.  The correlation coefficient tells us how closely the regression line “fits” the data points.  It has a value between -1 and 1. A value very close to 1 indicates a very good fit with a positive sloping linear function.  A value very close to -1 indicates a very good fit with a negative sloping linear function.

 Recall that in our graph, a number was given, called the correlation coefficient, denoted by the letter r.  The correlation coefficient tells us how closely the regression line “fits” the data points.  It has a value between -1 and 1. A value very close to 1 indicates a very good fit with a positive sloping linear function.  A value very close to -1 indicates a very good fit with a negative sloping linear function.  A value very close to 0 indicates a very poor fit with the data, so there will be no linear relationship between variables in this case.

 Excel will not give the value of r, instead it gives the value of r squared.

 The r-squared value basically tells us the same thing, but it will only be between 0 and 1.

 Excel will not give the value of r, instead it gives the value of r squared.  The r-squared value basically tells us the same thing, but it will only be between 0 and 1.  If the r-squared value is close to 1, there is a very good linear fit for the data points.

 Excel will not give the value of r, instead it gives the value of r squared.  The r-squared value basically tells us the same thing, but it will only be between 0 and 1.  If the r-squared value is close to 1, there is a very good linear fit for the data points.  If the r-squared value is close to 0, there is a very poor fit between the data points.

 Excel will not give the value of r, instead it gives the value of r squared.  The r-squared value basically tells us the same thing, but it will only be between 0 and 1.  If the r-squared value is close to 1, there is a very good linear fit for the data points.  If the r-squared value is close to 0, there is a very poor fit between the data points.  We will now look at some examples of what it looks like with an r-squared value close to 1 and with an r-squared value close to 0.

 Consider the following set of data points.

 They follow a clear linear pattern, so we should expect the r-squared value to be close to 1.

 Consider the following set of data points.  They follow a clear linear pattern, so we should expect the r-squared value to be close to 1.  And it is.

 Now consider the following set of data points.

 These points seem to be scattered everywhere and don’t follow any linear pattern.

 Now consider the following set of data points.  These points seem to be scattered everywhere and don’t follow any linear pattern.  We expect the r-squared value to be close to 0.

 Now consider the following set of data points.  These points seem to be scattered everywhere and don’t follow any linear pattern.  We expect the r-squared value to be close to 0.  And it is.

 So, to summarize, a linear regression equation is a line that most closely fits a given set of data points.

 The regression equation can be used to predict future values, or values that are outside of the given data range.

 So, to summarize, a linear regression equation is a line that most closely fits a given set of data points.  The regression equation can be used to predict future values, or values that are outside of the given data range.  We can find regression equation for any set of data points, no matter how scattered the data look, but we can tell how closely the data follow a linear pattern by looking at the r-squared value.

 So, to summarize, a linear regression equation is a line that most closely fits a given set of data points.  The regression equation can be used to predict future values, or values that are outside of the given data range.  We can find regression equation for any set of data points, no matter how scattered the data look, but we can tell how closely the data follow a linear pattern by looking at the r-squared value.  An r-squared value close to 1 indicates a very good fit to the given data, and an r-squared value close to zero indicates a very poor fit to the data.

 The topic of linear regression is very deep, and we have only given a very brief introduction to it here.

 You can read more about it in the PDF given on the Assigned Reading for section 1.4.

 The topic of linear regression is very deep, and we have only given a very brief introduction to it here.  You can read more about it in the PDF given on the Assigned Reading for section 1.4.  Be sure you also watch the video about how to find a linear regression on Excel! You can find the video link in the Assigned Reading for section 1.4.

Download ppt "Introduction to Linear Regression.  You have seen how to find the equation of a line that connects two points."

Similar presentations