Presentation on theme: "Objective: Understanding and using linear regression Answer the following questions: (c) If one house is larger in size than another, do you think it affects."— Presentation transcript:
Objective: Understanding and using linear regression Answer the following questions: (c) If one house is larger in size than another, do you think it affects the price? (d) Take a guess as to how much (on average) each additional 1000 square foot would increase the price? (e) Would you guess that distance explains about 50% of the variability in airfares, about 65% of this variability, or about 85% of this variability?
What is the line that best “fits” or models this data? In other words, what constitutes a “good” line through a scatterplot?
We can model the relationship with a line and give its equation. No line can go through all the points, but we it can still be a useful “model.” The best line might not even go through any of the points. We want to find the line that comes closer to all the points than any other line.
Residual Definition: the difference between the observed value and its associated predicted value The residual tells us how far off the model’s prediction is at that point. We always subtract the predicted value from the observed one.
A negative residual means: the model made an overestimate A positive residual means: the model made an underestimate.
When we draw a line through the scatterplot, some residuals are positive and some are negative. If we add up all the residuals, what happens? We faced the same issue when we calculated a standard deviation to measure spread. How do we deal with it? We square the residuals! Since squaring them will make them all positive, we can now sum them. Squaring also emphasizes the larger residuals. When we add up all the squared residuals, that sum indicates how well the line we drew fits the data. Do we want a small or large sum?
LINE OF BEST FIT Is the line for which the sum of squared residuals is the smallest. Our line has the property that the variation of data from the model is the smallest it can be for any straight line model for the data. We say that this line “minimizes the sum of squared residuals” – the best fit line becomes the “least squares” line.
Correlation & the Line What we know about correlation can lead us to the equation of the linear model… Let’s look at scatterplots of standardized variables again: What line would you choose to model the relationship of standardized variables?
The line must go through So in z-scores, the line must go through the point: The equation of a line that passes through the origin can be written as y = mx We need to again change it z-scores m = slope so we can say moving over one unit in the z- scores corresponds to moving up m units in the predicted z-scores of y. There are many different slopes that pass through the origin. Which one fits our data best? In other words, which slope minimizes the sum of the squared residuals? It turns out that the best choice for m is the correlation coefficient itself, r! So now, we can write:
What does it tell us? In moving one standard deviation away from the mean in x, we can expect to move about r standard deviations away from the mean in y. In general, moving any number of standard deviations in x moves r times that number of standard deviations in y.
A scatterplot of house prices (in thousands of dollars) vs. house size (in thousands of square feet) shows a relationship that is straight, with only moderate scatter, and no outliers. The correlation is 0.85. If a house is one SD above the mean in size (making it about 2170 sq ft), how many SD above the mean would you predict its sale price to be? What would you predict about the sale price of a house that’s 2 SDs below average in size?
The regression line in real units If we want to find real values, we don’t always want to convert to z-scores, find the correlation, use the formula for looking at standard deviation changes, and then convert back to the original units…why not write an equation for the line for our data: In Algebra, you learned that an equation for a line was:
y= mx + b Statisticians use slightly different notation:
y – price (in thousands of dollars) x – house size (in thousands of sq. feet) What does the slope mean? What are the units? How much can the homeowner expect the value of his house to increase if he builds an additional 2000 sq feet? How much would you expect to pay for a house of 3000 sq ft?
Calculating a Regression Equation step-by-step Estimate the costs per person associated with traffic delays Annual Cost per person Mean = 298.96 SD = 180. 830 Peak Period freeway speed Mean = 54.34 mph SD = 4.494 mph r = -0.90 Find the equation of the regression line and write a sentence interpreting your equation.
Summary of Residuals A common theme in statistical modeling is to think of each data point as being composed of 2 parts – the part that is explained by the model (often called the fit) and the “leftover” part, (often called the residual). In the context of least squares regression, the fitted value for an observation is simply the y value that the regression line would predict for the x-value of that observation. The residual is the difference between the actual y value and the fitted y hat. Residual = actual – fitted.
Data = Model + Residual Or Residual = Data – Model In symbols: We can do a “residual plot” in the hopes of finding “nothing.”
The residual plot shown offers a good example of what a problem-free plot should look like. There are no odd fan or curved trends in the plot, the average of the residuals is zero, and the points are equally represented about the x-axis. This residual represents the difference between the observed response variable Y and the value predicted by the regression line.
Accounting for Variation The variation in residuals is the key to assessing how well the model fits. All regression models fall between the two extremes of zero correlation and perfect correlation. Can we gauge where are model falls? Compare a regression model with correlation 0.5 and –0.5 in terms of strength of linearity.
Since they only have different directions, we can square the correlation coefficient to get r 2. R square d gives us the fraction of the data’s variation accounted for by the model, and 1-r squared is the fraction of the original variation left in the residuals. What does r 2 = 0 mean? What does r 2 = 69% mean?
price hat = 9.564 + 122.72 size Back to our regression of house price The value is reported as 71.4%/ What does this R 2 value mean about the relationship of price and size? Is the correlation positive or negative? How do you know? If we measured the size in thousands of square meters instead of thousands of square feet, would the r 2 value change? What about the slope?