1 Simple Linear Regression Chapter 16. 2 Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.
Published byModified over 4 years ago
Presentation on theme: "1 Simple Linear Regression Chapter 16. 2 Introduction In this chapter we examine the relationship among interval variables via a mathematical equation."— Presentation transcript:
2 Introduction In this chapter we examine the relationship among interval variables via a mathematical equation. The motivation for using the technique: –Forecast the value of a dependent variable (y) from the value of independent variables (x 1, x 2,…x k.). –Analyze the specific relationships between the independent variables and the dependent variable.
3 16.1 Simple Linear Regression Model The model has a deterministic and a probabilistic components
4 House size House Cost Most lots sell for $25,000 Building a house costs about 75$ per square foot. House cost = 25000 + 75(Size) The Deterministic part of the model
5 House cost = 25000 + 75(Size) However, house cost may vary even among same size houses! The Model
6 House cost = 25000 + 75(Size) House size House Cost Most lots sell for $25,000 The Model Since cost behave unpredictably,we add a random component.
7 The Model The simple first order (linear) regression model is: y = dependent variable x = independent variable 0 = y-intercept 1 = slope of the line = error variable x y 00 Run Rise = Rise/Run 0 and 1 are unknown population parameters, therefore are estimated from the data.
8 16.2 The Least Squares Method The estimates are determined by –drawing a sample from the population of interest, –calculating sample statistics. –producing a straight line that cuts into the data. Question: What is the best line to describe the specific linear relationship? x y
9 The Least Squares (Regression) Line A good line is considered one, that minimizes the sum of squared errors. X Actual value of Y Equation value of Y Error
10 The Least Squares (Regression) Line 3 3 4 1 1 4 (1,2) 2 2 (2,4) (3,1.5) (4,3.2) 2.5 Here is a short comparison of two possible lines drawn over 4 data points: 1. Horizontal line 2. Positively increasing line Which line is better?
11 The Least Squares (Regression) Line 3 3 4 1 1 4 (1,2) 2 2 (2,4) (3,1.5) (4,3.2) Sum of squared differences =(2 - 1) 2 +(4 - 2) 2 +(1.5 - 3) 2 +(3.2 - 4) 2 = 6.89 Sum of squared differences =(2 -2.5) 2 +(4 - 2.5) 2 +(1.5 - 2.5) 2 +(3.2 - 2.5) 2 = 3.99 2.5 The smaller the sum of squared differences the better the fit of the line to the data.
12 The Estimated Coefficients To calculate the estimates of 0 and 1 that minimize the differences between the data points and the line, use the formulas shown below (alternative formulae are suggested later): The regression equation that estimates the equation of the first order linear model is:
13 Example 2 –A car dealer wants to find the relationship between the odometer reading and the selling price of 3-year old Tauruses. –A random sample of 100 cars is selected, and the data recorded. –Find the regression line. Dependent variable y Inependent variable x The Simple Linear Regression Line
14 In order to more easily use Excel results, we’ll use another version of the formula for b 1 : COV (the covariance of x and y) is a measure of the common ‘movement’ of x and y (do both generally increase together, or move in opposite directions) The Simple Linear Regression Line cov(x,y) is also denoted by S xy
15 -2.909 Excel provides a population covariance of -2.879. The sample covariance is: -2.879n/(n-1)=-2.879(100/(99) = -2.909 The Simple Linear Regression Line Solution –Solving by hand: Calculate statistics. See data in where n = 100. Car Price
16 Solution – continued –Using the computer: Tools > Data analysis > Regression > [Shade the y range and the x range] > OK The Simple Linear Regression Line Car Price
17 The Simple Linear Regression Line Odometer readingSelling price Car Price
18 Interpreting the Linear Regression - Equation The intercept is b 0 = $17,248. 0 No data The regression equation describes the linear relationship within the range covered by the sample only. Thus, do not interpret the intercept as the “Price of cars that have not been driven” 17025 This is the slope of the line. For each additional mile on the odometer, the price decreases by an average of $0.0669
19 Interpreting the Linear Regression - Equation Remember: The regression equation pertains to the sample only!! To generalize the results by making inference about the population, we are about to apply statistical inference techniques.
20 16.3 Model Assumptions The error is a critical part of the regression model. Four requirements involving must be satisfied. –The probability distribution of is normal. –The mean of is zero: E( ) = 0. –The standard deviation of is for all values of x. –The set of errors associated with different values of y are all independent.
21 The Normality of The mean value of y for a given value of x is E(y|x) = 0 + 1 x + E( ) = 0 + 1 x, since E( ) = 0. 0 + 1 x 1 0 + 1 x 2 0 + 1 x 3 E(y|x 2 ) E(y|x 3 ) x1x1 x2x2 x3x3 E(y|x 1 ) The standard deviation remains constant, but the mean value changes with x Recall: y = 0 + 1 x+ Since 0 + 1 x is deterministic and is normally distributed, y is also normally distributed. The standard deviation of y is for all values of y
22 + + + + + + + + + + + + + + + + + + + + + + + + Notice, that for small values of y, and for large values of y the errors are mostly negative, while for midrange values of y the errors are positive. The errors are not independent. Consequently, linear regression is not the correct model to work with here. One can also question the assumption of ‘The mean error is zero’. The independence of the errors Here is a case where linear regression is not the right model to apply to.
23 16.4 Assessing the Model The least squares method produces a regression line whether or not there is a linear relationship between x and y. Consequently, it is important to assess how well the linear model fits the data (how strong the linear relationship is). Several methods are used to assess the model. All are based on the sum of squares for errors, SSE.
24 SSE is the sum of vertical differences between the points and the regression line. It was the function we minimized when constructing the regression equation. It can serve as a measure of how well the line fits the data. SSE is defined by Sum of Squares for Errors A shortcut formula o xixi yiyi +
25 If is small the errors tend to be close to their mean, and the model fits the data well. Since the mean error is equal to zero, the model fits the data well when is close to zero. Therefore, we can, use as a measure of the suitability of using a linear model. To do this we need to estimate . An unbiased estimator of 2 is given by s 2 Standard Error of Estimate The standard error of estimate is defined by s
26 Testing the Slope Different inputs (x) yield different outputs (y). No linear relationship. Different inputs (x) yield the same output (y). The slope is not equal to zeroThe slope is equal to zero Linear relationship.
27 We can draw inference about 1 from b 1 by testing H 0 : 1 = 0 H 1 : 1 = 0 (or 0) –The test statistic is –If the error variable is normally distributed, the statistic is Student t distribution with d.f. = n-2. The standard error of b 1. where Testing the Slope
28 Example 4 –Test to determine whether there is enough evidence to infer that there is a linear relationship between the car auction price and the odometer reading for all three-year-old Tauruses, in example 2. Use = 5%. –Solution. The alternative hypothesis here (H 1 ) is of the form ‘Not equal to zero’, because we try to show that there is a linear relationship, which can be verified by either positive value or negative value for 1. Testing the Slope, Example
29 Solving by hand –To compute “t” we need the values of b 1 and s b1. –For the two tails rejection region, t > t.025 or t < -t.025 with = n-2 = 98. From the t-table t.025 = 1.984 approximately, Testing the Slope, Example Obtained from Descriptive Statistics in Excel.
30 Using the computer Testing the Slope, Example There is overwhelming evidence to infer that the odometer reading affects the auction selling price.` Car Price
31 –To measure the strength of the linear relationship we use the coefficient of determination. Coefficient of determination R 2 Here the line explains all the variation between the different ‘y’ values. Here the line explains only some of the variation among the ‘y’ values (there are errors).
32 Coefficient of determination The mathematical formulation of R 2 is based on the following characteristic ( stated without a proof): Overall variation in y = The variation of the regression model around the mean + The variation of the actual points around the line That is: SST = SSR + SSE Observe next
33 Coefficient of Determination = [Variation in y explained by the linear relationship] [Total Variation in y Coefficient of Determination R 2
34 Coefficient of determination - insight SST = SSR + SSE When the relationship between x and y is perfectly linear – there are no deviation of the actual points from the line, so SSE = 0. Therefore, SSR = SST and R 2 = SSR/SST = 1 When the relationship between x and y is not perfectly linear – there are deviations of the actual points from the regression line, so SSE > 0. Therefore, SSR < SST, and R 2 = SSR/SST < 1. When no linear relationship exists between x and y – none of the total variation among the actual points is explained by a linear relationship, so SST = SSE. Therefore, R 2 = 0.
35 Example 3 –Find the coefficient of determination for example 2; what does this statistic tell you about the model? Solution –Solving by hand; we use an alternative form of the R 2 formula, that makes it easier to use Excel. Coefficient of determination, Example
36 – Using the computer From the regression output we have Coefficient of determination – Example 65% of the variation in the auction selling price is explained by the variation in odometer reading. The rest (35%) remains unexplained by this model.
37 Coefficient of Correlation The coefficient of correlation is used to measure the strength of association between two variables. The coefficient values range between -1 and 1. –If r = -1 (negative association) or r = +1 (positive association) every point falls on the regression line. –If r = 0 there is no linear pattern. The coefficient can be used to test for linear relationship between two variables.
38 If we are satisfied with how well the model fits the data, we can use it to predict the values of y. To make a prediction we use –Point prediction, and –Interval prediction 16.6 Using the Regression Equation Before using the regression model, we need to assess how well it fits the data.
39 Point Prediction Example 5 –Predict the selling price of a three-year-old Taurus with 40,000 miles on the odometer (Example 2). –It is predicted that a 40,000 miles car would sell for $14,575. – How close is this prediction to the real price? A point prediction
40 Interval Estimates Two intervals can be used to discover how closely the predicted value will match the true value of y. –Prediction interval – predicts y for a given value of x, –Confidence interval – predicts the average y for a given x. –The confidence interval –The prediction interval
41 Interval Estimates, Example Example 5 - continued –Provide an interval estimate for the bidding price on a Ford Taurus with 40,000 miles on the odometer. –Two types of predictions are required: A prediction for a specific car A prediction for the mean price per car
42 Interval Estimates, Example Solution –A prediction interval provides the price estimate for a single car: xgxg t.025,98 Approximately t /2.n-1 ss xgxg b1b1 b0b0
43 Solution – continued –A confidence interval provides the estimate of the mean price per car for a Ford Taurus with 40,000 miles reading on the odometer. The confidence interval (95%) = Interval Estimates, Example