Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 8: Simple Linear Regression Yang Zhenlin.

Similar presentations


Presentation on theme: "Chapter 8: Simple Linear Regression Yang Zhenlin."— Presentation transcript:

1 Chapter 8: Simple Linear Regression zlyang@smu.edu.sg http://www.mysmu.edu/faculty/zlyang/ Yang Zhenlin

2 STAT306, Term II, 09/10 Chapter 8 STAT151, Term I 2015-16© Zhenlin Yang, SMU Learning Objectives Describing the Relationship between Two Variables -- Scatter plot -- Numerical measures Simple Linear Regression Model Least Squares Method for Model Estimation A Measure of Goodness of Fit: R-Square Inference about the Regression Coefficients Predictions -- Predicting the value of a future observation -- Predicting the mean of future observations 2

3 STAT306, Term II, 09/10 Chapter 8 STAT151, Term I 2015-16© Zhenlin Yang, SMU Introduction We are interested in the relationship between two numerical variables X and Y. One of these variables, say X, is known in advance, called the explanatory variable, or independent variable. The other variable, Y, is a random variable and its values or its general random behavior is of interest. For this, Y is called the response variable, or dependent variable. If there is a strong relationship between X and Y, one can predict a future random variable Y, based on the known future value of X, through such a “relationship”. To study the relation, n pairs of observations on (X, Y) are collected, denoted as (X 1, Y 1 ), (X 2, Y 2 ),..., (X n, Y n ). The Least Squares Method helps finding such a relation. 3

4 STAT306, Term II, 09/10 Chapter 8 STAT151, Term I 2015-16© Zhenlin Yang, SMU Describing the Relationship Example 8.1. Prices of used cars and the odometer readings.  A car dealer wants to find the relationship between the odometer reading and the selling price of used cars.  A random sample of 100 cars is selected, and the data recorded.  Construct a scatter plot of the data. The full data Scatter diagram: plot of the pairs of observed values (x 1, y 1 ), (x 2, y 2 ),..., (x n, y n ) of variables X and Y. It is a very effective graphical tool for “revealing” the relationship between variables. 4

5 STAT306, Term II, 09/10 Chapter 8 STAT151, Term I 2015-16© Zhenlin Yang, SMU Describing the Relationship The plot indeed shows a negative linear relation between the price and the odometer reading. 5

6 STAT306, Term II, 09/10 Chapter 8 STAT151, Term I 2015-16© Zhenlin Yang, SMU Besides the graphical display of the data, some numerical measures, such as the sample covariance and the sample coefficient of correlation can be used to measure the direction and strength of the linear relationship between two variables Describing the Relationship Sample Means: Sample Variances: Sample covariance: Sample correlation coefficient: This is called the ‘five statistics summary’ of the data 6

7 STAT306, Term II, 09/10 Chapter 8 STAT151, Term I 2015-16© Zhenlin Yang, SMU Example 8.2. Continuing on the Example 8.1, find the five statistics summary and comment on the linear relationship between price and odometer reading. Solution: Describing the Relationship As r =  0.8063, there exists a strong negative linear relation … 7

8 STAT306, Term II, 09/10 Chapter 8 STAT151, Term I 2015-16© Zhenlin Yang, SMU Describing the Relationship Cov(X, Y) = 0 Strong positive linear relationship. The scatter diagram shows a clear upward trend. No linear relationship. Scatter diagram shows either no pattern, or a non-linear pattern. Strong negative linear relationship. The scatter diagram shows a clear downward trend. or Cov(X, Y) > 0 Cov(X, Y) < 0 Sample Coefficient of Correlation r = +1 0  1 8

9 STAT306, Term II, 09/10 Chapter 8 STAT151, Term I 2015-16© Zhenlin Yang, SMU Simple Linear Regression Model The simple linear regression model takes the form: Y = dependent variable X = independent variable  0 = y-intercept  1 = slope of the line  = error variable x y 00 Run Rise   = Rise/Run  0 and  1 are unknown population parameters, therefore need to be estimated from the data. As the scatter diagram given in Example 8.1 shows that although there is a general trend that as the odometer reading increases, the price of the used car decreases, the relation is not deterministic as cars of the same odometer reading can have different prices. Thus, price can also be altered by some unknown random errors! 9

10 STAT306, Term II, 09/10 Chapter 8 STAT151, Term I 2015-16© Zhenlin Yang, SMU Simple Linear Regression Model These n pairs of observations satisfy: As Y is a random variable, so must be . Due to the random sampling mechanism, {Y i } must be independent, and so are the {  i }. Further, it is reasonable to assume that To learn this theoretical relationship, in particular, to estimate the parameters  0 and  1, a random sample of n experimental units are selected, and the values of (Y, X) for each unit are to be observed to give (X 1, Y 1 ), (X 2, Y 2 ),..., (X n, Y n ). E(  i ) = 0, i = 1, 2,..., n. For if they are not zero, the non zero constant can be absorbed into  0. Thus, 10

11 STAT306, Term II, 09/10 Chapter 8 STAT151, Term I 2015-16© Zhenlin Yang, SMU Least Squares Estimation Based on the observed data, we are seeking a line that best fits the data when two variables are related to one another. We define “best fit line” as a line for which the sum of squared differences between it and the data points is minimized. Errors Different lines generate different errors, thus different sum of squares of errors. X Y Errors There is a line that minimizes the sum of squared errors, and in this sense it is the best line. 11

12 STAT306, Term II, 09/10 Chapter 8 STAT151, Term I 2015-16© Zhenlin Yang, SMU Let be a fitted line. To find the best line that minimizes the sum of squared errors, it is equivalent to find the intercept b 0 and the slope b 1 that The actual Y value of point i The value of point i calculated from the equation The value of point i calculated from the equation That is, to minimize Least Squares Estimation 12

13 STAT306, Term II, 09/10 Chapter 8 STAT151, Term I 2015-16© Zhenlin Yang, SMU Taking partial derivatives and set to zero: Leads to Substituting Least Squares Estimation 13

14 STAT306, Term II, 09/10 Chapter 8 STAT151, Term I 2015-16© Zhenlin Yang, SMU And the solutions: Least Squares Estimation gives the least squares equation: 14

15 STAT306, Term II, 09/10 Chapter 8 STAT151, Term I 2015-16© Zhenlin Yang, SMU Example 8.3. Continuing on the Example 8.2, find the least squares line relating odometer reading to the price of the used car. Solution: The estimated coefficients are The least squares equation is Interpretation of =  0.0623: for one additional mile on the odometer, it is estimated that the average cost of the cars decrease by $0.0623. Least Squares Estimation 15

16 STAT306, Term II, 09/10 Chapter 8 STAT151, Term I 2015-16© Zhenlin Yang, SMU Least Squares Estimation This is the estimated slope of the line. For each additional mile on the odometer, the price decreases by an average of $0.0623 Interpreting the Linear Regression Equation The intercept is estimated as $17067. 0 No data Do not interpret the intercept as the “Price of cars that have not been driven” 17067 16

17 STAT306, Term II, 09/10 Chapter 8 STAT151, Term I 2015-16© Zhenlin Yang, SMU Least Squares Estimation Properties of the Least Squares Estimators. For the simple linear regression model: Where {  i } are independent with E(  i ) = 0, the least squares estimators and are unbiased estimators of  0 and  1, To see this, note that E(Y i ) =  0 +  1 X i, we have More on white board in class. 17

18 STAT306, Term II, 09/10 Chapter 8 STAT151, Term I 2015-16© Zhenlin Yang, SMU Measure of Goodness of Fit Sum of Squares due to Errors (SSE)  This is the sum of differences between the points and the regression line.  It can serve as a measure of how well the line fits the data. SSE is defined by –A shortcut formula 18

19 STAT306, Term II, 09/10 Chapter 8 STAT151, Term I 2015-16© Zhenlin Yang, SMU Measure of Goodness of Fit Coefficient of Determination R 2 it is a measure of the strength of the linear relationship between the response Y and the explanatory variable(s) X, and is defined as The first definition is a general one and applies to linear regression models with multi predictors. It simplifies to the second definition when there is only one predictor X. In the case of simple linear regression, R 2 is also the square of the sample correlation coefficient r. 19

20 STAT306, Term II, 09/10 Chapter 8 STAT151, Term I 2015-16© Zhenlin Yang, SMU To understand the significance of coefficient of determination, note: SST: total variations (sum of squares) in Y, SSR: sum of squares due to regression, SSE: sum of squares due to error. It follows that R 2 = 1  SSE/SST = SSR/SST R 2 measures the proportion of the variation in Y that is explained by the variation in X, or by the model. R 2 takes on any value between zero and one. R 2 = 1: Perfect match between the line and the data points. R 2 = 0: There are no linear relationship between X and Y. Measure of Goodness of Fit 20

21 STAT306, Term II, 09/10 Chapter 8 STAT151, Term I 2015-16© Zhenlin Yang, SMU Inferences for the Model Error Variable: Required Conditions The error  is a critical part of the regression model. For formal statistical inferences for the model, four requirements involving the distribution of  must be satisfied.  The probability distribution of  is normal.  The mean of  is zero: E(  ) = 0.  The standard deviation of  is   for all values of X.  The set of errors associated with different observations on Y are all independent. It follows that the response Y is normally distributed with mean E(Y) =  0 +  1 X, and standard deviation  , and that the random sample of n observations {Y 1, Y 2,..., Y n } made on Y are independent. 21

22 STAT306, Term II, 09/10 Chapter 8 STAT151, Term I 2015-16© Zhenlin Yang, SMU Inferences for the Model   0 +  1 x 1  0 +  1 x 2  0 +  1 x 3 E(y|x 2 ) E(y|x 3 ) x1x1 x2x2 x3x3  E(y|x 1 )  The standard deviation remains constant, but the mean value changes with x Normality of  Changing the X value increases (or decreases if  1 < 0) the mean of Y, but does not change the distributional shape of it. 22

23 STAT306, Term II, 09/10 Chapter 8 STAT151, Term I 2015-16© Zhenlin Yang, SMU Inferences for the Model Estimate of Error Standard Deviation  The mean error is equal to zero.  If   is small the errors tend to be close to zero (close to the mean error). Then, the model fits the data well.  Therefore, we can also use   as a measure of the suitability of using a linear model.  However,   is unknown and has to be estimated. As SSE is the sum of squared errors, it leads naturally to an It can be shown that 23

24 STAT306, Term II, 09/10 Chapter 8 STAT151, Term I 2015-16© Zhenlin Yang, SMU Inferences for the Model Example 8.4. Calculate the estimated of error standard deviation and the coefficient of determination for Example 8.1, and describe what does it tell you about the model fit? Solution It is hard to assess the model based on s  even when compared with the mean value of Y, Calculated earlier 24

25 STAT306, Term II, 09/10 Chapter 8 STAT151, Term I 2015-16© Zhenlin Yang, SMU Inferences for the Model 65% of the variation in the auction selling price is explained by the variation in odometer reading. The rest (35%) remains unexplained by this model. Some Theoretical Results. If the errors {  1,  2, …,  n } are independent and identically distributed as N(0, ), then we have (a) (b) (c) Some Theoretical Results. If the errors {  1,  2, …,  n } are independent and identically distributed as N(0, ), then we have (a) (b) (c) 25

26 STAT306, Term II, 09/10 Chapter 8 STAT151, Term I 2015-16© Zhenlin Yang, SMU Inferences for the Model We can draw inference about  1 from by testing H 0 :  1 = 0 versus H 1 :  1  0 (or 0) Testing the Slope The implication of this test is clear: if H 0 is rejected, one can conclude that there is sufficient evidence to show that Y and X are linearly related; otherwise, they are not. The same question can be answered by constructing a confidence interval for  1. From the theoretical result given earlier and the results presented in Chapter 5b regarding the t-distribution, it is immediate to see that A statistic for testing the slope parameter or constructing a confidence interval for it. 26

27 STAT306, Term II, 09/10 Chapter 8 STAT151, Term I 2015-16© Zhenlin Yang, SMU A 100(1  )% confidence interval for  1 is given as A 100(1  )% confidence interval for  1 is given as Apparently, the quantity is an estimate of the standard deviation of, and thus referred to as the estimated standard error of. Apparently, the quantity is an estimate of the standard deviation of, and thus referred to as the estimated standard error of. Inferences for the Model Inference concerning the intercept parameter  0 can be carried out in a similar manner, but it is not as interesting and important as for the slope parameter  1. 27

28 STAT306, Term II, 09/10 Chapter 8 STAT151, Term I 2015-16© Zhenlin Yang, SMU Inferences for the Model Example 8.5. Test to determine whether there is enough evidence to infer that there is a linear relationship between the car auction price and the odometer reading for all three-year-old Tauruses, in Example 8.4. Use a = 5%. Solution: H 0 :  1 = 0 vs H 1 :  1  0 With n = n  2 = 98, the rejection region is t > t 98 (.025) or t <  t 98 (.025), where t.025  1.984. As t =  13.49 <  1.984, reject H 0 at 5% level of significance. Yes, there is enough evidence to … A 95% CI for  1 : 28

29 STAT306, Term II, 09/10 Chapter 8 STAT151, Term I 2015-16© Zhenlin Yang, SMU Predictions Before using the regression model, we need to assess how well it fits the data. If we are satisfied with how well the model fits the data, we can use it to predict the a future value of Y 0 or the mean of Y 0 based on the future value of X 0. This is in fact an important application of a regression model. The simple linear regression model can be easily extended to include more predictor variables, e.g., in the examples presented, the price of a used car is not only affected by its odometer reading, but also affected by its ‘age’, color, etc. Those constitute important topics in an advanced course: Applied Regression Methods (STAT312) The end. Thank you. 29


Download ppt "Chapter 8: Simple Linear Regression Yang Zhenlin."

Similar presentations


Ads by Google