Presentation is loading. Please wait.

Presentation is loading. Please wait.

VI. Regression Analysis A. Simple Linear Regression 1. Scatter Plots Regression analysis is best taught via an example. Pencil lead is a ceramic material.

Similar presentations


Presentation on theme: "VI. Regression Analysis A. Simple Linear Regression 1. Scatter Plots Regression analysis is best taught via an example. Pencil lead is a ceramic material."— Presentation transcript:

1 VI. Regression Analysis A. Simple Linear Regression 1. Scatter Plots Regression analysis is best taught via an example. Pencil lead is a ceramic material. Like other ceramic materials, the porosity of the pencil lead is a measure of the strength of the body. The lower the porosity, the higher the strength.

2 The following data are the porosities and strengths for a popular # 2 pencil lead.

3 First, consider a scatter plot of these data. As expected, the plot suggests that as the porosity gets smaller, the strength gets larger (a negative relationship). We also see that the relationship is not perfect (the data do not form a perfect straight line). The data exhibit some “noise” or variability around the possible linear relationship.

4 2. The Simple Linear Regression Model Suppose we believe that as at least a first approximation, there is a strictly linear relationship between strength and porosity. What is the appropriate model? where y i is the response, in this case, the strength of the i th pencil lead, x i is the predictor or regressor, in this case, the porosity of the i th pencil lead, is the y-intercept, is the slope (in our case, we expect to be negative), and is a random error.

5 We usually assume that the random errors are independent and that they all have an expected value of 0 and variance σ 2. With these assumptions, which is a straight line. The statistical model represents the approximate relationship between y i, the response of interest, and the x i which is the regressor. By knowing, we know the relationship between y and x. If, then there is a negative relationship between y i and x i. If, then there is a positive relationship. If, then there is no relationship!

6 Problem: Do we ever know or ? How should we choose our estimates for and ? Since is a straight line, we should choose our estimates to produce the “best” line through the data. Note: There are an infinite number of possible lines. How shall we define the “best” line?

7 3. Least Squares Estimation Consider an estimated relationship between y and x given by Note: is an estimate or prediction of y i. b 0 is an estimate of the y-intercept. b 1 is an estimate of the slope.

8 One possible line through our scatter plot is the following.

9 Consider the difference between each actual observation y i and its predicted value,. We usually call this difference the i th residual, e i ; thus, For a good estimated line, all of the residuals should be “small”. Thus, one possible measure of how good our estimated line is Problem: Note, for some data points e i 0. A poor fit where e i is much less than 0 can be compensated by another very poor fit when e i is much larger than 0. Thus, is not a particularly good measure.

10 A better measure? which we call the sum of squares for the residuals (SS res ). Our best estimated line, then, is the one which minimizes SS res. Therefore we wish to choose b 0 and b 1 such that SS res is minimized. What values of b 0 and b 1 accomplish this? We need to return to basic calculus.

11

12 Substituting b 0 into SS res, we obtain Thus,

13 where and

14 For our data,

15 Thus,

16 Therefore, our prediction equation is The following is a plot of this prediction equation through the actual data.

17 4. Hypothesis Tests for Usually, the most important question to be addressed by regression analysis is whether the slope is 0. Our approach for answering this question: a hypothesis test. 1. State the hypotheses. Note: Most statistical software packages assume. 2. State the test statistic. To obtain the test statistic, we need to understand the distribution of b 1.

18 If the random errors, the ε’s, follow a normal distribution with mean 0 and variance σ 2, then also follows a normal distribution with a mean of β 1 and a variance of Note: b 1 is an unbiased estimator of β 1. Problem: Do we know σ 2 ? Of course, not. Therefore to develop an appropriate test statistic, we first need to develop an appropriate estimate for σ 2.

19 We shall use Why use the denominator n-2? The denominator for our variance estimators is always the appropriate degrees of freedom (df). The appropriate degrees of freedom is always given by df = (number of observations used) - (number of parameter estimates required)

20 Look at SS res. Note: We must estimate both β 0 and β 1 to calculate SS res ; thus, df = n - 2. To compute MS res, we need a better way to find SS res. The definitional formula for SS res is

21 The computational formula is given by SS res = SS total - b 1 SS xy where Thus, an appropriate estimate of the variance of b 1 is given by An appropriate test statistic is This t statistic has n-2 degrees of freedom.

22 We may also express this test statistic as where is the estimated standard error for b 1. Most software packages report b 1 and, which allows the analyst to compute the t statistic from the output. 3. State the rejection region. Rejection Regions are just as before. Thus, if, then we reject H 0 if. If, we reject H 0 if. Steps 4 and 5 are the same as before.

23 Return to our example: Perform the appropriate test for using a.05 significance level. 1. State the hypotheses. 2. State the test statistic. In this case, the test statistic is 3. State the critical region. In this case, we reject H 0 if

24 Since n = 10, we reject H 0 if 4. Conduct experiment and calculate the test statistic.

25

26 5. Reach conclusions. t is not < -1.860 Therefore we must fail to reject H 0. Thus, we have insufficient evidence to establish that there is negative linear relationship between porosity and strength. Since we failed to reject the claim that there is no relationship, we are not required to calculate a confidence interval for. In general, however, we can construct a (1 - ) 100% confidence interval for by

27 For our specific case, a 95% confidence interval for is which clearly includes 0.

28 5. The Coefficient of Determination, R 2 We can partition the total variability in the data, SS total into two components: SS reg, the sum of squares due to the regression model, and SS res, which is the sum of squares due to the residuals. We define by SS reg SS reg represents the variability in the data explained by our model. SS res represents the variability unexplained and presumed due to error.

29 Note: If our model fits the data well, then SS reg should be “large”, and SS res should be near 0. On the other hand, if the model does not fit the data well, then SS reg should be near 0, and SS res should be large. One reasonable measure of the overall performance of our model is the coefficient of determination, R 2, given by It can be shown that 0 ≤ R 2 ≤ 1.

30 Note: If the fit is good, SS res is near 0 and R 2 is near 1. If the fit is poor, SS res is large and R 2 is near 0. A problem with R 2 : What defines a good value? The answer depends upon the application area. Typically, in many engineering problems, R 2 >.9 However, there are some very “noisy” systems where a good R 2 is.20.

31 We generally use the computational formula to compute SS reg, which is For our example: SS total = 0.159, b 1 = -0.265, and SS xx = 0.585. Thus, and which is rather poor and confirms our hypothesis test.

32 6. The Overall F-Test This procedure focuses purely on whether some relationship, either positive or negative, exists between the response and the regressor. Consequently, it is inherently a two-sided procedure. In general, this test evaluates the overall adequacy of the model. For simple linear regression, this test reduces to a two-sided test for the slope, in which case, our hypotheses are In multiple regression, this test simultaneously evaluates all of the slopes.

33 Our test statistic is based on MS reg which is defined by where df reg = the number of regressors In the case of simple linear regression, df reg = 1. Our test statistic is The degrees of freedom for the test statistic are 1 for the numerator and n-2 for the denominator. One way to view this F statistics is as a signal-to-noise ratio. MS reg is a standardized measure of what the model explains (a signal). MS res is a standardized measure of the error (a measure of noise).

34 Since we have only one possible alternative hypothesis, we always reject H 0 if. In our case, we reject the null hypothesis if For our example SS reg = 0.041 and df reg = 1. Thus,

35 Since MS res = 0.01474, our test statistic is Apart from rounding errors, this value for the F statistic is the square of the value for the t statistic we used to test the slope originally. We typically use the following analysis of variance (ANOVA) table to summarize the calculations for this test. Degrees of Sum of Mean Source Freedom Squares Squares F Regression df reg SS reg MS reg F Residual df res SS res MS res Total n-1 SS total

36 For our specific situation, the ANOVA table is the following. Degrees of Sum of Mean Source Freedom Squares Squares F Regression 10.0410.0412.782 Residual 80.1180.01474 Total 90.159 Source refers to our partition of the total variability into two components: one for the regression model, and the other for the residual or error. For simple linear regression, the degrees of freedom for the model are number of parameters - 1 = 2 - 1 = 1.

37 The degrees of freedom for the residuals for this particular situation are number of observations - number of parameters = n - 2 = 8. We obtain the mean squares by dividing the appropriate sum of squares by the corresponding degrees of freedom. We calculate the F statistic by dividing the mean square for regression by the mean square for the residuals. Since 2.782 is not > 5.32, we cannot reject the null hypothesis that porosity and strength are not related.

38 7. Reading a Computer Generated Analysis Repeat the analysis of the porosity and strength data using the software package of your choice. Highlight all of the tests we did by hand. If your software allows you to calculate confidence and prediction bands, then do so.

39 C. Multiple Linear Regression Choose either one of the exercises from the book that you do not assign for homework or a data set of your own to illustrate multiple linear regression. Do all of the analysis within the software package of your choice. Continually emphasize that multiple linear regression is a straight forward extension of simple linear regression. We spent a lot of time in simple linear regression laying the necessary foundations for multiple linear regression. Once students understand reading the computer output for simple linear regression, they can pick up very quickly how to read the output for multiple linear regression.

40 The following questions should guide your discussion of the example: 1. What is our model and how should we estimate it? 2. What is the overall adequacy of our model? 3. Which specific regressors seem important? Once we begin residual analysis, we shall add the following questions: Is the model reasonable correct? How well do our data meet the assumptions required for our analysis?

41 Highlight the following: 1. The model. where y i is the i th response, x ij is the i th value for the j th regressor, k is the number of regressors, β 0 is the y-intercept, β j is coefficient associated with the j th regressor, and ε i is a random error with mean 0 and constant variance σ 2.

42 Again, emphasize the extension from simple linear regression. Be sure to emphasize that we can no longer call the β j 's slopes. The β j 's represent the expected change in y given a one unit change in x j if we hold all of the other regressors constant.

43 2. We estimate the model using least squares. The estimated model is where is the predicted response, b 0 is the estimated y-intercept, and b j is the estimated coefficient for the j th regressor. In multiple regression, we once again find the b's which minimize In this course, we always let computer software packages perform the estimation.

44 3. We determine the overall adequacy of the model in two ways. First, the multiple coefficient of determination, R 2, given by where which is the same way that we defined the coefficient of determination for simple linear regression. Second, the overall F test, which test the hypotheses

45 The test statistic is based on MS reg and MS res, which are defined just as they were for simple linear regression; thus, where df reg is the number of regressors, and where df res is the number of observations (n) minus the number of parameters estimated (k+1); thus, df res = n-k-1. The test statistic is and has k numerator degrees of freedom and n-k-1 denominator degrees of freedom.

46 4. We determine whether a specific regressor makes a significant contribution through t-tests of the form The test statistics have the form It is important to note that these tests actually are tests on the contribution of the specific regressor given that all of the other regressors are in the model. Thus, these tests do not determine whether the specific regressor is important in isolation from the effects of the other regressors. This point emphasizes why we cannot call the β's slopes! Again, we let computer software packages perform this task.

47 D. Residual Analysis Continue the multiple regression analysis by checking the model assumptions using proper residual analysis. Again, do all of the calculations and plotting using the software package of choice. If you have the option, use the studentized residuals rather than the raw or standardized residuals.

48 Some key points to emphasize are: a plot of the studentized residuals against the predicted values, which checks the constant variance assumption and model misspecification, a plot of the studentized residuals against the regressors, which also checks the constant variance assumption and model misspecification, a plot of the studentized residuals in time order, which checks the independence assumption, a stem-and-leaf display of the residuals or of the studentized residuals, which checks the well-behaved distribution assumption, and a normal probability plot of the studentized residuals, which also checks the well-behaved distribution assumption.


Download ppt "VI. Regression Analysis A. Simple Linear Regression 1. Scatter Plots Regression analysis is best taught via an example. Pencil lead is a ceramic material."

Similar presentations


Ads by Google