Presentation is loading. Please wait.

Presentation is loading. Please wait.

Session 6 Applied Regression -- Prof. Juran.

Similar presentations


Presentation on theme: "Session 6 Applied Regression -- Prof. Juran."— Presentation transcript:

1 Session 6 Applied Regression Prof. Juran

2 Outline Residual Analysis Are they normal?
Do they have a common variance? Multicollinearity Autocorrelation, serial correlation Applied Regression Prof. Juran Applied Regression Prof. Juran

3 Residual Analysis Assumptions about regression models:
The Form of the Model The Residual Errors The Predictor Variables The Data Applied Regression Prof. Juran Applied Regression Prof. Juran

4 Regression Assumptions
Recall the assumptions about regression models: The Form of the Model The relationship between Y and each X is assumed to be linear. Applied Regression Prof. Juran Applied Regression Prof. Juran

5 The residuals are normally distributed.
The Residual Errors The residuals are normally distributed. The residuals have a mean of zero. The residuals have the same variance. The residuals are independent of each other. Applied Regression Prof. Juran Applied Regression Prof. Juran

6 The Predictor Variables
The X variables are nonrandom (i.e. fixed or selected in advance). This assumption is rarely true in business regression analysis. The data are measured without error. This assumption is rarely true in business regression analysis. The X variables are linearly independent of each other (uncorrelated, or orthogonal). This assumption is rarely true in business regression analysis. Applied Regression Prof. Juran Applied Regression Prof. Juran

7 The Data The observations are equally reliable and have equal weights in determining the regression model. Applied Regression Prof. Juran Applied Regression Prof. Juran

8 Because many of these assumptions center on the residuals, we need to spend some time studying the residuals in our model, to assess the degree to which these assumptions are valid. Applied Regression Prof. Juran Applied Regression Prof. Juran

9 Example: Anscombe’s Quartet
Here are four bivariate data sets, devised by F. J. Anscombe. Anscombe, F. J. (1973), “Graphs in Statistical Analysis,” The American Statistician, 27, Applied Regression Prof. Juran Applied Regression Prof. Juran

10 Applied Regression -- Prof. Juran

11 Applied Regression -- Prof. Juran

12 Applied Regression -- Prof. Juran

13 Applied Regression -- Prof. Juran

14 Applied Regression -- Prof. Juran

15 Three observations: These data sets are clearly different from each other. The differences would not be made obvious by any descriptive statistics or summary regression statistics. We need tools to identify characteristics such as those which differentiate these four data sets. Applied Regression Prof. Juran Applied Regression Prof. Juran

16 The differences can be detected in the different ways that these data sets violate the basic regression assumptions regarding residual errors. Applied Regression Prof. Juran Applied Regression Prof. Juran

17 Assumption: The residuals have a mean of zero.
This assumption is not likely to be a problem, because the regression procedure ensures that this will be true, unless there is a serious skewness problem. Applied Regression Prof. Juran Applied Regression Prof. Juran

18 Assumption: The residuals are normally distributed.
We can check this with a number of methods. We might plot a histogram of the residuals to see if they “look” reasonably normal. For this purpose we might want to “standardize” the residuals, so that their values can be compared with our expectations in terms of the standard error. Applied Regression Prof. Juran Applied Regression Prof. Juran

19 Standardized Residuals
In order to judge whether residuals are outliers or have an inordinate impact on the regression they are commonly standardized. The variance of the ith residual ei, perhaps surprisingly, is not though this is in many examples a reasonable approximation. Applied Regression Prof. Juran Applied Regression Prof. Juran

20 The correct variance is
One way to go, therefore, is to calculate the so-called standardized residual for each observation: Alternatively, we could use the so-called studentized residuals: Applied Regression Prof. Juran Applied Regression Prof. Juran

21 These are both measures of how far individual observations are from their predicted values, and large values of either are signals of concern. Excel (and any other stats software package) produces standardized residuals on command. Applied Regression Prof. Juran Applied Regression Prof. Juran

22 Applied Regression -- Prof. Juran

23 The normal score is calculated using the following procedure:
Another way to assess normality is to use a normal probability plot, which graphs the distribution of residuals against what we would expect to see from a standard normal distribution. The normal score is calculated using the following procedure: Order the observations in increasing order of their residual errors. Calculate a quantile, which basically measures what proportion of the data lie below each observation. Calculate the normal score, which is a measure of where we would expect the quantiles to be if we drew a sample of this size from a perfect standard normal distribution. Applied Regression Prof. Juran Applied Regression Prof. Juran

24 Applied Regression -- Prof. Juran

25 Applied Regression -- Prof. Juran

26 Applied Regression -- Prof. Juran

27 Trouble! Excel gives us a normal probability plot for the dependent variable, not the residuals We have never assumed that Y is normally distributed Another reason to switch to Minitab, SAS, SPSS, etc. Applied Regression Prof. Juran Applied Regression Prof. Juran

28 Applied Regression -- Prof. Juran

29 Assumption: The residuals have the same variance.
One way to check this is to plot the actual values of Y against the predicted values. In the case of simple regression, this is a lot like plotting them against the X variable. Applied Regression Prof. Juran Applied Regression Prof. Juran

30 Applied Regression -- Prof. Juran

31 Another method is to plot the residuals against the predicted value of Y (or the actual observed value of Y, or in simple regression against the X variable): Applied Regression Prof. Juran Applied Regression Prof. Juran

32 Collinearity Collinearity (also called multicollinearity) is the situation in which one or more of the predictor variables are nearly a linear combination of other predictors. (The opposite condition — in which all independent variables are more or less independent — is called orthogonality.) Applied Regression Prof. Juran Applied Regression Prof. Juran

33 In the extreme case of exact dependence, the XTX matrix cannot be inverted and the regression procedure will fail. In less extreme cases, we suffer from several possible problems: The independent variables are not “independent”. We can’t talk about the slope coefficients in terms of effects of one variable on Y “all other things held constant”, because changes in one of the X variables are associated with expected changes in other X variables. The slope coefficient values can be very sensitive to changes in the data, and/or which other independent variables are included in the model. Forecasting problems: Large standard errors for all parameters Uncertainty about whether true relationships have been detected Uncertainty about the stability of the correlation structure Applied Regression Prof. Juran Applied Regression Prof. Juran

34 Sources of Collinearity
Data collection method. Some combination of X variable values does not exist in the data. Example: Say that we did the tool wear case without ever trying the Type A machine at low speed or the Type B machine at high speed. Collinearity here is the result of the experimental design. Applied Regression Prof. Juran Applied Regression Prof. Juran

35 Sources of Collinearity
Constraints on the Model or in the Population. Some combination of X variable values does not exist in the population. Example: In the cigarette data, imagine if the states with a high proportion of high school graduates also had a high proportion of black citizens. Collinearity here is the result of attributes of the population. Applied Regression Prof. Juran Applied Regression Prof. Juran

36 Sources of Collinearity
Model Specification. Adding or including variables that are tightly correlated with other variables already in the model. Example: In a study to predict the profitability of TV programs, we might include both the Nielsen rating and the Nielsen share. Collinearity here is the result of including multiple variables that contain more or less the same information. Applied Regression Prof. Juran Applied Regression Prof. Juran

37 Sources of Collinearity
Over-definition. We may have a relatively small number of observations, but a large number of independent variables for each. Collinearity here is the result of too few degrees of freedom. In other words, n – p – 1 is small (or in the extreme, negative), because p is large compared with n. Applied Regression Prof. Juran Applied Regression Prof. Juran

38 Detecting Collinearity
First, be aware of the potential problem, and be vigilant. Second, check the various combinations of independent variables for obvious evidence of collinearity. This might include pairwise correlation analysis, or even regressing each independent variable against all of the others. A high R-square coefficient would be a sign of trouble. Applied Regression Prof. Juran Applied Regression Prof. Juran

39 Detecting Collinearity
Third, after a regression model has been estimated, watch for these clues: Large changes in a coefficient as an independent variable is added or removed. Large changes in a coefficient as an observation is added or removed. Inappropriate signs or magnitudes of an estimated coefficient as compared to common sense or prior expectations. The Variance Inflation Factor (VIF) is one measure of collinearity’s impact. Applied Regression Prof. Juran Applied Regression Prof. Juran

40 Variance Inflation Factor
Applied Regression Prof. Juran Applied Regression Prof. Juran

41 Countermeasures Design as much orthogonality into the data as you can. You may improve a pre-existing situation by collecting additional data, as orthogonally as possible. Exclude variables from the model that you know are correlated. Principal Components Analysis: Basically creating a small set of new independent variables, each of which is a linear combination of the larger set of original independent variables (Ch. 9.5, RABE). Applied Regression Prof. Juran Applied Regression Prof. Juran

42 Countermeasures In some cases, rescaling and centering the data can diminish the collinearity. For example, we can translate each observation into a z-stat (by subtracting the mean and dividing by the standard deviation). Applied Regression Prof. Juran Applied Regression Prof. Juran

43 Collinearity in the Supervisor Data
Applied Regression Prof. Juran Applied Regression Prof. Juran

44 Applied Regression -- Prof. Juran

45 Applied Regression -- Prof. Juran

46 Applied Regression -- Prof. Juran

47 Cars Applied Regression -- Prof. Juran

48 New model: Dependent variable is Volkswagen.
Applied Regression Prof. Juran Applied Regression Prof. Juran

49 Reduced Volkswagen model:
Applied Regression Prof. Juran Applied Regression Prof. Juran

50 42 cars in the data set, 29 represented by the dummy variables above.
13 remaining, of which 5 are VW. Applied Regression Prof. Juran Applied Regression Prof. Juran

51 Some Minitab Output Regression Analysis: MSRP versus MPG City, HP, Trunk, Warranty, Audi, Chevrolet, ... The following terms cannot be estimated and were removed: Saturn, Volkswagen, AWD Analysis of Variance Source DF Adj SS Adj MS F-Value P-Value Regression MPG City HP Trunk Warranty Audi Chevrolet Chrysler Ford Honda Lexus Mazda Nissan Toyota FWD RWD Error Total Adjusted sums of squares are the additional sums of squares determined by adding each particular term to the model given the other terms are already in the model. Applied Regression Prof. Juran Applied Regression Prof. Juran

52 S R-sq R-sq(adj) R-sq(pred) 4435.45 95.20% 92.43% 88.33%
Model Summary S R-sq R-sq(adj) R-sq(pred) % % % Coefficients Term Coef SE Coef T-Value P-Value VIF Constant MPG City HP Trunk Warranty Audi Chevrolet Chrysler Ford Honda Lexus Mazda Nissan Toyota FWD RWD Applied Regression Prof. Juran Applied Regression Prof. Juran

53 Serial Correlation (A.k.a. Autocorrelation)
Here we are concerned with the assumption that the residuals are independent of each other. In particular, we are suspicious that the sequential residuals have a positive correlation. In other words, some information about an observed value of the dependent variable is contained in the previous observation. Applied Regression Prof. Juran Applied Regression Prof. Juran

54 Consider the following historical data set, in which the dependent variable is Consumer Expenditure and the independent variable is Money Stock. (Economists are interested in the effect of Money Stock on Expenditure, because if it is significant it presents an opportunity to influence the economy through public policy.) Applied Regression Prof. Juran Applied Regression Prof. Juran

55 Applied Regression -- Prof. Juran

56 Applied Regression -- Prof. Juran

57 Applied Regression -- Prof. Juran

58 Applied Regression -- Prof. Juran

59 Applied Regression -- Prof. Juran

60 Applied Regression -- Prof. Juran

61 Applied Regression -- Prof. Juran

62 Applied Regression -- Prof. Juran

63 Summary Residual Analysis Are they normal?
Do they have a common variance? Multicollinearity Autocorrelation, serial correlation Applied Regression Prof. Juran Applied Regression Prof. Juran

64 For Session 7 Practice the Excel array functions Artsy case
Do a full multiple regression model of the cigarette data: Replicate the regression results using matrix algebra OK to this one to TAs Artsy case Applied Regression Prof. Juran Applied Regression Prof. Juran


Download ppt "Session 6 Applied Regression -- Prof. Juran."

Similar presentations


Ads by Google