Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multiple Regression ©2005 Dr. B. C. Paul. Problems with Regression So Far We have only been able to consider one factor as controlling at a time Everything.

Similar presentations


Presentation on theme: "Multiple Regression ©2005 Dr. B. C. Paul. Problems with Regression So Far We have only been able to consider one factor as controlling at a time Everything."— Presentation transcript:

1 Multiple Regression ©2005 Dr. B. C. Paul

2 Problems with Regression So Far We have only been able to consider one factor as controlling at a time Everything else has gone into sum of squares for error With our MPG we suspect other factor are involved

3 Looking at Other Variables Lets try to plot MPG against Outside temperature. Click on Graphs Highlight Interactive to bring up The side menu Highlight and click on scatterplot

4 Setting the Plot Move your Y and X Axis variable into position This interface requires you To use a drag and drop Method rather than click On arrows. When you are done click Ok.

5 Up Comes the Plot There appears to be evidence That MPG improves as the Outside temperature increases.

6 Ordering a Model We would like to have a model that includes more than one factor at a time Such a model exists

7 Function of the Model Works by least square error Objective remains to pick the coefficients such that the average error squared between the model and data points is minimum Again we will skip any derivations or explanations of how this is done For us we’ll push the right SPSS buttons

8 How Do We Deal with Significance? We have seen that any coefficient in a model can be analyzed individually to certainty that it is not really zero (if its zero that term is not even in the model) The trick! The significance of the coefficient depends on How much of the total variation the model explains How much of the credit for that is going to other variables It makes a difference what is in the model Example for MPG= f(distance driven) linear regression was significant When quadratic term was added the regression fit improved but neither term including the linear was significant

9 Method of Variable Entry The Decree Method I can tell SPSS I want it to do a regression with such and such a variable SPSS will do the best regression it can and then show me the ANOVA table I can look and see whether I believe my coefficients are strong enough for me to sign-off on.

10 Forward Regression The computer will look at the variables available It will try a linear regression on each one If one variable comes up at 95% significant it becomes a candidate to enter The computer will get the significance of each variable and if several are over 95% it will pick the best The computer will then look at the remaining variables to explain the residuals It will try each variable and check its significance in explaining the residuals It looks for variables over 95% significant and then chooses the best

11 Forward Regression Continued The process of variables entering continues until all variables have been selected or no more variables are significant. 95% significance is the “default” significance to enter We can reset to a different significance level

12 Backward Regression Computer starts by doing a regression with all possible variables in the equation Computer then does a T test on each coefficient to see if the coefficient might be zero. Any variable that falls below 75% significance is removed from the equation (75% is a default that you can reset) The T tests are then repeated The process repeats until no more variables can be thrown out of the equation

13 Step Wise Regression Starts out like Forward Regression Moves forward till two variables are in the equation Now the computer does T tests on all the variables in the equation like a Backward Regression to see if anyone should be thrown out. If not it goes into another Forward step After the forward step it checks with T tests on all variables in the equation It continues the back and forth process until nothing changes or it goes into an infinite loop

14 What are the chances that the Methods will give you the same answer? About zero Some smaller easier sets will converge for all methods but the larger sets usually do not yield the same answer. Which is right? Maybe it’s a dumb question Method does influence answer May be more important that you carefully make sure you have a good defensible method (Maybe it’s the teachers favorite answer – Step Wise)

15 Lets Try It I added a variable for Distance Squared. Multilinear regression can only Consider linear effects of a Variable, but I can trick it by Creating a non-linear variable In my case I still think I saw The MPG bending down as the Drive distance increased (logical cause the engine warmed Up)

16 Start Like We Are Going To Do Regular Linear Regression Click analyze to pull down The menu Highlight Regression to pop Out the side menu Highlight and Click Linear

17 Select My Variables Note the change here is that I Entered all the possible Independent variables (you can’t see that I also entered Distance squared)

18 Set the Regression Method to Step Wise

19 Check My Options Click on Options Note that this controls my significance To enter and remove The default is set to 95% to enter And 90% to remove.

20 Set My Plots I ask for my histograms. I ask for my residuals to be plotted Against the predicted value to search For trends in the residual.

21 Click Ok and Out Comes Stuff

22 We Can See Some Model History Our First Model was MPG is a linear Function of outside temperature It explained about 54% of observed Variation.

23 The Saga Continues The next step was to add an Effect for distance. The two variables explained 91% Of the observed variation.

24 The Rest of the History The model next added Age and finally a distance squared term. It appears that none of the variables was removed in a backwards Step. This just moved forward till all variables were in. In the end we have 93.5% of variation explained.

25 Looking at the ANOVA for the Regression Equations All Four Regressions were Highly significant

26 Checking the Significance of Coefficients We Actually Knew that none of our Variables got bounced out. Note that every Variable is Significant at Above the Alpha = 5% level.

27 We Can See Interaction Between Variables as they Enter Note that the T score for Distance dips When distance Squared Entered (For some Reason it Appears the Values are Correlated).

28 More Interactions As the unexplained random variations decreased the significance of The temperature effect increase steadily.

29 Our Equation Is

30 Look at the Significance that Controlled the Order the Variables Came In To start with Age had less than 50% significance but distance and distance squared Were both strong. Distance had a better T score and entered next.

31 Next Regression Step In the next step both Age and Distance Squared were above 5% but Age Was stronger.

32 Checking Out Our Residuals I’ve seen better normal Distributions on a cell by cell Basis but this doesn’t trigger Any immediate concerns. (Remember we do assume Our residuals will be normally Distributed with a mean of Zero around the predicted value).

33 Looking at Cumulative Probability On an accumulative value Chart we do very well in Assuming normal distribution Of the error with a mean of 0.

34 Our Scatter Plot If there is a trend there I don’t see it. (which is exactly what One wants to see After the regression is Well done).

35 Summary on Regression ANOVA works for Category Data Is a particular category significant Ford Escorts are made in 3 plants  Is there a difference in the mechanical problems rate that depends on which factory built the car? Plants #1, #2, #3 really have no order except arbitrary If I had looked at MPG based on Spring, Summer, Fall, Winter Assigning a numeric value to the seasons would be totally arbitrary Category Data Lends itself poorly to regression

36 So When Should I Choose Regression? Continuous quantitative variables I could break my drivers ages into groups but the break points would be arbitrary This little artifact is one of the reasons two car insurance companies can look at the same regional risk for drivers in an area and yet quote different rates for the same coverage Creating categories out of continuous data can cause some weird effects Regression tends to work better for continuously distributed quantitative data Also provides predictive models as opposed to category means


Download ppt "Multiple Regression ©2005 Dr. B. C. Paul. Problems with Regression So Far We have only been able to consider one factor as controlling at a time Everything."

Similar presentations


Ads by Google