Things Favoring ANOVA Analysis ANOVA tells you whether a factor is controlling a result It requires that the control factor be easily categorized Example Spring Summer Fall Tends to work well on non-quantitative or unordered or discontinuous controlling factors Does not quantify the magnitude or type of effect – only its existence Example Gas Mileage in influence by the season of the year, the driving distance, and the driver
Things Favoring Regression Analysis Suppose your gas mileage data is Outside Temperature Distance Driven Age of Driver The data can be categorized only by arbitrary divisions Suppose I want to know quantitatively how these continuous numeric variables control gas mileage
What Regression Does Idea is that you have a “Dependent Variable” that is a function of some “Independent Variable” Y = F(X) Could be gas mileage as a function of temperature The simplest form of a function is a straight line Y=b o +b 1 *X
Reminders on Linear Form b. Is the intercept of the line with the vertical axis at X=0 B 1 represents the units of rise in Y per unit of Run in X (ie it is the slope of the line)
Idea Behind Linear Regression Most of the variation of Y can be explained as a linear function of X The portion of variation in Y due to other known, unknown or random causes is normally distributed about the regression line The degree to which we missed predicting Y using X can be measured by squaring the difference between the actual and predicted value We will select our linear coefficients b o and b 1 such that the sum of all these squared differences is minimized For this class we will skip the formula derivation and mathematical formulas used to get b o and b 1 Linear Regression is readily done by most calculators and our friend program SPSS
Doing Linear Regression With SPSS Begin by Entering the Data In this case we will consider gas mileage As a function of the distance a car is driven. We believe that there may be a relationship Because vehicles take a while to warm up And get better mileage after warm up.
Why Did I Pick Linear Regression? Controlling variable was continuous – not category If I had looked at gas mileage as a function of gender the control variable would have been category (male, female) A linear relationship is an easy one to consider There are ways of plotting data to see if it appears there might be a linear trend.
A Note on Modeling Statistical Methods are all about fitting mathematical models to real data A linear regression attempts to fit a straight line function of x through the data Y Ultimately the quality of what I do does depend on how good the model represented reality A poorly fit model will produce answers But right answers cost more and are harder to get
Visually Examining Our Data Set Go to Graphs and click to pull down the menu Highlight and click on scatter Plot
You Will Be Given A Choice of Types of Scatter Plots The Default is a simple Scatter plot – which I am Going to accept. I will click on the define Button to move to the next Screen.
I Need to Define What to Plot on the Y and X axis The Y axis is my Dependent Variable. In this case I believe that MPG Is a function of distance Traveled. To make it my variable I will Highlight MPG and then Click the arrow by Y Axis
Next Choose my Independent Variable Since I believe that MPG Might be a function of Distance driven I next select Distance and click the arrow By X axis to move the variable Over to X axis Then I click Ok to go to the Plot.
Out Comes my Plot I see a fairly clear Indication that gas Mileage is improving With the length of the trip
Now Getting on to Regression Click the Pull Down Menu for Analyze Highlight Regression to pop the Side menu out Highlight and Click Linear
Select the Regression Variables Note that I selected MPG for my Dependent variable and distance As my independent variable.
Click OK and Out Comes Stuff First it tells me about variables that entered (Ie – what did it try to make MPG a function of). I told it to make it a function of distance and The table says it entered distance as the Controlling variable. Method Enter means it entered that variable because I told it to.
Next Box Tells Me About How Well I Did Guessing a Linear Model R 2 is called the Pearson Product Coefficient. It tells me how much Of the total scatter in the data is Explained by my linear regression Of one variable (distance) means 39.3% was explained
More Interpretation R value tells you how well your Data followed a straight line. 1 means it is a straight line. 0 means Its nothing like a straight line (a circle Would pull a 0 even though Y is A function of X – its not a linear one). Standard Error of the Estimate Is how far on average you would Miss your guess if you just gave The mileage predicted by the Equation.
The ANOVA Table SPSS does an ANOVA on The linear model as a Predictor. The F value for The regression is The chances of getting an F Value that high if the model Fit was a fluke is essentially 0.
The Coefficients Table Coefficient Table Gives the Regression Constants B o = B 1 =0.654 Y= *X
How Good are Our Coefficients? Test Statistic is done for each Coefficient in the equation. The “null hypothesis” is that the Slope or intercept is actually 0. The test statistic has a t distribution The standard deviation for each Coefficient value is given here. The constant is and the Standard deviation of that estimate Is
Significance of the Coefficients Significance levels in this table indicate the chance That the real value of the regression coefficient Should be 0. As can be seen, for both coefficients there is Essentially no chance that any of the coefficients Should be 0.
Some Conclusions There is definitely a linear influence of miles driven on gas mileage however the linear relationship only explains about 40% of the variability in the data. We know there is still something out there We may also want to examine our residuals to see if there are any trends in the residuals indicating we might be missing something or that our constant normal distribution of residuals about the model is wrong It might come up wrong for example if we were wrong about a linear model being the best fit.
Examining Residuals of Regression Set up your linear regression in the Usual manner.
Selecting Plots After setting you dependent and Independent variables and before Clicking ok, click plots instead.
Picking Residual Plots Plot the residual on the Y axis Against the predicted value on The X axis. Ask for Histograms and normal Probability plots.
More Plots Use the next button to allow you To select another plot. Then enter the residual on the Y axis against the dependent Variable. Finally tell the computer to Continue.
You Will Still Get the Normal Tables we Saw Before Scroll down To see what Is new.
Some Abnormality in the Histogram A Histogram is a bar chart Showing the number of Results in different numeric Intervals. In this case we can see there May be two families of Unexplained events and One of them is causing the Model to over-predict (note the negative tail).
We Have a Cumulative Probability Plot Cumulative probability Counts all the samples That should have come Up by a certain point (it is an integration of the Probability distribution). Normal would plot on a Straight line. This is Somewhat straight but The slope at the center is Wrong and the tails Drift off. (More commentary On reading cumulative Probability plots later).
Look for Trends that have been systematically missed This plot shows The residual (amount we Missed by) against The predicted Value. If there is a trend In the points it May tell us What we missed. In this case it is Pretty scattered.
Missing Trends We are still missing Something because There is a definite Trend in the residuals Relative to the actual MPG. We are missing a Variable or factor. (it might be linear).
Consider Another Data Set We have an Independent and Dependent Variable. (The data set could represent Any problem we wished to Model).
Tell it to do a Regression of the Dependent against the Independent Variable. Be sure we also ask for our Residual plots.
Go to Results The R^2 value is – darn One is a straight line. How much Closer do you want to be. This regression looks like it Fits like a glove – The Mean Square for regression Is 5 orders of magnitude Greater than the MS for error. The F statistic blows the null Hypothesis off the map.
No Chance the Slope or Constant are Zero
There is some evidence the distribution of residuals is a little skewed.
The residual distribution is definitely skewed off to one side
Oh Boy – Can You See the Trend we missed here? Here the residuals Follow a clear and Unmistakable shape of An effect we missed.
This Thing Has a Second Order or Curved Effect
OK – Now What Do I Do? Linear Regression Rapidly and Quantitatively Fits a simple linear function of one variable to another. We noted that there had to be other effects present on the gas mileage but linear regression only handles one independent variable. We also noted that sometimes there our second or higher order effects of a variable present – a straight line just doesn’t fit that We may want to have some more powerful tools to fall back on (we just try the easy stuff first).