# Multiple regression. Problem: to draw a straight line through the points that best explains the variance Regression.

## Presentation on theme: "Multiple regression. Problem: to draw a straight line through the points that best explains the variance Regression."— Presentation transcript:

Multiple regression

Problem: to draw a straight line through the points that best explains the variance Regression

Problem: to draw a straight line through the points that best explains the variance Regression

Problem: to draw a straight line through the points that best explains the variance Regression

Test with F, just like ANOVA: Variance explained by x-variable / df Variance still unexplained / df Regression Variance explained (change in line lengths 2 ) Variance unexplained (residual line lengths 2 )

Test with F, just like ANOVA: Variance explained by x-variable / df Variance still unexplained / df Regression In regression, each x-variable will normally have 1 df

Test with F, just like ANOVA: Variance explained by x-variable / df Variance still unexplained / df Regression Essentially a cost: benefit analysis – Is the benefit in variance explained worth the cost in using up degrees of freedom?

Regression Also have R 2 : the proportion of total variance explained by the variable Variance explained by x-variable Variance still unexplained Variance explained by x-variable Unexplained variance

Total variance for 32 data points is 300 units. An x-variable is then regressed against the data, accounting for 150 units of variance. 1.What is the R 2 ? 2.What is the F ratio? Regression example

Total variance for 32 data points is 300 units. An x-variable is then regressed against the data, accounting for 150 units of variance. 1.What is the R 2 ? 2.What is the F ratio? Regression example R 2 = 150/300 = 0.5 F 1,30 = 150/1 = 30 150/30 Why is df error = 30?

Multiple regression Tree age Herbivore damage Higher nutrient trees Lower nutrient trees Damage= m 1 *age + b

Tree age Herbivore damage Tree nutrient concentration Residuals of herbivore damage

Tree age Herbivore damage Tree nutrient concentration Residuals of herbivore damage Damage= m 1 *age + m 2 *nutrient + b

Damage= m 1 *age + m 2 *nutrient + m3*age*nutrient +b No interaction (additive):Interaction (non-additive): yy

Non-linear regression? Just a special case of multiple regression! Y = m 1 x +m 2 x 2 +b XX 2 Y 111.1 242.0 393.6 4163.1 5255.2 6366.7 74911.3 X2X2 X1X1 Y = m 1 x 1 +m 2 x 2 +b

STEPWISE REGRESSION

811109 Jump height (how high ball can be raised off the ground) Feet off ground Total SS = 11.11

X variableparameterSSF 1,13 p Height+0.9439.96112<0.0001 of player

X variableparameterSSp Weight+0.0407.9232<0.0001 of player F 1,13

Why do you think weight is + correlated with jump height?

An idea Perhaps if we took two people of identical height, the lighter one might actually jump higher? Excess weight may reduce ability to jump high…

How could we test this idea?

lighter heavier X variableparameterSSF p Height+2.1339.956803<0.0001 Weight-0.0591.008 81<0.0001

Why did the parameter estimates change? Why did the F tests change? X variableparameterSSF p Height+2.1339.956803<0.0001 Weight-0.0591.008 81<0.0001 X variableparameterSSp Weight+0.0407.9232<0.0001 of player F 1,13

Heavy people often tall (tall people often heavy) Tall people can jump higher People light for their height can jump a bit more Weight Height Jump + + -

The problem: The parameter estimate and significance of an x-variable is affected by the x-variables already in the model! How do we know which variables are significant, and which order to enter them in model?

Solutions 1) Use a logical order. For example it makes sense to test the interaction first 2) Stepwise regression: “tries out” various orders of removing variables.

Stepwise regression Enters or removes variables in order of significance, checks after each step if the significance of other variables has changed Enters one by one: forward stepwise Enters all, removes one by one: backwards stepwise

Forward stepwise regression Enter the variable with the highest correlation with y-variable first (p>p enter). Next enter the variable to explain the most residual variation (p>p enter). Remove variables that become insignificant (p> p leave) due to other variables being added. And so on…

General words of caution! Correlation does not equal causation!

General words of caution! Can interpolate between points, but don’t extraoplate (Mark Twain effect) In the space of 176 the lower Mississippi has shortened itself 242 miles. That is an average of a trifle over 1 1/3 miles per year. Therefore, any calm person, who is not blind or idiotic, can see that in the old Oölithic Silurian Period, just a million years ago next November, the Lower Mississippi River was upwards of 1,300,000 miles long, and stuck out over the Gulf of Mexico like a fishing rod

Download ppt "Multiple regression. Problem: to draw a straight line through the points that best explains the variance Regression."

Similar presentations