Presentation on theme: "The Right Questions about Statistics: How regression works Maths Learning Centre The University of Adelaide Regression is a method designed to create a."— Presentation transcript:
The Right Questions about Statistics: How regression works Maths Learning Centre The University of Adelaide Regression is a method designed to create a FORMULA that uses some information to PREDICT/EXPLAIN an outcome, using DATA. You calculate the formula that most closely matches your data.
Regression begins with a research question about a numerical outcome...... you’re interested in exactly how one or more things affect that outcome. Variable NUMERICAL Variable NUMERICAL Variable NUMERICAL “RESPONSE VARIABLE” “DEPENDENT VARIABLE” “OUTCOME VARIABLE” “INDEPENDENT VARIABLES” “EXPLANATORY VARIABLES” “CRITERION VARIABLE” “PREDICTOR VARIABLES”
For example, you might be interested in how a person’s body temperature is affected by the number of grams of chilli in a meal. Chilli (g) NUMERICAL Temp (°C) NUMERICAL
Variable NUMERICAL Variable NUMERICAL Variable NUMERICAL X1X1 Y X2X2 What the regression will produce is a FORMULA that will let you calculate the outcome based on the explanatories (the formula is also called a MODEL). Y = β 0 + β 1 X 1 + β 2 X 2
For example, the formula might look like this: Chilli (g) NUMERICAL Temp (°C) NUMERICAL temp = 36.06 + 0.45(chilli) The process of regression finds the numbers in this formula so that it gives answers closest to the actual data.
What sort of relationship might be there? “SCATTERPLOT” The shape will tell you what sort of formula you ought to use. A “DESCRIBE” question!
LINEAR The easiest one to work with is LINEAR regression because the formula is simplest Y = β 0 + β 1 X
With LINEAR relationships, to DESCRIBE how strong this relationship is, you can calculate the CORRELATION (r). - 1.01.000.5- 0.5 r = -1r = 0r = 1 Ignores how steep the slope is – Just tells you how close to a line the points are.
The process so far... Have a “what’s the formula?” question. Look at the pattern – usually with a scatterplot – to help choose a formula. Variable NUMERICAL Variable NUMERICAL Variable NUMERICAL
The next step is to find the numbers in the formula itself. There’s some complicated-looking equations to figure out what these are, based on calculus and matrix algebra... Y = β 0 + β 1 X 1 + β 2 X 2 “INTERCEPT” “CONSTANT” “COEFFICIENTS” BUT the computer program will do all that for you. “SLOPES”
What the computer will do in Excel: Original data Regression output Formula numbers temp = 36.06 + 0.45(chilli)
What the formula means: temp = 36.06 + 0.45(chilli) How much temperature changes on average for a change of 1 g of chilli. 1g extra of chilli puts your temperature up by 0.45 degrees on average. Does getting this number in my data mean that chilli really does affect temperature?
temp = 36.06 + 0.45(chilli) If no relationship, then this number would be most likely to be zero. Is there really a relationship? A “DECIDE” question! Assuming some things, we can calculate a test statistic that comes from a t-distribution, and find a p-value. P-value = 0.0000000000000199 “SIGNIFICANT EFFECT”
Assuming some things, we can calculate a test statistic that comes from an F distribution, and find a p-value. Y = 18.3 + 3.0X 1 – 0.24X 2 If no relationship at all, then these numbers would be most likely to be zero. Is there really a relationship (for multiple regression)? P-value = 0.0000000000000000000000098 “SIGNIFICANT RELATIONSHIP”
Y = 18.3 + 3.0X 1 – 0.24X 2 If no relationship with X 1, then this number would be most likely to be zero. Is there really a relationship with X 1 ? P-value = 0.000000000000000000000003 “SIGNIFICANT EFFECT”
Y = 18.3 + 3.0X 1 – 0.24X 2 If no relationship with X 2, then this number would be most likely to be zero. Is there really a relationship with X 2 ? P-value = 0.26 “NOT SIGNIFICANT EFFECT” At this stage you would normally remove the X 2 from the formula and do a new regression
temp = 36.06 + 0.45(chilli) We are asking what options for this number are consistent with our data. How big could the effect be? An “ESTIMATE” question! Assuming some things, we can calculate a confidence interval for this number. 95% CI is from 0.38 to 0.52
DISCLAIMER: There are a whole lot of things you need to check in order to make sure your regression is acceptable statistically (especially if you are using p-values or confidence intervals). I have not mentioned any of these today. You will need to look them up in a book like Medical Statistics at a Glance or Intro Stats.
So this is how you perform regression: Have a “what’s the formula?” question. Collect data. Look at the pattern to choose a formula. Get a computer to calculate the numbers and p-values. Check the p-values. Choose your final formula.
And this is what regression means: It tells you a formula for how to calculate an outcome based on other information. It does NOT tell you if some things CAUSE others, only how to calculate them as accurately as possible. The computer output will tell you p-values and confidence intervals to answer other types of questions.