(Correlation and) (Multiple) Regression Friday 5 th March (and Logistic Regression too!)

The Shape of Things to Come… Rest of Module Week 8 Morning: Regression Afternoon: Logistic Regression Week 9 Morning: Published Multivariate Analyses Afternoon: Regression & Logistic Regression (Computing Session) Week 10 Morning: Log-linear Models Afternoon: Log-linear Models (Computing Session) ASSESSMENT D ASSESSMENT E

The Correlation Coefficient (r) Age at First Childbirth Age at First Cohabitation This shows the strength/closeness of a relationship r = 0.5 (or perhaps less…)

r = + 1 r = -1 r = 0

Correlation… and Regression r measures correlation in a linear way … and is connected to linear regression More precisely, it is r 2 (r-squared) that is of relevance It is the ‘variation explained’ by the regression line … and is sometimes referred to as the ‘coefficient of determination’

y x Mean The arrows show the overall variation (variation from the mean of y)

y x Mean Some of the overall variation is explained by the regression line (i.e. the arrows tend to be shorter than the dashed lines, because the regression line is closer to the points than the mean line is)

Length of Residence (y) Age (x) 0 C 1 B outlier ε y = Bx + C + ε Error term (Residual) ConstantSlope Regression line

Some variation is explained by the regression line The residuals constitute the unexplained variation The regression line is chosen so as to minimise the sum of the squared residuals i.e. to minimise Σε 2 (Σ means ‘sum of’) The full/specific name for this technique is Ordinary Least Squares (OLS) linear regression Choosing the line that best explains the data

Regression assumptions #1 and #2 0 ε Frequency #1: Residuals have the usual symmetric, ‘bell-shaped’ normal distribution #2: Residuals are independent of each other

y y x x Homoscedasticity Spread of residuals (ε) stays consistent in size (range) as x increases Homoscedasticity Spread of residuals (ε) increases as x increases (or varies in some other way) Use Weighted Least Squares Regression assumption #3

Regression assumption #4 Linearity! (We’ve already assumed this…) In the case of a non- linear relationship, one may be able to use a non-linear regression equation, such as: y = B 1 x + B 2 x 2 + c

Another problem: Multicollinearity If two ‘independent variables’, x and z, are perfectly correlated (i.e. identical), it is impossible to tell what the B values corresponding to each should be e.g. if y = 2x + c, and we add z, should we get: y = 1.0x + 1.0z + c, or y = 0.5x + 1.5z + c, or y = -5001.0x + 5003.0z + c ? The problem applies if two variables are highly (but not perfectly) correlated too…

Example of Regression (from Pole and Lampard, 2002, Ch. 9) GHQ = (-0.69 x INCOME) + 4.94 Is -0.69 significantly different from 0 (zero)? A test statistic that takes account of the ‘accuracy’ of the B of -0.69 (by dividing it by its standard error) is t = -2.142 For this value of t in this example, the significance value is p = 0.038 < 0.05 r-squared here is (-0.321) 2 = 0.103 = 10.3%

… and of Multiple Regression GHQ = (-0.47 x INCOME) + (-1.95 x HOUSING) + 5.74 For B = 0.47, t = -1.51 (& p = 0.139 > 0.05) For B = -1.95, t = -2.60 (& p = 0.013 < 0.05) The r-squared value for this regression is 0.236 (23.6%)

Interaction effects… Square root of length of residence Age Women All Men In this situation there is an interaction between the effects of age and of gender, so B (the slope) varies according to gender and is greater for women

Logistic regression and odds ratios Men: 1967/294 = 6.69 (to 1) Women: 1980/511 = 3.87 (to 1) Odds ratio6.69/3.87 = 1.73 Men: p/(1-p) = 3.87 x 1.73 = 6.69 Women: p/(1-p) = 3.87 x 1 = 3.87

Odds and log odds Odds = Constant x Odds ratio Log odds = log(constant) + log(odds ratio)

Men log (p/(1-p)) = log(3.87) + log(1.73) Women log (p/(1-p)) = log(3.87) + log(1) = log(3.87) log (p/(1-p)) = constant + log(odds ratio)

Note that: log(3.87) = 1.354 log(6.69) = 1.900 log(1.73)= 0.546 log(1)= 0 And that the ‘reverse’ of the logarithmic transformation is exponentiation

log (p/(1-p)) = constant + (B x SEX) where B = log(1.73) SEX = 1 for men SEX = 0 for women Log odds for men = 1.354 + 0.546 = 1.900 Log odds for women = 1.354 + 0 = 1.354 Exp(1.900) = 6.69 & Exp(1.354) = 3.87

Interpreting effects in Logistic Regression In the above example: Exp(B) = Exp(log(1.73)) = 1.73 (the odds ratio!) In general, effects in logistic regression analysis take the form of exponentiated B’s (Exp(B)), which are odds ratios. Odds ratios have a multiplicative effect on the (odds of) the outcome Is a B of 0.546 (= log(1.73)) significant? In this case p = 0.000 < 0.05 for this B.

Back from odds to probabilities Probability = Odds / (1 + Odds) Men: 6.69 / (1 + 6.69) = 0.870 Women:3.87 / (1 + 3.87) = 0.795

‘Multiple’ Logistic regression log odds = c + (B 1 x SEX) + (B 2 x AGE) = c + (0.461 x SEX) + (-0.099 x AGE) For B 1 = 0.461, p = 0.000 < 0.05 For B 2 = -0.099, p = 0.000 < 0.05 Exp(B 1 ) = Exp(0.461) = 1.59 Exp(B 2 ) = Exp(-0.099) = 0.905

(Correlation and) (Multiple) Regression Friday 5 th March (and Logistic Regression too!)

Similar presentations

Presentation on theme: "(Correlation and) (Multiple) Regression Friday 5 th March (and Logistic Regression too!)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

(Correlation and) (Multiple) Regression Friday 5 th March (and Logistic Regression too!)

Similar presentations

Presentation on theme: "(Correlation and) (Multiple) Regression Friday 5 th March (and Logistic Regression too!)"— Presentation transcript:

Similar presentations

About project

Feedback