Download presentation
Presentation is loading. Please wait.
1
Applying Regression
2
The Course 14 (or so) lessons Some flexibility Depends how we feel
What we get through
3
Part I: Theory of Regression
Models in statistics Models with more than one parameter: regression Samples to populations Introducing multiple regression More on multiple regression
4
Part 2: Application of regression
Categorical predictor variables Assumptions in regression analysis Issues in regression analysis Non-linear regression Categorical and count variables Moderators (interactions) in regression Mediation and path analysis Part 3:Taking Regression Further (Kind of brief) Introducing longitudinal multilevel models
5
Bonuses Bonus lesson1: Why is it called regression? Bonus lesson 2: Other types of regression.
6
House Rules Jeremy must remember If you don’t understand
Not to talk too fast If you don’t understand Ask Any time If you think I’m wrong Ask. (I’m not always right)
7
The Assistants Carla Xena - cxenag@essex.ac.uk
Eugenia Suarez Arian
8
Learning New Techniques
Best kind of data to learn a new technique Data that you know well, and understand Your own data In computer labs (esp later on) Use your own data if you like My data I’ll provide you with Simple examples, small sample sizes Conceptually simple (even silly)
9
Computer Programs Stata Excel GPower Mostly For calculations
I’ll explain SPSS options You’ll like Stata more Excel For calculations Semi-optional GPower
10
Lesson 1: Models in statistics
Models, parsimony, error, mean, OLS estimators
11
What is a Model?
12
What is a model? Representation
Of reality Not reality Model aeroplane represents a real aeroplane If model aeroplane = real aeroplane, it isn’t a model
13
Statistics is about modelling Sifting
Representing and simplifying Sifting What is important from what is not important Parsimony In statistical models we seek parsimony Parsimony simplicity
14
Parsimony in Science A model should be: More it explains
1: able to explain a lot 2: use as few concepts as possible More it explains The more you get Fewer concepts The lower the price Is it worth paying a higher price for a better model?
15
The Mean as a Model
16
The (Arithmetic) Mean We all know the mean The mean is: The ‘average’
Learned about it at school Forget (didn’t know) about how clever the mean is The mean is: An Ordinary Least Squares (OLS) estimator Best Linear Unbiased Estimator (BLUE)
17
Mean as OLS Estimator Going back a step or two
MODEL was a representation of DATA We said we want a model that explains a lot How much does a model explain? DATA = MODEL + ERROR ERROR = DATA - MODEL We want a model with as little ERROR as possible
18
What is error? Data (Y) Model (b0) mean Error (e) 1.40 1.60 -0.20 1.55
-0.05 1.80 0.20 1.62 0.02 1.63 0.03
19
How can we calculate the ‘amount’ of error?
Sum of errors? Sum of absolute errors?
20
Are small and large errors equivalent?
One error of 4 Four errors of 1 The same? What happens with different data? Y = (2, 2, 5) b0 = 2 Not very representative Y = (2, 2, 4, 4) b0 = any value from 2 - 4 Indeterminate There are an infinite number of solutions which would satisfy our criteria for minimum error
21
Sum of squared errors (SSE)
22
Determinate If we minimise SSE Shown in graph Always gives one answer
Get the mean Shown in graph SSE plotted against b0 Min value of SSE occurs when b0 = mean
24
The Mean as an OLS Estimate
25
Mean as OLS Estimate The mean is an Ordinary Least Squares (OLS) estimate As are lots of other things This is exciting because OLS estimators are BLUE Best Linear Unbiased Estimators Proven with Gauss-Markov Theorem Which we won’t worry about
26
BLUE Estimators Best Minimum variance (of all possible unbiased estimators) Narrower distribution than other estimators e.g. median, mode
27
SSE and the Standard Deviation
Tying up a loose end
28
SSE closely related to SD Sample standard deviation – s
Biased estimator of population SD Population standard deviation - s Need to know the mean to calculate SD Reduces N by 1 Hence divide by N-1, not N Like losing one df
29
Proof That the mean minimises SSE Available in Not that difficult
As statistical proofs go Available in Maxwell and Delaney – Designing experiments and analysing data Judd and McClelland – Data Analysis: a model comparison approach (out of print?)
30
What’s a df? The number of parameters free to vary
When one is fixed Term comes from engineering Movement available to structures
31
Back to the Data Mean has 5 (N) df s has N –1 df 1st moment
Mean has been fixed 2nd moment Can think of it as amount of cases vary away from the mean
32
While we are at it … Skewness has N – 2 df Kurtosis has N – 3 df
3rd moment Kurtosis has N – 3 df 4rd moment Amount cases vary from s
33
Parsimony and df Number of df remaining
Measure of parsimony Model which contained all the data Has 0 df Not a parsimonious model Normal distribution Can be described in terms of mean and s 2 parameters (z with 0 parameters)
34
Summary of Lesson 1 Statistics is about modelling DATA
Models have parameters Fewer parameters, more parsimony, better Models need to minimise ERROR Best model, least ERROR Depends on how we define ERROR If we define error as sum of squared deviations from predicted value Mean is best MODEL
35
Lesson 1a A really brief introduction to Stata
36
Commands Command review Output Variable list Commands
37
Stata Commands Can use menus All have similar format:
But commands are easy All have similar format: command variables , options Stata is case sensitive BEDS, beds, Beds Stata lets you shorten summarize sqft su sq
38
More Stata Commands Open exercise 1.4.dta Run Or summarize sqm
table beds mean price histogram price Or su be tab be mean pr hist pr
39
Lesson 2: Models with one more parameter - regression
40
In Lesson 1 we said … Use a model to predict and describe data
Mean is a simple, one parameter model
41
More Models Slopes and Intercepts
42
More Models The mean is OK We often have more information than that
As far as it goes It just doesn’t go very far Very simple prediction, uses very little information We often have more information than that We want to use more information than that
43
House Prices Look at house prices in one area of Los Angeles
Predictors of house prices Using: Sale price, size, number of bedrooms, size of lot, year built …
46
House Prices 3628 OLYMPIAD Dr 649500 4 3 2575 3673 OLYMPIAD Dr 450000
address listprice beds baths sqft 3628 OLYMPIAD Dr 649500 4 3 2575 3673 OLYMPIAD Dr 450000 2 1910 3838 CHANSON Dr 489900 2856 3838 West 58TH Pl 330000 1651 3919 West 58TH Pl 349000 1466 3954 FAIRWAY Blvd 514900 2.25 2018 4044 OLYMPIAD Dr 649000 2.5 3019 4336 DON LUIS Dr 474000 2188 4421 West 59TH St 460000 1519 4518 WHELAN Pl 388000 1.5 1403 4670 West 63RD St 259500 1491 5000 ANGELES VISTA Blvd 678800 5 3808
47
One Parameter Model The mean “How much is that house worth?” $415,689
Use 1 df to say that
48
Adding More Parameters
We have more information than this We might as well use it Add a linear function of number of size (square feet) (x1)
49
Alternative Expression
Estimate of Y (expected value of Y) Value of Y
50
Estimating the Model We can estimate this model in four different, equivalent ways Provides more than one way of thinking about it 1. Estimating the slope which minimises SSE 2. Examining the proportional reduction in SSE 3. Calculating the covariance 4. Looking at the efficiency of the predictions
51
Estimate the Slope to Minimise SSE
52
Estimate the Slope Stage 1 Mark errors on it Draw a scatterplot
x-axis at mean Not at zero Mark errors on it Called ‘residuals’ Sum and square these to find SSE
55
Add another slope to the chart
Redraw residuals Recalculate SSE Move the line around to find slope which minimises SSE Find the slope
56
First attempt:
57
Any straight line can be defined with two parameters
The location (height) of the slope b0 Sometimes called a The gradient of the slope b1
58
Gradient b1 units 1 unit
59
Height b0 units
60
Height is defined as the point that the slope hits the y-axis
If we fix slope to zero Height becomes mean Hence mean is b0 Height is defined as the point that the slope hits the y-axis The constant The y-intercept
61
Why the constant? Implicit in Stata b0x0
Where x0 is 1.00 for every case i.e. x0 is constant Implicit in Stata (And SPSS, SAS, R) Some packages force you to make it explicit (Later on we’ll need to make it explicit)
62
Why the intercept? Where the regression line intercepts the y-axis
Sometimes called y-intercept
63
Finding the Slope How do we find the values of b0 and b1?
Start with we jiggle the values, to find the best estimates which minimise SSE Iterative approach Computer intensive – used to matter, doesn’t really any more (With fast computers and sensible search algorithms – more on that later)
64
Start with b0=416 (mean) b1=0.5 (nice round number)
SSE = 365,774 b0=300, b1=0.5, SSE=341,683 b0=300, b1=0.6, SSE=310,240 b0=300, b1=0.8, SSE=264,573 b0=300, b1=1, SSE=301, 797 b0=250, b1=1, SSE=255,366 …..
65
Gives the position of the
Quite a long time later b0 = b1 = 1.084 SSE = 145,636.78 Gives the position of the Regression line (or) Line of best fit Better than guessing Not necessarily the only method But it is OLS, so it is the best (it is BLUE)
67
We now know Told us two things
A zero square metre house is worth $216,000 Adding a square meter adds $1,080 Told us two things Don’t extrapolate to meaningless values of x-axis Constant is not necessarily useful It is necessary to estimate the equation
68
Exercise 2a, 2b
69
Standardised Regression Line
One big but: Scale dependent Values change £ to €, inflation Scales change £, £000, £00? Need to deal with this
70
Don’t express in ‘raw’ units
Express in SD units sx1=183.82 sy= b1 = 1.103 We increase x1 by 1, and Ŷ increases by 1.084 So we increase x1 by 1 and Ŷ increases by SDs
71
Similarly, 1 unit of x1 = 1/69.017 SDs
Increase x1 by 1 SD Ŷ increases by (69.017/1) = Put them both together
72
The standardised regression line
Change (in SDs) in Ŷ associated with a change of 1 SD in x1 A different route to the same answer Standardise both variables (divide by SD) Find line of best fit
73
The Correlation Coefficient
The standardised regression line has a special name The Correlation Coefficient (r) (r stands for ‘regression’, but more on that later) Correlation coefficient is a standardised regression slope Relative change, in terms of SDs
74
Exercise 2c
75
Proportional Reduction in Error
76
Proportional Reduction in Error
We might be interested in the level of improvement of the model How much less error (as proportion) do we have Proportional Reduction in Error (PRE) Mean only Error(model 0) = 341,683 Mean + slope Error(model 1) = 196,046
78
But we squared all the errors in the first place
So we could take the square root This is the correlation coefficient Correlation coefficient is the square root of the proportion of variance explained
79
Standardised Covariance
80
Standardised Covariance
We are still iterating Need a ‘closed-form’ equation Equation to solve to get the parameter estimates Answer is a standardised covariance A variable has variance Amount of ‘differentness’ We have used SSE so far
81
SSE varies with N Divide by N Gives us the variance Same as SD2
Higher N, higher SSE Divide by N Gives SSE per person (or house) (Actually N – 1, we have lost a df to the mean) Gives us the variance Same as SD2 We thought of SSE as a scattergram Y plotted against X (repeated image follows)
83
Or we could plot Y against Y
Axes meet at the mean (415) Draw a square for each point Calculate an area for each square Sum the areas Sum of areas SSE Sum of areas divided by N Variance
84
Plot of Y against Y 20 40 60 80 100 120 140 160 180
85
Draw Squares 138 – 88.9 = 40.1 Area = 40.1 x 40.1 = 1608.1 138 – 88.9
20 40 60 80 100 120 140 160 180 138 – 88.9 = 40.1 Area = 40.1 x 40.1 = 138 – 88.9 = 40.1 35 – 88.9 = -53.9 Area = -53.9 x -53.9 = 35 – 88.9 = -53.9
86
What if we do the same procedure
Instead of Y against Y Y against X Draw rectangles (not squares) Sum the area Divide by N - 1 This gives us the variance of x with y The Covariance Shortened to Cov(x, y)
88
Area = (-33.9) x (-2) = 67.8 = 49.1 55 – 88.9 = -33.9 4 - 3 = 1 1 - 3 = -2 Area = 49.1 x 1 = 49.1
89
More formally (and easily)
We can state what we are doing as an equation Where Cov(x, y) is the covariance Cov(x,y)=5165 What do points in different sectors do to the covariance?
90
Problem with the covariance
Tells us about two things The variance of X and Y The covariance Need to standardise it Like the slope Two ways to standardise the covariance Standardise the variables first Subtract from mean and divide by SD Standardise the covariance afterwards
91
Need the combined variance
First approach Much more computationally expensive Too much like hard work to do by hand Need to standardise every value Second approach Much easier Standardise the final value only Need the combined variance Multiply two variances Find square root (were multiplied in first place)
92
Standardised covariance
93
The correlation coefficient
A standardised covariance is a correlation coefficient
94
Expanded …
95
This means … We now have a closed form equation to calculate the correlation Which is the standardised slope Which we can use to calculate the unstandardised slope
96
We know that: We know that:
97
So value of b1 is the same as the iterative approach
98
The variables are centred at zero
The intercept Just while we are at it The variables are centred at zero We subtracted the mean from both variables Intercept is zero, because the axes cross at the mean
99
Add mean of y to the constant Subtract mean of x
Adjusts for centring y Subtract mean of x But not the whole mean of x Need to correct it for the slope
100
Accuracy of Prediction
101
One More (Last One) We have one more way to calculate the correlation
Looking at the accuracy of the prediction Use the parameters b0 and b1 To calculate a predicted value for each case
102
Plot actual price against predicted price
From the model
104
Seems a futile thing to do
r = 0.653 The correlation between actual and predicted value Seems a futile thing to do And at this stage, it is But later on, we will see why
105
Some More Formulae For hand calculation Point biserial
106
Phi (f) Used for 2 dichotomous variables Vote P Vote Q Homeowner A: 19
Not homeowner C: 60 D:53
107
Problem with the phi correlation
Unless Px= Py (or Px = 1 – Py) Maximum (absolute) value is < 1.00 Tetrachoric correlation can be used to correct this Rank (Spearman) correlation Used where data are ranked
108
Summary Mean is an OLS estimate Regression line
OLS estimates are BLUE Regression line Best prediction of outcome from predictor OLS estimate (like mean) Standardised regression line A correlation
109
Four ways to think about a correlation
1. Standardised regression line 2. Proportional Reduction in Error (PRE) 3. Standardised covariance 4. Accuracy of prediction
110
Regression and Correlation in Stata
correlate x y correlate x y , cov regress y x Or regress price sqm
111
Post-Estimation Stata commands ‘leave behind’ something
You can run post-estimation commands They mean ‘from the last regression’ Get predicted values: predict my_preds Get residuals: predict my_res, residuals This comes after the comma, so it’s an option
112
Graphs Scatterplot Regression line Both graphs lfit price beds
scatter price beds Regression line lfit price beds Both graphs twoway (scatter price beds) (lfit price beds)
113
What happens if you run reg without a predictor?
regress price
114
Exercises
115
Lesson 3: Samples to Populations – Standard Errors and Statistical Significance
116
The Problem In Social Sciences Theoretically
We investigate samples Theoretically Randomly taken from a specified population Every member has an equal chance of being sampled Sampling one member does not alter the chances of sampling another Not the case in (say) physics, biology, etc.
117
Population But it’s the population that we are interested in
Not the sample Population statistic represented with Greek letter Hat means ‘estimate’
118
Sample statistics (e.g. mean) estimate population parameters
Want to know Likely size of the parameter If it is > 0
119
Sampling Distribution
We need to know the sampling distribution of a parameter estimate How much does it vary from sample to sample If we make some assumptions We can know the sampling distribution of many statistics Start with the mean
120
Sampling Distribution of the Mean
Given Normal distribution Random sample Continuous data Mean has a known sampling distribution Repeatedly sampling will give a known distribution of means Centred around the true (population) mean (m)
121
Analysis Example: Memory
Difference in memory for different words 10 participants given a list of 30 words to learn, and then tested Two types of word Abstract: e.g. love, justice Concrete: e.g. carrot, table
123
Confidence Intervals This means Using
If we know the mean in our sample We can estimate where the mean in the population (m) is likely to be Using The standard error (se) of the mean Represents the standard deviation of the sampling distribution of the mean
124
1 SD contains 68% Almost 2 SDs contain 95%
125
We know the sampling distribution of the mean
t distributed if N < 30 Normal with large N (>30) Asymptotically normal Know the range within means from other samples will fall Therefore the likely range of m
126
Two implications of equation
Increasing N decreases SE But only a bit (SE halfs if N is 400 times bigger) Decreasing SD decreases SE Calculate Confidence Intervals From standard errors 95% is a standard level of CI 95% of samples the true mean will lie within the 95% CIs In large samples: 95% CI = 1.96 SE In smaller samples: depends on t distribution (df=N-1=9)
129
What is a CI? (For 95% CI): 95% chance that the true (population) value lies within the confidence interval? No; 95% of samples, true mean will land within the confidence interval?
130
Significance Test Probability that m is a certain value
Almost always 0 Doesn’t have to be though We want to test the hypothesis that the difference is equal to 0 i.e. find the probability of this difference occurring in our sample IF m=0 (Not the same as the probability that m=0)
131
Calculate SE, and then t t has a known sampling distribution
Can test probability that a certain value is included
132
Other Parameter Estimates
Same approach Prediction, slope, intercept, predicted values At this point, prediction and slope are the same Won’t be later on One predictor only More complicated with > 1
133
Testing the Degree of Prediction
Prediction is correlation of Y with Ŷ The correlation – when we have one IV Use F, rather than t Started with SSE for the mean only This is SStotal Divide this into SSresidual SSregression SStot = SSreg + SSres
135
Back to the house prices
Original SSE (SStotal) = SSresidual = What is left after our model SSregression = – = What our model explains
137
F = 18.6, df = 1, 25, p = 0.0002 Can reject H0 A significant effect
H0: Prediction is not better than chance A significant effect
138
Statistical Significance: What does a p-value (really) mean?
139
A Quiz Six questions, each true or false
Write down your answers (if you like) An experiment has been done. Carried out perfectly. All assumptions perfectly satisfied. Absolutely no problems. P = 0.01 Which of the following can we say?
140
1. You have absolutely disproved the null hypothesis (that is, there is no difference between the population means).
141
2. You have found the probability of the null hypothesis being true.
142
3. You have absolutely proved your experimental hypothesis (that there is a difference between the population means).
143
4. You can deduce the probability of the experimental hypothesis being true.
144
5. You know, if you decide to reject the null hypothesis, the probability that you are making the wrong decision.
145
6. You have a reliable experimental finding in the sense that if, hypothetically, the experiment were repeated a great number of times, you would obtain a significant result on 99% of occasions.
146
OK, What is a p-value Cohen (1994)
“[a p-value] does not tell us what we want to know, and we so much want to know what we want to know that, out of desperation, we nevertheless believe it does” (p 997).
147
OK, What is a p-value Sorry, didn’t answer the question
It’s “The probability of obtaining a result as or more extreme than the result we have in the study, given that the null hypothesis is true” Not probability the null hypothesis is true
148
A Bit of Notation Not because we like notation Probability – P
But we have to say a lot less Probability – P Null hypothesis is true – H Result (data) – D Given - |
149
What’s a P Value P(D|H) Not P(H|D) (what we want to know)
Probability of the data occurring if the null hypothesis is true Not P(H|D) (what we want to know) Probability that the null hypothesis is true, given that we have the data = p(H) P(H|D) ≠ P(D|H)
150
What is probability you are prime minister
Given that you are British P(M|B) Very low What is probability you are British Given you are prime minister P(B|M) Very high P(M|B) ≠ P(B|M)
151
The police have your DNA DNA matches 1 in 1,000,000 people
There’s been a murder Someone murdered an instructor (perhaps they talked too much) The police have DNA The police have your DNA They match(!) DNA matches 1 in 1,000,000 people What’s the probability you didn’t do the murder, given the DNA match (H|D)
152
Luckily, you have Jeremy on your defence team We say:
Police say: P(D|H) = 1/1,000,000 Luckily, you have Jeremy on your defence team We say: P(D|H) ≠ P(H|D) Probability that someone matches the DNA, who didn’t do the murder Incredibly high
153
Back to the Questions Haller and Kraus (2002)
Asked those questions of groups in Germany Psychology Students Psychology lecturers and professors (who didn’t teach stats) Psychology lecturers and professors (who did teach stats)
154
We have found evidence against the null hypothesis
You have absolutely disproved the null hypothesis (that is, there is no difference between the population means). True 34% of students 15% of professors/lecturers, 10% of professors/lecturers teaching statistics False We have found evidence against the null hypothesis
155
You have found the probability of the null hypothesis being true.
32% of students 26% of professors/lecturers 17% of professors/lecturers teaching statistics False We don’t know
156
3. You have absolutely proved your experimental hypothesis (that there is a difference between the population means). 20% of students 13% of professors/lecturers 10% of professors/lecturers teaching statistics False
157
You can deduce the probability of the experimental hypothesis being true.
59% of students 33% of professors/lecturers 33% of professors/lecturers teaching statistics False
158
You know, if you decide to reject the null hypothesis, the probability that you are making the wrong decision. 68% of students 67% of professors/lecturers 73% of professors professors/lecturers teaching statistics False Can be worked out P(replication)
159
You have a reliable experimental finding in the sense that if, hypothetically, the experiment were repeated a great number of times, you would obtain a significant result on 99% of occasions. 41% of students 49% of professors/lecturers 37% of professors professors/lecturers teaching statistics False Another tricky one It can be worked out
160
One Last Quiz I carry out a study You replicate the study exactly
All assumptions perfectly satisfied Random sample from population I find p = 0.05 You replicate the study exactly What is probability you find p < 0.05?
161
You replicate the study exactly
I carry out a study All assumptions perfectly satisfied Random sample from population I find p = 0.01 You replicate the study exactly What is probability you find p < 0.05?
162
Significance testing creates boundaries and gaps where none exist.
Significance testing means that we find it hard to build upon knowledge we don’t get an accumulation of knowledge
163
Yates (1951) "the emphasis given to formal tests of significance ... has resulted in ... an undue concentration of effort by mathematical statisticians on investigations of tests of significance applicable to problems which are of little or no practical importance ... and ... it has caused scientific research workers to pay undue attention to the results of the tests of significance ... and too little to the estimates of the magnitude of the effects they are investigating
164
Testing the Slope Same idea as with the mean
Estimate 95% CI of slope Estimate significance of difference from a value (usually 0) Need to know the SD of the slope Similar to SD of the mean
166
Similar to equation for SD of mean Then we need standard error
Similar (ish) When we have standard error Can go on to 95% CI Significance of difference
168
Confidence Limits 95% CI 95% confidence limits
t dist with N - k - 1 df is 2.31 CI = 5.24 2.31 = 12.06 95% confidence limits
169
Significance of difference from zero
i.e. probability of getting result if b=0 Not probability that b = 0 This probability is (of course) the same as the value for the prediction
170
Testing the Standardised Slope (Correlation)
Correlation is bounded between –1 and +1 Does not have symmetrical distribution, except around 0 Need to transform it Fisher z’ transformation – approximately normal
171
95% CIs 0.879 – 1.96 * 0.38 = 0.13 * 0.38 = 1.62
172
Transform back to correlation
95% CIs = 0.13 to 0.92 Very wide Because of small sample size Maybe that’s why CIs are not reported?
173
Using Excel Functions in excel
Fisher() – to carry out Fisher transformation Fisherinv() – to transform back to correlation
174
The Others Same ideas for calculation of CIs and SEs for
Predicted score Gives expected range of values given X Same for intercept But we have probably had enough
175
One more tricky thing (Don’t worry if you don’t understand)
For means, regression estimates, etc Estimate 1.0000 95% confidence intervals 0.0000, P = They match
176
For correlations, odds ratios, etc 95% CIs P-value
No longer match 95% CIs 0.0000, P-value Because of the sampling distribution of the mean Does not depend on the value The sampling distribution of a proportion Does depend on the value More certainty around 0.9 than around 0.00.
177
Lesson 4: Introducing Multiple Regression
178
Residuals We said We could have said We ignored the i on the Y
Y = b0 + b1x1 We could have said Yi = b0 + b1xi1 + ei We ignored the i on the Y And we ignored the ei It’s called error, after all But it isn’t just error Trying to tell us something
179
What Error Tells Us Error tells us that a case has a different score for Y than we predict There is something about that case Called the residual What is left over, after the model Contains information Something is making the residual 0 But what?
182
If all cases were equal on X
The residual (+ the mean) is the expected value of Y If all cases were equal on X It is the value of Y, controlling for X Other words: Holding constant Partialling Residualising (residualised scores) Conditioned on
183
Sometimes adjustment is enough on its own Teenage pregnancy rate
Measure performance against criteria Teenage pregnancy rate Measure pregnancy and abortion rate in areas Control for socio-economic deprivation, religion, rural/urban and anything else important See which areas have lower teenage pregnancy and abortion rate, given same level of deprivation Value added education tables Measure school performance Control for initial intake
184
Sqm Price Predicted Residual Adj Value (mean + resid) 239.2 605.0 475.77 129.23 544.8 177.4 400.0 408.78 -8.78 406.8 265.3 529.5 504.08 25.42 441.0 153.4 315.0 382.69 -67.69 347.9 136.2 341.0 364.05 -23.05 392.6 187.5 525.0 419.66 105.34 520.9 280.5 585.0 520.51 64.49 480.1 203.3 430.0 436.79 -6.79 408.8 141.1 436.0 369.39 66.61 482.2 130.3 390.0 357.70 32.30 447.9
185
Control? In experimental research In non-experimental research
Use experimental control e.g. same conditions, materials, time of day, accurate measures, random assignment to conditions In non-experimental research Can’t use experimental control Use statistical control instead
186
Analysis of Residuals What predicts differences in crime rate
After controlling for socio-economic deprivation Number of police? Crime prevention schemes? Rural/Urban proportions? Something else This is (mostly) what multiple regression is about
187
Books and attend as IV, grade as outcome
Exam performance Consider number of books a student read (books) Number of lectures (max 20) a student attended (attend) Books and attend as IV, grade as outcome
188
First 10 cases
189
Use books as IV Use attend as IV R=0.492, F=12.1, df=1, 28, p=0.001
b0=52.1, b1=5.7 (Intercept makes sense) Use attend as IV R=0.482, F=11.5, df=1, 38, p=0.002 b0=37.0, b1=1.9 (Intercept makes less sense)
190
100 90 80 70 60 50 Grade (100) 40 30 -1 1 2 3 4 5 Books
192
Problem Use R2 to give proportion of shared variance
Books = 24% Attend = 23% So we have explained 24% + 23% = 47% of the variance NO!!!!!
193
Look at the correlation matrix
BOOKS ATTEND GRADE 1 0.44 0.49 0.48 Correlation of books and attend is (unsurprisingly) not zero Some of the variance that books shares with grade, is also shared by attend
194
My wife has access to 2 cars
I have access to 2 cars My wife has access to 2 cars We have access to four cars? No. We need to know how many of my 2 cars are shared Similarly with regression But we can do this with the residuals Residuals are what is left after (say) books See if residual variance is explained by attend Can use this new residual variance to calculate SSres, SStotal and SSreg
195
Because assumes that the variables have a causal priority
Well. Almost. This would give us correct values for SS Would not be correct for slopes, etc Because assumes that the variables have a causal priority Why should attend have to take what is left from books? Why should books have to take what is left by attend? Use OLS again; take variance they share
196
Simultaneously estimate 2 parameters
b1 and b2 Y = b0 + b1x1 + b2x2 x1 and x2 are IVs Shared variance Not trying to fit a line any more Trying to fit a plane Can solve iteratively Closed form equations better But they are unwieldy
197
3D scatterplot (2points only) y x2 x1
198
b2 y b1 b0 x2 x1
200
Increasing Power What if the predictors don’t correlate?
Regression is still good It increases the power to detect effects (More on power later) Less variance left over When do we know the two predictors don’t correlate?
201
(Really) Ridiculous Equations
202
The good news The bad news There is an easier way
It involves matrix algebra We don’t really need to know how to do it
203
We’re not programming computers
So we usually don’t care Very, very occasionally it helps to know what the computer is doing
204
Back to the Good News We can calculate the standardised parameters as
B=Rxx-1 x Rxy Where B is the vector of regression weights Rxx-1 is the inverse of the correlation matrix of the independent (x) variables Rxy is the vector of correlations of the correlations of the x and y variables
205
Exercise 4.2
206
Exercises Exercise 4.1 Exercise 4.2 Exercise 4.3 Exercise 4.4
Grades data in Excel Exercise 4.2 Repeat in Stata Exercise 4.3 Zero correlation Exercise 4.4 Repeat therapy data Exercise 4.5 PTSD in families.
207
Lesson 5: More on Multiple Regression
208
Contents More on parameter estimates R, R2, adjusted R2 Extra bits
Standard errors of coefficients R, R2, adjusted R2 Extra bits Suppressors Decisions about control variables Standardized estimates > 1 Variable entry techniques
209
More on Parameter Estimates
210
Parameter Estimates Parameter estimates (b1, b2 … bk) were standardised Because we analysed a correlation matrix Represent the correlation of each IV with the outcome When all other IVs are held constant
211
Can also be unstandardised
Unstandardised represent the unit (rather than SD’s) change in the outcome associated with a 1 unit change in the IV When all the other variables are held constant Parameters have standard errors associated with them As with one IV Hence t-test, and associated probability can be calculated Trickier than with one IV
212
Standard Error of Regression Coefficient
Standardised is easier R2i is the value of R2 when all other predictors are used as predictors of that variable Note that if R2i = 0, the equation is the same as for previous
213
Multiple R
214
Multiple R The degree of prediction
R (or Multiple R) No longer equal to b R2 Might be equal to the sum of squares of B Only if all x’s are uncorrelated
215
In Terms of Variance Can also think of R2 in terms of variance explained. Each IV explains some variance in the outcome The IVs share some of their variance Can’t share the same variance twice
216
Variance in Y accounted for by x1 rx1y2 = 0.36
The total variance of Y = 1
217
In this model But R2 = ryx12 + ryx22 R2 = 0.36 + 0.36 = 0.72
If x1 and x2 are correlated No longer the case
218
Variance in Y accounted for by x1 rx1y2 = 0.36
Variance shared between x1 and x2 (not equal to rx1x2) The total variance of Y = 1 Variance in Y accounted for by x2 rx2y2 = 0.36
219
So But Two different ways We can no longer sum the r2
Need to sum them, and subtract the shared variance – i.e. the correlation But It’s not the correlation between them It’s the correlation between them as a proportion of the variance of Y Two different ways
220
Based on estimates If rx1x2 = 0 rxy = bx1 Equivalent to ryx12 + ryx22
221
Based on correlations rx1x2 = 0 Equivalent to ryx12 + ryx22
222
Can also be calculated using methods we have seen
Based on PRE (predicted value) Based on correlation with prediction Same procedure with >2 IVs
223
Adjusted R2 R2 is on average an overestimate of population value of R2
Any x will not correlate 0 with Y Any variation away from 0 increases R Variation from 0 more pronounced with lower N Need to correct R2 Adjusted R2
224
Calculation of Adj. R2 1 – R2 Proportion of unexplained variance
We multiple this by an adjustment More variables – greater adjustment More people – less adjustment
226
Some stranger things that can happen Counter-intuitive
Extra Bits Some stranger things that can happen Counter-intuitive
227
Suppressor variables Can be hard to understand Definition
Very counter-intuitive Definition A predictor which increases the size of the parameters associated with other predictors above the size of their correlations
228
An example (based on Horst, 1941)
Success of trainee pilots Mechanical ability (x1), verbal ability (x2), success (y) Correlation matrix
229
Mechanical ability correlates 0.3 with success
Verbal ability correlates 0.0 with success What will the parameter estimates be? (Don’t look ahead until you have had a guess)
230
Mechanical ability Verbal ability So what is happening? b = 0.4
Larger than r! Verbal ability b = -0.2 Smaller than r!! So what is happening? You need verbal ability to do the mechanical ability test Not actually related to mechanical ability Measure of mechanical ability is contaminated by verbal ability
231
High mech, low verbal High mech, high verbal High mech Low verbal
This is positive (.4) Low verbal Negative, because we are talking about standardised scores (-(-.2) (.2) Your mech is really high – you did well on the mechanical test, without being good at the words High mech, high verbal Well, you had a head start on mech, because of verbal, and need to be brought down a bit
232
Another suppressor? b1 = b2 =
233
Another suppressor? b1 =0.26 b2 = -0.06
234
And another? b1 = b2 =
235
And another? b1 = 0.53 b2 = -0.47
236
One more? b1 = b2 =
237
One more? b1 = 0.53 b2 = 0.47
238
Suppression happens when two opposing forces are happening together
And have opposite effects Don’t throw away your IVs, Just because they are uncorrelated with the outcome Be careful in interpretation of regression estimates Really need the correlations too, to interpret what is going on Cannot compare between studies with different predictors Think about what you want to know Before throwing variables into the analysis
239
What to Control For? What is the added value of a ‘better’ college
In terms of salary More academic people go to ‘better’ colleges Control for: Ability? Social class? Mother’s education? Parent’s income? Course? Ethnic group? …
240
Decisions about control variables Effect of gender
Guided from theory Effect of gender Controlling for hair length and skirt wearing?
242
Do dogs make kids healthier?
What to control for? Parent’s weight? Yes: Obese parents are more likely to have obese kids, kids who are thinner, relative to the parents are thinner. No: Dog might make parent thinner. By controlling for parental weight, you’re controlling for the effect of dog
244
Bad control vars Bad control vars Dog Kid’s health Good control vars
245
Parent Weight Child Asthma Dog Kid’s health Rural/Urban? House/apartment? Income
246
Standardised Estimates > 1
Correlations are bounded -1.00 ≤ r ≤ +1.00 We think of standardised regression estimates as being similarly bounded But they are not Can go >1.00, <-1.00 R cannot, because that is a proportion of variance
247
Three measures of ability
Mechanical ability, verbal ability 1, verbal ability 2 Score on science exam Before reading on, what are the parameter estimates?
248
Mechanical About where we expect Verbal 1 Very high Verbal 2 Very low
249
Verbal 1 and verbal 2 are correlated so highly
What is going on It’s a suppressor again a predictor which increases the size of the parameters associated with other predictors above the size of their correlations Verbal 1 and verbal 2 are correlated so highly They need to cancel each other out
250
Variable Selection What are the appropriate predictors to use in a model? Depends what you are trying to do Multiple regression has two separate uses Prediction Explanation
251
Prediction Explanation What will happen in the future?
Emphasis on practical application Variables selected (more) empirically Value free Explanation Why did something happen? Emphasis on understanding phenomena Variables selected theoretically Not value free
252
More on causality later on … Which are appropriate variables
Visiting the doctor Precedes suicide attempts Predicts suicide Does not explain suicide More on causality later on … Which are appropriate variables To collect data on? To include in analysis? Decision needs to be based on theoretical knowledge of the behaviour of those variables Statistical analysis of those variables (later) Unless you didn’t collect the data Common sense (not a useful thing to say)
253
Variable Entry Techniques
Entry-wise All variables entered simultaneously Hierarchical Variables entered in a predetermined order Stepwise Variables entered according to change in R2 Actually a family of techniques
254
Hierarchical regression
Entrywise regression All variables entered simultaneously All treated equally Hierarchical regression Entered in a theoretically determined order Change in R2 is assessed, and tested for significance e.g. sex and age Should not be treated equally with other variables Sex and age MUST be first (unchangeable) Confused with hierarchical linear modelling (MLM)
255
R-Squared Change SSE0, df0 SSE and df for first (smaller) model
SSE and df for second (larger) model
256
Stepwise Example Variables entered empirically
Variable which increases R2 the most goes first Then the next … Variables which have no effect can be removed from the equation Example House prices – what’s important? Size, lot size, list price,
257
Stepwise Analysis Data determines the order
Model 1: listing price, R2 = 0.87 Model 2: listing price + lot size, R2 = 0.89
258
Hierarchical analysis
Theory determines the order Model 1: Lot size+ House size, R2 = 0.499 Model 2: + List price, R2 = 0.905 Change in R2 = 0.41, p < 0.001
259
Other problems with stepwise
Which is the best model? Entrywise – OK Stepwise – excluded age Excluded size MOST IMPORTANT PREDICTOR Hierarchical Listing price accounted for additional variance Whoever decides the price has information that we don’t Other problems with stepwise F and df are wrong (cheats with df) Unstable results Small changes (sampling variance) – large differences in models
260
Uses a lot of paper Don’t use a stepwise procedure to pack your suitcase
261
Is Stepwise Always Evil?
Yes All right, no Research goal is entirely predictive (technological) Not explanatory (scientific) What happens, not why N is large 40 people per predictor, Cohen, Cohen, Aiken, West (2003) Cross validation takes place
262
Alternatives to stepwise regression
More recently developed Used for genetic studies 1000s of predictors, one outcome, small samples Least Angle Regression LARS (least angle regression) Lasso (Least absolute shrinkage and selection operator)
263
Entry Methods in Stata Entrywise Hierarchical What regress does
Two ways Use hireg Add on module net search hireg Then install
264
Hierarchical Regression
Use (on one line) hireg outcome (block1var1 block1var2) (block2var1 block2var2) Hireg reports Parameter estimates for the two regressions R2 for each model, change in R2
265
Model R F(df) 1: (1,98) 2: (2,97) p R2 change F(df)change p 0.136 (1,97) P value for the R2 P value for the change in R2
266
Hierarchical Regression (Cont…)
I don’t like hireg, for two reasons It’s different to regression It only works for OLS regression, not logistic, multinomial, Poisson, etc Alternative 2: Use test The p-value associated with the change in R2 for a variable Equal to the p-value for that variable.
267
Hierarchical Regression (Cont…)
Example (using cars) Parameters from final model: hireg price () (extro) car | Coef. Std. Err. t P>|t| [95% Conf. Interval] extro | R2 change statistics R2 change F(df) change p (1,36) (What is relationship between t and F?) We know the p-value of the R2 change When there is one predictor in the block What about when there’s more than one?
268
Hierarchical Regression (Cont)
test isn’t exactly what we want But it is the same as what we want Advantage of test You can always use it (I can always remember how it works)
269
(For SPSS) SPSS calls them ‘blocks’
Enter some variables, click ‘next block’ Enter more variables Click on ‘Statistics’ Click on R-squared change
270
Stepwise Regression Add stepwise: prefix With
Pr() – probability value to be removed from equation Pe() – probability value to be entered into equation stepwise, pe(0.05) pr(0.2): reg price sqm lotsize originallis
271
A quick note on R2 R2 is sometimes regarded as the ‘fit’ of a regression model Bad idea If good fit is required – maximise R2 Leads to entering variables which do not make theoretical sense
272
Propensity Scores Another method of controlling for variables
Ensure that predictors are uncorrelated with one predictor Don’t need to control for them
273
x’s Uncorrelated? Two cases when x’s are uncorrelated
Experimental design Predictors are uncorrelated We randomly assigned people to conditions to ensure that was the case Sample weights We can deliberately sample Ensure that they are uncorrelated
274
Or use post hoc sample weights
20 women with college degree 20 women without college degree 20 men with college degree 20 men without college degree Or use post hoc sample weights Propensity weighting Weight to ensure that variables are uncorrelated Usually done to avoid having to control E.g. ethnic differences in PTSD symptoms Can incorporate many more control variables 100+
275
Propensity Scores Race profiling of police stops
Same time, place, area, etc
276
Critique of Multiple Regression
Goertzel (2002) “Myths of murder and multiple regression” Skeptical Inquirer (Paper B1) Econometrics and regression are ‘junk science’ Multiple regression models (in US) Used to guide social policy
277
More Guns, Less Crime Lott and Mustard: A 1% increase in gun ownership
(controlling for other factors) Lott and Mustard: A 1% increase in gun ownership 3.3% decrease in murder rates But: More guns in rural Southern US More crime in urban North (crack cocaine epidemic at time of data)
278
Executions Cut Crime No difference between crimes in states in US with or without death penalty Ehrlich (1975) controlled all variables that affect crime rates Death penalty had effect in reducing crime rate No statistical way to decide who’s right
279
Legalised Abortion Donohue and Levitt (1999) Lott and Whitley (2001)
Legalised abortion in 1970’s cut crime in 1990’s Lott and Whitley (2001) “Legalising abortion decreased murder rates by … 0.5 to 7 per cent.” It’s impossible to model these data Controlling for other historical events Crack cocaine (again)
280
Crime is still dropping in the US
Despite the recession Levitt says it’s mysterious, because the abortion effect should be over Some suggest Xboxes, Playstations, etc Netflix, DVRs (Violent movies reduce crime).
281
Another Critique Berk (2003) Three cheers for regression
Regression analysis: a constructive critique (Sage) Three cheers for regression As a descriptive technique Two cheers for regression As an inferential technique One cheer for regression As a causal analysis
282
Is Regression Useless? Do regression carefully Validate models
Don’t go beyond data which you have a strong theoretical understanding of Validate models Where possible, validate predictive power of models in other areas, times, groups Particularly important with stepwise
283
Lesson 6: Categorical Predictors
284
Introduction
285
Introduction So far, just looked at continuous predictors
Also possible to use categorical (nominal, qualitative) predictors e.g. Sex; Job; Religion; Region; Type (of anything) Usually analysed with t-test/ANOVA
286
Historical Note But these (t-test/ANOVA) are special cases of regression analysis Aspects of General Linear Models (GLMs) So why treat them differently? Fisher’s fault Computers’ fault Regression, as we have seen, is computationally difficult Matrix inversion and multiplication Can’t do it, without a computer
287
In the special cases where:
You have one categorical predictor Your IVs are uncorrelated It is much easier to do it by partitioning of sums of squares These cases Very rare in ‘applied’ research Very common in ‘experimental’ research Fisher worked at Rothamsted agricultural research station Never have problems manipulating wheat, pigs, cabbages, etc
288
Still (too) common to dichotomise a variable
In psychology Led to a split between ‘experimental’ psychologists and ‘correlational’ psychologists Experimental psychologists (until recently) would not think in terms of continuous variables Still (too) common to dichotomise a variable Too difficult to analyse it properly Equivalent to discarding 1/3 of your data
289
The Approach
290
The Approach Recode the nominal variable Names are slightly confusing
Into one, or more, variables to represent that variable Names are slightly confusing Some texts talk of ‘dummy coding’ to refer to all of these techniques Some (most) refer to ‘dummy coding’ to refer to one of them Most have more than one name
291
If a variable has g possible categories it is represented by g-1 variables
Simplest case: Smokes: Yes or No Variable 1 represents ‘Yes’ Variable 2 is redundant If it isn’t yes, it’s no
292
The Techniques
293
We will examine two coding schemes
Dummy coding For two groups For >2 groups Effect coding Look at analysis of change Equivalent to ANCOVA Pretest-posttest designs
294
Dummy Coding – 2 Groups Sometimes called ‘simple coding’
A categorical variable with two groups One group chosen as a reference group The other group is represented in a variable e.g. 2 groups: Experimental (Group 1) and Control (Group 0) Control is the reference group Dummy variable represents experimental group Call this variable ‘group1’
295
For variable ‘group1’ 1 = ‘Yes’, 0=‘No’
296
Some data Group is x, score is y
297
Control Group = 0 Experimental Group = 1
Intercept = Score on Y when x = 0 Intercept = mean of control group Experimental Group = 1 b = change in Y when x increases 1 unit b = difference between experimental group and control group
298
Gradient of slope represents difference between means
299
Dummy Coding – 3+ Groups With three groups the approach is the similar
g = 3, therefore g-1 = 2 variables needed 3 Groups Control Experimental Group 1 Experimental Group 2
300
Recoded into two variables
Note – do not need a 3rd variable If we are not in group 1 or group 2 MUST be in control group 3rd variable would add no information (What would happen to determinant?)
301
b1 and b2 and associated p-values
F and associated p Tests H0 that b1 and b2 and associated p-values Test difference between each experimental group and the reference group To test difference between experimental groups Need to rerun analysis (or just do ANOVA with post-hoc tests)
302
Need to correct for this
One more complication Have now run multiple comparisons Increases a – i.e. probability of type I error Need to correct for this Bonferroni correction Multiply given p-values by two/three (depending how many comparisons were made)
303
Effect Coding Usually used for 3+ groups
Compares each group (except the reference group) to the mean of all groups Dummy coding compares each group to the reference group. Example with 5 groups 1 group selected as reference group Group 5
304
Each group (except reference) has a variable
1 if the individual is in that group 0 if not -1 if in reference group
305
Examples Dummy coding and Effect Coding
Group 1 chosen as reference group each time Data
306
Dummy Group dummy2 dummy3 1 2 3 Effect Group Effect2 effect3 1 -1 2 3
307
Dummy R=0.543, F=5.7, df=2, 27, p=0.009 b0 = 52.4, b1 = 3.9, p=0.100 b2 = 7.7, p=0.002 Effect R=0.543, F=5.7, df=2, 27, p=0.009 b0 = 56.27, b1 = 0.03, p=0.980 b2 = 3.8, p=0.007
308
In Stata Use xi: prefix for dummy coding
Use xi3: module for more codings But I don’t like it, I do it by hand I don’t understand what it’s doing It makes very long variables And then I can’t use test BUT: If doing stepwise, you need to keep the variables together Example: xi: reg outcome contpred i.catpred Put i. in front of categorical predictors This has changed in Stata 11. xi: no longer needed
309
xi: reg salary i.job_description
salary | Coef. Std. Err t P>|t| _Ijob_desc~2 | _Ijob_desc~3 | _cons |
310
Exercise 6.1 5 golf balls Which is best?
311
In SPSS SPSS provides two equivalent procedures for regression
GLM GLM will: Automatically code categorical variables Automatically calculate interaction terms Allow you to not understand GLM won’t: Give standardised effects Give hierarchical R2 p-values
312
ANCOVA and Regression
313
Test (Which is a trick; but it’s designed to make you think about it) Use bank data (Ex 5.3) Compare the pay rise (difference between salbegin and salary) For ethnic minority and non-minority staff What do you find?
314
ANCOVA and Regression Dummy coding approach has one special use
In ANCOVA, for the analysis of change Pre-test post-test experimental design Control group and (one or more) experimental groups Tempting to use difference score + t-test / mixed design ANOVA Inappropriate
315
Salivary cortisol levels
Used as a measure of stress Not absolute level, but change in level over day may be interesting Test at: 9.00am, 9.00pm Two groups High stress group (cancer biopsy) Group 1 Low stress group (no biopsy) Group 0
316
Correlation of AM and PM = 0.493 (p=0.008)
Has there been a significant difference in the rate of change of salivary cortisol? 3 different approaches
317
Approach 1 – find the differences, do a t-test
t = 1.31, df=26, p=0.203 Approach 2 – mixed ANOVA, look for interaction effect F = 1.71, df = 1, 26, p = 0.203 F = t2 Approach 3 – regression (ANCOVA) based approach
318
Why is the regression approach better?
IVs: AM and group outcome: PM b1 (group) = 3.59, standardised b1=0.432, p = 0.01 Why is the regression approach better? The other two approaches took the difference Assumes that r = 1.00 Any difference from r = 1.00 and you add error variance Subtracting error is the same as adding error
319
Using regression Two effects Data is am-pm cortisol
Ensures that all the variance that is subtracted is true Reduces the error variance Two effects Adjusts the means Compensates for differences between groups Removes error variance Data is am-pm cortisol
320
More on Change If difference score is correlated with either pre-test or post-test Subtraction fails to remove the difference between the scores If two scores are uncorrelated Difference will be correlated with both Failure to control Equal SDs, r = 0 Correlation of change and pre-score =0.707
321
Even More on Change A topic of surprising complexity
What I said about difference scores isn’t always true Lord’s paradox – it depends on the precise question you want to answer Collins and Horn (1993). Best methods for the analysis of change Collins and Sayer (2001). New methods for the analysis of change More later
322
Lesson 7: Assumptions in Regression Analysis
323
The Assumptions The distribution of residuals is normal (at each value of the outcome). The variance of the residuals for every set of values for the predictor is equal. violation is called heteroscedasticity. The error term is additive no interactions. At every value of the outcome the expected (mean) value of the residuals is zero No non-linear relationships
324
The expected correlation between residuals, for any two cases, is 0.
The independence assumption (lack of autocorrelation) All predictors are uncorrelated with the error term. No predictors are a perfect linear function of other predictors (no perfect multicollinearity) The mean of the error term is zero.
325
What are we going to do … Deal with some of these assumptions in some detail Deal with others in passing only look at them again later on
326
Assumption 1: The Distribution of Residuals is Normal at Every Value of the outcome
327
Look at Normal Distributions
A normal distribution symmetrical, bell-shaped (so they say)
328
What can go wrong? Skew Kurtosis Outliers non-symmetricality
one tail longer than the other Kurtosis too flat or too peaked kurtosed Outliers Individual cases which are far from the distribution
329
Effects on the Mean Skew Kurtosis
biases the mean, in direction of skew Kurtosis mean not biased standard deviation is and hence standard errors, and significance tests
330
Examining Univariate Distributions
Graphs Histograms Boxplots P-P plots Calculation based methods
331
Histograms A and B
332
C and D
333
E & F
334
Histograms can be tricky ….
335
Boxplots
336
P-P Plots A & B
337
C & D
338
E & F
339
Calculation Based Skew and Kurtosis statistics
Outlier detection statistics
340
Skew and Kurtosis Statistics
Normal distribution skew = 0 kurtosis = 0 Two methods for calculation Fisher’s and Pearson’s Very similar answers Associated standard error can be used for significance (t-test) of departure from normality not actually very useful Never normal above N = 400
342
Outlier Detection Calculate distance from mean
z-score (number of standard deviations) deleted z-score that case biased the mean, so remove it Look up expected distance from mean 1% 3+ SDs
343
Non-Normality in Regression
344
Effects on OLS Estimates
The mean is an OLS estimate The regression line is an OLS estimate Lack of normality biases the position of the regression slope makes the standard errors wrong probability values attached to statistical significance wrong
345
Checks on Normality Check residuals are normally distributed
Draw histogram residuals Use regression diagnostics Lots of them Most aren’t very interesting
346
Regression Diagnostics
Residuals Standardised, studentised-deleted look for cases > |3| (?) Influence statistics Look for the effect a case has If we remove that case, do we get a different answer? DFBeta, Standardised DFBeta changes in b
347
Distances DfFit, Standardised DfFit
change in predicted value Distances measures of ‘distance’ from the centroid some include IV, some don’t
348
More on Residuals Residuals are trickier than you might have imagined
Raw residuals OK Standardised residuals Residuals divided by SD
349
Standardised / Studentised
Now we can calculate the standardised residuals SPSS calls them studentised residuals Also called internally studentised residuals
350
Deleted Studentised Residuals
Studentised residuals do not have a known distribution Cannot use them for inference Deleted studentised residuals Externally studentised residuals Studentized (jackknifed) residuals Distributed as t With df = N – k – 1
351
Testing Significance We can calculate the probability of a residual
Is it sampled from the same population BUT Massive type I error rate Bonferroni correct it Multiply p value by N
352
Bivariate Normality We didn’t just say “residuals normally distributed” We said “at every value of the outcomes” Two variables can be normally distributed – univariate, but not bivariate
353
Couple’s IQs male and female Seem reasonably normal
354
But wait!!
355
When we look at bivariate normality So plot X against Y
not normal – there is an outlier So plot X against Y OK for bivariate but – may be a multivariate outlier Need to draw graph in 3+ dimensions can’t draw a graph in 3 dimensions But we can look at the residuals instead …
356
IQ histogram of residuals
357
Multivariate Outliers …
Will be explored later in the exercises So we move on …
358
What to do about Non-Normality
Skew and Kurtosis Skew – much easier to deal with Kurtosis – less serious anyway Transform data removes skew positive skew – log transform negative skew - square
359
Transformation May need to transform IV and/or outcome
More often outcome time, income, symptoms (e.g. depression) all positively skewed can cause non-linear effects (more later) if only one is transformed alters interpretation of unstandardised parameter May alter meaning of variable Some people say that this is such a big problem Never transform May add / remove non-linear and moderator effects
360
Change measures Outliers increase sensitivity at ranges Can be tricky
avoiding floor and ceiling effects Outliers Can be tricky Why did the outlier occur? Error? Delete them. Weird person? Probably delete them Normal person? Tricky.
361
You are trying to model a process
is the data point ‘outside’ the process e.g. lottery winners, when looking at salary yawn, when looking at reaction time Which is better? A good model, which explains 99% of your data? (because we threw outliers out) A poor model, which explains all of it (because we keep outliers in) I prefer a good model
362
More on House Prices Zillow.com tracks and predicts house prices
In the USA Sometimes detects outliers We don’t trust this selling price We haven’t used it
363
Example in Stata reg salary educ predict res, res hist res
gen logsalary= log(salary) reg logsalary educ predict logres, res hist logres
366
But … Parameter estimates change
Interpretation of parameter estimate is different Exercise 7.0, 7.1
367
Bootstrapping Bootstrapping is very, very cool And very, very clever
But very, very simple
368
Bootstrapping When we estimate a test statistic (F or r or t or c2)
We rely on knowing the sampling distribution Which we know If the distributional assumptions are satisfied
369
Estimate the Distribution
Bootstrapping lets you: Skip the bit about distribution Estimate the sampling distribution from the data This shouldn’t be allowed Hence bootstrapping But it is
370
How to Bootstrap We resample, with replacement Take our sample
Sample 1 individual Put that individual back, so that they can be sampled again Sample another individual Keep going until we’ve sampled as many people as were in the sample Analyze the data Repeat the process B times Where B is a big number
371
Example Original B1 B2 B3 1 1 1 2 2 1 2 2 3 3 3 3 4 3 4 2 5 3 4 4 6 3 4 4 7 7 8 6 8 7 8 7 9 9 9 9 10 9 10 9
372
Analyze each dataset 2 approaches to CI or P Semi-parametric
Sampling distribution of statistic Gives sampling distribution 2 approaches to CI or P Semi-parametric Calculate standard error of statistic Call that the standard deviation Does not make assumption about distribution of data Makes assumption about sampling distribution
373
Non-parametric needs more samples
Stata calls this percentile Count. If you have 1000 samples 25th is lower CI 975th is upper CI P-value is proportion that cross zero Non-parametric needs more samples
374
Bootstrapping in Stata
Very easy: Use bootstrap: (or bs: or bstrap: ) prefix or (Better) use vce(bootstrap) option By default does 50 samples Not enough Use reps() At least 1000
375
Example Again reg salary salbegin educ, vce(bootstrap, reps(50))
| Observed Bootstrap | Coef Std. Err z salbegin | Again salbegin |
376
More Reps 1,000 reps Z = 17.31 Again Z = 17.59 10,000 reps 17.23 17.02
377
Exercise 7.2, 7.3
378
Assumption 2: The variance of the residuals for every set of values for the predictor is equal.
379
Heteroscedasticity This assumption is a about heteroscedasticity of the residuals Hetero=different Scedastic = scattered We don’t want heteroscedasticity we want our data to be homoscedastic Draw a scatterplot to investigate
381
Easy to get – use predicted values
Only works with one IV need every combination of IVs Easy to get – use predicted values use residuals there Plot predicted values against residuals A bit like turning the scatterplot to make the line of best fit flat
382
Good – no heteroscedasticity
383
Bad – heteroscedasticity
384
Testing Heteroscedasticity
White’s test Do regression, save residuals. Square residuals Square IVs Calculate interactions of IVs e.g. x1•x2, x1•x3, x2 • x3
385
Use education and salbegin to predict salary (employee data.sav)
Run regression using squared residuals as outcome IVs, squared IVs, and interactions as IVs Test statistic = N x R2 Distributed as c2 Df = k (for second regression) Use education and salbegin to predict salary (employee data.sav) R2 = 0.113, N=474, c2 = 53.5, df=5, p < Automatic in Stata estat imtest, white
386
Plot of Predicted and Residual
387
White’s Test as Test of Interest
Possible to have a theory that predicts heteroscedasticity Lupien, et al, 2006 Heteroscedasticity in relationship of hippocampal volume and age
388
Magnitude of Heteroscedasticity
Chop data into 5 “slices” Calculate variance of each slice Check ratio of smallest to largest Less than 5 OK
389
gen slice = 1 replace slice = 2 if pred > 30000 replace slice = 3 if pred > 60000 replace slice = 4 if pred > 90000 replace slice = 5 if pred > bysort slice: su pred 1: 3954 5: 17116 (Doesn’t look too bad, thanks to skew in predictors)
390
Dealing with Heteroscedasticity
Use Huber-White (robust) estimates Also called sandwich estimates Also called empirical estimates Use survey techniques Relatively straightforward in SAS and Stata, fiddly in SPSS Google: SPSS Huber-White
391
Why’s it a Sandwich? SE can be calculated with: Sandwich estimator:
392
Example reg salary educ reg salary educ , robust
Standard errors: 204, 2821 reg salary educ , robust 267 3347 SEs usually go up, can go down
393
Heteroscedasticity – Implications and Meanings
What happens as a result of heteroscedasticity? Parameter estimates are correct not biased Standard errors (hence p-values) are incorrect
394
However … If there is no skew in predicted scores If skewed,
P-values a tiny bit wrong If skewed, P-values can be very wrong Exercise 7.4
395
Robust SE Haiku T-stat looks too good. Use robust standard errors significance gone
396
What is heteroscedasticity trying to tell us?
Meaning What is heteroscedasticity trying to tell us? Our model is wrong – it is misspecified Something important is happening that we have not accounted for e.g. amount of money given to charity (given) depends on: earnings degree of importance person assigns to the charity (import)
397
Do the regression analysis
R2 = 0.60,, p < 0.001 seems quite good b0 = 0.24, p=0.97 b1 = 0.71, p < 0.001 b2 = 0.23, p = 0.031 White’s test c2 = 18.6, df=5, p=0.002 The plot of predicted values against residuals …
398
Plot shows heteroscedastic relationship
399
Which means … the effects of the variables are not additive
If you think that what a charity does is important you might give more money how much more depends on how much money you have
401
One more thing about heteroscedasticity
it is the equivalent of homogeneity of variance in ANOVA/t-tests
402
Exercise 7.4, 7.5, 7.6
403
Assumption 3: The Error Term is Additive
404
Additivity What heteroscedasticity shows you
effects of variables need to be additive (assume no interaction between the variables) Heteroscedasticity doesn’t always show it to you can test for it, but hard work (same as homogeneity of covariance assumption in ANCOVA) Have to know it from your theory A specification error
405
Additivity and Theory Two IVs Alcohol has sedative effect
A bit makes you a bit tired A lot makes you very tired Some painkillers have sedative effect A bit of alcohol and a bit of painkiller doesn’t make you very tired Effects multiply together, don’t add together
406
So many possible non-additive effects
If you don’t test for it It’s very hard to know that it will happen So many possible non-additive effects Cannot test for all of them Can test for obvious In medicine Choose to test for salient non-additive effects e.g. sex, race More on this, when we look at moderators
407
Exercise 7.6 Exercise 7.7
408
Assumption 4: At every value of the outcome the expected (mean) value of the residuals is zero
409
Linearity Relationships between variables should be linear
best represented by a straight line Not a very common problem in social sciences measures are not sufficiently accurate (much measurement error) to make a difference R2 too low unlike, say, physics
410
Relationship between speed of travel and fuel used
411
R2 = 0.938 BUT looks pretty good
know speed, make a good prediction of fuel BUT look at the chart if we know speed we can make a perfect prediction of fuel used R2 should be 1.00
412
Detecting Non-Linearity
Residual plot just like heteroscedasticity Using this example very, very obvious usually pretty obvious
413
Residual plot
414
Linearity: A Case of Additivity
Linearity = additivity along the range of the IV Jeremy rides his bicycle harder Increase in speed depends on current speed Not additive, multiplicative MacCallum and Mar (1995). Distinguishing between moderator and quadratic effects in multiple regression. Psychological Bulletin.
415
The independence assumption (lack of autocorrelation)
Assumption 5: The expected correlation between residuals, for any two cases, is 0. The independence assumption (lack of autocorrelation)
416
Independence Assumption
Also: lack of autocorrelation Tricky one often ignored exists for almost all tests All cases should be independent of one another knowing the value of one case should not tell you anything about the value of other cases
417
How is it Detected? Can be difficult
need some clever statistics (multilevel models) Better off avoiding situations where it arises Or handling it when it does arise Residual Plots
418
Residual Plots Were data collected in time order?
If so plot ID number against the residuals Look for any pattern Test for linear relationship Non-linear relationship Heteroscedasticity
420
How does it arise? Two main ways time-series analyses
When cases are time periods weather on Tuesday and weather on Wednesday correlated inflation 1972, inflation 1973 are correlated clusters of cases patients treated by three doctors children from different classes people assessed in groups
421
Why does it matter? Standard errors can be wrong
therefore significance tests can be wrong Parameter estimates can be wrong really, really wrong from positive to negative An example students do an exam (on statistics) choose one of three questions IV: time outcome: grade
422
Result, with line of best fit
423
Result shows that BUT … Look again
people who spent longer in the exam, achieve better grades BUT … we haven’t considered which question people answered we might have violated the independence assumption outcome will be autocorrelated Look again with questions marked
424
Now somewhat different
425
Now, people that spent longer got lower grades
questions differed in difficulty do a hard one, get better grade if you can do it, you can do it quickly
426
Dealing with Non-Independence
For time series data Time series analysis (another course) Multilevel models (hard, some another course) For clustered data Robust standard errors Generalized estimating equations Multilevel models
427
Cluster Robust Standard Errors
Predictor: School size Outcome: Grades Sample: 20 schools 20 children per school What is the N?
428
Robust Standard Errors
Sample is: 400 children – is it 400? Not really Each child adds information First child in a school adds lots of information about that school 100th child in a school adds less information’ How much less depends on how similar the children in the school are 20 schools It’s more than 20
429
Robust SE in Stata Very easy
reg predictor outcome , robust cluster(clusterid) BUT Only to be used where clustering is a nuisance only Only adjusts standard errors, not parameter estimates Only to be used where parameter estimates shouldn’t be affected by clustering
430
Example of Robust SE Effects of incentives for attendance at adult literacy class Some students rewarded for attendance Others not rewarded 152 classes randomly assigned to each condition Scores measured at mid term and final
431
Example of Robust SE Naïve Clustered reg postscore tx midscore
Est: SE: Clustered reg postscore tx midscore, robust cluster(classid) Est: SE
432
Problem with Robust Estimates
Only corrects standard error Does not correct estimate Other predictors must be uncorrelated with predictors of group membership Or estimates wrong Two alternatives: Generalized estimating equations (gee) Multilevel models
433
Independence + Heteroscedasticity
Assumption is that residuals are: Independently and identically distributed i.i.d. Same procedure used for both problems Really, same problem
434
Exercise 7.9, exercise 7.10
435
Assumption 6: All predictor variables are uncorrelated with the error term.
436
Uncorrelated with the Error Term
A curious assumption by definition, the residuals are uncorrelated with the predictors (try it and see, if you like) There are no other predictors that are important That correlate with the error i.e. Have an effect
437
OLS estimates will be (badly) biased in this case
Problem in economics Demand increases supply Supply increases wages Higher wages increase demand OLS estimates will be (badly) biased in this case need a different estimation procedure two-stage least squares simultaneous equation modelling Instrumental variables
438
Another Haiku Supply and demand: without a good instrument, not identified.
439
no perfect multicollinearity
Assumption 7: No predictors are a perfect linear function of other predictors no perfect multicollinearity
440
No Perfect Multicollinearity
IVs must not be linear functions of one another matrix of correlations of IVs is not positive definite cannot be inverted analysis cannot proceed Have seen this with age, age start, time working (can’t have all three in the model) also occurs with subscale and total in model at the same time
441
Large amounts of collinearity
a problem (as we shall see) sometimes not an assumption Exercise 7.11
442
Assumption 8: The mean of the error term is zero.
You will like this one.
443
Mean of the Error Term = 0 Mean of the residuals = 0
That is what the constant is for if the mean of the error term deviates from zero, the constant soaks it up - note, Greek letters because we are talking about population values
444
Can do regression without the constant
Usually a bad idea E.g R2 = 0.995, p < 0.001 Looks good
446
Lesson 8: Issues in Regression Analysis
Things that alter the interpretation of the regression equation
447
The Four Issues Causality Sample sizes Collinearity Measurement error
448
Causality
449
What is a Cause? Debate about definition of cause
some statistics (and philosophy) books try to avoid it completely We are not going into depth just going to show why it is hard Two dimensions of cause Ultimate versus proximal cause Determinate versus probabilistic
450
Proximal versus Ultimate Why am I here?
I walked here because This is the location of the class because Eric Tanenbaum asked me because (I don’t know) because I was in my office when he rang because I was a lecturer at Derby University because I saw an advert in the paper because
451
Proximal cause Ultimate cause I exist because My parents met because
My father had a job … Proximal cause the direct and immediate cause of something Ultimate cause the thing that started the process off I fell off my bicycle because of the bump I fell off because I was going too fast
452
Determinate versus Probabilistic Cause Why did I fall off my bicycle?
I was going too fast But every time I ride too fast, I don’t fall off Probabilistic cause Why did my tyre go flat? A nail was stuck in my tyre Every time a nail sticks in my tyre, the tyre goes flat Deterministic cause
453
Can get into trouble by mixing them together
Eating deep fried Mars Bars and doing no exercise are causes of heart disease “My Grandad ate three deep fried Mars Bars every day, and the most exercise he ever got was when he walked to the shop next door to buy one” (Deliberately?) confusing deterministic and probabilistic causes
454
Criteria for Causation
Association (correlation) Direction of Influence (a b) Isolation (not c a and c b)
455
Association Correlation does not mean causation But
we all know But Causation does mean correlation Need to show that two things are related may be correlation may be regression when controlling for third (or more) factor
456
Relationship between price and sales
suppliers may be cunning when people want it more stick the price up So – no relationship between price and sales
457
But which variables do we enter?
Until (or course) we control for demand b1 (Price) = b2 (Demand) = 0.94 But which variables do we enter?
458
Direction of Influence
Relationship between A and B three possible processes A B A causes B A B B causes A A B C C causes A & B
459
How do we establish the direction of influence?
Longitudinally? Barometer Drops Storm Now if we could just get that barometer needle to stay where it is … Where the role of theory comes in (more on this later)
460
Isolation Isolate the outcome from all other influences Cannot do this
as experimenters try to do Cannot do this can statistically isolate the effect using multiple regression
461
Role of Theory Strong theory is crucial to making causal statements
Fisher said: to make causal statements “make your theories elaborate.” don’t rely purely on statistical analysis Need strong theory to guide analyses what critics of non-experimental research don’t understand
462
S.J. Gould – a critic says correlate price of petrol and his age, for the last 10 years find a correlation Ha! (He says) that doesn’t mean there is a causal link Of course not! (We say). No social scientist would do that analysis without first thinking (very hard) about the possible causal relations between the variables of interest Would control for time, prices, etc …
463
Gould says “Most correlations are non-causal” (1982, p243)
Atkinson, et al. (1996) relationship between college grades and number of hours worked negative correlation Need to control for other variables – ability, intelligence Gould says “Most correlations are non-causal” (1982, p243) Of course!!!!
464
120 non-causal correlations
karaoke jokes (about statistics) children wake early bathroom headache sleeping equations (beermat) laugh thirsty fried breakfast no beer curry chips falling over lose keys curtains closed I drink a lot of beer 16 causal relations 120 non-causal correlations
465
Abelson (1995) elaborates on this
‘method of signatures’ A collection of correlations relating to the process the ‘signature’ of the process e.g. tobacco smoking and lung cancer can we account for all of these findings with any other theory?
466
The longer a person has smoked cigarettes, the greater the risk of cancer.
The more cigarettes a person smokes over a given time period, the greater the risk of cancer. People who stop smoking have lower cancer rates than do those who keep smoking. Smoker’s cancers tend to occur in the lungs, and be of a particular type. Smokers have elevated rates of other diseases. People who smoke cigars or pipes, and do not usually inhale, have abnormally high rates of lip cancer. Smokers of filter-tipped cigarettes have lower cancer rates than other cigarette smokers. Non-smokers who live with smokers have elevated cancer rates. (Abelson, 1995: )
467
Failure to use theory to select appropriate variables
In addition, should be no anomalous correlations If smokers had more fallen arches than non-smokers, not consistent with theory Failure to use theory to select appropriate variables specification error e.g. in previous example Predict wealth from price and sales increase price, price increases Increase sales, price increases
468
Sometimes these are indicators of the process, not the process itself
e.g. barometer – stopping the needle won’t help e.g. inflation? Indicator or cause of economic health?
469
No Causation without Experimentation
Blatantly untrue I don’t doubt that the sun shining makes us warm Why the aversion? Pearl (2000) says problem is that there is no mathematical operator (e.g. “=“) No one realised that you needed one Until you build a robot
470
AI and Causality A robot needs to make judgements about causality
Needs to have a mathematical representation of causality Suddenly, a problem! Doesn’t exist Most operators are non-directional Causality is directional
471
“How many subjects does it take to run a regression analysis?”
Sample Sizes “How many subjects does it take to run a regression analysis?”
472
Introduction Social scientists don’t worry enough about the sample size required “Why didn’t you get a significant result?” “I didn’t have a large enough sample” Not a common answer, but very common reason More recently awareness of sample size is increasing use too few – no point doing the research use too many – waste their time
473
Research funding bodies Ethical review panels
both become more interested in sample size calculations We will look at two approaches Rules of thumb (quite quickly) Power Analysis (more slowly)
474
Rules of Thumb Lots of simple rules of thumb exist
10 cases per IV and at least 100 cases Green (1991) more sophisticated To test significance of R2 – N = k To test significance of slopes, N = k Rules of thumb don’t take into account all the information that we have Power analysis does
475
Power Analysis Introducing Power Analysis Hypothesis test
tells us the probability of a result of that magnitude occurring, if the null hypothesis is correct (i.e. there is no effect in the population) Doesn’t tell us the probability of that result, if the null hypothesis is false (i.e., there actually is an effect in the population)
476
According to Cohen (1982) all null hypotheses are false
everything that might have an effect, does have an effect it is just that the effect is often very tiny
477
Type I error is false rejection of H0
Type I Errors Type I error is false rejection of H0 Probability of making a type I error a – the significance value cut-off usually 0.05 (by convention) Always this value Not affected by sample size type of test
478
Type II error is false acceptance of the null hypothesis
Type II errors Type II error is false acceptance of the null hypothesis Much, much trickier We think we have some idea we almost certainly don’t Example I do an experiment (random sampling, all assumptions perfectly satisfied) I find p = 0.05
479
Very hard to work out You repeat the experiment exactly
different random sample from same population What is probability you will find p < 0.05? Answer: 0.5 Another experiment, I find p = 0.01 Probability you find p < 0.05? Answer: 0.79 Very hard to work out not intuitive need to understand non-central sampling distributions (more in a minute)
480
Probability of type II error = beta (b)
same as population regression parameter (to be confusing) Power = 1 – Beta Probability of getting a significant result (given that there is a significant result to be found)
481
H0 false (effect to be found) H0 True (no effect to be found)
State of the World H0 true (we find no effect – p > 0.05) H0 false (we find an effect – p < 0.05) Research Findings Type II error p = b power = 1 - b Type I error p = a
482
Four parameters in power analysis
a – prob. of Type I error b – prob. of Type II error (power = 1 – b) Effect size – size of effect in population N Know any three, can calculate the fourth Look at them one at a time
483
a Probability of Type I error
Usually set to 0.05 Somewhat arbitrary sometimes adjusted because of circumstances rarely because of power analysis May want to adjust it, based on power analysis
484
b – Probability of type II error
Power (probability of finding a result) = 1 – b Standard is 80% Some argue for 90% Implication that Type I error is 4 times more serious than type II error adjust ratio with compromise power analysis
485
Effect size in the population
Most problematic to determine Three ways What effect size would be useful to find? R2 = no use (probably) Base it on previous research what have other people found? Use Cohen’s conventions small R2 = 0.02 medium R2 = 0.13 large R2 = 0.26
486
Effect size usually measured as f2
For R2
487
For (standardised) slopes
Where sr2 is the contribution to the variance accounted for by the variable of interest i.e. sr2 = R2 (with variable) – R2 (without) change in R2 in hierarchical regression
488
N – the sample size usually use other three parameters to determine this sometimes adjust other parameters (a) based on this e.g. You can have 50 participants. No more.
489
With power analysis program
Doing power analysis With power analysis program SamplePower, Gpower (free), Nquery With Stata command sampsi Which I find very confusing But we’ll use it anyway
490
sampsi Limited in usefulness
A categorical, two group predictor sampsi 0 0.5, pre(1) r01(0.5) n1(50) sd(1) Find power for detecting an effect of 0.5 When there’s one other variable at baseline Which correlates 0.5 50 people in each group When sd is 1.0
491
sampsi … Method: ANCOVA relative efficiency = 1.143
adjustment to sd = adjusted sd1 = Estimated power: power =
492
GPower Better for regression designs
495
Underpowered Studies Research in the social sciences is often underpowered Why? See Paper B11 – “the persistence of underpowered studies”
496
Extra Reading Power traditionally focuses on p values What about CIs?
Paper B8 – “Obtaining regression coefficients that are accurate, not simply significant”
497
Exercise 8.1
498
Collinearity
499
Collinearity as Issue and Assumption
Collinearity (multicollinearity) the extent to which the predictors are (multiply) correlated If R2 for any IV, using other IVs = 1.00 perfect collinearity variable is linear sum of other variables regression will not proceed (SPSS will arbitrarily throw out a variable)
500
Four things to look at in collinearity
R2 < 1.00, but high other problems may arise Four things to look at in collinearity meaning implications detection actions
501
Meaning of Collinearity
Literally ‘co-linearity’ lying along the same line Perfect collinearity when some IVs predict another Total = S1 + S2 + S3 + S4 S1 = Total – (S2 + S3 + S4) rare
502
Less than perfect when some IVs are close to predicting other IVs
correlations between IVs are high (usually, but not always) high multiple correlations
503
Implications Effects the stability of the parameter estimates Because
and so the standard errors of the parameter estimates and so the significance and CIs Because shared variance, which the regression procedure doesn’t know where to put
504
Sex differences due to genetics? due to upbringing?
(almost) perfect collinearity statistically impossible to tell
505
When collinearity is less than perfect
increases variability of estimates between samples estimates are unstable reflected in the variances, and hence standard errors
506
Detecting Collinearity
Look at the parameter estimates large standardised parameter estimates (>0.3?), which are not significant be suspicious Run a series of regressions each IV as outcome all other IVs as IVs for each IV
507
Ask for collinearity diagnostics
Sounds like hard work? SPSS does it for us! Ask for collinearity diagnostics Tolerance – calculated for every IV Variance Inflation Factor sq. root of amount s.e. has been increased
508
Actions What you can do about collinearity Get new data
“no quick fix” (Fox, 1991) Get new data avoids the problem address the question in a different way e.g. find people who have been raised as the ‘wrong’ gender exist, but rare Not a very useful suggestion
509
Remove / Combine variables
Collect more data not different data, more data collinearity increases standard error (se) se decreases as N increases get a bigger N Remove / Combine variables If an IV correlates highly with other IVs Not telling us much new If you have two (or more) IVs which are very similar e.g. 2 measures of depression, socio-economic status, achievement, etc
510
Use stepwise regression (or some flavour of)
sum them, average them, remove one Many measures use principal components analysis to reduce them Use stepwise regression (or some flavour of) See previous comments Can be useful in theoretical vacuum Ridge regression not very useful behaves weirdly
511
Exercise 8.2, 8.3, 8.4
512
Measurement Error
513
What is Measurement Error
In social science, it is unlikely that we measure any variable perfectly measurement error represents this imperfection We assume that we have a true score T A measure of that score x
514
just like a regression equation
standardise the parameters T is the reliability the amount of variance in x which comes from T but, like a regression equation assume that e is random and has mean of zero more on that later
515
Simple Effects of Measurement Error
Lowers the measured correlation between two variables Real correlation true scores (x* and y*) Measured correlation measured scores (x and y)
516
Measured correlation of x and y rxy True correlation of x and y rx*y*
Reliability of x rxx Reliability of y ryy e e x y Measured correlation of x and y rxy
517
Attenuation of correlation
Attenuation corrected correlation
518
Example
519
Complex Effects of Measurement Error
Really horribly complex Measurement error reduces correlations reduces estimate of b reducing one estimate increases others because of effects of control combined with effects of suppressor variables exercise to examine this
520
Dealing with Measurement Error
Attenuation correction very dangerous not recommended Avoid in the first place use reliable measures don’t discard information don’t categorise Age: 10-20, 21-30, …
521
Complications Assume measurement error is Additive Linear additive
e.g. weight – people may under-report / over-report at the extremes Linear particularly the case when using proxy variables
522
e.g. proxy measures Want to know effort on childcare, count number of children 1st child is more effort than 19th child Want to know financial status, count income 1st £1 much greater effect on financial status than the 1,000,000th.
523
Exercise 8.5
524
Lesson 9: Non-Linear Analysis in Regression
525
Introduction Non-linear effect occurs Assumption is violated
when the effect of one predictor is not consistent across the range of the IV Assumption is violated expected value of residuals = 0 no longer the case
526
Some Examples
527
A Learning Curve Skill Experience
528
Yerkes-Dodson Law of Arousal
Performance Arousal
529
Enthusiasm Levels over a
Lesson on Regression Enthusiastic Suicidal 3.5 Time
530
Learning Yerkes-Dodson Enthusiasm line changed direction once
line changed direction twice
531
Everything is Non-Linear
Every relationship we look at is non-linear, for two reasons Exam results cannot keep increasing with reading more books Linear in the range we examine For small departures from linearity Cannot detect the difference Non-parsimonious solution
532
Non-Linear Transformations
533
Bending the Line Non-linear regression is hard Transformations
We cheat, and linearise the data Do linear regression Transformations We need to transform the data rather than estimating a curved line which would be very difficult may not work with OLS we can take a straight line, and bend it or take a curved line, and straighten it back to linear (OLS) regression
534
We still do linear regression
Linear in the parameters Y = b1x + b2x2 + … Can do non-linear regression Non-linear in the parameters Much trickier Statistical theory either breaks down OR becomes harder
535
Linear transformations
multiply by a constant add a constant change the slope and the intercept
536
y=2x y=x + 3 y y=x x
537
Linear transformations are no use
alter the slope and intercept don’t alter the standardised parameter estimate Non-linear transformation will bend the slope quadratic transformation y = x2 one change of direction
538
Cubic transformation y = x2 + x3 two changes of direction
539
To estimate a non-linear regression
we don’t actually estimate anything non-linear we transform the x-variable to a non-linear version can estimate that straight line represents the curve we don’t bend the line, we stretch the space around the line, and make it flat
540
Detecting Non-linearity
541
Draw a Scatterplot Draw a scatterplot of y plotted against x
see if it looks a bit non-linear e.g. Education and beginning salary from bank data with line of best fit
542
A Real Example Starting salary and years of education
From employee data.sav
543
Expected value of error (residual) is > 0
544
Use Residual Plot Scatterplot is only good for one variable
use the residual plot (that we used for heteroscedasticity) Good for many variables
545
We want points to lie in a nice straight sausage
546
We don’t want a nasty bent sausage
547
Educational level and starting salary
548
Carrying Out Non-Linear Regression
549
Linear Transformation
Linear transformation doesn’t change interpretation of slope standardised slope se, t, or p of slope R2 Can change effect of a transformation
550
With others does have an effect
Actually more complex with some transformations can add a constant with no effect (e.g. quadratic) With others does have an effect inverse, log Sometimes it is necessary to add a constant negative numbers have no square root 0 has no log
551
Education and Salary Linear Regression
Saw previously that the assumption of expected errors = 0 was violated Anyway … R2 = 0.401, p < 0.001 salbegin = educ Standardised b1 (educ) = 0.633 Both parameters make sense
552
Add this variable to the equation
Non-linear Effect Compute new variable quadratic educ2 = educ2 Add this variable to the equation R2 = 0.585, p < 0.001 salbegin = educ educ2 slightly curious Standardised b1 (educ) = -2.4 b2 (educ2) = 3.1 What is going on?
553
Need hierarchical regression
Collinearity is what is going on Correlation of educ and educ2 r = 0.990 Regression equation becomes difficult (impossible?) to interpret Need hierarchical regression what is the change in R2 is that change significant? R2 (change) = 0.184, p < 0.001
554
While we are at it, let’s look at the cubic effect
R2 (change) = 0.004, p = 0.045 e e e3 Standardised: b1(e) = 0.04 b2(e2) = -2.04 b3(e3) = 2.71
555
Keep going while we are ahead?
Fourth Power Keep going while we are ahead? When do we stop?
556
Tricky, given that parameter estimates are a bit nonsensical
Interpretation Tricky, given that parameter estimates are a bit nonsensical Two methods 1: Use R2 change Save predicted values or calculate predicted values to plot line of best fit Save them from equation Plot against IV
558
Differentiate with respect to e We said:
s = e e e3 but first we will simplify it to quadratic s = e e2 dy/dx = x 2 x e
559
1 year of education at the higher end of the scale, better than 1 year at the lower end of the scale. MBA versus GCSE
560
Differentiate Cubic e e e3 dy/dx = 103 – 206 2 e + 12 3 e2 Can calculate slopes for quadratic and cubic at different values
562
A Quick Note on Differentiation
For y = xp dx/dy = pxp-1 For equations such as y =b1x + b2xP dy/dx = b1 + b2pxp-1 y = 3x + 4x2 dy/dx = • 2x
563
Many functions are simple to differentiate
y = b1x + b2x2 + b3x3 dy/dx = b1 + b2 • 2x + b3 • 3 • x2 y = 4x + 5x2 + 6x3 dx/dy = • 2 • x + 6 • 3 • x2 Many functions are simple to differentiate Not all though
564
Splines and Knots Estimate a different slope following an event
Lines are splines Events are knots Event might be known Marriage Might be unknown How many years after brain injury does recovery start
565
Lesson 10: Regression for Counts and Categories
Dichotomous/Nominal outcomes
566
Contents Dichotomous – logistic / probit
General and Generalized Linear Models Dichotomous – logistic / probit Counts – Poisson and negative binomial
567
GLMs and GLMs General linear models Generalized linear models
Ordinary least squares regression based models Identity link function Regression, ANOVA, correlation, etc Generalized linear models More links More error structures General linear models are a subset of generalized linear models
568
Dichotomous Often in social sciences, we have a dichotomous/nominal outcome we will look at dichotomous first, then a quick look at multinomial Dichotomous outcome e.g. guilty/not guilty pass/fail won/lost Alive/dead (used in medicine)
569
Why Won’t OLS Do?
570
Example: PTSD in Veterans
How does length of deployment affect probability of PTSD? Have PTSD, or don’t. We might be interested in severity Army are not If you have PTSD, you need help Not going back Develop a selection procedure Two predictor variables Rank – 1 =Staff Sgt, 5 = Private, Deployment length (months)
571
1st ten cases
572
Just consider score first
outcome PTSD (1 = Yes, 0 = No) Just consider score first Carry out regression Rank as predictor, PTSD as outcome R2 = 0.097, F = 4.1, df = 1, 48, p = b0 = 0.190 b1 = 0.110, p=0.028 Seems OK
573
Residual plot
574
Problems 1 and 2 strange distributions of residuals
parameter estimates may be wrong standard errors will certainly be wrong
575
2nd problem – interpretation
I have rank 2 Pass = 2 = 0.41 I have rank 8 Pass = 8 = 1.07 Seems OK, but What does it mean? Cannot score 0.41 or 1.07 can only score 0 or 1 Cannot be interpreted need a different approach
576
A Different Approach Logistic Regression
577
Logit Transformation In lesson 9, transformed IVs
now transform the outcome Need a transformation which gives us graduated scores (between 0 and 1) No upper limit we can’t predict someone will pass twice No lower limit you can’t do worse than fail
578
Step 1: Convert to Probability
First, stop talking about values talk about probability for each value of score, calculate probability of pass Solves the problem of graduated scales
579
probability of PTSD given a rank of 1 is 0.7
580
Now a score of 0.41 has a meaning But a score of 1.07 has no meaning
This is better Now a score of 0.41 has a meaning a 0.41 probability of pass But a score of 1.07 has no meaning cannot have a probability > 1 (or < 0) Need another transformation
581
Step 2: Convert to Odds-Ratio
Need to remove upper limit Convert to odds Odds, as used by betting shops 5:1, 1:2 Slightly different from odds in speech a 1 in 2 chance odds are 1:1 (evens) 50%
582
Odds ratio = (number of times it happened) / (number of times it didn’t happen)
583
0.8 = 0.8/0.2 = 4 0.2 = 0.2/0.8 = 0.25 equivalent to 4:1 (odds on)
0.8 = 0.8/0.2 = 4 equivalent to 4:1 (odds on) 4 times out of five 0.2 = 0.2/0.8 = 0.25 equivalent to 1:4 (4:1 against) 1 time out of five
584
Now we have solved the upper bound problem
we can interpret 1.07, 2.07, But we still have the zero problem we cannot interpret predicted scores less than zero
585
Step 3: The Log Log10 of a number(x) log(10) = 1 log(100) = 2
586
log(1) = 0 log(0.1) = -1 log( ) = -5
587
Natural Logs and e Don’t use log10 Natural log, ln
Use loge Natural log, ln Has some desirable properties, that log10 doesn’t For us If y = ln(x) + c dy/dx = 1/x Not true for any other logarithm
588
Be careful – calculators and stats packages are not consistent when they use log
Sometimes log10, sometimes loge
589
Take the natural log of the odds ratio Goes from - +
can interpret any predicted value
590
Putting them all together
Logit transformation log-odds ratio not bounded at zero or one
592
Probability gets closer to zero, but never reaches it as logit goes down.
593
Hooray! Problem solved, lesson over
errrmmm… almost Because we are now using log-odds ratio, we can’t use OLS we need a new technique, called Maximum Likelihood (ML) to estimate the parameters
594
Parameter Estimation using ML
ML tries to find estimates of model parameters that are most likely to give rise to the pattern of observations in the sample data All gets a bit complicated OLS is a special case of ML the mean is an ML estimator
595
Don’t have closed form equations
must be solved iteratively estimates parameters that are most likely to give rise to the patterns observed in the data by maximising the likelihood function (LF) We aren’t going to worry about this except to note that sometimes, the estimates do not converge ML cannot find a solution
596
R2 in Logistic Regression
A dichotomous variable doesn’t have variance If you know the mean (proportion) you know the variance You can’t have R2. There are several pseudo-R2 None are perfect There’s something better
597
Logistic Regression in Stata
Exercise 10.1 Two (almost) equivalent commands logistic ptsd rank deployment logit ptsd rank deployment
598
Logit Gives output in log-odds logit ptsd rank deployment
pass | Odds Ratio Std. Err z P>|z| [95% Conf. Interval] deployment | rank |
599
Logistic Gives output in odds ratios No intercept
logit ptsd rank deployment pass | Odds Ratio Std. Err z P>|z| [95% Conf. Interval] deployment | rank |
600
SPSS produces a classification table
And Stata produces it if you ask predictions of model based on cut-off of 0.5 (by default) predicted values x actual values DO NOT USE IT! Will this person go to prison? No. You will be right 99.9% of the time Doesn’t mean you have a good model (Gottman and Murray – Blink)
602
Model parameters B SE (B)
Change in the logged odds associated with a change of 1 unit in IV just like OLS regression difficult to interpret SE (B) Standard error Multiply by 1.96 to get 95% CIs
603
Constant i.e. score = 0 B = 1.314 Exp(B) = eB = e1.314 = 3.720
OR = 3.720, p = 1 – (1 / (OR + 1)) = 1 – (1 / ( )) p = 0.788
604
Score 1 Constant b = 1.314 Score B = -0.467
Exp(1.314 – 0.467) = Exp(0.847) = 2.332 OR = 2.332 p = 1 – (1 / ( )) = 0.699
605
Standard Errors and CIs
Symmetrical in B Non-symmetrical (sometimes very) in exp(B)
606
The odds of failing the test are multiplied by 0. 63 (CIs = 0. 408, 0
The odds of failing the test are multiplied by 0.63 (CIs = 0.408, p = 0.033), for every additional point on the aptitude test.
607
Hierarchical Logistic Regression
In OLS regression Use R2 change In logistic regression Use chi-square change Difference in chi-square = chi-square Difference in df = df
608
Hierarchical Logistic Regression
Model 1: Experience Model 2: Experience + Score Model 1: Chi-square =4.83, df = 1 Model 2: Chi-square =5.77, df = 2
609
Difference: gen p = 1 - chi2(1, 1.94) tab p p = 0.332
Chi-square = = 1.94, Df = 2 – 1 = 1 gen p = 1 - chi2(1, 1.94) tab p p = 0.332 P-value from SE = 0.339 Why?
610
More on Standard Errors
Because of Wald standard errors Wald SEs are overestimated Make p-value in estimates is wrong – too high (CIs still correct)
611
Two estimates use slightly different information
P-value says “what if no effect” CI says “what if there is this effect” Variance depends on the hypothesised ratio of the number of people in the two groups Can calculate likelihood ratio based p-values If you can be bothered Some packages provide them automatically
612
Probit Regression Very similar to logistic
much more complex initial transformation (to normal distribution) Very similar results to logistic (multiplied by 1.7) Swap logistic for probit in Stata command Harder to interpret Parameter doesn’t mean something – like log odds
613
Differentiating Between Probit and Logistic
Depends on shape of the error term Normal or logistic Graphs are very similar to each other Could distinguish quality of fit Given enormous sample size Logistic = probit x 1.7 Actually Probit advantage Understand the distribution Logistic advantage Much simpler to get back to the probability
615
Infinite Parameters Non-convergence can happen because of infinite parameters Insoluble model Three kinds: Complete separation The groups are completely distinct Pass group all score more than 10 Fail group all score less than 10
616
Quasi-complete separation
Separation with some overlap Pass group all score 10 or more Fail group all score 10 or less Both cases: No convergence Close to this Curious estimates Curious standard errors
617
Categorical Predictors
Can cause separation Especially if correlated Need people in every cell Male Female White Non-White Below Poverty Line Above Poverty Line
618
Logistic Regression and Diagnosis
Logistic regression can be used for diagnostic tests For every score Calculate probability that result is positive Calculate proportion of people with that score (or lower) who have a positive result Calculate c statistic Measure of discriminative power % of all possible cases, where the model gives a higher probability to a correct case than to an incorrect case
619
Perfect c-statistic = 1.0 Random c-statistic = 0.5
620
Sensitivity and Specificity
Probability of saying someone has a positive result – If they do: p(pos)|pos Specificity Probability of saying someone has a negative result If they do: p(neg)|neg
621
C-Statistic, Sensitivity and Specificity
After logistic lroc Gives c-statistic Better than R-squared
623
More Advanced Techniques
Multinomial Logistic Regression more than two categories in outcome same procedure one category chosen as reference group odds of being in category other than reference Ordinal multinomial logistic regression For ordinal outcome variables
624
More on Odds Ratios Odds ratios are horrid
We use them because they have nice distributional properties Example: 40% in group 1 get PTSD 60% in group 2 get PTSD What’s the odds ratio? How is this confusing?
625
Alternatives to Odds Ratios
Risk difference 20 percentage points higher Relative risk Probability is 1.5 times higher This is what you would think an odds ratio meant Can we use these in regression? RD – maybe. Sometimes. RR – yes. But we need to do something else first
626
Final Thoughts Logistic Regression can be extended Same issues as OLS
dummy variables non-linear effects interactions Same issues as OLS collinearity outliers
627
Same additional options as regress
xi: cluster robust
628
Poisson Regression
629
Counts and the Poisson Distribution
Von Bortkiewicz (1898) Numbers of Prussian soldiers kicked to death by horses 0 109 1 65 2 22 3 3 1
630
The data fitted a Poisson probability distribution
When counts of events occur, poisson distribution is common E.g. papers published by researchers, police arrests, number of murders, ship accidents Common approach Log transform and treat as normal Problems Censored at 0 Integers only allowed Heteroscedasticity
631
The Poisson Distribution
633
Excel has a Poisson function you can use.
Where: y is the count m is the mean of the Poisson distribution In a Poisson distribution The mean = the variance (hence heteroscedasticity issue)) m = s2
634
Poisson Probabilities
Mean 1 2 3 10 Score 0.37 0.14 0.05 0.00 0.27 0.15 0.18 0.22 0.06 0.01 4 0.02 0.09 0.17 5 0.04 0.10 6 7 8 0.11 9 0.13
635
Issues with Estimation
Just as with logistic We can’t predict a mean below zero Don’t predict the mean Predict the log of the mean
636
Poisson Regression in Stata
Adult literacy study Number of sessions attended Count variable Poisson regression
637
poisson sessions tx, irr
sessions | Coef. Std. Err z P>|z| [95% Conf. Interval] tx | _cons | poisson sessions tx, irr sessions | IRR Std. Err z P>|z| [95% Conf. Interval] tx |
638
But was it Poisson? Look at predicted probabilities Predicted means
Compare with actual probabilities Predicted means Control: exp(1.899) = 6.86 Intervention: exp( ) = 5.28 Get means and SDs
639
bysort tx: sum sessions
VObs Mean Std. Dev. sessions | -> tx = 1ariable | Variable | Obs Mean Std. Dev. sessions |
640
Compare predicted probabilities with actual probabilities
Do OK on the means Don’t do OK on the variances Variances are too high Compare predicted probabilities with actual probabilities tab session tx,col nofreq Draw graphs Not horrible Except the zeroes
642
Test for Goodness of Fit to Poisson Distribution
After running Poisson estat gof Goodness-of-fit chi2 = Prob > chi2(150) = Highly significant Poisson distribution doesn’t fit
643
Overdispersion Problem in Poisson regression Causes Two solutions
Too many zeroes Causes c2 inflation Standard error deflation Hence p-values too low Higher type I error rate Two solutions Negative binomial regression Robust standard errors
644
Robust Standard Errors
poisson sessions tx, robust | Robust sessions | Coef. Std. Err z P>|z| tx | _cons | Robust SEs are larger
645
Negative Binomial Regression
Adds a ‘hurdle’ to account for the zeroes Called alpha nbreg sessions tx OR nbreg sessions tx, robust
646
Back to Categorical Outcomes
We said: Odds ratios are not good We like relative risk instead What is the ratio of the risks? What analysis technique do we know that gives ratios of means
647
Poisson regression! Wait. It won’t work. The distribution is wrong. Robust estimates!
648
Poisson Regression in SPSS
SPSS 15 (and above), has added it Under generalized linear models
649
Lesson 11: Mediation and Path Analysis
650
Introduction Moderator Mediator All relationships are really mediated
Level of one variable influences effect of another variable Mediator One variable influences another via a third variable All relationships are really mediated are we interested in the mediators? can we make the process more explicit
651
In examples with bank Why? beginning education salary
What is the process? Are we making assumptions about the process? Should we test those assumptions?
652
job skills expectations negotiating skills kudos for bank education beginning salary
653
Direct and Indirect Influences
X may affect Y in two ways Directly – X has a direct (causal) influence on Y (or maybe mediated by other variables) Indirectly – X affects Y via a mediating variable - M
654
e.g. how does going to the pub effect comprehension on a Summer school course
on, say, regression not reading books on regression Having fun in pub in evening less knowledge Anything here?
655
not reading books on regression
Having fun in pub in evening less knowledge fatigue Still needed?
656
Mediators needed to cope with more sophisticated theory in social sciences make explicit assumptions made about processes examine direct and indirect influences
657
Detecting Mediation
658
“Classic Approach” 4 Steps
From Baron and Kenny (1986) To establish that the effect of X on Y is mediated by M Show that X predicts Y Show that X predicts M Show that M predicts Y, controlling for X If effect of X controlling for M is zero, M is complete mediator of the relationship (3 and 4 in same analysis)
659
Enjoy Books Buy books Read Books
Example: Book habits Enjoy Books Buy books Read Books
660
Three Variables Enjoy Buy Read How much an individual enjoys books
How many books an individual buys (in a year) Read How many books an individual reads (in a year)
662
The Theory enjoy buy read
663
Show that X (enjoy) predicts Y (read) b1 = 0.487, p < 0.001
Step 1 Show that X (enjoy) predicts Y (read) b1 = 0.487, p < 0.001 standardised b1 = 0.732 OK
664
Show that X (enjoy) predicts M (buy) b1 = 0.974, p < 0.001
standardised b1 = 0.643 OK
665
3. Show that M (buy) predicts Y (read), controlling for X (enjoy)
b1 = 0.469, p < 0.001 standardised b1 = 0.206 OK
666
If effect of X controlling for M is zero, M is complete mediator of the relationship
(Same as analysis for step 3.) b2 = 0.287, p = 0.001 standardised b2 = 0.431 Hmmmm… Significant, therefore not a complete mediator
667
0.287 (step 4) enjoy read buy 0.206 (from step 3) 0.974 (from step 2)
668
The Mediation Coefficient
Amount of mediation = Step 1 – Step 4 =0.487 – 0.287 = 0.200 OR Step 2 x Step 3 =0.974 x 0.206
669
SE of Mediator enjoy read buy sa = se(a) sb = se(b) a b (from step 2)
670
Sobel test Standard error of mediation coefficient can be calculated a = sa = 0.189 b = sb = 0.054
671
Indirect effect = 0.200 Online Sobel test: se = 0.056
t =3.52, p = 0.001 Online Sobel test:
672
Problems with the Sobel test
Recently Move in methodological literature away from this conventional approach Problems of power: Several tests, all of which must be significant Type I error rate = 0.05 * 0.05 = Must affect power
673
Distributional Assumption
We assume that the sampling distribution of the coefficient is normally distributed Standard error is standard deviation If: a (x m) is normal and not zero b (m y) is normal and not zero Then: a × b Is not normally distributed Assumption is violated Test is incorrect
674
Solution: Bootstrap Computer intensive semi-parametric procedure Removes distributional assumption Bootstrapping suggested as alternative For Stata: For SAS, SPSS:
675
Cross Sectional Bias If everything is measured at one time Ideally:
Likely to be bias Ideally: Three variables, measured on three occasions
676
x x x m m m y y y
677
BUT: Stationarity assumption can save us
Kind of hard work Collecting data on three occasions BUT: Stationarity assumption can save us x x x m m m y y y
678
We assume that the effect from M to Y is stable over time
Only need two time points Cole and Maxwell (2003)
679
Power in Mediation Really hard to work out Need to run simulations
Power depends on Size of a Size of b Fritz and Mackinnon (2007) Table of power for different effects
680
More Information on Mediation
Mackinnon, Fritz and Fairchild Annual Review of Psychology Mackinnon Introduction to statistical mediation Iacobucci Mediation analysis (little green book) Mackinnon’s website (Google: mackinnon mediation) Facebook group (No, really)
681
Lesson 12: Moderators in Regression
“different slopes for different folks”
682
Introduction Moderator relationships have many different names
interactions (from ANOVA) multiplicative non-linear (just confusing) non-additive All talking about the same thing
683
A moderated relationship occurs
when the effect of one variable depends upon the level of another variable
684
Where there is collinearity
Hang on … That seems very like a nonlinear relationship Moderator Effect of one variable depends on level of another Non-linear Effect of one variable depends on level of itself Where there is collinearity Can be hard to distinguish between them Paper B5 Should (usually) compare effect sizes
685
e.g. How much it hurts when I drop a computer on my foot depends on
x1: how much alcohol I have drunk x2: how high the computer was dropped from but if x1 is high enough x2 will have no effect
686
e.g. Likelihood of injury in a car accident
depends on x1: speed of car x2: if I was wearing a seatbelt but if x1 is low enough x2 will have no effect
688
e.g. number of words (from a list) I can remember
depends on x1: type of words (abstract, e.g. ‘justice’, or concrete, e.g. ‘carrot’) x2: Method of testing (recognition – i.e. multiple choice, or free recall) but if using recognition x1: will not make a difference
689
We looked at three kinds of moderator alcohol x height = pain
continuous x continuous speed x seatbelt = injury continuous x categorical word type x test type categorical x categorical We will look at them in reverse order
690
How do we know to look for moderators?
Theoretical rationale Often the most powerful Many theories predict additive/linear effects Fewer predict moderator effects Presence of heteroscedasticity Clue there may be a moderated relationship missing
691
Two Categorical Predictors
692
Data 2 IVs 20 Participants in one of four groups 5 per group
word type (concrete [e.g. Carrot, table], abstract [e.g. Love, justice) test method (multiple choice, recall ) 20 Participants in one of four groups Concrete, MC Concrete, recall Abstract, MC Abstract, recall 5 per group lesson12.1-words.dta
694
Graph of means
695
Procedure for Testing 1: Convert to dummy coding
Already done 2: Calculate interaction term Multiply dummy codes together (Can also use xi: for this) Call interaction mxc
696
Interaction term (wxt)
multiply effect coded variables together
697
3: Carry out regression Hierarchical linear effects first
interaction effect in next block
698
b0(intercept)= 7.0 b1 (mc) = 8.6 b2 (concrete) = 8.2
Mean score when MC = 0 and concrete = 0 b1 (mc) = 8.6 When concrete is zero, effect of MC b2 (concrete) = 8.2 When mc is zero, effect of concrete
699
Given other estimates, what’s the predicted mean of concrete, MC?
b3 (mc x con) = -8.4 grand mean Given other estimates, what’s the predicted mean of concrete, MC? = 23.8 What is it? 15.4
700
Have: Expect: Difference:
701
Back to the Graph Slope for concrete words 15.2-15.4=-0.2
Difference in slopes -8.6-(-0.2) = -8.4 Slope for abstract words =-8.6
702
b associated with interaction
The difference in the slopes OR The change in slope, away from the average, associated with a 1 unit change in the moderating variable
703
Y(conc) = Y = 7 + 8.6 m + 8.2 1 + -8.4 m 1
Another way to look at it Y = m c m c Examine concrete words group (c = 1) substitute values into the equation Y(conc) = Y = m m 1 Y(conc) = Y = m m Y(conc) = Y = m m Y(conc) = Y = m
704
Categorical x Continuous
705
Note on Dichotomisation
Very common to see people dichotomise a variable Makes the analysis easier Very bad idea Paper B6
706
Data A chain of 60 supermarkets
examining the relationship between profitability, shop size, and local competition 2 IVs shop size comp (local competition, 0=no, 1=yes) outcome profit
707
Data, ‘lesson 12.2.dta’
708
1st Analysis Two IVs R2=0.367, df=2, 57, p < 0.001
Unstandardised estimates b1 (shopsize) = (p=0.001) b2 (comp) = (p<0.001) Standardised estimates b1 (shopsize) = 0.356 b2 (comp) = 0.448
709
Suspicions Presence of competition is likely to have an effect
Residual plot shows a little heteroscedasticity
711
Procedure for Testing Very similar to last time
convert ‘comp’ to dummy coding (if it’s not already) Compute interaction term comp (effect coded) x size Hierarchical regression
712
Result Estimates b1 (shopsize) = 0.12, SE = 0.03
b2 (comp) = -1.67, SE 2.50 b3 (sxc) = -0.10, SE 0.05
713
comp now non-significant
shows importance of hierarchical it obviously is important
714
Interpretation Draw graph with lines of best fit
graph twoway (scatter profit shopsize if comp==1) (lfit profit shopsize if comp==1) (scatter profit shopsize if comp==0) (lfit profit shopsize if comp==0), legend(off)
716
Substitute into equation Effects of size
(can ignore the constant) Y=size comp(-1.67) + sizecomp(-0.09) Competition present (comp = 1) Y=size (-1.67) + size1(-0.09) Y=size size (-0.09) Y=size0.03
717
Competition present (x2 = 0)
Y=size comp(-1.67) + sizecomp(-0.09) Competition present (x2 = 0) Y=size (-1.67) + size 0 (-0.09) Y=size0.12
718
Two Continuous Variables
719
Data Bank Employees only using clerical staff 363 cases
predicting starting salary previous experience age age x experience (exercise 6.3)
720
Correlation matrix only one significant
721
Initial Estimates (no moderator) (standardised)
R2 = 0.063, p<0.001 Age at start = -0.37, p<0.001 Previous experience = 0.36, p<0.001 Suppressing each other Age and experience compensate for one another Older, with no experience, bad Younger, with experience, good
722
The Procedure Very similar to previous
create multiplicative interaction term BUT Center variables (subtract mean) Not always necessary Can make life easier
723
Hierarchical regression
two linear effects first moderator effect in second
724
Estimates (standardised)
Change in R2 0.085, p<0.001 Estimates (standardised) b1 (agestart) = -0.52 b2 (prevexp) = 0.93 b3 (age x exp) = -0.56
725
Interpretation 1: Pick-a-Point
Graph is tricky can’t have two continuous variables Choose specific points (pick-a-point) Graph the line of best fit of one variable at others Two ways to pick a point 1: Choose high (z = +1), medium (z = 0) and low (z = -1) Choose ‘sensible’ values – age 20, 50, 80?
726
We know: We can rewrite this as:
Y = e a a e -0.58 Where a = agestart, and e = experience We can rewrite this as: Y = (e 0.94) + (a -0.53) + (a e -0.58) Take a out of the brackets Y = (e 0.94) + ( e -0.58)a Bracketed terms are simple intercept and simple slope – intercept and slope for agestart 0= (e 0.10) 1= ( e -0.58)a Y = 0 + 1a
727
Pick any value of e, and we know the slope for a
Standardised, so it’s easy e = -1 0= (-1 0.94) = 1= ( -0.58)a = -0.05a e = 0 0= (0 0.10) = 0 1= ( -0.58)a = -0.53a e = 1 0= (1 0.10) = 0.10 1= ( -0.58)a = -1.11a
728
Graph the Three Lines
729
Do This in Stata The easy way Create some pseudo cases
Some fake people With sensible scores for the variables Regression equation ‘stays behind’ Calculate predicted scores with predict (Can be in a new dataset)
731
Then draw graph drop if _n < 364 graph twoway (lfit pred agestart if prevexp==0, lcolor(red)) (lfit pred agestart if prevexp==85, lcolor(black)) (lfit pred agestart if prevexp==170, lcolor(green)) , legend(off)
733
(Also works in SPSS; in SAS, use Proc Score)
734
Interpretation 2: P-Values and CIs
Second way Newer, rarely done Calculate CIs of the slope At any point Calculate p-value Give ranges of significance
735
What do you need? The variance and covariance of the estimates
SPSS doesn’t provide estimates for intercept Need to do it manually In options, exclude intercept Create intercept – c = 1 Use it in the regression
736
Enter information into web page:
Get results Calculations in Bauer and Curran (in press: Multivariate Behavioral Research) Paper B13
738
Areas of Significance
739
2 complications 1: Constant differed
2: outcome was logged, hence non-linear effect of 1 unit depends on where the unit is See paper A2
740
Finally …
741
Unlimited Moderators Moderator effects are not limited to 2 variables
linear effects
742
Three Interacting Variables
Age, Sex, Exp Block 1 Block 2 Age x Sex, Age x Exp, Sex x Exp Block 3 Age x Sex x Exp
743
Results All two way interactions significant Three way not significant
Effect of Age depends on sex Effect of experience depends on sex Size of the age x experience interaction does not depend on sex (phew!)
744
Moderated Non-Linear Relationships
Enter non-linear effect Enter non-linear effect x moderator if significant indicates degree of non-linearity differs by moderator
746
Lesson 13: Longitudinal Models
747
Advantages of Longitudinal Data
You get more data from the same number of people You can test causal relationships Although you can’t rule them out You can examine change You can control for individual differences
748
Disadvantage of Longitudinal Data
It’s much harder to analyze
749
Longitudinal Research
For comparing repeated measures Clusters are people Data are usually short and fat ID V1 V2 V3 V4 1 2 3 4 7 6 8 5
750
Converting Data ID V X Change data to tall and thin
1 2 3 4 7 6 8 5 Change data to tall and thin Use reshape in stata Use Data, Restructure in SPSS Clusters are ID
751
Predict Salary Change Use exercise5.3-bank salary.dta
Compare beginning salary and salary Would normally use paired samples t-test Difference = $17,403, 95% CIs $16, , $18,
752
Predict Salary Change Don’t take the difference in salary and salbegin
Why not? reg salary agestart salbegin Est: , 95% CIs ,
753
Restructure the Data gen id = _n rename salbegin sal1
rename salary sal2 reshape long sal, i(id) j(t) replace t = t-1
754
Restructure the Data Do it again Do a regression
With data tall and thin Do a regression What do we find? ID Time Cash 1 $18,750 $21,450 2 $12,000 $21,900 3 $13,200 $45,000
755
Results We have violated the independence assumption
We have the wrong answer Simplest way to solve it: regress sal t, cluster(id) Assumes that ID is just an irritant Rather inflexible
756
However … That has one advantage
Missing data doesn’t mean that we exclude the case If data are missing at random (or missing completely at random) estimates will be unbiased
757
If everyone has If half the people have But what if some have
Score at time 1 Score at time 2 Analysis is easy If half the people have But what if some have Time 1 Time 2 Time 1 and 2
758
If No problem Missing at random Missing completely at random
MAR Missing completely at random MCAR (Crappy names) No problem T1 T2 10 15 7 6 8 2 5
759
Interesting … That wasn’t very interesting
What is more interesting is when: We have missing data Which we won’t talk about more (much) We have multiple measurements of the same people Which we will talk about
760
Modelling Change Can plot and assess trajectories over time
How do people change? What predicts the rate of change?
761
Plotting Individuals Salary Person 1 T2 T1
762
Plotting Individuals Person 3 Salary Person 1 Person 2 T2 T1
765
Estimation Each individual has an intercept
Sampled from the population of intercepts Each individual has a slope Sampled from the population of slopes Can we estimate the average of each And a measure of their variance? Yes! With multilevel models
766
Multilevel Models Can do all kinds of clever things Used when
We won’t worry about most of them Used when Level 1 units (measures) Are nested within Level 2 units (people) Same person measured twice Violated indepence
767
Levels In regression In multilevel models Everything is at one level
We have multiple levels Hierarchical levels (hence hierarchical linear models) Random effects (random effects models) Mixed effects (mixed models)
768
Levels Level 1 units Level 2 units First level of measurement
Are clustered within Level 2 units Second level of measurement
769
Some Equations (This is very hard. It’s not important). In regression
If x is time And we have one person Reference time with I And call it T
770
Single person equation
But what if we have lots of people? We’ve used i for time We’ll use j for people But everything is fixed We want to have some random effects
771
Let’s make intercepts random
Everyone has their own intercept Look! We added a little j Now it’s a multilevel model And we need an equation for the intercept
772
Equation for each person’s intercept
Your intercept (b0j) is equal to: Mean intercept g (Gamma) Plus residual (for that individual) m0j (mu) This is level 2 model Level 2 residuals i.i.d, etc
773
So now we have Or
774
Make Time Random Value of the time parameter can vary Amongst people
Everyone can have a different effect of time
775
Time is random Everyone has a slope parameter So:
776
Or
777
Time Invariant Covariates
Can be added at level 2 We can predict person’s intercept Starting point And rate of change
778
Employee Data Level 1: Level 2: Pay measures Clustered within People
Two of them Clustered within Level 2: People Level 2 measures: age, sex, job, etc
779
Regression with Time Do a regression analysis Time is the predictor
On one person Time is the predictor Get a regression line for that person
780
Fixed vs Random Effects
Fixed effects Effect is the same across all clusters (people) Variation is only measurement error Random effects Effect varies across people Additional parameter in the model Less parsimonious
781
Fixed vs Random Effects
If an effect has variance It might have covariance With any other effects which also have variance More parameters Less parsimony
782
Covariates Two kinds of covariates Time invariant Time variant
Fixed for a person Age when study started Sex Time variant Can change over time Time Marital Status
783
Time Invariant Look at effect of age Add age to the fixed effects
Is that significant? Are random effects (still?) significant
784
Multilevel Models in Stata
Use xtmixed Stata 10 added xtmelogit, xtmepoisson Continuous variables are hard enough In SPSS, continuous variables only
785
Multilevel Models in Stata
xtmixed sal t Does regression Need to tell it about the people: xtmixed sal t ||id:
786
------------------------------------------------------------------------------
sal | Coef. Std. Err z P>|z| [95% Conf. Interval] t | _cons | Random-effects Parameters | Estimate Std. Err. [95% Conf. Interval] id: Identity | sd(_cons) | sd(Residual) | LR test vs. linear regression: chibar2(01) = Prob >= chibar2 =
787
Average slope Average intercept SD of the slopes SD of the slopes
sal | Coef. Std. Err z P>|z| [95% Conf. Interval] t | _cons | Random-effects Parameters | Estimate Std. Err. [95% Conf. Interval] id: Identity | sd(_cons) | sd(Residual) | LR test vs. linear regression: chibar2(01) = Prob >= chibar2 = Average intercept SD of the slopes SD of the slopes
788
Gives random intercepts only predict rand_int
Let’s look at them predict rand_int xtline rand_int , overlay t(t) i(id) legend(off)
790
Random Slopes Everyone has the same slope xtmixed sal t ||id: t
Maybe that’s not true Make slopes random xtmixed sal t ||id: t
791
------------------------------------------------------------------------------
sal | Coef. Std. Err z P>|z| [95% Conf. Interval] t | _cons | Random-effects Parameters | Estimate Std. Err. [95% Conf. Interval] id: Independent | sd(t) | sd(_cons) | sd(Residual) |
792
predict rand_int_slope, fitte
xtline rand_int_slope , overlay t(t) i(id) legend(off)
794
Structure of the Covariances
We have been forcing slope and intercepts to be uncorrelated Let’s correlate them xtmixed sal t ||id: t , cov(un)
795
------------------------------------------------------------------------------
sal | Coef. Std. Err z P>|z| [95% Conf. Interval] t | _cons | Random-effects Parameters | Estimate Std. Err. [95% Conf. Interval] id: Unstructured | sd(t) | sd(_cons) | corr(t,_cons) | sd(Residual) | e+10
796
Predicting Change Does another variable moderate the effect of time
This means that the effect of time varies As a function of the slope xi: xtmixed sal i.t*agestart ||id: t
797
------------------------------------------------------------------------------
sal | Coef. Std. Err z P>|z| [95% Conf. Interval] _It_1 | agestart | _ItXagest_1 | _cons | Random-effects Parameters | Estimate Std. Err. [95% Conf. Interval] id: Unstructured | sd(t) | sd(_cons) | corr(t,_cons) | sd(Residual) | e+09
798
Exercises 13.1, 13.2
799
Fixed Effects Models A second way of looking at longitudinal data
Multilevel (mixed) models Assume that intercepts are random Fixed effects models Assume they are fixed If they are fixed they can correlate With all other predictors
800
Fixed Effects Models Allowing intercepts to correlate
Has the effect of controlling for ALL time invariant predictors Even those you didn’t measure Each person is their own control
801
Fixed Effects Models Regression asks: Fixed effects asks:
Are people who are higher on x also higher on y? Fixed effects asks: When a person is higher on x are they also higher on y Effects are within people, not between people
802
Fixed Effects in Stata Make data long, then
xtreg sal t agestart, i(id)
803
------------------------------------------------------------------------------
sal | Coef. Std. Err t P>|t| [95% Conf. Interval] t | _cons | sigma_u | sigma_e | rho | (fraction of variance due to u_i) F test that all u_i=0: F(473, 473) = Prob > F =
804
Fixed Effects Regression
Can we look at the effect of time invariant predictors? xtreg sal t agestart, i(id) fe Why not?
805
Interactions But we can look at interactions of time invariant predictors xi: xtreg sal i.t*agestart, i(id) fe
806
------------------------------------------------------------------------------
sal | Coef. Std. Err t P>|t| [95% Conf. Interval] _It_1 | agestart | (dropped) _ItXagest_1 | _cons | sigma_u | sigma_e | rho | (fraction of variance due to u_i)
807
Exercises 13.3, 13.4
808
Bonus Lesson 1: Why Regression?
A little aside, where we look at why regression has such a curious name.
809
Regression The or an act of regression; reversion; return towards the mean; return to an earlier stage of development, as in an adult’s or an adolescent’s behaving like a child (From Latin gradi, to go) So why name a statistical technique which is about prediction and explanation?
810
Tall fathers have shorter sons Short fathers have taller sons
Francis Galton Charles Darwin’s cousin Studying heritability Tall fathers have shorter sons Short fathers have taller sons ‘Filial regression toward mediocrity’ Regression to the mean
811
Galton thought this was biological fact Then did the analysis backward
Evolutionary basis? Then did the analysis backward Tall sons have shorter fathers Short sons have taller fathers Regression to the mean Not biological fact, statistical artefact
812
Other Examples Secrist (1933): The Triumph of Mediocrity in Business
Second albums often tend to not be as good as first Sequel to a film is not as good as the first one Sports Illustrated Cover Jinx Parents think that punishing bad behaviour works, but rewarding good behaviour doesn’t
813
Accident reduction schemes Poor radiologists improve after training
Always reduce accidents Poor radiologists improve after training Any treatment for a cold will work Or for most illnesses Deaths due to methadone in Utah High last year Must take action!
814
Pair Link Diagram An alternative to a scatterplot x y
815
r=1.00 x
816
r=0.00 x
817
From Regression to Correlation
Where do we predict an individual’s score on y will be, based on their score on x? Depends on the correlation r = 1.00 – we know exactly where they will be r = 0.00 – we have no idea r = 0.50 – we have some idea
818
r=1.00 x y Starts here Will end up here
819
Could end anywhere here
Starts here Could end anywhere here x y
820
Probably end somewhere here
Starts here x y
821
Galton Squeeze Diagram
Don’t show individuals Show groups of individuals, from the same (or similar) starting point Shows regression to the mean
822
r=0.00 x Ends here y Group starts here Group starts here
823
r=0.50 x y
824
r=1.00 x y
825
Correlation is amount of regression that doesn’t occur
x y 1 unit r units Correlation is amount of regression that doesn’t occur
826
x y No regression r=1.00
827
x y Some regression r=0.50
828
r=0.00 Lots (maximum) regression r=0.00 x y
829
Formula
830
regression = perfection – correlation
Conclusion Regression towards mean is statistical necessity regression = perfection – correlation Very non-intuitive Interest in regression and correlation From examining the extent of regression towards mean By Pearson – worked with Galton Stuck with curious name See also Paper B3
831
Correcting for regression to the mean
Possible Makes lots of tricky assumptions To appear to do well in your job / life Do something after someone has failed You probably can’t do worse Hospital / school / department / class / study / experiment If it fails, volunteer to do it
832
Bonus Lesson 2: Other Kinds of Regression
833
Introduction We’ve covered a few kinds of regression
There are many more, for specific types of outcomes
834
Beta Regression Used when the outcome variable is beta distributed
Rates and proportions Bounded by zero and 1 Uniform, or strange shaped distributions
835
Cox Proportional Hazards Regression
Type of survival model Used for time to an event When the event might not occur Developed for medical research
836
Cox Proportional Hazards Regression
E.g. How long does it take for a car to break down The car crashes, and is scrapped We’ll never know – but we want to know the information Discarding the data point would lead to bias
837
Competing Risks Survival
Time to multiple events Several things are trying to kill you Which one succeeds
838
Data Mining Techniques
Avoid problems with stepwise regression Used as alternatives to logistic Boosted regression (Stata command: boost) Semi-parametric alternative to logistic regression Least Angle Regression (LARS)
839
Classification trees
840
Seemingly Unrelated Regression
Used for multiple outcomes With correlated error terms Some say it should be seemingly related Stata command: sureg
841
Instrumental Variables Regression
Used for mediator models Although economists don’t call them that. Use ivregress in Stata
842
Quantile Regression Why do we always try to predict the mean?
What about predicting the median? The 25th percentile? That’s quantile regression
843
Non-Parametric Regression
(Might also include LARS/Lasso/Boost) Don’t force any functional form on the relationship LO(W)ESS – locally weighted scatterplot smoothing Will find any relationship
844
Robust Regression (Careful, not sandwich estimators)
Trimmed of outliers, estimated with bootstrap Lots of publications by Wilcox
845
Censored Regression Tobit regression When a measure is censored
E.g. unemployed people work 0 hours.
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.