Presentation is loading. Please wait.

Presentation is loading. Please wait.

Theory of Regression.

Similar presentations


Presentation on theme: "Theory of Regression."— Presentation transcript:

1 Theory of Regression

2 The Course 16 (or so) lessons Some flexibility Depends how we feel
What we get through

3 Part I: Theory of Regression
Models in statistics Models with more than one parameter: regression Why regression? Samples to populations Introducing multiple regression More on multiple regression

4 Part 2: Application of regression
Categorical predictor variables Assumptions in regression analysis Issues in regression analysis Non-linear regression Moderators (interactions) in regression Mediation and path analysis Part 3: Advanced Types of Regression Logistic Regression Poisson Regression Introducing SEM Introducing longitudinal multilevel models

5 House Rules Jeremy must remember If you don’t understand
Not to talk too fast If you don’t understand Ask Any time If you think I’m wrong Ask. (I’m not always right)

6 Learning New Techniques
Best kind of data to learn a new technique Data that you know well, and understand Your own data In computer labs (esp later on) Use your own data if you like My data I’ll provide you with Simple examples, small sample sizes Conceptually simple (even silly)

7 Computer Programs SPSS Excel GPower Stata (if you like)
Mostly Excel For calculations GPower Stata (if you like) R (because it’s flexible and free) Mplus (SEM, ML?) AMOS (if you like)

8

9

10 Lesson 1: Models in statistics
Models, parsimony, error, mean, OLS estimators

11 What is a Model?

12 What is a model? Representation
Of reality Not reality Model aeroplane represents a real aeroplane If model aeroplane = real aeroplane, it isn’t a model

13 Statistics is about modelling Sifting
Representing and simplifying Sifting What is important from what is not important Parsimony In statistical models we seek parsimony Parsimony  simplicity

14 Parsimony in Science A model should be: More it explains
1: able to explain a lot 2: use as few concepts as possible More it explains The more you get Fewer concepts The lower the price Is it worth paying a higher price for a better model?

15 A Simple Model Height of five individuals These are our DATA 1.40m

16 A Little Notation The (vector of) data that we are modelling
The ith observation in our data.

17 Greek letters represent the true value in the population.
(Beta) Parameters in our model (population value) The value of the first parameter of our model in the population. The value of the jth parameter of our model, in the population. (Epsilon) The error in the population model.

18 Normal letters represent the values in our sample
Normal letters represent the values in our sample. These are sample statistics, which are used to estimate population parameters. A parameters in our model (sample statistics) The error in our sample. The data in our sample which we are trying to model.

19 Symbols on top change the meaning.
The data in our sample which we are trying to model (repeated). The estimated value of Y, for the ith case. The mean of Y.

20 I will use b1 (because it is easier to type)

21 Not always that simple some texts and computer programs use
b = the parameter estimate (as we have used) (beta) = the standardised parameter estimate SPSS does this.

22 Set of all parameters (b0, b1, b2, b3 … bp)
A capital letter is the set (vector) of parameters/statistics Set of all parameters (b0, b1, b2, b3 … bp) Rules are not used very consistently (even by me). Don’t assume you know what someone means, without checking.

23 We want a model Model 1: Data: 5 statistics Model: 5 statistics
To represent those data Model 1: 1.40m, 1.55m, 1.80m, 1.62m, 1.63m Not a model A copy VERY unparsimonious Data: 5 statistics Model: 5 statistics No improvement

24 Model 2: The mean (arithmetic mean) A one parameter model

25 Which, because we are lazy, can be written as

26 The Mean as a Model

27 The (Arithmetic) Mean We all know the mean The mean is: The ‘average’
Learned about it at school Forget (didn’t know) about how clever the mean is The mean is: An Ordinary Least Squares (OLS) estimator Best Linear Unbiased Estimator (BLUE)

28 Mean as OLS Estimator Going back a step or two
MODEL was a representation of DATA We said we want a model that explains a lot How much does a model explain? DATA = MODEL + ERROR ERROR = DATA - MODEL We want a model with as little ERROR as possible

29 What is error? Data (Y) Model (b0) mean Error (e) 1.40 1.60 -0.20 1.55
-0.05 1.80 0.20 1.62 0.02 1.63 0.03

30 How can we calculate the ‘amount’ of error?
Sum of errors

31 Knowledge about ERROR is useful
0 implies no ERROR Not the case Knowledge about ERROR is useful As we shall see later

32 Sum of absolute errors Ignore signs

33 Are small and large errors equivalent?
One error of 4 Four errors of 1 The same? What happens with different data? Y = (2, 2, 5) b0 = 2 Not very representative Y = (2, 2, 4, 4) b0 = any value from 2 - 4 Indeterminate There are an infinite number of solutions which would satisfy our criteria for minimum error

34 Sum of squared errors (SSE)

35 Determinate If we minimise SSE Shown in graph Always gives one answer
Get the mean Shown in graph SSE plotted against b0 Min value of SSE occurs when b0 = mean

36

37 The Mean as an OLS Estimate

38 Mean as OLS Estimate The mean is an Ordinary Least Squares (OLS) estimate As are lots of other things This is exciting because OLS estimators are BLUE Best Linear Unbiased Estimators Proven with Gauss-Markov Theorem Which we won’t worry about

39 BLUE Estimators Best Linear
Minimum variance (of all possible unbiased estimators Narrower distribution than other estimators e.g. median, mode Linear Linear predictions For the mean Linear (straight, flat) line

40 Unbiased Estimators Also consistent
Centred around true (population) values Expected value = population value Minimum is biased. Minimum in samples > minimum in population Estimators Errrmm… they are estimators Also consistent Sample approaches infinity, get closer to population values Variance shrinks

41 SSE and the Standard Deviation
Tying up a loose end

42 SSE closely related to SD Sample standard deviation – s
Biased estimator of population SD Population standard deviation - s Need to know the mean to calculate SD Reduces N by 1 Hence divide by N-1, not N Like losing one df

43 Proof That the mean minimises SSE Available in Not that difficult
As statistical proofs go Available in Maxwell and Delaney – Designing experiments and analysing data Judd and McClelland – Data Analysis (out of print?)

44 What’s a df? The number of parameters free to vary
When one is fixed Term comes from engineering Movement available to structures

45 Fix 1 corner, the shape is fixed
0 df No variation available 1 df Fix 1 corner, the shape is fixed

46 Back to the Data Mean has 5 (N) df s has N –1 df 1st moment
Mean has been fixed 2nd moment Can think of as amount cases vary away from the mean

47 While we are at it … Skewness has N – 2 df Kurtosis has N – 3 df
3rd moment Kurtosis has N – 3 df 4rd moment Amount cases vary from s

48 Parsimony and df Number of df remaining
Measure of parsimony Model which contained all the data Has 0 df Not a parsimonious model Normal distribution Can be described in terms of mean and s 2 parameters (z with 0 parameters)

49 Summary of Lesson 1 Statistics is about modelling DATA
Models have parameters Fewer parameters, more parsimony, better Models need to minimise ERROR Best model, least ERROR Depends on how we define ERROR If we define error as sum of squared deviations from predicted value Mean is best MODEL

50

51

52 Lesson 2: Models with one more parameter - regression

53 In Lesson 1 we said … Use a model to predict and describe data
Mean is a simple, one parameter model

54 More Models Slopes and Intercepts

55 More Models The mean is OK We often have more information than that
As far as it goes It just doesn’t go very far Very simple prediction, uses very little information We often have more information than that We want to use more information than that

56 House Prices In the UK, two of the largest lenders (Halifax and Nationwide) compile house price indices Predict the price of a house Examine effect of different circumstances Look at change in prices Guides legislation E.g. interest rates, town planning

57 Predicting House Prices

58 One Parameter Model The mean “How much is that house worth?” “£88,900”
Use 1 df to say that

59 Adding More Parameters
We have more information than this We might as well use it Add a linear function of number of bedrooms (x1)

60 Alternative Expression
Estimate of Y (expected value of Y) Value of Y

61 Estimating the Model We can estimate this model in four different, equivalent ways Provides more than one way of thinking about it 1. Estimating the slope which minimises SSE 2. Examining the proportional reduction in SSE 3. Calculating the covariance 4. Looking at the efficiency of the predictions

62 Estimate the Slope to Minimise SSE

63 Estimate the Slope Stage 1 Mark errors on it Draw a scatterplot
x-axis at mean Not at zero Mark errors on it Called ‘residuals’ Sum and square these to find SSE

64

65 asasdasd

66 Add another slope to the chart
Redraw residuals Recalculate SSE Move the line around to find slope which minimises SSE Find the slope

67 First attempt:

68 Any straight line can be defined with two parameters
The location (height) of the slope b0 Sometimes called a The gradient of the slope b1

69 Gradient b1 units 1 unit

70 Height b0 units

71 Height is defined as the point that the slope hits the y-axis
If we fix slope to zero Height becomes mean Hence mean is b0 Height is defined as the point that the slope hits the y-axis The constant The y-intercept

72 Why the constant? Implicit in SPSS b0x0
Where x0 is 1.00 for every case i.e. x0 is constant Implicit in SPSS Some packages force you to make it explicit (Later on we’ll need to make it explicit)

73 Why the intercept? Where the regression line intercepts the y-axis
Sometimes called y-intercept

74 Finding the Slope How do we find the values of b0 and b1?
Start with we jiggle the values, to find the best estimates which minimise SSE Iterative approach Computer intensive – used to matter, doesn’t really any more (With fast computers and sensible search algorithms – more on that later)

75 Start with b0=88.9 (mean) b1=10 (nice round number)
SSE = – worse than it was b0=86.9, b1=10, SSE=13828 b0=66.9, b1=10, SSE=7029 b0=56.9, b1=10, SSE=6628 b0=46.9, b1=10, SSE=8228 b0=51.9, b1=10, SSE=7178 b0=51.9, b1=12, SSE=6179 b0=46.9, b1=14, SSE=5957 ……..

76 Gives the position of the
Quite a long time later b0 = b1 = SSE = 5921 Gives the position of the Regression line (or) Line of best fit Better than guessing Not necessarily the only method But it is OLS, so it is the best (it is BLUE)

77

78 We now know Told us two things
A house with no bedrooms is worth  £46,000 (??!) Adding a bedroom adds  £15,000 Told us two things Don’t extrapolate to meaningless values of x-axis Constant is not necessarily useful It is necessary to estimate the equation

79 Standardised Regression Line
One big but: Scale dependent Values change £ to €, inflation Scales change £, £000, £00? Need to deal with this

80 Don’t express in ‘raw’ units
Express in SD units sx1=1.72 sy=36.21 b1 = 14.79 We increase x1 by 1, and Ŷ increases by 14.79

81 Similarly, 1 unit of x1 = 1/1.72 SDs
Increase x1 by 1 SD Ŷ increases by  (1.72/1) = 8.60 Put them both together

82 The standardised regression line
Change (in SDs) in Ŷ associated with a change of 1 SD in x1 A different route to the same answer Standardise both variables (divide by SD) Find line of best fit

83 The Correlation Coefficient
The standardised regression line has a special name The Correlation Coefficient (r) (r stands for ‘regression’, but more on that later) Correlation coefficient is a standardised regression slope Relative change, in terms of SDs

84 Proportional Reduction in Error

85 Proportional Reduction in Error
We might be interested in the level of improvement of the model How much less error (as proportion) do we have Proportional Reduction in Error (PRE) Mean only Error(model 0) = 11806 Mean + slope Error(model 1) = 5921

86

87 But we squared all the errors in the first place
So we could take the square root (It’s a shoddy excuse, but it makes the point) This is the correlation coefficient Correlation coefficient is the square root of the proportion of variance explained

88 Standardised Covariance

89 Standardised Covariance
We are still iterating Need a ‘closed-form’ Equation to solve to get the parameter estimates Answer is a standardised covariance A variable has variance Amount of ‘differentness’ We have used SSE so far

90 SSE varies with N Divide by N The variance Same as SD2
Higher N, higher SSE Divide by N Gives SSE per person (Actually N – 1, we have lost a df to the mean) The variance Same as SD2 We thought of SSE as a scattergram Y plotted against X (repeated image follows)

91

92 Or we could plot Y against Y
Axes meet at the mean (88.9) Draw a square for each point Calculate an area for each square Sum the areas Sum of areas SSE Sum of areas divided by N Variance

93 Plot of Y against Y 20 40 60 80 100 120 140 160 180

94 Draw Squares 138 – 88.9 = 40.1 Area = 40.1 x 40.1 = 1608.1 138 – 88.9
20 40 60 80 100 120 140 160 180 138 – 88.9 = 40.1 Area = 40.1 x 40.1 = 138 – 88.9 = 40.1 35 – 88.9 = -53.9 Area = -53.9 x -53.9 = 35 – 88.9 = -53.9

95 What if we do the same procedure
Instead of Y against Y Y against X Draw rectangles (not squares) Sum the area Divide by N - 1 This gives us the variance of x with y The Covariance Shortened to Cov(x, y)

96

97 Area = (-33.9) x (-2) = 67.8 4 - 3 = 1 55 – 88.9 = -33.9 = 49.1 1 - 3 = -2 Area = 49.1 x 1 = 49.1

98 More formally (and easily)
We can state what we are doing as an equation Where Cov(x, y) is the covariance Cov(x,y)=44.2 What do points in different sectors do to the covariance?

99 Problem with the covariance
Tells us about two things The variance of X and Y The covariance Need to standardise it Like the slope Two ways to standardise the covariance Standardise the variables first Subtract from mean and divide by SD Standardise the covariance afterwards

100 Need the combined variance
First approach Much more computationally expensive Too much like hard work to do by hand Need to standardise every value Second approach Much easier Standardise the final value only Need the combined variance Multiply two variances Find square root (were multiplied in first place)

101 Standardised covariance

102 The correlation coefficient
A standardised covariance is a correlation coefficient

103 Expanded …

104 This means … We now have a closed form equation to calculate the correlation Which is the standardised slope Which we can use to calculate the unstandardised slope

105 We know that: We know that:

106 So value of b1 is the same as the iterative approach

107 The variables are centred at zero
The intercept Just while we are at it The variables are centred at zero We subtracted the mean from both variables Intercept is zero, because the axes cross at the mean

108 Add mean of y to the constant Subtract mean of x
Adjusts for centring y Subtract mean of x But not the whole mean of x Need to correct it for the slope Naturally, the same

109 Accuracy of Prediction

110 One More (Last One) We have one more way to calculate the correlation
Looking at the accuracy of the prediction Use the parameters b0 and b1 To calculate a predicted value for each case

111 Plot actual price against predicted price
From the model

112

113 Seems a futile thing to do
r = 0.706 The correlation Seems a futile thing to do And at this stage, it is But later on, we will see why

114 Some More Formulae For hand calculation Point biserial

115 Phi (f) Used for 2 dichotomous variables Vote P Vote Q Homeowner A: 19
Not homeowner C: 60 D:53

116 Problem with the phi correlation
Unless Px= Py (or Px = 1 – Py) Maximum (absolute) value is < 1.00 Tetrachoric can be used Rank (Spearman) correlation Used where data are ranked

117 Summary Mean is an OLS estimate Regression line
OLS estimates are BLUE Regression line Best prediction of DV from IV OLS estimate (like mean) Standardised regression line A correlation

118 Four ways to think about a correlation
1. Standardised regression line 2. Proportional Reduction in Error (PRE) 3. Standardised covariance 4. Accuracy of prediction

119

120

121 Lesson 3: Why Regression?
A little aside, where we look at why regression has such a curious name.

122 Regression The or an act of regression; reversion; return towards the mean; return to an earlier stage of development, as in an adult’s or an adolescent’s behaving like a child (From Latin gradi, to go) So why name a statistical technique which is about prediction and explanation?

123 Tall fathers have shorter sons Short fathers have taller sons
Francis Galton Charles Darwin’s cousin Studying heritability Tall fathers have shorter sons Short fathers have taller sons ‘Filial regression toward mediocrity’ Regression to the mean

124 Galton thought this was biological fact Then did the analysis backward
Evolutionary basis? Then did the analysis backward Tall sons have shorter fathers Short sons have taller fathers Regression to the mean Not biological fact, statistical artefact

125 Other Examples Secrist (1933): The Triumph of Mediocrity in Business
Second albums often tend to not be as good as first Sequel to a film is not as good as the first one ‘Curse of Athletics Weekly’ Parents think that punishing bad behaviour works, but rewarding good behaviour doesn’t

126 Pair Link Diagram An alternative to a scatterplot x y

127 r=1.00 x

128 r=0.00 x

129 From Regression to Correlation
Where do we predict an individual’s score on y will be, based on their score on x? Depends on the correlation r = 1.00 – we know exactly where they will be r = 0.00 – we have no idea r = 0.50 – we have some idea

130 r=1.00 x y Starts here Will end up here

131 Could end anywhere here
Starts here Could end anywhere here x y

132 Probably end somewhere here
Starts here x y

133 Galton Squeeze Diagram
Don’t show individuals Show groups of individuals, from the same (or similar) starting point Shows regression to the mean

134 r=0.00 x Ends here y Group starts here Group starts here

135 r=0.50 x y

136 r=1.00 x y

137 Correlation is amount of regression that doesn’t occur
x y 1 unit r units Correlation is amount of regression that doesn’t occur

138 x y No regression r=1.00

139 x y Some regression r=0.50

140 r=0.00 Lots (maximum) regression r=0.00 x y

141 Formula

142 regression = perfection – correlation
Conclusion Regression towards mean is statistical necessity regression = perfection – correlation Very non-intuitive Interest in regression and correlation From examining the extent of regression towards mean By Pearson – worked with Galton Stuck with curious name See also Paper B3

143

144

145 Lesson 4: Samples to Populations – Standard Errors and Statistical Significance

146 The Problem In Social Sciences Theoretically
We investigate samples Theoretically Randomly taken from a specified population Every member has an equal chance of being sampled Sampling one member does not alter the chances of sampling another Not the case in (say) physics, biology, etc.

147 Population But it’s the population that we are interested in
Not the sample Population statistic represented with Greek letter Hat means ‘estimate’

148 Sample statistics (e.g. mean) estimate population parameters
Want to know Likely size of the parameter If it is > 0

149 Sampling Distribution
We need to know the sampling distribution of a parameter estimate How much does it vary from sample to sample If we make some assumptions We can know the sampling distribution of many statistics Start with the mean

150 Sampling Distribution of the Mean
Given Normal distribution Random sample Continuous data Mean has a known sampling distribution Repeatedly sampling will give a known distribution of means Centred around the true (population) mean (m)

151 Analysis Example: Memory
Difference in memory for different words 10 participants given a list of 30 words to learn, and then tested Two types of word Abstract: e.g. love, justice Concrete: e.g. carrot, table

152

153 Confidence Intervals This means Using
If we know the mean in our sample We can estimate where the mean in the population (m) is likely to be Using The standard error (se) of the mean Represents the standard deviation of the sampling distribution of the mean

154 1 SD contains 68% Almost 2 SDs contain 95%

155 We know the sampling distribution of the mean
t distributed Normal with large N (>30) Know the range within means from other samples will fall Therefore the likely range of m

156 Two implications of equation
Increasing N decreases SE But only a bit Decreasing SD decreases SE Calculate Confidence Intervals From standard errors 95% is a standard level of CI 95% of samples the true mean will lie within the 95% CIs In large samples: 95% CI = 1.96  SE In smaller samples: depends on t distribution (df=N-1=9)

157

158

159 What is a CI? (For 95% CI): 95% chance that the true (population) value lies within the confidence interval? 95% of samples, true mean will land within the confidence interval?

160 Significance Test Probability that m is a certain value
Almost always 0 Doesn’t have to be though We want to test the hypothesis that the difference is equal to 0 i.e. find the probability of this difference occurring in our sample IF m=0 (Not the same as the probability that m=0)

161 Calculate SE, and then t t has a known sampling distribution
Can test probability that a certain value is included

162 Other Parameter Estimates
Same approach Prediction, slope, intercept, predicted values At this point, prediction and slope are the same Won’t be later on We will look at one predictor only More complicated with > 1

163 Testing the Degree of Prediction
Prediction is correlation of Y with Ŷ The correlation – when we have one IV Use F, rather than t Started with SSE for the mean only This is SStotal Divide this into SSresidual SSregression SStot = SSreg + SSres

164

165 Back to the house prices
Original SSE (SStotal) = 11806 SSresidual = 5921 What is left after our model SSregression = – 5921 = 5885 What our model explains Slope = 14.79 Intercept = 46.0 r = 0.706

166

167 F = 7.95, df = 1, 8, p = 0.02 Can reject H0 A significant effect
H0: Prediction is not better than chance A significant effect

168 Statistical Significance: What does a p-value (really) mean?

169 A Quiz Six questions, each true or false
Write down your answers (if you like) An experiment has been done. Carried out perfectly. All assumptions perfectly satisfied. Absolutely no problems. P = 0.01 Which of the following can we say?

170 1. You have absolutely disproved the null hypothesis (that is, there is no difference between the population means).

171 2. You have found the probability of the null hypothesis being true.

172 3. You have absolutely proved your experimental hypothesis (that there is a difference between the population means).

173 4. You can deduce the probability of the experimental hypothesis being true.

174 5. You know, if you decide to reject the null hypothesis, the probability that you are making the wrong decision.

175 6. You have a reliable experimental finding in the sense that if, hypothetically, the experiment were repeated a great number of times, you would obtain a significant result on 99% of occasions.

176 OK, What is a p-value Cohen (1994)
“[a p-value] does not tell us what we want to know, and we so much want to know what we want to know that, out of desperation, we nevertheless believe it does” (p 997).

177 OK, What is a p-value Sorry, didn’t answer the question
It’s The probability of obtaining a result as or more extreme than the result we have in the study, given that the null hypothesis is true Not probability the null hypothesis is true

178 A Bit of Notation Not because we like notation Probability – P
But we have to say a lot less Probability – P Null hypothesis is true – H Result (data) – D Given - |

179 What’s a P Value P(D|H) Not P(H|D) P(H|D) ≠ P(D|H)
Probability of the data occurring if the null hypothesis is true Not P(H|D) Probability that the null hypothesis is true, given that we have the data = p(H) P(H|D) ≠ P(D|H)

180 What is probability you are prime minister
Given that you are british P(M|B) Very low What is probability you are British Given you are prime minister P(B|M) Very high P(M|B) ≠ P(B|M)

181 The police have your DNA DNA matches 1 in 1,000,000 people
There’s been a murder Someone bumped off a statto for talking too much The police have DNA The police have your DNA They match(!) DNA matches 1 in 1,000,000 people What’s the probability you didn’t do the murder, given the DNA match (H|D)

182 Luckily, you have Jeremy on your defence team We say:
Police say: P(D|H) = 1/1,000,000 Luckily, you have Jeremy on your defence team We say: P(D|H) ≠ P(H|D) Probability that someone matches the DNA, who didn’t do the murder Incredibly high

183 Back to the Questions Haller and Kraus (2002)
Asked those questions of groups in Germany Psychology Students Psychology lecturers and professors (who didn’t teach stats) Psychology lecturers and professors (who did teach stats)

184 We have found evidence against the null hypothesis
You have absolutely disproved the null hypothesis (that is, there is no difference between the population means). True 34% of students 15% of professors/lecturers, 10% of professors/lecturers teaching statistics False We have found evidence against the null hypothesis

185 You have found the probability of the null hypothesis being true.
32% of students 26% of professors/lecturers 17% of professors/lecturers teaching statistics False We don’t know

186 3. You have absolutely proved your experimental hypothesis (that there is a difference between the population means). 20% of students 13% of professors/lecturers 10% of professors/lecturers teaching statistics False

187 You can deduce the probability of the experimental hypothesis being true.
59% of students 33% of professors/lecturers 33% of professors/lecturers teaching statistics False

188 You know, if you decide to reject the null hypothesis, the probability that you are making the wrong decision. 68% of students 67% of professors/lecturers 73% of professors professors/lecturers teaching statistics False Can be worked out P(replication)

189 You have a reliable experimental finding in the sense that if, hypothetically, the experiment were repeated a great number of times, you would obtain a significant result on 99% of occasions. 41% of students 49% of professors/lecturers 37% of professors professors/lecturers teaching statistics False Another tricky one It can be worked out

190 One Last Quiz I carry out a study You replicate the study exactly
All assumptions perfectly satisfied Random sample from population I find p = 0.05 You replicate the study exactly What is probability you find p < 0.05?

191 You replicate the study exactly
I carry out a study All assumptions perfectly satisfied Random sample from population I find p = 0.01 You replicate the study exactly What is probability you find p < 0.05?

192 Significance testing creates boundaries and gaps where none exist.
Significance testing means that we find it hard to build upon knowledge we don’t get an accumulation of knowledge

193 Yates (1951) "the emphasis given to formal tests of significance ... has resulted in ... an undue concentration of effort by mathematical statisticians on investigations of tests of significance applicable to problems which are of little or no practical importance ... and ... it has caused scientific research workers to pay undue attention to the results of the tests of significance ... and too little to the estimates of the magnitude of the effects they are investigating

194 Testing the Slope Same idea as with the mean
Estimate 95% CI of slope Estimate significance of difference from a value (usually 0) Need to know the sd of the slope Similar to SD of the mean

195

196 Similar to equation for SD of mean Then we need standard error
Similar (ish) When we have standard error Can go on to 95% CI Significance of difference

197

198 Confidence Limits 95% CI 95% confidence limits
t dist with N - k - 1 df is 2.31 CI = 5.24  2.31 = 12.06 95% confidence limits

199 Significance of difference from zero
i.e. probability of getting result if b=0 Not probability that b = 0 This probability is (of course) the same as the value for the prediction

200 Testing the Standardised Slope (Correlation)
Correlation is bounded between –1 and +1 Does not have symmetrical distribution, except around 0 Need to transform it Fisher z’ transformation – approximately normal

201 95% CIs 0.879 – 1.96 * 0.38 = 0.13 * 0.38 = 1.62

202 Transform back to correlation
95% CIs = 0.13 to 0.92 Very wide Small sample size Maybe that’s why CIs are not reported?

203 Using Excel Functions in excel
Fisher() – to carry out Fisher transformation Fisherinv() – to transform back to correlation

204 The Others Same ideas for calculation of CIs and SEs for
Predicted score Gives expected range of values given X Same for intercept But we have probably had enough

205 Lesson 5: Introducing Multiple Regression

206 Residuals We said We could have said We ignored the i on the Y
Y = b0 + b1x1 We could have said Yi = b0 + b1xi1 + ei We ignored the i on the Y And we ignored the ei It’s called error, after all But it isn’t just error Trying to tell us something

207 What Error Tells Us Error tells us that a case has a different score for Y than we predict There is something about that case Called the residual What is left over, after the model Contains information Something is making the residual  0 But what?

208 Unpleasant neighbours
swimming pool Unpleasant neighbours

209 If all cases were equal on X
The residual (+ the mean) is the value of Y If all cases were equal on X It is the value of Y, controlling for X Other words: Holding constant Partialling Residualising Conditioned on

210

211 Sometimes adjustment is enough on its own Teenage pregnancy rate
Measure performance against criteria Teenage pregnancy rate Measure pregnancy and abortion rate in areas Control for socio-economic deprivation, and anything else important See which areas have lower teenage pregnancy and abortion rate, given same level of deprivation Value added education tables Measure school performance Control for initial intake

212 Control? In experimental research In non-experimental research
Use experimental control e.g. same conditions, materials, time of day, accurate measures, random assignment to conditions In non-experimental research Can’t use experimental control Use statistical control instead

213 Analysis of Residuals What predicts differences in crime rate
After controlling for socio-economic deprivation Number of police? Crime prevention schemes? Rural/Urban proportions? Something else This is what regression is about

214 Books and attend as IV, grade as DV
Exam performance Consider number of books a student read (books) Number of lectures (max 20) a student attended (attend) Books and attend as IV, grade as DV

215 First 10 cases

216 Use books as IV Use attend as IV R=0.492, F=12.1, df=1, 28, p=0.001
b0=52.1, b1=5.7 (Intercept makes sense) Use attend as IV R=0.482, F=11.5, df=1, 38, p=0.002 b0=37.0, b1=1.9 (Intercept makes less sense)

217 100 90 80 70 60 50 Grade (100) 40 30 -1 1 2 3 4 5 Books

218

219 Problem Use R2 to give proportion of shared variance
Books = 24% Attend = 23% So we have explained 24% + 23% = 47% of the variance NO!!!!!

220 Look at the correlation matrix
BOOKS ATTEND GRADE 1 0.44 0.49 0.48 Correlation of books and attend is (unsurprisingly) not zero Some of the variance that books shares with grade, is also shared by attend

221 My wife has access to 2 cars
I have access to 2 cars My wife has access to 2 cars We have access to four cars? No. We need to know how many of my 2 cars are the same cars as her 2 cars Similarly with regression But we can do this with the residuals Residuals are what is left after (say) books See of residual variance is explained by attend Can use this new residual variance to calculate SSres, SStotal and SSreg

222 Assumes that the variables have a causal priority
Well. Almost. This would give us correct values for SS Would not be correct for slopes, etc Assumes that the variables have a causal priority Why should attend have to take what is left from books? Why should books have to take what is left by attend? Use OLS again

223 Simultaneously estimate 2 parameters
b1 and b2 Y = b0 + b1x1 + b2x2 x1 and x2 are IVs Not trying to fit a line any more Trying to fit a plane Can solve iteratively Closed form equations better But they are unwieldy

224 3D scatterplot (2points only) y x2 x1

225 b2 y b1 b0 x2 x1

226 (Really) Ridiculous Equations

227 The good news The bad news There is an easier way
It involves matrix algebra We don’t really need to know how to do it We need to know it exists

228 A Quick Guide to Matrix Algebra
(I will never make you do it again)

229 Very Quick Guide to Matrix Algebra
Why? Matrices make life much easier in multivariate statistics Some things simply cannot be done without them Some things are much easier with them If you can manipulate matrices you can specify calculations v. easily e.g. AA’ = sum of squares of a column Doesn’t matter how long the column

230 A row vector: A column vector: A scalar is a number A scalar: 4
A vector is a row or column of numbers A row vector: A column vector:

231 A vector is described as rows x columns
Is a 1  4 vector Is a 2  1 vector A number (scalar) is a 1  1 vector

232 A matrix is a rectangle, described as rows x columns
Is a 3 x 5 matrix Matrices are referred to with bold capitals - A is a matrix

233 Correlation matrices and covariance matrices are special
They are square and symmetrical Correlation matrix of books, attend and grade

234 Another special matrix is the identity matrix I
A square matrix, with 1 in the diagonal and 0 in the off-diagonal Note that this is a correlation matrix, with correlations all = 0

235 Matrix Operations Transposition
A matrix is transposed by putting it on its side Transpose of A is A’

236 Matrix multiplication
A matrix can be multiplied by a scalar, a vector or a matrix Not commutative AB  BA To multiply AB Number of rows in A must equal number of columns in B

237 Matrix by vector

238 Matrix by matrix

239 Multiplying by the identity matrix
Has no effect Like multiplying by 1

240 Inverting matrices is dull
The inverse of J is: 1/J J x 1/J = 1 Same with matrices Matrices have an inverse Inverse of A is A-1 AA-1=I Inverting matrices is dull We will do it once But first, we must calculate the determinant

241 The determinant of A is |A| Determinants are important in statistics
(more so than the other matrix algebra) We will do a 2x2 Much more difficult for larger matrices

242

243 Determinants are important because
Needs to be above zero for regression to work Zero or negative determinant of a correlation/covariance matrix means something wrong with the data Linear redundancy Described as: Not positive definite Singular (if determinant is zero) In different error messages

244 Next, the adjoint Now

245 Find A-1

246 Matrix Algebra with Correlation Matrices

247 Determinants Determinant of a correlation matrix
The volume of ‘space’ taken up by the (hyper) sphere that contains all of the points

248 X X X X X

249 X

250 Negative Determinant Points take up less than no space
Correlation matrix cannot exist Non-positive definite matrix

251 Sometimes Obvious

252 Sometimes Obvious (If You Think)

253 Sometimes No Idea

254 Multiple R for Each Variable
Diagonal of inverse of correlation matrix Used to calculate multiple R Call elements aij

255 Regression Weights Where i is DV j is IV

256 Back to the Good News We can calculate the standardised parameters as
B=Rxx-1 x Rxy Where B is the vector of regression weights Rxx-1 is the inverse of the correlation matrix of the independent (x) variables Rxy is the vector of correlations of the correlations of the x and y variables Now do exercise 3.2

257 One More Thing The whole regression equation can be described with matrices very simply

258 Go all the way back to our example
Where Y = vector of DV X = matrix of IVs B = vector of coefficients Go all the way back to our example

259

260 The constant – literally a constant
The constant – literally a constant. Could be any number, but it is most convenient to make it 1. Used to ‘capture’ the intercept.

261 The matrix of values for IVs (books and attend)

262 The parameter estimates. We are trying to find the best values of these.

263 Error. We are trying to minimise this

264 The DV - grade

265 Simple way of representing as many IVs as you like
Y=BX+E Simple way of representing as many IVs as you like Y = b0x0 + b1x1 + b2x2 + b3x3 + b4x4 + b5x5 + e

266

267 Generalises to Multivariate Case
Y=BX+E Y, B and E Matrices, not vectors Goes beyond this course (Do Jacques Tacq’s course for more) (Or read his book)

268

269

270

271 Lesson 6: More on Multiple Regression

272 Parameter Estimates Parameter estimates (b1, b2 … bk) were standardised Because we analysed a correlation matrix Represent the correlation of each IV with the DV When all other IVs are held constant

273 Can also be unstandardised
Unstandardised represent the unit change in the DV associated with a 1 unit change in the IV When all the other variables are held constant Parameters have standard errors associated with them As with one IV Hence t-test, and associated probability can be calculated Trickier than with one IV

274 Standard Error of Regression Coefficient
Standardised is easier R2i is the value of R2 when all other predictors are used as predictors of that variable Note that if R2i = 0, the equation is the same as for previous

275 Multiple R The degree of prediction
R (or Multiple R) No longer equal to b R2 Might be equal to the sum of squares of B Only if all x’s are uncorrelated

276 In Terms of Variance Can also think of this in terms of variance explained. Each IV explains some variance in the DV The IVs share some of their variance Can’t share the same variance twice

277 Variance in Y accounted for by x1 rx1y2 = 0.36
The total variance of Y = 1

278 In this model But R2 = ryx12 + ryx22 R2 = 0.36 + 0.36 = 0.72
If x1 and x2 are correlated No longer the case

279 Variance in Y accounted for by x1 rx1y2 = 0.36
Variance shared between x1 and x2 (not equal to rx1x2) The total variance of Y = 1 Variance in Y accounted for by x2 rx2y2 = 0.36

280 So But Two different ways We can no longer sum the r2
Need to sum them, and subtract the shared variance – i.e. the correlation But It’s not the correlation between them It’s the correlation between them as a proportion of the variance of Y Two different ways

281 Based on estimates If rx1x2 = 0 rxy = bx1 Equivalent to ryx12 + ryx22

282 Based on correlations rx1x2 = 0 Equivalent to ryx12 + ryx22

283 Can also be calculated using methods we have seen
Based on PRE Based on correlation with prediction Same procedure with >2 IVs

284 Adjusted R2 R2 is an overestimate of population value of R2
Any x will not correlate 0 with Y Any variation away from 0 increases R Variation from 0 more pronounced with lower N Need to correct R2 Adjusted R2

285 Calculation of Adj. R2 1 – R2 Proportion of unexplained variance
We multiple this by an adjustment More variables – greater adjustment More people – less adjustment

286 Shrunken R2 Some authors treat shrunken and adjusted R2 as the same thing Others don’t

287

288 Some stranger things that can happen Counter-intuitive
Extra Bits Some stranger things that can happen Counter-intuitive

289 Suppressor variables Can be hard to understand Definition
Very counter-intuitive Definition An independent variable which increases the size of the parameters associated with other independent variables above the size of their correlations

290 An example (based on Horst, 1941)
Success of trainee pilots Mechanical ability (x1), verbal ability (x2), success (y) Correlation matrix

291 Mechanical ability correlates 0.3 with success
Verbal ability correlates 0.0 with success What will the parameter estimates be? (Don’t look ahead until you have had a guess)

292 Mechanical ability Verbal ability So what is happening? b = 0.4
Larger than r! Verbal ability b = -0.2 Smaller than r!! So what is happening? You need verbal ability to do the test Not related to mechanical ability Measure of mechanical ability is contaminated by verbal ability

293 High mech, low verbal High mech, high verbal High mech Low verbal
This is positive Low verbal Negative, because we are talking about standardised scores Your mech is really high – you did well on the mechanical test, without being good at the words High mech, high verbal Well, you had a head start on mech, because of verbal, and need to be brought down a bit

294 Another suppressor? b1 = b2 = B1 – 0.26 B2 - 0.06
Only difference – x2, y changes to 0.2 No suppressor effect

295 Another suppressor? b1 =0.26 b2 = -0.06

296 And another? b1 = b2 = B1 = 0.53 B2 = -0.47 Big suppressor

297 And another? b1 = 0.53 b2 = -0.47

298 One more? b1 = b2 = B1 = 0.53 B2 = 0.47 b1 = 0.53 Suppressor again

299 One more? b1 = 0.53 b2 = 0.47

300 Suppression happens when two opposing forces are happening together
And have opposite effects Don’t throw away your IVs, Just because they are uncorrelated with the DV Be careful in interpretation of regression estimates Really need the correlations too, to interpret what is going on Cannot compare between studies with different IVs

301 Standardised Estimates > 1
Correlations are bounded -1.00 ≤ r ≤ +1.00 We think of standardised regression estimates as being similarly bounded But they are not Can go >1.00, <-1.00 R cannot, because that is a proportion of variance

302 Three measures of ability
Mechanical ability, verbal ability 1, verbal ability 2 Score on science exam Before reading on, what are the parameter estimates?

303 Mechanical About where we expect Verbal 1 Very high Verbal 2 Very low

304 Verbal 1 and verbal 2 are correlated so highly
What is going on It’s a suppressor again An independent variable which increases the size of the parameters associated with other independent variables above the size of their correlations Verbal 1 and verbal 2 are correlated so highly They need to cancel each other out

305 Variable Selection What are the appropriate independent variables to use in a model? Depends what you are trying to do Multiple regression has two separate uses Prediction Explanation

306 Prediction Explanation What will happen in the future?
Emphasis on practical application Variables selected (more) empirically Value free Explanation Why did something happen? Emphasis on understanding phenomena Variables selected theoretically Not value free

307 More on causality later on … Which are appropriate variables
Visiting the doctor Precedes suicide attempts Predicts suicide Does not explain suicide More on causality later on … Which are appropriate variables To collect data on? To include in analysis? Decision needs to be based on theoretical knowledge of the behaviour of those variables Statistical analysis of those variables (later) Unless you didn’t collect the data Common sense (not a useful thing to say)

308 Variable Entry Techniques
Entry-wise All variables entered simultaneously Hierarchical Variables entered in a predetermined order Stepwise Variables entered according to change in R2 Actually a family of techniques

309 Entrywise Hierarchical All variables entered simultaneously
All treated equally Hierarchical Entered in a theoretically determined order Change in R2 is assessed, and tested for significance e.g. sex and age Should not be treated equally with other variables Sex and age MUST be first Confused with hierarchical linear modelling

310 Stepwise Example Variables entered empirically
Variable which increases R2 the most goes first Then the next … Variables which have no effect can be removed from the equation Example IVs: Sex, age, extroversion, DV: Car – how long someone spends looking after their car

311 Correlation Matrix

312 Entrywise analysis r2 = 0.64

313 Stepwise Analysis Data determines the order
Model 1: Extroversion, R2 = 0.450 Model 2: Extroversion + Sex, R2 = 0.633

314 Hierarchical analysis
Theory determines the order Model 1: Sex + Age, R2 = 0.510 Model 2: S, A + E, R2 = 0.638 Change in R2 = 0.128, p = 0.001

315 Other problems with stepwise
Which is the best model? Entrywise – OK Stepwise – excluded age Did have a (small) effect Hierarchical The change in R2 gives the best estimate of the importance of extroversion Other problems with stepwise F and df are wrong (cheats with df) Unstable results Small changes (sampling variance) – large differences in models

316 Uses a lot of paper Don’t use a stepwise procedure to pack your suitcase

317 Is Stepwise Always Evil?
Yes All right, no Research goal is predictive (technological) Not explanatory (scientific) What happens, not why N is large 40 people per predictor, Cohen, Cohen, Aiken, West (2003) Cross validation takes place

318 A quick note on R2 R2 is sometimes regarded as the ‘fit’ of a regression model Bad idea If good fit is required – maximise R2 Leads to entering variables which do not make theoretical sense

319 Critique of Multiple Regression
Goertzel (2002) “Myths of murder and multiple regression” Skeptical Inquirer (Paper B1) Econometrics and regression are ‘junk science’ Multiple regression models (in US) Used to guide social policy

320 More Guns, Less Crime Lott and Mustard: A 1% increase in gun ownership
(controlling for other factors) Lott and Mustard: A 1% increase in gun ownership 3.3% decrease in murder rates But: More guns in rural Southern US More crime in urban North (crack cocaine epidemic at time of data)

321 Executions Cut Crime No difference between crimes in states in US with or without death penalty Ehrlich (1975) controlled all variables that effect crime rates Death penalty had effect in reducing crime rate No statistical way to decide who’s right

322 Legalised Abortion Donohue and Levitt (1999) Lott and Whitley (2001)
Legalised abortion in 1970’s cut crime in 1990’s Lott and Whitley (2001) “Legalising abortion decreased murder rates by … 0.5 to 7 per cent.” It’s impossible to model these data Controlling for other historical events Crack cocaine (again)

323 Another Critique Berk (2003) Three cheers for regression
Regression analysis: a constructive critique (Sage) Three cheers for regression As a descriptive technique Two cheers for regression As an inferential technique One cheer for regression As a causal analysis

324 Is Regression Useless? Do regression carefully Validate models
Don’t go beyond data which you have a strong theoretical understanding of Validate models Where possible, validate predictive power of models in other areas, times, groups Particularly important with stepwise

325 Lesson 7: Categorical Independent Variables

326 Introduction

327 Introduction So far, just looked at continuous independent variables
Also possible to use categorical (nominal, qualitative) independent variables e.g. Sex; Job; Religion; Region; Type (of anything) Usually analysed with t-test/ANOVA

328 Historical Note But these (t-test/ANOVA) are special cases of regression analysis Aspects of General Linear Models (GLMs) So why treat them differently? Fisher’s fault Computers’ fault Regression, as we have seen, is computationally difficult Matrix inversion and multiplication Unfeasible, without a computer

329 In the special cases where:
You have one categorical IV Your IVs are uncorrelated It is much easier to do it by partitioning of sums of squares These cases Very rare in ‘applied’ research Very common in ‘experimental’ research Fisher worked at Rothamsted agricultural research station Never have problems manipulating wheat, pigs, cabbages, etc

330 Still (too) common to dichotomise a variable
In psychology Led to a split between ‘experimental’ psychologists and ‘correlational’ psychologists Experimental psychologists (until recently) would not think in terms of continuous variables Still (too) common to dichotomise a variable Too difficult to analyse it properly Equivalent to discarding 1/3 of your data

331 The Approach

332 The Approach Recode the nominal variable Names are slightly confusing
Into one, or more, variables to represent that variable Names are slightly confusing Some texts talk of ‘dummy coding’ to refer to all of these techniques Some (most) refer to ‘dummy coding’ to refer to one of them Most have more than one name

333 If a variable has g possible categories it is represented by g-1 variables
Simplest case: Smokes: Yes or No Variable 1 represents ‘Yes’ Variable 2 is redundant If it isn’t yes, it’s no

334 The Techniques

335 We will examine two coding schemes
Dummy coding For two groups For >2 groups Effect coding Look at analysis of change Equivalent to ANCOVA Pretest-posttest designs

336 Dummy Coding – 2 Groups Also called simple coding by SPSS
A categorical variable with two groups One group chosen as a reference group The other group is represented in a variable e.g. 2 groups: Experimental (Group 1) and Control (Group 0) Control is the reference group Dummy variable represents experimental group Call this variable ‘group1’

337 For variable ‘group1’ 1 = ‘Yes’, 2=‘No’

338 Some data Group is x, score is y

339 Control Group = 0 Experimental Group = 1
Intercept = Score on Y when x = 0 Intercept = mean of control group Experimental Group = 1 b = change in Y when x increases 1 unit b = difference between experimental group and control group

340 Gradient of slope represents difference between means

341 Dummy Coding – 3+ Groups With three groups the approach is the similar
g = 3, therefore g-1 = 2 variables needed 3 Groups Control Experimental Group 1 Experimental Group 2

342 Recoded into two variables
Note – do not need a 3rd variable If we are not in group 1 or group 2 MUST be in control group 3rd variable would add no information (What would happen to determinant?)

343 b1 and b2 and associated p-values
F and associated p Tests H0 that b1 and b2 and associated p-values Test difference between each experimental group and the control group To test difference between experimental groups Need to rerun analysis

344 Need to correct for this
One more complication Have now run multiple comparisons Increases a – i.e. probability of type I error Need to correct for this Bonferroni correction Multiply given p-values by two/three (depending how many comparisons were made)

345 Effect Coding Usually used for 3+ groups
Compares each group (except the reference group) to the mean of all groups Dummy coding compares each group to the reference group. Example with 5 groups 1 group selected as reference group Group 5

346 Each group (except reference) has a variable
1 if the individual is in that group 0 if not -1 if in reference group

347 Examples Dummy coding and Effect Coding
Group 1 chosen as reference group each time Data

348 Dummy Group dummy2 dummy3 1 2 3 Effect Group Effect2 effect3 1 -1 2 3

349 Dummy R=0.543, F=5.7, df=2, 27, p=0.009 b0 = 52.4, b1 = 3.9, p=0.100 b2 = 7.7, p=0.002 Effect R=0.543, F=5.7, df=2, 27, p=0.009 b0 = 56.27, b1 = 0.03, p=0.980 b2 = 3.8, p=0.007

350 In SPSS SPSS provides two equivalent procedures for regression
Regression (which we have been using) GLM (which we haven’t) GLM will: Automatically code categorical variables Automatically calculate interaction terms GLM won’t: Give standardised effects Give hierarchical R2 p-values Allow you to not understand

351 ANCOVA and Regression

352 Test Use employee data.sav
(Which is a trick; but it’s designed to make you think about it) Use employee data.sav Compare the pay rise (difference between salbegin and salary) For ethnic minority and non-minority staff What do you find?

353 ANCOVA and Regression Dummy coding approach has one special use
In ANCOVA, for the analysis of change Pre-test post-test experimental design Control group and (one or more) experimental groups Tempting to use difference score + t-test / mixed design ANOVA Inappropriate

354 Salivary cortisol levels
Used as a measure of stress Not absolute level, but change in level over day may be interesting Test at: 9.00am, 9.00pm Two groups High stress group (cancer biopsy) Group 1 Low stress group (no biopsy) Group 0

355 Correlation of AM and PM = 0.493 (p=0.008)
Has there been a significant difference in the rate of change of salivary cortisol? 3 different approaches

356 Approach 1 – find the differences, do a t-test
t = 1.31, df=26, p=0.203 Approach 2 – mixed ANOVA, look for interaction effect F = 1.71, df = 1, 26, p = 0.203 F = t2 Approach 3 – regression (ANCOVA) based approach

357 Why is the regression approach better?
IVs: AM and group DV: PM b1 (group) = 3.59, standardised b1=0.432, p = 0.01 Why is the regression approach better? The other two approaches took the difference Assumes that r = 1.00 Any difference from r = 1.00 and you add error variance Subtracting error is the same as adding error

358 Using regression Two effects
Ensures that all the variance that is subtracted is true Reduces the error variance Two effects Adjusts the means Compensates for differences between groups Removes error variance

359 In SPSS SPSS automates all of this Use Analyse, GLM, Univariate ANOVA
But you have to understand it, to know what it is doing Use Analyse, GLM, Univariate ANOVA

360 Categorical predictors here
Outcome here Categorical predictors here Continuous predictors here Click options

361 Select parameter estimaters

362 More on Change If difference score is correlated with either pre-test or post-test Subtraction fails to remove the difference between the scores If two scores are uncorrelated Difference will be correlated with both Failure to control Equal SDs, r = 0 Correlation of change and pre-score =0.707

363 Even More on Change A topic of surprising complexity
What I said about difference scores isn’t always true Lord’s paradox – it depends on the precise question you want to answer Collins and Horn (1993). Best methods for the analysis of change Collins and Sayer (2001). New methods for the analysis of change.

364 Lesson 8: Assumptions in Regression Analysis

365 The Assumptions The distribution of residuals is normal (at each value of the dependent variable). The variance of the residuals for every set of values for the independent variable is equal. violation is called heteroscedasticity. The error term is additive no interactions. At every value of the dependent variable the expected (mean) value of the residuals is zero No non-linear relationships

366 The expected correlation between residuals, for any two cases, is 0.
The independence assumption (lack of autocorrelation) All independent variables are uncorrelated with the error term. No independent variables are a perfect linear function of other independent variables (no perfect multicollinearity) The mean of the error term is zero.

367 What are we going to do … Deal with some of these assumptions in some detail Deal with others in passing only look at them again later on

368 Assumption 1: The Distribution of Residuals is Normal at Every Value of the Dependent Variable

369 Look at Normal Distributions
A normal distribution symmetrical, bell-shaped (so they say)

370 What can go wrong? Skew Kurtosis Outliers non-symmetricality
one tail longer than the other Kurtosis too flat or too peaked kurtosed Outliers Individual cases which are far from the distribution

371 Effects on the Mean Skew Kurtosis
biases the mean, in direction of skew Kurtosis mean not biased standard deviation is and hence standard errors, and significance tests

372 Examining Univariate Distributions
Histograms Boxplots P-P plots Calculation based methods

373 Histograms A and B

374 C and D

375 E & F

376 Histograms can be tricky ….

377 Boxplots

378 P-P Plots A & B

379 C & D

380 E & F

381 Calculation Based Skew and Kurtosis statistics
Outlier detection statistics

382 Skew and Kurtosis Statistics
Normal distribution skew = 0 kurtosis = 0 Two methods for calculation Fisher’s and Pearson’s Very similar answers Associated standard error can be used for significance of departure from normality not actually very useful Never normal above N = 400

383

384 Outlier Detection Calculate distance from mean Calculate influence
z-score (number of standard deviations) deleted z-score that case biased the mean, so remove it Look up expected distance from mean 1% 3+ SDs Calculate influence how much effect did that case have on the mean?

385 Non-Normality in Regression

386 Effects on OLS Estimates
The mean is an OLS estimate The regression line is an OLS estimate Lack of normality biases the position of the regression slope makes the standard errors wrong probability values attached to statistical significance wrong

387 Checks on Normality Check residuals are normally distributed
SPSS will draw histogram and p-p plot of residuals Use regression diagnostics Lots of them Most aren’t very interesting

388 Regression Diagnostics
Residuals standardised, unstandardised, studentised, deleted, studentised-deleted look for cases > |3| (?) Influence statistics Look for the effect a case has If we remove that case, do we get a different answer? DFBeta, Standardised DFBeta changes in b

389 Distances DfFit, Standardised DfFit Covariance ratio
change in predicted value Covariance ratio Ratio of the determinants of the covariance matrices, with and without the case Distances measures of ‘distance’ from the centroid some include IV, some don’t

390 More on Residuals Residuals are trickier than you might have imagined
Raw residuals OK Standardised residuals Residuals divided by SD

391 Leverage But That SD is wrong Variance of the residuals is not equal Those further from the centroid on the predictors have higher variance Need a measure of this Distance from the centroid is leverage, or h (or sometimes hii) One predictor Easy

392 Minimum hi is 1/n, the maximum is 1 Except
SPSS uses standardised leverage - h* It doesn’t tell you this, it just uses it

393 Minimum 0, maximum (N-1/N)

394 Multiple predictors Calculate the hat matrix (H)
Leverage values are the diagonals of this matrix Where X is the augmented matrix of predictors (i.e. matrix that includes the constant) Hence leverage hii – element ii of H

395 Example of calculation of hat matrix

396 Standardised / Studentised
Now we can calculate the standardised residuals SPSS calls them studentised residuals Also called internally studentised residuals

397 Deleted Studentised Residuals
Studentised residuals do not have a known distribution Cannot use them for inference Deleted studentised residuals Externally studentised residuals Jackknifed residuals Distributed as t With df = N – k – 1

398 Testing Significance We can calculate the probability of a residual
Is it sampled from the same population BUT Massive type I error rate Bonferroni correct it Multiply p value by N

399 Bivariate Normality We didn’t just say “residuals normally distributed” We said “at every value of the dependent variables” Two variables can be normally distributed – univariate, but not bivariate

400 Couple’s IQs male and female Seem reasonably normal

401 But wait!!

402 When we look at bivariate normality So plot X against Y
not normal – there is an outlier So plot X against Y OK for bivariate but – may be a multivariate outlier Need to draw graph in 3+ dimensions can’t draw a graph in 3 dimensions But we can look at the residuals instead …

403 IQ histogram of residuals

404 Multivariate Outliers …
Will be explored later in the exercises So we move on …

405 What to do about Non-Normality
Skew and Kurtosis Skew – much easier to deal with Kurtosis – less serious anyway Transform data removes skew positive skew – log transform negative skew - square

406 Transformation May need to transform IV and/or DV More often DV
time, income, symptoms (e.g. depression) all positively skewed can cause non-linear effects (more later) if only one is transformed alters interpretation of unstandardised parameter May alter meaning of variable May add / remove non-linear and moderator effects

407 Change measures Outliers increase sensitivity at ranges Can be tricky
avoiding floor and ceiling effects Outliers Can be tricky Why did the outlier occur? Error? Delete them. Weird person? Probably delete them Normal person? Tricky.

408 Pedhazur and Schmelkin (1991)
You are trying to model a process is the data point ‘outside’ the process e.g. lottery winners, when looking at salary yawn, when looking at reaction time Which is better? A good model, which explains 99% of your data? A poor model, which explains all of it Pedhazur and Schmelkin (1991) analyse the data twice

409 We will spend much less time on the other 6 assumptions
Can do exercise 8.1.

410 Assumption 2: The variance of the residuals for every set of values for the independent variable is equal.

411 Heteroscedasticity This assumption is a about heteroscedasticity of the residuals Hetero=different Scedastic = scattered We don’t want heteroscedasticity we want our data to be homoscedastic Draw a scatterplot to investigate

412

413 Easy to get – use predicted values
Only works with one IV need every combination of IVs Easy to get – use predicted values use residuals there Plot predicted values against residuals or standardised residuals or deleted residuals or standardised deleted residuals or studentised residuals A bit like turning the scatterplot on its side

414 Good – no heteroscedasticity

415 Bad – heteroscedasticity

416 Testing Heteroscedasticity
White’s test Not automatic in SPSS (is in SAS) Luckily, not hard to do Do regression, save residuals. Square residuals Square IVs Calculate interactions of IVs e.g. x1•x2, x1•x3, x2 • x3

417 Use education and salbegin to predict salary (employee data.sav)
Run regression using squared residuals as DV IVs, squared IVs, and interactions as IVs Test statistic = N x R2 Distributed as c2 Df = k (for second regression) Use education and salbegin to predict salary (employee data.sav) R2 = 0.113, N=474, c2 = 53.5, df=5, p <

418 Plot of Pred and Res

419 Magnitude of Heteroscedasticity
Chop data into “slices” 5 slices, based on X (or predicted score) Done in SPSS Calculate variance of each slice Check ratio of smallest to largest Less than 10:1 OK

420 The Visual Bander New in SPSS 12

421 Variances of the 5 groups
We have a problem 3 / 0.2 ~= 15

422 Dealing with Heteroscedasticity
Use Huber-White estimates Very easy in Stata Fiddly in SPSS – bit of a hack Use Complex samples Create a new variable where all cases are equal to 1, call it const Use Complex Samples, Prepare for Analysis Create a plan file

423 Sample weight is const Finish Use Complex Samples, GLM Use plan file created, and set up model as in GLM (More on complex samples later) In Stata, do regression as normal, and click “robust”.

424 Heteroscedasticity – Implications and Meanings
What happens as a result of heteroscedasticity? Parameter estimates are correct not biased Standard errors (hence p-values) are incorrect

425 However … If there is no skew in predicted scores If skewed,
P-values a tiny bit wrong If skewed, P-values very wrong Can do exercise

426 What is heteroscedasticity trying to tell us?
Meaning What is heteroscedasticity trying to tell us? Our model is wrong – it is misspecified Something important is happening that we have not accounted for e.g. amount of money given to charity (given) depends on: earnings degree of importance person assigns to the charity (import)

427 Do the regression analysis
R2 = 0.60, F=31.4, df=2, 37, p < 0.001 seems quite good b0 = 0.24, p=0.97 b1 = 0.71, p < 0.001 b2 = 0.23, p = 0.031 White’s test c2 = 18.6, df=5, p=0.002 The plot of predicted values against residuals …

428 Plot shows heteroscedastic relationship

429 Which means … the effects of the variables are not additive
If you think that what a charity does is important you might give more money how much more depends on how much money you have

430

431 One more thing about heteroscedasticity
it is the equivalent of homogeneity of variance in ANOVA/t-tests

432 Assumption 3: The Error Term is Additive

433 Additivity What heteroscedasticity shows you
effects of variables need to be additive Heteroscedasticity doesn’t always show it to you can test for it, but hard work (same as homogeneity of covariance assumption in ANCOVA) Have to know it from your theory A specification error

434 Additivity and Theory Two IVs Alcohol has sedative effect
A bit makes you a bit tired A lot makes you very tired Some painkillers have sedative effect A bit of alcohol and a bit of painkiller doesn’t make you very tired Effects multiply together, don’t add together

435 So many possible non-additive effects
If you don’t test for it It’s very hard to know that it will happen So many possible non-additive effects Cannot test for all of them Can test for obvious In medicine Choose to test for salient non-additive effects e.g. sex, race

436 Assumption 4: At every value of the dependent variable the expected (mean) value of the residuals is zero

437 Linearity Relationships between variables should be linear
best represented by a straight line Not a very common problem in social sciences except economics measures are not sufficiently accurate to make a difference R2 too low unlike, say, physics

438 Relationship between speed of travel and fuel used

439 R2 = 0.938 BUT looks pretty good
know speed, make a good prediction of fuel BUT look at the chart if we know speed we can make a perfect prediction of fuel used R2 should be 1.00

440 Detecting Non-Linearity
Residual plot just like heteroscedasticity Using this example very, very obvious usually pretty obvious

441 Residual plot

442 Linearity: A Case of Additivity
Linearity = additivity along the range of the IV Jeremy rides his bicycle harder Increase in speed depends on current speed Not additive, multiplicative MacCallum and Mar (1995). Distinguishing between moderator and quadratic effects in multiple regression. Psychological Bulletin.

443 The independence assumption (lack of autocorrelation)
Assumption 5: The expected correlation between residuals, for any two cases, is 0. The independence assumption (lack of autocorrelation)

444 Independence Assumption
Also: lack of autocorrelation Tricky one often ignored exists for almost all tests All cases should be independent of one another knowing the value of one case should not tell you anything about the value of other cases

445 How is it Detected? Can be difficult
need some clever statistics (multilevel models) Better off avoiding situations where it arises Residual Plots Durbin-Watson Test

446 Residual Plots Were data collected in time order?
If so plot ID number against the residuals Look for any pattern Test for linear relationship Non-linear relationship Heteroscedasticity

447

448 How does it arise? Two main ways time-series analyses
When cases are time periods weather on Tuesday and weather on Wednesday correlated inflation 1972, inflation 1973 are correlated clusters of cases patients treated by three doctors children from different classes people assessed in groups

449 Why does it matter? Standard errors can be wrong
therefore significance tests can be wrong Parameter estimates can be wrong really, really wrong from positive to negative An example students do an exam (on statistics) choose one of three questions IV: time DV: grade

450 Result, with line of best fit

451 Result shows that BUT … Look again
people who spent longer in the exam, achieve better grades BUT … we haven’t considered which question people answered we might have violated the independence assumption DV will be autocorrelated Look again with questions marked

452 Now somewhat different

453 Now, people that spent longer got lower grades
questions differed in difficulty do a hard one, get better grade if you can do it, you can do it quickly Very difficult to analyse well need multilevel models

454 Durbin Watson Test Not well implemented in SPSS
Depends on the order of the data Reorder the data, get a different result Doesn’t give statistical significance of the test

455 Assumption 6: All independent variables are uncorrelated with the error term.

456 Uncorrelated with the Error Term
A curious assumption by definition, the residuals are uncorrelated with the independent variables (try it and see, if you like) It is about the DV must have no effect (when the IVs have been removed) on the DV

457 OLS estimates will be (badly) biased in this case
Problem in economics Demand increases supply Supply increases wages Higher wages increase demand OLS estimates will be (badly) biased in this case need a different estimation procedure two-stage least squares simultaneous equation modelling

458 no perfect multicollinearity
Assumption 7: No independent variables are a perfect linear function of other independent variables no perfect multicollinearity

459 No Perfect Multicollinearity
IVs must not be linear functions of one another matrix of correlations of IVs is not positive definite cannot be inverted analysis cannot proceed Have seen this with age, age start, time working also occurs with subscale and total

460 Large amounts of collinearity
a problem (as we shall see) sometimes not an assumption

461 Assumption 8: The mean of the error term is zero.
You will like this one.

462 Mean of the Error Term = 0 Mean of the residuals = 0
That is what the constant is for if the mean of the error term deviates from zero, the constant soaks it up - note, Greek letters because we are talking about population values

463 Can do regression without the constant
Usually a bad idea E.g R2 = 0.995, p < 0.001 Looks good

464

465

466 Lesson 9: Issues in Regression Analysis
Things that alter the interpretation of the regression equation

467 The Four Issues Causality Sample sizes Collinearity Measurement error

468 Causality

469 What is a Cause? Debate about definition of cause
some statistics (and philosophy) books try to avoid it completely We are not going into depth just going to show why it is hard Two dimensions of cause Ultimate versus proximal cause Determinate versus probabilistic

470 Proximal versus Ultimate Why am I here?
I walked here because This is the location of the class because Eric Tanenbaum asked me because (I don’t know) because I was in my office when he rang because I am a lecturer at York because I saw an advert in the paper because

471 Proximal cause Ultimate cause I exist because My parents met because
My father had a job … Proximal cause the direct and immediate cause of something Ultimate cause the thing that started the process off I fell off my bicycle because of the bump I fell off because I was going too fast

472 Determinate versus Probabilistic Cause Why did I fall off my bicycle?
I was going too fast But every time I ride too fast, I don’t fall off Probabilistic cause Why did my tyre go flat? A nail was stuck in my tyre Every time a nail sticks in my tyre, the tyre goes flat Deterministic cause

473 Can get into trouble by mixing them together
Eating deep fried Mars Bars and doing no exercise are causes of heart disease “My Grandad ate three deep fried Mars Bars every day, and the most exercise he ever got was when he walked to the shop next door to buy one” (Deliberately?) confusing deterministic and probabilistic causes

474 Criteria for Causation
Association Direction of Influence Isolation

475 Association Correlation does not mean causation But
we all know But Causation does mean correlation Need to show that two things are related may be correlation my be regression when controlling for third (or more) factor

476 Relationship between price and sales
suppliers may be cunning when people want it more stick the price up So – no relationship between price and sales

477 But which variables do we enter?
Until (or course) we control for demand b1 (Price) = b2 (Demand) = 0.94 But which variables do we enter?

478 Direction of Influence
Relationship between A and B three possible processes A B A causes B A B B causes A A B C C causes A & B

479 How do we establish the direction of influence?
Longitudinally? Barometer Drops Storm Now if we could just get that barometer needle to stay where it is … Where the role of theory comes in (more on this later)

480 Isolation Isolate the dependent variable from all other influences
as experimenters try to do Cannot do this can statistically isolate the effect using multiple regression

481 Role of Theory Strong theory is crucial to making causal statements
Fisher said: to make causal statements “make your theories elaborate.” don’t rely purely on statistical analysis Need strong theory to guide analyses what critics of non-experimental research don’t understand

482 S.J. Gould – a critic says correlate price of petrol and his age, for the last 10 years find a correlation Ha! (He says) that doesn’t mean there is a causal link Of course not! (We say). No social scientist would do that analysis without first thinking (very hard) about the possible causal relations between the variables of interest Would control for time, prices, etc …

483 Gould says “Most correlations are non-causal” (1982, p243)
Atkinson, et al. (1996) relationship between college grades and number of hours worked negative correlation Need to control for other variables – ability, intelligence Gould says “Most correlations are non-causal” (1982, p243) Of course!!!!

484 120 non-causal correlations
karaoke jokes (about statistics) vomit toilet headache sleeping equations (beermat) laugh thirsty fried breakfast no beer curry chips falling over lose keys curtains closed I drink a lot of beer 16 causal relations 120 non-causal correlations

485 Abelson (1995) elaborates on this
‘method of signatures’ A collection of correlations relating to the process the ‘signature’ of the process e.g. tobacco smoking and lung cancer can we account for all of these findings with any other theory?

486 The longer a person has smoked cigarettes, the greater the risk of cancer.
The more cigarettes a person smokes over a given time period, the greater the risk of cancer. People who stop smoking have lower cancer rates than do those who keep smoking. Smoker’s cancers tend to occur in the lungs, and be of a particular type. Smokers have elevated rates of other diseases. People who smoke cigars or pipes, and do not usually inhale, have abnormally high rates of lip cancer. Smokers of filter-tipped cigarettes have lower cancer rates than other cigarette smokers. Non-smokers who live with smokers have elevated cancer rates. (Abelson, 1995: )

487 Failure to use theory to select appropriate variables
In addition, should be no anomalous correlations If smokers had more fallen arches than non-smokers, not consistent with theory Failure to use theory to select appropriate variables specification error e.g. in previous example Predict wealth from price and sales increase price, price increases Increase sales, price increases

488 Sometimes these are indicators of the process
e.g. barometer – stopping the needle won’t help e.g. inflation? Indicator or cause?

489 No Causation without Experimentation
Blatantly untrue I don’t doubt that the sun shining makes us warm Why the aversion? Pearl (2000) says problem is no mathematical operator No one realised that you needed one Until you build a robot

490 AI and Causality A robot needs to make judgements about causality
Needs to have a mathematical representation of causality Suddenly, a problem! Doesn’t exist Most operators are non-directional Causality is directional

491 “How many subjects does it take to run a regression analysis?”
Sample Sizes “How many subjects does it take to run a regression analysis?”

492 Introduction Social scientists don’t worry enough about the sample size required “Why didn’t you get a significant result?” “I didn’t have a large enough sample” Not a common answer More recently awareness of sample size is increasing use too few – no point doing the research use too many – waste their time

493 Research funding bodies Ethical review panels
both become more interested in sample size calculations We will look at two approaches Rules of thumb (quite quickly) Power Analysis (more slowly)

494 Rules of Thumb Lots of simple rules of thumb exist
10 cases per IV >100 cases Green (1991) more sophisticated To test significance of R2 – N = k To test sig of slopes, N = k Rules of thumb don’t take into account all the information that we have Power analysis does

495 Power Analysis Introducing Power Analysis Hypothesis test
tells us the probability of a result of that magnitude occurring, if the null hypothesis is correct (i.e. there is no effect in the population) Doesn’t tell us the probability of that result, if the null hypothesis is false

496 According to Cohen (1982) all null hypotheses are false
everything that might have an effect, does have an effect it is just that the effect is often very tiny

497 Type I error is false rejection of H0
Type I Errors Type I error is false rejection of H0 Probability of making a type I error a – the significance value cut-off usually 0.05 (by convention) Always this value Not affected by sample size type of test

498 Type II error is false acceptance of the null hypothesis
Type II errors Type II error is false acceptance of the null hypothesis Much, much trickier We think we have some idea we almost certainly don’t Example I do an experiment (random sampling, all assumptions perfectly satisfied) I find p = 0.05

499 Very hard to work out You repeat the experiment exactly
different random sample from same population What is probability you will find p < 0.05? ……………… Another experiment, I find p = 0.01 Probability you find p < 0.05? Very hard to work out not intuitive need to understand non-central sampling distributions (more in a minute)

500 Probability of type II error = beta (b)
same as population regression parameter (to be confusing) Power = 1 – Beta Probability of getting a significant result

501   H0 false (effect to be found) H0 True (no effect to be found)
State of the World H0 true (we find no effect – p > 0.05) H0 false (we find an effect – p < 0.05) Research Findings Type II error p = b power = 1 - b Type I error p = a

502 Four parameters in power analysis
a – prob. of Type I error b – prob. of Type II error (power = 1 – b) Effect size – size of effect in population N Know any three, can calculate the fourth Look at them one at a time

503 a Probability of Type I error
Usually set to 0.05 Somewhat arbitrary sometimes adjusted because of circumstances rarely because of power analysis May want to adjust it, based on power analysis

504 b – Probability of type II error
Power (probability of finding a result) = 1 – b Standard is 80% Some argue for 90% Implication that Type I error is 4 times more serious than type II error adjust ratio with compromise power analysis

505 Effect size in the population
Most problematic to determine Three ways What effect size would be useful to find? R2 = no use (probably) Base it on previous research what have other people found? Use Cohen’s conventions small R2 = 0.02 medium R2 = 0.13 large R2 = 0.26

506 Effect size usually measured as f2
For R2

507 For (standardised) slopes
Where sr2 is the contribution to the variance accounted for by the variable of interest i.e. sr2 = R2 (with variable) – R2 (without) change in R2 in hierarchical regression

508 N – the sample size usually use other three parameters to determine this sometimes adjust other parameters (a) based on this e.g. You can have 50 participants. No more.

509 With power analysis program
Doing power analysis With power analysis program SamplePower, GPower, Nquery With SPSS MANOVA using non-central distribution functions Uses MANOVA syntax Relies on the fact you can do anything with MANOVA Paper B4

510 Underpowered Studies Research in the social sciences is often underpowered Why? See Paper B11 – “the persistence of underpowered studies”

511 Extra Reading Power traditionally focuses on p values What about CIs?
Paper B8 – “Obtaining regression coefficients that are accurate, not simply significant”

512 Collinearity

513 Collinearity as Issue and Assumption
Collinearity (multicollinearity) the extent to which the independent variables are (multiply) correlated If R2 for any IV, using other IVs = 1.00 perfect collinearity variable is linear sum of other variables regression will not proceed (SPSS will arbitrarily throw out a variable)

514 Four things to look at in collinearity
R2 < 1.00, but high other problems may arise Four things to look at in collinearity meaning implications detection actions

515 Meaning of Collinearity
Literally ‘co-linearity’ lying along the same line Perfect collinearity when some IVs predict another Total = S1 + S2 + S3 + S4 S1 = Total – (S2 + S3 + S4) rare

516 Less than perfect when some IVs are close to predicting
correlations between IVs are high (usually, but not always)

517 Implications Effects the stability of the parameter estimates Because
and so the standard errors of the parameter estimates and so the significance Because shared variance, which the regression procedure doesn’t know where to put

518 Red cars have more accidents than other coloured cars
because of the effect of being in a red car? because of the kind of person that drives a red car? we don’t know No way to distinguish between these three: Accidents = 1 x colour + 0 x person Accidents = 0 x colour + 1 x person Accidents = 0.5 x colour x person

519 Sex differences due to genetics? due to upbringing?
(almost) perfect collinearity statistically impossible to tell

520 When collinearity is less than perfect
increases variability of estimates between samples estimates are unstable reflected in the variances, and hence standard errors

521 Detecting Collinearity
Look at the parameter estimates large standardised parameter estimates (>0.3?), which are not significant be suspicious Run a series of regressions each IV as DV all other IVs as IVs for each IV

522 Ask for collinearity diagnostics
Sounds like hard work? SPSS does it for us! Ask for collinearity diagnostics Tolerance – calculated for every IV Variance Inflation Factor sq. root of amount s.e. has been increased

523 Actions What you can do about collinearity Get new data
“no quick fix” (Fox, 1991) Get new data avoids the problem address the question in a different way e.g. find people who have been raised as the ‘wrong’ gender exist, but rare Not a very useful suggestion

524 Remove / Combine variables
Collect more data not different data, more data collinearity increases standard error (se) se decreases as N increases get a bigger N Remove / Combine variables If an IV correlates highly with other IVs Not telling us much new If you have two (or more) IVs which are very similar e.g. 2 measures of depression, socio-economic status, achievement, etc

525 Use stepwise regression (or some flavour of)
sum them, average them, remove one Many measures use principal components analysis to reduce them Use stepwise regression (or some flavour of) See previous comments Can be useful in theoretical vacuum Ridge regression not very useful behaves weirdly

526 Measurement Error

527 What is Measurement Error
In social science, it is unlikely that we measure any variable perfectly measurement error represents this imperfection We assume that we have a true score T A measure of that score x

528 just like a regression equation
standardise the parameters T is the reliability the amount of variance in x which comes from T but, like a regression equation assume that e is random and has mean of zero more on that later

529 Simple Effects of Measurement Error
Lowers the measured correlation between two variables Real correlation true scores (x* and y*) Measured correlation measured scores (x and y)

530 Measured correlation of x and y rxy True correlation of x and y rx*y*
Reliability of x rxx Reliability of y ryy e e x y Measured correlation of x and y rxy

531 Attenuation of correlation
Attenuation corrected correlation

532 Example

533 Complex Effects of Measurement Error
Really horribly complex Measurement error reduces correlations reduces estimate of b reducing one estimate increases others because of effects of control combined with effects of suppressor variables exercise to examine this

534 Dealing with Measurement Error
Attenuation correction very dangerous not recommended Avoid in the first place use reliable measures don’t discard information don’t categorise Age: 10-20, 21-30, …

535 Complications Assume measurement error is Additive Linear additive
e.g. weight – people may under-report / over-report at the extremes Linear particularly the case when using proxy variables

536 e.g. proxy measures Want to know effort on childcare, count number of children 1st child is more effort than last Want to know financial status, count income 1st £10 much greater effect on financial status than the 1000th.

537 Lesson 10: Non-Linear Analysis in Regression

538 Introduction Non-linear effect occurs Assumption is violated
when the effect of one independent variable is not consistent across the range of the IV Assumption is violated expected value of residuals = 0 no longer the case

539 Some Examples

540 A Learning Curve Skill Experience

541 Yerkes-Dodson Law of Arousal
Performance Arousal

542 Enthusiasm Levels over a
Lesson on Regression Enthusiastic Suicidal 3.5 Time

543 Learning Yerkes-Dodson Enthusiasm line changed direction once
line changed direction twice

544 Everything is Non-Linear
Every relationship we look at is non-linear, for two reasons Exam results cannot keep increasing with reading more books Linear in the range we examine For small departures from linearity Cannot detect the difference Non-parsimonious solution

545 Non-Linear Transformations

546 Bending the Line Non-linear regression is hard Transformations
We cheat, and linearise the data Do linear regression Transformations We need to transform the data rather than estimating a curved line which would be very difficult may not work with OLS we can take a straight line, and bend it or take a curved line, and straighten it back to linear (OLS) regression

547 We still do linear regression
Linear in the parameters Y = b1x + b2x2 + … Can do non-linear regression Non-linear in the parameters Much trickier Statistical theory either breaks down OR becomes harder

548 Linear transformations
multiply by a constant add a constant change the slope and the intercept

549 y=2x y=x + 3 y y=x x

550 Linear transformations are no use
alter the slope and intercept don’t alter the standardised parameter estimate Non-linear transformation will bend the slope quadratic transformation y = x2 one change of direction

551 Cubic transformation y = x2 + x3 two changes of direction

552 Quadratic Transformation
y= x + 1x2

553 Square Root Transformation
y= x + 5x

554 Cubic Transformation y = 3 - 4x + 2x x3

555 Logarithmic Transformation
y = x + 10log(x)

556 Inverse Transformation
y = x + 8(1/x)

557 To estimate a non-linear regression
we don’t actually estimate anything non-linear we transform the x-variable to a non-linear version can estimate that straight line represents the curve we don’t bend the line, we stretch the space around the line, and make it flat

558 Detecting Non-linearity

559 Draw a Scatterplot Draw a scatterplot of y plotted against x
see if it looks a bit non-linear e.g. Anscombe’s data e.g. Education and beginning salary from bank data drawn in SPSS with line of best fit

560 Anscombe (1973) For each dataset constructed a set of datasets
show the importance of graphs in regression/correlation For each dataset N 11 Mean of x 9 Mean of y 7.5 Equation of regression line y = x sum of squares (X - mean) 110 correlation coefficient 0.82 R2 0.67

561

562

563

564

565 A Real Example Starting salary and years of education
From employee data.sav

566 Expected value of error (residual) is > 0

567 Use Residual Plot Scatterplot is only good for one variable
use the residual plot (that we used for heteroscedasticity) Good for many variables

568 We want points to lie in a nice straight sausage

569 We don’t want a nasty bent sausage

570 Educational level and starting salary

571 Carrying Out Non-Linear Regression

572 Linear Transformation
Linear transformation doesn’t change interpretation of slope standardised slope se, t, or p of slope R2 Can change effect of a transformation

573 With others does have an effect
Actually more complex with some transformations can add a constant with no effect (e.g. quadratic) With others does have an effect inverse, log Sometimes it is necessary to add a constant negative numbers have no square root 0 has no log

574 Education and Salary Linear Regression
Saw previously that the assumption of expected errors = 0 was violated Anyway … R2 = 0.401, F=315, df = 1, 472, p < 0.001 salbegin =  educ Standardised b1 (educ) = 0.633 Both parameters make sense

575 Add this variable to the equation
Non-linear Effect Compute new variable quadratic educ2 = educ2 Add this variable to the equation R2 = 0.585, p < 0.001 salbegin =  educ  educ2 slightly curious Standardised b1 (educ) = -2.4 b2 (educ2) = 3.1 What is going on?

576 Need hierarchical regression
Collinearity is what is going on Correlation of educ and educ2 r = 0.990 Regression equation becomes difficult (impossible?) to interpret Need hierarchical regression what is the change in R2 is that change significant? R2 (change) = 0.184, p < 0.001

577 While we are at it, let’s look at the cubic effect
R2 (change) = 0.004, p = 0.045  e  e  e3 Standardised: b1(e) = 0.04 b2(e2) = -2.04 b3(e3) = 2.71

578 Keep going while we are ahead
Fourth Power Keep going while we are ahead won’t run ??? Collinearity is the culprit Tolerance (educ4) = VIF = Matrix of correlations of IVs is not positive definite cannot be inverted

579 Tricky, given that parameter estimates are a bit nonsensical
Interpretation Tricky, given that parameter estimates are a bit nonsensical Two methods 1: Use R2 change Save predicted values or calculate predicted values to plot line of best fit Save them from equation Plot against IV

580

581 Differentiate with respect to e We said:
s =  e  e  e3 but first we will simplify it to quadratic s =  e  e2 dy/dx = x 2 x e

582 1 year of education at the higher end of the scale, better than 1 year at the lower end of the scale. MBA versus GCSE

583 Differentiate Cubic  e  e  e3 dy/dx = 103 – 206  2  e + 12  3  e2 Can calculate slopes for quadratic and cubic at different values

584

585 A Quick Note on Differentiation
For y = xp dx/dy = pxp-1 For equations such as y =b1x + b2xP dy/dx = b1 + b2pxp-1 y = 3x + 4x2 dy/dx = • 2x

586 Many functions are simple to differentiate
y = b1x + b2x2 + b3x3 dy/dx = b1 + b2 • 2x + b3 • 3 • x2 y = 4x + 5x2 + 6x3 dx/dy = • 2 • x + 6 • 3 • x2 Many functions are simple to differentiate Not all though

587 Automatic Differentiation
If you Don’t know how to differentiate Can’t be bothered to look up the function Can use automatic differentiation software e.g. GRAD (freeware)

588

589 Lesson 11: Logistic Regression
Dichotomous/Nominal Dependent Variables

590 Introduction Often in social sciences, we have a dichotomous/nominal DV we will look at dichotomous first, then a quick look at multinomial Dichotomous DV e.g. guilty/not guilty pass/fail won/lost Alive/dead (used in medicine)

591 Why Won’t OLS Do?

592 Example: Passing a Test
Test for bus drivers pass/fail we might be interested in degrees of pass fail a company which trains them will not fail means ‘pay for them to take it again’ Develop a selection procedure Two predictor variables Score – Score on an aptitude test Exp – Relevant prior experience (months)

593 1st ten cases

594 Just consider score first
DV pass (1 = Yes, 0 = No) Just consider score first Carry out regression Score as IV, Pass as DV R2 = 0.097, F = 4.1, df = 1, 48, p = b0 = 0.190 b1 = 0.110, p=0.028 Seems OK

595 Or does it? … 1st Problem – pp plot of residuals

596 2nd problem - residual plot

597 Problems 1 and 2 strange distributions of residuals
parameter estimates may be wrong standard errors will certainly be wrong

598 3rd problem – interpretation
I score 2 on aptitude. Pass =  2 = 0.41 I score 8 on the test Pass =  8 = 1.07 Seems OK, but What does it mean? Cannot score 0.41 or 1.07 can only score 0 or 1 Cannot be interpreted need a different approach

599 A Different Approach Logistic Regression

600 Logit Transformation In lesson 10, transformed IVs
now transform the DV Need a transformation which gives us graduated scores (between 0 and 1) No upper limit we can’t predict someone will pass twice No lower limit you can’t do worse than fail

601 Step 1: Convert to Probability
First, stop talking about values talk about probability for each value of score, calculate probability of pass Solves the problem of graduated scales

602 probability of failure given a score of 1 is 0.7
probability of passing given a score of 5 is 0.8

603 Now a score of 0.41 has a meaning But a score of 1.07 has no meaning
This is better Now a score of 0.41 has a meaning a 0.41 probability of pass But a score of 1.07 has no meaning cannot have a probability > 1 (or < 0) Need another transformation

604 Step 2: Convert to Odds-Ratio
Need to remove upper limit Convert to odds Odds, as used by betting shops 5:1, 1:2 Slightly different from odds in speech a 1 in 2 chance odds are 1:1 (evens) 50%

605 Odds ratio = (number of times it happened) / (number of times it didn’t happen)

606 0.8 = 0.8/0.2 = 4 0.2 = 0.2/0.8 = 0.25 equivalent to 4:1 (odds on)
0.8 = 0.8/0.2 = 4 equivalent to 4:1 (odds on) 4 times out of five 0.2 = 0.2/0.8 = 0.25 equivalent to 1:4 (4:1 against) 1 time out of five

607 Now we have solved the upper bound problem
we can interpret 1.07, 2.07, But we still have the zero problem we cannot interpret predicted scores less than zero

608 Step 3: The Log Log10 of a number(x) log(10) = 1 log(100) = 2

609 log(1) = 0 log(0.1) = -1 log( ) = -5

610 Natural Logs and e Don’t use log10 Natural log, ln
Use loge Natural log, ln Has some desirable properties, that log10 doesn’t For us If y = ln(x) + c dy/dx = 1/x Not true for any other logarithm

611 Be careful – calculators and stats packages are not consistent when they use log
Sometimes log10, sometimes loge Can prove embarrassing (a friend told me)

612 Take the natural log of the odds ratio Goes from -  +
can interpret any predicted value

613 Putting them all together
Logit transformation log-odds ratio not bounded at zero or one

614

615 Probability gets closer to zero, but never reaches it as logit goes down.

616 Hooray! Problem solved, lesson over
errrmmm… almost Because we are now using log-odds ratio, we can’t use OLS we need a new technique, called Maximum Likelihood (ML) to estimate the parameters

617 Parameter Estimation using ML
ML tries to find estimates of model parameters that are most likely to give rise to the pattern of observations in the sample data All gets a bit complicated OLS is a special case of ML the mean is an ML estimator

618 Don’t have closed form equations
must be solved iteratively estimates parameters that are most likely to give rise to the patterns observed in the data by maximising the likelihood function (LF) We aren’t going to worry about this except to note that sometimes, the estimates do not converge ML cannot find a solution

619 Interpreting Output Using SPSS Overall fit for:
step (only used for stepwise) block (for hierarchical) model (always) in our model, all are the same c2=4.9, df=1, p=0.025 F test

620

621 Model summary -2LL (=c2/N) Cox & Snell R2 Nagelkerke R2
Different versions of R2 No real R2 in logistic regression should be considered ‘pseudo R2’

622

623 Classification Table predictions of model
based on cut-off of 0.5 (by default) predicted values x actual values

624

625 Model parameters B SE (B)
Change in the logged odds associated with a change of 1 unit in IV just like OLS regression difficult to interpret SE (B) Standard error Multiply by 1.96 to get 95% CIs

626

627 Constant i.e. score = 0 B = 1.314 Exp(B) = eB = e1.314 = 3.720
OR = 3.720, p = 1 – (1 / (OR + 1)) = 1 – (1 / ( )) p = 0.788

628 Score 1 Constant b = 1.314 Score B = -0.467
Exp(1.314 – 0.467) = Exp(0.847) = 2.332 OR = 2.332 p = 1 – (1 / ( )) = 0.699

629 Standard Errors and CIs
SPSS gives B, SE B, exp(B) by default Can work out 95% CI from standard error B ± 1.96 x SE(B) Or ask for it in options Symmetrical in B Non-symmetrical (sometimes very) in exp(B)

630

631 The odds of passing the test are multiplied by 0. 63 (CIs = 0. 408, 0
The odds of passing the test are multiplied by 0.63 (CIs = 0.408, 0.962p p = 0.033), for every additional point on the aptitude test.

632 More on Standard Errors
In OLS regression If a variable is added in a hierarchical fashion The p-value associated with the change in R2 is the same as the p-value of the variable Not the case in logistic regression In our data and 0.033 Wald standard errors Make p-value in estimates is wrong – too high (CIs still correct)

633 Two estimates use slightly different information
P-value says “what if no effect” CI says “what if this effect” Variance depends on the hypothesised ratio of the number of people in the two groups Can calculate likelihood ratio based p-values If you can be bothered Some packages provide them automatically

634 Probit Regression Very similar to logistic In SPSS:
much more complex initial transformation (to normal distribution) Very similar results to logistic (multiplied by 1.7) In SPSS: A bit weird Probit regression available through menus

635 However But requires data structured differently
Ordinal logistic regression is equivalent to binary logistic If outcome is binary SPSS gives option of probit

636 Results Estimate SE P Logistic (binary) Score 0.288 0.301 0.339 Exp
0.147 0.073 0.043 Logistic (ordinal) Logistic (probit) 0.191 0.178 0.282 0.090 0.042 0.033

637 Differentiating Between Probit and Logistic
Depends on shape of the error term Normal or logistic Graphs are very similar to each other Could distinguish quality of fit Given enormous sample size Logistic = probit x 1.7 Actually Probit advantage Understand the distribution Logistic advantage Much simpler to get back to the probability

638

639 Infinite Parameters Non-convergence can happen because of infinite parameters Insoluble model Three kinds: Complete separation The groups are completely distinct Pass group all score more than 10 Fail group all score less than 10

640 Quasi-complete separation
Separation with some overlap Pass group all score 10 or more Fail group all score 10 or less Both cases: No convergence Close to this Curious estimates Curious standard errors

641 Categorical Predictors
Can cause separation Esp. if correlated Need people in every cell Male Female White Non-White Below Poverty Line Above Poverty Line

642 Logistic Regression and Diagnosis
Logistic regression can be used for diagnostic tests For every score Calculate probability that result is positive Calculate proportion of people with that score (or lower) who have a positive result Calculate c statistic Measure of discriminative power %age of all possible cases, where the model gives a higher probability to a correct case than to an incorrect case

643 SPSS doesn’t do it automatically Save probabilities
Perfect c-statistic = 1.0 Random c-statistic = 0.5 SPSS doesn’t do it automatically But easy to do Save probabilities Use Graphs, ROC Curve Test variable: predicted probability State variable: outcome

644 Sensitivity and Specificity
Probability of saying someone has a positive result – If they do: p(pos)|pos Specificity Probability of saying someone has a negative result If they do: p(neg)|neg

645 Calculating Sens and Spec
For each value Calculate proportion of minority earning less – p(m) proportion of non-minority earning less – p(w) Sensitivity (value) P(m)

646 Salary P(minority) 10 .39 20 .31 30 .23 40 .17 50 .12 60 .09 70 .06 80 .04 90 .03

647 Using Bank Data Predict minority group, using salary (000s)
Logit(minority) = salary x –0.039 Find actual proportions

648 Area under curve is c-statistic

649 More Advanced Techniques
Multinomial Logistic Regression more than two categories in DV same procedure one category chosen as reference group odds of being in category other than reference Polytomous Logit Universal Models (PLUM) Ordinal multinomial logistic regression For ordinal outcome variables

650 Final Thoughts Logistic Regression can be extended Same issues as OLS
dummy variables non-linear effects interactions (even though we don’t cover them until the next lesson) Same issues as OLS collinearity outliers

651

652

653 Lesson 12: Mediation and Path Analysis

654 Introduction Moderator Mediator All relationships are really mediated
Level of one variable influences effect of another variable Mediator One variable influences another via a third variable All relationships are really mediated are we interested in the mediators? can we make the process more explicit

655 In examples with bank Why? beginning education salary
What is the process? Are we making assumptions about the process? Should we test those assumptions?

656 job skills expectations negotiating skills kudos for bank education beginning salary

657 Direct and Indirect Influences
X may affect Y in two ways Directly – X has a direct (causal) influence on Y (or maybe mediated by other variables) Indirectly – X affects Y via a mediating variable - M

658 e.g. how does going to the pub effect comprehension on a Summer school course
on, say, regression not reading books on regression Having fun in pub in evening less knowledge Anything here?

659 not reading books on regression
Having fun in pub in evening less knowledge fatigue Still needed?

660 Mediators needed to cope with more sophisticated theory in social sciences make explicit assumptions made about processes examine direct and indirect influences

661 Detecting Mediation

662 4 Steps From Baron and Kenny (1986)
To establish that the effect of X on Y is mediated by M Show that X predicts Y Show that X predicts M Show that M predicts Y, controlling for X If effect of X controlling for M is zero, M is complete mediator of the relationship (3 and 4 in same analysis)

663 Enjoy Books  Buy books Read Books
Example: Book habits Enjoy Books Buy books Read Books

664 Three Variables Enjoy Buy Read How much an individual enjoys books
How many books an individual buys (in a year) Read How many books an individual reads (in a year)

665

666 The Theory enjoy buy read

667 Show that X (enjoy) predicts Y (read) b1 = 0.487, p < 0.001
Step 1 Show that X (enjoy) predicts Y (read) b1 = 0.487, p < 0.001 standardised b1 = 0.732 OK

668 Show that X (enjoy) predicts M (buy) b1 = 0.974, p < 0.001
standardised b1 = 0.643 OK

669 3. Show that M (buy) predicts Y (read), controlling for X (enjoy)
b1 = 0.469, p < 0.001 standardised b1 = 0.206 OK

670 If effect of X controlling for M is zero, M is complete mediator of the relationship
(Same as analysis for step 3.) b2 = 0.287, p = 0.001 standardised b2 = 0.431 Hmmmm… Significant, therefore not a complete mediator

671 0.287 (step 4) enjoy read buy 0.206 (from step 3) 0.974 (from step 2)

672 The Mediation Coefficient
Amount of mediation = Step 1 – Step 4 =0.487 – 0.287 = 0.200 OR Step 2 x Step 3 =0.974 x 0.206

673 SE of Mediator enjoy read buy sa = se(a) sb = se(b) a b (from step 2)

674 Sobel test Standard error of mediation coefficient can be calculated a = sa = 0.189 b = sb = 0.054

675 Indirect effect = 0.200 Online Sobel test: se = 0.056
t =3.52, p = 0.001 Online Sobel test: (Won’t be there for long; probably will be somewhere else)

676 A Note on Power Recently
Move in methodological literature away from this conventional approach Problems of power: Several tests, all of which must be significant Type I error rate = 0.05 * 0.05 = Must affect power Bootstrapping suggested as alternative See Paper B7, A4, B9 B21 for SPSS syntax

677

678

679 Lesson 13: Moderators in Regression
“different slopes for different folks”

680 Introduction Moderator relationships have many different names
interactions (from ANOVA) multiplicative non-linear (just confusing) non-additive All talking about the same thing

681 A moderated relationship occurs
when the effect of one variable depends upon the level of another variable

682 Where there is collinearity
Hang on … That seems very like a nonlinear relationship Moderator Effect of one variable depends on level of another Non-linear Effect of one variable depends on level of itself Where there is collinearity Can be hard to distinguish between them Paper in handbook (B5) Should (usually) compare effect sizes

683 e.g. How much it hurts when I drop a computer on my foot depends on
x1: how much alcohol I have drunk x2: how high the computer was dropped from but if x1 is high enough x2 will have no effect

684 e.g. Likelihood of injury in a car accident
depends on x1: speed of car x2: if I was wearing a seatbelt but if x1 is low enough x2 will have no effect

685

686 e.g. number of words (from a list) I can remember
depends on x1: type of words (abstract, e.g. ‘justice’, or concrete, e.g. ‘carrot’) x2: Method of testing (recognition – i.e. multiple choice, or free recall) but if using recognition x1: will not make a difference

687 We looked at three kinds of moderator alcohol x height = pain
continuous x continuous speed x seatbelt = injury continuous x categorical word type x test type categorical x categorical We will look at them in reverse order

688 How do we know to look for moderators?
Theoretical rationale Often the most powerful Many theories predict additive/linear effects Fewer predict moderator effects Presence of heteroscedasticity Clue there may be a moderated relationship missing

689 Two Categorical Predictors

690 Data 2 IVs 20 Participants in one of four groups 5 per group
word type (concrete [1], abstract [2]) test method (recog [1], recall [2]) 20 Participants in one of four groups 1, 1 1, 2 2, 1 2, 2 5 per group lesson12.1.sav

691

692 Graph of means

693 ANOVA Results Standard way to analyse these data would be to use ANOVA
Words: F=6.1, df=1, 16, p=0.025 Test: F=5.1, df=1, 16, p=0.039 Words x Test: F=5.6, df=1, 16, p=0.031

694 Procedure for Testing 1: Convert to effect coding
can use dummy coding, collinearity is less of an issue doesn’t make any difference to substantive interpretation 2: Calculate interaction term In ANOVA interaction is automatic In regression we create an interaction variable

695 Interaction term (wxt)
multiply effect coded variables together

696 3: Carry out regression Hierarchical linear effects first
interaction effect in next block

697 b0 (intercept) = predicted value of Y (score) when all X = 0
b1 (words) = -2.3, p=0.025 b2 (test) = -2.1, p=0.039 b3 (words x test) = -2.2, p=0.031 Might need to use change in R2 to test sig of interaction, because of collinearity What do these mean? b0 (intercept) = predicted value of Y (score) when all X = 0 i.e. the central point

698 b0 = 13.2 grand mean b1 = -2.3 distance from grand to mean for two word types 13.2 – (-2.3) = 15.5 (-2.3) = 10.9

699 b2 = -2.1 b3 = -2.2 Score = 13.2 + (-2.3)  w + (-2.1)  t
distance from grand mean to recog and recall means b3 = -2.2 to understand b3 we need to look at predictions from the equation without this term Score = (-2.3)  w + (-2.1)  t

700 Score = (-2.3)  w + (-2.1)  t So for each group we can calculate an expected value

701 b1 = -2.3, b2 = -2.1 W C A T Cog Call Word -1 1 Test -1 1
Expected Value (-2.3) x (-1) + (-2.1) x -1 (-2.3) x (-1) + (-2.1) x 1 (-2.3) x 1 + (-2.1) x (-1) (-2.3) x 1 + (-2.1) x 1

702 The exciting part comes when we look at the differences between the actual value and the value in the 2 IV model

703 Each difference = 2.2 (or –2.2) The value of b3 was –2.2
the interaction term is the correction required to the slope when the second IV is included

704 Examine the slope for word type
Gradient = ( ) / 2 = -2.1

705 Add the slopes for two test groups
Both word groups (-2.1) Concrete ( )/2 = 0.1 Abstract ( )/2 = -4.3

706 b associated with interaction
the change in slope, away from the average, associated with a 1 unit change in the moderating variable OR Half the difference in the slopes

707 Y(concrete) = 13.2 + -2.3-1 + -2.1t + -2.2-1t
Another way to look at it Y = w t wt Examine concrete words group (w = -1) substitute values into the equation Y(concrete) =  t -1t Y(concrete) = t + 2.2t Y(concrete) = t The effect of changing test type for concrete words (the slope, which is half the actual difference)

708 Why go to all that effort? Why not do ANOVA in the first place?
That is what ANOVA actually does if it can handle an unbalanced design (i.e. different numbers of people in each group) Helps to understand what can be done with ANOVA SPSS uses regression to do ANOVA Helps to clarify more complex cases as we shall see

709 Categorical x Continuous

710 Note on Dichotomisation
Very common to see people dichotomise a variable Makes the analysis easier Very bad idea Paper B6

711 Data A chain of 60 supermarkets
examining the relationship between profitability, shop size, and local competition 2 IVs shop size comp (local competition, 0=no, 1=yes) DV profit

712 Data, ‘lesson 12.2.sav’

713 1st Analysis Two IVs R2=0.367, df=2, 57, p < 0.001
Unstandardised estimates b1 (shopsize) = (p=0.001) b2 (comp) = (p<0.001) Standardised estimates b1 (shopsize) = 0.356 b2 (comp) = 0.448

714 Suspicions Presence of competition is likely to have an effect
Residual plot shows a little heteroscedasticity

715 Procedure for Testing Very similar to last time
convert ‘comp’ to effect coding -1 = No competition 1 = competition Compute interaction term comp (effect coded) x size Hierarchical regression

716 Result Unstandardised estimates Standardised estimates
b1 (shopsize) = (p=0.006) b2 (comp) = (p = 0.506) b3 (sxc) = (p=0.050) Standardised estimates b1 (shopsize) = 0.306 b2 (comp) = b3 (sxc) =

717 comp now non-significant
shows importance of hierarchical it obviously is important

718 Interpretation Draw graph with lines of best fit
drawn automatically by SPSS Interpret equation by substitution of values evaluate effects of size competition

719

720 Effects of size in presence and absence of competition
(can ignore the constant) Y=x1 x2(-1.67) + x1x2 (-0.050) Competition present (x2 = 1) Y=x1 (-1.67) + x11 (-0.050) Y=x1 x1(-0.050) Y=x1  (–1.67)

721 Y=x1 x2(-1.67) + x1x2 (-0.050) Competition absent (x2 = -1) Y=x1 (-1.67) + x1-1 (-0.050) Y=x1  x1-1 (-0.050) + -1(-1.67) Y= x1 0.121 (+ 1.67)

722 Two Continuous Variables

723 Data Bank Employees only using clerical staff 363 cases
predicting starting salary previous experience age age x experience

724 Correlation matrix only one significant

725 Initial Estimates (no moderator) (standardised)
R2 = 0.061, p<0.001 Age at start = -0.37, p<0.001 Previous experience = 0.36, p<0.001 Suppressing each other Age and experience compensate for one another Older, with no experience, bad Younger, with experience, good

726 The Procedure Very similar to previous
create multiplicative interaction term BUT Need to eliminate effects of means cause massive collinearity and SDs cause one variable to dominate the interaction term By standardising

727 Hint: automatic in SPSS in Descriptives
To standardise x, subtract mean, and divide by SD re-expresses x in terms of distance from the mean, in SDs ie z-scores Hint: automatic in SPSS in Descriptives Create interaction term of age and exp axe = z(age)  z(exp)

728 Hierarchical regression
two linear effects first moderator effect in second hint: it is often easier to interpret if standardised versions of all variables are used

729 Estimates (standardised)
Change in R2 0.085, p<0.001 Estimates (standardised) b1 (exp) = 0.104 b2 (agestart) = -0.54 b3 (age x exp) = -0.54

730 Interpretation 1: Pick-a-Point
Graph is tricky can’t have two continuous variables Choose specific points (pick-a-point) Graph the line of best fit of one variable at others Two ways to pick a point 1: Choose high (z = +1), medium (z = 0) and low (z = -1) Choose ‘sensible’ values – age 20, 50, 80?

731 Bracketed terms are simple intercept and simple slope
We know: Y = e  a  a  e  -0.54 Where a = agestart, and e = experience We can rewrite this as: Y = (e  0.10) + (a  -0.54) + (a  e  -0.54) Take a out of the brackets Y = (e  0.10) + ( e  -0.54)a Bracketed terms are simple intercept and simple slope 0= (e  0.10) 1= ( e  -0.54)a Y = 0 + 1a

732 Pick any value of e, and we know the slope for a
Standardised, so it’s easy e = -1 0= (-1  0.10) = 1= (  -0.54)a = -0.0a e = 0 0= (0  0.10) = 0 1= (  -0.54)a = -0.54a e = 1 0= (1  0.10) = 0.10 1= (  -0.54)a = -1.08a

733 Graph the Three Lines

734 Interpretation 2: P-Values and CIs
Second way Newer, rarely done Calculate CIs of the slope At any point Calculate p-value Give ranges of significance

735 What do you need? The variance and covariance of the estimates
SPSS doesn’t provide estimates for intercept Need to do it manually In options, exclude intercept Create intercept – c = 1 Use it in the regression

736 Enter information into web page:
(Again, may not be around for long) Get results Calculations in Bauer and Curran (in press: Multivariate Behavioral Research) Paper B13

737

738 Areas of Significance

739 2 complications 1: Constant differed
2: DV was logged, hence non-linear effect of 1 unit depends on where the unit is Can use SPSS to do graphs showing lines of best fit for different groups See paper A2

740 Finally …

741 Unlimited Moderators Moderator effects are not limited to 2 variables
linear effects

742 Three Interacting Variables
Age, Sex, Exp Block 1 Block 2 Age x Sex, Age x Exp, Sex x Exp Block 3 Age x Sex x Exp

743 Results All two way interactions significant Three way not significant
Effect of Age depends on sex Effect of experience depends on sex Size of the age x experience interaction does not depend on sex (phew!)

744 Moderated Non-Linear Relationships
Enter non-linear effect Enter non-linear effect x moderator if significant indicates degree of non-linearity differs by moderator

745

746 Modelling Counts: Poisson Regression
Lesson 14

747 Counts and the Poisson Distribution
Von Bortkiewicz (1898) Numbers of Prussian soldiers kicked to death by horses 0 109 1 65 2 22 3 3 1

748 The data fitted a Poisson probability distribution
When counts of events occur, poisson distribution is common E.g. papers published by researchers, police arrests, number of murders, ship accidents Common approach Log transform and treat as normal Problems Censored at 0 Integers only allowed Heteroscedasticity

749 The Poisson Distribution

750

751 In a poisson distribution
Where: y is the count m is the mean of the poisson distribution In a poisson distribution The mean = the variance (hence heteroscedasticity issue)) m = s2

752 Poisson Regression in SPSS
Not directly available SPSS can be tweaked to do it in three ways: General loglinear model (genlog) Non-linear regression (CNLR) Bootstrapped p-values only Both are quite tricky SPSS 15,

753 Example Using Genlog Number of shark bites on different colour surfboards 100 surfboards, 50 red, 50 blue Weight cases by bites Analyse, Loglinear, General Colour is factor

754 Results Correspondence Between Parameters and Terms of the Design
Parameter Aliased Term 1 Constant 2 [COLOUR = 1] 3 x [COLOUR = 2] Note: 'x' indicates an aliased (or a redundant) parameter. These parameters are set to zero.

755 Note: Intercept (param 1) is curious
Asymptotic % CI Param Est. SE Z-value Lower Upper Note: Intercept (param 1) is curious Param 2 is the difference in the means

756 SPSS: Continuous Predictors
Bleedin’ nightmare

757 Poisson Regression in Stata
SPSS will save a Stata file Open it in Stata Statistics, Count outcomes, Poisson regression

758 Poisson Regression in R
R is a freeware program Similar to SPlus Steep learning curve to start with Much nicer to do Poisson (and other) regression analysis

759 Commands in R Stage 1: enter data Run analysis Get results
colour <- c(1, 0, 1, 0, 1, 0 … 1) bites <- c(3, 1, 0, 0, … ) Run analysis p1 <- glm(bites ~ colour, family = poisson) Get results summary.glm(p1)

760 R Results Results for colour Same as SPSS
Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) * colour ** Results for colour Same as SPSS For intercept different (weird SPSS)

761 Predicted Values Need to get exponential of parameter estimates
Like logistic regression Exp(0.555) = 1.74 You are likely to be bitten by a shark 1.74 times more often with a red surfboard

762 Checking Assumptions Was it really poisson distributed?
For Poisson, m = s2 As mean increases, variance should also increase Residuals should be random Overdispersion is common problem Too many zeroes For blue: m = s2 = exp( ) = 1.42 For red: m = s2 = exp( ) = 2.48

763 Strictly:

764 Compare Predicted with Actual Distributions

765 Overdispersion Problem in poisson regression Causes Solution
Too many zeroes Causes c2 inflation Standard error deflation Hence p-values too low Higher type I error rate Solution Negative binomial regression

766 Using R R can read an SPSS file
But you have to ask it nicely Click Packages menu, Load package, choose “Foreign” Click File, Change Dir Change to the folder that contains your data

767 More on R R uses objects Function is read.spss()
To place something into an object use <- X <- Y Puts Y into X Function is read.spss() Mydata <- read.spss(“spssfilename.sav”) Variables are then referred to as Mydata$VAR1 Note 1: R is case sensitive Note 2: SPSS variable name in capitals

768 GLM in R Command Output is a GLM object
glm(outcome ~ pred1 + pred2 + … + predk [,family = familyname]) If no familyname, default is OLS Use binomial for logistic, poisson for poisson Output is a GLM object You need to give this a name my1stglm <- glm(outcome ~ pred1 + pred2 + … + predk [,family = familyname])

769 Then need to explore the result To explore what it means
summary(my1stglm) To explore what it means Need to plot regressions Easiest is to use Excel

770

771 Introducing Structural Equation Modelling
Lesson 15

772 Introduction Related to regression analysis
All (OLS) regression can be considered as a special case of SEM Power comes from adding restrictions to the model SEM is a system of equations Estimate those equations

773 Regression as SEM Grades example
Grade = constant + books + attend + error Looks like a regression equation Also Books correlated with attend Explicit modelling of error

774 Path Diagram System of equations are usefully represented in a path diagram x Measured variable e unmeasured variable regression correlation

775 Path Diagram for Regression
Must usually explicitly model error error Books Attend Grade Must explicitly model correlation

776 Results Unstandardised

777 Standardised

778 Table

779 So What Was the Point? Regression is a special case
Lots of other cases Power of SEM Power to add restrictions to the model Restrict parameters To zero To the value of other parameters To 1

780 Restrictions Questions Each restriction adds 1 df
Is a parameter really necessary? Are a set of parameters necessary? Are parameters equal Each restriction adds 1 df Test of model with c2

781 The c2 Test Can the model proposed have generated the data?
Test of significance of difference of model and data Statistically significant result Bad Theoretically driven Start with model Don’t start with data

782 Regression Again Both estimates restricted to zero

783 This test is (asymptotically) equivalent to the F test in regression
Two restrictions 2 df for c2 test c2 = 15.9, p = This test is (asymptotically) equivalent to the F test in regression We still haven’t got any further

784 Multivariate Regression
y1 x1 y2 x2 y3

785 Test of all x’s on all y’s
(6 restrictions = 6 df) y1 x1 y2 x2 y3

786 Test of all x1 on all y’s (3 restrictions) y1 x1 y2 x2 y3

787 Test of all x1 on all y1 (3 restrictions) y1 x1 y2 x2 y3

788 Test of all 3 partial correlations between y’s, controlling for x’s
(3 restrictions) y1 x1 y2 x2 y3

789 Path Analysis and SEM More complex models – can add more restrictions
E.g. mediator model 1 restriction No path from enjoy -> read

790 Result c2 = 10.9, 1 df, p = 0.001 Not a complete mediator
Additional path is required

791 Multiple Groups Same model Equality constraints between groups
Different people Equality constraints between groups Means, correlations, variances, regression estimates E.g. males and females

792 Multiple Groups Example
Age Severity of psoriasis SEVE – in emotional areas Hands, face, forearm SEVNONE – in non-emotional areas Anxiety Depression

793

794

795 Model

796 Females

797 Males

798 Constraint sevnone -> dep 1 restriction, 1 df 4 restrictions
Constrained to be equal for males and females 1 restriction, 1 df c2 = 1.3 – not significant 4 restrictions 2 severity -> anx & dep

799 Parameters are not equal
4 restrictions, 4 df c2 = 1.3, p = 0.014 Parameters are not equal

800 Missing Data: The big advantage
SEM programs tend to deal with missing data Multiple imputation Full Information (Direct) Maximum Likelihood Asymptotically equivalent Data can be MAR, not just MCAR

801 Power: A Smaller Advantage
Power for regression gets tricky with large models With SEM power is (relatively) easy It’s all based on chi-square Paper B14

802 Lesson 16: Dealing with clustered data & longitudinal models

803 The Independence Assumption
In Lesson 8 we talked about independence The residual of any one case should not tell you about the residual of any other case Particularly problematic when: Data are clustered on the predictor variable E.g. predictor is household size, cases are members of family E.g. Predictor is doctor training, outcome is patients of doctor Data are longitudinal Have people measured over time It’s the same person!

804 Clusters of Cases Problem with cluster (group) randomised studies
Or group effects Use Huber-White sandwich estimator Tell it about the groups Correction is made Use complex samples in SPSS

805 Complex Samples As with Huber-White for heteroscedasticity Run GLM
Add a variable that tells it about the clusters Put it into clusters Run GLM As before Warning: Need about 20 clusters for solutions to be stable

806 Example People randomised by week to one of two forms of triage
Compare the total cost of treating each Ignore clustering Difference is £2.40 per person, with 95% confidence intervals £0.58 to £4.22, p =0.010 Include clustering Difference is still £2.40, with 95% CIs £5.65 to -£0.85, and p = Ignoring clustering led to type I error

807 Longitudinal Research
For comparing repeated measures Clusters are people Can model the repeated measures over time Data are usually short and fat ID V1 V2 V3 V4 1 2 3 4 7 6 8 5

808 Converting Data ID V X Change data to tall and thin
1 2 3 4 7 6 8 5 Change data to tall and thin Use Data, Restructure in SPSS Clusters are ID

809 (Simple) Example Use employee data.sav
Compare beginning salary and salary Would normally use paired samples t-test Difference = $17,403, 95% CIs $16, , $18,

810 Restructure the Data Do it again Complex GLM with Time as factor
With data tall and thin Complex GLM with Time as factor ID as cluster Difference = $17,430, 95% CIs = , ID Time Cash 1 $18,750 2 $21,450 $12,000 $21,900 3 $13,200 $45,000

811 Interesting … That wasn’t very interesting
What is more interesting is when we have multiple measurements of the same people Can plot and assess trajectories over time

812 Single Person Trajectory
+ + + + + + Time

813 Multiple Trajectories: What’s the Mean and SD?
Time

814 Complex Trajectories An event occurs
Can have two effects: A jump in the value A change in the slope Event doesn’t have to happen at the same time for each person Doesn’t have to happen at all

815 Slope 1 Jump Slope 2 Event Occurs

816 Parameterising Time Event Time2 Outcome 1 12 2 13 3 14 4 15 5 16 6 10
12 2 13 3 14 4 15 5 16 6 10 7 9 8

817 What are the parameter estimates?
Draw the Line What are the parameter estimates?

818 Main Effects and Interactions
Intercept differences Moderator effects Slope differences

819 Multilevel Models Fixed versus random effects Levels
Fixed effects are fixed across individuals (or clusters) Random effects have variance Levels Level 1 – individual measurement occasions Level 2 – higher order clusters

820 More on Levels NHS direct study Widowhood food study
Level 1 units: ……………. Level 2 units: …………… Widowhood food study Level 1 units …………… Level 2 units ……………

821 More Flexibility Three levels: Level 1: measurements Level 2: people
Level 3: schools

822 More Effects Variances and covariances of effects
Level 1 and level 2 residuals Makes R2 difficult to talk about Outcome variable Yij The score of the ith person in the jth group

823 Y i j 2.3 1 3.2 2 4.5 3 4.8 7.2 3.1 1.6 4

824 Notation Notation gets a bit horrid We used to have b0 and b1
Varies a lot between books and programs We used to have b0 and b1 If fixed, that’s fine If random, each person has their own intercept and slope

825 Standard Errors Intercept has standard errors
Slopes have standard errors Random effects have variances Those variances have standard errors Is there statistically significant variation between higher level units (people)? OR Is everyone the same?

826 Programs Since version 12 Menus Can do this in SPSS
Can’t do anything really clever Menus Completely unusable Have to use syntax

827 SPSS Syntax MIXED relfd with time /fixed = time
/random = intercept time | subject (id) covtype(un) /print = solution.

828 SPSS Syntax MIXED relfd with time Continuous predictor Outcome

829 Must specify effect as fixed first
SPSS Syntax MIXED relfd with time /fixed = time Must specify effect as fixed first

830 SPSS Syntax MIXED relfd with time /fixed = time
/random = intercept time | subject (id) covtype(un) Intercept and time are random Specify random effects SPSS assumes that your level 2 units are subjects, and needs to know the id variable

831 SPSS Syntax MIXED relfd with time fixed = time
/random = intercept time | subject (id) covtype(un) Covariance matrix of random effects is unstructured. (Alternative is id – identity or vc – variance components).

832 SPSS Syntax MIXED relfd with time fixed = time
/random = intercept time | subject (id) covtype(un) /print = solution. Print the answer

833 The Output Information criteria We’ll come back

834 Fixed Effects Not useful here, useful for interactions

835 Estimates of Fixed Effects
Interpreted as regression equation

836 Covariance Parameters

837 Change Covtype to VC We know that this is wrong
The covariance of the effects was statistically significant Can also see if it was wrong by comparing information criteria We have removed a parameter from the model Model is worse Model is more parsimonious Is it much worse, given the increase in parsimony?

838 VC Model UN Model Lower is better.

839 Adding Bits So far, all a bit dull
We want some more predictors, to make it more exciting E.g. female Add: Relfd with time female /fixed = time sex time * sex What does the interaction term represent?

840 Extending Models Models can be extended Need a different program
Any kind of regression can be used Logistic, multinomial, Poisson, etc More levels Children within classes within schools Measures within people within classes within prisons Multiple membership / cross classified models Children within households and classes, but households not nested within class Need a different program E.g. MlwiN

841 MlwiN Example (very quickly)

842 Books Singer, JD and Willett, JB (2003). Applied Longitudinal Data Analysis: Modeling Change and Event Occurrence. Oxford, Oxford University Press. Examples at:

843 The End


Download ppt "Theory of Regression."

Similar presentations


Ads by Google