Presentation is loading. Please wait.

Presentation is loading. Please wait.

WiFi password: 525-244-426-914.

Similar presentations


Presentation on theme: "WiFi password: 525-244-426-914."— Presentation transcript:

1 WiFi password:

2 Assumptions

3 Assumptions assumptions about the predictors
Absence of Collinearity No influential data points Normality of Errors Homoskedasticity of Errors assumptions about the residuals Independence

4 Assumptions Absence of Collinearity No influential data points
Normality of Errors Homoskedasticity of Errors Independence

5 Assumptions Absence of Collinearity No influential data points
Normality of Errors Homoskedasticity of Errors Independence

6 Collinearity… … generally occurs when predictors are correlated (however, it may also occur in more complex ways, through multi-collinearity) Demo

7 Absence of Collinearity
Baayen (2008: 182)

8 Collinearity Baayen (2008: 182)

9 “If collinearity is ignored, one is likely to end up with a confusing statistical analysis in which nothing is significant, but where dropping one covariate can make the others significant, or even change the sign of estimated parameters.” (Zuur, Ieno & Elphick, 2010: 9) Zuur, A. F., Ieno, E. N., & Elphick, C. S. (2010). A protocol for data exploration to avoid common statistical problems. Methods in Ecology and Evolution, 1(1), 3-14.

10 You check collinearity through variance inflation factors
library(car) vif(xmdl) Values >10 are commonly regarded as dangerous; however, values substantially larger than 1 are dangerous I would definitely start worrying at around >4 INFORMALLY: correlations, definitely start worrying at around 0.8

11 Model comparison with separate models of collinear predictors
xmdl1 <- lm(y ~ A) xmdl2 <- lm(y ~ B) AIC(xmdl1) AIC(xmdl2) trade-off between goodness of fit and number of parameters Akaike’s information criterion

12 If relative importance of (potentially) collinear predictors is of prime interest…
Random forests: myforest = cforest(..., controls = data.controls) my_varimp = varimp(myforest, conditional = T) check Stephanie Shih’s tutorials

13 Assumptions Absence of Collinearity No influential data points
Normality of Errors Homoskedasticity of Errors Independence

14 Assumptions Absence of Collinearity No influential data points
Normality of Errors Homoskedasticity of Errors Independence

15 generated 500 x points and 500 y points, completely uncorrelated, by simply drawing them from a normal distribution

16 simply changing one value to 8/8

17 Influence diagnostics
DFFit DFBeta Leverage Cook’s distance Standardized residuals Studentized residuals … and more!

18 Code for doing DFBetas yourself
Perform leave-one-out influence diagnostics General structure of code: all.betas = c() for(i in 1:nrow(xdata)){ xmdl = lm( … , xdata[-i,]) all.betas = c(all.betas, coef(xmdl)["slope_of_interest"]) }

19 Influence diagnostics abuse
Influence diagnostics are no justification for removing data points!!

20 Assumptions Absence of Collinearity No influential data points
Normality of Errors Homoskedasticity of Errors Independence

21 Assumptions Absence of Collinearity No influential data points
Normality of Errors Homoskedasticity of Errors Independence

22 Q-Q plots qqnorm(residuals(xmdl));qqline(residuals(xmdl))

23 Assumptions Absence of Collinearity No influential data points
Normality of Errors Homoskedasticity of Errors Independence

24 Zuur, A. F. , Ieno, E. N. , & Elphick, C. S. (2010)
Zuur, A. F., Ieno, E. N., & Elphick, C. S. (2010). A protocol for data exploration to avoid common statistical problems. Methods in Ecology and Evolution, 1(1), 3-14.

25 (plot of residuals against fitted values)
Residual plot (plot of residuals against fitted values) plot(fitted(xmdl),residuals(xmdl)) This is really bad!!!

26 Learning to interpret residual plots by simulating random data
You can type these two lines into R again and again to train your eye!: ## Good par(mfrow=c(3,3)) for(i in 1:9)plot(1:50,rnorm(50)) ## Weak non-constant variance for(i in 1:9) {plot(1:50,sqrt((1:50))*rnorm(50))} Faraway, J. (2005). Linear models with R. Boca Raton: Chapman & Hall/CRC Press.

27 Learning to interpret residual plots by simulating random data
## Strong non-constant variance par(mfrow=c(3,3)) for(i in 1:9)plot(1:50,(1:50)*rnorm(50)) ## Non-linearity for(i in 1:9)plot(1:50,cos((1:50)*pi/25)+rnorm(50)) Faraway, J. (2005). Linear models with R. Boca Raton: Chapman & Hall/CRC Press.

28 Emphasis of graphical tools
For now, forget about formal tests of deviations from normality and homogeneity; graphical methods are generally considered superior (Montgomery & Peck, 1992; Draper & Smith, 1998; Quinn & Keough, 2002; Läänä, 2009; Zuur et al., 2009) Problems with formal tests: Type II errors hard cut-offs if used in a frequentist fashion less information about the data and the model have assumptions themselves Zuur, A. F., Ieno, E. N., & Elphick, C. S. (2010). A protocol for data exploration to avoid common statistical problems. Methods in Ecology and Evolution, 1(1), 3-14.

29 If you have continuous data that exhibits heteroskedasticity…
… you can perform nonlinear transformations (e.g., log transform) … there are several variants of regression that can help you out (generalized least squares with gls(); White-Huber covariance matrices using bptest() and coeftest(); bootstrapping using the “boot” package etc.)

30

31 “A priori violations” In the following cases, your data violates the normality and homoskedasticity assumption on a priori grounds: (1) count data (2) binary data

32 “A priori violations” In the following cases, your data violates the normality and homoskedasticity assumption on a priori grounds: (1) count data  Poisson regression (2) binary data

33 “A priori violations” In the following cases, your data violates the normality and homoskedasticity assumption on a priori grounds: (1) count data  Poisson regression (2) binary data  logistic regression

34 General Linear Model Generalized Linear Model

35 Generalized Linear Models: Ingredients
An error distribution (normal, Poisson, binomial) A linear predictor (LP) A link function (identity, log, logit)

36 Generalized Linear Models: Two important types
Poisson regression: a generalized linear model with Poisson error structure and log link function Logistic regression: a generalized linear model with binomial error structure and logit link function

37 The Poisson Distribution
Mean = Variance

38 Hissing Koreans Winter & Grawunder (2012)
Winter, B., & Grawunder, S. (2012). The phonetic profile of Korean formality. Journal of Phonetics, 40,

39 Rates can be misleading
N = Rate Time 16/s vs. 0/s could be 1 millisecond or 10 years

40

41

42

43 The basic GLM formula for Poisson
The basic LM formula The basic GLM formula for Poisson the linear predictor the link function

44 The basic GLM formula for Poisson
The basic LM formula The basic GLM formula for Poisson linear predictor

45 Poisson model output exponentiate log values predicted mean rate

46 Poisson model in R xmdl = glm(…,xdata,family=“poisson”) When using predict, you have to additionally specify whether you want predictions in LP space or response space preds = predict.glm(xmdl, newdata=mydata, type=“response”,se.fit=T)

47 The Poisson Distribution
Mean = Variance

48 The Poisson Distribution
Mean = Variance use negative binomial regression if variance > mean, then you are dealing with overdispersion library(MASS) xmdl.nb = glm.nb(…)

49 Overdispersion test xmdl.nb = glm.nb(…) library(pscl) odTest(xmdl.nb)

50 Generalized Linear Models: Two important types
Poisson regression: a generalized linear model with Poisson error structure and log link function Logistic regression: a generalized linear model with binomial error structure and logit link function

51

52

53

54

55 The basic GLM formula for logistic regression
The basic GLM formula for Poisson regression The basic GLM formula for logistic regression the logit link function

56 = inverse logit function
plogis()

57 Odds and log odds examples
Probability Odds Log odds (= “logits”) 0.1 0.111 -2.197 0.2 0.25 -1.386 0.3 0.428 -0.847 0.4 0.667 -0.405 0.5 1 0.6 1.5 0.405 0.7 2.33 0.847 0.8 4 1.386 0.9 9 2.197 - So a probability of 80% of an event occurring means that the odds are “4 to 1” for it occurring What happens if the odds are 50 to 50? -> ratio is 1 If the probability of non-occurrence is higher than occurrence, fractions If the probability of occurrence is higher, positive numbers

58 Snijders & Bosker (1999: 212)

59 for probabilities: transform the entire LP with the logistic function
Estimate Std. Error z value Pr(>|z|) (Intercept) ** alc *** for probabilities: transform the entire LP with the logistic function plogis()

60


Download ppt "WiFi password: 525-244-426-914."

Similar presentations


Ads by Google