R for psychologists Professor Thom Baguley, Psychology, Nottingham Trent University

R for psychologists Professor Thom Baguley, Psychology, Nottingham Trent University Thomas.Baguley@ntu.ac.uk

2 0. Overview 1. What is R? 2. Why use R? 3. R basics 4. R graphics 5. Linear models in R 6. ANOVA and ANCOVA 6. Writing your own R functions 8. Simulation and bootstrapping

3 1. What is R? R is a software environment for statistical analysis “the lingua franca of computational statistics” De Leeuw & Mair (2007)

4 2. Why use R? It is: -free -open source -cross-platform (Mac, Linux, PC) It has: -excellent basic statistics functions -powerful and versatile graphics -hundreds of user-contributed packages -a large community of users

5 3. R basics Some common R objects: -characters (text strings e.g., 'a' or ’1’) -numbers (numbers like 2 or 1e+3) -vectors (1D set of numbers or other elements) -data frames (like vectors organized in columns) -matrices (r by c array of numbers) -lists (objects that contain other objects)

6 Assignment Input: >vector1 <- 6 >vector1 Output: [1] 6

7 Arithmetic >6 + 6 * 2 [1] 18 > vector1^2 [1] 36

8 Calling functions >c(1, 9, 25)> log(vector1) [1] 1 9 25[1] 1.791759 > rnorm(1, 100, 15)?rnorm [1] 123.5336help(rnorm)

9 Loading data > dat1 <- read.csv('pride.csv') > library(foreign) > h05 <- read.spss('Hayden_2005.sav', to.data.frame = TRUE)

10 Addressing the contents of an object > vector1[1] [1] 6 > days <- h05$days > days[1:9] [1] 2864 1460 2921 2921 2921 1460 2921 1460 31

11 Addressing the contents of an object > vector1[1] [1] 6 > days <- h05$days > days[1:9] [1] 2864 1460 2921 2921 2921 1460 2921 1460 31 Or combine the steps with h05$days[1:9]

12 Data frames > is.data.frame(h05) [1] TRUE > dim(h05) > names(h05) [1] 43 2 [1] "names" "days” -this data frame has 43 rows and 2 named columns -it can be helpful to think of a data frame as a set of named variables (vectors) bound into columns

13 Matrix objects (matrices) > cells <- c(3677,56,3924,31) > cat.names <- list(c("Before", "After"), c("Alive", "Dead")) > checklist <- matrix(cells, 2, 2, byrow=TRUE, dimnames= cat.names) > checklist Alive Dead Before 3677 56 After 3924 31

14 The power of objects > chisq.test(checklist) Pearson's Chi-squared test with Yates' continuity correction data: checklist X-squared = 8.1786, df = 1, p-value = 0.004239 > chisq.test(c(2, 18)) Chi-squared test for given probabilities data: c(2, 18) X-squared = 12.8, df = 1, p-value = 0.0003466

15 Calling functions: defaults and named arguments > chisq.test(checklist, correct=FALSE) Pearson's Chi-squared test data: checklist X-squared = 8.8072, df = 1, p-value = 0.003000 -the default argument (for matrix input is) correct=TRUE -setting correct=FALSE over-rides the default -because unnamed arguments to functions are matched in order, this argument must be named (otherwise naming the arguments is optional)

16 Exercise 1 (R Basics) -entering data -working with objects -simple statistical functions

17 4. R graphics n = 43 > boxplot(h05$days)

18 > hist(h05$days)> plot(density(h05$days))

19 Distribution functions e.g., Normal distribution dnorm(x, mean = 0, sd = 1) pnorm(q, mean = 0, sd = 1, lower.tail = TRUE) qnorm(p, mean = 0, sd = 1, lower.tail = TRUE) rnorm(n, mean = 0, sd = 1) > rdat <- rnorm(30, 100, 15)

20 > curve(dnorm(x), xlim=c(-4,4), col='purple') > curve(dt(x, 1), col='red', add=TRUE, lty=3)

22 Exercise 2 (R Graphics) -exploratory plots -plot parameters -plotting distribution functions -plotting a serial position curve (optional)

23 5. Linear models in R > expenses <- read.csv('expenses.csv') > t.test(majority ~ problem, data = expenses) Welch Two Sample t-test data: majority by problem t = -3.7477, df = 505.923, p-value = 0.000199 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -2044.457 -638.160 sample estimates: mean in group 0 mean in group 1 7092.712 8434.021 Data courtesy of Mark Thompson http://markreckons.blogspot.com/

24 > expenses <- read.csv('expenses.csv') > lm(majority ~ problem, data = expenses) Call: lm(formula = majority ~ problem, data = expenses) Coefficients: (Intercept) problem 7093 1341

25 > lm.out <- lm(majority ~ problem, data = expenses) > max(rstandard(lm.out)) [1] 2.629733 > AIC(lm.out) [1] 12674.85 > lm.out$coefficients (Intercept) problem 7092.712 1341.308> ?lm

26 > summary(lm.out) Call: lm(formula = majority ~ problem, data = expenses) Residuals: Min 1Q Median 3Q Max -8397.0 -3403.5 -260.9 3118.8 11543.3 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7092.7 218.9 32.397 < 2e-16 *** problem 1341.3 357.0 3.758 0.000187 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 4395 on 644 degrees of freedom Multiple R-squared: 0.02145,Adjusted R-squared: 0.01993 F-statistic: 14.12 on 1 and 644 DF, p-value: 0.0001871

27 > glm(problem~I(majority/10000), family='binomial', data = expenses) Call: glm(formula = problem ~ I(majority/10000), family = "binomial", data = expenses) Coefficients: (Intercept) I(majority/10000) -1.0394 0.6878 Degrees of Freedom: 645 Total (i.e. Null); 644 Residual Null Deviance: 855.5 Residual Deviance: 841.6 AIC: 845.6

28 > lrm.out <- glm(problem ~ I(majority/10000), family='binomial', data = expenses) > plot(election$majority, lrm.out$fitted, col='dark green')

29 Exercise 3 (Linear models) - using a formula in a call to a model function -linear models -plotting a regression line -generalized linear models (optional) -plotting confidence bands (optional)

30 6. ANOVA > factor1 factor 1 [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 Levels: 1 2 3 4 > library(foreign) > diag.data diag.fit <- lm(descript ~ group, data=diag.data)

31 Regression model (dummy coding) > diag.fit Call: lm(formula = descript ~ group, data = diag.data) Coefficients: (Intercept) groupPicture groupFull Diagram groupSegmented 17.8 0.5 4.6 9.5

32 aov() > aov(diag.fit) Call: aov(formula = diag.fit) Terms: group Residuals Sum of Squares 583.7 2440.2 Deg. of Freedom 3 36

33 anova() > summary(aov(diag.fit)) > anova(diag.fit) Analysis of Variance Table Response: descript Df Sum Sq Mean Sq F value Pr(>F) group 3 583.7 194.567 2.8704 0.04977 * Residuals 36 2440.2 67.783 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

34 Factorial ANOVA and ANCOVA > lm(DV ~ factor1 + factor2 + factor1:factor2, data=data2) > lm(DV ~ factor1 * factor2, data=data2) > lm(descript ~ group + time, diag.data) > lm(descript ~ 0 + group + scale(time, scale=F), diag.data) > diag.ancov <- lm(descript ~ group + scale(time, scale=F), diag.data)

35 Type I and Type II Sums of squares > drop1(diag.ancov, test = 'F') Single term deletions Model: descript ~ group + scale(time, scale = F) Df Sum of Sq RSS AIC F value Pr(F) 2154.4 169.46 group 3 821.45 2975.9 176.38 4.4484 0.009478 ** scale(time, scale = F) 1 285.78 2440.2 172.44 4.6427 0.038149 * -sequential sums of squares (Type I) tests are given by default -hierarchical sums of squares (Type II) tests are given by the drop1() function [Software such as SAS and SPSS defaults to unique (Type III) SS]

36 Repeated measures (within-subjects) factors -a weakness in the base R installation, but easily done using packages such as ez, nlme or lme4 (the latter two able to fit a wider range of repeated measures models) > pride.long <- read.csv("pride_long.csv") > pride.mixed <- aov(accuracy ~ emotion*condition + Error(factor(participant)/emotion), pride.long) > summary(pride.mixed) [Software such as SAS and SPSS defaults to unique (Type III) SS]

37 Multiple comparisons and contrasts -you can run contrasts by changing default contrasts in R -Fisher LSD, Tukey’s HSD and p adjustment (e.g., Hochberg, Holm, Bonferroni, FDR) in R base installation -powerful modified Bonferroni (e.g., Shaffer, Westfall) and general linear contrasts available in multcomp package -flexible contrast and estimable functions in gmodels package

38 Exercise 4 (ANOVA) - factors -aov() and anova() -ANCOVA -repeated measures (within-subjects) -contrasts and multiple comparisons (optional)

39 7. Writing your own functions sd.pool <- function(sd.1, n.1, sd.2, n.2 = n.1){ num <- (n.1-1)*sd.1^2+(n.2-1)*sd.2^2 denom <- n.1+n.2-2 output <- sqrt(num/denom) output } > sd.pool(6.1, 20, 5.3, 25) [1] 5.66743

40 Exercise 5 (Write a function) -pick a simple statistical procedure -write a function to automate it e.g.,pooled standard deviation

41 8. Simulation and bootstrapping -distribution functions > rnorm(100, 10, 1) -R boot package -sample() and replicate()

42 Bootstrap CIs for a median or timmed mean

43 The bootstrap (and other Monte Carlo methods) Bootstrapping involves repeatedly resampling with replacement from a data set e.g., Data set = {0,1} 7 simulated samples: {1,0}, {0,1}, {0,0}, {0,0}, {0,1}, {0,0}, {1,1} Bootstrapping effectively treats the sample distribution as the population distribution > replicate(7, sample(x,2,replace=TRUE))

44 Bootstrapping using R > library(boot) > x1.boot <- boot(samp,function(x,i) median(x[i]),R=10000) > boot.ci(x1.boot) BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS Based on 10000 bootstrap replicates CALL : boot.ci(boot.out = x1.boot) Intervals : Level Normal Basic 95% (-0.1747, 0.6948 ) (-0.1648, 0.6614 ) Level Percentile BCa 95% (-0.1538, 0.6724 ) (-0.2464, 0.6620 ) Calculations and Intervals on Original Scale Normal 95% CI for mean (using t) [-2.00, 1.33]

45 Exercise 6 (Bootstrapping) -resample data -construct a simple percentile bootstrap -run a BC a bootstrap using the boot package (optional)

47 9. Advanced statistical modelling in R Multilevel modeling nlme package lme4 package MCMCglmm package Meta-analysis meta package metafor package … and many, many more other packages

48 metafor package > m.e <- c(10.7, 16.24, 10.03, 3.65, 5.73) > n.e <- c(31, 57, 9, 17, 10) > m.c <- c(2.87, 6.83, 7.18, 4.65, 7.47) > n.c <- c(14, 52, 9, 18, 10) > sd.pooled <- c(7.88, 9.84, 8.67, 6.34, 5.74) > diff <- m.c - m.e > se.diffs <- sd.pooled * sqrt(1/n.e + 1/n.c)

49 > install.packages('metafor') > library(metafor) > meta.out <- rma(yi=diff, sei=se.diffs, method = 'FE') > meta.out Fixed-Effects Model (k = 5) Test for Heterogeneity: Q(df = 4) = 21.0062, p-val = 0.0003 Model Results: estimate se zval pval ci.lb ci.ub -4.1003 1.0750 -3.8141 0.0001 -6.2073 -1.9933 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

50 > funnel(trimfill(meta.out))

51 > forest(rma(yi=diff, sei=se.diffs))

R for psychologists Professor Thom Baguley, Psychology, Nottingham Trent University

Similar presentations

Presentation on theme: "R for psychologists Professor Thom Baguley, Psychology, Nottingham Trent University"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

R for psychologists Professor Thom Baguley, Psychology, Nottingham Trent University

Similar presentations

Presentation on theme: "R for psychologists Professor Thom Baguley, Psychology, Nottingham Trent University"— Presentation transcript:

Similar presentations

About project

Feedback