1 Introduction to R Workshop June 23-25, 2010 Southwest Fisheries Science Center 3333 North Torrey Pines Court La Jolla, CA 92037 Eric Archer

1 Introduction to R Workshop June 23-25, 2010 Southwest Fisheries Science Center 3333 North Torrey Pines Court La Jolla, CA 92037 Eric Archer eric.archer@noaa.gov858-546-7121

2 Introduction to R 1) How R thinks Environment Data Structures Data Input/Output 2) Becoming a codeR Data Selection and Manipulation Data Summary Functions 3) Visualization and analysis Data Processing (‘apply’ family) Plotting & Graphics Statistical Distributions Statistical Tests Model Fitting Packages, Path, Options

3 “Programming ought to be regarded as an integral part of effective and responsible data analysis” - Venables and Ripley. 1999. S Programming S, S-Plus, R Why R? Free Open source Many packages Large support base Multi-platform Vectorization S Chambers, Becker, Wilks 1984: Bell Labs S-Plus 1988: Statistical Sciences 1993: MathSoft 2001: Insightful 2008: TIBCO R Ihaka & Gentleman 1996 (The R Project)

4 Workspace Entering commands commands and assignments executed or evaluated immediately separated by new line (Enter/Return) or semicolon recall commands with ↑ or ↓ case sensitive everything is some sort of function that does something Getting help > help(mean) > ?median > help(“[“) > example(mean) > help.search(“regression”) > RSiteSearch(“genetics”) > http://www.r-project.org/

5 Workspace ls() list objects in workspace rm(…) remove objects from workspace rm(list = ls()) remove all objects from workspace save.image() saves workspace load(".rdata") loads saved workspace history() view command history loadhistory() load command history savehistory() save command history # comments

6 <- assign c(…) combine arguments into a vector seq(x) generate sequence from 1 to x seq(from,to,by) generate sequence with increment by from:to generate sequence from.. to rep(x,times) replicate x letters,LETTERS vector of 26 lower and upper case letters Assignment and data creation > x <- 1 > y <- "A" > my.vec <- c(1, 5, 6, 10) > my.nums <- 12:24 > x [1] 1 > y [1] "A" > my.vec [1] 1 5 6 10 > my.nums [1] 12 13 14 15 16 17 18 19 20 21 22 23 24

7 Data Structures Object modes (atomic structures) integer whole numbers (15, 23, 8, 42, 4, 16) numeric real numbers (double precision: 3.14, 0.0002, 6.022E23) character text string (“Hello World”, “ROFLMAO”, “A”) logical TRUE/FALSE or T/F Object classes vector object with atomic mode factor vector object with discrete groups (ordered/unordered) array multiple dimensions matrix 2-dimensional array list vector of components data.frame "matrix –like" list of variables of same # of rows Special Values NULL object of zero length, test with is.null(x) NA Not Available / missing value, test with is.na(x) NaN Not a number, test with is.nan(x) (e.g. 0/0, log(-1) ) Inf, -Inf Positive/negative infinity, test with is.infinite(x) (e.g. 1/0 )

8 Creation and info vector(mode,length) create vector length(x) number of elements names(x) get or set names Indexing (number, character (name), or logical) x[n] nth element x[-n] all but the nth element x[a:b] elements a to b x[-(a:b)] all but elements a to b x[c(…)] specific elements x[“name”] “name” element x[x > a] all elements greater than a x[x %in% c(…)] all elements in the set Vectors

9 Create a vector > x <- 1:10 Give the elements some names > names(x) <- c("first","second","third","fourth","fifth") Select elements based on another vector > i <- c(1,5) > x[i] first fifth 1 5 > x[-c(i,8)] second third fourth 2 3 4 6 7 9 10 Vectors

10 logical testing ==equals >, <greater, less than >=, <=greater,less than or equal to ! not &, &&and (single is element-by-element, double is first element) |, ||orVectors Select elements based on a condition > x <- 1:10 > x[x < 5] [1] 1 2 3 4 > x < 5 [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE > x[x < 5] [1] 1 2 3 4 & vs && > x 2 [1] FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE > x 2 [1] FALSE

11 Operator recycles smaller object enough times to cover larger object > x <- 4 > y <- c(5, 6, 7, 8, 9, 10) > z <- x + y > z [1] 9 10 11 12 13 14 > x <- c(3, 5) > z <- x + y > z [1] 8 11 10 13 12 15 > i <- 1:10 > j <- c(T, T, F) > i[j] [1] 1 2 4 5 7 8 10 Vectorization

12 summary(x) generic summary of object str(x) display object structure mode(x) get or set storage mode class(x) name of object class is. (x) test type of object (is.numeric, is.logical, etc.) attr(x, which) get or set the attribute of an object attributes(x) get or set all attributes of an object Object Information

13 > y <- 1:10 > str(y) int [1:10] 1 2 3 4 5 6 7 8 9 10 > mode(y) [1] "numeric“ > class(y) [1] "integer“ > is.character(y) [1] FALSE > is.integer(y) [1] TRUE > is.double(y) [1] FALSE > is.numeric(y) [1] TRUE Object Information

14 Object Information > x <- 1:4 > names(x) <- c("first","second","third","four") > x first second third four 1 2 3 4 > str(x) Named int [1:4] 1 2 3 4 - attr(*, "names")= chr [1:4] "first" "second" "third" "four" > attributes(x) $names [1] "first" "second" "third" "four" > attr(x, "notes") <- "This is a really important vector." > attributes(x) $names [1] "first" "second" "third" "four" $notes [1] "This is a really important vector." > attr(x, "date") <- 20090624 > attributes(x) $names [1] "first" "second" "third" "four" $notes [1] "This is a really important vector." $date [1] 20090624 > x first second third four 1 2 3 4 attr(,"notes") [1] "This is a really important vector." attr(,"date") [1] 20090624

15 coercion as. (x) coerces object x to if possible > x <- 1:10 > x.char <- as.character(x) > as.numeric(x.char) [1] 1 2 3 4 5 6 7 8 9 10 > y <- letters[1:10] > as.numeric(y) [1] NA NA NA NA NA NA NA NA NA NA Warning message: NAs introduced by coercion > z <- "1char" > as.numeric(z) [1] NA Warning message: NAs introduced by coercion > logic.chars <- c("TRUE", "FALSE", "T", "F", "t", "f", "0", "1") > as.logical(logic.chars) [1] TRUE FALSE TRUE FALSE NA NA NA NA > logic.nums <- c(-2, -1, 0, 1.5, 2, 100) > as.logical(logic.nums) [1] TRUE TRUE FALSE TRUE TRUE TRUE

16 Factors Discrete ordered or unordered data Internally represented numerically factor(x, levels, labels, exclude, ordered) levels(x) labels(x) is.factor(x),is.ordered(x)

17 > x <- c("b", "a", "a", "c", "B", "d", "a", "d") > x.fac <- factor(x) > x.fac [1] b a a c B d a d Levels: a b B c d > str(x.fac) Factor w/ 5 levels "a","b","B","c",..: 2 1 1 4 3 5 1 5 > levels(x.fac) [1] "a" "b" "B" "c" "d“ > labels(x.fac) [1] "1" "2" "3" "4" "5" "6" "7" "8“ > as.numeric(x.fac) [1] 2 1 1 4 3 5 1 5 > as.character(x.fac) [1] "b" "a" "a" "c" "B" "d" "a" "d" Factors

18 Factors > x.fac.lvl <- factor(x, levels = c("a", "c")) > x.fac.lvl [1] a a c a Levels: a c > x.fac.exc <- factor(x, exclude = c("a", "c")) > x.fac.exc [1] b B d d Levels: b B d > x.fac.lbl <- factor(x, labels = c("L1", "L2", "L3", "L4", "L5")) > x.fac.lbl [1] L2 L1 L1 L4 L3 L5 L1 L5 Levels: L1 L2 L3 L4 L5 > x.fac[2] < x.fac[1] [1] NA Warning message: In Ops.factor(x.fac[2], x.fac[1]) : < not meaningful for factors > x.ord <- factor(x, ordered = TRUE) > x.ord [1] b a a c B d a d Levels: a < b < B < c < d > x.ord[2] < x.ord[1] [1] TRUE

19 Arrays and Matrices array(data, dim, dimnames) create array (row-priority) matrix(data, nrow, ncol, dimnames) create matrix x[row, col] element at row,col x[row,] x[, col] vector of row and col x[“name”, ] vector of row “name” etc. dim(x) retrieve or set dimensions nrow(x) number of rows ncol(x) number of columns dimnames(x) retrieve or set dimension names rownames(x) retrieve or set row names colnames(x) retrieve or set column names cbind(…) create array from columns rbind(…) create array from rows t(x) transpose (matrices)

20 Create an array > x <- array(1:10, dim = c(4, 6)) > x [,1] [,2] [,3] [,4] [,5] [,6] [1,] 1 5 9 3 7 1 [2,] 2 6 10 4 8 2 [3,] 3 7 1 5 9 3 [4,] 4 8 2 6 10 4 > str(x) int [1:4, 1:6] 1 2 3 4 5 6 7 8 9 10... > attributes(x) $dim [1] 4 6 > dim(x) [1] 4 6 > dimnames(x) NULL Arrays and Matrices

21 Set column or row names > colnames(x) <- c("col1", "col2", "col3", "col4", "5", "6") > x col1 col2 col3 col4 5 6 [1,] 1 5 9 3 7 1 [2,] 2 6 10 4 8 2 [3,] 3 7 1 5 9 3 [4,] 4 8 2 6 10 4 > colnames(x) <- c("column1", "column2") Error in dimnames(x) <- dn : length of 'dimnames' [2] not equal to array extent > colnames(x)[1] <- "column1" > x column1 col2 col3 col4 5 6 [1,] 1 5 9 3 7 1 [2,] 2 6 10 4 8 2 [3,] 3 7 1 5 9 3 [4,] 4 8 2 6 10 4 Arrays and Matrices

22 Set row and columns names using dimnames > dimnames(x) <- list(c("first", "second", "third", "4"), NULL) > x [,1] [,2] [,3] [,4] [,5] [,6] first 1 5 9 3 7 1 second 2 6 10 4 8 2 third 3 7 1 5 9 3 4 4 8 2 6 10 4 Arrays and Matrices Setting dimension names > dimnames(x) <- list(my.rows = c("first", "second", "third", "4"), my.cols = NULL) > x my.cols my.rows [,1] [,2] [,3] [,4] [,5] [,6] first 1 5 9 3 7 1 second 2 6 10 4 8 2 third 3 7 1 5 9 3 4 4 8 2 6 10 4

23 Change dimensionality of array > dim(x) <- c(6, 4) > x [,1] [,2] [,3] [,4] [1,] 1 7 3 9 [2,] 2 8 4 10 [3,] 3 9 5 1 [4,] 4 10 6 2 [5,] 5 1 7 3 [6,] 6 2 8 4 > dim(x) <- c(3, 4, 2) > x,, 1 [,1] [,2] [,3] [,4] [1,] 1 4 7 10 [2,] 2 5 8 1 [3,] 3 6 9 2,, 2 [,1] [,2] [,3] [,4] [1,] 3 6 9 2 [2,] 4 7 10 3 [3,] 5 8 1 4Arrays

24 Bind several vectors into an array > i1 <- seq(from = 1, to = 20, length = 10) > i2 <- seq(from = 3.4, to = 25, length = 10) > i3 <- seq(from = 15, to = 25, length = 10) > i <- cbind(i1, i2, i3) > i i1 i2 i3 [1,] 1.000000 3.4 15.00000 [2,] 3.111111 5.8 16.11111 [3,] 5.222222 8.2 17.22222 [4,] 7.333333 10.6 18.33333 [5,] 9.444444 13.0 19.44444 [6,] 11.555556 15.4 20.55556 [7,] 13.666667 17.8 21.66667 [8,] 15.777778 20.2 22.77778 [9,] 17.888889 22.6 23.88889 [10,] 20.000000 25.0 25.00000 Arrays and Matrices

25 > j <- rbind(i1, i2, i3) > j [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] i1 1.0 3.111111 5.222222 7.333333 9.444444 11.55556 13.66667 15.77778 17.88889 i2 3.4 5.800000 8.200000 10.600000 13.000000 15.40000 17.80000 20.20000 22.60000 i3 15.0 16.111111 17.222222 18.333333 19.444444 20.55556 21.66667 22.77778 23.88889 [,10] i1 20 i2 25 i3 25 > i <- cbind(col1 = i1, col2 = i2, col3 = i3) Arrays and Matrices

26 Special vector Collection of elements of different modes Often used as return type of functions list(…), vector(“list”, length) create list x[i] list of element i x[[i]] element i x[“name”] list of element name x[[“name”]] or x$name element name unlist transform list to a vector Lists

27 Lists > x <- list(1:10, c("a", "b"), c(TRUE, TRUE, FALSE, TRUE), 5) > x [[1]] [1] 1 2 3 4 5 6 7 8 9 10 [[2]] [1] "a" "b" [[3]] [1] TRUE TRUE FALSE TRUE [[4]] [1] 5 > is.list(x) [1] TRUE > is.vector(x) [1] TRUE > is.numeric(x) [1] FALSE

28 What are the elements in a list? > x[1] [[1]] [1] 1 2 3 4 5 6 7 8 9 10 > str(x[1]) List of 1 $ : int [1:10] 1 2 3 4 5 6 7 8 9 10 > mode(x[1]) [1] "list“ > x[[1]] [1] 1 2 3 4 5 6 7 8 9 10 > str(x[[1]]) int [1:10] 1 2 3 4 5 6 7 8 9 10 > mode(x[[1]]) [1] "numeric“ Lists

29 > y <- list(numbers = c(5, 10:25), initials = c(“rnm", "fds")) > y $numbers [1] 5 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 $initials [1] “rnm" "fds" > y$initials [1] “rnm" "fds“ > y["numbers"] $numbers [1] 5 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 > y$new.element <- "This is new" > y $numbers [1] 5 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 $initials [1] “rnm" "fds" $new.element [1] "This is new" Lists

30 Like matrices, but columns of different modes Organized list where components are columns of equal length rows x[[“name”]] or x$name column name x[row, column], etc. > age <- c(1:5) > color <- c("neonate", "two-tone", "speckled", "mottled", "adult") > juvenile <- c(TRUE, TRUE, FALSE, FALSE, FALSE) > spotted <- data.frame(age, color, juvenile) > spotted age color juvenile 1 1 neonate TRUE 2 2 two-tone TRUE 3 3 speckled FALSE 4 4 mottled FALSE 5 5 adult FALSE Data Frames

31 > is.matrix(spotted) [1] FALSE > is.array(spotted) [1] FALSE > is.list(spotted) [1] TRUE > is.data.frame(spotted) [1] TRUE > spotted$age [1] 1 2 3 4 5 > spotted$age[2] [1] 2 > spotted$color[2] [1] two-tone Levels: adult mottled neonate speckled two-tone > spotted[spotted$age < 3, ] age color juvenile 1 1 neonate TRUE 2 2 two-tone TRUE Data Frames

32 Forcing character columns > str(spotted) 'data.frame': 5 obs. of 3 variables: $ age : int 1 2 3 4 5 $ color : Factor w/ 5 levels "adult","mottled",..: 3 5 4 2 1 $ juvenile: logi TRUE TRUE FALSE FALSE FALSE > spotted2 <- data.frame(age.class = age, + color.pattern = color, juvenile.stat = juvenile, + stringsAsFactors = FALSE) > spotted2 age.class color.pattern juvenile.stat 1 1 neonate TRUE 2 2 two-tone TRUE 3 3 speckled FALSE 4 4 mottled FALSE 5 5 adult FALSE > str(spotted2) 'data.frame': 5 obs. of 3 variables: $ age.class : int 1 2 3 4 5 $ color.pattern: chr "neonate" "two-tone" "speckled" "mottled"... $ juvenile.stat: logi TRUE TRUE FALSE FALSE FALSE Data Frames

33 Data Frames Deleting columns > spotted$age <- NULL > spotted color juvenile 1 neonate TRUE 2 two-tone TRUE 3 speckled FALSE 4 mottled FALSE 5 adult FALSE Creating new columns > spotted$freq <- c(0.3, 0.2, 0.2, 0.15, 0.15) > spotted$have.data <- TRUE > spotted color juvenile freq have.data 1 neonate TRUE 0.30 TRUE 2 two-tone TRUE 0.20 TRUE 3 speckled FALSE 0.20 TRUE 4 mottled FALSE 0.15 TRUE 5 adult FALSE 0.15 TRUE

34 Data Frames subset(x, subset, select) > subset(spotted, age >=3) age color juvenile 3 3 speckled FALSE 4 4 mottled FALSE 5 5 adult FALSE > subset(spotted, juvenile == FALSE & age <= 4) age color juvenile 3 3 speckled FALSE 4 4 mottled FALSE > subset(spotted, age <=2, select = c("color", "juvenile")) color juvenile 1 neonate TRUE 2 two-tone TRUE

35 Data Input/Output Directory management dir() list files in directory setwd(path) set working directory getwd() get working directory ?files File and Directory Manipulation Standard ASCII Format read.table creates a data frame from text file read.csv read comma-delimited file read.delim read tab-delimited file read.fwf read fixed width format write.table write data to text file write.csv write comma-delimited file R Binary Format save writes binary R objects save.image writes current environment in binary R load reload files written with save R Text Format dump creates text representation of R objects source accept input from text file (scripts)

36 Reading ASCII > sets <- read.csv("Sets_All.csv", header = TRUE) > sets$Ordered.Year <- ordered(sets$Year) > sets$SpotCd.Fac <- factor(sets$SpotCd, exclude = NULL) > spotted.sets <- sets[sets$Sp1Cd == 2, ] > write.table(spotted.sets, file = "spotted.txt", + row.names = FALSE) Reading R binary > save(spotted.sets, file = "spotted.RData") > rm(list = ls()) > load("spotted.RData") Reading R commands > positions <- spotted.sets[, c("Latitude", "Longitude")] > dump("positions", file = "set_positions.R") > rm(list = ls()) > source("set_positions.R") Data Input/Output

37 Writing Scripts Text files containing commands and comments written as if executed on command line (usually end with.r) From R GUI : File|New script Any text editor (Notepad, Tinn-R, VEDIT, etc.) Commands executed with: source("filename.r") Copy/paste From R Editor : Edit|Run...

38 Exercise 1A : Assemble data frame 1.Assemble a data frame from “Homework 1” files with only these columns (make these names and in this order): boat (character), skipper (character), lat, lon, year, month, day, mammals, turtles, fish 2.Add a column classifying each trip by season: Winter: Dec – Feb, Spring: Mar – May, Summer: Jun – Aug, Fall: Sep – Nov 3.Add three columns classifying bycatch size for each of: fish : 200 (large) turtles : = 4 (large) mammals: = 2 (large) 4. Add column indicating that boat needs to be inspected if any bycatch class is “large” 5. Write your new data frame to a.csv file End Day 1 Exercise 1B : Make a list 1.Read.csv file from 1A into clean R environment 2.Create a list with one element for the entire data set and one element per bycatch type (4 elements total). Each bycatch element should contain a named vector of the number of trips with small, medium, and large bycatches 3.How many trips needed to be inspected? 4.How many trips had no bycatch at all? 5.Save list and results from 3 & 4 in an R workspace

39 sample(x, size, replace, prob) take a random sample from x cut(x, breaks, labels) divide vector into intervals %in% return logical vector of matches which(x) return index of TRUE results all(…), any(…) return TRUE if all or any arguments are TRUE unique(x) return unique observations in vector duplicated(x) return duplicated observations sort sort vector or factor order sort based on multiple arguments merge() merge two data frames by common cols or rows ceiling, floor, trunc, round, signif rounding functions Data Selection and Manipulation

40 sample > x <- 1:5 Sample x (jumble or permute) > sample(x) [1] 2 1 4 5 3 Sample from x > sample(x, 3) [1] 2 4 3 Sample with replacement > sample(x, 10, replace = TRUE) [1] 2 3 5 3 3 4 2 1 4 4 Sample with modified probabilities > cars <- c("Ford", "GM", "Toyota", "VW", "Subaru", "Honda") > male.wts <- c(6, 5, 3, 1, 3, 3) > female.wts <- c(3, 3, 4, 8, 3, 6) > > male.survey <- sample(cars, 100, replace = TRUE, prob = male.wts) > female.survey <- sample(cars, 100, replace = TRUE, prob = female.wts)

41 cut cut(x, breaks, labels = NULL, include.lowest = FALSE, right = TRUE, dig.lab = 3, ordered_result = FALSE,...) > y <- c(4, 5, 6, 10, 11, 30, 49, 50, 51) Bins : 5 > y y y <= 50 > y.cut <- cut(y, breaks = c(5, 10, 30, 50)) > y.cut [1] (5,10] (5,10] (10,30] (10,30] (30,50] (30,50] Levels: (5,10] (10,30] (30,50] > str(y.cut) Factor w/ 3 levels "(5,10]","(10,30]",..: NA NA 1 1 2 2 3 3 NA Bins : 5 >= y y y <= 50 > cut(y, breaks = c(5, 10, 30, 50), include.lowest = TRUE) [1] [5,10] [5,10] [5,10] (10,30] (10,30] (30,50] (30,50] Levels: [5,10] (10,30] (30,50] Bins : 5 >= y = y = y < 50 > cut(y, breaks = c(5, 10, 30, 50), right = FALSE) [1] [5,10) [5,10) [10,30) [10,30) [30,50) [30,50) Levels: [5,10) [10,30) [30,50) Bins : 5 >= y = y = y <= 50 > cut(y, breaks = c(5, 10, 30, 50), include.lowest = TRUE, right = FALSE) [1] [5,10) [5,10) [10,30) [10,30) [30,50] [30,50] [30,50] Levels: [5,10) [10,30) [30,50]

42 %in%, which > x <- sample(1:10, 20, replace = TRUE) > x [1] 4 10 2 3 4 3 6 4 7 3 9 1 3 4 7 1 3 2 8 [20] 5 > x %in% c(3, 10, 2, 1) [1] FALSE TRUE TRUE TRUE FALSE TRUE FALSE FALSE FALSE [10] TRUE FALSE TRUE TRUE FALSE FALSE TRUE TRUE TRUE [19] FALSE FALSE > x[x %in% c(3, 10, 2, 1)] [1] 10 2 3 3 3 1 3 1 3 2 > which(x %in% c(3, 10, 2, 1)) [1] 2 3 4 6 10 12 13 16 17 18 > which(x < 5) [1] 1 3 4 5 6 8 10 12 13 14 16 17 18 > x[which(x > 6)] [1] 10 7 9 7 8

43 any, all > x <- sample(1:10, 20, replace = TRUE) > x [1] 2 7 8 1 1 7 5 8 6 7 3 7 2 1 5 10 3 9 1 2 > any(x == 6) [1] TRUE > all(x < 5) [1] FALSE

44 unique, duplicated > x <- sample(1:10, 20, replace = TRUE) > x [1] 6 5 1 8 9 6 2 3 8 9 8 10 10 2 9 3 4 3 4 [20] 10 > unique(x) [1] 6 5 1 8 9 2 3 10 4 > duplicated(x) [1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE [10] TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE TRUE [19] TRUE TRUE

45 sort, order > x <- sample(1:10, 20, replace = TRUE) > x [1] 3 6 7 1 5 3 10 3 7 2 3 9 1 8 4 3 8 2 [19] 4 1 > sort(x) [1] 1 1 1 2 2 3 3 3 3 3 4 4 5 6 7 7 8 8 [19] 9 10 > sort(x, decreasing = TRUE) [1] 10 9 8 8 7 7 6 5 4 4 3 3 3 3 3 2 2 1 [19] 1 1 > order(x) [1] 4 13 20 10 18 1 6 8 11 16 15 19 5 2 3 9 14 17 [19] 12 7 > trips <- read.csv(“homework 1a df.csv") > month.sort <- trips[order(trips$month), ] > month.days.sort <- trips[order(trips$month, trips$day), ]

46 merge(x, y, by = intersect(names(x), names(y)), by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all, sort = TRUE, suffixes = c(".x",".y"),...) > rm(list = ls()) > load("merge data.rdata") > str(cranial) 'data.frame': 20 obs. of 2 variables: $ id : Factor w/ 20 levels "Specimen-1","Specimen-12",..: 14 11 13 7 20 18 3 10 5 17... $ skull: num 260 266 259 273 262... > str(haps) 'data.frame': 20 obs. of 2 variables: $ id : Factor w/ 20 levels "Specimen-1","Specimen-10",..: 16 12 15 18 8 7 3 13 6 9... $ haps: Factor w/ 5 levels "A","B","C","D",..: 1 4 4 5 5 3 1 3 3 4... > merge(haps, cranial) id haps skull 1 Specimen-1 A 255.4461 2 Specimen-12 A 262.5730 3 Specimen-16 E 256.2258 4 Specimen-22 E 259.2000... merge

47 merge(x, y, by = intersect(names(x), names(y)), by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all, sort = TRUE, suffixes = c(".x",".y"),...) > str(sex) 'data.frame': 40 obs. of 2 variables: $ specimens: Factor w/ 40 levels "Specimen-1","Specimen- 10",..: 1 12 23 34 36 37 38 39 40 2... $ sex : Factor w/ 2 levels "F","M": 1 2 1 2 2 2 2 2 2 1... > str(trials) 'data.frame': 30 obs. of 2 variables: $ id : Factor w/ 23 levels "Specimen-1","Specimen-18",..: 5 6 1 9 3 7 8 2 10 4... $ value: num 30.1 23.1 24.3 22.6 36.7... > merge(sex, trials, by.x = "specimens", by.y = "id") specimens sex value 1 Specimen-1 F 24.28745 2 Specimen-11 F 23.90455 3 Specimen-12 M 27.41010 4 Specimen-14 M 36.84547 5 Specimen-15 M 20.08898 merge

48 nchar(x) number of characters in string substr(x, start, stop) extract or replace substrings strsplit(x, split) split string paste(..., sep, collapse) concatenate vectors format format object for printing grep, sub, gsub pattern matching and replacement String Manipulation

49 > x <- "This is a sentence." > nchar(x) [1] 19 > substr(x, 3, 9) [1] "is is a“ > substr(x, 1, 4) <- "That" > x [1] "That is a sentence.“ > strsplit(x, " ") [[1]] [1] "That" "is" "a" "sentence." > strsplit(x, "a") [[1]] [1] "Th" "t is " " sentence." nchar, substr, strsplit

50 paste > sites <- LETTERS[1:6] > paste("Site", sites) [1] "Site A" "Site B" "Site C" "Site D" "Site E" "Site F" > paste("Site", sites, sep = "-") [1] "Site-A" "Site-B" "Site-C" "Site-D" "Site-E" "Site-F" > paste("Site", sites, sep = "_", collapse = ",") [1] "Site_A,Site_B,Site_C,Site_D,Site_E,Site_F"

51 summary summarizes object – different for each class table create contingency table sum(x), prod(x) sum and product of vector cumsum(x) vector of cumulative sums rowSums, colSums compute row or column sums rowMeans, colMeans compute row or column means rowsum(x, group) compute column sums for a grouping variable Data Summary

52 > trips <- read.csv(“homework 1a df.csv") > table(season = trips$season) season Fall Spring Summer Winter 2503 2546 2336 2615 > table(season = trips$season, fish.class = trips$fish.class) fish.class season Large Medium Small Fall 1499 897 107 Spring 1505 960 81 Summer 1380 865 91 Winter 1550 959 106 > turtle.class.table <- as.data.frame(table(turtle.class = trips$turtle.class)) > str(turtle.class.table) 'data.frame': 2 obs. of 2 variables: $ turtle.class: Factor w/ 2 levels "Large","Small": 1 2 $ Freq : int 3443 6557 > turtle.class.table turtle.class Freq 1 Large 3443 2 Small 6557 table

53 > x <- matrix(1:18, nrow = 6, ncol = 3) > x [,1] [,2] [,3] [1,] 1 7 13 [2,] 2 8 14 [3,] 3 9 15 [4,] 4 10 16 [5,] 5 11 17 [6,] 6 12 18 > rowSums(x) [1] 21 24 27 30 33 36 > colMeans(x) [1] 3.5 9.5 15.5 > rowsum(x, c(1, 1, 2, 2, 3, 3)) [,1] [,2] [,3] 1 3 15 27 2 7 19 31 3 11 23 35 > rowsum(x, c("a", "a", "a", "b", "b", "b")) [,1] [,2] [,3] a 6 24 42 b 15 33 51 row/col sums/means

54 min, max return minimum or maximum values range return a vector of minimum and maximum values which.min, which.max return index of first minimum value mean(x) arithmetic mean of vector sd, var, cov, cor standard deviation, variance, covariance, correlation median(x) median of vector quantile(x, probs) give quantiles of vector Data Summary > x <- sample(1:100, 50, replace = TRUE) > mean(x) [1] 55.82 > median(x) [1] 51.5 > range(x) [1] 1 100 > quantile(x, probs = 0.1) 10% 21.9 > quantile(x, probs = c(0.025, 0.5, 0.975)) 2.5% 50% 97.5% 6.825 51.500 98.325

55 fun.name <- function(args) { statements x or return(x) } result of last statement is return value arguments(args) passed by value can give default arguments “…” passes unmatched arguments to other functions Functions

56 F2C <- function(faren) { # converts farenheit to celsius cels <- round((faren - 32) * 5/9, 2) paste(faren, "deg. Farenheit =", cels, "deg. Celsius", sep=" ", collapse="") } sample.mean <- function(x, sample.size = 10) { y <- sample(x, size = sample.size, replace = TRUE) mean(y) } sample.mean <- function(x, sample.size = length(x)) { y <- sample(x, size = sample.size, replace = TRUE) mean(y) } sample.mean <- function(x,...) { y <- sample(x,...) mean(y, na.rm = TRUE) } Functions

57 if(cond) {statements} else {statements} evaluate condition ifelse(test, yes, no) evaluate test, return yes or no for(var in seq) {statements} execute one loop for each var in seq while(cond) {statements} execute loop as long as condition is true repeat {statements} execute expression on each loop break exits loop next moves to next iteration in loop switch(EXPR,...) select from list of alternatives print(x) prints object x to screen stop("...") stop function and print error message warning("...") generate warning message stopifnot(cond) stop if cond not TRUE Functions

58 fishery.status.1 <- function(catch, catch.limit = 20) { result <- list(to.close = TRUE, remaining.catch = NA) if (catch < catch.limit) { result$to.close = FALSE result$remaining.catch = catch.limit - catch } else { result$to.close = TRUE result$remaining.catch = 0 } result } fishery.status.2 <- function(catch, catch.limit = 20) { to.close = catch.limit remaining.catch <- ifelse(catch < catch.limit, catch.limit - catch, 0) list(to.close = to.close, remaining.catch = remaining.catch) } if, ifelse > x <- c(TRUE, TRUE, FALSE) > y <- c(FALSE, TRUE, FALSE) > z <- c(TRUE, FALSE, FALSE) > x & y [1] FALSE TRUE FALSE > x && y [1] FALSE > x && z [1] TRUE

59 for make.plates <- function(num.plates) { plate.vec <- vector("character", length = num.plates) for(i in 1:num.plates) { first.num <- sample(0:9, 1) chars <- sample(LETTERS, 3, replace = TRUE) chars <- paste(chars, collapse = "") last.nums <- sample(0:9, 3, replace = TRUE) last.nums <- paste(last.nums, collapse = "") plate.vec[i] <- paste(first.num, chars, last.nums, sep = "", collapse = "") } plate.vec } check.plates <- function(plates, reserved) { bad.plates <- vector("character") for(plate in plates) { plate.str <- substr(plate, 2, 4) if (plate.str %in% reserved) bad.plates <- c(bad.plates, plate) } bad.plates }

60 Question: How many trips had “small” bycatches for all categories? More importantly: What is the variance of this measure? bootstrap example trips <- read.csv("homework 1a df.csv") boot.bycatch <- function(trip.df, nrep) { obs.num.small <- num.all.small(trip.df) boot.results <- vector("numeric", nrep) for(i in 1:nrep) { boot.rows <- sample(1:nrow(trip.df), nrow(trip.df), rep = TRUE) boot.df <- trip.df[boot.rows, ] boot.results[i] <- num.all.small(boot.df) } list(observed = obs.num.small, boot.dist = boot.results) } num.all.small <- function(trip.df) { f.small <- trip.df$fish.class == "Small" t.small <- trip.df$turtle.class == "Small" m.small <- trip.df$mammal.class == "Small" sum(f.small & t.small & m.small) }

61 Exercise 2A : Reformat dates 1)Use “Homework 2 sets.csv” 2)Write function to split Date into Year, Month, Day 3)Save function as R object 4)Create numeric Year, Month, Day columns in data frame 5)Create new Date character column that is DD-MM-YY 6)Remove old Date column and save new data frame under new name Exercise 2B : Bootstrap fishery closures 1) Use “Homework 2 catches.txt" 2) Write and save a function that takes catch.data, a catch.limit, and a number of bootstrap replicates. The function should bootstrap the catch over all years and return two objects: 1) a distribution of the number of years with closures, and 2) a distribution of the average catch remaining. 3) Run bootstrap with catch limits of 20 and 50 at 1000 replicates each. Extra: Create a table showing the frequency distribution of the number of closures in the bootstrap result. End Day 2

62 lapply(X, FUN, …) apply function to list or vector sapply(X, FUN, …) simplified version of lapply apply(X, MARGIN, FUN, …) apply function to margins of array tapply(X, INDEX, FUN, …) apply function to ragged array by(data, INDICES, FUN,...) apply function to data frame aggregate(x, by, FUN,...) compute function for subsets of object Data Processing - ‘apply’ family

63 lapply lapply returns list > spring.trip <- trips$season == "Spring" > spring.fish 0] > spring.turtles 0] > spring.mammals 0] > > spring <- list(fish = spring.fish, turtles = spring.turtles, mammals = spring.mammals) > > lapply(spring, length) $fish [1] 2525 $turtles [1] 1274 $mammals [1] 2119 > lapply(spring, mean) $fish [1] 250.2356 $turtles [1] 5.49843 $mammals [1] 3.050024

64 sapply sapply returns vector or matrix > sapply(spring, median) fish turtles mammals 250 5 3 > sapply(spring, function(i) sum(i > 5 & i < 20)) fish turtles mammals 63 623 0 > sapply(spring, function(i) c(n = length(i), mean = mean(i), var = var(i))) fish turtles mammals n 2525.0000 1274.00000 2119.000000 mean 250.2356 5.49843 3.050024 var 20785.6612 8.61783 1.953115

65 apply bycatch.df <- subset(trips,, c("fish", "turtles", "mammals")) Apply across columns > apply(bycatch.df, 2, mean) fish turtles mammals 248.6285 2.7283 2.5160 > apply(bycatch.df, 2, quantile, prob = c(0.025, 0.975)) fish turtles mammals 2.5% 8 0 0 97.5% 489 10 5 Apply across rows > bycatch.sum <- apply(bycatch.df, 1, sum) > range(bycatch.sum) [1] 0 512 > mean(bycatch.sum) [1] 253.8728

66 tapply apply function based on groups > tapply(trips$fish, trips$season, mean) Fall Spring Summer Winter 250.1322 248.1716 250.5051 245.9576 > tapply(trips$fish, list(season = trips$season, class = trips$fish.class), median) class season Large Medium Small Fall 354.0 112 6.0 Spring 354.0 107 5.0 Summer 353.5 111 3.0 Winter 348.0 108 5.5

67 1) Rewrite bootstrap from Exercise 2B using apply family 2) Run bootstrap with catch limits of 10, 15, 20, 30, 50, 60. 3) Summarize mean and median of results for each catch limit in one object Exercise 3 : Bootstrap with apply

68 Create a function that simulates growth data according to a Gompertz model, The output should have two columns (age and length). Age should be rounded to two decimal places. Length should be rounded to one decimal place. Try to put in checks and traps for screwy input data. sim.growth.func <- function(age.range, L0, k, g, sd, sample.size) age.range is a two element vector giving min and max ages L0 is length at birth k, g are model rate parameters sd is the standard deviation for the error term sample.size is the number of samples to return Simulated growth data

70 Simulated growth data # Gompertz growth function gomp.func <- function(age.vec, LAB, k, g) { LAB * exp(k * (1 - exp(-g * age.vec))) } # A function to created simulated growth data according # to a Gompertz equation sim.growth.func <- function(age.range, LAB, k, g, std.dev, sample.size = 1000) { # Check to make sure age.range is a reasonable vector if (!is.numeric(age.range) || !is.vector(age.range)) stop("'age.range' is not a numeric vector") if (any(age.range < 0)) stop("'age.range' < 0") if (age.range[1] >= age.range[2]) stop("'age.range[1]' >= 'age.range[2]'") # Generate some random ages between min and max of age.range random.ages <- runif(sample.size, age.range[1], age.range[2]) # Calculate the expected length for those ages from the Gompertz equation expected.length <- gomp.func(random.ages, LAB, k, g) # Add some error to the lengths and return the named array length.err <- rnorm(sample.size, 0, std.dev) as.data.frame(cbind(age = random.ages, length = expected.length + length.err)) } growth.df <- sim.growth.func(age.range = c(0, 65), LAB = 10, k = 2, g = 0.25, std.dev = 5)

71 Plot plot(x, y = NULL, type = "p", xlim = NULL, ylim = NULL, log = "", main = NULL, sub = NULL, xlab = NULL, ylab = NULL, ann = par("ann"), axes = TRUE, frame.plot = axes, panel.first = NULL, panel.last = NULL, col = par("col"), bg = NA, pch = par("pch"), cex = 1, lty = par("lty"), lab = par("lab"), lwd = par("lwd"), asp = NA,...) plot(growth$age, growth.df$length, xlab = "Age (years)", ylab = "Length (cm)")

72 Hist hist(x, breaks = "Sturges", freq = NULL, probability = !freq, include.lowest = TRUE, right = TRUE, density = NULL, angle = 45, col = NULL, border = NULL, main = paste("Histogram of", xname), xlim = range(breaks), ylim = NULL, xlab = xname, ylab, axes = TRUE, plot = TRUE, labels = FALSE, nclass = NULL,...) > hist(growth$age) > hist(growth$age, breaks = c(0:5, seq(6, 12, 2), 15, 20, 40, max(growth.df$age)), +col = "black", border = "white")

73 Boxplot boxplot(formula, data = NULL,..., subset, na.action = NULL) boxplot(x,..., range = 1.5, width = NULL, varwidth = FALSE, notch = FALSE, outline = TRUE, names, plot = TRUE, border = par("fg"), col = NULL, log = "", pars = list(boxwex = 0.8, staplewex = 0.5, outwex = 0.5), horizontal = FALSE, add = FALSE, at = NULL) > age.breaks <- hist(growth$age)$breaks > binned.age <- cut(growth$age, breaks = age.breaks) > boxplot(growth$length ~ binned.age, xlab = "Age bin", ylab = "Length")

74 Modifying Graphs abline add straight lines to plot lines join points at coordinates with lines points place points on plot title add labels to a plot text write text on a plot ?plot.default default plot options par set or get graphical parameters layout(mat,...) divide graphical screen into matrix split.screen(figs,...) divide graphical screen into sub-screens > newborns <- growth[growth$age <= 3, ] > adults 3, ] > > plot(newborns$age, newborns$length, xlim = range(growth$age), + ylim = range(growth$length), xlab = "Age", ylab = "Length", + col = "blue", pch = 21) > > par(new = TRUE) > > plot(adults$age, adults$length, xlim = range(growth$age), + ylim = range(growth$length), xlab = "", ylab = "", + col = "red", pch = 21) > > abline(v = 3, col = "green") > > text(3, 80, "Transition", pos = 4)

75 Modifying Graphs > layout(matrix(c(1, 1, 2, 3), 2, 2, byrow = TRUE)) > plot(growth$age, growth$length, xlab = "Age", ylab = +"Length", main = "Simulated growth data") > age.breaks <- seq(0, max(growth$age) + 5, 5) > binned.age <- cut(growth$age, age.breaks) > hist(growth$age, age.breaks, xlab = "Age", main = "") > boxplot(growth$length ~ binned.age, names = +age.breaks[-length(age.breaks)], xlab = "Age bin")

76 Curve curve(expr, from, to, n = 101, add = FALSE, type = "l", ylab = NULL, log = NULL, xlim = NULL,...) > curve(sin, -10, 10) > plot(growth$age, growth$length, xlab = "Age", + ylab = "Length", main = "") > curve(10 * exp(2 * (1 - exp(-0.25 * x))), + add = TRUE, lty = "dashed", lwd = 2, col = "red")

77 d density p distribution function q quantile function r random number dunif, dnorm, dgamma, dbeta, dchisq, etc. >library(help=“stats”) >set.seed(x) set random number seed Statistical Distributions dnorm pnormqnorm

78 Statistical Tests binom.test(x, n, p = 0.5, alternative = c("two.sided", "less", "greater"), conf.level = 0.95) chisq.test(x, y = NULL, correct = TRUE, p = rep(1/length(x), length(x)), rescale.p = FALSE, simulate.p.value = FALSE, B = 2000) t.test(x, y = NULL, alternative = c("two.sided", "less", "greater"), mu = 0, paired = FALSE, var.equal = FALSE, conf.level = 0.95,...)

79 > male.growth <- sim.growth.func(c(0, 65), 10, 2.05, 0.27, 5) > female.growth <- sim.growth.func(c(0, 65), 10, 1.99, 0.23, 4) > adult.males 18, ] > adult.females 18, ] > gender.test <- t.test(adult.males[, "length"], adult.females[, "length"]) > gender.test Welch Two Sample t-test data: adult.males[, "length"] and adult.females[, "length"] t = 19.3369, df = 1427.025, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 4.146325 5.082547 sample estimates: mean of x mean of y 77.56675 72.95232 > str(gender.test) List of 9 $ statistic : Named num 19.3..- attr(*, "names")= chr "t" $ parameter : Named num 1427..- attr(*, "names")= chr "df" $ p.value : num 3.56e-74 $ conf.int : atomic [1:2] 4.15 5.08..- attr(*, "conf.level")= num 0.95 $ estimate : Named num [1:2] 77.6 73.0..- attr(*, "names")= chr [1:2] "mean of x" "mean of y" $ null.value : Named num 0..- attr(*, "names")= chr "difference in means" $ alternative: chr "two.sided" $ method : chr "Welch Two Sample t-test" $ data.name : chr "adult.males[, \"length\"] and adult.females[, \"length\"]" - attr(*, "class")= chr "htest"t-test

80 Linear Models lm(formula, data, subset, weights, na.action, method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, contrasts = NULL, offset,...) Analysis of Variance Model aov(formula, data = NULL, projections = FALSE, qr = TRUE, contrasts = NULL,...) Generalized Linear Models glm(formula, family = gaussian, data, weights, subset, na.action, start = NULL, etastart, mustart, offset, control = glm.control(...), model = TRUE, method = "glm.fit", x = FALSE, y = TRUE, contrasts = NULL,...) Nonlinear Least Squares nls(formula, data, start, control, algorithm, trace, subset, weights, na.action, model, lower, upper,...) Non-Linear Minimization nlm(f, p, hessian = FALSE, typsize=rep(1, length(p)), fscale=1, print.level = 0, ndigit=12, gradtol = 1e-6, stepmax = max(1000 * sqrt(sum((p/typsize)^2)), 1000), steptol = 1e-6, iterlim = 100, check.analyticals = TRUE,...) Model Fitting

81 > sim.growth <- sim.growth.func(c(0, 65), 10, 2, 0.25, 5) > juv <- as.data.frame(sim.growth[sim.growth[, "age"] < 10, ]) > juv.lm <- lm(length ~ age, juv) > juv.lm Call: lm(formula = length ~ age, data = juv) Coefficients: (Intercept) age 12.438 5.584 > plot(juv.lm) Waiting to confirm page change... > plot(juv) > abline(coef = juv.lm$coefficients, col = "red", lty = "dashed") lm

82 Model Fitting fitted extract fitted values for models coef extract model coefficients resid extract model residuals deviance extract deviances for models logLik calculate log-likelihood for model fit AIC calculate AIC for model fit predict predictions from model results anova calculate analysis of variance tables > coef(juv.lm) (Intercept) age 12.88 5.28 > logLik(juv.lm) 'log Lik.' -508 (df=3) > AIC(juv.lm) [1] 1023 > predict(juv.lm, data.frame(age = c(1, 5, 10))) 1 2 3 18.2 39.3 65.6

83 > gomp.form <- formula(length ~ LAB * exp(k * (1 - exp(-g * age)))) > growth.nls <- nls(gomp.form, sim.growth, start = c(LAB = 5, k = 5, g = 0.6)) > growth.nls Nonlinear regression model model: length ~ LAB * exp(k * (1 - exp(-g * age))) data: sim.growth LAB k g 10.995 1.905 0.236 residual sum-of-squares: 24793 Number of iterations to convergence: 6 Achieved convergence tolerance: 9.67e-06 > plot(sim.growth) > age.vec <- 1:max(sim.growth$age) > lines(age.vec, predict(growth.nls, list(age = age.vec)), col = "red", + lty = "dashed", lwd = 2) nls

84 Packages, Path, & Options library() list available packages library(package) load package library(help = "package") list info about package (build, functions, etc.) require(package) loads package and returns FALSE if not present attach(x,pos) attach database (list, data frame, or file) to search path detach(x) remove database from search path search() list attached packages in search path options(...) set and examine global options ?Startup Control initialization of R session

1 Introduction to R Workshop June 23-25, 2010 Southwest Fisheries Science Center 3333 North Torrey Pines Court La Jolla, CA 92037 Eric Archer

Similar presentations

Presentation on theme: "1 Introduction to R Workshop June 23-25, 2010 Southwest Fisheries Science Center 3333 North Torrey Pines Court La Jolla, CA 92037 Eric Archer"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Introduction to R Workshop June 23-25, 2010 Southwest Fisheries Science Center 3333 North Torrey Pines Court La Jolla, CA 92037 Eric Archer

Similar presentations

Presentation on theme: "1 Introduction to R Workshop June 23-25, 2010 Southwest Fisheries Science Center 3333 North Torrey Pines Court La Jolla, CA 92037 Eric Archer"— Presentation transcript:

Similar presentations

About project

Feedback