Presentation is loading. Please wait.

Presentation is loading. Please wait.

Working with data in R 2 Fish 552: Lecture 3. Recommended Reading An Introduction to R (R Development Core Team) –http://cran.r-project.org/doc/manuals/R-intro.pdfhttp://cran.r-project.org/doc/manuals/R-intro.pdf.

Similar presentations


Presentation on theme: "Working with data in R 2 Fish 552: Lecture 3. Recommended Reading An Introduction to R (R Development Core Team) –http://cran.r-project.org/doc/manuals/R-intro.pdfhttp://cran.r-project.org/doc/manuals/R-intro.pdf."— Presentation transcript:

1 Working with data in R 2 Fish 552: Lecture 3

2 Recommended Reading An Introduction to R (R Development Core Team) –http://cran.r-project.org/doc/manuals/R-intro.pdfhttp://cran.r-project.org/doc/manuals/R-intro.pdf Chapter 4 Chapter 5: 5.1 – 5.4, 5.8 Chapter 6: 6.1, 6.2 Chapter 7

3 Matrices Matrices represent another way to store collections of variables. Whereas data frames can store objects of multiple types (numeric, character,.... ), a matrix must be of a single type or R will coerce variables accordingly The matrix() statement matrix(data = NA, nrow = 1, ncol = 1, byrow = FALSE) A vector needs to be given or R will generate NA (Not Available) Only one of these arguments needs to be specified. If both are specified then length(data) = nrow * ncol R will fill the matrix by column unless byrow=TRUE is specified > matrix(1:4, ncol=2) [,1] [,2] [1,] 1 3 [2,] 2 4

4 Matrices Create a matrix with the co2 data > co2.mat <- matrix(c(years, co2), ncol=2, nrow=length(years)) > head(co2.mat, n=3) [,1] [,2] [1,] 1959 316.00 [2,] 1960 316.91 [3,] 1961 317.63 A fast way to create matrices is by binding the columns of two (or more) vectors (or matrices) > co2.mat <- cbind(years,co2) > head(co2.mat, n=3) years co2 [1,] 1959 316.00 [2,] 1960 316.91 [3,] 1961 317.63

5 Matrices rbind() forms a matrix by binding two (or more) rows of vectors (or matrices) > co2.row.mat <- rbind(years,co2) > t(co2.row.mat) years co2 [1,] 1959 316.00 [2,] 1960 316.91 [3,] 1961 317.63... The same functions to extract dimensions for data frames work for matrices > dim(co2.mat) [1] 46 2 > dim(co2.row.mat) [1] 2 46 > nrow(co2.mat) [1] 46 > ncol(co2.mat) [1] 2 t - transpose a matrix

6 Arrays N-dimensional matrices > co2.array <- array(data=c(years,co2), dim=c(nyears,2)) > head(co2.array, n=3) [,1] [,2] [1,] 1959 316.00 [2,] 1960 316.91 [3,] 1961 317.63 A matrix is a special case of an array, but a data frame is not > is.array(co2.array) [1] TRUE > is.array(co2.mat) [1] TRUE > is.array(co2.df) [1] FALSE create an array of dimension nyears x 2

7 Arrays 3-dimensional array > array(1:24, dim=c(3,4,2)),, 1 [,1] [,2] [,3] [,4] [1,] 1 4 7 10 [2,] 2 5 8 11 [3,] 3 6 9 12,, 2 [,1] [,2] [,3] [,4] [1,] 13 16 19 22 [2,] 14 17 20 23 [3,] 15 18 21 24 R will fill in an array by column, row and then higher dimensions

8 Lists Most flexible structure –Each element can be of a different mode and varying length > description <- "Atmospheric CO2 concentrations (ppmv) + derived from in situ air samples + collected at Mauna Loa Observatory, + Hawaii. Source: C.D. Keeling“ co2.list <- list(Meta=description, Years=nyears, data=co2.mat)

9 Lists > co2.list $Meta [1] "Atmospheric CO2 concentrations (ppmv) derived from in situ air samples collected at Mauna Loa Observatory, Hawaii. Source: C.D. Keeling" $Years [1] 46 $data years co2 [1,] 1959 316.00 [2,] 1960 316.91 [3,] 1961 317.63... > co2.list$Years [1] 46 $ operation works for lists

10 Extracting elements (lists) > dat.list <- as.list(dat.df) > dat.list $id [1] 31 62 50 99 53 75 54 58 4 74 $age [1] 12 18 20 17 14 8 12 24 24 21 $sex [1] M F F M F M M F F M Levels: F M R coerced the variable sex into a categorical variable during the original data.frame statement, but it is not apparent until now

11 Extracting elements [[ ]] extracts elements of lists so does $ (but only if the item in the list is named) > dat.list[[1]] [1] 31 62 50 99 53 75 54 58 4 74 > dat.list$id [1] 31 62 50 99 53 75 54 58 4 74 First element of the list [[]] and first element of the vector [] > dat.list[[1]][1] [1] 31 Mixed operators > dat.list$id[1] [1] 31

12 Extracting elements > dat.list$id <- matrix(dat.list$id, ncol=2) > dat.list $id [,1] [,2] [1,] 31 75 [2,] 62 54 [3,] 50 58 [4,] 99 4 [5,] 53 74 $age [1] 12 18 20 17 14 8 12 24 24 21 $sex [1] M F F M F M M F F M Levels: F M > dat.list[[1]][1,1] [1] 31 > dat.list[[1]][1,] [1] 31 75 But now the elements of id are accessed as a matrix [,] object[[1 st element]][element within 1 st element]

13 Categorical variables in R A factor in R is a vector object used to specify a discrete classification (grouping) of individual elements. –Variable sex Factors in R are used in –Basic data manipulation –Plotting routines –Specifying statistical models R will automatically specify categorical variables as factors when –Creating data frames –Reading in data –Behind the scenes in other applications....

14 factor() function We can specify a categorical or numeric variable as a factor with the factor() command > zone <- c("demersal", "demersal", "demersal", "pelagic",.. > is.factor(zone) [1] FALSE > ( zone <- factor(zone) ) [1] demersal demersal demersal pelagic pelagic.... Levels: demersal pelagic

15 factor() function Often times a categorical variable is coded numerically in a database. In this case the labels argument may be specified > zone <- c(1, 1, 1, 2, 2, 2, 1, 2, 2, 1) > ( zone <- factor(zone, labels=c("demersal", "pelagic")) ) [1] demersal demersal demersal pelagic pelagic... Levels: demersal pelagic The labels argument may also be specified for a character data type that aren’t informatively named. –i.e. dem / pel If we did not know all the levels of a factor > levels(zone) [1] "demersal" "pelagic"

16 Hands-on Exercise 1 Create 2 matrices, A and B, where A is a 2 x 2 matrix and B is a 2 x 3 matrix and each are filled with unique values Combine A and B into both a 2 x 5 matrix and a 5 x 2 matrix Create a factor from the following vector such that 1 is female and 2 is male sex <- c(1, 1, 2, 1, 2, 2, 2, 1, 1, 1) Create a list data that contains the 2 matrices and the factor Extract the first row of the matrix A from the list data Using the matrix B in the list data, change the value in the 1 st row and 1 st column to NA

17 Reading in data 3 primary functions scan() –Most primitive –Most flexible – reads into vector –Very fast, use for large or really messy data read.table() –Easiest to use –reads into a data frame read.csv() –Excel worksheets or other comma separated data

18 Entering Data Data is usually read in from and *.txt or *.csv file, but can be entered manually from the keyboard co2 <- scan() Famous Mauna Loa CO 2 data co2 <- c(316,316.91, 317.63, 318.46, … There is also monthly Mauna Loa CO 2 data in the included datasets: Jan Feb Mar Apr May Jun Jul… 1959 315.42 316.31 316.50 317.56 318.13 318.00… 1960 316.27 316.81 317.42 318.87 319.87 319.43… … Hit enter twice to stop

19 Reading in data ?read.table –Common options and defaults header – first row has names for columns of data (FALSE) sep – how entries are separated (“” – white space) na.strings – what values are interpreted as NA (NA) skip – number of lines to skip before reading in data (0) nrows – maximum number of row to read (-1) col.names – names for columns (V) 6 data sets in varying formats http://courses.washington.edu/fish552/data/

20 Reading in data Good format (dat_df1.dat) Specify working directory wd <- "C:/data" setwd(wd) read.table("dat_df1.dat", header=TRUE) Specify full path read.table("C:/data/dat_df1.dat", header=TRUE) Read directly from the web read.table("http://courses.washington.edu/fish 552/data/dat_df1.dat", header=TRUE) A folder on YOUR computer where the file is located

21 Reading in data Variables not named (dat_df2.dat) > read.table("dat_df2.dat") V1 V2 V3 1 31 12 M................ > read.table("dat_df2.dat", col.names=c("id", "age", "sex")) id age sex 1 31 12 M By default R names columns “V column #”

22 Reading in data Comments can precede the data with # (dat_df3.dat) # Comments can preceed the data # Fake patient data id age sex 31 12 M Everything works as usual read.table("dat_df3.dat", header=TRUE)

23 Reading in data Specify what line to start reading data (dat_df4.dat) > read.table("dat_df4.dat", header=TRUE) Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : line 1 did not have 8 elements read.table("dat_df4.dat", header=T, skip=3) # of lines to skip before reading the first row

24 Reading in data Reading in data from Excel (dat_df1.csv) read.csv("dat_df1.csv", header=TRUE)

25 Reading in data Data not seperated by white space (dat_df5.dat) read.table("dat_df5.dat", sep="/", header=TRUE) What does *.csv stand for ? read.table("dat_df1.csv", header=TRUE, sep=",") Hint

26 Hands-on Exercise 2 Read in the data in the file dat_df6.dat –Examine the data first Missing values are indicated by -99 or -999 and NA The last column should be read in as a factor Be sure to give the variables names (id, age, sex) If you get stuck, start with the help file –?read.table


Download ppt "Working with data in R 2 Fish 552: Lecture 3. Recommended Reading An Introduction to R (R Development Core Team) –http://cran.r-project.org/doc/manuals/R-intro.pdfhttp://cran.r-project.org/doc/manuals/R-intro.pdf."

Similar presentations


Ads by Google