Introduction to R user-friendly and absolutely free Ho Kim SCHOOL OF PUBLIC HEALTH, SNU
Useful sites R is a free software with powerful tools The Comprehensive R Archives Network http://cran.r-project.org/ -> Download R for Windows -> base -> Download R-3.2.2 for Windows Textbook : Simple R by John Verzani http://cran.r-project.org/doc/contrib/Verzani-SimpleR.pdf
Features of R R is free. R is open-source and runs on UNIX, Windows and Macintosh. R has an excellent built-in help system. R has excellent graphing capabilities. Students can easily migrate to the commercially supported S-Plus program if commercial software is desired. R's language has a powerful, easy to learn syntax with many built-in statistical functions. The language is easy to extend with user-written functions. R is a computer programming language. For programmers it will feel more familiar than others and for new computer users, the next leap to programming will not be so large.
Starting the R
Data manipulation Data input Data management Data types Importing data Exporting data Viewing data Value labels Missing data Variables Operators Sorting data Merging data Subsetting data Source: http://www.statmethods.net/ (Quick r)
Data types Vectors Matrices Factors [Data Input] a <- c(1,2,5.3,6,-2,4) #numeric vector b <- c("one","two","three") #character vector c <- c(TRUE,TRUE,TRUE,FALSE,TRUE,FALSE) #logical vector a[c(2,4)] #2nd and 4th elements of vector Factors gender <- c(rep("male",20), rep("female", 30)) gender <- factor(gender) # R now treats gender as a nominal variable summary(gender) Matrices # generates 5 x 4 numeric matrix y<-matrix(1:20, nrow=5,ncol=4) y[,4] # 4th column of matrix y[3,] # 3rd row of matrix y[2:4,1:3] # rows 2,3,4 of columns 1,2,3
Data types Dataframes Lists # example of a list with 4 components - [Data Input] Data types Dataframes d <- c(1,2,3,4) e <- c("red", "white", "red", NA) f <- c(TRUE,TRUE,TRUE,FALSE) mydata <- data.frame(d,e,f) names(mydata) <- c("ID","Color","Passed") # variable names Lists # example of a list with 4 components - # a string, a numeric vector, a matrix, and a scaler w <- list(name="Fred", mynumbers=a, mymatrix=y, age=5.3)
Importing data From CSV file From Excel From txt file [Data Input] malaria <-read.table("C:\\R_data\\malaria.csv", header=TRUE, sep=",") From Excel library(RODBC) channel <- odbcConnectExcel("C:\\R_data\\malaria.xls") malaria <- sqlFetch(channel, "mal") *odbcConnectExcel is only usable with 32-bit Windows From txt file malaria <- read.table("C:\\ R_data\\malaria.txt", header=TRUE, sep="\t")
Exporting data To an CSV file To a tab delimited text file [Data Input] Exporting data To an CSV file write.table(malaria, "C:\\ R_data\\mal01.csv", row.names=F) To a tab delimited text file write.table(malaria, "C:\\ R_data\\mal02.txt", sep="\t", row.names=F)
Viewing data ls() # list objects in the working environment names(malaria) # list the variables in malaria str(malaria) # list the structure of malaria levels(malaria $v1) # list levels of factor v1 in malaria malaria$v1<-factor(malaria$mal) dim(malaria) # dimensions of an malaria class(malaria) # class of an malaria (numeric, matrix, dataframe, etc) malaria # print malaria head(malaria, n=10) # print first 10 rows of malaria tail(malaria, n=5) # print last 5 rows of malaria summary(malaria)
Value labels # variable v1 is coded 1, 2 or 3 # we want to attach value labels 1=red, 2=blue, 3=green v1<-c(1,1,1,2,2,3) v2 <- factor(v1, levels = c(1,2,3), labels = c("red", "blue", "green"))
Missing data Testing for missing values Recoding values to missing y <- c(1,2,3,NA) is.na(y) # returns a vector (F F F T) Recoding values to missing malaria[malaria$age==99,“age"] <- NA Excluding missing values from analyses x <- c(1,2,NA,3) mean(x) # returns NA mean(x, na.rm=TRUE) # returns 2
Help > help(mean) > ?mean
Data manipulation Data input Data management Data types Importing data Exporting data Viewing data Value labels Missing data Variables Operators Sorting data Merging data Subsetting data
Variables Recoding variables [Data management] # create 2 age categories malaria$agecat <- ifelse(malaria$age >7, c(“student"), c(“baby")) attach(malaria) malaria$agecat2[age > 7] <- "student" malaria$agecat2[age <= 7] <- "baby" detach(malaria)
Operators Comparison operators Logical operators == equals [Data management] Operators Comparison operators == equals != not equals <= less than or equals >= greater than or equals = assignment (arrow ‘<-’ 와 같다) Logical operators & and | or ! not
Sorting Data Avoid “Attach” command when sorting the data [Data management] Sorting Data # sort by mal newdata <- malaria[order(malaria$mal),] # sort by mal and age newdata2 <- malaria[order(malaria$mal, malaria$age),] #sort by mal (ascending) and age (descending) newdata3 <- malaria[order(malaria$mal, -malaria$age),] Avoid “Attach” command when sorting the data
Merging Data Raw dataset Adding rows Adding columns [Data management] malaria2<-read.table("C:\\R_data\\malaria.csv", header=TRUE, sep=",") Adding rows extra<-read.table ("C:\\R_data\\extra15.csv",header=T, sep=",") malaria3<-rbind(malaria2,extra) Adding columns region<-read.table ("C:\\R_data\\region.csv", header=T, sep=",") malaria4<-merge(malaria3, region, by="subject")
Subsetting Data mal.1 <- subset(malaria,mal==1) summary(mal.1) [Data management] Subsetting Data mal.1 <- subset(malaria,mal==1) summary(mal.1) mal.baby <- subset(malaria, mal == 1 & age < 8)