Boost Your R Code with the data.table package

Boost Your R Code with the data.table package
Ian Price, PhD Pittsburgh useR group

Topics to Cover What is data.table Why use data.table
“Three” syntax fields Sub-Selecting Columns Sub-selecting Rows Special Fields Functions and Adding, Deleting, Modifying Cols by Reference More by Reference Merging With data.table Map-Reduce melt / dcast

What Is data.table? They are data.frames! They are lists!
data.table inherits from the data.frame class they work everywhere that a data.frame object would absolutely no reason not to start using data.table today They are lists! Should not be surprising as data.frames are also lists This features more as we go down the road Why hasn’t R replaced data.frame with data.table? Dependency Chain. base R data.frame has incorporated some properties of data.table How do you get started? library(data.table) my_data_structure_dt <- as.data.table(my_data_structure) my_file_dt <- fread(“myfile.csv”)

Starting Code! library(data.table) data(iris)
iris_dt <- as.data.table(iris) class(iris); class(iris_dt) typeof(iris_dt) tables() head(iris) head(iris_dt)

Why Use data.table? Lightning fast speed Increase memory efficiency
Keyed Index allows RADIX sorting & binary search – data.frame uses a vector scan Small amount of time dedicated to indexing a column This is an up-front cost Most worthwhile for large (> 100,000 obs) data sets In data.frame, every new manipulation is a new manipulation! Huge time savings in having a pre-sorted data.table Increase memory efficiency “Shallow Copy” not “Deep Copy” R is a memory hog, and so is data.frame every manipulation on a data.frame becomes a new data.frame in memory even though you don’t see, it adds a new data.frame data.table uses more Shallow Copy techniques in particular, with data.table you can manipulate your object by Reference Better Code Practice Vectorization Easy, expressive syntax

Warming Up Code iris_dt[Sepal.Length < 5.0]
iris[iris[, 'Sepal.Length'] < 5.0, ] iris_dt[, `:=`(iris_index = seq(1:dim(iris)[1]))] setkey(iris_dt, "iris_index") tables()

Speed Up Example From Documentation
grpsize <- ceiling(1e7/26^2) t1 <- system.time( DF <- data.frame(x=rep(LETTERS,each=26*grpsize), y=rep(letters,each=grpsize), v=runif(grpsize*26^2), stringsAsFactors=FALSE)) t2 <- system.time(ans1 <- DF[DF$x=="R" & DF$y=="h",]) DT <- as.data.table(DF) s1 <- system.time(setkey(DT,x,y)) s2 <- system.time(ans2 <- DT[list("R","h")]) identical(ans1$v, ans2$v)

“Three” Syntactical Fields
data.tables have three basic fields-types for arguments First Field: select criteria for rows Second Field: select which column, or perform column operations Third Field: Special Fields, eg .N or with This can be used multiple times! eg by, .SDcols, etc Popular SQL comparison (not my favorite!) this_data_table[SELECT, FROM, WHERE] New Syntactical Feature If only using the first field, no need for more commas Column names don’t need to be quoted, but must be in a list, list() or .()

Sub-Selecting Columns
Columns are referred to by column-name Using column numbers creates fragile code, as column order can change Columns are always referred to in the second field If there is no argument in the first field, a comma is needed Columns are referred to as lists, and without quotes either normal R syntax, list(Col1, Col2, …) or using data.table notation .(Col1, Col2, …) for special characters use the back-tick, e.g. dt[, .(`123`)] If you want to you quoted fields, use the third field, with= FALSE use vector rather than list notation dt[, c(“Col1”, “Col2”), with = FALSE]

Code Time iris_dt[, list(Sepal.Length, Sepal.Width)]
iris_dt[, .(Sepal.Length, Sepal.Width)] iris_dt[, c("Sepal.Length", "Sepal.Width"), with=FALSE] iris_dt[, names(iris), with=FALSE] Interesting combinations possible with: eval() as.name() substitute() parse()

Sub-Selecting Rows Straight-forward, integer or truth statement goes directly into the first field Do not need to create another data.table to subset on in the first field Greater feel of vectorization! iris_dt[Species == 'setosa'] iris_dt[Species == 'setosa', .(Sepal.Length)] iris_dt[Species == 'setosa', (Sepal.Length, Sepal.Width)][order(Sepal.Length)]

Special Fields There are a few special items in the data.table syntax
Many are not especially useful for the work I do Examples: .SD & .SDcols – very useful for condensing column names .N returns the count – again very useful by groups similar fields for aggregate functions .GRP similar use to “by”, I’ve not found a scenario where I couldn’t use by .I special character for the index, I’ve not found a useful scenario for it

Functions on Columns Functions are generally applied to columns
This can also be done on a sub-selected portion of the population The output of the function can be read out to a new value Or the function output can become part of the data.table (more later) Functions can also be done per grouping of the table There are special built in functions that make data.table easier to use! .N gives the count .SD references all available (or sub-selected) columns DT can become your all in one, easy to use pivot table!

Applying Functions to data.table!
iris_dt[, max(Sepal.Length)] iris_dt[, max(Sepal.Length), by = "Species"] iris_dt[, .N, by = "Species"] # compare with tapply! sepal_dt <- iris_dt[, .(min = min(Sepal.Length), mean = mean(Sepal.Length), max = max(Sepal.Length)), by = "Species"] setkey(sepal_dt, "Species") mean_dt <- iris_dt[, lapply(.SD, mean), by = "Species"]

Adding, Deleting, Modifying Cols by Reference
This may seem a little odd at first, but I think it is one of DT’s biggest strengths. What does “by reference” mean? No copy is made at all, other than temporary working copy Similar to the python pandas “inplace = TRUE” What is the implication? This is what really allows us to make shallow copies versus deep copies This makes things go much faster since we make fewer deep copies No return value is visible Stuff that happens to the DT in another scope, still affects it!!! What does it look like special operator `:=`

Using Data.table by Reference
iris_dt[, `:=`(new_col = 0)] iris_dt[, `:=`(new_col = NULL)] iris_dt[, `:=`(this_col = 0, that_col = 1)] iris_dt[, c("new_col") := 0] iris_dt[, c("new_col", "this_col", "that_col") := NULL] iris_dt[, `:=`(Sepal_ratio = Sepal.Length/Sepal.Width, Petal_ratio = Petal.Length/Petal.Width)] iris_dt[(Sepal.Length > 7) & (Sepal_ratio > 2.5), c("Petal_ratio") := 3]

More by Reference The `:=` operator is the heavy hitter, but DT does many things by reference! Any DT function that starts with “set” works by reference setDT setDF setkey setorder setcolorder setnames setattrib

More Code by Reference! setDF(iris_dt) setDT(iris)
setcolorder(iris, c("Species", "Sepal.Length", "Petal.Length", "Sepal.Width", "Petal.Width")) setorder(iris, Petal.Width) setnames(iris, names(iris), tolower(names(iris))) setkey(iris, "petal.width")

Merging With Data.Table
Merging columns in data.table is fast and easy with the native function merge. Outer, Inner, Left, and Right merges easy and intuitive using the “all” kwargs If no merge key is called in the kwargs, will automatically merge on the keyed column merged_dt <- merge(x, y, by.x, by.y, all. x, all.y, suffixes = c(“.x”, “.y”)) Merging rows in data.table uses rbindlist new_dt <- rbindlist(list(x1_dt, x2_dt))

Code to Merge data.table
dt1 <- data.table(A = letters[1:10], X = 1:10, key = "A") dt2 <- data.table(A = letters[5:14], Y = 1:10, key = "A") merge(dt1, dt2) # inner join merge(dt1, dt2, all = TRUE) # outer join merge(dt1, dt2, all.x = TRUE) # left join merge(dt1, dt2, all.y = TRUE) # right join rbindlist(list(dt1, dt2))

Map-Reduce This is much easier and useful with Data.Table than it might appear lapply is a “map” function, and we are using lists! For operation on each of N columns, produce N outputs E.G. 1. Change class of multiple column E.G. 2. Perform operation on multiple columns Reduce, a R-base function, can be useful for big tables  For an operation on N columns, produce desired number of outputs E.G. Compute the sum of all, or a selection, of columns

Code to Practice Map – Reduce
inputs <- c(“sepal.length", “sepal.width") iris[, c(inputs) := lapply(.SD, function(x) 10*x), .SDcols = inputs] iris[, c("sum") := Reduce(`+`, .SD, 0), .SDcols = inputs] iris[, lapply(.SD, class)] iris[, c(inputs) := lapply(.SD, as.character), .SDcols = inputs]

melt / dcast melt and dcast are two surprisingly useful functions for manipulating data.tables. Based on reshape2 package. dcast turns a data structure with many rows and few columns into a structure with few rows and many columns melt does the opposite, turning a data structure with few rows and many columns into a structure with many rows and few columns data can be lost, or needs to be filled, depending on the situation

melt / dcast code alphabet <- data.table( UC =rep(LETTERS, each = 26), LC = rep(letters, 26), value = rnorm(26*26)) sq_table <- dcast(alphabet, UC ~ LC, value.var = "value") alphabet2 <- melt(sq_table, id = "UC", measure.vars= letters)

And now it’s time to say good-bye
Thank You Any Questions?

Boost Your R Code with the data.table package

Similar presentations

Presentation on theme: "Boost Your R Code with the data.table package"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Boost Your R Code with the data.table package

Similar presentations

Presentation on theme: "Boost Your R Code with the data.table package"— Presentation transcript:

Similar presentations

About project

Feedback