SparkR: Enabling Interactive Data Science at Scale

SparkR: Enabling Interactive Data Science at Scale
Hi everyone. My name is Zongheng, and I’m very excited to be here today to talk about SparkR. which is a research project Shivaram and I started at UC Berkeley AMPLab a couple months ago. A little bit about ourselves. I am a third-year undergraduate at berkeley studying computer science and math. I work in the amplab as a research assistant. Shivaram is a third-year CS PhD student who is also part of the UC Berkeley AMPLab. His research interests include topics in distributed computing and machine learning. Ok. Just so I can get a rough impression I’m guessing most of us here have some experience with Spark but how many of us have used or programmed in R before? Ok let’s jump right in to the talk. Shivaram Venkataraman Zongheng Yang

Talk Outline Motivation Overview of Spark & SparkR API Live Demo: Digit Classification Design & Implementation Questions & Answers So much of the intro. Here’s the outline of the talk.

Fast! Scalable Flexible Motivation
Before introducing what SparkR is and its functionalities, let me try to explain a little bit of the motivation behind the project. When users choose to use Spark or when we think about Spark’s advantages Usually the first thing that comes to mind is that Spark is very fast. Furthermore we know that being a cluster computing engine, Spark is also scalable. One other advantage is it is flexible – you could use the highly expressive APIs to write concise programs, and you have the choice of writing it in different languages We have seen how these characteristics & other features have made Spark popular.

Statistical! Packages Interactive Now what about R?
Firstly, R is amazingly good at statistics and data analysis. In fact it is designed for statisticians and is very popular in related areas. Moreover, there’s an extensive list of mature packages available in R that are very useful & popular. Some examples include the ggplot2 package for plotting sophisticated graphs (where users can get immense control over the layout), or the plyr package for manipulating & transforming data. Additionally, another characteristic is that R fits well with an interactive workflow. For instance you could load your datasets into the R shell, do some explorations, and quickly visualize your findings by plotting various graphs.

Fast! Statistical! Scalable Packages Flexible Interactive
However, there’s one drawback: Traditionally, the R internal is single-threaded. It is unclear how R programs can be effectively and concisely written to run on multiple machines. So, what if we can combine these two worlds? This is where SparkR comes in: it is a language binding that lets users write R programs that are equipped with nice statistics packages, and have them run on top of Spark.

RDD Transformations Actions Parallel Collection map filter groupBy …
count collect saveAsTextFile RDD Parallel Collection So when we start thinking about what kind of API would SparkR have, we look at Spark’s API first. Most of Spark API consists of operations on the class RDD. As some of you might know, RDD has two kinds of operations, one is called transformations and the other is actions. For instance, the map function is a transformation, and just lets you apply a custom function on all elements in the RDD. Actions are special operations that actually fire off computation, so when you do a saveAsTextFile(), you expect the call immediately starts saving whatever elements you have in the RDD to a text file. Similar for count() and collect(). When we are designing SparkR’s API, one direct approach is to mimic these API functions, but lets you call them inside R programs and perhaps on R datasets.

R + RDD = R2D2 Let me present the result of adding R and RDD, for those of the Star Wars fans out there, the correct answer is obviously R2D2 the robot!

R + RDD = RRDD lapply lapplyPartition groupByKey reduceByKey sampleRDD
collect cache … broadcast includePackage textFile parallelize R + RDD = RRDD Actually, we came up with something cooler, which is RRDD. so RRDD is a subclass of RDD that facilitates calling some of the familiar functions from inside R. Furthermore, we include some aliases on some of the functions, which respects the idiomatic practice in R. For instance, the RDD map() function is still available in SparkR, but we also provide an alias called lapply(). This is because a native lapply() call in R simply loops over a list of data, and an RRDD is conceptually a list of elements so we chose this name. Besides supporting many of the essential RDD operations – such as the transformations here and Spark’s broadcast support, SparkR also includes some new feature that attempts to fulfill our design motivation. For instance, we have introduced this includePackage() function, that simply takes a package name, and mark it as included in the current environment of all worker nodes running SparkR. Using this function, users can use the functions in some nice, existing R package, or his own UDFs in the closures of say, lapply() or groupByKey().

Getting Closer to Idiomatic R
Q: How can I use a loop to [...insert task here...] ? A: Don’t. Use one of the apply functions. However we also consider the idiomatic practice in R. For instance, it is very usual in say Java to loop over some collections of data, and performing some operations on them. But R actually has a very nice family of functions called apply, such as lapply or sapply, that does similar things. So instead of explicitly using a for loop or a while loop, idiomatic R prefers using these apply functions. And this is actually a quote from this article. The ultimate purpose of this decision, is to remove the learning curve for R programmers as much as possible. From:

Example: Word Count lines <- textFile(sc, "hdfs://my_text_file")
So much for the overview of SparkR’s current API. Let’s see how it works in action with an example. Of course, we are now tackling the most important problem in distributed computing, Word Count. I am going to walk through this very short R program line by line, and explain how it uses SparkR’s API as we go. In this program, the things marked in red color are either SparkR or Spark concepts. The first step of the world count program is to read data from HDFS into RDD. For this we use the textFile function, whose first argument just takes in a spark context sc, and also a path to a HDFS location. By the way if you start sparkR in the native R shell, we will automatically create this variable sc for you, just like the Spark shell does. This way, lines is conceptually an RDD of string, ready to be further operated on inside R. For those of you who are already familiar with Spark, you will probably notice that this is very similar to the counterpart in original API.

Example: Word Count lines <- textFile(sc, "hdfs://my_text_file") words <- flatMap(lines, function(line) { strsplit(line, " ")[[1]] }) wordCount <- lapply(words, function(word) { list(word, 1L) - serialize closures Next step, we extract the actual words from each line. To do this, we use the Spark flatMap function, and feed the lines RDD and a closure into it. This closure uses this R function called strsplit, which just splits the line by a space and takes the first part. The third step uses the SparkR lapply() function, which is an alias for map(), that just maps over this words RDD, and for each word produces a pair. So how does this actually get executed? Under the hood, SparkR will automatically fetch the dependencies of each closure, and serialize the whole thing for you. It then gets shipped over the network to every worker during a shuffle.

Example: Word Count lines <- textFile(sc, "hdfs://my_text_file") words <- flatMap(lines, function(line) { strsplit(line, " ")[[1]] }) wordCount <- lapply(words, function(word) { list(word, 1L) counts <- reduceByKey(wordCount, "+", 2L) output <- collect(counts) - Use R primitive functions like “+” To finish the word count program, we next call the Spark reduceByKey function, that takes the previous key-value pair rdd, and an R primitive function which is represented by the plus here, and a number representing the number of partitions to use. Lastly, we call the collect() function on this counts RDD, getting back the final answer in an array.

Live Demo As the next part of the remaining talk, I want to do a live demo that presents some of the advantages of SparkR. Specifically, we will tackle a machine learning problem in R, but make our programs run faster by executing it on Spark.

MNIST The machine learning problem is digit recognition using this dataset MNIST. MNIST is actually very widely used and researched in the machine learning community, and is basically a set of images that are hand-written digits. The problem is to train a machine learning model, that recognizes the actual digits from these images.

A b Minimize Here’s one formalization of this problem
Basically, our high-level plan is to extract a feature matrix A from the input dataset, as well as label vector b here. The goal is to find the vector x here, such that this norm here is minimized. Therefore the program will just compute ATA and ATb, and then solve for x. [BEGIN DEMO] Minimize

How does this work ? [END DEMO]
Hopefully you agree that this all seems pretty cool. As the next part of the talk, I’d like to discuss some of the details of SparkR’s design & implementation. Namely how we went about implementing all of this functionality under the hood. How does this work ?

Dataflow Local Worker Worker
The core internal of SparkR can be explained by an illustration on the dataflow of a computation in a SparkR job. Let’s consider a simple scenario where you launch SparkR from a local machine, and the cluster contains two workers.

Dataflow R Local Worker Worker
The first thing you do is to launch a normal R process, such as the R shell.

Dataflow R Local Worker Worker JNI Java Spark Context Spark Context
The next step is to launch SparkR, by calling library(sparkR) from within that R process. What this will do is that SparkR will use JNI to create a JavaSparkContext, and hold on to a reference to it inside R. For instance this reference is accessible in the sparkR shell called sc, as we have seen before.

Dataflow R R R Local Worker Worker exec JNI exec Spark Executor Java
Spark Context Java Spark Context exec JNI Worker When an action actually takes place, the JavaSparkContext in the JVM will instruct the worker nodes to launch Spark executors, again inside their own JVMs. Then, each spark executor will fork off a new thread that serves as an R worker. And this R worker takes care of reloading the deserialized tasks, and the list of R packages to include locally, as well as broadcast variables and so forth. The actual computation happens in this R worker process. And the results are communicated back to the spark executor, which in turn communicates back to the driver machine. Spark Executor R exec

There is one thing that is of particular interest here, which is the pipes we use for the communication between the R worker processes and the executors running on JVM. There are two parts of this communication. One is the way we grab the dependencies of a closure (or in other words, an anonymous function), and the other part of the picture is how do we serialize and deserialize these functions and dependencies.

From http://obeautifulcode.com/R/How-R-Searches-And-Finds-Stuff/
The way we attack the first aspect is of course by traversing the old, lovely environment diagrams. Basically, in R, these kinds of environment objects store the mappings between variables names and their values. Also, different environments are chained together by this parent relationship, as defined by the lexical scoping semantics. So let’s say if a closure uses as variable that is not defined in its own environment, we basically keep walking up this environment chain, and grab the first value found. From

The second aspect of the communication issue is, the way we communicate data
Basically, we use this pretty robust native serialization function provided in R. Which is called save. It is particularly handy and mature, and in fact, each time you finish playing around with an R shell and when you try to exit, you will get a prompt asking you whether or not to save the session – and this save() function does all the hard work of serializing all kinds of R objects to disks. In SparkR’s case, we feed the objects we want to serialize into this function, and get back a byte array as result, then we ship this byte array across the network.

Dataflow: Performance?
Local Worker Spark Executor R R Spark Context Java Spark Context exec JNI Worker Hopefully this explains the seemingly complicated communication process as highlighted here. A natural follow-up question to ask is, what if there are multiple transformation on an RDD? In Spark, transformations don’t fire off computation jobs because they have a lazy semantics. When an action takes place, all previous uncomputed transformation functions are combined together into one giant function to ship over the network. In SparkR we’d want to follow the same semantics, one reason is to maintain the similarity with Spark’s other APIs, and the other reason is to avoid the cost of doing the aforementioned pretty costly process every time a transformation is called on an RDD. Spark Executor R exec

…Pipeline the transformations!
words <- flatMap(lines, …) wordCount <- lapply(words, …) Spark Executor exec R SparkR’s solution is to introduce a pipelined RDD that does exactly that optimization. It basically combines all R functions in a series of transformations toghether, and ship them altogether once, so that spark executors do not need to fork off new R processes for every transformation. This is one optimization that we currently do.

Alpha developer release
One line install! install_github("amplab-extras/SparkR-pkg", subdir="pkg")

SparkR Implementation
Lightweight 292 lines of Scala code 1694 lines of R code 549 lines of test code in R …Spark is easy to extend!

In the Roadmap Calling MLLib directly within SparkR Data Frame support Better integration with R packages Performance: daemon R processes Speaking of extension….

EC2 setup scripts All Spark examples MNIST demo Hadoop2, Maven build
On Github

Combine scalability & utility
RDD :: distributed lists Closures & Serializations Re-use R packages SparkR Combine scalability & utility

Thanks! https://github.com/amplab-extras/SparkR-pkg
Shivaram Venkataraman Zongheng Yang Spark User mailing list

Pipelined RDD R R R R exec exec exec Spark Spark Executor Executor
Here’s an illustration on the effects. As we can Spark Executor R R Spark Executor exec

HDFS / HBase / Cassandra / …
SparkR Processing Engine Spark Cluster Manager Mesos / YARN / … Storage HDFS / HBase / Cassandra / … Let’s take a high-level look at SparkR. This is a graph that shows a common Spark stack. At the top, you have the processing execution engine which is Spark. Optionally, you can have a third-part cluster manager, such as Mesos or YARN, which basically just manages tasks across all the workers in a cluster. At the bottom is the Storage layer. Spark supports reading from some popular data formats & data sources, such as HDFS, Hbase, Cassandra. So where does SparkR fit into this picture? Here. SparkR lets users write R programs, and provides an interface into Spark. In other words, your familiar R programs can be run on Spark, utilizing the power of cluster computing plus retaining the aforementioned benefits of R.

Example: Logistic Regression
pointsRDD <- textFile(sc, "hdfs://myfile") weights <- runif(n=D, min = -1, max = 1) # Logistic gradient gradient <- function(partition) { X <- partition[,1]; Y <- partition[,-1] t(X) %*% (1/(1 + exp(-Y * (X %*% weights))) - 1) * Y }

Example: Logistic Regression
pointsRDD <- textFile(sc, "hdfs://myfile") weights <- runif(n=D, min = -1, max = 1) # Logistic gradient gradient <- function(partition) { X <- partition[,1]; Y <- partition[,-1] t(X) %*% (1/(1 + exp(-Y * (X %*% weights))) - 1) * Y } # Iterate weights <- weights - reduce( lapplyPartition(pointsRDD, gradient), "+") Write jobs in R. Use R shell. Support R packages

How does it work ? RScript RScript Spark Executor Spark Executor
Data: RDD[Array[Byte]] Spark Context Functions: Array[Byte] rJava R Shell

SparkR: Enabling Interactive Data Science at Scale

Similar presentations

Presentation on theme: "SparkR: Enabling Interactive Data Science at Scale"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

SparkR: Enabling Interactive Data Science at Scale

Similar presentations

Presentation on theme: "SparkR: Enabling Interactive Data Science at Scale"— Presentation transcript:

Similar presentations

About project

Feedback