Fast, Interactive, Language-Integrated Cluster Computing

Fast, Interactive, Language-Integrated Cluster Computing
Spark Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica UC BERKELEY

Project Goals Extend the MapReduce model to better support two common classes of analytics apps: Iterative algorithms (machine learning, graphs) Interactive data mining Enhance programmability: Integrate into Scala programming language Allow interactive use from Scala interpreter Point out that Scala is a modern PL etc Mention DryadLINQ (but we go beyond it with RDDs) Point out that interactive use and iterative use go hand in hand because both require small tasks and dataset reuse

Motivation Most current cluster programming models are based on acyclic data flow from stable storage to stable storage Map Reduce Input Output Acyclic

Motivation Most current cluster programming models are based on acyclic data flow from stable storage to stable storage Map Reduce Input Output Benefits of data flow: runtime can decide where to run tasks and can automatically recover from failures Also applies to Dryad, SQL, etc Benefits: easy to do fault tolerance and

Motivation Acyclic data flow is inefficient for applications that repeatedly reuse a working set of data: Iterative algorithms (machine learning, graphs) Interactive data mining tools (R, Excel, Python) With current frameworks, apps reload data from stable storage on each query

Solution: Resilient Distributed Datasets (RDDs)
Allow apps to keep working sets in memory for efficient reuse Retain the attractive properties of MapReduce Fault tolerance, data locality, scalability Support a wide range of applications RDDs = first-class way to manipulate and persist intermediate datasets

Outline Spark programming model Implementation User applications

Programming Model Resilient distributed datasets (RDDs)
Immutable, partitioned collections of objects Created through parallel transformations (map, filter, groupBy, join, …) on data in stable storage Can be cached for efficient reuse Actions on RDDs Count, reduce, collect, save, … You write a single program  similar to DryadLINQ Distributed data sets with parallel operations on them are pretty standard; the new thing is that they can be reused across ops Variables in the driver program can be used in parallel ops; accumulators useful for sending information back, cached vars are an optimization Mention cached vars useful for some workloads that won’t be shown here Mention it’s all designed to be easy to distribute in a fault-tolerant fashion

Result: scaled to 1 TB data in 5-7 sec (vs 170 sec for on-disk data)
Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Base RDD Cache 1 Transformed RDD lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache() Worker Driver results tasks Block 1 Action cachedMsgs.filter(_.contains(“foo”)).count Key idea: add “variables” to the “functions” in functional programming cachedMsgs.filter(_.contains(“bar”)).count Cache 2 . . . Cache 3 Block 2 Result: scaled to 1 TB data in 5-7 sec (vs 170 sec for on-disk data) Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Block 3

filter (func = _.contains(...))
RDD Fault Tolerance RDDs maintain lineage information that can be used to reconstruct lost partitions Ex: messages = textFile(...).filter(_.startsWith(“ERROR”)) .map(_.split(‘\t’)(2)) HDFS File Filtered RDD Mapped RDD filter (func = _.contains(...)) map (func = _.split(...))

Example: Logistic Regression
Goal: find best line separating two sets of points random initial line + + + + + + – + + – Note that dataset is reused on each gradient computation – – + + – – – – – – target

Example: Logistic Regression
val data = spark.textFile(...).map(readPoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Final w: " + w) Key idea: add “variables” to the “functions” in functional programming

Logistic Regression Performance
127 s / iteration first iteration 174 s further iterations 6 s This is for a 29 GB dataset on 20 EC2 m1.xlarge machines (4 cores each)

Spark Applications In-memory data mining on Hive data (Conviva) Predictive analytics (Quantifind) City traffic prediction (Mobile Millennium) Twitter spam classification (Monarch) Collaborative filtering via matrix factorization …

(return a result to driver program)
Spark Operations Transformations (define a new RDD) map filter sample groupByKey reduceByKey sortByKey flatMap union join cogroup cross mapValues Actions (return a result to driver program) collect reduce count save lookupKey

Conviva GeoReport Aggregations on many keys w/ same WHERE clause
Time (hours) Aggregations on many keys w/ same WHERE clause 40× gain comes from: Not re-reading unused columns or filtered records Avoiding repeated decompression In-memory storage of deserialized objects

Implementation Runs on Apache Mesos to share resources with Hadoop & other apps Can read from any Hadoop input source (e.g. HDFS) Spark Hadoop MPI Mesos Node … NOT a variant of Hadoop No changes to Scala compiler

Spark Scheduler Dryad-like DAGs Pipelines functions within a stage Cache-aware work reuse & locality Partitioning-aware to avoid shuffles join union groupBy map Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: G: NOT a modified version of Hadoop = cached data partition

Interactive Spark Modified Scala interpreter to allow Spark to be used interactively from the command line Required two changes: Modified wrapper code generation so that each line typed has references to objects for its dependencies Distribute generated classes over the network

What is Spark Streaming?
Framework for large scale stream processing Scales to 100s of nodes Can achieve second scale latencies Integrates with Spark’s batch and interactive processing Provides a simple batch-like API for implementing complex algorithm Can absorb live data streams from Kafka, Flume, ZeroMQ, etc.

Motivation Many important applications must process large streams of live data and provide results in near-real-time Social network trends Website statistics Intrustion detection systems etc. Require large clusters to handle workloads Require latencies of few seconds

But what are the requirements
Need for a framework … … for building such complex stream processing applications But what are the requirements from such a framework?

Requirements Scalable to large clusters Second-scale latencies Simple programming model

Case study: Conviva, Inc.
Real-time monitoring of online video metadata HBO, ESPN, ABC, SyFy, … Two processing stacks Custom-built distributed stream processing system 1000s complex metrics on millions of video sessions Requires many dozens of nodes for processing Hadoop backend for offline analysis Generating daily and monthly reports Similar computation as the streaming system

Case study: XYZ, Inc. Any company who wants to process live streaming data has this problem Twice the effort to implement any new function Twice the number of bugs to solve Twice the headache Two processing stacks Custom-built distributed stream processing system 1000s complex metrics on millions of videos sessions Requires many dozens of nodes for processing Hadoop backend for offline analysis Generating daily and monthly reports Similar computation as the streaming system

Requirements Scalable to large clusters Second-scale latencies Simple programming model Integrated with batch & interactive processing

Stateful Stream Processing
Traditional streaming systems have a event-driven record-at-a-time processing model Each node has mutable state For each record, update state & send new records State is lost if node dies! Making stateful stream processing be fault-tolerant is challenging mutable state node 1 node 3 input records node 2 Traditional streaming systems have what we call a “record-at-a-time” processing model. Each node in the cluster processing a stream has a mutable state. As records arrive one at a time, the mutable state is updated, and a new generated record is pushed to downstream nodes. Now making this mutable state fault-tolerant is hard.

Requirements Scalable to large clusters Second-scale latencies Simple programming model Integrated with batch & interactive processing Efficient fault-tolerance in stateful computations

Spark Streaming Tathagata Das (TD) UC Berkeley

Discretized Stream Processing
Run a streaming computation as a series of very small, deterministic batch jobs live data stream Spark Streaming Chop up the live stream into batches of X seconds Spark treats each batch of data as RDDs and processes them using RDD operations Finally, the processed results of the RDD operations are returned in batches batches of X seconds Spark processed results

Discretized Stream Processing
Run a streaming computation as a series of very small, deterministic batch jobs live data stream Spark Streaming Batch sizes as low as ½ second, latency ~ 1 second Potential for combining batch processing and streaming processing in the same system batches of X seconds Spark processed results

Example 1 – Get hashtags from Twitter
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) DStream: a sequence of RDD representing a stream of data Twitter Streaming API t+1 t t+2 tweets DStream stored in memory as an RDD (immutable, distributed)

val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) val hashTags = tweets.flatMap (status => getTags(status)) new DStream transformation: modify data in one Dstream to create another DStream t+1 t t+2 tweets DStream flatMap flatMap flatMap … hashTags Dstream [#cat, #dog, … ] new RDDs created for every batch

val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) val hashTags = tweets.flatMap (status => getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...") output operation: to push data to external storage t t+1 t+2 tweets DStream flatMap flatMap flatMap hashTags DStream save every batch saved to HDFS

Function object to define the transformation
Java Example Scala val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) val hashTags = tweets.flatMap (status => getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...") Java JavaDStream<Status> tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) JavaDstream<String> hashTags = tweets.flatMap(new Function<...> { }) Function object to define the transformation

lost partitions recomputed on other workers
Fault-tolerance RDDs remember the sequence of operations that created it from the original fault-tolerant input data Batches of input data are replicated in memory of multiple worker nodes, therefore fault-tolerant Data lost due to worker failure, can be recomputed from input data tweets RDD input data replicated in memory flatMap hashTags RDD lost partitions recomputed on other workers

Key concepts DStream – sequence of RDDs representing a stream of data
Twitter, HDFS, Kafka, Flume, ZeroMQ, Akka Actor, TCP sockets Transformations – modify data from on DStream to another Standard RDD operations – map, countByValue, reduce, join, … Stateful operations – window, countByValueAndWindow, … Output Operations – send data to external entity saveAsHadoopFiles – saves to HDFS foreach – do anything with each batch of results

Example 2 – Count the hashtags
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) val hashTags = tweets.flatMap (status => getTags(status)) val tagCounts = hashTags.countByValue() t t+1 t+2 tweets flatMap map reduceByKey flatMap map reduceByKey … flatMap map reduceByKey hashTags tagCounts [(#cat, 10), (#dog, 25), ... ]

Example 3 – Count the hashtags over last 10 mins
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) val hashTags = tweets.flatMap (status => getTags(status)) val tagCounts = hashTags.window(Minutes(10), Seconds(1)).countByValue() sliding window operation window length sliding interval

Example 3 – Counting the hashtags over last 10 mins
val tagCounts = hashTags.window(Minutes(10), Seconds(1)).countByValue() hashTags t-1 t t+1 t+2 t+3 sliding window countByValue count over all the data in the window tagCounts

Smart window-based countByValue
val tagCounts = hashtags.countByValueAndWindow(Minutes(10), Seconds(1)) hashTags t-1 t t+1 t+2 t+3 countByValue add the counts from the new batch in the window + – subtract the counts from batch before the window tagCounts ?

Fast, Interactive, Language-Integrated Cluster Computing

Similar presentations

Presentation on theme: "Fast, Interactive, Language-Integrated Cluster Computing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Fast, Interactive, Language-Integrated Cluster Computing

Similar presentations

Presentation on theme: "Fast, Interactive, Language-Integrated Cluster Computing"— Presentation transcript:

Similar presentations

About project

Feedback