UC Berkeley a Spark in the cloud iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Slides:



Advertisements
Similar presentations
Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Cliff Engle, Michael Franklin, Haoyuan Li, Antonio Lupher, Justin Ma,
Advertisements

The Datacenter Needs an Operating System Matei Zaharia, Benjamin Hindman, Andy Konwinski, Ali Ghodsi, Anthony Joseph, Randy Katz, Scott Shenker, Ion Stoica.
Berkley Data Analysis Stack Shark, Bagel. 2 Previous Presentation Summary Mesos, Spark, Spark Streaming Infrastructure Storage Data Processing Application.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Matei Zaharia University of California, Berkeley Spark in Action Fast Big Data Analytics using Scala UC BERKELEY.
UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Berkeley Data Analytics Stack
The Hadoop Stack, Part 3 Introduction to Spark
Spark: Cluster Computing with Working Sets
Spark Fast, Interactive, Language-Integrated Cluster Computing Wen Zhiguang
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Spark Fast, Interactive, Language-Integrated Cluster Computing.
Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)
Matei Zaharia Large-Scale Matrix Operations Using a Data Flow Engine.
Berkley Data Analysis Stack (BDAS)
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Fast and Expressive Big Data Analytics with Python
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
In-Memory Cluster Computing for Iterative and Interactive Applications
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
In-Memory Cluster Computing for Iterative and Interactive Applications
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
In-Memory Cluster Computing for Iterative and Interactive Applications
HADOOP ADMIN: Session -2
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
UC Berkeley Spark A framework for iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Lecture 7: Practical Computing with Large Data Sets cont. CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier Special thanks.
Matei Zaharia Introduction to. Outline The big data problem Spark programming model User community Newest addition: DataFrames.
Data Engineering How MapReduce Works
Berkeley Data Analytics Stack Prof. Chi (Harold) Liu November 2015.
Other Map-Reduce (ish) Frameworks: Spark William Cohen 1.
Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Haoyuan Li, Justin Ma, Murphy McCauley, Joshua Rosen, Reynold Xin,
Spark Debugger Ankur Dave, Matei Zaharia, Murphy McCauley, Scott Shenker, Ion Stoica UC BERKELEY.
Big Data Infrastructure Week 3: From MapReduce to Spark (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.
Spark System Background Matei Zaharia  [June HotCloud ]  Spark: Cluster Computing with Working Sets  [April NSDI.
Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL.
Data Summit 2016 H104: Building Hadoop Applications Abhik Roy Database Technologies - Experian LinkedIn Profile:
BIG DATA/ Hadoop Interview Questions.
Massive Data Processing – In-Memory Computing & Spark Stream Process.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
CSCI5570 Large Scale Data Processing Systems Distributed Data Analytics Systems Slide Ack.: modified based on the slides from Matei Zaharia James Cheng.
PySpark Tutorial - Learn to use Apache Spark with Python
Image taken from: slideshare
Spark: Cluster Computing with Working Sets
Berkeley Data Analytics Stack - Apache Spark
Big Data is a Big Deal!.
Presented by Peifeng Yu
PROTECT | OPTIMIZE | TRANSFORM
Fast, Interactive, Language-Integrated Cluster Computing
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
ITCS-3190.
Spark.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Spark Presentation.
Data Platform and Analytics Foundational Training
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
Introduction to Spark.
COS 518: Distributed Systems Lecture 10 Andrew Or, Mike Freedman
Spark and Scala.
Group 15 Swathi Gurram Prajakta Purohit
Introduction to Spark.
Fast, Interactive, Language-Integrated Cluster Computing
COS 518: Distributed Systems Lecture 11 Mike Freedman
Lecture 29: Distributed Systems
CS639: Data Management for Data Science
Presentation transcript:

UC Berkeley a Spark in the cloud iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica

Background MapReduce and Dryad raised level of abstraction in cluster programming by hiding scaling & faults However, these systems provide a limited programming model: acyclic data flow Can we design similarly powerful abstractions for a broader class of applications?

Spark Goals Support applications with working sets (datasets reused across parallel operations) »Iterative jobs (common in machine learning) »Interactive data mining Retain MapReduce’s fault tolerance & scalability Experiment with programmability »Integrate into Scala programming language »Support interactive use from Scala interpreter

Non-goals Spark is not a general-purpose programming language » One-size-fits-all architectures are also do-nothing-well architectures Spark is not a scheduler, nor a resource manager Mesos »Generic resource scheduler with support for heterogeneous frameworks

Programming Model Resilient distributed datasets (RDDs) »Created from HDFS files or “parallelized” arrays »Can be transformed with map and filter »Can be cached across parallel operations Parallel operations on RDDs »Reduce, toArray, foreach Shared variables »Accumulators (add-only), broadcast variables

Example: Log Mining Load “error” messages from a log into memory, then interactively search for various queries lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(1)) cachedMsgs = messages.cache() Block 1 Block 2 Block 3 Worker Driver cachedMsgs.filter(_.contains(“foo”)).count cachedMsgs.filter(_.contains(“bar”)).count... tasks results Cache 1 Cache 2 Cache 3 Base RDD Transformed RDD Cached RDD Parallel operation

RDD Representation Each RDD object maintains a lineage that can be used to rebuild slices of it that are lost / fall out of cache Ex: cachedMsgs = textFile(“log”).filter(_.contains(“error”)).map(_.split(‘\t’)(1)).cache() HdfsRDD path: hdfs://… HdfsRDD path: hdfs://… FilteredRDD func: contains(...) FilteredRDD func: contains(...) MappedRDD func: split(…) MappedRDD func: split(…) CachedRDD HDFS Local cache getIterator(slice)

Example: Logistic Regression Goal: find best line separating two sets of points + – – – – – – – – – + target – random initial line

Logistic Regression Code val data = spark.textFile(...).map(readPoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = data.map(p => { val scale = (1/(1+exp(-p.y*(w dot p.x))) - 1) * p.y scale * p.x }).reduce(_ + _) w -= gradient } println("Final w: " + w)

Logistic Regression Performance 127 s / iteration first iteration 174 s further iterations 6 s

Example: Collaborative Filtering Predict movie ratings for a set of users based on their past ratings of other movies R = 1??45?3??35??35?5???14????2?1??45?3??35??35?5???14????2? Movies Users

Matrix Factorization Model Model R as product of user and movie matrices A and B of dimensions U×K and M×K RA = Problem: given subset of R, optimize A and B BTBT

Alternating Least Squares Start with random A and B Repeat: 1.Fixing B, optimize A to minimize error on scores in R 2.Fixing A, optimize B to minimize error on scores in R

Serial ALS val R = readRatingsMatrix(...) var A = (0 until U).map(i => Vector.random(K)) var B = (0 until M).map(i => Vector.random(K)) for (i <- 1 to ITERATIONS) { A = (0 until U).map(i => updateUser(i, B, R)) B = (0 until M).map(i => updateMovie(i, A, R)) }

Naïve Spark ALS val R = readRatingsMatrix(...) var A = (0 until U).map(i => Vector.random(K)) var B = (0 until M).map(i => Vector.random(K)) for (i <- 1 to ITERATIONS) { A = spark.parallelize(0 until U, numSlices).map(i => updateUser(i, B, R)).toArray() B = spark.parallelize(0 until M, numSlices).map(i => updateMovie(i, A, R)).toArray() } Problem: R re-sent to all nodes in each parallel operation

Efficient Spark ALS val R = spark.broadcast(readRatingsMatrix(...)) var A = (0 until U).map(i => Vector.random(K)) var B = (0 until M).map(i => Vector.random(K)) for (i <- 1 to ITERATIONS) { A = spark.parallelize(0 until U, numSlices).map(i => updateUser(i, B, R.value)).toArray() B = spark.parallelize(0 until M, numSlices).map(i => updateMovie(i, A, R.value)).toArray() } Solution: mark R as “broadcast variable”

How to Implement Broadcast? 36% of iteration spent on broadcast Just using broadcast variables gives a significant performance boost, but not enough for all apps Example: ALS broadcasts 100’s of MB / iteration, which quickly bottlenecked our initial HDFS-based broadcast

Broadcast Methods Explored MethodResults NFSServer becomes bottleneck HDFSScales further than NFS, but limited Chained StreamingInitial results promising, but straggler nodes cause problems BitTorrentOff-the-shelf BT adds too much overhead in data center environment SplitStreamScales well in theory, but needs to be modified for fault tolerance

Broadcast Results

ALS Performance with Chained Streaming Broadcast

Language Integration Scala closures are serializable objects »Serialize on driver, load, & run on workers Not quite enough »Nested closures may reference entire outer scope »May pull in non-serializable variables not used inside »Solution: bytecode analysis + reflection

Interactive Spark Modified Scala interpreter to allow Spark to be used interactively from the command line Required two changes: »Modified wrapper code generation so that each “line” typed has references to objects for its dependencies »Place generated classes in distributed filesystem Enables in-memory exploration of big data

Demo

Conclusions Spark provides a limited but efficient set of fault tolerant distributed memory abstractions »Resilient distributed datasets (RDDs) »Restricted shared variables Planned extensions: »More RDD transformations (e.g., shuffle) »More RDD persistence options (e.g., disk + memory) »Updatable RDDs (for incremental or streaming jobs) »Data sharing across applications

Related Work DryadLINQ »Build queries through language-integrated SQL operations on lazy datasets »Cannot have a dataset persist across queries »No concept of shared variables for broadcast etc. Pig and Hive »Query languages that can call into Java/Python/etc UDFs »No support for caching a datasets across queries OpenMP »Compiler extension for parallel loops in C++ »Annotate variables as read-only or accumulator above loop »Cluster version exists, but not fault-tolerant Twister and Haloop »Iterative MapReduce implementations using caching »Can’t define multiple distributed datasets, run multiple map & reduce pairs on them, or decide which operations to run next interactively

Questions ? ? ?

Backup

Architecture Driver program connects to Mesos and schedules tasks Workers run tasks, report results and variable updates Data shared with HDFS/NFS No communication between workers for now Driver Workers HDFS user code tasks, results Mesos local cache

Mesos slave Mesos master Hadoop v20 scheduler Mesos slave Hadoop job Hadoop v20 executor task Mesos slave Hadoop v19 executor task MPI scheduler MPI job MPI execut or task Mesos Architecture Hadoop v19 scheduler Hadoop job Hadoop v19 executor task MPI execut or task

Serial Version val data = readData(...) var w = Vector.random(D) for (i <- 1 to ITERATIONS) { var gradient = Vector.zeros(D) for (p <- data) { val scale = (1/(1+exp(-p.y*(w dot p.x))) - 1) * p.y gradient += scale * p.x } w -= gradient } println("Final w: " + w)

Spark Version val data = spark.hdfsTextFile(...).map(readPoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { var gradient = spark.accumulator(Vector.zeros(D)) for (p <- data) { val scale = (1/(1+exp(-p.y*(w dot p.x))) - 1) * p.y gradient += scale * p.x } w -= gradient.value } println("Final w: " + w)

Spark Version val data = spark.hdfsTextFile(...).map(readPoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { var gradient = spark.accumulator(Vector.zeros(D)) for (p <- data) { val scale = (1/(1+exp(-p.y*(w dot p.x))) - 1) * p.y gradient += scale * p.x } w -= gradient.value } println("Final w: " + w)

Spark Version val data = spark.hdfsTextFile(...).map(readPoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { var gradient = spark.accumulator(Vector.zeros(D)) data.foreach(p => { val scale = (1/(1+exp(-p.y*(w dot p.x))) - 1) * p.y gradient += scale * p.x }) w -= gradient.value } println("Final w: " + w)

Functional Programming Version val data = spark.hdfsTextFile(...).map(readPoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { w -= data.map(p => { val scale = (1/(1+exp(-p.y*(w dot p.x))) - 1) * p.y scale * p.x }).reduce(_+_) } println("Final w: " + w)

Job Execution Big Dataset Slave 4 Slave 3 Slave 2 Slave 1 Master R1 R2R3R4 aggregate update param param Spark

Job Execution Slave 4 Slave 3 Slave 2 Slave 1 Master R1 R2R3R4 aggregate update param param Master aggregate param Map 4 Map 3 Map 2 Map 1 Reduce aggregate Map 8 Map 7 Map 6 Map 5 Reduce param      SparkHadoop / Dryad