In-Memory Cluster Computing for Iterative and Interactive Applications

Slides:



Advertisements
Similar presentations
Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Cliff Engle, Michael Franklin, Haoyuan Li, Antonio Lupher, Justin Ma,
Advertisements

Berkley Data Analysis Stack Shark, Bagel. 2 Previous Presentation Summary Mesos, Spark, Spark Streaming Infrastructure Storage Data Processing Application.
UC Berkeley a Spark in the cloud iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Spark Streaming Large-scale near-real-time stream processing UC BERKELEY Tathagata Das (TD)
Spark Lightning-Fast Cluster Computing UC BERKELEY.
Matei Zaharia University of California, Berkeley Spark in Action Fast Big Data Analytics using Scala UC BERKELEY.
Comp6611 Course Lecture Big data applications Yang PENG Network and System Lab CSE, HKUST Monday, March 11, 2013 Material adapted from.
UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Berkeley Data Analytics Stack (BDAS) Overview Ion Stoica UC Berkeley UC BERKELEY.
Berkeley Data Analytics Stack
Spark: Cluster Computing with Working Sets
Spark Fast, Interactive, Language-Integrated Cluster Computing Wen Zhiguang
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
In-Memory Frameworks (and Stream Processing) Aditya Akella.
Overview of Spark project Presented by Yin Zhu Materials from Hadoop in Practice by A.
Spark Fast, Interactive, Language-Integrated Cluster Computing.
Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)
Matei Zaharia Large-Scale Matrix Operations Using a Data Flow Engine.
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
Discretized Streams Fault-Tolerant Streaming Computation at Scale Matei Zaharia, Tathagata Das (TD), Haoyuan (HY) Li, Timothy Hunter, Scott Shenker, Ion.
Discretized Streams An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker,
Berkley Data Analysis Stack (BDAS)
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Fast and Expressive Big Data Analytics with Python
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
Spark Resilient Distributed Datasets:
In-Memory Cluster Computing for Iterative and Interactive Applications
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
In-Memory Cluster Computing for Iterative and Interactive Applications
Outline | Motivation| Design | Results| Status| Future
UC Berkeley Spark A framework for iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Storage in Big Data Systems
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Resilient Distributed Datasets (NSDI 2012) A Fault-Tolerant Abstraction for In-Memory Cluster Computing Piccolo (OSDI 2010) Building Fast, Distributed.
Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.
Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.
Spark Streaming Large-scale near-real-time stream processing
Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Lecture 7: Practical Computing with Large Data Sets cont. CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier Special thanks.
Matei Zaharia Introduction to. Outline The big data problem Spark programming model User community Newest addition: DataFrames.
Data Engineering How MapReduce Works
Berkeley Data Analytics Stack Prof. Chi (Harold) Liu November 2015.
Other Map-Reduce (ish) Frameworks: Spark William Cohen 1.
Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Haoyuan Li, Justin Ma, Murphy McCauley, Joshua Rosen, Reynold Xin,
John Kubiatowicz Electrical Engineering and Computer Sciences
Spark Debugger Ankur Dave, Matei Zaharia, Murphy McCauley, Scott Shenker, Ion Stoica UC BERKELEY.
Spark System Background Matei Zaharia  [June HotCloud ]  Spark: Cluster Computing with Working Sets  [April NSDI.
Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.
Massive Data Processing – In-Memory Computing & Spark Stream Process.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
CSCI5570 Large Scale Data Processing Systems Distributed Data Analytics Systems Slide Ack.: modified based on the slides from Matei Zaharia James Cheng.
Spark: Cluster Computing with Working Sets
Berkeley Data Analytics Stack - Apache Spark
Presented by Peifeng Yu
Spark Programming By J. H. Wang May 9, 2017.
Fast, Interactive, Language-Integrated Cluster Computing
Spark.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Spark Presentation.
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
Introduction to Spark.
CS110: Discussion about Spark
Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC
Fast, Interactive, Language-Integrated Cluster Computing
Lecture 29: Distributed Systems
CS639: Data Management for Data Science
Presentation transcript:

In-Memory Cluster Computing for Iterative and Interactive Applications Spark In-Memory Cluster Computing for Iterative and Interactive Applications Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica UC BERKELEY

Environment Point out that Scala is a modern PL etc Mention DryadLINQ (but we go beyond it with RDDs) Point out that interactive use and iterative use go hand in hand because both require small tasks and dataset reuse

Motivation Most current cluster programming models are based on acyclic data flow from stable storage to stable storage Map Reduce Input Output Acyclic

Motivation Most current cluster programming models are based on acyclic data flow from stable storage to stable storage Map Reduce Input Output Benefits of data flow: runtime can decide where to run tasks and can automatically recover from failures Also applies to Dryad, SQL, etc Benefits: easy to do fault tolerance and

Motivation Acyclic data flow is inefficient for applications that repeatedly reuse a working set of data: Iterative algorithms (machine learning, graphs) Interactive data mining tools (R, Excel, Python) With current frameworks, apps reload data from stable storage on each query Point out that both types of apps are actually quite common / desirable in data analytics

Example: Iterative Apps iteration 1 result 1 iteration 2 result 2 iteration 3 result 3 Input . . . Each iteration is, for example, a MapReduce job iter. 1 iter. 2 . . . Input

Goal: Keep Working Set in RAM iteration 1 one-time processing iteration 2 iteration 3 Input Distributed memory . . . iter. 1 iter. 2 . . . Input

Challenge How to design a distributed memory abstraction that is both fault-tolerant and efficient?

Challenge Existing distributed storage abstractions have interfaces based on fine-grained updates Reads and writes to cells in a table E.g. databases, key-value stores, distributed memory Require replicating data or logs across nodes for fault tolerance  expensive!

Solution: Resilient Distributed Datasets (RDDs) Provide an interface based on coarse-grained transformations (map, group-by, join, …) Efficient fault recovery using lineage Log one operation to apply to many elements Recompute lost partitions on failure No cost if nothing fails

RDD Recovery iteration 1 iteration 2 iteration 3 Input one-time processing iteration 2 iteration 3 Input Distributed memory . . . iter. 1 iter. 2 . . . Input

Generality of RDDs Despite coarse-grained interface, RDDs can express surprisingly many parallel algorithms These naturally apply the same operation to many items Capture many current programming models Data flow models: MapReduce, Dryad, SQL, … Specialized models for iterative apps: BSP (Pregel), iterative MapReduce, bulk incremental Also support new apps that these models don’t Key idea: add “variables” to the “functions” in functional programming

Outline Programming interface Applications Implementation Demo

Spark Programming Interface Language-integrated API in Scala Provides: Resilient distributed datasets (RDDs) Partitioned collections with controllable caching Operations on RDDs Transformations (define RDDs), actions (compute results) Restricted shared variables (broadcast, accumulators) You write a single program  similar to DryadLINQ Distributed data sets with parallel operations on them are pretty standard; the new thing is that they can be reused across ops Variables in the driver program can be used in parallel ops; accumulators useful for sending information back, cached vars are an optimization Mention cached vars useful for some workloads that won’t be shown here Mention it’s all designed to be easy to distribute in a fault-tolerant fashion

Result: scaled to 1 TB data in 5-7 sec (vs 170 sec for on-disk data) Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Base RDD Cache 1 Transformed RDD lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache() Worker Driver results tasks Block 1 Action cachedMsgs.filter(_.contains(“foo”)).count Key idea: add “variables” to the “functions” in functional programming cachedMsgs.filter(_.contains(“bar”)).count Cache 2 . . . Cache 3 Block 2 Result: scaled to 1 TB data in 5-7 sec (vs 170 sec for on-disk data) Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Block 3

filter (func = _.contains(...)) Fault Tolerance RDDs track lineage information that can be used to efficiently reconstruct lost partitions Ex: messages = textFile(...).filter(_.startsWith(“ERROR”)) .map(_.split(‘\t’)(2)) HDFS File Filtered RDD Mapped RDD filter (func = _.contains(...)) map (func = _.split(...))

Example: Logistic Regression Goal: find best line separating two sets of points random initial line + + + + + + – + + – Note that dataset is reused on each gradient computation – – + + – – – – – – target

Example: Logistic Regression val data = spark.textFile(...).map(readPoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Final w: " + w) Key idea: add “variables” to the “functions” in functional programming

Logistic Regression Performance 127 s / iteration first iteration 174 s further iterations 6 s This is for a 29 GB dataset on 20 EC2 m1.xlarge machines (4 cores each)

Example: Collaborative Filtering Goal: predict users’ movie ratings based on past ratings of other movies 1 ? ? 4 5 ? 3 ? ? 3 5 ? ? 3 5 ? 5 ? ? ? 1 4 ? ? ? ? 2 ? R = Users Movies

Model and Algorithm = R A BT Model R as product of user and movie feature matrices A and B of size U×K and M×K R A = BT Alternating Least Squares (ALS) Start with random A & B Optimize user vectors (A) based on movies Optimize movie vectors (B) based on users Repeat until converged

Serial ALS Range objects var R = readRatingsMatrix(...) var A = // array of U random vectors var B = // array of M random vectors for (i <- 1 to ITERATIONS) { A = (0 until U).map(i => updateUser(i, B, R)) B = (0 until M).map(i => updateMovie(i, A, R)) } Range objects

Problem: R re-sent to all nodes in each iteration Naïve Spark ALS var R = readRatingsMatrix(...) var A = // array of U random vectors var B = // array of M random vectors for (i <- 1 to ITERATIONS) { A = spark.parallelize(0 until U, numSlices) .map(i => updateUser(i, B, R)) .collect() B = spark.parallelize(0 until M, numSlices) .map(i => updateMovie(i, A, R)) } Problem: R re-sent to all nodes in each iteration

Efficient Spark ALS Result: 3× performance improvement Solution: mark R as broadcast variable var R = spark.broadcast(readRatingsMatrix(...)) var A = // array of U random vectors var B = // array of M random vectors for (i <- 1 to ITERATIONS) { A = spark.parallelize(0 until U, numSlices) .map(i => updateUser(i, B, R.value)) .collect() B = spark.parallelize(0 until M, numSlices) .map(i => updateMovie(i, A, R.value)) } Result: 3× performance improvement

Initial version (HDFS) Scaling Up Broadcast Initial version (HDFS) Cornet broadcast

Cornet Performance 1GB data to 100 receivers [Chowdhury et al, SIGCOMM 2011]

Spark Applications EM alg. for traffic prediction (Mobile Millennium) Twitter spam classification (Monarch) In-memory OLAP & anomaly detection (Conviva) Time series analysis Network simulation …

Mobile Millennium Project Estimate city traffic using GPS observations from probe vehicles (e.g. SF taxis)

Sample Data One day’s worth of SF taxi measurements Credit: Tim Hunter, with support of the Mobile Millennium team; P.I. Alex Bayen; traffic.berkeley.edu

Challenge Data is noisy and sparse (1 sample/minute) Must infer path taken by each vehicle in addition to travel time distribution on each link

Challenge Data is noisy and sparse (1 sample/minute) Must infer path taken by each vehicle in addition to travel time distribution on each link

Solution EM algorithm to estimate paths and travel time distributions simultaneously observations flatMap weighted path samples groupByKey link parameters broadcast

3× speedup from caching, 4.5x from broadcast Results [Hunter et al, SOCC 2011] 3× speedup from caching, 4.5x from broadcast

Cluster Programming Models RDDs can express many proposed data-parallel programming models MapReduce, DryadLINQ Bulk incremental processing Pregel graph processing Iterative MapReduce (e.g. Haloop) SQL Allow apps to efficiently intermix these models Say it’s because these all do data-parallel operations

Models We Have Built Pregel on Spark (Bagel) Haloop on Spark 200 lines of code Haloop on Spark Hive on Spark (Shark) 3000 lines of code Compatible with Apache Hive ML operators in Scala Say it’s because these all do data-parallel operations

Implementation Spark runs on the Mesos cluster manager [NSDI 11], letting it share resources with Hadoop & other apps Can read from any Hadoop input source (HDFS, S3, …) Spark Hadoop MPI Mesos Node … Mention it’s designed to be fault-tolerant No changes to Scala language & compiler

Outline Programming interface Applications Implementation Demo

Conclusion www.spark-project.org Spark’s RDDs offer a simple and efficient programming model for a broad range of apps Solid foundation for higher-level abstractions Join our open source community: www.spark-project.org

Related Work DryadLINQ, FlumeJava Similar “distributed collection” API, but cannot reuse datasets efficiently across queries GraphLab, Piccolo, BigTable, RAMCloud Fine-grained writes requiring replication or checkpoints Iterative MapReduce (e.g. Twister, HaLoop) Implicit data sharing for a fixed computation pattern Relational databases Lineage/provenance, logical logging, materialized views Caching systems (e.g. Nectar) Store data in files, no explicit control over what is cached

(return a result to driver program) Spark Operations Transformations (define a new RDD) map filter sample groupByKey reduceByKey sortByKey flatMap union join cogroup cross mapValues Actions (return a result to driver program) collect reduce count save lookupKey

Job Scheduler Dryad-like task DAG Reuses previously computed data Partitioning-aware to avoid shuffles Automatic pipelining join union groupBy map Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: G: NOT a modified version of Hadoop = previously computed partition

Fault Recovery Results These are for K-means

Behavior with Not Enough RAM