Spark System Background Matei Zaharia  [June 2010. HotCloud 2010. ]  Spark: Cluster Computing with Working Sets  [April 2012. NSDI.

Slides:



Advertisements
Similar presentations
Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Cliff Engle, Michael Franklin, Haoyuan Li, Antonio Lupher, Justin Ma,
Advertisements

Spark Streaming Large-scale near-real-time stream processing
Berkley Data Analysis Stack Shark, Bagel. 2 Previous Presentation Summary Mesos, Spark, Spark Streaming Infrastructure Storage Data Processing Application.
UC Berkeley a Spark in the cloud iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Spark Streaming Large-scale near-real-time stream processing UC BERKELEY Tathagata Das (TD)
Matei Zaharia University of California, Berkeley Spark in Action Fast Big Data Analytics using Scala UC BERKELEY.
UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Berkeley Data Analytics Stack (BDAS) Overview Ion Stoica UC Berkeley UC BERKELEY.
Berkeley Data Analytics Stack
Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing.
Spark: Cluster Computing with Working Sets
Spark Fast, Interactive, Language-Integrated Cluster Computing Wen Zhiguang
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
In-Memory Frameworks (and Stream Processing) Aditya Akella.
Overview of Spark project Presented by Yin Zhu Materials from Hadoop in Practice by A.
Discretized Streams: Fault-Tolerant Streaming Computation at Scale Wenting Wang 1.
Spark Fast, Interactive, Language-Integrated Cluster Computing.
Matei Zaharia Large-Scale Matrix Operations Using a Data Flow Engine.
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
Discretized Streams Fault-Tolerant Streaming Computation at Scale Matei Zaharia, Tathagata Das (TD), Haoyuan (HY) Li, Timothy Hunter, Scott Shenker, Ion.
Discretized Streams An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker,
Berkley Data Analysis Stack (BDAS)
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Fast and Expressive Big Data Analytics with Python
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
In-Memory Cluster Computing for Iterative and Interactive Applications
Distributed Computations
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
Spark Resilient Distributed Datasets:
In-Memory Cluster Computing for Iterative and Interactive Applications
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
In-Memory Cluster Computing for Iterative and Interactive Applications
Storage in Big Data Systems
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Resilient Distributed Datasets (NSDI 2012) A Fault-Tolerant Abstraction for In-Memory Cluster Computing Piccolo (OSDI 2010) Building Fast, Distributed.
Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.
Spark Streaming Large-scale near-real-time stream processing
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Lecture 7: Practical Computing with Large Data Sets cont. CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier Special thanks.
Matei Zaharia Introduction to. Outline The big data problem Spark programming model User community Newest addition: DataFrames.
Data Engineering How MapReduce Works
Berkeley Data Analytics Stack Prof. Chi (Harold) Liu November 2015.
Operating Systems and The Cloud, Part II: Search => Cluster Apps => Scalable Machine Learning David E. Culler CS162 – Operating Systems and Systems Programming.
Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Haoyuan Li, Justin Ma, Murphy McCauley, Joshua Rosen, Reynold Xin,
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Massive Data Processing – In-Memory Computing & Spark Stream Process.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
CSCI5570 Large Scale Data Processing Systems Distributed Data Analytics Systems Slide Ack.: modified based on the slides from Matei Zaharia James Cheng.
Spark: Cluster Computing with Working Sets
Berkeley Data Analytics Stack - Apache Spark
Fast, Interactive, Language-Integrated Cluster Computing
Hadoop.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Spark Presentation.
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
CSCI5570 Large Scale Data Processing Systems
Introduction to Spark.
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
CS110: Discussion about Spark
Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC
Fast, Interactive, Language-Integrated Cluster Computing
Lecture 29: Distributed Systems
CS639: Data Management for Data Science
Presentation transcript:

Spark System

Background Matei Zaharia  [June HotCloud ]  Spark: Cluster Computing with Working Sets  [April NSDI Best paper awards]  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing  [June HotCloud 2012.]  Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters (Streaming Spark)  [Nov SOSP 2013.]  Discretized Streams: Fault-Tolerant Streaming Computation at Scale

Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Motivation MapReduce greatly simplified “big data” analysis on large, unreliable clusters But not good at some apps:  Iterative computation (multi stages)  Interactive ad-hoc queries Leads to some specialized frameworks:  E.g., Pregel for graph processing

Motivation Complex apps and interactive queries both need one thing that MapReduce lacks: Efficient primitives for data sharing In MapReduce, the only way to share data across jobs is stable storage  slow!

Examples iter. 1 iter Input HDFS read HDFS write HDFS read HDFS write Input query 1 query 2 query 3 result 1 result 2 result 3... HDFS read Slow due to replication and disk I/O, but necessary for fault tolerance

iter. 1 iter Input Goal: In-Memory Data Sharing Input query 1 query 2 query 3... one-time processing design a distributed memory abstraction that is both fault-tolerant and efficient

Resilient Distributed Datasets (RDDs) Restricted form of distributed shared memory »Immutable, partitioned collections of records »Can only be built through coarse-grained deterministic transformations (map, filter, join, …) Efficient fault recovery using lineage »Log one operation to apply to many elements »Recompute lost partitions on failure

Input query 1 query 2 query 3... RDD Recovery one-time processing iter. 1 iter Input

Generality of RDDs Despite their restrictions, RDDs can express surprisingly many parallel algorithms Unify many current programming models  Data flow models: MapReduce, Dryad, SQL, …  Specialized models for iterative apps: BSP (Pregel), iterative MapReduce (Haloop), bulk incremental, …

Memory bandwidth Network bandwidth Tradeoff Space Granularity of Updates Write Throughput Fine Coarse LowHigh K-V stores, databases, RAMCloud Best for batch workloads Best for transactional workloads HDFSRDDs

Spark Programming Interface DryadLINQ-like API in the Scala language Provides:  Resilient distributed datasets (RDDs)  Operations on RDDs:  transformations (build new RDDs)  actions (compute and output results)  Control of each RDD’s partitioning (layout across nodes) and persistence (storage in RAM, on disk, etc)

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) messages.persist() Block 1 Block 2 Block 3 Worker Master messages.filter(_.contains(“foo”)).count messages.filter(_.contains(“bar”)).count tasks results Msgs. 1 Msgs. 2 Msgs. 3 Base RDD Transformed RDD Action Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Result: scaled to 1 TB data in 5-7 sec (vs 170 sec for on-disk data) hdfs lines errors message

RDDs track the graph of transformations that built them (their lineage) to rebuild lost data E.g.: messages = textFile(...).filter(_.contains(“error”)).map(_.split(‘\t’)(2)) HadoopRDD path = hdfs://… HadoopRDD path = hdfs://… FilteredRDD func = _.contains(...) FilteredRDD func = _.contains(...) MappedRDD func = _.split(…) MappedRDD func = _.split(…) Fault Recovery HadoopRDDFilteredRDDMappedRDD

Fault Recovery Results Failure happens

Example: PageRank links & ranks repeatedly joined Reuse of Links RDD Can co-partition them (e.g. hash both on URL) to avoid shuffles Problems: generate lot of intermediate RDDs Contribs 0 join Contribs 2 Ranks 0 (url, rank) Ranks 0 (url, rank) Links (url, neighbors) Links (url, neighbors) Ranks 2 reduce Ranks 1...

PageRank Performance

Implementation Runs on Mesos [NSDI 11] to share clusters w/ Hadoop Can read from any Hadoop input source (HDFS, S3, …) Spark Hadoop MPI Mesos Node … No changes to Scala language or compiler

Programming Models Implemented on Spark RDDs can express many existing parallel models »MapReduce, DryadLINQ »Pregel graph processing [200 LOC] »Iterative MapReduce [200 LOC] »SQL: Hive on Spark (Shark) »Enables apps to efficiently intermix these models All are based on coarse-grained operations

Behavior with Insufficient RAM

Scalability Logistic RegressionK-Means

Conclusion RDDs offer a simple and efficient programming model for a broad range of applications Leverage the coarse-grained nature of many parallel algorithms for low-overhead recovery

Discretized Streams Fault-Tolerant Streaming Computation at Scale

Many important applications need to process large data streams arriving in real time  User activity statistics (e.g. Facebook’s Puma)  Spam detection  Traffic estimation  Network intrusion detection Target: large-scale apps that must run on tens- hundreds of nodes with O(1 sec) latency Motivation

Challenge To run at large scale, system has to be both:  Fault-tolerant: recover quickly from failures and stragglers  Cost-efficient: do not require significant hardware beyond that needed for basic processing Existing streaming systems don’t have both properties

Traditional Streaming Systems “Record-at-a-time” processing model  Each node has mutable state  For each record, update state & send new records mutable state node 1 node 3 input records push node 2 input records

Fault tolerance For These System Fault tolerance : replication, upstream backup node 1 node 3 node 2 node 1’ node 3’ node 2’ synchronization input node 1 node 3 node 2 standb y input Fast recovery, but 2x hardware cost Only need 1 standby, but slow to recover

Observation Batch processing models for clusters (e.g. MapReduce) provide fault tolerance efficiently  Divide job into deterministic tasks  Rerun failed/slow tasks in parallel on other nodes Idea: run a streaming computation as a series of very small, deterministic batches  Same recovery schemes at much smaller timescale  Work to make batch size as small as possible

Discretized Stream t = 1: t = 2: input batch operation pull …… input immutable dataset (stored reliably) immutable dataset (output or state); stored in memory without replication immutable dataset (output or state); stored in memory without replication …

Parallel Recovery If a node fails/straggles, recompute its dataset partitions in parallel on other nodes map input dataset output dataset Faster recovery than upstream backup, without the cost of replication

Speed up Batch Jobs  Prototype built on the Spark  process 2 GB/s (20M records/s) of data on 50 nodes at sub- second latency Max throughput within a given latency bound (1 or 2s)

Programming Model A discretized stream (D-stream) is a sequence of immutable, partitioned datasets  Specifically, resilient distributed datasets (RDDs), the storage abstraction in Spark Deterministic transformations operators produce new streams

API LINQ-like language-integrated API in Scala pageViews = readStream("...", "1s") ones = pageViews.map(ev => (ev.url, 1)) counts = ones.runningReduce(_ + _) t = 1: t = 2: pageViewsonescounts mapreduce... = RDD = partition

Evaluation  Compare with Storm

Evaluation  Failure recovery

Evaluation  Speculative Execution

Conclusion  D-Streams forgo traditional streaming wisdom by batching data in small timesteps  Enable efficient, new parallel recovery scheme  Let users seamlessly intermix streaming, batch and interactive queries