Resilient Distributed Datasets (NSDI 2012) A Fault-Tolerant Abstraction for In-Memory Cluster Computing Piccolo (OSDI 2010) Building Fast, Distributed.

Slides:



Advertisements
Similar presentations
MapReduce Online Tyson Condie UC Berkeley Slides by Kaixiang MO
Advertisements

Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
UC Berkeley a Spark in the cloud iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
MapReduce Online Veli Hasanov Fatih University.
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
Spark Streaming Large-scale near-real-time stream processing UC BERKELEY Tathagata Das (TD)
Matei Zaharia University of California, Berkeley Spark in Action Fast Big Data Analytics using Scala UC BERKELEY.
Comp6611 Course Lecture Big data applications Yang PENG Network and System Lab CSE, HKUST Monday, March 11, 2013 Material adapted from.
UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing.
Spark: Cluster Computing with Working Sets
Spark Fast, Interactive, Language-Integrated Cluster Computing Wen Zhiguang
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
In-Memory Frameworks (and Stream Processing) Aditya Akella.
Overview of Spark project Presented by Yin Zhu Materials from Hadoop in Practice by A.
Discretized Streams: Fault-Tolerant Streaming Computation at Scale Wenting Wang 1.
Spark Fast, Interactive, Language-Integrated Cluster Computing.
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
Discretized Streams Fault-Tolerant Streaming Computation at Scale Matei Zaharia, Tathagata Das (TD), Haoyuan (HY) Li, Timothy Hunter, Scott Shenker, Ion.
Discretized Streams An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker,
Berkley Data Analysis Stack (BDAS)
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
MapReduce Online Tyson Condie and Neil Conway UC Berkeley Joint work with Peter Alvaro, Rusty Sears, Khaled Elmeleegy (Yahoo! Research), and Joe Hellerstein.
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
In-Memory Cluster Computing for Iterative and Interactive Applications
Distributed Computations
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
Spark Resilient Distributed Datasets:
In-Memory Cluster Computing for Iterative and Interactive Applications
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
In-Memory Cluster Computing for Iterative and Interactive Applications
Distributed Computations MapReduce
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
MapReduce.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
MapReduce M/R slides adapted from those of Jeff Dean’s.
Spark Streaming Large-scale near-real-time stream processing
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Lecture 7: Practical Computing with Large Data Sets cont. CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier Special thanks.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Spark System Background Matei Zaharia  [June HotCloud ]  Spark: Cluster Computing with Working Sets  [April NSDI.
Massive Data Processing – In-Memory Computing & Spark Stream Process.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
CSCI5570 Large Scale Data Processing Systems Distributed Data Analytics Systems Slide Ack.: modified based on the slides from Matei Zaharia James Cheng.
Spark: Cluster Computing with Working Sets
Hadoop.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Spark Presentation.
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Introduction to Spark.
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
CS110: Discussion about Spark
Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC
CS639: Data Management for Data Science
5/7/2019 Map Reduce Map reduce.
Fast, Interactive, Language-Integrated Cluster Computing
MapReduce: Simplified Data Processing on Large Clusters
Lecture 29: Distributed Systems
CS639: Data Management for Data Science
Presentation transcript:

Resilient Distributed Datasets (NSDI 2012) A Fault-Tolerant Abstraction for In-Memory Cluster Computing Piccolo (OSDI 2010) Building Fast, Distributed Programs with Partitioned Tables Discretized Streams (HotCloud 2012) An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters MapReduce Online(NSDI 2010)

Motivation MapReduce greatly simplified “big data” analysis on large, unreliable clusters But as soon as it got popular, users wanted more: »More complex, multi-stage applications (e.g. iterative machine learning & graph processing) »More interactive ad-hoc queries Response: specialized frameworks for some of these apps (e.g. Pregel for graph processing)

Motivation Complex apps and interactive queries both two things that MapReduce lacks: Efficient primitives for data sharing A means for pipelining or continuous processing

MapReduce System Model Designed for batch-oriented computations over large data sets – Each operator runs to completion before producing any output – Barriers between stages – Only way to share data is by using stable storage Map output to local disk, reduce output to HDFS

Examples iter. 1 iter Input HDFS read HDFS write HDFS read HDFS write Input query 1 query 2 query 3 result 1 result 2 result 3... HDFS read Slow due to replication and disk I/O, but necessary for fault tolerance

iter. 1 iter Input Goal: In-Memory Data Sharing Input query 1 query 2 query 3... one-time processing × faster than network/disk, but how to get FT?

Challenge How to design a distributed memory abstraction that is both fault-tolerant and efficient?

Approach 1: Fine-grained E.g., Piccolo (Others: RAMCloud; DSM) Distributed Shared Table Implemented as an in-memory (dist) key-value store

Kernel Functions Operate on in-memory state concurrently on many machines Sequential code that reads from and writes to distributed table

Using the store get(key) put(key,value) update(key,value) flush() get_iterator(partition)

User specified policies … For partitioning Helps programmers express data locality preferences Piccolo ensures all entries in a partition reside on the same machine E.g., user can locate kernel with partition, and/or co-locate partitions of different related tables

User-specified policies … for resolving conflicts (multiple kernels writing) User defines an accumulation function (works if results independent of update order) … for checkpointing and restore Piccolo stores global state snapshot; relies on user to check-point kernel execution state

Fine-Grained: Challenge Existing storage abstractions have interfaces based on fine-grained updates to mutable state Requires replicating data or logs across nodes for fault tolerance »Costly for data-intensive apps »10-100x slower than memory write

Coarse Grained: Resilient Distributed Datasets (RDDs) Restricted form of distributed shared memory »Immutable, partitioned collections of records »Can only be built through coarse-grained deterministic transformations (map, filter, join, …) Efficient fault recovery using lineage »Log one operation to apply to many elements »Recompute lost partitions on failure »No cost if nothing fails

Input query 1 query 2 query 3... RDD Recovery one-time processing iter. 1 iter Input

Generality of RDDs Despite their restrictions, RDDs can express surprisingly many parallel algorithms »These naturally apply the same operation to many items Unify many current programming models »Data flow models: MapReduce, Dryad, SQL, … »Specialized models for iterative apps: BSP (Pregel), iterative MapReduce (Haloop), bulk incremental, … Support new apps that these models don’t

Memory bandwidth Network bandwidth Tradeoff Space Granularity of Updates Write Throughput Fine Coarse LowHigh K-V stores, databases, RAMCloud Best for batch workloads Best for transactional workloads HDFSRDDs

Spark Programming Interface DryadLINQ-like API in the Scala language Usable interactively from Scala interpreter Provides: »Resilient distributed datasets (RDDs) »Operations on RDDs: transformations (build new RDDs), actions (compute and output results) »Control of each RDD’s partitioning (layout across nodes) and persistence (storage in RAM, on disk, etc)

Spark Operations Transformations (define a new RDD) map filter sample groupByKey reduceByKey sortByKey flatMap union join cogroup cross mapValues Actions (return a result to driver program) collect reduce count save lookupKey

Task Scheduler Dryad-like DAGs Pipelines functions within a stage Locality & data reuse aware Partitioning-aware to avoid shuffles join union groupBy map Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: G: = cached data partition

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) messages.persist() Block 1 Block 2 Block 3 Worker Master messages.filter(_.contains(“foo”)).count messages.filter(_.contains(“bar”)).count tasks results Msgs. 1 Msgs. 2 Msgs. 3 Base RDD Transformed RDD Action Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Result: scaled to 1 TB data in 5-7 sec (vs 170 sec for on-disk data)

RDDs track the graph of transformations that built them (their lineage) to rebuild lost data E.g.: messages = textFile(...).filter(_.contains(“error”)).map(_.split(‘\t’)(2)) HadoopRDD path = hdfs://… HadoopRDD path = hdfs://… FilteredRDD func = _.contains(...) FilteredRDD func = _.contains(...) MappedRDD func = _.split(…) MappedRDD func = _.split(…) Fault Recovery HadoopRDDFilteredRDDMappedRDD

Fault Recovery Results Failure happens

Example: PageRank 1. Start each page with a rank of 1 2. On each iteration, update each page’s rank to Σ i ∈ neighbors rank i / |neighbors i | links = // RDD of (url, neighbors) pairs ranks = // RDD of (url, rank) pairs

Example: PageRank 1. Start each page with a rank of 1 2. On each iteration, update each page’s rank to Σ i ∈ neighbors rank i / |neighbors i | links = // RDD of (url, neighbors) pairs ranks = // RDD of (url, rank) pairs for (i <- 1 to ITERATIONS) { ranks = links.join(ranks).flatMap { (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) }.reduceByKey(_ + _) }

Optimizing Placement links & ranks repeatedly joined Can co-partition them (e.g. hash both on URL) to avoid shuffles Can also use app knowledge, e.g., hash on DNS name links = links.partitionBy( new URLPartitioner()) reduce Contribs 0 join Contribs 2 Ranks 0 (url, rank) Ranks 0 (url, rank) Links (url, neighbors) Links (url, neighbors)... Ranks 2 reduce Ranks 1

PageRank Performance

Programming Models Implemented on Spark RDDs can express many existing parallel models »MapReduce, DryadLINQ »Pregel graph processing [200 LOC] »Iterative MapReduce [200 LOC] »SQL: Hive on Spark (Shark) [in progress] Enables apps to efficiently intermix these models All are based on coarse-grained operations

Spark: Summary RDDs offer a simple and efficient programming model for a broad range of applications Leverage the coarse-grained nature of many parallel algorithms for low-overhead recovery Issues?

Discretized Streams Putting in-memory frameworks to work…

Motivation Many important applications need to process large data streams arriving in real time – User activity statistics (e.g. Facebook’s Puma) – Spam detection – Traffic estimation – Network intrusion detection Target: large-scale apps that must run on tens- hundreds of nodes with O(1 sec) latency

Challenge To run at large scale, system has to be both: – Fault-tolerant: recover quickly from failures and stragglers – Cost-efficient: do not require significant hardware beyond that needed for basic processing Existing streaming systems don’t have both properties

Traditional Streaming Systems “Record-at-a-time” processing model – Each node has mutable state – For each record, update state & send new records mutable state node 1 node 3 input records push node 2 input records

Traditional Streaming Systems Fault tolerance via replication or upstream backup: node 1 node 3 node 2 node 1’ node 3’ node 2’ synchronization node 1 node 3 node 2 standby input

Traditional Streaming Systems Fault tolerance via replication or upstream backup: node 1 node 3 node 2 node 1’ node 3’ node 2’ synchronization node 1 node 3 node 2 standby input Fast recovery, but 2x hardware cost Only need 1 standby, but slow to recover

Traditional Streaming Systems Fault tolerance via replication or upstream backup: node 1 node 3 node 2 node 1’ node 3’ node 2’ synchronization node 1 node 3 node 2 standby input Neither approach tolerates stragglers

Observation Batch processing models for clusters (e.g. MapReduce) provide fault tolerance efficiently – Divide job into deterministic tasks – Rerun failed/slow tasks in parallel on other nodes Idea: run a streaming computation as a series of very small, deterministic batches – Same recovery schemes at much smaller timescale – Work to make batch size as small as possible

Discretized Stream Processing t = 1: t = 2: stream 1 stream 2 batch operation pull input …… immutable dataset (stored reliably) immutable dataset (output or state); stored in memory without replication immutable dataset (output or state); stored in memory without replication …

Parallel Recovery Checkpoint state datasets periodically If a node fails/straggles, recompute its dataset partitions in parallel on other nodes map input dataset Faster recovery than upstream backup, without the cost of replication output dataset

Programming Model A discretized stream (D-stream) is a sequence of immutable, partitioned datasets – Specifically, resilient distributed datasets (RDDs), the storage abstraction in Spark Deterministic transformations operators produce new streams

API LINQ-like language-integrated API in Scala New “stateful” operators for windowing pageViews = readStream("...", "1s") ones = pageViews.map(ev => (ev.url, 1)) counts = ones.runningReduce(_ + _) t = 1: t = 2: pageViewsonescounts mapreduce... = RDD = partition Scala function literal sliding = ones.reduceByWindow( “5s”, _ + _, _ - _) Incremental version with “add” and “subtract” functions

Other Benefits of Discretized Streams Consistency: each record is processed atomically Unification with batch processing: – Combining streams with historical data pageViews.join(historicCounts).map(...) – Interactive ad-hoc queries on stream state pageViews.slice(“21:00”, “21:05”).topK(10)

D-Streams Summary D-Streams forgo traditional streaming wisdom by batching data in small timesteps Enable efficient, new parallel recovery scheme

MapReduce Online …..pipelining in map-reduce

Stream Processing with HOP Run MR jobs continuously, and analyze data as it arrives Map and reduce tasks run continuously Reduce function divides stream into windows – “Every 30 seconds, compute the 1, 5, and 15 minute average network utilization; trigger an alert if …” – Window management done by user (reduce)

Dataflow in Hadoop map reduc e Local FS HTTP GET

Hadoop Online Prototype HOP supports pipelining within and between MapReduce jobs: push rather than pull – Preserve simple fault tolerance scheme – Improved job completion time (better cluster utilization) – Improved detection and handling of stragglers MapReduce programming model unchanged – Clients supply same job parameters Hadoop client interface backward compatible – No changes required to existing clients E.g., Pig, Hive, Sawzall, Jaql – Extended to take a series of job

Pipelining Batch Size Initial design: pipeline eagerly (for each row) – Prevents use of combiner – Moves more sorting work to mapper – Map function can block on network I/O Revised design: map writes into buffer – Spill thread: sort & combine buffer, spill to disk – Send thread: pipeline spill files => reducers Simple adaptive algorithm

Pipeline request Dataflow in HOP Sched ule Schedule + Location map reduc e

Online Aggregation Traditional MR: poor UI for data analysis Pipelining means that data is available at consumers “early” – Can be used to compute and refine an approximate answer – Often sufficient for interactive data analysis, developing new MapReduce jobs,... Within a single job: periodically invoke reduce function at each reduce task on available data Between jobs: periodically send a “snapshot” to consumer jobs

Intra-Job Online Aggregation Approximate answers published to HDFS by each reduce task Based on job progress: e.g. 10%, 20%, … Challenge: providing statistically meaningful approximations – How close is an approximation to the final answer? – How do you avoid biased samples? Challenge: reduce functions are opaque – Ideally, computing 20% approximation should reuse results of 10% approximation – Either use combiners, or HOP does redundant work

Inter-Job Online Aggregation Write Answer HDFS map Job 2 Mappers reduc e Job 1 Reducers

Inter-Job Online Aggregation Like intra-job OA, but approximate answers are pipelined to map tasks of next job – Requires co-scheduling a sequence of jobs Consumer job computes an approximation – Can be used to feed an arbitrary chain of consumer jobs with approximate answers Challenge: how to avoid redundant work – Output of reduce for 10% progress vs. for 20%