Spark Resilient Distributed Datasets:

Slides:



Advertisements
Similar presentations
Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Cliff Engle, Michael Franklin, Haoyuan Li, Antonio Lupher, Justin Ma,
Advertisements

Shark:SQL and Rich Analytics at Scale
UC Berkeley a Spark in the cloud iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Spark Streaming Large-scale near-real-time stream processing UC BERKELEY Tathagata Das (TD)
Spark Lightning-Fast Cluster Computing UC BERKELEY.
Matei Zaharia University of California, Berkeley Spark in Action Fast Big Data Analytics using Scala UC BERKELEY.
Comp6611 Course Lecture Big data applications Yang PENG Network and System Lab CSE, HKUST Monday, March 11, 2013 Material adapted from.
UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Berkeley Data Analytics Stack
Spark: Cluster Computing with Working Sets
Spark Fast, Interactive, Language-Integrated Cluster Computing Wen Zhiguang
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
In-Memory Frameworks (and Stream Processing) Aditya Akella.
Overview of Spark project Presented by Yin Zhu Materials from Hadoop in Practice by A.
Discretized Streams: Fault-Tolerant Streaming Computation at Scale Wenting Wang 1.
Spark Fast, Interactive, Language-Integrated Cluster Computing.
Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)
Matei Zaharia Large-Scale Matrix Operations Using a Data Flow Engine.
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
Discretized Streams Fault-Tolerant Streaming Computation at Scale Matei Zaharia, Tathagata Das (TD), Haoyuan (HY) Li, Timothy Hunter, Scott Shenker, Ion.
Discretized Streams An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker,
Berkley Data Analysis Stack (BDAS)
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Fast and Expressive Big Data Analytics with Python
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
In-Memory Cluster Computing for Iterative and Interactive Applications
Distributed Computations
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
In-Memory Cluster Computing for Iterative and Interactive Applications
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
In-Memory Cluster Computing for Iterative and Interactive Applications
UC Berkeley Spark A framework for iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Storage in Big Data Systems
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Resilient Distributed Datasets (NSDI 2012) A Fault-Tolerant Abstraction for In-Memory Cluster Computing Piccolo (OSDI 2010) Building Fast, Distributed.
Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.
Spark Streaming Large-scale near-real-time stream processing
Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Lecture 7: Practical Computing with Large Data Sets cont. CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier Special thanks.
Matei Zaharia Introduction to. Outline The big data problem Spark programming model User community Newest addition: DataFrames.
Data Engineering How MapReduce Works
Berkeley Data Analytics Stack Prof. Chi (Harold) Liu November 2015.
Other Map-Reduce (ish) Frameworks: Spark William Cohen 1.
Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Haoyuan Li, Justin Ma, Murphy McCauley, Joshua Rosen, Reynold Xin,
Big Data Infrastructure Week 3: From MapReduce to Spark (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.
Spark System Background Matei Zaharia  [June HotCloud ]  Spark: Cluster Computing with Working Sets  [April NSDI.
Massive Data Processing – In-Memory Computing & Spark Stream Process.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
CSCI5570 Large Scale Data Processing Systems Distributed Data Analytics Systems Slide Ack.: modified based on the slides from Matei Zaharia James Cheng.
Spark: Cluster Computing with Working Sets
Presented by Peifeng Yu
Fast, Interactive, Language-Integrated Cluster Computing
Spark.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Spark Presentation.
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Introduction to Spark.
湖南大学-信息科学与工程学院-计算机与科学系
CS110: Discussion about Spark
Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
Spark and Scala.
Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC
CS639: Data Management for Data Science
Fast, Interactive, Language-Integrated Cluster Computing
Lecture 29: Distributed Systems
CS639: Data Management for Data Science
Presentation transcript:

Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Presentation by Antonio Lupher [Thanks to Matei for diagrams & several of the nicer slides!] October 26, 2011

The world today… Most current cluster programming models are based on acyclic data flow from stable storage to stable storage Map Reduce Input Output Acyclic

The world today… Most current cluster programming models are based on acyclic data flow from stable storage to stable storage Benefits: decide at runtime where to run tasks and automatically recover from failures Also applies to Dryad, SQL, etc Benefits: easy to do fault tolerance and

… but Inefficient for applications that repeatedly reuse working set of data: Iterative machine learning, graph algorithms PageRank, k-means, logistic regression, etc. Interactive data mining tools (R, Excel, Python) Multiple queries on the same subset of data Reload data from disk on each query/stage of execution Point out that both types of apps are actually quite common / desirable in data analytics

Goal: Keep Working Set in RAM iteration 1 one-time processing iteration 2 iteration 3 Input Distributed memory . . . Not necessarily all data – just the stuff you need from one computation to another iter. 1 iter. 2 . . . Input

How to provide fault tolerance efficiently? Requirements Distributed memory abstraction must be Fault-tolerant Efficient in large commodity clusters How to provide fault tolerance efficiently?

Requirements Existing distributed storage abstractions offer an interface based on fine-grained updates Reads and writes to cells in a table E.g. key-value stores, databases, distributed memory Have to replicate data or logs across nodes for fault tolerance Expensive for data-intensive apps, large datasets

Resilient Distributed Datasets (RDDs) Immutable, partitioned collection of records Interface based on coarse-grained transformations (e.g. map, groupBy, join) Efficient fault recovery using lineage Log one operation to apply to all elements Re-compute lost partitions of dataset on failure No cost if nothing fails

RDDs, cont’d Control persistence (in RAM vs. on disk) Tunable via persistence priority: user specifies which RDDs should spill to disk first Control partitioning of data Hash data to place data in convenient locations for subsequent operations Fine-grain reads

Implementation Spark runs on Mesos => share resources with Hadoop & other apps Can read from any Hadoop input source (HDFS, S3, …) Spark Hadoop MPI Mesos Node … Mention it’s designed to be fault-tolerant Though why do you need Hadoop if you have Spark?  Language-integrated API in Scala ~10,000 lines of code, no changes to Scala Can use interactively from interpreter

Spark Operations Transformations Create new RDD by transforming data in stable storage using data flow operators Map, filter, groupBy, etc. Lazy: don’t need to be materialized at all times Lineage information is enough to compute partitions from data in storage when needed

Spark Operations Actions Return a value to application or export to storage count, collect, save, etc. Require a value to be computed from the elements in the RDD => execution plan

(return a result to driver program) Spark Operations Transformations (define a new RDD) map flatMap filter sample groupByKey reduceByKey union join cogroup crossProduct mapValues sort partitionBy Actions (return a result to driver program) count collect reduce lookup save

RDD Representation Common interface: Set of partitions Preferred locations for each partition List of parent RDDs Function to compute a partition given parents Optional partitioning info (order, etc.) Capture a wide range of transformations Scheduler doesn’t need to know what each op does Users can easily add new transformations Most transformations implement in ≤ 20 lines Simple common interface

RDD Representation Lineage & Dependencies Narrow dependencies Each partition of parent RDD is used by at most one partition of child RDD e.g. map, filter Allow pipelined execution

RDD Representation Lineage & Dependencies Wide dependencies Multiple child partitions may depend on parent RDD partition e.g. join Require data from all parent partitions & shuffle

Scheduler Task DAG (like Dryad) Pipelines functions within a stage Reuses previously computed data Partitioning-aware to avoid shuffles join union groupBy map Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: G: NOT a modified version of Hadoop Shced examines lineage graph to build a DAG of stages to execute Each stage tries to maximize pipelined transformations with narrow dependencies Stage boundaries are the shuffle operations required for wide dependencies Shuffle/wide dependencies currently materialize records on nodes holding parent partitions (like MapReduce materializes map outputs) for fault recovery = previously computed partition

RDD Recovery What happens if a task fails? Exploit coarse-grained operations Deterministic, affect all elements of collection Just re-run the task on another node if parents available Easy to regenerate RDDs given parent RDDs + lineage Avoids checkpointing and replication but you might still want to (and can) checkpoint: long lineage => expensive to recompute intermediate results may have disappeared, need to regenerate Use REPLICATE flag to persist Lineage is graph of transformations

Result: scaled to 1 TB data in 5-7 sec (vs 170 sec for on-disk data) Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Base RDD Msgs. 1 Transformed RDD lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) messages.persist() Worker Driver results tasks Block 1 Action messages.filter(_.contains(“foo”)).count Key idea: add “variables” to the “functions” in functional programming messages.filter(_.contains(“bar”)).count Msgs. 2 . . . Msgs. 3 Block 2 Result: scaled to 1 TB data in 5-7 sec (vs 170 sec for on-disk data) Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Block 3

Fault Recovery Results These are for K-means k-means

Performance Outperforms Hadoop by up to 20x Avoiding I/O and Java object [de]serialization costs Some apps see 40x speedup (Conviva) Query a 1TB dataset w/5-7 sec. latencies You write a single program  similar to DryadLINQ Distributed data sets with parallel operations on them are pretty standard; the new thing is that they can be reused across ops Variables in the driver program can be used in parallel ops; accumulators useful for sending information back, cached vars are an optimization Mention cached vars useful for some workloads that won’t be shown here Mention it’s all designed to be easy to distribute in a fault-tolerant fashion

PageRank Results 2.4x speedup over Hadoop on 30 nodes, controlling partitions => 7.4x Linear scaling to 60 nodes

Behavior with Not Enough RAM Logistic regression

Example: Logistic Regression Goal: find best line separating two sets of points random initial line + + + + + + – + + – Note that dataset is reused on each gradient computation – – + + – – – – – – target

Logistic Regression Code val points = spark.textFile(...).map(parsePoint).persist() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = points.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce((a,b) => a+b) w -= gradient } println("Final w: " + w)

Logistic Regression Performance 127 s / iteration first iteration 174 s further iterations 6 s This is for a 29 GB dataset on 20 EC2 m1.xlarge machines (4 cores each)

More Applications EM alg. for traffic prediction (Mobile Millennium) In-memory OLAP & anomaly detection (Conviva) Twitter spam classification (Monarch) Pregel on Spark (Bagel) Alternating least squares matrix factorization Expectation maximization

Mobile Millennium Estimate traffic using GPS on taxis

Conviva GeoReport Aggregations on many keys w/ same WHERE clause Time (hours) Aggregations on many keys w/ same WHERE clause 40× gain comes from: Not re-reading unused columns or filtered records Avoiding repeated decompression In-memory storage of deserialized objects

SPARK Use transformations on RDDs instead of Hadoop jobs Cache RDDs for similar future queries Many queries re-use subsets of data Drill-down, etc. Scala makes integration with Hive (Java) easy… or easier (Cliff, Antonio, Reynold)

Comparisons DryadLINQ, FlumeJava Similar language-integrated “distributed collection” API, but cannot reuse datasets efficiently across queries Piccolo, DSM, Key-value stores (e.g. RAMCloud) Fine-grained writes but more complex fault recovery Iterative MapReduce (e.g. Twister, HaLoop), Pregel Implicit data sharing for a fixed computation pattern Relational databases Lineage/provenance, logical logging, materialized views Caching systems (e.g. Nectar) Store data in files, no explicit control over what is cached Cluster prog models: collections are files on disk or ephemeral – not as efficient as RDDs Key-value stores + DSM: lower-level, Iterative MR: limited to computation pattern, not general-purpose RDBMSs: fine-grain writes, logging, replication needed Caching: Nectar – reuse intermediate results MapReduce + Dryad: similar lineage, but it’s lost after job ends

Comparisons: RDDs vs DSM Concern RDDs Distr. Shared Mem. Reads Fine-grained Writes Bulk transformations Consistency Trivial (immutable) Up to app / runtime Fault recovery Fine-grained and low-overhead using lineage Requires checkpoints and program rollback Straggler mitigation Possible using speculative execution Difficult Work placement Automatic based on data locality Up to app (but runtime aims for transparency) Behavior if not enough RAM Similar to existing data flow systems Poor performance (swapping?)

Summary Simple & efficient model, widely applicable Express models that previously required a new framework efficiently, i.e. same optimizations Achieve fault tolerance efficiently by providing coarse-grained operations and tracking lineage Exploit persistent in-memory storage + smart partitioning for speed

Thoughts: Tradeoffs No fine-grain modifications of elements in collection Not the right tool for all applications E.g. storage system for web site, web crawler, anything where you need incremental/fine-grain writes Scala-based implementation Probably won’t see Microsoft use it anytime soon But concept of RDDs is not language-specific (abstraction doesn’t even require functional language)

Thoughts: Influence Factors that could promote adoption Inherent advantages in-memory = fast, RDDs = fault-tolerant Easy to use & extend Already supports MapReduce, Pregel (Bagel) Used widely at Berkeley, more projects coming soon Used at Conviva, Twitter Scala means easy integration with existing Java applications (subjective opinion) More pleasant to use than Java

Verdict Should spark enthusiasm in cloud crowds