Presented by Peifeng Yu

Presented by Peifeng Yu
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing EECS 582 – F16 Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica NSDI 12’ Presented by Peifeng Yu

Table of Contents Background RDD in a nutshell
EECS 582 – F16 Background RDD in a nutshell Spark the implementation Evaluation What’s new in Spark

Yet Another Cluster Computing Frameworks?

Related Works Problem General purpose Specialized In-memory Storage
MapReduce, Dryad Specialized Pregel: iterative graph compute HaLoop: loop of MapReduce steps In-memory Storage Distributed shared memory Key Value Storage / Databases Piccolo External storage needed for data reuse cross computations Iterative algorithm Ad-hoc query Don’t generalize Hard to implement efficient fault tolerance - why in-memory computation? * bottleneck in disk based (MapReduce) systems * iterative algorithms and interactive data mining tools: in-memory can boost by an order of magnitude * Data reuse! - related * MapReduce and Dryad: need to write result to external storage to enable data reuse between computations * Specialized system: Pregel, HaLoop + doesn't generalize * In memory storage on cluster + distributed shared memory + key-value stores + databases + Piccolo: share distributed, mutable state via a key-value table interface + fine-grained updates to mutable state, only way to fault tolerance is to replicate/log

RDD in a nutshell Resilient Distributed Dataset
EECS 582 – F16 Resilient Distributed Dataset General purpose distributed memory abstraction In-memory Immutable Can only be created through deterministic operations (Transformations) Atomic piece of data: partition Fault-tolerant * Only suitable for batch analytics + asynchronous applications are not suitable

RDD - Operations Transformations Actions
EECS 582 – F16 Transformations map, filter, union, join, etc. Actions count, collect, reduce, lookup, save + can only be created through deterministic operations (transformations) on data in stable storage or other RDDs - map, filter, join + Actions are used to get the final result out (More detail in Spark) + immutable is not a problem: possible to implement mutable state by having multiple RDDs to represent multiple versions of a dataset. External Source RDD1 RDD2 External Result Transformation Transformation Action

RDD - Fault Tolerance EECS 582 – F16 Store actual data => Lineage: how the partitions were derived from other datasets Checkpoint if the lineage is too long + Thanks to the coarse-grained operations, the operation log size is significantly smaller than the actual data + no need to materialized at all times, only need to know how it was derived from other datasets * persistance and partitioning can be controled * some node (dataset) in the lineage can be persistant either by user request or in large lineage case, automatically by runtime

RDD - Persistence & Partitioning
EECS 582 – F16 Persistence Level In-memory Disk backed Replica Partitioning Default hashing User defined Persistence Level: 7 in total MEMORY_ONLY: Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level. MEMORY_AND_DISK: Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed. MEMORY_ONLY_SER: Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive MEMORY_AND_DISK_SER: Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed. DISK_ONLY : Store the RDD partitions only on disk. MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc.: Same as the levels above, but replicate each partition on two cluster nodes. Advantages In case when re-computation is more costly than storage IO Help improve data locality

RDD vs. DSM Aspect RDDs Distributed Shared Memory Reads
EECS 582 – F16 Aspect RDDs Distributed Shared Memory Reads Coarse- or fine-grained Fine-grained Writes Coarse-grained Consistency Trivial (immutable) Up to app / runtime Fault recovery Fine-grained and low-overhead using lineage Requires checkpoints and program rollback Straggler mitigation Possible using backup tasks Difficult Work placement Automatic based on data locality Up to app (runtimes aim for transparency) Behavior if no enough RAM Similar to existing data flow systems Poor performance (swapping) * Comparsion with DSM + reads + __writes__ + consistency + __fault recovery__ + __straggler__ (immutable) + work placement: runtime can do scheduling to improve data locality + OOM: degrade gracefully to current data-parallel system * Only suitable for batch analytics + asynchronous applications are not suitable

Spark the implementation
EECS 582 – F16 - Spark: exposes RDDs through a language-integrated API (dataset as object, transformations as methods) * Actions: operations that return a value to the application or export data to a storage system. RDDs are Lazily computed * Persistance: in-memory, spill to disk, replica, priority * closure: transformation that require user functions are passed through network as closure. Instead of transfer data around, the code is transferred, which is relatively small than data

Spark the implementation
EECS 582 – F16 Job scheduler Memory manager Interactive interpreter (shipping closure around) Not a cluster manager Mesos YARN Standalone (added later) Job scheduler: Whenever a user runs an action (e.g., count or save) on an RDD, the scheduler examines that RDD’s lineage graph to build a DAG of stages to execute * Actions: operations that return a value to the application or export data to a storage system. RDDs are Lazily computed Memory manager Interactive interpreter * Persistence: in-memory, spill to disk, replica, priority * closure: transformation that require user functions are passed through network as closure. Instead of transfer data around, the code is transferred, which is relatively small than data

Spark The Scheduler EECS 582 – F16 Mapping high-level logical representation to low-level tasks Build a DAG of stages to execute based on the RDD’s lineage Transformation are lazily computed Factors Data locality Pipeline Worker fault tolerance Mapping from high level logical RDD representation to low level tasks Data locality If a task needs to process a partition that is available in memory on a node, we send it to that node. Otherwise, if a task processes a partition for which the containing RDD provides preferred locations (e.g., an HDFS file), we send it to those.

Revisit RDD Dependencies
EECS 582 – F16 Narrow Pipeline execution Partition-wise Easy recover Wide All parents must be present to compute any partition Full re-computation needed for recovering Narrow Dependencies Wide Dependencies

Spark The Scheduler Stage: pipelined op with narrow dependencies
EECS 582 – F16 Stage: pipelined op with narrow dependencies Boundaries shuffle operations required by wide dependencies Already computed partitions Stage: pipelined transformations with narrow dependencies Boundaries shuffle operations required by wide dependencies Already computed partitions

Spark The Scheduler Fault tolerance
EECS 582 – F16 Fault tolerance Re-run on another node in case a task fails Resubmit tasks for missing partitions in parallel Only worker failures are tolerated Scheduler (master) failure can be recovered by using additional service like zookeeper or simple local filesystem based checkpoint Optimization for long lineage: checkpointing Leave to the user to decide which RDD to checkpoint If a task fails, we re-run it on another node as long as its stage’s parents are still available. If some stages have become unavailable (e.g., because an output from the “map side” of a shuffle was lost), we resubmit tasks to compute the missing partitions in parallel. We do not yet tolerate scheduler failures, though replicating the RDD lineage graph would be straightforward. Optimization for long lineage For RDDs with narrow dependencies on data in stable storage, such as the points in our logistic regression example (§3.2.1) and the link lists in PageRank, checkpointing may never be worthwhile.

Spark the Memory Manager
EECS 582 – F16 Options for storage of persistent RDDs In-memory vs. on-disk Deserialized vs. serialized Single copy vs. replica Insufficient Memory LRU (skipping the RDD currently operating on) User defined priority LRU When a new RDD partition is computed but there is not enough space to store it, we evict a partition from the least recently accessed RDD, unless this is the same RDD as the one with the new partition. In that case, we keep the old partition in memory to prevent cycling partitions from the same RDD in and out.

Evaluation Iterative machine learning Limited memory
EECS 582 – F16 Iterative machine learning Limited memory Interactive data mining Others, refer to paper for details Go through most of them quickly, only highlight a few interesting points

Evaluation Iterative machine learning
EECS 582 – F16 Iterative machine learning Spark : first vs. later iterations Hadoop, Spark: first iteration HadoopBinMem, Spark: later iteration Logistic regression K-means - Hadoop - HadoopBinMem: A Hadoop deployment that converts the input data into a low-overhead binary format in the first iteration to eliminate text parsing in later ones, stores it in an in-memory HDFS instance Spark Spark: First vs later iteration read text input from HDFS in their first iterations. Hadoop, Spark: first iteration Overheads in Hadoop’s heartbeat between worker and master HadoopBinMem, Spark: later iteration Overhead of Hadoop software stack Overhead of HDFS Deserialization cost to convert binary records to usable in-memory Java objects

Evaluation Limited memory Logistic regression Graceful degradation
EECS 582 – F16 Limited memory Logistic regression Graceful degradation With less memory available, more partitions are saved to disk, resulting in longer execution time.

Evaluation Interactive data mining
EECS 582 – F16 Interactive data mining Not really the kind of “instant feedback” you would get like in Google Instant Search Still quite usable, compared to several hundreds second when working from disk

Take Away RDD Spark Immutable in-memory data partitions
EECS 582 – F16 RDD Immutable in-memory data partitions Fault tolerance using lineage, with optional checkpoint Lazily computed until user requested Limited operation, but still quite expressive Spark Schedule computation task Move data and code around in cluster Interactive interpreter

What’s new In Spark Language bindings Libraries built on top of Spark
EECS 582 – F16 Language bindings Java, Scala, Python, R Libraries built on top of Spark Spark SQL: working with structured data, mix SQL queries with Spark programs Spark Streaming: build scalable fault-tolerant streaming application MLlib: scalable machine learning library GraphX: API for graphs and graph-parallel computation SparkNet: distributed neural networks for Spark. Paper Accepted to Apache Incubator in 2013 Spark Streaming: use sliding to form mini batch, operation on higher level concept called DStream MLlib: high performance, because of Spark excels at iterative computation, announced to be 100x faster than MapReduce

Code Example EECS 582 – F16

Example - Inverted Index
EECS 582 – F16 Spark version (python) """InvertedIndex.py""” from pyspark import SparkContext sc = SparkContext("local", ”Inverted Index") docFile = ”path/to/input/file" # Should be some file on your system # each record is <docId, docContent> docData = sc.textFile(docFile) # split words, of type <word, docId> docWords = docData.flatMap(lambda k, v: [(wd, docId) for wd in v.split()]) # sort and then group by key, invIndex is of type <word, list<docId> > invIndex = docWords.sortByKey().groupByKey() # persist invIndex.save(‘path/to/output/file’)

Example - Inverted Index
EECS 582 – F16 MapReduce version (pseudo code) map(String key, String value): // key: document id // value: document contents for each word w in value: EmitIntermediate(w, key); reduce(String key, Iterator values): // key: a word // values: a list of document ids sort(values) Emit(key, values)

What if EECS 582 – F16 What if we want to do streaming data analytic, what’s the best way given a batch processing system like Spark? Optimal partition function? Application specific?

Presented by Peifeng Yu

Similar presentations

Presentation on theme: "Presented by Peifeng Yu"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Presented by Peifeng Yu

Similar presentations

Presentation on theme: "Presented by Peifeng Yu"— Presentation transcript:

Similar presentations

About project

Feedback