Presented by Peifeng Yu

Slides:



Advertisements
Similar presentations
UC Berkeley a Spark in the cloud iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Advertisements

Matei Zaharia University of California, Berkeley Spark in Action Fast Big Data Analytics using Scala UC BERKELEY.
UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Spark: Cluster Computing with Working Sets
Spark Fast, Interactive, Language-Integrated Cluster Computing Wen Zhiguang
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Spark Fast, Interactive, Language-Integrated Cluster Computing.
Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
Discretized Streams Fault-Tolerant Streaming Computation at Scale Matei Zaharia, Tathagata Das (TD), Haoyuan (HY) Li, Timothy Hunter, Scott Shenker, Ion.
Discretized Streams An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker,
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Distributed Computations
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Distributed Computations MapReduce
Storage in Big Data Systems
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
MapReduce M/R slides adapted from those of Jeff Dean’s.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Haoyuan Li, Justin Ma, Murphy McCauley, Joshua Rosen, Reynold Xin,
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Spark System Background Matei Zaharia  [June HotCloud ]  Spark: Cluster Computing with Working Sets  [April NSDI.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
CSCI5570 Large Scale Data Processing Systems Distributed Data Analytics Systems Slide Ack.: modified based on the slides from Matei Zaharia James Cheng.
Image taken from: slideshare
Spark: Cluster Computing with Working Sets
TensorFlow– A system for large-scale machine learning
Big Data is a Big Deal!.
Spark Programming By J. H. Wang May 9, 2017.
Fast, Interactive, Language-Integrated Cluster Computing
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
Machine Learning Library for Apache Ignite
Hadoop.
ITCS-3190.
Spark.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Hadoop Tutorials Spark
Spark Presentation.
Data Platform and Analytics Foundational Training
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Introduction to Spark.
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
CS110: Discussion about Spark
Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC
Overview of big data tools
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
Spark and Scala.
Data processing with Hadoop
Interpret the execution mode of SQL query in F1 Query paper
Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC
Spark and Scala.
Introduction to MapReduce
MAPREDUCE TYPES, FORMATS AND FEATURES
Introduction to Spark.
CS639: Data Management for Data Science
Fast, Interactive, Language-Integrated Cluster Computing
MapReduce: Simplified Data Processing on Large Clusters
Lecture 29: Distributed Systems
CS639: Data Management for Data Science
Map Reduce, Types, Formats and Features
Presentation transcript:

Presented by Peifeng Yu Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing EECS 582 – F16 Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica NSDI 12’ Presented by Peifeng Yu

Table of Contents Background RDD in a nutshell EECS 582 – F16 Background RDD in a nutshell Spark the implementation Evaluation What’s new in Spark

Yet Another Cluster Computing Frameworks?

Related Works Problem General purpose Specialized In-memory Storage MapReduce, Dryad Specialized Pregel: iterative graph compute HaLoop: loop of MapReduce steps In-memory Storage Distributed shared memory Key Value Storage / Databases Piccolo External storage needed for data reuse cross computations Iterative algorithm Ad-hoc query Don’t generalize Hard to implement efficient fault tolerance - why in-memory computation? * bottleneck in disk based (MapReduce) systems * iterative algorithms and interactive data mining tools: in-memory can boost by an order of magnitude * Data reuse! - related * MapReduce and Dryad: need to write result to external storage to enable data reuse between computations * Specialized system: Pregel, HaLoop + doesn't generalize * In memory storage on cluster + distributed shared memory + key-value stores + databases + Piccolo: share distributed, mutable state via a key-value table interface + fine-grained updates to mutable state, only way to fault tolerance is to replicate/log

RDD in a nutshell Resilient Distributed Dataset EECS 582 – F16 Resilient Distributed Dataset General purpose distributed memory abstraction In-memory Immutable Can only be created through deterministic operations (Transformations) Atomic piece of data: partition Fault-tolerant * Only suitable for batch analytics + asynchronous applications are not suitable

RDD - Operations Transformations Actions EECS 582 – F16 Transformations map, filter, union, join, etc. Actions count, collect, reduce, lookup, save + can only be created through deterministic operations (transformations) on data in stable storage or other RDDs - map, filter, join + Actions are used to get the final result out (More detail in Spark) + immutable is not a problem: possible to implement mutable state by having multiple RDDs to represent multiple versions of a dataset. External Source RDD1 RDD2 External Result Transformation Transformation Action

RDD - Fault Tolerance EECS 582 – F16 Store actual data => Lineage: how the partitions were derived from other datasets Checkpoint if the lineage is too long + Thanks to the coarse-grained operations, the operation log size is significantly smaller than the actual data + no need to materialized at all times, only need to know how it was derived from other datasets * persistance and partitioning can be controled * some node (dataset) in the lineage can be persistant either by user request or in large lineage case, automatically by runtime

RDD - Persistence & Partitioning EECS 582 – F16 Persistence Level In-memory Disk backed Replica Partitioning Default hashing User defined Persistence Level: 7 in total MEMORY_ONLY: Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level. MEMORY_AND_DISK: Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed. MEMORY_ONLY_SER: Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive MEMORY_AND_DISK_SER: Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed. DISK_ONLY : Store the RDD partitions only on disk. MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc.: Same as the levels above, but replicate each partition on two cluster nodes. Advantages In case when re-computation is more costly than storage IO Help improve data locality

RDD vs. DSM Aspect RDDs Distributed Shared Memory Reads EECS 582 – F16 Aspect RDDs Distributed Shared Memory Reads Coarse- or fine-grained Fine-grained Writes Coarse-grained Consistency Trivial (immutable) Up to app / runtime Fault recovery Fine-grained and low-overhead using lineage Requires checkpoints and program rollback Straggler mitigation Possible using backup tasks Difficult Work placement Automatic based on data locality Up to app (runtimes aim for transparency) Behavior if no enough RAM Similar to existing data flow systems Poor performance (swapping) * Comparsion with DSM + reads + __writes__ + consistency + __fault recovery__ + __straggler__ (immutable) + work placement: runtime can do scheduling to improve data locality + OOM: degrade gracefully to current data-parallel system * Only suitable for batch analytics + asynchronous applications are not suitable

Spark the implementation EECS 582 – F16 - Spark: exposes RDDs through a language-integrated API (dataset as object, transformations as methods) * Actions: operations that return a value to the application or export data to a storage system. RDDs are Lazily computed * Persistance: in-memory, spill to disk, replica, priority * closure: transformation that require user functions are passed through network as closure. Instead of transfer data around, the code is transferred, which is relatively small than data

Spark the implementation EECS 582 – F16 Job scheduler Memory manager Interactive interpreter (shipping closure around) Not a cluster manager Mesos YARN Standalone (added later) Job scheduler: Whenever a user runs an action (e.g., count or save) on an RDD, the scheduler examines that RDD’s lineage graph to build a DAG of stages to execute * Actions: operations that return a value to the application or export data to a storage system. RDDs are Lazily computed Memory manager Interactive interpreter * Persistence: in-memory, spill to disk, replica, priority * closure: transformation that require user functions are passed through network as closure. Instead of transfer data around, the code is transferred, which is relatively small than data

Spark The Scheduler EECS 582 – F16 Mapping high-level logical representation to low-level tasks Build a DAG of stages to execute based on the RDD’s lineage Transformation are lazily computed Factors Data locality Pipeline Worker fault tolerance Mapping from high level logical RDD representation to low level tasks Data locality If a task needs to process a partition that is available in memory on a node, we send it to that node. Otherwise, if a task processes a partition for which the containing RDD provides preferred locations (e.g., an HDFS file), we send it to those.

Revisit RDD Dependencies EECS 582 – F16 Narrow Pipeline execution Partition-wise Easy recover Wide All parents must be present to compute any partition Full re-computation needed for recovering Narrow Dependencies Wide Dependencies

Spark The Scheduler Stage: pipelined op with narrow dependencies EECS 582 – F16 Stage: pipelined op with narrow dependencies Boundaries shuffle operations required by wide dependencies Already computed partitions Stage: pipelined transformations with narrow dependencies Boundaries shuffle operations required by wide dependencies Already computed partitions

Spark The Scheduler Fault tolerance EECS 582 – F16 Fault tolerance Re-run on another node in case a task fails Resubmit tasks for missing partitions in parallel Only worker failures are tolerated Scheduler (master) failure can be recovered by using additional service like zookeeper or simple local filesystem based checkpoint Optimization for long lineage: checkpointing Leave to the user to decide which RDD to checkpoint If a task fails, we re-run it on another node as long as its stage’s parents are still available. If some stages have become unavailable (e.g., because an output from the “map side” of a shuffle was lost), we resubmit tasks to compute the missing partitions in parallel. We do not yet tolerate scheduler failures, though replicating the RDD lineage graph would be straightforward. Optimization for long lineage For RDDs with narrow dependencies on data in stable storage, such as the points in our logistic regression example (§3.2.1) and the link lists in PageRank, checkpointing may never be worthwhile.

Spark the Memory Manager EECS 582 – F16 Options for storage of persistent RDDs In-memory vs. on-disk Deserialized vs. serialized Single copy vs. replica Insufficient Memory LRU (skipping the RDD currently operating on) User defined priority LRU When a new RDD partition is computed but there is not enough space to store it, we evict a partition from the least recently accessed RDD, unless this is the same RDD as the one with the new partition. In that case, we keep the old partition in memory to prevent cycling partitions from the same RDD in and out.

Evaluation Iterative machine learning Limited memory EECS 582 – F16 Iterative machine learning Limited memory Interactive data mining Others, refer to paper for details Go through most of them quickly, only highlight a few interesting points

Evaluation Iterative machine learning EECS 582 – F16 Iterative machine learning Spark : first vs. later iterations Hadoop, Spark: first iteration HadoopBinMem, Spark: later iteration Logistic regression K-means - Hadoop - HadoopBinMem: A Hadoop deployment that converts the input data into a low-overhead binary format in the first iteration to eliminate text parsing in later ones, stores it in an in-memory HDFS instance Spark Spark: First vs later iteration read text input from HDFS in their first iterations. Hadoop, Spark: first iteration Overheads in Hadoop’s heartbeat between worker and master HadoopBinMem, Spark: later iteration Overhead of Hadoop software stack Overhead of HDFS Deserialization cost to convert binary records to usable in-memory Java objects

Evaluation Limited memory Logistic regression Graceful degradation EECS 582 – F16 Limited memory Logistic regression Graceful degradation With less memory available, more partitions are saved to disk, resulting in longer execution time.

Evaluation Interactive data mining EECS 582 – F16 Interactive data mining Not really the kind of “instant feedback” you would get like in Google Instant Search Still quite usable, compared to several hundreds second when working from disk

Take Away RDD Spark Immutable in-memory data partitions EECS 582 – F16 RDD Immutable in-memory data partitions Fault tolerance using lineage, with optional checkpoint Lazily computed until user requested Limited operation, but still quite expressive Spark Schedule computation task Move data and code around in cluster Interactive interpreter

What’s new In Spark Language bindings Libraries built on top of Spark EECS 582 – F16 Language bindings Java, Scala, Python, R Libraries built on top of Spark Spark SQL: working with structured data, mix SQL queries with Spark programs Spark Streaming: build scalable fault-tolerant streaming application MLlib: scalable machine learning library GraphX: API for graphs and graph-parallel computation SparkNet: distributed neural networks for Spark. Paper Accepted to Apache Incubator in 2013 Spark Streaming: use sliding to form mini batch, operation on higher level concept called DStream MLlib: high performance, because of Spark excels at iterative computation, announced to be 100x faster than MapReduce

Code Example EECS 582 – F16

Example - Inverted Index EECS 582 – F16 Spark version (python) """InvertedIndex.py""” from pyspark import SparkContext sc = SparkContext("local", ”Inverted Index") docFile = ”path/to/input/file"  # Should be some file on your system # each record is <docId, docContent> docData = sc.textFile(docFile) # split words, of type <word, docId> docWords = docData.flatMap(lambda k, v: [(wd, docId) for wd in v.split()]) # sort and then group by key, invIndex is of type <word, list<docId> > invIndex = docWords.sortByKey().groupByKey() # persist invIndex.save(‘path/to/output/file’)

Example - Inverted Index EECS 582 – F16 MapReduce version (pseudo code) map(String key, String value): // key: document id // value: document contents for each word w in value: EmitIntermediate(w, key); reduce(String key, Iterator values): // key: a word // values: a list of document ids sort(values) Emit(key, values)

What if EECS 582 – F16 What if we want to do streaming data analytic, what’s the best way given a batch processing system like Spark? Optimal partition function? Application specific?