CSCI5570 Large Scale Data Processing Systems Distributed Data Analytics Systems Slide Ack.: modified based on the slides from Matei Zaharia James Cheng.

CSCI5570 Large Scale Data Processing Systems Distributed Data Analytics Systems Slide Ack.: modified based on the slides from Matei Zaharia James Cheng CSE, CUHK

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica NSDI 2012 2

Motivation MapReduce greatly simplified “big data” analysis on large, unreliable clusters But as soon as it got popular, users wanted more: – More complex, multi-stage applications (e.g. iterative machine learning & graph processing) – More interactive ad-hoc queries Response: specialized frameworks for some of these apps (e.g. Pregel for graph processing) 3

Motivation Complex apps and interactive queries require reuse of intermediate results across multiple computations (common in iterative ML, DM, and graph algorithms) Thus, need efficient primitives for data sharing MapReduce lacks abstraction for leveraging distributed memory The only way to share data across jobs is stable storage  slow! 4

Examples iter. 1 iter. 2... Input HDFS read HDFS write HDFS read HDFS write Input query 1 query 2 query 3 result 1 result 2 result 3... HDFS read Slow due to replication and disk I/O, but necessary for fault tolerance 5

iter. 1 iter. 2... Input Goal: In-Memory Data Sharing Input query 1 query 2 query 3... one-time processing 10-100× faster than network/disk, but how to get FT? 6

Challenge How to design a distributed memory abstraction that is both fault-tolerant and efficient? 7

Challenge Existing storage abstractions have interfaces based on fine-grained updates to mutable state – RAMCloud, databases, distributed mem, Piccolo Requires replicating data or logs across nodes for fault tolerance – Costly for data-intensive apps – 10-100x slower than memory write 8

Solution: Resilient Distributed Datasets (RDDs) Restricted form of distributed shared memory – Immutable, partitioned collections of records – Can only be built through coarse-grained deterministic transformations (map, filter, join, …) Efficient fault recovery using lineage – Log transformations used to build a dataset (i.e., its lineage) rather than the actual data – If a partition of an RDD is lost, the RDD has enough info about how it was derived from other RDDs in order to recompute the partition 9

Input query 1 query 2 query 3... RDD Recovery one-time processing iter. 1 iter. 2... Input 10

Generality of RDDs Despite their restrictions, RDDs can express surprisingly many parallel algorithms – These naturally apply the same operation to many items Unify many current programming models – Data flow models: MapReduce, Dryad, SQL, … – Specialized models for iterative apps: BSP (Pregel), iterative MapReduce (Haloop), bulk incremental, … Support new apps that these models don’t 11

What RDDs are good and not good for Good for: – batch applications that apply the same operation to all elements of a dataset – RDDs can efficiently remember each transformation as one step in a lineage graph – recovery of lost partitions w/o having to log large amounts of data Not good for: – applications that need asynchronous fine-grained updates to shared state 12

Memory bandwidth Network bandwidth Tradeoff Space Granularity of Updates Write Throughput Fine Coarse LowHigh K-V stores, databases, RAMCloud Best for batch workloads Best for transactional workloads HDFSRDDs 13

How to represent RDDs Need to track lineage across a wide range of transformations A graph-based representation A common interface w/: – a set of partitions: atomic pieces of the dataset – a set of dependencies on parent RDDs – a function for computing the dataset based on its parents – metadata about its partitioning scheme and data placement 14

RDD Dependencies Narrow dependencies: each partition of the parent RDD is used by at most one partition of the child RDD 15 Good for pipelined execution on one cluster node, e.g., apply a map followed by a filter on an element-by- element basis. Efficient recovery as only the lost parent partitions need to be recomputed (in parallel on different nodes).

RDD Dependencies Wide dependencies: each partition of the parent RDD is used by multiple child partitions 16 Require data from all parent partitions to be available and to be shuffled across the nodes using a MapReduce- like operation. Expensive recovery, as a complete re-execution is needed.

Outline Spark programming interface Implementation How people are using Spark 17

Spark Programming Interface DryadLINQ-like API in the Scala language Used interactively from the Scala interpreter Provides: – Resilient distributed datasets (RDDs) – Operations on RDDs: transformations & actions – Control of each RDD’s partitioning (layout across nodes) and persistence (storage in RAM, on disk, etc) 19

Spark Programming Interface Transformations – lazy operations that define new RDDs Actions – launch a computation to return a value to the program or write data to external storage 20

Spark Operations Transformations (define a new RDD) map filter sample groupByKey reduceByKey sortByKey flatMap union join cogroup cross mapValues Actions (return a result to driver program) collect reduce count save lookupKey 21

Spark Operations 22

Programming in Spark Programmers start by defining one or more RDDs through transformations on data in stable storage (e.g., map, filter) The RDDs can then be used in actions (e.g., count, collect, save) Spark computes RDDs (lazily) the first time they are used in an action 23

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) messages.persist() Block 1 Block 2 Block 3 Worker Master messages.filter(_.contains(“foo”)).count messages.filter(_.contains(“bar”)).count tasks results Msgs. 1 Msgs. 2 Msgs. 3 Base RDD Transformed RDD Action Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Result: scaled to 1 TB data in 5-7 sec (vs 170 sec for on-disk data) 24

RDDs track the graph of transformations that built them (their lineage) to rebuild lost data E.g.: messages = textFile(...).filter(_.contains(“error”)).map(_.split(‘\t’)(2)) HadoopRDD path = hdfs://… HadoopRDD path = hdfs://… FilteredRDD func = _.contains(...) FilteredRDD func = _.contains(...) MappedRDD func = _.split(…) MappedRDD func = _.split(…) Fault Recovery HadoopRDDFilteredRDDMappedRDD 25

Fault Recovery Results Failure happens 26

Example: PageRank 1. Start each page with a rank of 12. On each iteration, update each page’s rank to Σ i ∈ neighbors rank i / |neighbors i | links = // RDD of (url, neighbors) pairs ranks = // RDD of (url, rank) pairs for (i <- 1 to ITERATIONS) { ranks = links.join(ranks).flatMap { (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) }.reduceByKey(_ + _) } 27

Optimizing Placement links & ranks repeatedly joined Can co-partition them (e.g. hash both on URL) to avoid shuffles Can also use app knowledge, e.g., hash on DNS name links = links.partitionBy( new URLPartitioner()) reduce Contribs 0 join Contribs 2 Ranks 0 (url, rank) Ranks 0 (url, rank) Links (url, neighbors) Links (url, neighbors)... Ranks 2 reduce Ranks 1 28

PageRank Performance 29

Behavior with Insufficient RAM 30

Scalability Logistic RegressionK-Means 31

Implementation Runs on Mesos cluster manager [NSDI 11] to share clusters w/ Hadoop, MPI, etc. Can read from any Hadoop input source (HDFS, S3, …) Spark Hadoop MPI Mesos Node … 33

Task Scheduler Dryad-like DAGs Pipelines functions within a stage Locality & data reuse aware Partitioning-aware to avoid shuffles join union groupBy map Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: G: = cached data partition 34

Programming Models Implemented on Spark RDDs can express many existing parallel models – MapReduce, DryadLINQ – Pregel graph processing [200 LOC] – Iterative MapReduce [200 LOC] – SQL: Hive on Spark (Shark) Enables apps to efficiently intermix these models All are based on coarse-grained operations 35

Open Source Community User applications: – Data mining 40x faster than Hadoop (Conviva) – Exploratory log analysis (Foursquare) – Traffic prediction via EM (Mobile Millennium) – Twitter spam classification (Monarch) – DNA sequence analysis (SNAP) –... 37

Conclusion RDDs offer a simple and efficient programming model for a broad range of applications Leverage the coarse-grained nature of many parallel algorithms for low-overhead recovery 38

CSCI5570 Large Scale Data Processing Systems Distributed Data Analytics Systems Slide Ack.: modified based on the slides from Matei Zaharia James Cheng.

Similar presentations

Presentation on theme: "CSCI5570 Large Scale Data Processing Systems Distributed Data Analytics Systems Slide Ack.: modified based on the slides from Matei Zaharia James Cheng."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSCI5570 Large Scale Data Processing Systems Distributed Data Analytics Systems Slide Ack.: modified based on the slides from Matei Zaharia James Cheng.

Similar presentations

Presentation on theme: "CSCI5570 Large Scale Data Processing Systems Distributed Data Analytics Systems Slide Ack.: modified based on the slides from Matei Zaharia James Cheng."— Presentation transcript:

Similar presentations

About project

Feedback