Introduction to Spark Internals

Slides:



Advertisements
Similar presentations
MapReduce Simplified Data Processing on Large Clusters
Advertisements

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
UC Berkeley a Spark in the cloud iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Spark Lightning-Fast Cluster Computing UC BERKELEY.
Spark Performance Patrick Wendell Databricks.
Matei Zaharia University of California, Berkeley Spark in Action Fast Big Data Analytics using Scala UC BERKELEY.
UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Spark: Cluster Computing with Working Sets
Spark Fast, Interactive, Language-Integrated Cluster Computing Wen Zhiguang
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Discretized Streams: Fault-Tolerant Streaming Computation at Scale Wenting Wang 1.
Spark Fast, Interactive, Language-Integrated Cluster Computing.
Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)
Matei Zaharia Large-Scale Matrix Operations Using a Data Flow Engine.
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Fast and Expressive Big Data Analytics with Python
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
Spark Resilient Distributed Datasets:
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Tuning and Debugging in Apache Spark Patrick February 20, 2015.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
UC Berkeley Spark A framework for iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Storage in Big Data Systems
HAMS Technologies 1
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.
Spark and Scala Sheng QIAN The Berkeley Data Analytics Stack.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Lecture 7: Practical Computing with Large Data Sets cont. CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier Special thanks.
Data Engineering How MapReduce Works
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Big Data Infrastructure Week 3: From MapReduce to Spark (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.
Matei Zaharia UC Berkeley Writing Standalone Spark Programs UC BERKELEY.
Spark System Background Matei Zaharia  [June HotCloud ]  Spark: Cluster Computing with Working Sets  [April NSDI.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL.
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
An Introduction to GPFS
Apache Tez : Accelerating Hadoop Query Processing Page 1.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
CSCI5570 Large Scale Data Processing Systems Distributed Data Analytics Systems Slide Ack.: modified based on the slides from Matei Zaharia James Cheng.
Presented by Peifeng Yu
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.
Machine Learning Library for Apache Ignite
Introduction to Distributed Platforms
Hadoop Tutorials Spark
Spark Presentation.
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Introduction to Spark.
COS 518: Distributed Systems Lecture 10 Andrew Or, Mike Freedman
湖南大学-信息科学与工程学院-计算机与科学系
Cse 344 May 4th – Map/Reduce.
CS110: Discussion about Spark
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
Spark and Scala.
Introduction to Spark.
CS639: Data Management for Data Science
Fast, Interactive, Language-Integrated Cluster Computing
COS 518: Distributed Systems Lecture 11 Mike Freedman
Lecture 29: Distributed Systems
CS639: Data Management for Data Science
Presentation transcript:

Introduction to Spark Internals Matei Zaharia UC Berkeley www.spark-project.org UC BERKELEY

Outline Project goals Components Life of a job Extending Spark How to contribute

Project Goals Generality Low latency Fault tolerance Simplicity : diverse workloads, operators, job sizes : sub-second : faults shouldn’t be special case : often comes from generality

Codebase Size Spark: 20,000 LOC Hadoop 1.0: 90,000 LOC Hadoop 2.0: 220,000 LOC (non-test, non-example sources)

Standalone backend: 1700 LOC Codebase Details Spark core: 16,000 LOC Interpreter: 3300 LOC Operators: 2000 Scheduler: 2500 Block manager: 2700 Networking: 1200 Accumulators: 200 Broadcast: 3500 Hadoop I/O: 400 LOC Mesos backend: 700 LOC Standalone backend: 1700 LOC

Outline Project goals Components Life of a job Extending Spark How to contribute

Spark client (app master) Components Spark client (app master) Spark worker Your program RDD graph Cluster manager Task threads sc = new SparkContext f = sc.textFile(“…”) f.filter(…) .count() ... Scheduler Block tracker Block manager Shuffle tracker HDFS, HBase, …

Resilient distributed Example Job Resilient distributed datasets (RDDs) val sc = new SparkContext( “spark://...”, “MyJob”, home, jars) val file = sc.textFile(“hdfs://...”) val errors = file.filter(_.contains(“ERROR”)) errors.cache() errors.count() Action

RDD Graph Dataset-level view: Partition-level view: file: HadoopRDD path = hdfs://... errors: FilteredRDD func = _.contains(…) shouldCache = true Task 1 Task 2 ...

Data Locality First run: data not in cache, so use HadoopRDD’s locality prefs (from HDFS) Second run: FilteredRDD is in cache, so use its locations If something falls out of cache, go back to HDFS

In More Detail: Life of a Job

doesn’t know about stages Scheduling Process RDD Objects DAGScheduler split graph into stages of tasks submit each stage as ready DAG TaskScheduler TaskSet launch tasks via cluster manager retry failed or straggling tasks Cluster manager Worker execute tasks store and serve blocks Block manager Threads Task rdd1.join(rdd2) .groupBy(…) .filter(…) build operator DAG stage failed agnostic to operators! doesn’t know about stages

RDD Abstraction Goal: wanted to support wide array of operators and let users compose them arbitrarily Don’t want to modify scheduler for each one How to capture dependencies generically?

Captures all current Spark operations! RDD Interface Set of partitions (“splits”) List of dependencies on parent RDDs Function to compute a partition given parents Optional preferred locations Optional partitioning info (Partitioner) Captures all current Spark operations!

Example: HadoopRDD partitions = one per HDFS block dependencies = none compute(partition) = read corresponding block preferredLocations(part) = HDFS block location partitioner = none

Example: FilteredRDD partitions = same as parent RDD dependencies = “one-to-one” on parent compute(partition) = compute parent and filter it preferredLocations(part) = none (ask parent) partitioner = none

Spark will now know this data is hashed! Example: JoinedRDD partitions = one per reduce task dependencies = “shuffle” on each parent compute(partition) = read and join shuffled data preferredLocations(part) = none partitioner = HashPartitioner(numTasks) Spark will now know this data is hashed!

Dependency Types “Narrow” deps: “Wide” (shuffle) deps: groupByKey map, filter join with inputs co-partitioned union join with inputs not co-partitioned

DAG Scheduler Interface: receives a “target” RDD, a function to run on each partition, and a listener for results Roles: Build stages of Task objects (code + preferred loc.) Submit them to TaskScheduler as ready Resubmit failed stages if outputs are lost

Scheduler Optimizations Pipelines narrow ops. within a stage Picks join algorithms based on partitioning (minimize shuffles) Reuses previously cached data join union groupBy map Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: G: Task NOT a modified version of Hadoop = previously computed partition

map output file or master Task Details Stage boundaries are only at input RDDs or “shuffle” operations So, each task looks like this: (Note: we write shuffle outputs to RAM/disk to allow retries) external storage Task f1  f2  … map output file or master and/or fetch map outputs

Design goal: any Task can run on any node Task Details Each Task object is self-contained Contains all transformation code up to input boundary (e.g. HadoopRDD => filter => map) Allows Tasks on cached data to even if they fall out of cache Design goal: any Task can run on any node Only way a Task can fail is lost map output files

Event Flow DAGScheduler TaskScheduler graph of stages RDD partitioning pipelining runJob(targetRDD, partitions, func, listener) DAGScheduler task finish & stage failure events submitTasks(taskSet) TaskScheduler task placement retries on failure speculation inter-job policy Task objects Cluster or local runner

TaskScheduler Interface: Two main implementations: Given a TaskSet (set of Tasks), run it and report results Report “fetch failed” errors when shuffle output lost Two main implementations: LocalScheduler (runs locally) ClusterScheduler (connects to a cluster manager using a pluggable “SchedulerBackend” API)

TaskScheduler Details Can run multiple concurrent TaskSets, but currently does so in FIFO order Would be really easy to plug in other policies! If someone wants to suggest a plugin API, please do Maintains one TaskSetManager per TaskSet that tracks its locality and failure info Polls these for tasks in order (FIFO)

Worker Implemented by the Executor class Receives self-contained Task objects and calls run() on them in a thread pool Reports results or exceptions to master Special case: FetchFailedException for shuffle Pluggable ExecutorBackend for cluster

Other Components BlockManager “Write-once” key-value store on each worker Serves shuffle data as well as cached RDDs Tracks a StorageLevel for each block (e.g. disk, RAM) Can drop data to disk if running low on RAM Can replicate data across nodes

Other Components CommunicationManager Asynchronous IO based networking library Allows fetching blocks from BlockManagers Allows prioritization / chunking across connections (would be nice to make this pluggable!) Fetch logic tries to optimize for block sizes

Other Components MapOutputTracker Tracks where each “map” task in a shuffle ran Tells reduce tasks the map locations Each worker caches the locations to avoid refetching A “generation ID” passed with each Task allows invalidating the cache when map outputs are lost

Outline Project goals Components Life of a job Extending Spark How to contribute

Extension Points Spark provides several places to customize functionality: Extending RDD: add new input sources or transformations SchedulerBackend: add new cluster managers spark.serializer: customize object storage

What People Have Done New RDD transformations (sample, glom, mapPartitions, leftOuterJoin, rightOuterJoin) New input sources (DynamoDB) Custom serialization for memory and bandwidth efficiency New language bindings (Java, Python)

Possible Future Extensions Pluggable inter-job scheduler Pluggable cache eviction policy (ideally with priority flags on StorageLevel) Pluggable instrumentation / event listeners Let us know if you want to contribute these!

As an Exercise Try writing your own input RDD from the local filesystem (say one partition per file) Try writing your own transformation RDD (pick a Scala collection method not in Spark) Try writing your own action (e.g. product())

Outline Project goals Components Life of a job Extending Spark How to contribute

Development Process Issue tracking: spark-project.atlassian.net Development discussion: spark-developers Main work: “master” branch on GitHub Submit patches through GitHub pull requests Be sure to follow code style and add tests!

Build Tools SBT and Maven currently both work (but switching to only Maven) IDEA is the most common IDEA; Eclipse may be made to work

Thanks! Stay tuned for future developer meetups.