12doesn’t know about stages Scheduling ProcessRDD ObjectsDAGSchedulersplit graph into stages of taskssubmit each stage as readyDAGTaskSchedulerTaskSetlaunch tasks via cluster managerretry failed or straggling tasksCluster managerWorkerexecute tasksstore and serve blocksBlock managerThreadsTaskrdd1.join(rdd2) .groupBy(…).filter(…)build operator DAGstage failedagnostic to operators!doesn’t know about stages
13RDD AbstractionGoal: wanted to support wide array of operators and let users compose them arbitrarily Don’t want to modify scheduler for each one How to capture dependencies generically?
14Captures all current Spark operations! RDD InterfaceSet of partitions (“splits”) List of dependencies on parent RDDs Function to compute a partition given parents Optional preferred locations Optional partitioning info (Partitioner)Captures all current Spark operations!
16Example: FilteredRDDpartitions = same as parent RDD dependencies = “one-to-one” on parent compute(partition) = compute parent and filter it preferredLocations(part) = none (ask parent) partitioner = none
17Spark will now know this data is hashed! Example: JoinedRDDpartitions = one per reduce task dependencies = “shuffle” on each parent compute(partition) = read and join shuffled data preferredLocations(part) = none partitioner = HashPartitioner(numTasks)Spark will now know this data is hashed!
18Dependency Types “Narrow” deps: “Wide” (shuffle) deps: groupByKey map, filterjoin with inputs co-partitionedunionjoin with inputs not co-partitioned
19DAG SchedulerInterface: receives a “target” RDD, a function to run on each partition, and a listener for resultsRoles:Build stages of Task objects (code + preferred loc.)Submit them to TaskScheduler as readyResubmit failed stages if outputs are lost
20Scheduler Optimizations Pipelines narrow ops. within a stage Picks join algorithms based on partitioning (minimize shuffles) Reuses previously cached datajoinuniongroupBymapStage 3Stage 1Stage 2A:B:C:D:E:F:G:TaskNOT a modified version of Hadoop= previously computed partition
21map output file or master Task DetailsStage boundaries are only at input RDDs or “shuffle” operationsSo, each task looks like this:(Note: we write shuffle outputs to RAM/disk to allow retries)external storageTaskf1 f2 …map output file or masterand/orfetch mapoutputs
22Design goal: any Task can run on any node Task DetailsEach Task object is self-containedContains all transformation code up to input boundary (e.g. HadoopRDD => filter => map)Allows Tasks on cached data to even if they fall out of cacheDesign goal: any Task can run on any nodeOnly way a Task can fail is lost map output files
23Event Flow DAGScheduler TaskScheduler graph of stagesRDD partitioningpipeliningrunJob(targetRDD, partitions, func, listener)DAGSchedulertask finish & stage failure eventssubmitTasks(taskSet)TaskSchedulertask placementretries on failurespeculationinter-job policyTask objectsCluster or local runner
24TaskScheduler Interface: Two main implementations: Given a TaskSet (set of Tasks), run it and report resultsReport “fetch failed” errors when shuffle output lostTwo main implementations:LocalScheduler (runs locally)ClusterScheduler (connects to a cluster manager using a pluggable “SchedulerBackend” API)
25TaskScheduler Details Can run multiple concurrent TaskSets, but currently does so in FIFO orderWould be really easy to plug in other policies!If someone wants to suggest a plugin API, please doMaintains one TaskSetManager per TaskSet that tracks its locality and failure infoPolls these for tasks in order (FIFO)
26Worker Implemented by the Executor class Receives self-contained Task objects and calls run() on them in a thread poolReports results or exceptions to masterSpecial case: FetchFailedException for shufflePluggable ExecutorBackend for cluster
27Other Components BlockManager “Write-once” key-value store on each workerServes shuffle data as well as cached RDDsTracks a StorageLevel for each block (e.g. disk, RAM)Can drop data to disk if running low on RAMCan replicate data across nodes
28Other Components CommunicationManager Asynchronous IO based networking libraryAllows fetching blocks from BlockManagersAllows prioritization / chunking across connections (would be nice to make this pluggable!)Fetch logic tries to optimize for block sizes
29Other Components MapOutputTracker Tracks where each “map” task in a shuffle ranTells reduce tasks the map locationsEach worker caches the locations to avoid refetchingA “generation ID” passed with each Task allows invalidating the cache when map outputs are lost
30OutlineProject goals Components Life of a job Extending Spark How to contribute
31Extension PointsSpark provides several places to customize functionality: Extending RDD: add new input sources or transformations SchedulerBackend: add new cluster managers spark.serializer: customize object storage
32What People Have DoneNew RDD transformations (sample, glom, mapPartitions, leftOuterJoin, rightOuterJoin) New input sources (DynamoDB) Custom serialization for memory and bandwidth efficiency New language bindings (Java, Python)
33Possible Future Extensions Pluggable inter-job scheduler Pluggable cache eviction policy (ideally with priority flags on StorageLevel) Pluggable instrumentation / event listeners Let us know if you want to contribute these!
34As an ExerciseTry writing your own input RDD from the local filesystem (say one partition per file) Try writing your own transformation RDD (pick a Scala collection method not in Spark) Try writing your own action (e.g. product())
35OutlineProject goals Components Life of a job Extending Spark How to contribute
36Development Process Issue tracking: spark-project.atlassian.net Development discussion: spark-developersMain work: “master” branch on GitHubSubmit patches through GitHub pull requestsBe sure to follow code style and add tests!
37Build ToolsSBT and Maven currently both work (but switching to only Maven) IDEA is the most common IDEA; Eclipse may be made to work