Making Sense of Spark Performance

Name: Making Sense of Spark Performance
Uploaded: 2017-10-12T03:27:36+00:00
Duration: PTM14S59
Channel: Aubrey Richard
Description: Making Sense of Spark Performance

Making Sense of Spark Performance
eecs.berkeley.edu/~keo/traces Kay Ousterhout UC Berkeley In collaboration with Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, and Byung-Gon Chun

About Me PhD student in Computer Science at UC Berkeley Thesis work centers around performance of large-scale distributed systems Spark PMC member

About This Talk Overview of how Spark works How we measured performance bottlenecks In-depth performance analysis for a few workloads Demo of performance analysis tool This is a research project, but this talk will be more practically-focused

Count the # of words in the document
I am Sam 6 Sam I am Do you like 6 Spark driver: = 21 Cluster of machines Green eggs and ham? 4 First, a quick refresher on how these frameworks work. Current execution engines architected in a similar way. Job: some analysis on large amount of data. Basic idea: split job into increments of work called tasks; each task runs on one machine. To give you a sense of what one of these tasks might do, look at a simple example… … Thank you, Sam I am 5 Spark (or Hadoop/Dryad/etc.) task

Count the # of occurrences of each word
MAP REDUCE {I: 2, am: 2, …} {I: 4, you: 2, …} I am Sam {Sam: 1, I: 1, … } {am: 4, Green: 1, …} Sam I am Do you like {Green: 1, eggs: 1, … } {Sam: 4, …} Green eggs and ham? … {Thank: 1, eggs: 1, …} {Thank: 1, you: 1,… } Thank you, Sam I am

Performance considerations
I am Sam Caching input data Scheduling: assigning tasks to machines Straggler tasks Network performance (e.g., during shuffle) Sam I am Do you like Green eggs and ham? … Thank you, Sam I am

Caching PACMan [NSDI ’12], Spark [NSDI ’12], Tachyon [SoCC ’14]
Scheduling Sparrow [SOSP ‘13], Apollo [OSDI ’14], Mesos [NSDI ‘11], DRF [NSDI ‘11], Tetris [SIGCOMM ’14], Omega [Eurosys ’13], YARN [SoCC ’13], Quincy [SOSP ‘09], KMN [OSDI ’14] Stragglers Scarlett [EuroSys ‘11], SkewTune [SIGMOD ‘12], LATE [OSDI ‘08], Mantri [OSDI ‘10], Dolly [NSDI ‘13], GRASS [NSDI ‘14], Wrangler [SoCC ’14] Network VL2 [SIGCOMM ‘09], Hedera [NSDI ’10], Sinbad [SIGCOMM ’13], Orchestra [SIGCOMM ’11], Baraat [SIGCOMM ‘14], Varys [SIGCOMM ’14], PeriSCOPE [OSDI ‘12], SUDO [NSDI ’12], Camdoop [NSDI ’12], Oktopus [SIGCOMM ‘11]), EyeQ [NSDI ‘12], FairCloud [SIGCOMM ’12] TONS of work on improving performance Lots of valuable work here: many cite impressive improvements. What are the biggest problems? What do we do next? How do we go about this in a principled way? Whack-a-mole forever? Dogma has emerged from optimizations people have illustrated, rather than from any kind of comprehensive performance study Generalized programming model Dryad [Eurosys ‘07], Spark [NSDI ’12]

Network and disk I/O are bottlenecks Stragglers are a major issue with
unknown causes Caching PACMan [NSDI ’12], Spark [NSDI ’12], Tachyon [SoCC ’14] Scheduling Sparrow [SOSP ‘13], Apollo [OSDI ’14], Mesos [NSDI ‘11], DRF [NSDI ‘11], Tetris [SIGCOMM ’14], Omega [Eurosys ’13], YARN [SoCC ’13], Quincy [SOSP ‘09], KMN [OSDI ’14] Stragglers Scarlett [EuroSys ‘11], SkewTune [SIGMOD ‘12], LATE [OSDI ‘08], Mantri [OSDI ‘10], Dolly [NSDI ‘13], GRASS [NSDI ‘14], Wrangler [SoCC ’14] Network VL2 [SIGCOMM ‘09], Hedera [NSDI ’10], Sinbad [SIGCOMM ’13], Orchestra [SIGCOMM ’11], Baraat [SIGCOMM ‘14], Varys [SIGCOMM ’14], PeriSCOPE [OSDI ‘12], SUDO [NSDI ’12], Camdoop [NSDI ’12], Oktopus [SIGCOMM ‘11]), EyeQ [NSDI ‘12], FairCloud [SIGCOMM ’12] Dogma has emerged from optimizations people have illustrated, rather than from any kind of comprehensive performance study Generalized programming model Dryad [Eurosys ‘07], Spark [NSDI ’12]

This Work (1) Methodology for quantifying performance bottlenecks (2) Bottleneck measurement for 3 SQL workloads (TPC-DS and 2 others)

Network optimizations can reduce job completion time by at most 2% CPU (not I/O) often the bottleneck Most straggler causes can be identified and fixed Very different than existing mentality! Why so few workloads?

Example Spark task: network read compute disk write time : time to handle one record Very tricky to understand performance here!! Problem: traces have only coarse-grained breakdown of how tasks spend their time: This is how long the entire task took Fine-grained instrumentation needed to understand performance

How much faster would a job run if the network were infinitely fast?
What’s an upper bound on the improvement from network optimizations?

How much faster could a task run if the network were infinitely fast?
network read compute disk write Original task runtime : blocked on network Look at a task: a part of the job that runs on just one machine added instrumentation to measure when compute thread is blocked. Crucial point here: if the disk read is pipelined with computation, no point in speeding it up! : blocked on disk compute Task runtime with infinitely fast network

How much faster would a job run if the network were infinitely fast?
time Task 0 Task 2 : time blocked on network 2 slots Task 1 to: Original job completion time Task 0 Task 1 Task 2 2 slots tn: Job completion time with infinitely fast network

SQL Workloads TPC-DS (20 machines, 850GB; 60 machines, 2.5TB)
Big Data Benchmark (5 machines, 60GB) amplab.cs.berkeley.edu/benchmark Databricks (9 machines, tens of GB) databricks.com WHERE POSSIBLE: validated with larger workloads with Google / Microsoft / Facebook. But only possible to do on coarser scale. 2 versions of each: in-memory, on-disk

How much faster could jobs get from optimizing network performance?
5 95 75 25 50 Percentiles what are we looking at? 3 workloads on x-axis. One version of each workload: input/output stored on disk, input/output stored in-memory. 1Gbps network link – nothing insane. Median improvement at most 2%

How can we sanity check these numbers?
How do our workloads compare to larger-scale traces? As mentioned, not enough data in traces to directly measure this.

How much data is transferred per CPU second?
Compare to Facebook, where we have per-job information. Generally, much lower. For Microsoft / Google, only have aggregate information: total bytes written, total machine (or task) seconds. Also not CPU seconds – so would be slightly higher using the cpu metric. Says NOTHING about tail behavior. But still – if anything, much less network bound than the workload we looked at. Microsoft ’09-’10: 1.9–6.35 Mb / task second Google ’04-‘07: 1.34–1.61 Mb / machine second

Shuffle Data < Input Data
How can this be true? Read off a point first! Graph is confusing: what is the ratio? Then say why this is: kind of intuitive, reducing data, except for ETL-style jobs, often want much less data. Sort workload not realistic! Neither is a workload that only does a shuffle. Shuffle Data < Input Data

What kind of hardware should I buy?
10Gbps networking hardware likely not necessary!

How much faster would jobs complete if the disk were infinitely fast?

How much faster could jobs get from optimizing disk performance?
Includes all disk and network (impossible to separate these) Caps improvement from flash etc. Median improvement at most 19%

Disk Configuration Our instances: 2 disks, 8 cores Cloudera:
At least 1 disk for every 3 cores As many as 2 disks for each core Our instances are under provisioned  results are upper bound

How much data is transferred per CPU second?
BDB and TPC-DS are, if anything, more I/O bound than the other workloads BASED ON THESE AGGREGATE METRICS. Google: MB / machine second Microsoft: 7-11 MB / task second

What does this mean about Spark versus Hadoop?
This work: 19% serialized + compressed in-memory data serialized + compressed on-disk data Faster

This work says nothing about Spark vs. Hadoop!
19% 6x or more amplab.cs.berkeley.edu/benchmark/ up to 10x spark.apache.org (on-disk data) serialized + compressed in-memory data serialized + compressed on-disk data deserialized in-memory data Faster

What causes stragglers?
Takeaway: causes depend on the workload, but disk and garbage collection common Fixing straggler causes can speed up other tasks too

Live demo

eecs.berkeley.edu/~keo/traces

I want your workloads! spark.eventLog.enabled true keo@cs.berkeley.edu

Network optimizations can reduce job completion time by at most 2%
CPU (not I/O) often the bottleneck 19% reduction in completion time from optimizing disk Many straggler causes can be identified and fixed Project webpage (with links to paper and tool): eecs.berkeley.edu/~keo/traces These are pretty different than what RR said.

Backup Slides

How do results change with scale?

How does the utilization compare?

Making Sense of Spark Performance

Similar presentations

Presentation on theme: "Making Sense of Spark Performance"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Making Sense of Spark Performance

Similar presentations

Presentation on theme: "Making Sense of Spark Performance"— Presentation transcript:

Similar presentations

About project

Feedback