Kay Ousterhout, Christopher Canel, Sylvia Ratnasamy, Scott Shenker

Kay Ousterhout, Christopher Canel, Sylvia Ratnasamy, Scott Shenker
Monotasks Architecting for Performance Clarity in Data Analytics Frameworks Kay Ousterhout, Christopher Canel, Sylvia Ratnasamy, Scott Shenker

How can I make this faster?
Starting abstractly: have a user, she’d running a workload on some system and getting some result. Pretty quickly asks: how can I make this faster?

She has lots of choices for how to do so. Lots of knobs to turn! To give a few examples…

Should I use a different cloud instance type?
Change instance type! Can have significant impact (Shivaram’s work: 1.9x)

Should I trade more CPU for less I/O by using better compression?
Change software config. How do I know what effect of different tradeoff will be?

??? User has no way to know how to turn these knobs – yet could get big performance improvements from doing so!

??? What’s more, this is getting worse: with each new optimization, there are more knobs, and the systems are more complicated, so it’s harder to know how to turn them

???

Architect for performance clarity
Make it easy to reason about performance We should architect for performance clarity. What do I mean by clarity? Much more than just “what’s the bottleneck”. Want to know how to change the knobs; this requires knowing how system performs under different conditions. No measure of utilization will tell me what happens if network is twice as fast. This goes way beyond measurement. Not aware of any system that was designed to provide this kind of clarity. Enabling the user to turn the existing N knobs is much more useful than adding an N+1th one!

For data analytics frameworks: Is it possible to architect for performance clarity? Does doing so require sacrificing performance? Intuitively: making performance easier to reason about might mean making system simpler, undoing performance improvements, and making it slower as a result

Key idea: use single-resource monotasks
Reasoning about performance Today: why it’s hard Monotasks: why it’s easy Does using monotasks hurt performance? Using monotasks to predict job runtime

Split input file into words and emit count of 1 for each
Example Spark Job Word Count: spark.textFile(“hdfs://…”) \ .flatMap(lambda l: l.split(“ “)) \ .map(lambda w: (w, 1)) \ .reduceByKey(lambda a, b: a + b) \ .saveAsTextFile(“hdfs://…”) Split input file into words and emit count of 1 for each User writes a high level description of job. Two parts, first…

Example Spark Job Word Count:
spark.textFile(“hdfs://…”) \ .flatMap(lambda l: l.split(“ “)) \ .map(lambda w: (w, 1)) \ .reduceByKey(lambda a, b: a + b) \ .saveAsTextFile(“hdfs://…”) Split input file into words and emit count of 1 for each For each word, combine the counts, and save the output Second part

Spark Word Count Job: … …
spark.textFile(“hdfs://…”) .flatMap(lambda l: l.split(“ “)) .map(lambda w: (w, 1)) .reduceByKey(lambda a, b: a + b) .saveAsTextFile(“hdfs://…”) Map Stage: Split input file into words and emit count of 1 for each Reduce Stage: For each word, combine the counts, and save the output … Worker 1 Worker n … Worker 1 Worker n Tasks The framework’s job is to handle the details of executing the job: break it into tasks that can be parallelized on a cluster, handle data transfer, etc. Aside: this is how Spark works, but lots of other frameworks too!

Reduce Stage: For each word, combine the counts, and save the output
Spark Word Count Job: .reduceByKey(lambda a, b: a + b) .saveAsTextFile(“hdfs://…”) Reduce Stage: For each word, combine the counts, and save the output … Worker 1 Worker n Focus on just reduce stage

Reduce Stage: For each word, combine the counts, and save the output
Spark Word Count Job: .reduceByKey(lambda a, b: a + b) .saveAsTextFile(“hdfs://…”) Reduce task: Reduce Stage: For each word, combine the counts, and save the output Network … Worker 1 Worker n CPU (aggregate) Disk write Tasks use pipelining to parallelize the use of many resources

Challenge: Tasks pipeline multiple resources,
Reduce task Network read CPU (filter) Disk write Task only using disk Task only using the network Task only using the CPU Challenge: Tasks pipeline multiple resources, resource use changes at fine time granularity Fine grained pipelining makes it hard to reason about performance, because the task is doing different things at different times, and this can change at the granularity of the pipelining. This is how things have always been built: fine-grained pipelining seen as fundamental to performance! But it’s also a fundamental challenge to reasoning about performance. This results in a few issues…

4 concurrent tasks on a worker
First, tasks running concurrently! time

Concurrent tasks may contend for
the same resource (e.g., network) Task 7 Task 3 Task 4 Task 8 A single scheduler is responsible for scheduling these multi-resource tasks! Can’t avoid contention that occurs at fine time granularity time

Challenge: Resource use controlled by operating system
Task 18 Task 19 Reduce task Network read CPU (filter) Disk write Disk write controlled by OS buffer cache Alternate Disk write (data in buffer cache) Disk write controlled by OS! OS might reasonably hold all data in the cache, and actually write it to disk much later, when it will contend with other tasks running on the machine.

Time t: different tasks may be bottlenecked on different resources
What’s the bottleneck? Single task may be bottlenecked on different resources at different times Task 1 Task 2 Task 5 Task 3 Task 4 Task 7 Task 6 Task 8 Time t: different tasks may be bottlenecked on different resources Now given what I’ve said so far, say you wanted to answer a simple question: what’s the bottleneck? This starts to look pretty hairy pretty quickly. Can *you* answer this from looking at this picture? If you look at one slice in time, different tasks may be bottlenecked on different things! You could think about this for one task: but the task has different bottlenecks at different times!

How much faster would my job be with 2x disk throughput?
Task 1 Task 2 Task 5 Task 3 Task 4 Task 7 Task 6 Task 8 How much faster would my job be with 2x disk throughput? How would runtimes for these disk writes change? How would that change timing of (and contention for) other resources? What if you wanted to ask a more complicated question? Emphasize: no amount of additional logging is going to make it easy to answer this question today.

Fundamental challenge: tasks have non-uniform resource use
Concurrent tasks on a machine may contend Resource use is controlled outside of the framework No model for performance Contention exacerbated by the fact that resource use controlled outside of framework

Today: tasks use pipelining to parallelize multiple resources Proposal: build systems using monotasks that each consume just one resource

Today: Tasks have non-uniform resource use
Concurrent tasks may contend Resource use outside of framework No model for performance

Monotasks: Each task uses one resource
Today: Network read Today’s task: Tasks have non-uniform resource use CPU Disk write Task 1 Concurrent tasks may contend Monotasks: Each task uses one resource Network monotask Compute monotask Resource use outside of framework Disk monotask Monotasks don’t start until all dependencies complete No model for performance Dependencies: key to uniform resource use! Don’t want tasks to block.

Today: Monotasks: Dedicated schedulers control contention
Tasks have non-uniform resource use Monotasks for one of today’s tasks: Each task uses one resource Network scheduler Concurrent tasks may contend CPU scheduler: 1 monotask / core Resource use outside of framework Disk drive scheduler: 1 monotask / disk No model for performance

Per-resource schedulers have complete control Today: Monotasks:
Tasks have non-uniform resource use Each task uses one resource Network scheduler Concurrent tasks may contend Dedicated schedulers control contention CPU scheduler: 1 monotask / core Resource use outside of framework writes buffered Disk drive scheduler: 1 monotask / disk All writes flushed to disk No model for performance

Monotask times can be used to model performance
Today: Monotasks: Monotask times can be used to model performance Tasks have non-uniform resource use Each task uses one resource Ideal CPU time: total CPU monotask time / # CPU cores Concurrent tasks may contend Dedicated schedulers control contention Resource use outside of framework Per-resource schedulers have complete control No model for performance

Monotask times can be used to model performance
Today: Monotasks: Monotask times can be used to model performance Tasks have non-uniform resource use Each task uses one resource Ideal CPU time: total CPU monotask time / # CPU cores Modeled job runtime: max of ideal times Concurrent tasks may contend Dedicated schedulers control contention Resource use outside of framework Per-resource schedulers have complete control Ideal network runtime Ideal disk runtime No model for performance

How much faster would the job be with 2x disk throughput?
Today: Monotasks: How much faster would the job be with 2x disk throughput? Tasks have non-uniform resource use Each task uses one resource Ideal CPU time: total CPU monotask time / # CPU cores Concurrent tasks may contend Dedicated schedulers control contention Ideal network runtime Resource use outside of framework Per-resource schedulers have complete control Ideal disk runtime No model for performance

How much faster would the job be with 2x disk throughput?
Today: Monotasks: How much faster would the job be with 2x disk throughput? Tasks have non-uniform resource use Each task uses one resource Ideal CPU time: total CPU monotask time / # CPU cores Concurrent tasks may contend Dedicated schedulers control contention Modeled new job runtime Ideal network runtime Resource use outside of framework Per-resource schedulers have complete control Ideal disk runtime (2x disk concurrency) Monotask times can be used to model performance No model for performance Emphasize: model simple (and not perfectly accurate!). But as I’ll show later, sufficiently accurate to answer what-if questions.

How does this decomposition work?
Today: Monotasks: 4 multi-resource tasks run concurrently Tasks have non-uniform resource use Each task uses one resource Concurrent tasks may contend Dedicated schedulers control contention How does this decomposition work? Resource use outside of framework Monotasks scheduled by per-resource schedulers Per-resource schedulers have complete control Monotask times can be used to model performance No model for performance Note that simple performance questions (e.g., bottleneck, time on each resource) are trivial

How does this decomposition work?
.reduceByKey(lambda a, b: a + b).saveAsTextFile(“hdfs://…”) Network monotasks: request remote data Today’s reduce task: Network CPU Disk write Disk monotask: write output … CPU monotask: deserialize, combine counts, serialize Framework can do this under the hood! Key insight: possible for the same reasons Spark is widely-used: high-level abstraction API-compatible with Spark, implemented at application layer

Does using monotasks hurt performance?
3 benchmark workloads: Big data benchmark (10 queries run with scale factor 5) Sort (600GB sorted using 20 machines) Block coordinate descent (ML workload, 16 machines) For all workloads, runtime comparable to Spark At most 9% slower, sometimes faster

How much faster would jobs run if…
Each machine had 2 disks instead of 1? Sort 600GB of key-value pairs on 20 machines Takeaway: monotask model correctly predicts new runtimes. Note that this is non-trivial (e.g., no runtimes increased by a factor of 2 because something else became the bottlencek), and monotasks model gets it correct in both cases. Predictions within 9% of the actual runtime

How much faster would job run if...
4x more machines Input stored in-memory No disk read No CPU time to deserialize Flash drives instead of disks Faster shuffle read/write time 10x improvement predicted with at most 23% error

Leveraging Performance Clarity to Automatically Improve Performance
Schedulers have complete visibility over resource use Network scheduler CPU scheduler: 1 monotask / core Can configure for best performance Disk drive scheduler: 1 monotask / disk

Leveraging Performance Clarity to Automatically Improve Performance
Monotask schedulers automatically select ideal concurrency Monotasks allow you to get performance improvements that it would be hard to get any other way

Why do we care about performance clarity?
By using single-resource monotasks, system can provide performance clarity without sacrificing performance Why do we care about performance clarity? Typical performance eval: group of experts Practical performance: 1 novice Insight from this talk: it *is* possible to architect for performance clarity, can do so with single-resource tasks Why do we care? We need to build systems that perform well for the average user

Reflecting on Monotasks
Painful to re-architect existing system to use monotasks Pipelining deeply integrated (>20K lines of code changed) Implemented at high layer of software stack Should clarity be provided by the operating system?

Goal: provide performance clarity
Only way to improve performance is to know what to speed up Using single-resource monotasks provides clarity without sacrificing performance With monotasks, easier to improve system performance only way to know how to make things faster is to know what you need to speed up! Insight from this talk: it *is* possible to architect for performance clarity, can do so with single-resource tasks With this one architectural change, can make lots of things easy – both for user and via automated tools. github.com/NetSys/spark-monotasks

Kay Ousterhout, Christopher Canel, Sylvia Ratnasamy, Scott Shenker

Similar presentations

Presentation on theme: "Kay Ousterhout, Christopher Canel, Sylvia Ratnasamy, Scott Shenker"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Kay Ousterhout, Christopher Canel, Sylvia Ratnasamy, Scott Shenker

Similar presentations

Presentation on theme: "Kay Ousterhout, Christopher Canel, Sylvia Ratnasamy, Scott Shenker"— Presentation transcript:

Similar presentations

About project

Feedback