1 Varys Efficient Coflow Scheduling Mosharaf Chowdhury, Good morning.I’m Mosharaf from the AMPLab.Today, I’m going to talk about how to perform application-aware network scheduling in data-parallel clusters using coflows.This is a joint work with …UC BerkeleyMosharaf Chowdhury,Yuan Zhong, Ion Stoica
2 Performance Communication is Crucial As in-memory systems proliferate, Facebook analytics jobs spend 33% of their runtime in communication1Communication is crucial for analytics at scale.For example, in our earlier work, we found that typical data-parallel jobs at Facebook spend up to a third of their running time in shuffle or intermediate data transfers.As in-memory systems proliferate and disks are removed from the I/O pipeline, data-parallel applications will spend more and more time communicating data over the network.These are, of course, well-known concerns.The real question is, how do we go about optimizing communication performance?As in-memory systems proliferate,the network is likely to become the primary bottleneck1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM’2011
3 “Let systems figure it out” OptimizingCommunicationPerformance:NetworkingApproach“Let systems figure it out”A sequence of packetsbetween two endpointsFlowIndependent unit of allocation, sharing, load balancing, and/orprioritizationSo far, most of the research in the networking community has been done in the context of the flow abstraction.A flow between two endpoints is just a sequence of packets, and it is the unit of allocation, load balancing, and traffic engineering.While this abstraction served us well for client-server or peer-to-peer communications, it is a poor fit for data-parallel applications.
4 “Let users figure it out” OptimizingCommunicationPerformance:SystemsApproach“Let users figure it out”# Comm.Params*Spark 1.0.16Hadoop 1.0.410YARN2.3.020Because we only have the application-agnostic flow abstraction, system designers expose myriad of parameters trying to optimize the performance of data-parallel applications.Examples include, the number of parallel flows, the size of send/receive buffers, bytes in flight, and many others.To get a sense of how many parameters you might want to tune, a few weeks ago we looked at three of the most commonly used data-parallel frameworks.And, this is only a lower bound. There are many other parameters that indirectly impact the communication performance of data-parallel jobs.These parameters are hard to understand, difficult to use, and they don’t always provide expected results.*Lower bound. Does not include many parameters that canindirectly impact communication; e.g., number of reducers etc.Also excludes control-plane communication/RPC parameters.
5 Optimizing Communication Performance: Systems Approach Optimizing “Let users figure it out”OptimizingCommunicationPerformance:NetworkingApproach“Let systems figure it out”A collection of parallel flowsCompletion time depends on the last flow to completeDistributed endpointsObviously, there is a massive gap between the networking and data-parallel applications perspectives on how to optimize the communication performance of data-parallel applications.Our goal is to bridge this gap.If we look at a data-parallel application, it is a sequence of computation and communication stages.If we focus on one of these communication stages, a shuffle in this case, we see that…Communication is highly structured.Each communication stage consists of a collection of parallel flows that have their endpoints located on different machinesFurthermore, each flow is independent in that the input of one flow does not depend on another flow of the same communication stage.Finally, a communication stage cannot complete until all its flows have completed.As a result, completing flows really fast and minimizing flow completion times might not result in any application-level improvements.Each flow is independent
6 Completion time depends on the last flow to complete A collection of parallel flowsDistributed endpointsEach flow is independentCompletion time depends on the last flow to completeCoflow1We refer to such collections of parallel flows as coflows.And, any data-parallel DAG typically consists a sequence of coflows of different patterns, which include many-to-many, many-to-one, one-to-many, all-to-all, and even a collection of parallel flows where each flow has distinct endpoints.Note that each individual flow is also a coflow.1. Coflow: A Networking Abstraction for Cluster Applications, HotNets’2012
7 Completion time depends on the last flow to complete How to schedule coflows …… for faster#1 completionof coflows?… to meet#2 moredeadlines?12N.A collection of parallel flowsDistributed endpointsEach flow is independentCoflow1Completion time depends on the last flow to completeNow, of course, a data-parallel cluster is a shared pool of resources.Many coflows, from many different jobs and frameworks, can coexist on the shared network.If we consider all the machines in the cluster to be connected by a shared network fabric with N input and output ports, the question we want to answer is how to schedule all these coflows?We want to schedule coflows either to minimize coflow completion times, which is more directly correlated to job completion times.We also want to schedule coflows to maximize the number of coflows that complete within their deadlines.In this talk, I’ll focus on making coflows complete faster.DC Fabric
8 Varys Enables coflows in data-intensive clusters Simpler Frameworks Zero user-side configuration using a simple coflow APIBetter performanceFaster and more predictable transfers through coflow schedulingIn this work, we present Varys, a system allows any data-parallel framework to take advantage of coflows to enable simpler configuration for users and better performance of user jobs.Varys requires minor changes to the cluster computing frameworks, but requires no changes to the applicationsWhile we had addressed the problem of scheduling individual coflows in the past, Varys is about efficiently scheduling multiple coflows.
9 Inter-Coflow Scheduling Benefits ofInter-Coflow SchedulingCoflow 1Coflow 2Link 1Link 26 Units3 Units3-ε UnitsFair SharingFlow-level Prioritization1,2The Optimaltime246time246time246L2L2L2But first, let us see the potentials of inter-coflow scheduling through a simple example.We have two coflows: coflow1 in black with one flow on link1 and coflow2 with two flows.Each block represent a unit of data.Assume, it takes a unit time time to send each unit of data.Let’s star with considering what happens today.In this plot, we have time in the X-axis and the links on Y-axis.Link1 will be almost equally shared between the two flows from two different coflows and the other link will be used completely by coflow2’s flow.After, 6 time units, both coflows will finish.Recently, there has been a lot of focus on minimizing flow completion times by prioritizing flows of smaller size.In that case, the orange flow in link1 will be prioritized over the black flow.After 3 time units, the orange flow will finish.Note that coflow2 hasn’t finished yet, because it still has 3 more data units from its flow on link2.Eventually, when all flow finishes, the coflow completion times remain the same even thought the flow completion time has improved.The optimal solution for minimizing application-level performance, in this case, would be to let the black coflow finish first.As a result, we can see application-level performance improvement for coflow1 without any impact on the other coflow.In fact, it is quite easy to show that significantly decreasing flow completion times might still not result in any improvement in user experience.L1L1L1Coflow1 comp. time = 6Coflow2 comp. time = 6Coflow1 comp. time = 6Coflow2 comp. time = 6Coflow1 comp. time = 3Coflow2 comp. time = 61. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’2012.2. pFabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM’2013.
10 Inter-Coflow Scheduling Benefits ofInter-Coflow SchedulingCoflow 1Coflow 2Link 1Link 26 Units3 Units3-ε Unitstime246Coflow1 comp. time = 6Coflow2 comp. time = 6Fair SharingL1L2time246Coflow1 comp. time = 6Coflow2 comp. time = 6Flow-level Prioritization1L1L2time246The OptimalCoflow1 comp. time = 3Coflow2 comp. time = 6L1L2Concurrent Open Shop Scheduling1Tasks on independent machinesExamples include job scheduling and caching blocksUse a ordering heuristicThe inter-coflow scheduling problem has its root in the concurrent open shop scheduling problem.In fact, data-parallel job scheduling or memory caching all fall under the umbrella of this problem.Like many other scheduling problems, it is also an NP-Hard problem and the solution approach is similar to other scheduling heuristics.One must come up with an ordering of coflows and the effectiveness of the ordering heuristic determines the goodness of the solution.Now, unlike this simplistic example, flows do not run on independent links in a cluster.1. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’2012.2. pFabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM’2013.1. A note on the complexity of the concurrent open shop problem, Journal of Scheduling, 9(4):389–396, 2006
11 Inter-Coflow Scheduling Link 1Link 26 Units3 Units3-ε Units321Ingress Ports(Machine Uplinks)Egress Ports(Machine Downlinks)DC FabricConcurrent Open Shop Scheduling1Tasks on independent machinesExamples include job scheduling and caching blocksUse a ordering heuristicSo, we must bring the network fabric into the picture.In this example, we have 3-machine datacenter fabric.Each machine’s incoming and outgoing links are shown are the ingress and egress ports of the fabric.Each input port has virtual output queues for each output port.So the flows on link1 in our example may be going from input port 1 to output port 1 and the flow in link2 is going from input port 2 to output port 2.In essence, the performance a coflow depends on its allocation on both input and output ports of the fabric.In fact, we have discovered a brand-new scheduling problem called the concurrent open shop scheduling with coupled resources problem.We have provide the first characterization of this problem and we have proven that due to the unique matching constraints observed in the network, this is one of rare scheduling problems where Graham’s list scheduling method won’t work.633-ε1. A note on the complexity of the concurrent open shop problem, Journal of Scheduling, 9(4):389–396, 2006
12 Inter-Coflow Scheduling is NP-HardCoflow 1Coflow 2Link 1Link 26 Units3 Units3-ε Units321Ingress Ports(Machine Uplinks)Egress Ports(Machine Downlinks)DC Fabricwith coupled resourcesConcurrent Open Shop SchedulingFlows on dependent linksConsider ordering and matching constraints^So, we must bring the network fabric into the picture.In this example, we have 3-machine datacenter fabric.Each machine’s incoming and outgoing links are shown are the ingress and egress ports of the fabric.Each input port has virtual output queues for each output port.So the flows on link1 in our example may be going from input port 1 to output port 1 and the flow in link2 is going from input port 2 to output port 2.In essence, the performance a coflow depends on its allocation on both input and output ports of the fabric.In fact, we have discovered a brand-new scheduling problem called the concurrent open shop scheduling with coupled resources problem.We have provide the first characterization of this problem and we have proven that due to the unique matching constraints observed in the network, this is one of rare scheduling problems where Graham’s list scheduling method won’t work.6Characterized COSS-CRProved that list scheduling might not result in optimal solution33-ε
13 Varys Employs a two-step algorithm to minimize coflow completion times Ordering heuristicKeeps an ordered list of coflows to be scheduled, preempting if neededAllocation algorithmAllocates minimum required resources to each coflow to finish in minimum timeWe propose a simple two-step greedy scheduling algorithm.First we order the coflows, and then we allocate rates to individual flows.In the next few slides, I’ll go over this two-step algorithm.
14 Ordering Heuristic : SEBF Smallest- Shortest-First Effective- C1 endsC2 endsC2 endsC1 ends141P1P122P2P224P3P333359494TimeTimeC1C2Length34Width2Size512BottleneckShortest-FirstSmallest-Effective-Bottleneck-FirstLet’s take another example to demonstrate the pros and cons of different ordering schemes for the first step of our algorithm.In this example, we have two coflows: coflow1 in grey with two flows and coflow2 in orange with three flows.If we consider the typical shortest-first heuristic for coflows, and we define the length of a coflow the length of its longest flow, we see that coflow1 has length 3 and coflow2 has length 4.Let’s bring back the fabric again into the picture and see what happens when we choose the order coflow1followed by coflow2.In this example, we see that both grey flows are going to output port 1.As a result, they will be competing with each other and it will take at least 5 units of time for coflow1 to finish.Orange flows will make some progress in the time, but the one going to output port1 must wait until the grey flows have finished.All in all, coflow2 will finish after 9 time units.The total coflow completion time is 14 in this case.Now, instead of selecting length as the ordering criterion, we could also select the width of a coflow, i.e., the number of its flows, or the total size of a coflow.<EXPLAIN why these could be good choices.>In both cases, coflow1 should be scheduled first.However, observe that the completion of a coflow depends only on its bottleneck.And for coflow1’s bottleneck, the output port1 must receive 5 units of data, whereas coflow2’s bottlenecks have to receive 4 data units each.Hence, we propose the smallest-effective-bottleneck-first heuristic.In this case, coflow2 is scheduled first and finishes after 4 time units.And then, coflow1 is scheduled for a total CCT of 13 time units.Although the improvements seem smaller for this particular example, as the number of concurrent jobs increase we get more opportunities to better order and our competitive advantage increases.Note that in the single link case, this reduces to classic SRTF, just as a coflow reduces to a flow.Narrowest-FirstSmallest-First
16 Allocation Algorithm MADD Ensure minimum allocation to each flow for it tofinish at thedesired duration;for example,at bottleneck’s completion, orat the deadline.MADDFinishing flows faster than the bottleneck cannot decrease a coflow’s completion timeA coflow cannot finish before its very last flowOnce we have determined an order of coflows, the next step is to determine the rate of its individual flows.Recall, that a coflow cannot finish until all its flows have finish.Which means that finishing flows earlier than a coflow’s bottleneck cannot decrease its completion time.Hence, we propose the Minimum Allocation for Desired duration algorithm that allocates minimum rates to each flows of a coflow so that they either finish together with the bottlenecks or they all finish at the deadline, if a deadline is provided.Note that we must ensure that capacity isn’t wasted between the ordering and rate allocation steps.I’ll skip the details of work conserving backfilling in this talk.The bigger question, how can you use coflows.
17 Varys Enables frameworks to take advantage of coflow scheduling Varys provides a simple API to allow any data-parallel framework to express their communication requirements, and Varys enforces coflow scheduling through centralized architecture.The details are in the paper, but I want to stress that the API requires changes only in the framework.ALL user jobs, already-written or the new ones, do not require ANY changes at all.Currently, we assume that the flow sizes are known, which is true for many common frameworks that write their intermediate data to disk.Exposes the coflow APIEnforces through a centralized scheduler
18 A 3000-node trace-driven simulation matched against a 100-node EC2 deployment EvaluationDoes it improve performance?Can it beat non-preemptive solutions?YESVarys is written in about 4500 lines of Scala, and we have deployed it on a 100-node large-memory EC2 instances for performance evaluation.Each of these machines had 1Gbps NICs.You can find more details in the paper, but I want to touch some high-level points in next two slides.These are obvious questions:Does it improve performance?At what cost, meaning, will it starve?The short answer to all three question is “yes, it can.”
19 Avg. 95th Faster Jobs 3.16X 1.85X 2.50X 1.25X 3.84X 1.74X 2.94X 1.15X Comm. Heavy1Comm. Improv.Job Improv.Avg.3.16X1.85X2.50X1.25XIn our EC2 experiments we have observed that, in comparison to per-flow fair sharing, Varys improved the communication performance of data-intensive jobs by up to 1.85X and that of corresponding jobs by 1.25X.This is because many jobs are not communication-intensive.For communication-intensive jobs, improvements are even higher.NOTE:Time in Comm. <25% % % >74% Jobs % 13% % %We have also found that recent techniques to optimize FCTs can beat Varys, but they are significantly worse than Varys in optimizing application-level performance. In fact, they are worse than TCP at minimizing coflow completion times.95th3.84X1.74X2.94X1.15X1. 26% jobs spend at least 50% of their duration in communication stages.
20 Better than Non-Preemptive Solutions w.r.t. FIFO1NOWhatAboutPerpetualStarvationAvg.5.65XBecause Varys preempts large coflows with smaller ones to minimize the average CCT, there is a risk of starvation.One easy way to avoid starvation is to use FIFO scheduling.However, due to head-of-line blocking, FIFO-based schemes can be excessively bad, and Varys easily outperformed the FIFO-based schemes proposed several years ago.What about starvation?We observe no perpetual starvation in our experiments due to two reasons:To avoid starvation, we make sure that every coflow makes progress by giving it a small share.Second one is a conjecture.Interestingly in our experiments this mechanism rarely kicked in.It has been shown that SRTF doesn’t cause significant starvation when task sizes follow heavy-tailed distributions.While we haven’t yet been able to prove a similar result for coflows, we do observe heavy-tailed distribution in coflow sizes and bottlenecks, and empirically, we haven’t found that instances where we had to invoke our starvation prevention mechanism.95th?7.70X1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM’2011
21 #1 #2 #3 Coflow Dependencies Unknown Flow Information Four Challenges Multi-stage jobsMulti-wave stages#2Unknown Flow InformationPipelining between stagesTask failures and restartsFourChallenges#3DecentralizedVarysMaster failureLow-latency analyticsGeneral-purpose coflow scheduling raises many challenges, and at least four remain unanswered in this work.First, how to handle dependencies between multiple stages of a data-parallel job and multiple waves of a data-parallel job.Second, how to emulate smallest-first scheduling without knowing flow sizes or the number of flows.Third, how to handle master failure and support low latency coflows?Baraat addresses all three in the context of tasks, i.e., in the context of point-to-multipoint or multipoint-to-point coflows.For data-parallel jobs, a more scenario is multipoint-to-multipoint scenario where matching plays an important role in addition to ordering.in the Context of Multipoint-to-Multipoint Coflows
22 “Concurrent Open Shop Scheduling with Coupled Resources” #4Theory Behind“Concurrent Open Shop Scheduling with Coupled Resources”We have introduced the COSS-CR problem, shown it to be strongly NP-had, and proved that unlike many scheduling problems, there exist some algorithm better than Graham’s list scheduling approach for this problem.Just last month, group led by Prof Yuan Zhong in the Columbia University’s operations research department have found the first polynomial-time constant factor approximation algorithm for COSS-CR.While the algorithm has high time-complexity and the bound is loose (64/3), it gives an upper bound on how well we can do and it introduces new techniques to analyzing this new scheduling problem.As theory community picks it up, more and more results might follow.
23 VarysGreedily schedules coflows without worrying about flow-level metricsConsolidates network optimization of data-intensive frameworksImproves job performance by addressing the COSS-CR problemIncreases predictability through informed admission controlIn conclusion, Varys enables general-purpose data-parallel frameworks to transparently take advantage of coflows and improves job-level performance and predictability.We have discovered a new scheduling problem and opened up a new avenue of scheduling research with some exciting results.Mosharaf Chowdhury