Cof low Mending the Application-Network Gap in Big Data Analytics

Cof low Mending the Application-Network Gap in Big Data Analytics
UC Berkeley Mosharaf Chowdhury

Big Data The volume of data businesses want to make sense of is increasing Increasing variety of sources Web, mobile, wearables, vehicles, scientific, … Cheaper disks, SSDs, and memory Stalling processor speeds The volume of data is growing faster than we can make sense of. This is fueled by three trends: First of all, we are seeing increasingly more variety of data sources, ranging from traditional and mobile webs to connected devices and scientific experiments. The reason we are even attempting to make sense of all these datasets, is because storage is becoming cheaper everyday. Even the price of SSDs and DRAM are dropping. At the same time, processor are not getting any faster. Processing all the data we collect is becoming increasingly harder on a single machine or even on a few machines.

Big Datacenters for Massive Parallelism
BlinkDB Storm Spark-Streaming Pregel GraphLab GraphX DryadLINQ Spark Dremel MapReduce Hadoop Dryad Hive 2005 2010 2015 So, we are building big datacenters with thousands of machines, and a high capacity network connecting them. And to make use of such large datacenters in an efficient, fault-tolerant, and scalable way, we are building distributed data analytics systems. It started with Google publishing the MapReduce paper in 2004. Soon, it was followed by Hadoop and Microsoft Dryad. Soon after, simple analytics were replaced by more complex higher-level engines like Hive for Hadoop and DryadLINQ over Dryad. Around 2010, we open-sourced Spark for in-memory, iterative computing, which has grown to become a great success today! As time went on, we have seen specialized frameworks for graph queries, for streaming analytics, and for approximate queries. And this figure only shows the tip of the iceberg! We are seeing newer and newer programming models, resource schedulers, and datacenter designs, all aiming different aspects of big data analytics.

Data-Parallel Applications
Multi-stage dataflow Computation interleaved with communication Computation Stage (e.g., Map, Reduce) Distributed across many machines Tasks run in parallel Communication Stage (e.g., Shuffle) Between successive computation stages Reduce Stage Yes, big data applications can differ a lot, ranging from SQL queries to machine learning algorithms. But they are quite similar as well. Deep down, most big data applications are multi-stage dataflows with interleaved computation and communication stages. Let’s take the simplest example, a MapReduce application. It has two computation stages Map and Reduce, and in between there is a communication stage called shuffle. Computation tasks run in parallel distributed throughout the cluster. The shuffle stage transfers outputs of all map tasks to the reduce tasks, creating an many-to-many communication pattern. Note that the shuffle has a unique collective behavior. It cannot complete all the data have been transferred to the next computation stage. Modern data-parallel applications consist of many such stages, not just two. But the collective behavior of communication is still very common. A communication stage cannot complete until all the data have been transferred Shuffle Map Stage

Communication is Crucial
Performance Facebook jobs spend ~25% of runtime on average in intermediate comm.1 As a result, their performance significantly depend on communication in many cases. By studying a month-long trace of 320,000 jobs and 150 million tasks from a 3000-machine production cluster, we found that, on average, Facebook jobs spend about 25% of their runtimes transferring intermediate data. Similarly, we have also found that many application see communication as the primary bottleneck as they scale. This is specially true for broadcasts; the more machine a job uses, the total amount of data to broadcast increases. As in-memory systems proliferate and disks are removed from the I/O pipeline, the network is likely to be the primary bottleneck. These are known problems. However, how do we address these problems today? As SSD-based and in-memory systems proliferate, the network is likely to become the primary bottleneck 1. Based on a month-long trace with 320,000 jobs and 150 Million tasks, collected from a 3000-machine Facebook production MapReduce cluster.

“Configuration should be handled at the system level”
Faster Communication Stages: Networking Approach “Configuration should be handled at the system level” Transfers data from a source to a destination Flow Independent unit of allocation, sharing, load balancing, and/or prioritization The networking approach to solve this problem is to give it up to the systems. The network provides a simple abstraction of a flow, which is just a sequence of packets between a source and a destination. It’s a very simple abstraction but a very powerful one, and it has served us well as Internet became a necessity from an experiment. Over the years we have used it allocate resources, to perform load balancing, and to do prioritization inside the network. Everything we do today is still based on the flow abstraction.

Existing Solutions Per-Flow Fairness Flow Completion Time
WFQ CSFQ D3 DeTail PDQ pFabric GPS RED ECN XCP RCP DCTCP D2TCP FCP 2005 2010 2015 1980s 1990s 2000s Per-Flow Fairness Flow Completion Time We have seen hundreds, if not thousands, of proposals to try to address the performance issues using flow as a basic abstraction. In the early days, it was all about fair allocation. Recently, as datacenters became more widely used, it’s all about improving flow completion time. These are just a few; there are way too many proposals I’m not showing on this slide. Today I’m going to show that while these are not incorrect, they are quite inefficient in the context of big data analytics. Flows fundamentally cannot capture the collective communication behavior seen in of data-parallel applications. Independent flows cannot capture the collective communication behavior common in data-parallel applications

Why Do They Fall Short? r1 r1 r2 r2 s1 s1 s2 s2 s3 s3 Datacenter
Network 1 2 3 s1 s1 s2 s2 s3 s3 Input Links Output Links Let’s take a simple example to see why flow-based approaches fall short. In this example, we have a shuffle with three senders and two receivers. The tasks are represented by circles. The six arrows represent the six flows between these tasks. Each rectangle represent one unit of data. For example, sender s2 has two units of data to send to receiver r1 and two for receiver r2. This is a logical plan. But what do this shuffle look like in the datacenter network? Now consider a datacenter without any congestion in its core. We can assume this because modern datacenters have full bisection bandwidth, pushing resource contention to the edge links.

Why Do They Fall Short? r1 r2 s1 s2 s3 s1 r1 s2 r2 s3 Datacenter
Network 1 2 3 s1 r1 s2 r2 s3

Why Do They Fall Short? Datacenter Network s1 r1 1 1 s2 r2 2 2 s3 3 3 Per-Flow Fair Sharing Shuffle Completion Time = 5 time 2 4 6 Solutions focusing on flow completion time cannot further decrease the shuffle completion time Link to r1 3 5 Link to r2 Avg. Flow Completion Time = 3.66 3 5

Improve Application-Level Performance1
Datacenter Network s1 r1 1 1 Slow down faster flows to accelerate slower flows s2 r2 2 2 s3 3 3 Per-Flow Fair Sharing Data-Proportional Allocation Per-Flow Fair Sharing Shuffle Completion Time = 5 Shuffle Completion Time = 4 time 2 4 6 time 2 4 6 Link to r1 3 Link to r1 4 4 5 4 Link to r2 Avg. Flow Completion Time = 3.66 Avg. Flow Completion Time = 4 3 Link to r2 4 4 5 4 1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM’2011.

“Configuration should be handled by the end users”
Faster Communication Stages: Systems Approach “Configuration should be handled by the end users” Applications know their performance goals, but they have no means to let the network know The applications haven’t yet found a good way to address this problem and they expect their users to configure parameters themselves to improve performance. This is because, even though they have ALL the information they have no way to let the network precisely know what they actually want.

Holistic Approach Faster Communication Stages: Systems Approach Faster
“Configuration should be handled by the end users” Faster Communication Stages: Networking Approach “Configuration should be handled at the system level” ME MIND THE GAP Holistic Approach Applications and the Network Working Together Clearly, there is a gap between the two approaches: the network does not know what the applications want and the applications cannot address it by themselves. What we need is a holistic approach that bring applications and the network closer so that the network can use application-level performance goals when making decisions.

Cof low Communication abstraction for data-parallel applications to express their performance goals Minimize completion times, Meet deadlines, or Perform fair allocation.

Broadcast Single Flow Aggregation All-to-All Parallel Flows Shuffle
Using coflows as building blocks, we can now represent the entire dataflow as a sequence of computation stages and coflows. In this example, we have three coflows a shuffle, an aggregation, and a broadcast. An all-to-all communication pattern, as seen in many graph analytics and linear algebra platforms, is also a coflow. Similarly, the parallel flows pattern, as used by replication flows when we write data or read from remote machines is also a coflow. In fact, the simple flow abstraction we have used forever is also a coflow with just one flow. However, in a shared cluster, there are many different jobs from many different users and frameworks coexisting. What happens when all of these applications start using coflows? Parallel Flows Shuffle

. How to schedule coflows online … … for faster #1 completion
of coflows? … to meet #2 more deadlines? … for fair #3 allocation of the network? 1 2 N . We end up with datacenter with tens or hundreds of coflows instead of tens or hundreds of thousands of flows that the network has no clue of how they relate to each other. Given coflows from all these different frameworks, we can now reformulate network optimization as an application-aware network scheduling problem with different objectives. Today, I’m going to talk about how to schedule coflows to minimize the average completion time so that on average jobs complete faster. Datacenter

Varys Enables coflows in data-intensive clusters Coflow Scheduler
1 Enables coflows in data-intensive clusters Coflow Scheduler Faster, application-aware data transfers throughout the network Global Coordination Consistent calculation and enforcement of scheduler decisions The Coflow API Decouples network optimizations from applications, relieving developers and end users To address these challenges, we have developed Varys, a system that allows applications to use coflows in a user-transparent way. Applications use coflows and offload network optimizations to a common network schedulers. Varys has three primary components. First, it contains a coflow scheduler to improve the overall performance of a data-intensive cluster. Second, it uses global coordination to consistently enforce the schedulers decisions throughout the cluster. Finally, it provides a simple API so that the developers can take advantage of coflows without any changes to the user. 1. Efficient Coflow Scheduling with Varys, SIGCOMM’2014.

Cof low Communication abstraction for data-parallel applications to express their performance goals The size of each flow, The total number of flows, and The endpoints of individual flows.

Inter-Coflow Scheduling
Benefits of Inter-Coflow Scheduling Coflow 1 Coflow 2 Link 1 Link 2 6 Units 3 Units 2 Units Fair Sharing Smallest-Flow First1,2 The Optimal time 2 4 6 time 2 4 6 time 2 4 6 L2 L2 L2 Let us see the potentials of inter-coflow scheduling through a simple example. We have two coflows: coflow1 in black with one flow on link1 and coflow2 with two flows. Each block represent a unit of data. Assume, it takes a unit time time to send each unit of data. Let’s start with considering what happens today. In this plot, we have time in the X-axis and the links on Y-axis. Link1 will be almost equally shared between the two flows from two different coflows and the other link will be used completely by coflow2’s flow. After, 6 time units, both coflows will finish. Recently, there has been a lot of focus on minimizing flow completion times by prioritizing flows of smaller size. In that case, the orange flow in link1 will be prioritized over the black flow. After 3 time units, the orange flow will finish. Note that coflow2 hasn’t finished yet, because it still has 3 more data units from its flow on link2. Eventually, when all flow finishes, the coflow completion times remain the same even thought the flow completion time has improved. The optimal solution for minimizing application-level performance, in this case, would be to let the black coflow finish first. As a result, we can see application-level performance improvement for coflow1 without any impact on the other coflow. In fact, it is quite easy to show that significantly decreasing flow completion times might still not result in any improvement in user experience. L1 L1 L1 Coflow1 comp. time = 5 Coflow2 comp. time = 6 Coflow1 comp. time = 5 Coflow2 comp. time = 6 Coflow1 comp. time = 3 Coflow2 comp. time = 6 1. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’2012. 2. pFabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM’2013.

Benefits of is NP-Hard Inter-Coflow Scheduling Coflow 1 Coflow 2 Link 1 Link 2 6 Units 3 Units 2 Units time 2 4 6 Coflow1 comp. time = 6 Coflow2 comp. time = 6 Fair Sharing L1 L2 time 2 4 6 Coflow1 comp. time = 6 Coflow2 comp. time = 6 Flow-level Prioritization1 L1 L2 time 2 4 6 The Optimal Coflow1 comp. time = 3 Coflow2 comp. time = 6 L1 L2 Concurrent Open Shop Scheduling1 Examples include job scheduling and caching blocks Solutions use a ordering heuristic The inter-coflow scheduling problem has its roots in the concurrent open shop scheduling problem. In fact, data-parallel job scheduling or memory caching all fall under the umbrella of this problem. It has also been applied to schedule concurrent accesses to multiple memory controllers in chip multiprocessor systems. Like many other scheduling problems, it is also an NP-Hard problem and the solution approach is similar to other scheduling heuristics. One must come up with an ordering of coflows and the effectiveness of the ordering heuristic determines the goodness of the solution. Now, unlike this simplistic example, flows do not run on independent links in a cluster. 1. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’2012. 2. pFabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM’2013. 1. A Note on the Complexity of the Concurrent Open Shop Problem, Journal of Scheduling, 9(4):389–396, 2006

is NP-Hard Coflow 1 Coflow 2 Link 1 Link 2 6 Units 3 Units 2 Units 1 2 3 Input Links Output Links Datacenter Concurrent Open Shop Scheduling with Coupled Resources Examples include job scheduling and caching blocks Solutions use a ordering heuristic Consider matching constraints So, we must bring the network fabric into the picture. In this example, we have 3-machine fabric. Each machine’s incoming and outgoing links are shown are the ingress and egress ports of the fabric. Each input port has virtual output queues for each output port. So the flows on link1 in our example may be going from input port 1 to output port 1 and the flow in link2 is going from input port 2 to output port 2. In essence, the performance a coflow depends on its allocation on both input and output ports of the fabric. In fact, we have discovered a brand-new scheduling problem called the concurrent open shop scheduling with coupled resources problem. We have also provided the first characterization of this problem, showing some unique characteristics of this new scheduling problem. Quite unsurprisingly, this is also NP-Hard. 6 2 3

Varys Employs a two-step algorithm to minimize coflow completion times
We propose a simple two-step greedy scheduling algorithm that breaks the problem into its natural breaking points. As coflows arrive and leave the network online, we continuously maintain an ordered list of coflows and allocate rates to individual flows of a coflow. In the next few slides, I’ll go over this two-step algorithm. Ordering heuristic Keep an ordered list of coflows to be scheduled, preempting if needed Allocation algorithm Allocates minimum required resources to each coflow to finish in minimum time

Ordering Heuristic Shortest-First (Total CCT = 35) C1 C2 C3 Length 3 5
C1 ends C2 ends C3 ends O1 6 3 Datacenter 1 1 O2 5 O3 2 2 3 3 13 19 5 3 3 Time 3 Shortest-First (Total CCT = 35) C1 C2 C3 Length 3 5 6

Ordering Heuristic Shortest-First Narrowest-First (35)
C1 ends C2 ends C3 ends C3 ends C2 ends C1 ends O1 O1 1 6 3 Datacenter 1 O2 O2 2 5 O3 O3 2 3 3 13 19 6 16 19 3 5 3 Time Time 3 Shortest-First Narrowest-First (35) (Total CCT = 41) C1 C2 C3 Width 3 2 1

Ordering Heuristic Shortest-First Narrowest-First Smallest-First (35)
C1 ends C2 ends C3 ends C3 ends C2 ends C1 ends O1 O1 1 6 3 Datacenter 1 O2 O2 2 5 O3 O3 2 3 3 13 19 6 16 19 3 5 3 Time Time 3 Shortest-First Narrowest-First (35) (41) C3 ends C1 ends C2 ends C1 C2 C3 Size 9 10 6 O1 O2 O3 6 9 19 Time Smallest-First (34)

Ordering Heuristic Shortest-First Narrowest-First Smallest-First
C1 ends C2 ends C3 ends C3 ends C2 ends C1 ends O1 O1 1 6 3 Datacenter 1 O2 O2 2 5 O3 O3 2 3 3 13 19 6 16 19 3 5 3 Time Time 3 Shortest-First Narrowest-First (35) (41) C3 ends C1 ends C2 ends C1 C2 C3 Bottleneck 3 10 6 C1 ends C3 ends C2 ends O1 O1 O2 O2 O3 O3 6 9 19 3 9 19 Time Time Smallest-First (34) Smallest-Bottleneck (31)

Allocation Algorithm Finishing flows faster than the bottleneck cannot decrease a coflow’s completion time Allocate minimum flow rates such that all flows of a coflow finish together on time A coflow cannot finish before its very last flow Once we have determined an order of coflows, the next step is to determine the rate of its individual flows. Recall, that a coflow cannot finish until all its flows have finish. Which means that finishing flows earlier than a coflow’s bottleneck cannot decrease its completion time. Hence, we allocate minimum rates to each flows of a coflow so that they either finish together with the bottlenecks or they all finish at the deadline, if a deadline is provided. Similar to our earlier example, we take the extra bandwidth from one coflow and give to the next coflow in our ordered list so that the next one can also proceed toward its completion without hurting the completion time of the previous one. Note that we must ensure that capacity isn’t wasted between the ordering and rate allocation steps. I’ll skip the details of work conserving backfilling in this talk.

Faster, application-aware data transfers throughout the network Global Coordination Consistent calculation and enforcement of scheduler decisions The Coflow API Decouples network optimizations from applications, relieving developers and end users I’ve shown how coflow scheduling works in Varys. Next, I’ll show how Varys enforces it decisions, and why does it need coordination anyway.

The Need for Coordination
C1 ends C2 ends 1 4 1 O1 O2 2 2 3 O3 3 5 3 4 9 4 Time Scheduling with Coordination C1 C2 Bottleneck 4 5 (Total CCT = 13)

The Need for Coordination
C1 ends C2 ends C1 ends C2 ends 1 4 1 O1 O1 O2 O2 2 2 3 O3 O3 3 5 3 4 9 4 7 12 Time Time Scheduling with Coordination Scheduling without Coordination (Total CCT = 13) (Total CCT = 19) Uncoordinated local decisions interleave coflows, hurting performance

Varys Architecture Centralized master-slave architecture
Applications use a client library to communicate with the master Actual timing and rates are determined by the coflow scheduler Sender Receiver Driver Put Get Reg Varys Daemon Varys Daemon Varys Daemon Varys Master Network Interface Topology Monitor Usage Estimator Varys’s architecture is similar to that of many cluster frameworks we see in current clusters. It has a centralized master that reschedules on coflow arrivals and departures. While the senders and receivers express their intent, the actual timing and rates are determined by the master. The master can be a source of coordination overheads. However, our evaluation on up to 100 machine clusters show that we can easily handle even O(100)ms coflows. Beyond that, we’ll need switch support and changes to the entire network. This is, of course, a valid path, but I’m not confidant of the deployability of such approaches that calls for fork-lift upgrades of millions of dollars of investments. (Distributed) File System Coflow Scheduler TaskName Comp. Tasks calling Varys Client Library f 1. Download from

Faster, application-aware data transfers throughout the network Global Coordination Consistent calculation and enforcement of scheduler decisions The Coflow API Decouples network optimizations from applications, relieving developers and end users Finally, let’s take a look at how you’d use coflows using Varys.

The Coflow API register put get unregister NO changes to user jobs
NO storage management register put get unregister @driver b register(BROADCAST) id b.put(content) … s register(SHUFFLE) reducers shuffle b.unregister() s.unregister() mappers The coflow API consists of four methods: register, put and corresponding unregister and get. This similar to the open, send, recv, and close calls from traditional BSD sockets. How does it help? Consider a simple classic MapReduce model with a single shuffle. Now, let’s try to add a broadcast from the master at the beginning of each job or iteration to such jobs. It becomes extremely simple: <EXPLAIN> Note that these changes are ONLY in the framework. ALL user jobs will remain the same. At the same time, despite what the method names suggest, actual data is stored in user jobs. The network doesn’t store any data. broadcast @mapper b.get(id) … @reducer s.get(idsl) … driver (JobTracker) … ids s.put(content)

A 3000-machine trace-driven simulation matched against a 100-machine EC2 deployment
Evaluation Does it improve performance? Can it beat non-preemptive solutions? Do we really need coordination? YES Varys is written in about 4500 lines of Scala, and we have deployed it on a 100-node large-memory EC2 instances for performance evaluation. Each of these machines had 1Gbps NICs. You can find more details in the paper, but I want to touch some high-level points in next two slides. These are obvious questions: Does it improve performance? At what cost, meaning, will it starve? Do we really need coordination in practice? The short answer to all three question is “yes, it can.”

Better than Per-Flow Fairness
Comm. Heavy Comm. Improv. Job Improv. EC2 3.16X 1.85X 2.50X 1.25X In our EC2 experiments we have observed that, in comparison to per-flow fair sharing, Varys improved the communication performance of data-intensive jobs by up to 1.85X and that of corresponding jobs by 1.25X. This is because many jobs are not communication-intensive. For communication-intensive jobs, improvements are much higher. NOTE: Time in Comm. <25% % % >74 % Jobs % 13% % % Sim. 3.21X 1.11X 4.86X 3.39X

Preemption is Necessary [Sim.]
Varys 1 4 Per-Flow Fairness Prioritization 2,3 NO What About Starvation Because Varys preempts large coflows with smaller ones to minimize the average CCT, there is a risk of starvation. One easy way to avoid starvation is to use FIFO scheduling. However, due to head-of-line blocking, FIFO-based schemes can be excessively bad, and Varys easily outperformed the FIFO-based schemes proposed several years ago. What about starvation? We observe no perpetual starvation in our experiments due to two reasons: To avoid starvation, we make sure that every coflow makes progress by giving it a small share. Second one is a conjecture. Interestingly in our experiments the prevention mechanism rarely kicked in. It has been shown that SRTF doesn’t cause significant starvation when task sizes follow heavy-tailed distributions. While we haven’t yet been able to prove a similar result for coflows, we do observe heavy-tailed distribution in coflow sizes and bottlenecks, and empirically, we haven’t found that instances where we had to invoke our starvation prevention mechanism. 1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM’2011

Lack of Coordination Hurts [Sim.]
Varys 1 4 Per-Flow Fairness Prioritization 2,3 Smallest-flow-first (per-flow priorities) Minimizes flow completion time FIFO-LM4 performs decentralized coflow scheduling Suffers due to local decisions Works well for small, similar coflows We have also found that recent techniques to optimize FCTs can beat Varys, but they are significantly worse than Varys in optimizing application-level performance. In fact, they are worse than TCP at minimizing coflow completion times. In the same vein, recent coflow-aware decentralized proposals fall short as well for data analytics jobs. Decentralized solutions work well for small coflows from user-facing requests, but bulking to pay the price of coordination really hurts when large and small coflows intermix in an analytics clusters. 1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM’2011 2. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’2012 3. pFabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM’2013 4. Decentralized Task-Aware Scheduling for Data Center Networks, SIGCOMM’2014

Cof low Communication abstraction for data-parallel applications to express their performance goals The size of each flow, The total number of flows, and The endpoints of individual flows.  Pipelining between stages  Speculative executions  Task failures and restarts Of course, Varys performs well because everything is known. But in many cases, pipelining between successive computation stages mean that the flow sizes cannot be known. Similarly, speculative execution – a common technique to mitigate stragglers – means that the total number of flows can change as well. Finally, because tasks fail and they are restarted, endpoints of the flows are not known a priori either. How does the coflow abstraction help us then?

How to Perform Coflow Scheduling Without Complete Knowledge?
Meaning, how to efficiently schedule coflows without having complete knowledge?

Implications ? Minimize Avg. Comp. Time Flows in a Single Link
Coflows in an Entire Datacenter With complete knowledge Smallest-Flow-First Ordering by Bottleneck Size + Data-Proportional Rate Allocation Without complete knowledge Least-Attained Service (LAS) ? We have seen that Varys emulates smallest-remaining-time-first in the context of coflows with great success. When we do not have any information on flow sizes or bottlenecks, we cannot use our solution. In the traditional networking literature, least-attained service is used to minimize the flow completion time in such cases. In fact, it has been shown that for heavy-tailed distributions of flow sizes, LAS can closely approximate the shortest-first heuristic. What is the equivalent in case of coflows?

Revisiting Ordering Heuristics
C1 ends C2 ends C3 ends C3 ends C2 ends C1 ends O1 O1 1 6 3 1 O2 O2 2 5 O3 O3 2 3 3 13 19 6 16 19 3 5 3 Time Time 3 Shortest-First Narrowest-First (35) (41) C3 ends C1 ends C2 ends C1 ends C3 ends C2 ends O1 O1 ✖ O2 O2 O3 O3 6 9 19 3 9 19 Time Time Smallest-First (34) Smallest-Bottleneck (31)

Coflow-Aware LAS (CLAS)
Set priority that decreases with how much a coflow has already sent The more a coflow has sent, the lower its priority Smaller coflows finish faster Use total size of coflows to set priorities Avoids the drawbacks of full decentralization We propose the Coflow-Aware LAS where we set priority of coflows that decreases with their size. The more a coflow sends, the lower its priority. Meaning, smaller coflows will finish faster and the average coflow completion time will decrease. Note that for a single link this reduces to the classical LAS.

Coflow-Aware LAS (CLAS)
Continuous priority reduces to fair sharing when similar coflows coexist Priority oscillation FIFO works well for similar coflows Avoids the drawbacks of full decentralization Coflow 1 Coflow 2 time 2 4 6 Coflow1 comp. time = 6 Coflow2 comp. time = 6 However, setting continuous priorities can result in priority oscillation when both coflows are of the same size. Let us take two simple coflows, each with three units of data to send. As before, assuming one unit of data can be sent in one time unit, the black coflow will send one data unit and the priority of the orange coflow will increase. Then the orange coflow will have the higher priority. Eventually they will alternate, and if we have small enough prioritization granularity, it will boil down to fair sharing. As a result, both coflows will finish together and have the same coflow completion times of 6 time units. In such cases, i.e., when two similar coflows coexist, FIFO scheduling results in the optimal solution. What we want is the combination of two. We want FIFO for similar coflows and prioritization for dissimilar ones. It turned out we can have both if we discretize priorities. time 2 4 6 Coflow1 comp. time = 3 Coflow2 comp. time = 6

Discretized Coflow-Aware LAS (D-CLAS)
Lowest- Priority Queue Priority discretization Change priority when total size exceeds predefined thresholds Scheduling policies FIFO within the same queue Prioritization across queue Weighted sharing across queues Guarantees starvation avoidance FIFO QK … FIFO Q2 Instead of changing priorities continuously, we change priorities on predefined thresholds of sizes. We can think of these thresholds to form a number of logical queues to hold coflows. Coflows within the same thresholds have the same priority and scheduled in the FIFO order, whereas the ones in different thresholds have different priorities. Consequently, we combine FIFO and prioritization without complete knowledge of coflows. <DESCRIBE> However, how do we determine the thresholds? It turns out that determining the optimal number of queues is difficult even for a single link. We use exponentially spaced queue thresholds for practical reasons. First, it keeps the number of queues small. Second, it allows each slave to take local decisions with loose coordination to calculate global coflow sizes. Finally, small coflows are scheduled in the FIFO order avoiding global coordination overheads. FIFO Q1 Highest- Priority Queue

How to Discretize Priorities?
Exponentially spaced thresholds K : number of queues A : threshold constant E : threshold exponent Loose coordination suffices to calculate global coflow sizes Slaves make independent decisions in between Small coflows (smaller than E1A) do not experience coordination overheads! Lowest- Priority Queue FIFO QK ∞ … EK-1A FIFO Q2 Instead of changing priorities continuously, we change priorities on predefined thresholds of sizes. We can think of these thresholds to form a number of logical queues to hold coflows. Coflows within the same thresholds have the same priority and scheduled in the FIFO order, whereas the ones in different thresholds have different priorities. Consequently, we combine FIFO and prioritization without complete knowledge of coflows. <DESCRIBE> However, how do we determine the thresholds? It turns out that determining the optimal number of queues is difficult even for a single link. We use exponentially spaced queue thresholds for practical reasons. First, it keeps the number of queues small. Second, it allows each slave to take local decisions with loose coordination to calculate global coflow sizes. Finally, small coflows are scheduled in the FIFO order avoiding global coordination overheads. E2A E1A FIFO Q1 Highest- Priority Queue E1A

Closely Approximates Varys [Sim. & EC2]
w/o Complete Knowledge 1 4 Per-Flow Fairness Prioritization 2,3 Put together, even without complete information we can achieve similar performance, and just exploiting one piece of information – how flows relate to each other – allows us to beat all application-agnostic schemes. This makes coflows practical even in presence of failures, speculations, scheduling uncertainties, and in the presence of coflow dependencies. 1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM’2011 2. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’2012 3. pFabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM’2013 4. Decentralized Task-Aware Scheduling for Data Center Networks, SIGCOMM’2014

Top-Level Apache Project
My Contributions Spark NSDI’12 Sinbad SIGCOMM’13 Network-Aware Applications Top-Level Apache Project Merged at Facebook Coflow Orchestra SIGCOMM’11 Varys SIGCOMM’14 Aalo SIGCOMM’15 Application-Aware Network Scheduling Merged with Spark Open-Source Open-Source FairCloud SIGCOMM’12 HARP SIGCOMM’12 ViNEYard ToN’12 Datacenter Resource Allocation Today, I have talked about application-aware network scheduling using coflows. I have talked about improving performance of individual applications, which was part of the Orchestra project. I have also discussed how to perform inter-coflow scheduling to improve performance across multiple applications. Apart of mending the application-network gap, I have worked on writing the communication stack of data analytics frameworks and the replication pipeline of distributed file systems like HDFS to improve their network performance. In the infrastructure layer, I have worked on resource allocation in public clouds in the FairCloud project, in private datacenters in the HARP project, and in virtualized environments in the ViNEYard project. My work on coflow scheduling has already seen some success and shows more promise. Part of the Orchestra project that focused on improving individual coflow’s performance has been merged with Apache Spark. Varys, on the other hand, has resulted in theoretical advances as well as practical ones. Researchers in Columbia, Rice, USC, and Hong Kong are working on coflow-aware routing, load balancing, and even optical network scheduling. The rest of my work have been well received as well. Spark is now a top-level Apache project and used by tens of companies in production and maintained by hundreds of developers. My work on virtual network embedding has inspired hundreds of resource allocation projects in diverse environments, including wireless networks, software-defined networks, sensor networks, grid computing and others. Some of my other projects have been adopted and extended by large companies. I have worked closely for six months with the Facebook HDFS team to incorporate my network-aware replication technique in their distributed file system. FairCloud has inspired HP’s ElasticSwitch project, and HARP had been deployed in many of Bing’s datacenters in 2012. Going forward, I want to maintain and expand on my collaborations not only to have impact but also to discover new and exciting problems. @HP @Microsoft Bing Open-Source

Communication-First Big Data Systems
In-Datacenter Analytics Cheaper SSDs and DRAM, proliferation of optical networks, and resource disaggregation will make network the primary bottleneck Inter-Datacenter Analytics Bandwidth-constrained wide-area networks End User Delivery Faster and responsive delivery of analytics results over the Internet for better end user experience While I have worked on mending the gap between the network and data-parallel applications, I believe the future is all about systems that are communication-aware from the inception. We stand at the confluence of three trends that all point toward the elevated importance of network-aware designs within a datacenter. As SSDs become cheaper and resource becomes disaggregated, the network will feel more crunch than ever before. At the same time, optical networks call for wavelength-aware algorithms to better utilize broadcast and multicast capabilities. In the wide area networks, inter-datacenter analytics are becoming more and more common as pulling all the data in the same place is becoming harder for regulatory reasons and for bandwidth cost. Instead of dumping everything in the same datacenter, we have to design applications that runs on data located across the globe. Given the much higher latencies over the Internet, many of the now-well-established techniques, for example, straggler detection, will require rethinking. The last piece is delivering the results to the end users. After all the analysis, we have to provide better results to the users and we must do that at a lower latency than today. Recent protocols like SPDY and HTTP 2.0 are already moving toward coflow-like directions where multiple streams are multiplexed to get better performance. I believe the key is to find clean interfaces to transport application-level goals to the network without completely tying them together.

Systems Networking MIND THE GAP ME Better capture application-level performance goals using coflows Coflows improve application-level performance and usability Extends networking and scheduling literature Coordination – even if not free – is worth paying for in many cases In conclusion, I have tried to mend the gap between systems and networking, not by discarding all that we have done for last four decades, but by extending them with some extra bits of information so that the network has some context of what applications want. I have shown that coflows improve performance, predictability, and usability. This direction has also opened a new scheduling problem with exciting results to come. Finally, I have shown that, specially in controlled environments like datacenters, we shouldn’t shy away from coordination when necessary. Thank you.

Improve Flow Completion Times
Datacenter 1 2 3 s1 r1 s2 r2 s3 Per-Flow Fair Sharing Smallest-Flow First1,2 Shuffle Completion Time = 5 Shuffle Completion Time = 6 time 2 4 6 time 2 4 6 Link to r1 Link to r1 1 2 4 Link to r2 Avg. Flow Completion Time = 3.66 Link to r2 Avg. Flow Completion Time = 2.66 1 2 6 1. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’2012. 2. pFabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM’2013.

Distributions of Coflow Characteristics

Traffic Sources Ingest and replicate new data
Read input from remote machines, when needed Transfer intermediate data Write and replicate output What’s common in this landscape is the network plays a crucial role throughout the entire lifetime of data. From the point of ingesting new data into a datacenter and reading data for analysis to intermediate data transfers and eventually writing the results out, the network is there. Percentage of Traffic by Category at Facebook

Distribution of Shuffle Durations
Performance Facebook jobs spend ~25% of runtime on average in intermediate comm. Month-long trace from a 3000-machine MapReduce production cluster at Facebook 320,000 jobs 150 Million tasks Understandably, many data-intensive jobs depend on communication for faster end-to-end completion time. To understand the impact of communication, we analyzed more than 150-million tasks and 320-thousand jobs from Facebook’s production cluster and looked at how much time they spend in communication. In this plot we show what fraction of a jobs lifetime is spent in the shuffle stage, the communication stage of MapReduce. Note that it doesn’t include time spent in reading or replicating data. While the dependency on communication varies, on average a quarter of job’s lifetime is spent in transferring between distributed machines. Jobs Mappers Reducers Total Tasks Jobs with Shuffles Mappers Reducers Total Tasks

Theoretical Results Structure of optimal schedules
Permutation schedules might not always lead to the optimal solution Approximation ratio of COSS-CR Polynomial-time algorithm with constant approximation ratio ( )1 The need for coordination Fully decentralized schedulers can perform arbitrarily worse than the optimal 64 3 We have found some interesting theoretical results regarding the COSS-CR problem. First is that unlike many scheduling problems, a permutation might not result in an optimal schedule. It just means that if we have N coflows, and we take the N! possible orders, and schedule them one after another, the optimal schedule might not be in any of the possible N! schedules. Instead it might involve interleaving between coflows in some cases. The implication is that finding the approximation factor of our heuristic has been so far elusive. All is not lost, however. Recently, our friends at Columbia Operations Research have found a 64/3 constant factor polynomial time algorithm for this problem. This is mostly of theoretical interest so far, because the algorithm is computationally expensive and cannot be used in the online setting we want it to. Third, and the one with the most impact on our system design is the need for coordination. We have proved that uncoordinated coflow scheduling can be arbitrarily far away from the optimal. I will show empirical results suggesting the same as well. 1. Due to Zhen Qiu, Cliff Stein, and Yuan Zhong from the Department of Industrial Engineering and Operations Research, Columbia University, 2014

The Coflow API register put get unregister NO changes to user jobs
NO storage management register put get unregister @driver b register(BROADCAST, numFlows) id b.put(content, size) … s register(SHUFFLE, numFlows, {b}) reducers shuffle b.unregister() s.unregister() mappers The coflow API consists of four methods: register, put and corresponding unregister and get. This similar to the open, send, recv, and close calls from traditional BSD sockets. How does it help? Consider a simple classic MapReduce model with a single shuffle. Now, let’s try to add a broadcast from the master at the beginning of each job or iteration to such jobs. It becomes extremely simple: <EXPLAIN> Note that these changes are ONLY in the framework. ALL user jobs will remain the same. At the same time, despite what the method names suggest, actual data is stored in user jobs. The network doesn’t store any data. broadcast @mapper b.get(id) … @reducer s.get(idsl) … driver (JobTracker) ids s.put(content, size) …

Varys Employs a two-step algorithm to support coflow deadlines
How do we use coflows to support deadlines? We use admission control to make sure that no coflow is admitted without knowing it’s guarantees can be met. After that we set rates so that coflows finish on their deadlines. Admission control Do not admit any coflows that cannot be completed within deadline without violating existing deadlines Allocation algorithm Allocate minimum required resources to each coflow to finish them at their deadlines

Facebook Trace Simulation
More Predictable Facebook Trace Simulation EC2 Deployment Varys 1 (Earliest-Deadline First) Varys Finally, how does Varys perform for deadline-sensitive coflows. In simulation, we have found that no admitted coflow misses deadlines using Varys. It’s better than both fair sharing and earliest-deadline-first flow scheduling algorithms. In deployments, things are similar, yet slightly different. Unfortunately, Varys misses some deadlines of admitted coflows. This is due to the coordination overheads in the presence of tight deadlines. Met Deadline Not Admitted Missed Deadline 1. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’2012

“Let users figure it out”
Optimizing Communication Performance: Systems Approach “Let users figure it out” # Comm. Params* Spark-v1.1.1 6 Hadoop-v1.2.1 10 YARN-v2.6.0 20 While that was going on in the network, the systems folks have given it up to the users with too many parameters. Many of which are hard to understand, difficult to tune, and does not always behave well when put together. I took a quick look at the recent numbers a week or so ago, and the most popular data analytics frameworks have somewhere between 5 to 20 parameters. Btw, these do not including control parameters like what should be the timeout of an RPC call, the buffer size, etc. *Lower bound. Does not include many parameters that can indirectly impact communication; e.g., number of reducers etc. Also excludes control-plane communication/RPC parameters.

Experimental Methodology
Varys deployment in EC2 100 m2.4xlarge machines Each machine has 8 CPU cores, 68.4 GB memory, and 1 Gbps NIC ~900 Mbps/machine during all-to-all communication Trace-driven simulation Detailed replay of a day-long Facebook trace (circa October 2010) 3000-machine,150-rack cluster with 10:1 oversubscription

Cof low Mending the Application-Network Gap in Big Data Analytics

Similar presentations

Presentation on theme: "Cof low Mending the Application-Network Gap in Big Data Analytics"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Cof low Mending the Application-Network Gap in Big Data Analytics

Similar presentations

Presentation on theme: "Cof low Mending the Application-Network Gap in Big Data Analytics"— Presentation transcript:

Similar presentations

About project

Feedback