Acknowledgement: slides include content from Hedera and MP-TCP authors

Acknowledgement: slides include content from Hedera and MP-TCP authors
CS434/534: Topics in Network Systems Cloud Data Centers Transport: MP-TCP, DC Transport Issues Yang (Richard) Yang Computer Science Department Yale University 208A Watson Acknowledgement: slides include content from Hedera and MP-TCP authors

Outline Admin and recap Cloud data center (CDC) network infrastructure
Background, high-level goals Traditional CDC vs the one-big switch abstraction VL2 design and implementation: L2 semantics, VLB/ECMP Load-aware centralized load balancing (Hedera) Distributed load balancing by end hosts: MP-TCP DC transport issues

Admin PS1 help by Jensen () Office hours on projects
Thursday: 1:30-3:30 pm

Recap: ECMP/VLB Hashing Collision Problem
D2 S4 D1 D4 D3

Recap: Centralized Scheduling of Flow Routing
Estimate Flow Demands Place Flows Detect Large Flows Detect large flows Estimate flow demands Place flows Global first fit Simulated annealing

Recap: Distributed Scheduling using Endhosts
Instead of a central scheduler, end hosts distributedly compute flow rates using TCP Two separate paths Logically a single pool = In a circuit-switched network, there is a dedicated channel for each flow. It’s rigid and inflexible: when one flow is silent, the other flow can’t fill in. Packet switching gives you much more flexibility (whether you use it for ATM virtual circuits, or for full-blown IP). Multipath brings flexibility of the same sort. Pic 3 is rigid and inflexible, Pic 4 is flexible. In the case of packet switching, you need to be careful about how the flows share the link. Pic 1 circuits made it easy – they give strict isolation between flows. But with Pic 2 packet switching, you need a control plane. It could be with ATM and admission control. Or it could be TCP congestion control, which says how end-systems should adapt their rates so that the network shares its capacity fairly. What sort of control plane do we need, to ensure that a multipath network works well?

Recap: Simple Model Driving TCP CC Design
User 1 x1 d = sum xi > Xgoal? sum xi User 2 x2 xn User n Flows observe congestion signal d, and locally take actions to adjust rates.

Recap: Distributed Linear Control
Considers the simplest class of control strategy to achieve fairness and efficiency

AIMD: State Transition Trace
x2 fairness line: x1=x2 efficiency line: x1+x2=C overload underload x0 x1

Recap: Mapping from Model to Protocol
Window Based Mapping Assume window size is cwnd segments, each with MSS bytes, round-trip time (RTT): Rate x = cwnd * MSS RTT Bytes/sec MSS: Minimum Segment Size

Discussion How can a sender know congestion?

Approach 1: End Hosts Consider Loss as Congestion
Packets 1 2 3 4 5 6 7 Acknowledgements (waiting seq#) 2 3 4 4 4 4 Pros and Cons of endhosts using loss as congestion? Assume loss => cong

Approach 2: Network Feedback (ECN: Explicit Congestion Notification)
Sender reduces rate if ECN received. Pros and Cons of ECN? Sender 1 Receiver bounces marker back to sender in ACK msg Receiver Network marks ECN Mark (1 bit) on pkt according to local condition, e.g., queue length > K Sender 2

TCP/Reno Full Alg Initially: cwnd = 1;
ssthresh = infinite (e.g., 64K); For each newly ACKed segment: if (cwnd < ssthresh) // slow start: MI cwnd = cwnd + 1; else // congestion avoidance; AI cwnd += 1/cwnd; Triple-duplicate ACKs or ECN in RTT: // MD cwnd = ssthresh = cwnd/2; Timeout: ssthresh = cwnd/2; // reset (if already timed out, double timeout value; this is called exponential backoff)

Recap: EW-TCP for Multiple TCP Sharing Bottleneck
A multipath TCP flow with two subflows Regular TCP Goal: The subflows of each flow share each bottleneck w/ other flows fairly. Mechanism: Each subflow uses basic TCP with an increase parameter a (instead of 1) per RTT If n subflows share a bottleneck w/ a TC flow, choose a = 1/n2. 𝑚𝑒𝑎𝑛 𝑊= 𝑎 2(1−𝑝) 𝑝 ≈ 𝑎 2/𝑝

Discussion Issues of EW-TCP

Equal vs Non-Equal Share Example

What is the throughput of each flow, if each subflow shares its path with others equally? 12Mb/s 8Mb/s 12Mb/s 8Mb/s 8Mb/s 12Mb/s

If each flow split its traffic 2:1 ... 12Mb/s 9Mb/s 12Mb/s 9Mb/s 9Mb/s 12Mb/s

If each flow split its traffic ∞:1 ... 12Mb/s 12Mb/s 12Mb/s 12Mb/s 12Mb/s 12Mb/s Equal share may not be efficient: not even Pareto Optimal !

Outline Admin and recap Cloud data center (CDC) networks
Background, high-level goal Traditional CDC vs the one-big switch abstraction VL2 design and implementation: L2 semantics, VLB/ECMP Load-aware centralized load balancing (Hedera) Distributed load balancing by end hosts: MP-TCP Intuition EW-TCP COUPLED

Design Option 2: Total Coupled TCP
Each ACK on subflow r, increase the window wr by 1/wtotal Each loss on subflow r, decrease the window wr by wtotal/2 Q: what is the behavior, say two subflows with w1 and w2 respectively?

Discussion Theory: MPTCP should send all its traffic on its least-congested (lowest-loss-rate) paths. Kelly+Voice 2005; Han, Towsley et al. 2006 Issues of sending all traffic to least congested (lowest loss rate) path?

Outline Admin and recap Cloud data center (CDC) networks
Background, high-level goal Traditional CDC vs the one-big switch abstraction VL2 design and implementation: L2 semantics, VLB/ECMP Load-aware centralized load balancing (Hedera) Distributed load balancing by end hosts: MP-TCP Intuition EW-TCP COUPLED TCP SEMI-COUPLED TCP

Design Option 3: Semi Coupled TCP
Each ACK on subflow r, increase the window wr by a/wtotal Each loss on subflow r, decrease the window wr by wr/2 Q: what is the equilibrium sharing of bw?

Semi-Coupled TCP 𝑚𝑒𝑎𝑛 Δ𝑊𝑟= 1−𝑝𝑟 𝑎 𝑊𝑡 +𝑝𝑟(− 𝑊𝑟 2 ) = 0
𝑚𝑒𝑎𝑛 Δ𝑊𝑟= 1−𝑝𝑟 𝑎 𝑊𝑡 +𝑝𝑟(− 𝑊𝑟 2 ) = 0 1−𝑝𝑟 𝑎 𝑊𝑡 =𝑝𝑟 𝑊𝑟 2 𝑊𝑟= 1−𝑝𝑟 2𝑎 𝑝𝑟𝑊𝑡 ≈ 2𝑎 𝑝𝑟𝑊𝑡 𝑊𝑡 = 2𝑎 𝑊𝑡 1/𝑝𝑟 𝑊𝑡= 2𝑎 1/𝑝𝑟 𝑊𝑟 = 2a 1/𝑝𝑟 1/𝑝𝑟

Background, high-level goal Traditional CDC vs the one-big switch abstraction VL2 design and implementation: L2 semantics, VLB/ECMP Load-aware centralized load balancing (Hedera) Distributed load balancing by end hosts: MP-TCP Intuition Formal definition

Formal Requirement So far we focus on window size (w) and ignore flows w/ different RTTs, but flow rate is w/RTT. Formal requirement: 𝑟 ∈𝑅 𝑤𝑟 𝑅𝑇𝑇𝑟 ≥ max 𝑟 ∈𝑅 𝑤𝑟 𝑇𝐶𝑃 𝑅𝑇𝑇𝑟

Design Option 4 Each ACK on subflow r, increase the window wr by a/wtotal Each loss on subflow r, decrease the window wr by wr/2 min( , 1/wr)

Deriving a 𝑟 ∈𝑅 𝑤𝑟 𝑅𝑇𝑇𝑟 ≥ max 𝑟 ∈𝑅 𝑤𝑟 𝑇𝐶𝑃 𝑅𝑇𝑇𝑟
1−𝑝𝑟 min 𝑎 𝑤 𝑡𝑜𝑡𝑎𝑙 , 1 𝑤 𝑟 =𝑝𝑟 𝑤 𝑟 2 min 𝑎 𝑤 𝑡𝑜𝑡𝑎𝑙 𝑤 𝑟 , 1 𝑤 𝑟2 =𝑝𝑟 1 2(1−𝑝𝑟) min 𝑎 𝑤 𝑡𝑜𝑡𝑎𝑙 𝑤 𝑟 , 1 𝑤 𝑟2 ≈ 𝑝𝑟 2 𝑤𝑟 𝑡𝑐𝑝= 2/𝑝𝑟 min 𝑎 𝑤 𝑡𝑜𝑡𝑎𝑙 𝑤 𝑟 , 1 𝑤 𝑟2 = 1 𝑤 𝑟𝑡𝑐𝑝2 max 𝑤 𝑡𝑜𝑡𝑎𝑙 𝑤 𝑟 𝑎 , 𝑤 𝑟 = 𝑤 𝑟𝑡𝑐𝑝 max 𝑤 𝑡𝑜𝑡𝑎𝑙 𝑤 𝑟 𝑎 𝑅𝑇𝑇𝑟2 , 𝑤 𝑟 𝑅𝑇𝑇𝑟 = 𝑤𝑟 𝑡𝑐𝑝 𝑅𝑇𝑇𝑟

Deriving a 𝑟 ∈𝑅 𝑤𝑟 𝑅𝑇𝑇𝑟 ≥ max 𝑟 ∈𝑅 𝑤𝑟 𝑇𝐶𝑃 𝑅𝑇𝑇𝑟

Final MP-TCP Alg

MPTCP Evaluation Throughput (% of optimal) FatTree, 128 nodes
Flow rank Simulations of FatTree, 100Mb/s links, permutation traffic matrix, one flow per host, TCP+ECMP versus MPTCP.

Hedera first-fit heuristic
MPTCP vs Hedera Throughput [% of optimal] Simulation of FatTree with 128 hosts. Permutation traffic matrix Closed-loop flow arrivals (one flow finishes, another starts) Flow size distributions from VL2 dataset MPTCP Hedera first-fit heuristic

MPTCP at Different Load
Ratio of throughputs, MPTCP/TCP Simulation of a FatTree-like topology with 512 nodes, but with 4 hosts for every up-link from a top-of-rack switch, i.e. the core is oversubscribed 4:1. Permutation TM: each host sends to one other, each host receives from one other Random TM: each host sends to one other, each host may receive from any number Connections per host At low loads, there are few collisions, and NICs are saturated, so TCP ≈ MPTCP At high loads, the core is severely congested, and TCP can fully exploit all the core links, so TCP ≈ MPTCP When the core is “right-provisioned”, i.e. just saturated, MPTCP > TCP

MPTCP at Different Load
Underloaded Sweet Spot Overloaded

A Puzzle: cwnd and Rate of a TCP Session
Question: cwnd fluctuates widely (i.e., cut to half); how can the sending rate stay relatively smooth? Question: Where does loss rate come from?

TCP/Reno Queueing Dynamics
cwnd filling buffer TD bottleneck bandwidth ssthresh draining buffer Time congestion avoidance If the buffer at the bottleneck is large enough, the buffer is never empty (not idle), during the cut-to-half to “grow-back” process.

Discussion If the buffer size at the bottleneck link is very small, what is the link utilization? cwnd TD bottleneck bandwidth ssthresh Time congestion avoidance

Exercise: Small Buffer
Assume BW: 10 G RTT: 100 ms Packet: 1250 bytes BDP (full window size): 100,000 packets A loss can cut window size from 100,000 to 50,000 packets To fully grow back Need 50,000 RTTs => 5000 seconds, 1.4 hours

Discussion What you like about MP-TCP?
What you do not like about MP-TCP?

Background, high-level goals Traditional CDC vs the one-big switch abstraction VL2 design and implementation: L2 semantics, VLB/ECMP Load-aware centralized load balancing (Hedera) Distributed load balancing by end hosts: MP-TCP DC transport issues

Big Picture INTERNET Servers Fabric

Diverse workload sharing the same infrastructure
Big Picture INTERNET Fabric Diverse workload sharing the same infrastructure Specifically, while some of the traffic in data center networks is sent across the Internet, the majority of data center traffic is between the servers within the data center and never leaves the data center. This is because Servers web app cache data-base map-reduce HPC monitoring

Different Types of Workloads
Mice & Elephants Short messages (e.g., query, coordination) Large flows (e.g., data update, backup) Low Latency High Throughput So the ultimate goal of data center transport is to complete these internal transactions or “flows” as quickly as possible. What makes this challenging is that the flows are actually quite diverse. In particular, there are… And these different flows require different things to complete quickly. For the short flows, we need to provide low latency…

Implication: Mice Mix w/ Elephants

Discussion cwnd TD bottleneck bandwidth ssthresh Time congestion avoidance What is potential issue of TCP for such mixed mice-elephants traffic?

Acknowledgement: slides include content from Hedera and MP-TCP authors

Similar presentations

Presentation on theme: "Acknowledgement: slides include content from Hedera and MP-TCP authors"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Acknowledgement: slides include content from Hedera and MP-TCP authors

Similar presentations

Presentation on theme: "Acknowledgement: slides include content from Hedera and MP-TCP authors"— Presentation transcript:

Similar presentations

About project

Feedback