Acknowledgement: slides include content from Hedera and MP-TCP authors

Slides:



Advertisements
Similar presentations
Data Center Networking with Multipath TCP
Advertisements

Improving Datacenter Performance and Robustness with Multipath TCP
Improving Datacenter Performance and Robustness with Multipath TCP Costin Raiciu, Sebastien Barre, Christopher Pluntke, Adam Greenhalgh, Damon Wischik,
1 TCP Congestion Control. 2 TCP Segment Structure source port # dest port # 32 bits application data (variable length) sequence number acknowledgement.
Congestion Control: TCP & DC-TCP Swarun Kumar With Slides From: Prof. Katabi, Alizadeh et al.
School of Information Technologies TCP Congestion Control NETS3303/3603 Week 9.
Congestion Control Tanenbaum 5.3, /12/2015Congestion Control (A Loss Based Technique: TCP)2 What? Why? Congestion occurs when –there is no reservation.
1 Chapter 3 Transport Layer. 2 Chapter 3 outline 3.1 Transport-layer services 3.2 Multiplexing and demultiplexing 3.3 Connectionless transport: UDP 3.4.
Data Communication and Networks
Introduction 1 Lecture 14 Transport Layer (Congestion Control) slides are modified from J. Kurose & K. Ross University of Nevada – Reno Computer Science.
TCP: flow and congestion control. Flow Control Flow Control is a technique for speed-matching of transmitter and receiver. Flow control ensures that a.
Lect3..ppt - 09/12/04 CIS 4100 Systems Performance and Evaluation Lecture 3 by Zornitza Genova Prodanoff.
Courtesy: Nick McKeown, Stanford 1 TCP Congestion Control Tahir Azim.
Lecture 17 Congestion Control; AIMD; TCP Reno 10/31/2013
3: Transport Layer3b-1 Principles of Congestion Control Congestion: r informally: “too many sources sending too much data too fast for network to handle”
Transport Layer 4 2: Transport Layer 4.
Transport Layer3-1 Chapter 3 outline r 3.1 Transport-layer services r 3.2 Multiplexing and demultiplexing r 3.3 Connectionless transport: UDP r 3.4 Principles.
Transport Layer3-1 Chapter 3 outline r 3.1 Transport-layer services r 3.2 Multiplexing and demultiplexing r 3.3 Connectionless transport: UDP r 3.4 Principles.
Multipath TCP design, and application to data centers Damon Wischik, Mark Handley, Costin Raiciu, Christopher Pluntke.
Principles of Congestion Control Congestion: informally: “too many sources sending too much data too fast for network to handle” different from flow control!
Congestion control for Multipath TCP (MPTCP) Damon Wischik Costin Raiciu Adam Greenhalgh Mark Handley THE ROYAL SOCIETY.
CS244A Midterm Review Ben Nham Some slides derived from: David Erickson (2007) Paul Tarjan (2007)
TCP with Variance Control for Multihop IEEE Wireless Networks Jiwei Chen, Mario Gerla, Yeng-zhong Lee.
Transport Layer 3-1 Chapter 3 Transport Layer Computer Networking: A Top Down Approach 6 th edition Jim Kurose, Keith Ross Addison-Wesley March
1 Transport Layer Lecture 10 Imran Ahmed University of Management & Technology.
CS640: Introduction to Computer Networks Aditya Akella Lecture 15 TCP – III Reliability and Implementation Issues.
Computer Networking Lecture 18 – More TCP & Congestion Control.
1 CS 4396 Computer Networks Lab TCP – Part II. 2 Flow Control Congestion Control Retransmission Timeout TCP:
CS640: Introduction to Computer Networks Aditya Akella Lecture 15 TCP – III Reliability and Implementation Issues.
Transport Layer3-1 Chapter 3 outline r 3.1 Transport-layer services r 3.2 Multiplexing and demultiplexing r 3.3 Connectionless transport: UDP r 3.4 Principles.
Jennifer Rexford Fall 2014 (TTh 3:00-4:20 in CS 105) COS 561: Advanced Computer Networks TCP.
Winter 2008CS244a Handout 71 CS244a: An Introduction to Computer Networks Handout 7: Congestion Control Nick McKeown Professor of Electrical Engineering.
The Macroscopic behavior of the TCP Congestion Avoidance Algorithm.
Transport Layer3-1 Chapter 3 outline r 3.1 Transport-layer services r 3.2 Multiplexing and demultiplexing r 3.3 Connectionless transport: UDP r 3.4 Principles.
@Yuan Xue A special acknowledge goes to J.F Kurose and K.W. Ross Some of the slides used in this lecture are adapted from their.
1 Transport Bandwidth Allocation 3/29/2012. Admin. r Exam 1 m Max: 65 m Avg: 52 r Any questions on programming assignment 2 2.
1 Network Transport Layer: TCP Analysis and BW Allocation Framework Y. Richard Yang 3/30/2016.
@Yuan Xue A special acknowledge goes to J.F Kurose and K.W. Ross Some of the slides used in this lecture are adapted from their.
By, Nirnimesh Ghose, Master of Science,
CS450 – Introduction to Networking Lecture 19 – Congestion Control (2)
Topics discussed in this section:
Approaches towards congestion control
Chapter 3 outline 3.1 transport-layer services
Chapter 6 TCP Congestion Control
Introduction to Congestion Control
CS 268: Lecture 6 Scott Shenker and Ion Stoica
Improving Datacenter Performance and Robustness with Multipath TCP
Chapter 3 outline 3.1 Transport-layer services
Network Transport Layer: Congestion Control
Improving Datacenter Performance and Robustness with Multipath TCP
Congestion Control and Resource Allocation
TCP Congestion Control
Multipath TCP Yifan Peng Oct 11, 2012
TCP Congestion Control
TCP, XCP and Fair Queueing
Lecture 19 – TCP Performance
So far, On the networking side, we looked at mechanisms to links hosts using direct linked networks and then forming a network of these networks. We introduced.
Chapter 6 TCP Congestion Control
CS640: Introduction to Computer Networks
TCP Congestion Control
Lecture 16, Computer Networks (198:552)
TCP Congestion Control
TCP Overview.
Lecture 17, Computer Networks (198:552)
Congestion Control (from Chapter 05)
Acknowledgement: slides include content from Hedera and MP-TCP authors
Transport Layer: Congestion Control
Chapter 3 outline 3.1 Transport-layer services
TCP flow and congestion control
Network Transport Layer: TCP/Reno Analysis, TCP Cubic, TCP/Vegas
Presentation transcript:

Acknowledgement: slides include content from Hedera and MP-TCP authors CS434/534: Topics in Network Systems Cloud Data Centers Transport: MP-TCP, DC Transport Issues Yang (Richard) Yang Computer Science Department Yale University 208A Watson Email: yry@cs.yale.edu http://zoo.cs.yale.edu/classes/cs434/ Acknowledgement: slides include content from Hedera and MP-TCP authors

Outline Admin and recap Cloud data center (CDC) network infrastructure Background, high-level goals Traditional CDC vs the one-big switch abstraction VL2 design and implementation: L2 semantics, VLB/ECMP Load-aware centralized load balancing (Hedera) Distributed load balancing by end hosts: MP-TCP DC transport issues

Admin PS1 help by Jensen () Office hours on projects Thursday: 1:30-3:30 pm

Recap: ECMP/VLB Hashing Collision Problem D2 S4 D1 D4 D3

Recap: Centralized Scheduling of Flow Routing Estimate Flow Demands Place Flows Detect Large Flows Detect large flows Estimate flow demands Place flows Global first fit Simulated annealing

Recap: Distributed Scheduling using Endhosts Instead of a central scheduler, end hosts distributedly compute flow rates using TCP Two separate paths Logically a single pool = In a circuit-switched network, there is a dedicated channel for each flow. It’s rigid and inflexible: when one flow is silent, the other flow can’t fill in. Packet switching gives you much more flexibility (whether you use it for ATM virtual circuits, or for full-blown IP). Multipath brings flexibility of the same sort. Pic 3 is rigid and inflexible, Pic 4 is flexible. In the case of packet switching, you need to be careful about how the flows share the link. Pic 1 circuits made it easy – they give strict isolation between flows. But with Pic 2 packet switching, you need a control plane. It could be with ATM and admission control. Or it could be TCP congestion control, which says how end-systems should adapt their rates so that the network shares its capacity fairly. What sort of control plane do we need, to ensure that a multipath network works well?

Recap: Simple Model Driving TCP CC Design User 1 x1 d = sum xi > Xgoal? sum xi User 2 x2 xn User n Flows observe congestion signal d, and locally take actions to adjust rates.

Recap: Distributed Linear Control Considers the simplest class of control strategy to achieve fairness and efficiency

AIMD: State Transition Trace x2 fairness line: x1=x2 efficiency line: x1+x2=C overload underload x0 x1

Recap: Mapping from Model to Protocol Window Based Mapping Assume window size is cwnd segments, each with MSS bytes, round-trip time (RTT): Rate x = cwnd * MSS RTT Bytes/sec MSS: Minimum Segment Size

Discussion How can a sender know congestion?

Approach 1: End Hosts Consider Loss as Congestion Packets 1 2 3 4 5 6 7 Acknowledgements (waiting seq#) 2 3 4 4 4 4 Pros and Cons of endhosts using loss as congestion? Assume loss => cong

Approach 2: Network Feedback (ECN: Explicit Congestion Notification) Sender reduces rate if ECN received. Pros and Cons of ECN? Sender 1 Receiver bounces marker back to sender in ACK msg Receiver Network marks ECN Mark (1 bit) on pkt according to local condition, e.g., queue length > K Sender 2

TCP/Reno Full Alg Initially: cwnd = 1; ssthresh = infinite (e.g., 64K); For each newly ACKed segment: if (cwnd < ssthresh) // slow start: MI cwnd = cwnd + 1; else // congestion avoidance; AI cwnd += 1/cwnd; Triple-duplicate ACKs or ECN in RTT: // MD cwnd = ssthresh = cwnd/2; Timeout: ssthresh = cwnd/2; // reset (if already timed out, double timeout value; this is called exponential backoff)

Recap: EW-TCP for Multiple TCP Sharing Bottleneck A multipath TCP flow with two subflows Regular TCP Goal: The subflows of each flow share each bottleneck w/ other flows fairly. Mechanism: Each subflow uses basic TCP with an increase parameter a (instead of 1) per RTT If n subflows share a bottleneck w/ a TC flow, choose a = 1/n2. 𝑚𝑒𝑎𝑛 𝑊= 𝑎 2(1−𝑝) 𝑝 ≈ 𝑎 2/𝑝

Discussion Issues of EW-TCP

Equal vs Non-Equal Share Example

Equal vs Non-Equal Share Example What is the throughput of each flow, if each subflow shares its path with others equally? 12Mb/s 8Mb/s 12Mb/s 8Mb/s 8Mb/s 12Mb/s

Equal vs Non-Equal Share Example If each flow split its traffic 2:1 ... 12Mb/s 9Mb/s 12Mb/s 9Mb/s 9Mb/s 12Mb/s

Equal vs Non-Equal Share Example If each flow split its traffic ∞:1 ... 12Mb/s 12Mb/s 12Mb/s 12Mb/s 12Mb/s 12Mb/s Equal share may not be efficient: not even Pareto Optimal !

Outline Admin and recap Cloud data center (CDC) networks Background, high-level goal Traditional CDC vs the one-big switch abstraction VL2 design and implementation: L2 semantics, VLB/ECMP Load-aware centralized load balancing (Hedera) Distributed load balancing by end hosts: MP-TCP Intuition EW-TCP COUPLED

Design Option 2: Total Coupled TCP Each ACK on subflow r, increase the window wr by 1/wtotal Each loss on subflow r, decrease the window wr by wtotal/2 Q: what is the behavior, say two subflows with w1 and w2 respectively?

Discussion Theory: MPTCP should send all its traffic on its least-congested (lowest-loss-rate) paths. Kelly+Voice 2005; Han, Towsley et al. 2006 Issues of sending all traffic to least congested (lowest loss rate) path?

Outline Admin and recap Cloud data center (CDC) networks Background, high-level goal Traditional CDC vs the one-big switch abstraction VL2 design and implementation: L2 semantics, VLB/ECMP Load-aware centralized load balancing (Hedera) Distributed load balancing by end hosts: MP-TCP Intuition EW-TCP COUPLED TCP SEMI-COUPLED TCP

Design Option 3: Semi Coupled TCP Each ACK on subflow r, increase the window wr by a/wtotal Each loss on subflow r, decrease the window wr by wr/2 Q: what is the equilibrium sharing of bw?

Semi-Coupled TCP 𝑚𝑒𝑎𝑛 Δ𝑊𝑟= 1−𝑝𝑟 𝑎 𝑊𝑡 +𝑝𝑟(− 𝑊𝑟 2 ) = 0 𝑚𝑒𝑎𝑛 Δ𝑊𝑟= 1−𝑝𝑟 𝑎 𝑊𝑡 +𝑝𝑟(− 𝑊𝑟 2 ) = 0 1−𝑝𝑟 𝑎 𝑊𝑡 =𝑝𝑟 𝑊𝑟 2 𝑊𝑟= 1−𝑝𝑟 2𝑎 𝑝𝑟𝑊𝑡 ≈ 2𝑎 𝑝𝑟𝑊𝑡 𝑊𝑡 = 2𝑎 𝑊𝑡 1/𝑝𝑟 𝑊𝑡= 2𝑎 1/𝑝𝑟 𝑊𝑟 = 2a 1/𝑝𝑟 1/𝑝𝑟

Outline Admin and recap Cloud data center (CDC) network infrastructure Background, high-level goal Traditional CDC vs the one-big switch abstraction VL2 design and implementation: L2 semantics, VLB/ECMP Load-aware centralized load balancing (Hedera) Distributed load balancing by end hosts: MP-TCP Intuition Formal definition

Formal Requirement So far we focus on window size (w) and ignore flows w/ different RTTs, but flow rate is w/RTT. Formal requirement: 𝑟 ∈𝑅 𝑤𝑟 𝑅𝑇𝑇𝑟 ≥ max 𝑟 ∈𝑅 𝑤𝑟 𝑇𝐶𝑃 𝑅𝑇𝑇𝑟

Design Option 4 Each ACK on subflow r, increase the window wr by a/wtotal Each loss on subflow r, decrease the window wr by wr/2 min( , 1/wr)

Deriving a 𝑟 ∈𝑅 𝑤𝑟 𝑅𝑇𝑇𝑟 ≥ max 𝑟 ∈𝑅 𝑤𝑟 𝑇𝐶𝑃 𝑅𝑇𝑇𝑟 1−𝑝𝑟 min 𝑎 𝑤 𝑡𝑜𝑡𝑎𝑙 , 1 𝑤 𝑟 =𝑝𝑟 𝑤 𝑟 2 min 𝑎 𝑤 𝑡𝑜𝑡𝑎𝑙 𝑤 𝑟 , 1 𝑤 𝑟2 =𝑝𝑟 1 2(1−𝑝𝑟) min 𝑎 𝑤 𝑡𝑜𝑡𝑎𝑙 𝑤 𝑟 , 1 𝑤 𝑟2 ≈ 𝑝𝑟 2 𝑤𝑟 𝑡𝑐𝑝= 2/𝑝𝑟 min 𝑎 𝑤 𝑡𝑜𝑡𝑎𝑙 𝑤 𝑟 , 1 𝑤 𝑟2 = 1 𝑤 𝑟𝑡𝑐𝑝2 max 𝑤 𝑡𝑜𝑡𝑎𝑙 𝑤 𝑟 𝑎 , 𝑤 𝑟 = 𝑤 𝑟𝑡𝑐𝑝 max 𝑤 𝑡𝑜𝑡𝑎𝑙 𝑤 𝑟 𝑎 𝑅𝑇𝑇𝑟2 , 𝑤 𝑟 𝑅𝑇𝑇𝑟 = 𝑤𝑟 𝑡𝑐𝑝 𝑅𝑇𝑇𝑟

Deriving a 𝑟 ∈𝑅 𝑤𝑟 𝑅𝑇𝑇𝑟 ≥ max 𝑟 ∈𝑅 𝑤𝑟 𝑇𝐶𝑃 𝑅𝑇𝑇𝑟

Final MP-TCP Alg

MPTCP Evaluation Throughput (% of optimal) FatTree, 128 nodes Flow rank Simulations of FatTree, 100Mb/s links, permutation traffic matrix, one flow per host, TCP+ECMP versus MPTCP.

Hedera first-fit heuristic MPTCP vs Hedera Throughput [% of optimal] Simulation of FatTree with 128 hosts. Permutation traffic matrix Closed-loop flow arrivals (one flow finishes, another starts) Flow size distributions from VL2 dataset MPTCP Hedera first-fit heuristic

MPTCP at Different Load Ratio of throughputs, MPTCP/TCP Simulation of a FatTree-like topology with 512 nodes, but with 4 hosts for every up-link from a top-of-rack switch, i.e. the core is oversubscribed 4:1. Permutation TM: each host sends to one other, each host receives from one other Random TM: each host sends to one other, each host may receive from any number Connections per host At low loads, there are few collisions, and NICs are saturated, so TCP ≈ MPTCP At high loads, the core is severely congested, and TCP can fully exploit all the core links, so TCP ≈ MPTCP When the core is “right-provisioned”, i.e. just saturated, MPTCP > TCP

MPTCP at Different Load Underloaded Sweet Spot Overloaded

A Puzzle: cwnd and Rate of a TCP Session Question: cwnd fluctuates widely (i.e., cut to half); how can the sending rate stay relatively smooth? Question: Where does loss rate come from?

TCP/Reno Queueing Dynamics cwnd filling buffer TD bottleneck bandwidth ssthresh draining buffer Time congestion avoidance If the buffer at the bottleneck is large enough, the buffer is never empty (not idle), during the cut-to-half to “grow-back” process.

Discussion If the buffer size at the bottleneck link is very small, what is the link utilization? cwnd TD bottleneck bandwidth ssthresh Time congestion avoidance

Exercise: Small Buffer Assume BW: 10 G RTT: 100 ms Packet: 1250 bytes BDP (full window size): 100,000 packets A loss can cut window size from 100,000 to 50,000 packets To fully grow back Need 50,000 RTTs => 5000 seconds, 1.4 hours

Discussion What you like about MP-TCP? What you do not like about MP-TCP?

Outline Admin and recap Cloud data center (CDC) network infrastructure Background, high-level goals Traditional CDC vs the one-big switch abstraction VL2 design and implementation: L2 semantics, VLB/ECMP Load-aware centralized load balancing (Hedera) Distributed load balancing by end hosts: MP-TCP DC transport issues

Big Picture INTERNET Servers Fabric

Diverse workload sharing the same infrastructure Big Picture INTERNET Fabric Diverse workload sharing the same infrastructure Specifically, while some of the traffic in data center networks is sent across the Internet, the majority of data center traffic is between the servers within the data center and never leaves the data center. This is because Servers web app cache data-base map-reduce HPC monitoring

Different Types of Workloads Mice & Elephants Short messages (e.g., query, coordination) Large flows (e.g., data update, backup) Low Latency High Throughput So the ultimate goal of data center transport is to complete these internal transactions or “flows” as quickly as possible. What makes this challenging is that the flows are actually quite diverse. In particular, there are… And these different flows require different things to complete quickly. For the short flows, we need to provide low latency…

Implication: Mice Mix w/ Elephants

Discussion cwnd TD bottleneck bandwidth ssthresh Time congestion avoidance What is potential issue of TCP for such mixed mice-elephants traffic?