TCP & Data Center Networking

Name: TCP & Data Center Networking
Uploaded: 2017-07-29T13:47:10+00:00
Duration: PTM25S19
Channel: Albert Young
Description: TCP & Data Center Networking

TCP & Data Center Networking
TCP & Data Center Networking: Overview TCP Incast Problem & Possible Solutions DC-TCP MPTCP (multipath TPC) Please read the following papers [InCast] [DC-TCP] [MPTCP] CSci5221: TCP and Data Center Networking

TCP Congestion Control: Recap
Designed to address network congestion problem reduce sending rates when network conges How to detect network congestion at end systems? Assume packet losses (& re-ordering)  network congestion How to adjust sending rates dynamically? AIMD (additive increase & multiplicative decrease): no packet loss in one RTT: W  W+1 packet loss in one RTT: W  W/2 How to determine the initial sending rates? probe the network available bandwidth via “slow start” W:=1; no loss in one RTT: W  2W Fairness: assume everyone will use the same algorithm

TCP Congestion Control: Devils in the Details
How to detect packet losses? e.g., as opposed to late-arriving packets? estimate (average) RTT times, and set a time-out threshold called RTO (Retransmission Time-Out) timer packets arriving very late are treated as if they were lost! RTT and RTO estimations: Jacobson’s algorithm Compute estRTT and devRTT using exponential smoothing: estRTT := (1-a)estRTT + sampleRTT (a>0 small, e.g., a=0.125) devRTT:=(1-a)devRTT + a|sampleRTT-devRTT| Set RTO conservatively: RTO:= max{minRTO, estRTT + 4xdevRTT} where minRTO = 200 ms Aside: many variants of TCP: Tahoe, Reno, Vegas, ...

But …. Internet vs. data center network:
Internet propagation delay: ms data center propagation delay: 0.1 ms packet size 1 KB, link capacity 1 Gbps  packet transmission time is 0.01 ms

What Special about Data Center Transport
Application requirements (particularly, low latency) Particular traffic patterns customer facing vs. internal: often co-exist internal: e.g., Google file system Map-Reduce … Commodity switches: shallow buffer And time is money!

How does search work? Partition/Aggregate Application Structure
Art is… Picasso 1. 2. Art is a lie… 3. ….. TLA MLA Worker Nodes ……… Partition/Aggregate Application Structure Deadline = 250ms Deadline = 50ms Deadline = 10ms Time is money Strict deadlines (SLAs) Missed deadline Lower quality result Many requests per query Tail-latency matters Picasso 1. 2. 3. ….. 1. Art is a lie… 2. The chief… “Everything you can imagine is real.” “I'd like to live as a poor man with lots of money.“ “Computers are useless. They can only give you answers.” “Bad artists copy. Good artists steal.” “It is your work in life that is the ultimate seduction.“ “Inspiration does exist, but it must find you working.” “Art is a lie that makes us realize the truth. “The chief enemy of creativity is good sense.“

Data Center Workloads (Query) Partition/Aggregate
Bursty, Delay-sensitive Delay-sensitive Throughput-sensitive Partition/Aggregate (Query) Short messages [50KB-1MB] (Coordination, Control state) Large flows [1MB-100MB] (Data update)

Flow Size Distribution
> 65% of Flows are < 1MB > 95% of Bytes from Flows > 1MB

A Simple Data Center Network Model
1 2 3 N packet size S_DATA small buffer B link capacity C switch Logical data block (S) (e.g., 1 MB) Ethernet: 1-10Gbps aggregator Server Request Unit (SRU) (e.g., 32 KB) Round Trip Time (RTT): 100-10us N servers

TCP Incast Problem Req. sent 7-8 dropped Rsp. sent 7-8 resent
Vasudevan et al. (SIGCOMM’09) Synchronized fan-in congestion:  Caused by Partition/Aggregate. Worker 1 Aggregator Worker 2 RTOmin = 200 ms Worker 3 Worker 4 TCP timeout Req. sent 7-8 dropped Rsp. sent 7-8 resent 1 – 6 done Link Idle! time

TCP Throughput Collapse
Cluster Setup 1Gbps Ethernet Unmodified TCP S50 Switch 1MB Block Size Collapse! TCP Incast Cause of throughput collapse: coarse-grained TCP timeouts And, here are the results of the experiment. On the Y axis we have Throughput (Goodput), and on the X axis we have the number of servers involved in the transfer. Initially the throughput is 900Mbps, close to the maximum achievable in the network. As we scale the number of servers, by around 7 servers we notice a drastic collapse in throughput down to 100Mbps (an order of magnitude lower than the max). This TCP throughput collapse is called TCP Incast, and the cause for this is coarse-grained TCP timeouts.

Incast in Bing MLA Query Completion Time (ms)
1. Incast really happens – see this actual screenshot from production tool 2. People care, they’ve solved it at application by jittering. 3. They care about the 99.9th percentile, customers

Problem Statement How to provide high goodput for data center
4/20/2017 TCP retransmission timeouts How to provide high goodput for data center applications? TCP throughput degradation N High-speed, low-latency network (RTT ≤ 0.1 ms) Highly-multiplexed link (e.g., 1000 flows) Highly-synchronized flows on bottleneck link Limited switch buffer size (e.g., 32 KB) 13

µsecond Retransmission Timeouts (RTO)
One Quick Fix: µsecond TCP + no minRTO µsecond Retransmission Timeouts (RTO) RTO = max( minRTO, f(RTT) ) RTT tracked in milliseconds 200ms 200µs? RTO is a max of 2 values. We have to make sure that both these values are in µseconds. This means lowering both the minRTO bound to µseconds (or getting rid of it), and tracking RTT is µseconds (currently tracked in milliseconds) Track RTT in µsecond 0?

Solution: µsecond TCP + no minRTO
Proposed solution Throughput (Mbps) Unmodified TCP more servers  Our solution, of using microsecond granularity timers, solves the problem of TCP throughput collapse. The red line is the result of running servers with our modified TCP stack. It solves the problem here for 47 servers, and we have found in simulation this solution scales to thousands of servers. Higher is better and Red is the quick fix solution High throughput for up to 47 servers Simulation scales to thousands of servers

TCP in the Data Center TCP does not meet demands of applications.
Requires large queues for high throughput: Adds significant latency. Wastes buffer space, esp. bad with shallow-buffered switches. Operators work around TCP problems. Ad-hoc, inefficient, often expensive solutions No solid understanding of consequences, tradeoffs

Data Center Workloads (Query) Partition/Aggregate
Bursty, Delay-sensitive Delay-sensitive Throughput-sensitive Partition/Aggregate (Query) Short messages [50KB-1MB] (Coordination, Control state) Large flows [1MB-100MB] (Data update)

Flow Size Distribution
> 65% of Flows are < 1MB > 95% of Bytes from Flows > 1MB

Queue Buildup Large flows buildup queues. Measurements in Bing cluster
Sender 1 Large flows buildup queues. Increase latency for short flows. Receiver How was this supported by measurements? Send 2 Measurements in Bing cluster For 90% packets: RTT < 1ms For 10% packets: 1ms < RTT < 15ms

Data Center Transport Requirements
High Burst Tolerance Incast due to Partition/Aggregate is common. Low Latency Short flows, queries 3. High Throughput Continuous data updates, large file transfers The challenge is to achieve these three together.

DCTCP: Main Idea React in proportion to the extent of congestion.
Reduce window size based on fraction of marked packets. ECN Marks TCP DCTCP Cut window by 50% Cut window by 40% Cut window by 5% Start with: “How can we extract multi-bit information from single-bit stream of ECN marks?”

DCTCP: Algorithm Switch side: Mark packets when Queue Length > K.
Don’t Mark B Sender side: Maintain running average of fraction of packets marked (α). Adaptive window decreases: Note: decrease factor between 1 and 2. very simple marking mechanism not all the tunings other aqms have on the source side, the source is tryign to estimate the fraction of packets getting marked using the obs that there is a stream of ecn marks coming back – more info in the stream than in any single bit trying to maintain smooth rate variations to operate well even when using shallow buffers, and only a few flows (no stat mux) F over the last RTT. In TCP there is always a way to get the next RTT from the window size. Comes from the self-clocking of TCP. Only changing the decrease. Simplest version – makes a lot of sense. So generic could apply it to any algorithm – CTCP, CUBIC – how to cut its window leaving increase part to what it already does. Have to be careful here.

DCTCP vs TCP (Kbytes) Setup: Win 7, Broadcom 1Gbps Switch
Not ns2. Setup: Win 7, Broadcom 1Gbps Switch Scenario: 2 long-lived flows, ECN Marking Thresh = 30KB

Multi-path TCP (MPTCP)
In a data center with rich path diversity (e.g., Fat-Tree or Bcube), can we use multipath to get higher throughput? Initially, there is one flow.

In a BCube data center, can we use multipath to get higher throughput?
Initially, there is one flow. A new flow starts. Its direct route collides with the first flow.

In a BCube data center, can we use multipath to get higher throughput?
Initially, there is one flow. A new flow starts. Its direct route collides with the first flow. But it also has longer routes available, which don’t collide.

The MPTCP protocol MPTCP is a replacement for TCP which lets you use multiple paths simultaneously.
user space MPTCP The sender stripes packets across paths The receiver puts the packets in the correct order socket API MPTCP TCP IP addr addr1 addr2

Design goal 1: Multipath TCP should be fair to regular TCP at shared bottlenecks
To be fair, Multipath TCP should take as much capacity as TCP at a bottleneck link, no matter how many paths it is using. A multipath TCP flow with two subflows Regular TCP This is the very first thing that comes to mind with multipath TCP, and it’s something that many other people have solved in different ways. This is just a warm-up... Design Goal 3 is a much “richer” generalization of this goal, which accommodates different topologies, different RTTs. So there’s no point giving an evaluation here. Strawman solution: Run “½ TCP” on each path

Design goal 2: MPTCP should use efficient paths
Each flow has a choice of a 1-hop and a 2-hop path. How should we split its traffic? 12Mb/s 12Mb/s 12Mb/s

If each flow split its traffic 1:1 ... 12Mb/s 8Mb/s 12Mb/s 8Mb/s 8Mb/s 12Mb/s

If each flow split its traffic ∞:1 ... 12Mb/s 12Mb/s 12Mb/s 12Mb/s 12Mb/s 12Mb/s

12Mb/s 12Mb/s 12Mb/s 12Mb/s 12Mb/s 12Mb/s Theoretical solution (Kelly+Voice 2005; Han, Towsley et al. 2006) Theorem: MPTCP should send all its traffic on its least-congested paths. This will lead to the most efficient allocation possible, given a network topology and a set of available paths.

Design goal 3: MPTCP should be fair compared to TCP
Design Goal 2 says to send all your traffic on the least congested path, in this case 3G. But this has high RTT, hence it will give low throughput. wifi path: high loss, small RTT 3G path: low loss, high RTT Goal 3a. A Multipath TCP user should get at least as much throughput as a single-path TCP would on the best of the available paths. Goal 3b. A Multipath TCP flow should take no more capacity on any link than a single-path TCP would.

How does MPTCP try to achieve all this?
Design goals Goal 1. Be fair to TCP at bottleneck links Goal 2. Use efficient paths ... Goal 3. as much as we can, while being fair to TCP Goal 4. Adapt quickly when congestion changes Goal 5. Don’t oscillate redundant “Goal 3. Be fair to TCP” means two things: fair to the user, i.e. user doesn’t suffer by switching from TCP to MPTCP; fair to the network, i.e. network doesn’t suffer when users switch from TCP to MPTCP. It subsumes Goal 1. There are two more goals. Paper discusses Goal 4 at length. Goal 5: we didn’t see any oscillations in our evaluation; theory papers predict no oscillation, for an idealized model. How does MPTCP try to achieve all this?

How does MPTCP congestion control work?
Maintain a congestion window wr, one window for each path, where r ∊ R ranges over the set of available paths. Increase wr for each ACK on path r, by Decrease wr for each drop on path r, by wr /2 MPTCP works pretty much like TCP, i.e. window increases and window decreases on each path. The decrease is the same as TCP. When there is only one path available, this formula reduces to the TCP formula. We derived a throughput formula for this congestion control algorithm, and checked that it satisfies the design goals.

Maintain a congestion window wr, one window for each path, where r ∊ R ranges over the set of available paths. Increase wr for each ACK on path r, by Decrease wr for each drop on path r, by wr /2 Design goal 3: At any potential bottleneck S that path r might be in, look at the best that a single-path TCP could get, and compare to what I’m getting. MPTCP works pretty much like TCP, i.e. window increases and window decreases on each path. The decrease is the same as TCP. When there is only one path available, this formula reduces to the TCP formula. We derived a throughput formula for this congestion control algorithm, and checked that it satisfies the design goals.

Maintain a congestion window wr, one window for each path, where r ∊ R ranges over the set of available paths. Increase wr for each ACK on path r, by Decrease wr for each drop on path r, by wr /2 Design goal 2: We want to shift traffic away from congestion. To achieve this, we increase windows in proportion to their size. MPTCP works pretty much like TCP, i.e. window increases and window decreases on each path. The decrease is the same as TCP. When there is only one path available, this formula reduces to the TCP formula. We derived a throughput formula for this congestion control algorithm, and checked that it satisfies the design goals.

MPTCP chooses efficient paths in a BCube data center, hence it gets high throughput.
MPTCP shifts its traffic away from the congested link. Initially, there is one flow. A new flow starts. Its direct route collides with the first flow. But it also has longer routes available, which don’t collide.

MPTCP chooses efficient paths in a BCube data center, hence it gets high throughput.
throughput [Mb/s] Packet-level simulations of BCube (125 hosts, 25 switches, 100Mb/s links) and measured average throughput, for three traffic matrices. For two of the traffic matrices, MPTCP and ½ TCP (strawman) were as good. For one of the traffic matrices, MPTCP got 19% higher throughput. perm. traffic matrix sparse traffic matrix local traffic matrix

TCP & Data Center Networking

Similar presentations

Presentation on theme: "TCP & Data Center Networking"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

TCP & Data Center Networking

Similar presentations

Presentation on theme: "TCP & Data Center Networking"— Presentation transcript:

Similar presentations

About project

Feedback