Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Microsoft Research.

Similar presentations


Presentation on theme: "Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Microsoft Research."— Presentation transcript:

1 Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Microsoft Research Stanford University Data Center TCP (DCTCP) 1 Presented by Yi-Zong (Andrew) Ou Credits: Slides modified from the original presentation

2 Data Center Packet Transport Vast large data centers: Huge investment R&D, business Packet transport inside the DC TCP rules (99.9% of traffic) Hoe does TCP do in the DC? 2

3 TCP in the Data Center TCP does not meet demands in DC and suffers from Bursty packet drops Incast [SIGCOMM ‘09] Builds up large queues:  Adds significant latency  Wastes precious buffers:  Bad with shallow-buffered switches  Deep-buffed switches are expensive Problems of current solutions to the TCP problems ‒Ad-hoc, inefficient, often expensive solutions ‒No solid understanding of consequences, tradeoffs 3

4 Case Study: Microsoft Bing Initial experimental measurements from 6000 servers production cluster Instrumentation passively collects logs ‒Application-level ‒Socket-level ‒Selected packet-level More than 150TB of compressed data over a month Note: the environment is different from the test bed 4

5 Partition/Aggregate Design Pattern Used in many large-scale web applications Web search, Social network composition, Ad selection Example: Facebook Partition/Aggregate ~ Multiget Aggregators: Web Servers Workers: Memcached Servers 5 Memcached Servers (Workers) Internet Web Servers

6 TLA MLA Worker Nodes ……… Partition/Aggregate Application Structure 6 Picasso “Everything you can imagine is real.”“Bad artists copy. Good artists steal.” “It is your work in life that is the ultimate seduction.“ “The chief enemy of creativity is good sense.“ “Inspiration does exist, but it must find you working.” “I'd like to live as a poor man with lots of money.“ “Art is a lie that makes us realize the truth. “Computers are useless. They can only give you answers.” 1. 2. 3. ….. 1. Art is a lie… 2. The chief… 3. ….. 1. 2. Art is a lie… 3. ….. Art is… Picasso Time is money  Strict deadlines (SLAs) Missed deadline  Lower quality result Deadline = 250ms Deadline = 50ms Deadline = 10ms

7 Workloads in DC and metrics Partition/Aggregate [2KB-20KB] (Query) Short messages [50KB-1MB] (Coordination, update control state) Large flows [1MB-50MB] (Internal data update) 7 Delay-sensitive Throughput-sensitive

8 Switch Design and Performance Impairments Switches Incast Queue Buildup Buffer Pressure 8

9 Switches Shared memory switches for performance gain by sharing logically common packets buffer Memory from the shared pool is dynamically allocated to a packet by a MMU If the interface reaches the memory limit, the arrived packets drop Questions: Could we add more memory? 9

10 Switches Shared memory switches for performance gain by sharing logically common packets buffer Memory from the shared pool is dynamically allocated to a packet by a MMU If the interface reaches the memory limit, the arrived packets drop Questions: Could we add more memory? Tradeoff between memory size and cost Deep-buffered Switches are much more expensive Shallow-buffered Switches are more common Shallow-buffered Switches causes the three impairments Incast Queue buildup Buffer Pressure 10

11 Incast 11 TCP timeout Worker 1 Worker 2 Worker 3 Worker 4 Aggregator RTO min = 300 ms Due to possible synchronization caused by Partition/Aggregate pattern

12 Incast – possible solutions What are the possible solutions to the incast problem? 12

13 Incast – possible solutions What are the possible solutions to the incase problem? Jittering response by a random delay Batching responses in small groups Pros: Avoid the incast problem Cons: Increase the median response time 13

14 Incast in the production environment Requests are jittered over 10ms window Jittering switched off around 8:30 am Jittering trades off median against high percentiles 14 MLA Query Completion Time (ms)

15 Queue Buildup 15 Sender 1 Sender 2 Receiver Big flows buildup queues  Increased latency for short flows Measurements in Bing cluster  For 90% packets: RTT < 1ms  For 10% packets: 1ms < RTT < 15ms

16 Buffer Pressure The long, greedy TCP flows build up queues on their interfaces Queue built up reduces the amount of buffer available to absorb burst traffic from Partition/Aggregate pattern Result: packet loss and timeouts 16

17 DCTCP Objectives Solve Incast Queue Buildup Buffer Pressure Objectives: High burst rate tolerance Low latency for small flow High throughput for large flow With commodity hardware 17

18 The DCTCP Algorithm 18

19 ECN: Explicit Congestion Notification ECN allows end-to-end notification of network congestion without dropping packets Achieve by setting a mark in the IP header to indicate congestion ECN-Echo: signal the sender to reduce the amount of information it sends Congestion Window Reduced (CWR): acknowledge that the congestion-indication echoing was received Already in commodity hardware DCTCP relies on ECN 19

20 DCTCP Overview Achieve the goal by reacting in proportion to the extent of congestion, not its presence At switches side, use simple marking scheme Mark Congestion Experienced (CE) of codepoints as soon as the queue occupancy exceeds a fixed small threshold At the source side, the source reacts by reducing the window by a factor depending on the fraction of marked packets. The larger the fraction, the bigger the decrease factor (decrease more) 18

21 Data Center TCP Algorithm Switch side: Mark packets when Queue Length > K 19 Receiver side: In TCP, a receiver sets ECN-Echo flag in a series of ACK packets until it receives confirmation (CWR flag) from the sender that the congestion notification has been received. In DCTCP, a receiver accurately notifies the exact sequence of marked packet to the sender BK MarkDon’t Mark CE: Congestion Experienced

22 DCTCP Algorithm 22 Sender side: – Maintain running average of fraction of packets marked (α). In each RTT:  Adaptive window decreases: – Note: decrease factor between 1 and 2 – Highly congested, α = 1, Cwnd becomes half – Lower congested, smaller α, Cwnd decreases little

23 Does DCTCP solve the problems? 1. Queue buildup DCTCP senders start reacting as soon as queue length exceeds k → reduce queuing delay More buffer available (headroom) → absorb micro burst rate 2. Buffer pressure Congested port’s queue will not grow exceeding large → few ports will not consume all the memory 3. Incast Marking early and aggressively based on instantaneous queue length → source react fast to reduce the size of follow up bursts. Prevent buffer overflow and timeouts. 21

24 Evaluation Implemented in Windows stack. Real hardware, 1Gbps and 10Gbps experiments 90 server testbed Broadcom Triumph 48 1G ports – 4MB shared memory (shallow) Cisco Cat4948 48 1G ports – 16MB shared memory (deep) Broadcom Scorpion 24 10G ports – 4MB shared memory (shallow) Numerous micro-benchmarks – Throughput and Queue Length – Multi-hop – Queue Buildup – Buffer Pressure Cluster traffic benchmark 23 – Fairness and Convergence – Incast – Static vs Dynamic Buffer Mgmt

25 Cluster Traffic Benchmark Emulate traffic within 1 Rack of Bing cluster 45 1G servers, 10G server for external traffic Generate query, and background traffic Flow sizes and arrival times follow distributions seen in Bing Metric: Flow completion time for queries and background flows. 24

26 Baseline 25 Background FlowsQuery Flows

27 Baseline 25 Background FlowsQuery Flows Low latency for short flows.

28 Baseline 25 Background FlowsQuery Flows Low latency for short flows. High throughput for long flows.

29 Baseline 25 Background FlowsQuery Flows Low latency for short flows High throughput for long flows High burst tolerance for query flows

30 Scaled Background & Query 10x Background, 10x Query 26 95 th percentile

31 Conclusions DCTCP satisfies all requirements for Data Center packet transport Handles bursts well Keeps queuing delays low Achieves high throughput Features: Very simple change to TCP and a single switch parameter Based on mechanisms already available, ECN (no need to replace hardware) 27

32 Thank you 32

33 Data Center Transport Requirements 33 1. High Burst Tolerance –Incast due to Partition/Aggregate is common. 2. Low Latency –Short flows, queries 3. High Throughput –Continuous data updates, large file transfers The challenge is to achieve these three together.

34 Tension Between Requirements 34 High Burst Tolerance High Throughput Low Latency DCTCP Deep Buffers:  Queuing Delays Increase Latency Shallow Buffers:  Bad for Bursts & Throughput Reduced RTO min (SIGCOMM ‘09)  Doesn’t Help Latency AQM – RED:  Avg Queue Not Fast Enough for Incast Objective: Low Queue Occupancy & High Throughput

35 Analysis How low can DCTCP maintain queues without loss of throughput? How do we set the DCTCP parameters? 22  Need to quantify queue size oscillations (Stability). Time (W*+1)(1-α/2) W* Window Size W*+1

36 Packets sent in this RTT are marked. Analysis How low can DCTCP maintain queues without loss of throughput? How do we set the DCTCP parameters? 22  Need to quantify queue size oscillations (Stability). Time (W*+1)(1-α/2) W* Window Size W*+1

37 Small Queues & TCP Throughput: The Buffer Sizing Story Bandwidth-delay product rule of thumb: A single flow needs buffers for 100% Throughput. 17 B Cwnd Buffer Size Throughput 100%

38 Small Queues & TCP Throughput: The Buffer Sizing Story Bandwidth-delay product rule of thumb: A single flow needs buffers for 100% Throughput. Appenzeller rule of thumb (SIGCOMM ‘04): Large # of flows: is enough. 17 B Cwnd Buffer Size Throughput 100%

39 Small Queues & TCP Throughput: The Buffer Sizing Story Bandwidth-delay product rule of thumb: A single flow needs buffers for 100% Throughput. Appenzeller rule of thumb (SIGCOMM ‘04): Large # of flows: is enough. Can’t rely on stat-mux benefit in the DC. Measurements show typically 1-2 big flows at each server, at most 4. 17

40 Small Queues & TCP Throughput: The Buffer Sizing Story Bandwidth-delay product rule of thumb: A single flow needs buffers for 100% Throughput. Appenzeller rule of thumb (SIGCOMM ‘04): Large # of flows: is enough. Can’t rely on stat-mux benefit in the DC. Measurements show typically 1-2 big flows at each server, at most 4. 17 B Real Rule of Thumb: Low Variance in Sending Rate → Small Buffers Suffice

41 Analysis How low can DCTCP maintain queues without loss of throughput? How do we set the DCTCP parameters? 22  Need to quantify queue size oscillations (Stability). 85% Less Buffer than TCP

42 DCTCP in Action 20 Setup: Win 7, Broadcom 1Gbps Switch Scenario: 2 long-lived flows, K = 30KB (Kbytes)

43 Review: The TCP/EControl Loop 43 Sender 1 Sender 2 Receiver ECN Mark (1 bit) ECN = Explicit Congestion Notification

44 Does DCTCP meet the objectives? 1.High Burst Tolerance Large buffer headroom → bursts fit Aggressive marking → sources react before packets are dropped 2. Low Latency Small buffer occupancies → low queuing delay 3. High Throughput ECN averaging → smooth rate adjustments, low variance 21


Download ppt "Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Microsoft Research."

Similar presentations


Ads by Google