Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mohammad Alizadeh, Abdul Kabbani, Tom Edsall,

Similar presentations


Presentation on theme: "Mohammad Alizadeh, Abdul Kabbani, Tom Edsall,"— Presentation transcript:

1 Less is More: Trading a little Bandwidth for Ultra-Low Latency in the Data Center
Mohammad Alizadeh, Abdul Kabbani, Tom Edsall, Balaji Prabhakar, Amin Vahdat, and Masato Yasuda

2 Latency in Data Centers
Latency is becoming a primary performance metric in DC Low latency applications High-frequency trading High-performance computing Large-scale web applications RAMClouds (want < 10μs RPCs) Desire predictable low-latency delivery of individual packets -and in fact, the needs of these applications have given rise to work on RAMClouds which ----- Meeting Notes (4/25/12 23:45) ----- End with: "What all these applications would like from the network, is predictable low-latency delivery of individual packets. For instance, one of the goals in RAMCloud is to allow 5-10us RPCs end-to-end accross the fabric. But as I'll show, this is very difficult to do in todays networks.

3 Why Does Latency Matter?
Who does she know? What has she done? Traditional Application Large-scale Web Application Data Center App Tier App Logic App. Logic App Logic Alice << 1µs latency Fabric 0.5-10ms latency Data Structures Single machine Perhaps the easiest way to see this is through a comparison that John Ousterhout makes to explain why large-scale web applications should move their data to RAMClouds. If you consider the way an application was deployed before the days of the web, to run the application you load … But of course, the problem is that this is limited to the capacity of that one machine. So, to deploy a large web application, what we do now is take racks of machines and distribute our application logic on them and take a bunch of other racks of machines and distributed our data on them. And now the application has to go over the network to access its data. And the latency for this today is 4-5 orders of magnitude larger than what we had in that single machine. So if you think about this and ask how many reads of small objects can I do in these models, that one machine is actually better than these thousands of machines. So while we’ve scaled compute and storage, we haven’t scaled data access rates. And this is a big problem for large-scale applications. Facebook can only make internal requests per page. Data Tier Latency limits data access rate Fundamentally limits applications Possibly 1000s of RPCs per operation Microseconds matter, even at the tail (e.g., 99.9th percentile) Eric Minnie Pics Apps Videos

4 Goal: Reduce queuing delays to zero.
Reducing Latency Software and hardware are improving Kernel bypass, RDMA; RAMCloud: software processing ~1μs Low latency switches forward packets in a few 100ns Baseline fabric latency (propagation, switching) under 10μs is achievable. Queuing delay: random and traffic dependent Can easily reach 100s of microseconds or even milliseconds One 1500B packet = 1Gbps Goal: Reduce queuing delays to zero.

5 Low Latency AND High Throughput
Data Center Workloads: Short messages [100B-10KB] Large flows [1MB-100MB] Low Latency High Throughput We want baseline fabric latency AND high throughput.

6 Why do we need buffers? Main reason: to create “slack”
Handle temporary oversubscription Absorb TCP’s rate fluctuations as it discovers path bandwidth Example: Bandwidth-delay product rule of thumb A single TCP flow needs C×RTT buffers for 100% Throughput. Throughput Buffer Size 100% B B ≥ C×RTT B < C×RTT

7 Overview of our Approach
Main Idea Use “phantom queues” Signal congestion before any queuing occurs Use DCTCP [SIGCOMM’10] Mitigate throughput loss that can occur without buffers Use hardware pacers Combat burstiness due to offload mechanisms like LSO and Interrupt coalescing

8 Review: DCTCP Switch: Source: B K
Mark Don’t Switch: Set ECN Mark when Queue Length > K. Source: React in proportion to the extent of congestion  less fluctuations Reduce window size based on fraction of marked packets. ECN Marks TCP DCTCP Cut window by 50% Cut window by 40% Cut window by 5% Start with: “How can we extract multi-bit information from single-bit stream of ECN marks?” - Standard deviation: TCP (33.6KB), DCTCP (11.5KB)

9 DCTCP vs TCP (Kbytes) Setup: Win 7, Broadcom 1Gbps Switch
Not ns2. Setup: Win 7, Broadcom 1Gbps Switch Scenario: 2 long-lived flows, From Alizadeh et al [SIGCOMM’10] ECN Marking Thresh = 30KB

10 Achieving Zero Queuing Delay
Incoming Traffic TCP TCP: ~1–10ms Incoming Traffic DCTCP K C DCTCP: ~100μs ~Zero Latency How do we get this?

11 Phantom Queue Key idea:
Associate congestion with link utilization, not buffer occupancy Virtual Queue (Gibbens & Kelly 1999, Kunniyur & Srikant 2001) Switch Link Speed C Marking Thresh. γC Bump on Wire (NetFPGA implementation) γ < 1: Creates “bandwidth headroom”

12 Throughput & Latency vs. PQ Drain Rate
Switch latency (mean)

13 The Need for Pacing TCP traffic is very bursty
Made worse by CPU-offload optimizations like Large Send Offload and Interrupt Coalescing Causes spikes in queuing, increasing latency Example. 1Gbps flow on 10G NIC 65KB bursts every 0.5ms

14 Impact of Interrupt Coalescing
Receiver CPU (%) Throughput (Gbps) Burst Size (KB) disabled 99 7.7 67.4 rx-frames=2 98.7 9.3 11.4 rx-frames=8 75 9.5 12.2 rx-frames=32 53.2 16.5 rx-frames=128 30.7 64.0 More Interrupt Coalescing Lower CPU Utilization & Higher Throughput More Burstiness

15 Hardware Pacer Module Algorithmic challenges: At what rate to pace?
Found dynamically: Which flows to pace? Elephants: On each ACK with ECN bit set, begin pacing the flow with some probability. Outgoing Packets From Server NIC Un-paced Traffic TX Token Bucket Rate Limiter Flow Association Table R QTB

16 Throughput & Latency vs. PQ Drain Rate
(with Pacing) Throughput Switch latency (mean)

17 No Pacing vs Pacing (Mean Latency)

18 No Pacing vs Pacing (99th Percentile Latency)

19 DCTCP Congestion Control
The HULL Architecture Phantom Queue Hardware Pacer DCTCP Congestion Control

20 Implementation and Evaluation
PQ, Pacer, and Latency Measurement modules implemented in NetFPGA DCTCP in Linux (patch available online) Evaluation 10 server testbed Numerous micro-benchmarks Static & dynamic workloads Comparison with ‘ideal’ 2-priority QoS scheme Different marking thresholds, switch buffer sizes Effect of parameters Large-scale ns-2 simulations

21 Dynamic Flow Experiment 20% load
9 senders  1 receiver (80% 1KB flows, 20% 10MB flows). Load: 20% Switch Latency (μs) 10MB FCT (ms) Avg 99th TCP 111.5 1,224.8 110.2 349.6 DCTCP-30K 38.4 295.2 106.8 301.7 DCTCP-6K-Pacer 6.6 59.7 111.8 320.0 DCTCP-PQ950-Pacer 2.8 18.6 125.4 359.9

22 Conclusion The HULL architecture combines
Phantom queues DCTCP Hardware pacing We trade some bandwidth (that is relatively plentiful) for significant latency reductions (often 10-40x compared to TCP and DCTCP). … by trading some bandwidth which is a resource that is relatively plentiful in modern data centers ….

23 Thank you!


Download ppt "Mohammad Alizadeh, Abdul Kabbani, Tom Edsall,"

Similar presentations


Ads by Google