Mohammad Alizadeh, Abdul Kabbani, Tom Edsall,

Slides:



Advertisements
Similar presentations
Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Microsoft Research.
Advertisements

Computer Networking Lecture 20 – Queue Management and QoS.
B 黃冠智.
Deconstructing Datacenter Packet Transport Mohammad Alizadeh, Shuang Yang, Sachin Katti, Nick McKeown, Balaji Prabhakar, Scott Shenker Stanford University.
Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Modified by Feng.
Lecture 18: Congestion Control in Data Center Networks 1.
Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Presented by Shaddi.
CSIT560 Internet Infrastructure: Switches and Routers Active Queue Management Presented By: Gary Po, Henry Hui and Kenny Chong.
Cloud Control with Distributed Rate Limiting Raghaven et all Presented by: Brian Card CS Fall Kinicki 1.
PFabric: Minimal Near-Optimal Datacenter Transport Mohammad Alizadeh Shuang Yang, Milad Sharif, Sachin Katti, Nick McKeown, Balaji Prabhakar, Scott Shenker.
Playback-buffer Equalization For Streaming Media Using Stateless Transport Prioritization By Wai-tian Tan, Weidong Cui and John G. Apostolopoulos Presented.
Congestion Control: TCP & DC-TCP Swarun Kumar With Slides From: Prof. Katabi, Alizadeh et al.
Balajee Vamanan et al. Deadline-Aware Datacenter TCP (D 2 TCP) Balajee Vamanan, Jahangir Hasan, and T. N. Vijaykumar.
1 Updates on Backward Congestion Notification Davide Bergamasco Cisco Systems, Inc. IEEE 802 Plenary Meeting San Francisco, USA July.
Ahmed Mansy, Mostafa Ammar (Georgia Tech) Bill Ver Steeg (Cisco)
Mohammad Alizadeh Adel Javanmard and Balaji Prabhakar Stanford University Analysis of DCTCP:Analysis of DCTCP: Stability, Convergence, and FairnessStability,
XCP: Congestion Control for High Bandwidth-Delay Product Network Dina Katabi, Mark Handley and Charlie Rohrs Presented by Ao-Jan Su.
Bertha & M Sadeeq.  Easy to manage the problems  Scalability  Real time and real environment  Free data collection  Cost efficient  DCTCP only covers.
Congestion control in data centers
1 Minseok Kwon and Sonia Fahmy Department of Computer Sciences Purdue University {kwonm, All our slides and papers.
Defense: Christopher Francis, Rumou duan Data Center TCP (DCTCP) 1.
SEDCL: Stanford Experimental Data Center Laboratory.
Reduced TCP Window Size for VoIP in Legacy LAN Environments Nikolaus Färber, Bernd Girod, Balaji Prabhakar.
Jennifer Rexford Fall 2014 (TTh 3:00-4:20 in CS 105) COS 561: Advanced Computer Networks TCP.
Information-Agnostic Flow Scheduling for Commodity Data Centers
Reduced TCP Window Size for Legacy LAN QoS Niko Färber July 26, 2000.
Practical TDMA for Datacenter Ethernet
Mohammad Alizadeh Stanford University Joint with: Abdul Kabbani, Tom Edsall, Balaji Prabhakar, Amin Vahdat, Masato Yasuda HULL: High bandwidth, Ultra Low-Latency.
Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan by liyong Data.
TCP & Data Center Networking
Curbing Delays in Datacenters: Need Time to Save Time? Mohammad Alizadeh Sachin Katti, Balaji Prabhakar Insieme Networks Stanford University 1.
Detail: Reducing the Flow Completion Time Tail in Datacenter Networks SIGCOMM PIGGY.
On the Data Path Performance of Leaf-Spine Datacenter Fabrics Mohammad Alizadeh Joint with: Tom Edsall 1.
Understanding the Performance of TCP Pacing Amit Aggarwal, Stefan Savage, Thomas Anderson Department of Computer Science and Engineering University of.
B 李奕德.  Abstract  Intro  ECN in DCTCP  TDCTCP  Performance evaluation  conclusion.
Packet Transport Mechanisms for Data Center Networks Mohammad Alizadeh NetSeminar (April 12, 2012) Mohammad Alizadeh NetSeminar (April 12, 2012) Stanford.
DCTCP & CoDel the Best is the Friend of the Good Bob Briscoe, BT Murari Sridharan, Microsoft IETF-84 TSVAREA Jul 2012 Le mieux est l'ennemi du bien Voltaire.
Wei Bai with Li Chen, Kai Chen, Dongsu Han, Chen Tian, Hao Wang SING HKUST Information-Agnostic Flow Scheduling for Commodity Data Centers 1 SJTU,
An Efficient Gigabit Ethernet Switch Model for Large-Scale Simulation Dong (Kevin) Jin.
CATNIP – Context Aware Transport/Network Internet Protocol Carey Williamson Qian Wu Department of Computer Science University of Calgary.
An Efficient Gigabit Ethernet Switch Model for Large-Scale Simulation Dong (Kevin) Jin.
Jiaxin Cao, Rui Xia, Pengkun Yang, Chuanxiong Guo,
Analysis and Design of an Adaptive Virtual Queue (AVQ) Algorithm for AQM By Srisankar Kunniyur & R. Srikant Presented by Hareesh Pattipati.
1 Sheer volume and dynamic nature of video stresses network resources PIE: A lightweight latency control to address the buffer problem issue Rong Pan,
6.888: Lecture 3 Data Center Congestion Control Mohammad Alizadeh Spring
1 Three ways to (ab)use Multipath Congestion Control Costin Raiciu University Politehnica of Bucharest.
MMPTCP: A Multipath Transport Protocol for Data Centres 1 Morteza Kheirkhah University of Edinburgh, UK Ian Wakeman and George Parisis University of Sussex,
Revisiting Transport Congestion Control Jian He UT Austin 1.
Shuihai Hu, Wei Bai, Kai Chen, Chen Tian (NJU), Ying Zhang (HP Labs), Haitao Wu (Microsoft) Sing Hong Kong University of Science and Technology.
Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Microsoft Research.
T3: TCP-based High-Performance and Congestion-aware Tunneling Protocol for Cloud Networking Satoshi Ogawa† Kazuki Yamazaki† Ryota Kawashima† Hiroshi Matsuo†
Scalable Congestion Control Protocol based on SDN in Data Center Networks Speaker : Bo-Han Hua Professor : Dr. Kai-Wei Ke Date : 2016/04/08 1.
Data Center TCP (DCTCP)
Low-Latency Software Rate Limiters for Cloud Networks
Data Center TCP (DCTCP)
6.888 Lecture 5: Flow Scheduling
TCP Congestion Control at the Network Edge
Router-Assisted Congestion Control
Packet Transport Mechanisms for Data Center Networks
Columbia University in the city of New York
Microsoft Research Stanford University
Hamed Rezaei, Mojtaba Malekpourshahraki, Balajee Vamanan
Queuing and Queue Management
AMP: A Better Multipath TCP for Data Center Networks
Data Center TCP (DCTCP)
RAP: Rate Adaptation Protocol
Centralized Arbitration for Data Centers
Lecture 16, Computer Networks (198:552)
Lecture 17, Computer Networks (198:552)
AMP: An Adaptive Multipath TCP for Data Center Networks
Presentation transcript:

Less is More: Trading a little Bandwidth for Ultra-Low Latency in the Data Center Mohammad Alizadeh, Abdul Kabbani, Tom Edsall, Balaji Prabhakar, Amin Vahdat, and Masato Yasuda

Latency in Data Centers Latency is becoming a primary performance metric in DC Low latency applications High-frequency trading High-performance computing Large-scale web applications RAMClouds (want < 10μs RPCs) Desire predictable low-latency delivery of individual packets -and in fact, the needs of these applications have given rise to work on RAMClouds which ----- Meeting Notes (4/25/12 23:45) ----- End with: "What all these applications would like from the network, is predictable low-latency delivery of individual packets. For instance, one of the goals in RAMCloud is to allow 5-10us RPCs end-to-end accross the fabric. But as I'll show, this is very difficult to do in todays networks.

Why Does Latency Matter? Who does she know? What has she done? Traditional Application Large-scale Web Application Data Center App Tier App Logic App. Logic App Logic Alice << 1µs latency Fabric 0.5-10ms latency Data Structures Single machine Perhaps the easiest way to see this is through a comparison that John Ousterhout makes to explain why large-scale web applications should move their data to RAMClouds. If you consider the way an application was deployed before the days of the web, to run the application you load … But of course, the problem is that this is limited to the capacity of that one machine. So, to deploy a large web application, what we do now is take racks of machines and distribute our application logic on them and take a bunch of other racks of machines and distributed our data on them. And now the application has to go over the network to access its data. And the latency for this today is 4-5 orders of magnitude larger than what we had in that single machine. So if you think about this and ask how many reads of small objects can I do in these models, that one machine is actually better than these thousands of machines. So while we’ve scaled compute and storage, we haven’t scaled data access rates. And this is a big problem for large-scale applications. Facebook can only make 100-150 internal requests per page. Data Tier Latency limits data access rate Fundamentally limits applications Possibly 1000s of RPCs per operation Microseconds matter, even at the tail (e.g., 99.9th percentile) Eric Minnie Pics Apps Videos

Goal: Reduce queuing delays to zero. Reducing Latency Software and hardware are improving Kernel bypass, RDMA; RAMCloud: software processing ~1μs Low latency switches forward packets in a few 100ns Baseline fabric latency (propagation, switching) under 10μs is achievable. Queuing delay: random and traffic dependent Can easily reach 100s of microseconds or even milliseconds One 1500B packet = 12μs @ 1Gbps Goal: Reduce queuing delays to zero.

Low Latency AND High Throughput Data Center Workloads: Short messages [100B-10KB] Large flows [1MB-100MB] Low Latency High Throughput We want baseline fabric latency AND high throughput.

Why do we need buffers? Main reason: to create “slack” Handle temporary oversubscription Absorb TCP’s rate fluctuations as it discovers path bandwidth Example: Bandwidth-delay product rule of thumb A single TCP flow needs C×RTT buffers for 100% Throughput. Throughput Buffer Size 100% B B ≥ C×RTT B < C×RTT

Overview of our Approach Main Idea Use “phantom queues” Signal congestion before any queuing occurs Use DCTCP [SIGCOMM’10] Mitigate throughput loss that can occur without buffers Use hardware pacers Combat burstiness due to offload mechanisms like LSO and Interrupt coalescing

Review: DCTCP Switch: Source: B K Mark Don’t Switch: Set ECN Mark when Queue Length > K. Source: React in proportion to the extent of congestion  less fluctuations Reduce window size based on fraction of marked packets. ECN Marks TCP DCTCP 1 0 1 1 1 1 0 1 1 1 Cut window by 50% Cut window by 40% 0 0 0 0 0 0 0 0 0 1 Cut window by 5% Start with: “How can we extract multi-bit information from single-bit stream of ECN marks?” - Standard deviation: TCP (33.6KB), DCTCP (11.5KB)

DCTCP vs TCP (Kbytes) Setup: Win 7, Broadcom 1Gbps Switch Not ns2. Setup: Win 7, Broadcom 1Gbps Switch Scenario: 2 long-lived flows, From Alizadeh et al [SIGCOMM’10] ECN Marking Thresh = 30KB

Achieving Zero Queuing Delay Incoming Traffic TCP TCP: ~1–10ms Incoming Traffic DCTCP K C DCTCP: ~100μs ~Zero Latency How do we get this?

Phantom Queue Key idea: Associate congestion with link utilization, not buffer occupancy Virtual Queue (Gibbens & Kelly 1999, Kunniyur & Srikant 2001) Switch Link Speed C Marking Thresh. γC Bump on Wire (NetFPGA implementation) γ < 1: Creates “bandwidth headroom”

Throughput & Latency vs. PQ Drain Rate Switch latency (mean)

The Need for Pacing TCP traffic is very bursty Made worse by CPU-offload optimizations like Large Send Offload and Interrupt Coalescing Causes spikes in queuing, increasing latency Example. 1Gbps flow on 10G NIC 65KB bursts every 0.5ms

Impact of Interrupt Coalescing Receiver CPU (%) Throughput (Gbps) Burst Size (KB) disabled 99 7.7 67.4 rx-frames=2 98.7 9.3 11.4 rx-frames=8 75 9.5 12.2 rx-frames=32 53.2 16.5 rx-frames=128 30.7 64.0 More Interrupt Coalescing Lower CPU Utilization & Higher Throughput More Burstiness

Hardware Pacer Module Algorithmic challenges: At what rate to pace? Found dynamically: Which flows to pace? Elephants: On each ACK with ECN bit set, begin pacing the flow with some probability. Outgoing Packets From Server NIC Un-paced Traffic TX Token Bucket Rate Limiter Flow Association Table R QTB

Throughput & Latency vs. PQ Drain Rate (with Pacing) Throughput Switch latency (mean)

No Pacing vs Pacing (Mean Latency)

No Pacing vs Pacing (99th Percentile Latency)

DCTCP Congestion Control The HULL Architecture Phantom Queue Hardware Pacer DCTCP Congestion Control

Implementation and Evaluation PQ, Pacer, and Latency Measurement modules implemented in NetFPGA DCTCP in Linux (patch available online) Evaluation 10 server testbed Numerous micro-benchmarks Static & dynamic workloads Comparison with ‘ideal’ 2-priority QoS scheme Different marking thresholds, switch buffer sizes Effect of parameters Large-scale ns-2 simulations

Dynamic Flow Experiment 20% load 9 senders  1 receiver (80% 1KB flows, 20% 10MB flows). Load: 20% Switch Latency (μs) 10MB FCT (ms) Avg 99th TCP 111.5 1,224.8 110.2 349.6 DCTCP-30K 38.4 295.2 106.8 301.7 DCTCP-6K-Pacer 6.6 59.7 111.8 320.0 DCTCP-PQ950-Pacer 2.8 18.6 125.4 359.9

Conclusion The HULL architecture combines Phantom queues DCTCP Hardware pacing We trade some bandwidth (that is relatively plentiful) for significant latency reductions (often 10-40x compared to TCP and DCTCP). … by trading some bandwidth which is a resource that is relatively plentiful in modern data centers ….

Thank you!