Mohammad Alizadeh Stanford University Joint with: Abdul Kabbani, Tom Edsall, Balaji Prabhakar, Amin Vahdat, Masato Yasuda HULL: High bandwidth, Ultra Low-Latency.

Slides:



Advertisements
Similar presentations
Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Microsoft Research.
Advertisements

B 黃冠智.
Deconstructing Datacenter Packet Transport Mohammad Alizadeh, Shuang Yang, Sachin Katti, Nick McKeown, Balaji Prabhakar, Scott Shenker Stanford University.
Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Modified by Feng.
Lecture 18: Congestion Control in Data Center Networks 1.
Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Presented by Shaddi.
CSIT560 Internet Infrastructure: Switches and Routers Active Queue Management Presented By: Gary Po, Henry Hui and Kenny Chong.
Tail Latency: Networking
PFabric: Minimal Near-Optimal Datacenter Transport Mohammad Alizadeh Shuang Yang, Milad Sharif, Sachin Katti, Nick McKeown, Balaji Prabhakar, Scott Shenker.
Congestion Control: TCP & DC-TCP Swarun Kumar With Slides From: Prof. Katabi, Alizadeh et al.
Balajee Vamanan et al. Deadline-Aware Datacenter TCP (D 2 TCP) Balajee Vamanan, Jahangir Hasan, and T. N. Vijaykumar.
1 Updates on Backward Congestion Notification Davide Bergamasco Cisco Systems, Inc. IEEE 802 Plenary Meeting San Francisco, USA July.
Mohammad Alizadeh Adel Javanmard and Balaji Prabhakar Stanford University Analysis of DCTCP:Analysis of DCTCP: Stability, Convergence, and FairnessStability,
XCP: Congestion Control for High Bandwidth-Delay Product Network Dina Katabi, Mark Handley and Charlie Rohrs Presented by Ao-Jan Su.
Sizing Router Buffers Guido Appenzeller Isaac Keslassy Nick McKeown Stanford University.
Bertha & M Sadeeq.  Easy to manage the problems  Scalability  Real time and real environment  Free data collection  Cost efficient  DCTCP only covers.
Congestion control in data centers
Defense: Christopher Francis, Rumou duan Data Center TCP (DCTCP) 1.
SEDCL: Stanford Experimental Data Center Laboratory.
Computer Networks: Performance Measures1 Computer Network Performance Measures.
1 Emulating AQM from End Hosts Presenters: Syed Zaidi Ivor Rodrigues.
Isaac Keslassy (Technion) Guido Appenzeller & Nick McKeown (Stanford)
Reduced TCP Window Size for VoIP in Legacy LAN Environments Nikolaus Färber, Bernd Girod, Balaji Prabhakar.
Data Center TCP (DCTCP)
Jennifer Rexford Fall 2014 (TTh 3:00-4:20 in CS 105) COS 561: Advanced Computer Networks TCP.
Jennifer Rexford Princeton University MW 11:00am-12:20pm Wide-Area Traffic Management COS 597E: Software Defined Networking.
Mohammad Alizadeh, Abdul Kabbani, Tom Edsall,
Practical TDMA for Datacenter Ethernet
Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan by liyong Data.
TCP & Data Center Networking
Open Issues in Buffer Sizing Amogh Dhamdhere Constantine Dovrolis College of Computing Georgia Tech.
Curbing Delays in Datacenters: Need Time to Save Time? Mohammad Alizadeh Sachin Katti, Balaji Prabhakar Insieme Networks Stanford University 1.
Detail: Reducing the Flow Completion Time Tail in Datacenter Networks SIGCOMM PIGGY.
On the Data Path Performance of Leaf-Spine Datacenter Fabrics Mohammad Alizadeh Joint with: Tom Edsall 1.
Understanding the Performance of TCP Pacing Amit Aggarwal, Stefan Savage, Thomas Anderson Department of Computer Science and Engineering University of.
Packet Transport Mechanisms for Data Center Networks Mohammad Alizadeh NetSeminar (April 12, 2012) Mohammad Alizadeh NetSeminar (April 12, 2012) Stanford.
Srihari Makineni & Ravi Iyer Communications Technology Lab
DCTCP & CoDel the Best is the Friend of the Good Bob Briscoe, BT Murari Sridharan, Microsoft IETF-84 TSVAREA Jul 2012 Le mieux est l'ennemi du bien Voltaire.
Wei Bai with Li Chen, Kai Chen, Dongsu Han, Chen Tian, Hao Wang SING HKUST Information-Agnostic Flow Scheduling for Commodity Data Centers 1 SJTU,
Kai Chen (HKUST) Nov 1, 2013, USTC, Hefei Data Center Networking 1.
TimeThief: Leveraging Network Variability to Save Datacenter Energy in On-line Data- Intensive Applications Balajee Vamanan (Purdue UIC) Hamza Bin Sohail.
Analysis and Design of an Adaptive Virtual Queue (AVQ) Algorithm for AQM By Srisankar Kunniyur & R. Srikant Presented by Hareesh Pattipati.
6.888: Lecture 3 Data Center Congestion Control Mohammad Alizadeh Spring
1 Three ways to (ab)use Multipath Congestion Control Costin Raiciu University Politehnica of Bucharest.
MMPTCP: A Multipath Transport Protocol for Data Centres 1 Morteza Kheirkhah University of Edinburgh, UK Ian Wakeman and George Parisis University of Sussex,
Revisiting Transport Congestion Control Jian He UT Austin 1.
Shuihai Hu, Wei Bai, Kai Chen, Chen Tian (NJU), Ying Zhang (HP Labs), Haitao Wu (Microsoft) Sing Hong Kong University of Science and Technology.
Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Microsoft Research.
Scalable Congestion Control Protocol based on SDN in Data Center Networks Speaker : Bo-Han Hua Professor : Dr. Kai-Wei Ke Date : 2016/04/08 1.
Data Center TCP (DCTCP)
Low-Latency Software Rate Limiters for Cloud Networks
Data Center TCP (DCTCP)
Incast-Aware Switch-Assisted TCP Congestion Control for Data Centers
6.888 Lecture 5: Flow Scheduling
Topics discussed in this section:
HyGenICC: Hypervisor-based Generic IP Congestion Control for Virtualized Data Centers Conference Paper in Proceedings of ICC16 By Ahmed M. Abdelmoniem,
Router-Assisted Congestion Control
Packet Transport Mechanisms for Data Center Networks
Open Issues in Router Buffer Sizing
Microsoft Research Stanford University
Hamed Rezaei, Mojtaba Malekpourshahraki, Balajee Vamanan
Congestion Control in Software Define Data Center Network
AMP: A Better Multipath TCP for Data Center Networks
Data Center TCP (DCTCP)
Centralized Arbitration for Data Centers
Lecture 16, Computer Networks (198:552)
Reconciling Mice and Elephants in Data Center Networks
Lecture 17, Computer Networks (198:552)
CS 401/601 Computer Network Systems Mehmet Gunes
AMP: An Adaptive Multipath TCP for Data Center Networks
Presentation transcript:

Mohammad Alizadeh Stanford University Joint with: Abdul Kabbani, Tom Edsall, Balaji Prabhakar, Amin Vahdat, Masato Yasuda HULL: High bandwidth, Ultra Low-Latency Data Center Fabrics

Latency in Data Centers Latency is becoming a primary metric in DC – Operators worry about both average latency, and the high percentiles (99.9 th or th ) High level tasks (e.g. loading a Facebook page) may require 1000s of low level transactions Need to go after latency everywhere – End-host: software stack, NIC – Network: queuing delay 2 This talk

TLA MLA Worker Nodes ……… Example: Web Search Picasso “Everything you can imagine is real.” “Bad artists copy. Good artists steal.” “It is your work in life that is the ultimate seduction.“ “The chief enemy of creativity is good sense.“ “Inspiration does exist, but it must find you working.” “I'd like to live as a poor man with lots of money.“ “Art is a lie that makes us realize the truth. “Computers are useless. They can only give you answers.” ….. 1. Art is a lie… 2. The chief… 3. … Art is a lie… 3. ….. Art is… Picasso Strict deadlines (SLAs) Missed deadline  Lower quality result Many RPCs per query  High percentiles matter Deadline = 250ms Deadline = 50ms Deadline = 10ms 3

4 TCP ~1–10ms DCTCP ~100μs HULL ~Zero Latency Roadmap: Reducing Queuing Latency Baseline fabric latency (propagation + switching): ~10μs

Data Center Workloads: Short messages [50KB-1MB] (Queries, Coordination, Control state) Large flows [1MB-100MB] (Data updates) Low Latency High Throughput 5 Low Latency & High Throughput The challenge is to achieve both together.

TCP Buffer Requirement Bandwidth-delay product rule of thumb: – A single flow needs C×RTT buffers for 100% Throughput. Throughput Buffer Size 100% B B ≥ C×RTT B 100% B < C×RTT 6 Buffering needed to absorb TCP’s rate fluctuations

Source: React in proportion to the extent of congestion – Reduce window size based on fraction of marked packets. 7 ECN MarksTCPDCTCP Cut window by 50%Cut window by 40% Cut window by 50%Cut window by 5% DCTCP: Main Idea Switch: Set ECN Mark when Queue Length > K. B K Mark Don’t Mark

8 Setup: Win 7, Broadcom 1Gbps Switch Scenario: 2 long-lived flows, (Kbytes) ECN Marking Thresh = 30KB DCTCP vs TCP

HULL: Ultra Low Latency

10 TCP: ~1–10ms DCTCP: ~100μs ~Zero Latency How do we get this? What do we want? C Incoming Traffic TCP Incoming Traffic DCTCP K C

Phantom Queue 11 Link Speed C Switch Bump on Wire Key idea: – Associate congestion with link utilization, not buffer occupancy – Virtual Queue (Gibbens & Kelly 1999, Kunniyur & Srikant 2001) Marking Thresh. γC γ < 1 creates “bandwidth headroom” γ < 1 creates “bandwidth headroom”

12 Throughput Switch latency (mean) Throughput & Latency vs. PQ Drain Rate

TCP traffic is very bursty – Made worse by CPU-offload optimizations like Large Send Offload and Interrupt Coalescing – Causes spikes in queuing, increasing latency 13 Example. 1Gbps flow on 10G NIC The Need for Pacing 65KB bursts every 0.5ms

14 Algorithmic challenges: – Which flows to pace? Elephants: Begin pacing only if flow receives multiple ECN marks – At what rate to pace? Found dynamically: Outgoing Packets From Server NIC Un-paced Traffic TX Token Bucket Rate Limiter Flow Association Table Flow Association Table R Q TB Hardware Pacer Module

15 Throughput Switch latency (mean) Throughput & Latency vs. PQ Drain Rate (with Pacing) Throughput & Latency vs. PQ Drain Rate (with Pacing)

16 No PacingPacing No Pacing vs Pacing (Mean Latency)

17 No PacingPacing No Pacing vs Pacing (99 th Percentile Latency)

The HULL Architecture 18 Phantom Queue Hardware Pacer DCTCP Congestion Control

More Details… Application DCTCP CC NIC Pacer LSO Host Switch Empty Queue PQ Large FlowsSmall Flows Link (with speed C) ECN Thresh. γ x C Large Burst Hardware pacing is after segmentation in NIC. Mice flows skip the pacer; are not delayed. 19

Load: 20% Switch Latency (μs)10MB FCT (ms) Avg99 th Avg99 th TCP111.51, DCTCP-30K DCTCP-6K-Pacer DCTCP-PQ950-Pacer senders  1 receiver (80% 1KB flows, 20% 10MB flows). Dynamic Flow Experiment 20% load ~93% decrease ~17% increase

Load: 40% Switch Latency (μs)10MB FCT (ms) Avg99 th Avg99 th TCP329.33, DCTCP-30K DCTCP-6K-Pacer DCTCP-PQ950-Pacer senders  1 receiver (80% 1KB flows, 20% 10MB flows). Dynamic Flow Experiment 40% load ~91% decrease ~28% increase

Processor sharing model for elephants – On a link of capacity 1, a flow of size x takes on average to complete (ρ is the total load). Example: (ρ = 40%) Slowdown = 50% Not 20% Slowdown = 50% Not 20% Slowdown due to bandwidth headroom

Slowdown: Theory vs Experiment 23 DCTCP-PQ800DCTCP-PQ900DCTCP-PQ950

Summary The HULL architecture combines – DCTCP – Phantom queues – Hardware pacing A small amount of bandwidth headroom gives significant (often 10-40x) latency reductions, with a predictable slowdown for large flows. 24

Thank you!