Curbing Delays in Datacenters: Need Time to Save Time? Mohammad Alizadeh Sachin Katti, Balaji Prabhakar Insieme Networks Stanford University 1.

Slides:

Advertisements

Similar presentations

A Switch-Based Approach to Starvation in Data Centers Alex Shpiner and Isaac Keslassy Department of Electrical Engineering, Technion. Gabi Bracha, Eyal.

Advertisements

Finishing Flows Quickly with Preemptive Scheduling

Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Microsoft Research.

Deconstructing Datacenter Packet Transport Mohammad Alizadeh, Shuang Yang, Sachin Katti, Nick McKeown, Balaji Prabhakar, Scott Shenker Stanford University.

Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Modified by Feng.

Lecture 18: Congestion Control in Data Center Networks 1.

Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Presented by Shaddi.

PFabric: Minimal Near-Optimal Datacenter Transport Mohammad Alizadeh Shuang Yang, Milad Sharif, Sachin Katti, Nick McKeown, Balaji Prabhakar, Scott Shenker.

Congestion Control: TCP & DC-TCP Swarun Kumar With Slides From: Prof. Katabi, Alizadeh et al.

1 Congestion Control EE122 Fall 2012 Scott Shenker Materials with thanks to Jennifer Rexford, Ion Stoica, Vern Paxson.

Balajee Vamanan et al. Deadline-Aware Datacenter TCP (D 2 TCP) Balajee Vamanan, Jahangir Hasan, and T. N. Vijaykumar.

Copyright © 2005 Department of Computer Science 1 Solving the TCP-incast Problem with Application-Level Scheduling Maxim Podlesny, University of Waterloo.

Mohammad Alizadeh Adel Javanmard and Balaji Prabhakar Stanford University Analysis of DCTCP:Analysis of DCTCP: Stability, Convergence, and FairnessStability,

Advanced Computer Networking Congestion Control for High Bandwidth-Delay Product Environments (XCP Algorithm) 1.

XCP: Congestion Control for High Bandwidth-Delay Product Network Dina Katabi, Mark Handley and Charlie Rohrs Presented by Ao-Jan Su.

Bertha & M Sadeeq.  Easy to manage the problems  Scalability  Real time and real environment  Free data collection  Cost efficient  DCTCP only covers.

Congestion control in data centers

Congestion Control Tanenbaum 5.3, /12/2015Congestion Control (A Loss Based Technique: TCP)2 What? Why? Congestion occurs when –there is no reservation.

Defense: Christopher Francis, Rumou duan Data Center TCP (DCTCP) 1.

SEDCL: Stanford Experimental Data Center Laboratory.

Computer Networks : TCP Congestion Control1 TCP Congestion Control.

A Switch-Based Approach to Starvation in Data Centers Alex Shpiner Joint work with Isaac Keslassy Faculty of Electrical Engineering Faculty of Electrical.

Computer Networking Lecture 17 – Queue Management As usual: Thanks to Srini Seshan and Dave Anderson.

Jennifer Rexford Fall 2014 (TTh 3:00-4:20 in CS 105) COS 561: Advanced Computer Networks TCP.

Information-Agnostic Flow Scheduling for Commodity Data Centers

Congestion Control for High Bandwidth-delay Product Networks Dina Katabi, Mark Handley, Charlie Rohrs.

Congestion Control for High Bandwidth-Delay Product Environments Dina Katabi Mark Handley Charlie Rohrs.

On Horrible TCP Performance over Underwater Links Balaji Prabhakar Abdul Kabbani, Balaji Prabhakar Stanford University.

03/12/08Nuova Systems Inc. Page 1 TCP Issues in the Data Center Tom Lyon The Future of TCP: Train-wreck or Evolution? Stanford University

Mohammad Alizadeh, Abdul Kabbani, Tom Edsall,

Practical TDMA for Datacenter Ethernet

Mohammad Alizadeh Stanford University Joint with: Abdul Kabbani, Tom Edsall, Balaji Prabhakar, Amin Vahdat, Masato Yasuda HULL: High bandwidth, Ultra Low-Latency.

CS144 An Introduction to Computer Networks

Detail: Reducing the Flow Completion Time Tail in Datacenter Networks SIGCOMM PIGGY.

On the Data Path Performance of Leaf-Spine Datacenter Fabrics Mohammad Alizadeh Joint with: Tom Edsall 1.

Distance-Dependent RED Policy (DDRED)‏ Sébastien LINCK, Eugen Dedu and François Spies LIFC Montbéliard - France ICN07.

Packet Transport Mechanisms for Data Center Networks Mohammad Alizadeh NetSeminar (April 12, 2012) Mohammad Alizadeh NetSeminar (April 12, 2012) Stanford.

1 Advanced Topics in Congestion Control EE122 Fall 2012 Scott Shenker Materials with thanks to Jennifer Rexford,

DCTCP & CoDel the Best is the Friend of the Good Bob Briscoe, BT Murari Sridharan, Microsoft IETF-84 TSVAREA Jul 2012 Le mieux est l'ennemi du bien Voltaire.

Wei Bai with Li Chen, Kai Chen, Dongsu Han, Chen Tian, Hao Wang SING HKUST Information-Agnostic Flow Scheduling for Commodity Data Centers 1 SJTU,

1 Miscellaneous Topics EE122 Fall 2012 Scott Shenker Materials with thanks to Jennifer Rexford, Ion Stoica, Vern.

CQRD: A Switch-based Approach to Flow Interference in Data Center Networks Guo Chen Dan Pei, Youjian Zhao Tsinghua University, Beijing, China.

TCP transfers over high latency/bandwidth networks & Grid DT Measurements session PFLDnet February 3- 4, 2003 CERN, Geneva, Switzerland Sylvain Ravot

6.888: Lecture 3 Data Center Congestion Control Mohammad Alizadeh Spring

MMPTCP: A Multipath Transport Protocol for Data Centres 1 Morteza Kheirkhah University of Edinburgh, UK Ian Wakeman and George Parisis University of Sussex,

Revisiting Transport Congestion Control Jian He UT Austin 1.

Shuihai Hu, Wei Bai, Kai Chen, Chen Tian (NJU), Ying Zhang (HP Labs), Haitao Wu (Microsoft) Sing Hong Kong University of Science and Technology.

Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Microsoft Research.

Scalable Congestion Control Protocol based on SDN in Data Center Networks Speaker : Bo-Han Hua Professor : Dr. Kai-Wei Ke Date : 2016/04/08 1.

Data Center TCP (DCTCP)

Low-Latency Software Rate Limiters for Cloud Networks

Data Center TCP (DCTCP)

15-744: Computer Networking

Resilient Datacenter Load Balancing in the Wild

6.888 Lecture 5: Flow Scheduling

OTCP: SDN-Managed Congestion Control for Data Center Networks

Router-Assisted Congestion Control

Congestion-Aware Load Balancing at the Virtual Edge

Microsoft Research Stanford University

Hamed Rezaei, Mojtaba Malekpourshahraki, Balajee Vamanan

Augmenting Proactive Congestion Control with Aeolus

Congestion Control in Software Define Data Center Network

Carnegie Mellon University, *Panasas Inc.

AMP: A Better Multipath TCP for Data Center Networks

Centralized Arbitration for Data Centers

Lecture 16, Computer Networks (198:552)

Congestion-Aware Load Balancing at the Virtual Edge

Lecture 17, Computer Networks (198:552)

Presentation transcript:

Curbing Delays in Datacenters: Need Time to Save Time? Mohammad Alizadeh Sachin Katti, Balaji Prabhakar Insieme Networks Stanford University 1

2 Window-based rate control schemes (e.g., TCP) do not work at near zero round-trip latency

Datacenter Networks s of server ports  Message latency is King need very high throughput, very low latency web app db map- reduce HPC monitoring cache 10-40Gbps links 1-5μs latency 10-40Gbps links 1-5μs latency

Transport in Datacenters TCP widely used, but has poor performance – Buffer hungry: adds significant queuing latency 4 TCP ~1–10ms DCTCP ~100μs ~Zero Latency How do we get here? Queuing Latency Baseline fabric latency: 1-5μs

(KBytes) Experiment: 2 flows (Win 7 stack), Broadcom 1Gbps Switch Reducing Queuing: DCTCP vs TCP 5 S1S1 SnSn ECN Marking Thresh = 30KB

Towards Zero Queuing S1S1 SnSn S1S1 SnSn S1S1 SnSn

Towards Zero Queuing ns2 sim: 10 DCTCP flows, 10Gbps switch, ECN at 9Gbps (90% util) Target Throughput Floor ≈ 23μs S1S1 SnSn

8 Receiver Sender RTT = 10  C×RTT = 10 pkts Cwnd = 1 Throughput = 1/RTT = 10% Window-based Rate Control C = 1

9 Receiver Sender Cwnd = 1 Throughput = 1/RTT = 50% Window-based Rate Control RTT = 2  C×RTT = 2 pkts C = 1

10 Receiver Sender Cwnd = 1 Throughput = 1/RTT = 99% Window-based Rate Control RTT = 1.01  C×RTT = 1.01 pkts C = 1

11 Receiver Sender 1 Cwnd = 1 Sender 2 Cwnd = 1 As propagation time  0: Queue buildup is unavoidable Window-based Rate Control RTT = 1.01  C×RTT = 1.01 pkts

So What? Window-based RC needs lag in the loop Near-zero latency transport must: 1.Use timer-based rate control / pacing 2.Use small packet size 12 Or… Change the Problem! Both increase CPU overhead (not practical in software) Possible in hardware, but complex (e.g., HULL NSDI’12)

Changing the Problem… 13 Priority queue Switch Port FIFO queue Switch Port Queue buildup costly  need precise rate control Queue buildup irrelevant  coarse rate control OK

pFABRIC 14

H1 H2 H3 H4 H5 H6 H7 H8 H9 15 DC Fabric: Just a Giant Switch

H1 H2 H3 H4 H5 H6 H7 H8 H9 H1 H2 H3 H4 H5 H6 H7 H8 H9 H1 H2 H3 H4 H5 H6 H7 H8 H9 TXRX 16 DC Fabric: Just a Giant Switch

H1 H2 H3 H4 H5 H6 H7 H8 H9 H1 H2 H3 H4 H5 H6 H7 H8 H9 TXRX 17 DC Fabric: Just a Giant Switch

H1 H2 H3 H4 H5 H6 H7 H8 H9 H1 H2 H3 H4 H5 H6 H7 H8 H9 Objective?  Minimize avg FCT DC transport = Flow scheduling on giant switch ingress & egress capacity constraints TXRX 18

“Ideal” Flow Scheduling Problem is NP-hard [Bar-Noy et al.] – Simple greedy algorithm: 2-approximation

pFabric in 1 Slide Packets carry a single priority # e.g., prio = remaining flow size pFabric Switches Very small buffers (~10-20 pkts for 10Gbps fabric) Send highest priority / drop lowest priority pkts pFabric Hosts Send/retransmit aggressively Minimal rate control: just prevent congestion collapse 20

Key Idea 21 Decouple flow scheduling from rate control H1H1 H1H1 H2H2 H2H2 H3H3 H3H3 H4H4 H4H4 H5H5 H5H5 H6H6 H6H6 H7H7 H7H7 H8H8 H8H8 H9H9 H9H9 Switches implement flow scheduling via local mechanisms Hosts use simple window-based rate control (≈TCP) to avoid high packet loss Queue buildup does not hurt performance  Window-based rate control OK

Switch Port  Priority Scheduling send highest priority packet first  Priority Dropping drop lowest priority packets first small “bag” of packets per-port 22 prio = remaining flow size H1H1 H1H1 H2H2 H2H2 H3H3 H3H3 H4H4 H4H4 H5H5 H5H5 H6H6 H6H6 H7H7 H7H7 H8H8 H8H8 H9H9 H9H9 pFabric Switch

pFabric Switch Complexity Buffers are very small (~2×BDP per-port) – e.g., C=10Gbps, RTT=15µs → Buffer ~ 30KB – Today’s switch buffers are 10-30x larger Priority Scheduling/Dropping Worst-case: Minimum size packets (64B) – 51.2ns to find min/max of ~600 numbers – Binary comparator tree: 10 clock cycles – Current ASICs: clock ~ 1ns 23

Why does this work? Invariant for ideal scheduling: At any instant, have the highest priority packet (according to ideal algorithm) available at the switch. Priority scheduling  High priority packets traverse fabric as quickly as possible What about dropped packets?  Lowest priority → not needed till all other packets depart  Buffer > BDP → enough time (> RTT) to retransmit 24

25 Evaluation (144-port fabric; Search traffic pattern) Recall: “Ideal” is REALLY idealized! Centralized with full view of flows No rate-control dynamics No buffering No pkt drops No load-balancing inefficiency

Mice FCT (<100KB) Average99 th Percentile 26

Conclusion Window-based rate control does not work at near-zero round-trip latency pFabric: simple, yet near-optimal – Decouples flow scheduling from rate control – Allows use of coarse window-base rate control pFabric is within 10-15% of “ideal” for realistic DC workloads (SIGCOMM’13) 27

Thank You! 28

29