PFabric: Minimal Near-Optimal Datacenter Transport Mohammad Alizadeh Shuang Yang, Milad Sharif, Sachin Katti, Nick McKeown, Balaji Prabhakar, Scott Shenker.

Slides:

Advertisements

Similar presentations

A Switch-Based Approach to Starvation in Data Centers Alex Shpiner and Isaac Keslassy Department of Electrical Engineering, Technion. Gabi Bracha, Eyal.

Advertisements

Finishing Flows Quickly with Preemptive Scheduling

Deconstructing Datacenter Packet Transport Mohammad Alizadeh, Shuang Yang, Sachin Katti, Nick McKeown, Balaji Prabhakar, Scott Shenker Stanford University.

Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Modified by Feng.

Lecture 18: Congestion Control in Data Center Networks 1.

Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Presented by Shaddi.

Fixing TCP in Datacenters Costin Raiciu Advanced Topics in Distributed Systems 2011.

Improving Datacenter Performance and Robustness with Multipath TCP Costin Raiciu, Sebastien Barre, Christopher Pluntke, Adam Greenhalgh, Damon Wischik,

Congestion Control: TCP & DC-TCP Swarun Kumar With Slides From: Prof. Katabi, Alizadeh et al.

1 Congestion Control EE122 Fall 2012 Scott Shenker Materials with thanks to Jennifer Rexford, Ion Stoica, Vern Paxson.

Balajee Vamanan et al. Deadline-Aware Datacenter TCP (D 2 TCP) Balajee Vamanan, Jahangir Hasan, and T. N. Vijaykumar.

5/17/20151 Adaptive RED: An Algorithm for Increasing the Robustness of RED’s Active Queue Management or How I learned to stop worrying and love RED Presented.

Copyright © 2005 Department of Computer Science 1 Solving the TCP-incast Problem with Application-Level Scheduling Maxim Podlesny, University of Waterloo.

Mohammad Alizadeh Adel Javanmard and Balaji Prabhakar Stanford University Analysis of DCTCP:Analysis of DCTCP: Stability, Convergence, and FairnessStability,

Advanced Computer Networking Congestion Control for High Bandwidth-Delay Product Environments (XCP Algorithm) 1.

XCP: Congestion Control for High Bandwidth-Delay Product Network Dina Katabi, Mark Handley and Charlie Rohrs Presented by Ao-Jan Su.

Vijay Vasudevan, Amar Phanishayee, Hiral Shah, Elie Krevat David Andersen, Greg Ganger, Garth Gibson, Brian Mueller* Carnegie Mellon University, *Panasas.

The War Between Mice and Elephants Presented By Eric Wang Liang Guo and Ibrahim Matta Boston University ICNP

Congestion control in data centers

Congestion Control Tanenbaum 5.3, /12/2015Congestion Control (A Loss Based Technique: TCP)2 What? Why? Congestion occurs when –there is no reservation.

Defense: Christopher Francis, Rumou duan Data Center TCP (DCTCP) 1.

SEDCL: Stanford Experimental Data Center Laboratory.

A Switch-Based Approach to Starvation in Data Centers Alex Shpiner Joint work with Isaac Keslassy Faculty of Electrical Engineering Faculty of Electrical.

1 Chapter 3 Transport Layer. 2 Chapter 3 outline 3.1 Transport-layer services 3.2 Multiplexing and demultiplexing 3.3 Connectionless transport: UDP 3.4.

L13: Sharing in network systems Dina Katabi Spring Some slides are from lectures by Nick Mckeown, Ion Stoica, Frans.

TCP Congestion Control

Jennifer Rexford Fall 2014 (TTh 3:00-4:20 in CS 105) COS 561: Advanced Computer Networks TCP.

Information-Agnostic Flow Scheduling for Commodity Data Centers

CONGA: Distributed Congestion-Aware Load Balancing for Datacenters

Low-Rate TCP Denial of Service Defense Johnny Tsao Petros Efstathopoulos Tutor: Guang Yang UCLA 2003.

Mohammad Alizadeh, Abdul Kabbani, Tom Edsall,

Practical TDMA for Datacenter Ethernet

Mohammad Alizadeh Stanford University Joint with: Abdul Kabbani, Tom Edsall, Balaji Prabhakar, Amin Vahdat, Masato Yasuda HULL: High bandwidth, Ultra Low-Latency.

ElasticTree: Saving Energy in Data Center Networks 許倫愷 2013/5/28.

Courtesy: Nick McKeown, Stanford 1 TCP Congestion Control Tahir Azim.

Curbing Delays in Datacenters: Need Time to Save Time? Mohammad Alizadeh Sachin Katti, Balaji Prabhakar Insieme Networks Stanford University 1.

Detail: Reducing the Flow Completion Time Tail in Datacenter Networks SIGCOMM PIGGY.

TFRC: TCP Friendly Rate Control using TCP Equation Based Congestion Model CS 218 W 2003 Oct 29, 2003.

On the Data Path Performance of Leaf-Spine Datacenter Fabrics Mohammad Alizadeh Joint with: Tom Edsall 1.

B 李奕德.  Abstract  Intro  ECN in DCTCP  TDCTCP  Performance evaluation  conclusion.

CA-RTO: A Contention- Adaptive Retransmission Timeout I. Psaras, V. Tsaoussidis, L. Mamatas Demokritos University of Thrace, Xanthi, Greece This study.

1 Advanced Topics in Congestion Control EE122 Fall 2012 Scott Shenker Materials with thanks to Jennifer Rexford,

DCTCP & CoDel the Best is the Friend of the Good Bob Briscoe, BT Murari Sridharan, Microsoft IETF-84 TSVAREA Jul 2012 Le mieux est l'ennemi du bien Voltaire.

Wei Bai with Li Chen, Kai Chen, Dongsu Han, Chen Tian, Hao Wang SING HKUST Information-Agnostic Flow Scheduling for Commodity Data Centers 1 SJTU,

Lecture 9 – More TCP & Congestion Control

Winter 2008CS244a Handout 71 CS244a: An Introduction to Computer Networks Handout 7: Congestion Control Nick McKeown Professor of Electrical Engineering.

Advance Computer Networks Lecture#09 & 10 Instructor: Engr. Muhammad Mateen Yaqoob.

1 Miscellaneous Topics EE122 Fall 2012 Scott Shenker Materials with thanks to Jennifer Rexford, Ion Stoica, Vern.

CQRD: A Switch-based Approach to Flow Interference in Data Center Networks Guo Chen Dan Pei, Youjian Zhao Tsinghua University, Beijing, China.

Transport Layer: Sliding Window Reliability

6.888: Lecture 3 Data Center Congestion Control Mohammad Alizadeh Spring

MMPTCP: A Multipath Transport Protocol for Data Centres 1 Morteza Kheirkhah University of Edinburgh, UK Ian Wakeman and George Parisis University of Sussex,

Revisiting Transport Congestion Control Jian He UT Austin 1.

Scalable Congestion Control Protocol based on SDN in Data Center Networks Speaker : Bo-Han Hua Professor : Dr. Kai-Wei Ke Date : 2016/04/08 1.

1 Flow & Congestion Control Some slides are from lectures by Nick Mckeown, Ion Stoica, Frans Kaashoek, Hari Balakrishnan, and Sam Madden Prof. Dina Katabi.

Data Center TCP (DCTCP)

Resilient Datacenter Load Balancing in the Wild

6.888 Lecture 5: Flow Scheduling

Chapter 3 outline 3.1 transport-layer services

OTCP: SDN-Managed Congestion Control for Data Center Networks

Improving Datacenter Performance and Robustness with Multipath TCP

Microsoft Research Stanford University

Hamed Rezaei, Mojtaba Malekpourshahraki, Balajee Vamanan

Lecture 19 – TCP Performance

Carnegie Mellon University, *Panasas Inc.

AMP: A Better Multipath TCP for Data Center Networks

Centralized Arbitration for Data Centers

Lecture 16, Computer Networks (198:552)

Lecture 17, Computer Networks (198:552)

Presentation transcript:

pFabric: Minimal Near-Optimal Datacenter Transport Mohammad Alizadeh Shuang Yang, Milad Sharif, Sachin Katti, Nick McKeown, Balaji Prabhakar, Scott Shenker Stanford University U.C. Berkeley/ICSI Insieme Networks 1

Transport in Datacenters s of server ports  DC network interconnect for distributed compute workloads  Msg latency is King traditional “fairness” metrics less relevant web app db map- reduce HPC monitoring cache

Transport in Datacenters Goal: Complete flows quickly Requires scheduling flows such that: – High throughput for large flows – Fabric latency (no queuing delays) for small flows Prior work: use rate control to schedule flows DCTCP [SIGCOMM’10], HULL [NSDI’11], D 2 TCP [SIGCOMM’12] D3 [SIGCOMM’11], PDQ [SIGCOMM’12], … 3 vastly improve performance, but complex

pFabric in 1 Slide Packets carry a single priority # e.g., prio = remaining flow size pFabric Switches Very small buffers (20-30KB for 10Gbps fabric) Send highest priority / drop lowest priority pkts pFabric Hosts Send/retransmit aggressively Minimal rate control: just prevent congestion collapse 4

CONCEPTUAL MODEL 5

H1 H2 H3 H4 H5 H6 H7 H8 H9 DC Fabric: Just a Giant Switch 6

H1 H2 H3 H4 H5 H6 H7 H8 H9 H1 H2 H3 H4 H5 H6 H7 H8 H9 H1 H2 H3 H4 H5 H6 H7 H8 H9 TXRX DC Fabric: Just a Giant Switch 7

H1 H2 H3 H4 H5 H6 H7 H8 H9 H1 H2 H3 H4 H5 H6 H7 H8 H9 TXRX 8

H1 H2 H3 H4 H5 H6 H7 H8 H9 H1 H2 H3 H4 H5 H6 H7 H8 H9 Objective?  Minimize avg FCT DC transport = Flow scheduling on giant switch ingress & egress capacity constraints TXRX 9

“Ideal” Flow Scheduling Problem is NP-hard  [Bar-Noy et al.] – Simple greedy algorithm: 2-approximation

pFABRIC DESIGN 11

Key Insight 12 Decouple flow scheduling from rate control H1H1 H1H1 H2H2 H2H2 H3H3 H3H3 H4H4 H4H4 H5H5 H5H5 H6H6 H6H6 H7H7 H7H7 H8H8 H8H8 H9H9 H9H9 Switches implement flow scheduling via local mechanisms Hosts implement simple rate control to avoid high packet loss

pFabric Switch Switch Port  Priority Scheduling send highest priority packet first  Priority Dropping drop lowest priority packets first small “bag” of packets per-port 13 prio = remaining flow size H1H1 H1H1 H2H2 H2H2 H3H3 H3H3 H4H4 H4H4 H5H5 H5H5 H6H6 H6H6 H7H7 H7H7 H8H8 H8H8 H9H9 H9H9

pFabric Switch Complexity Buffers are very small (~2×BDP per-port) – e.g., C=10Gbps, RTT=15µs → Buffer ~ 30KB – Today’s switch buffers are 10-30x larger Priority Scheduling/Dropping Worst-case: Minimum size packets (64B) – 51.2ns to find min/max of ~600 numbers – Binary comparator tree: 10 clock cycles – Current ASICs: clock ~ 1ns 14

pFabric Rate Control With priority scheduling/dropping, queue buildup doesn’t matter Greatly simplifies rate control H1 H2 H3 H4 H5 H6 H7 H8 H9 50% Loss Only task for RC: Prevent congestion collapse when elephants collide 15

pFabric Rate Control Minimal version of TCP algorithm 1.Start at line-rate – Initial window larger than BDP 2.No retransmission timeout estimation – Fixed RTO at small multiple of round-trip time 3.Reduce window size upon packet drops – Window increase same as TCP (slow start, congestion avoidance, …) 16 H1H1 H1H1 H2H2 H2H2 H3H3 H3H3 H4H4 H4H4 H5H5 H5H5 H6H6 H6H6 H7H7 H7H7 H8H8 H8H8 H9H9 H9H9

Why does this work? Key invariant for ideal scheduling: At any instant, have the highest priority packet (according to ideal algorithm) available at the switch. Priority scheduling  High priority packets traverse fabric as quickly as possible What about dropped packets?  Lowest priority → not needed till all other packets depart  Buffer > BDP → enough time (> RTT) to retransmit 17

Evaluation 18 40Gbps Fabric Links 10Gbps Edge Links 9 Racks ns2 simulations: 144-port leaf-spine fabric – RTT = ~14.6µs (10µs at hosts) – Buffer size = 36KB (~2xBDP), RTO = 45μs (~3xRTT) Random flow arrivals, realistic distributions – web search (DCTCP paper), data mining (VL2 paper)

19 Overall Average FCT Recall: “Ideal” is REALLY idealized! Centralized with full view of flows No rate-control dynamics No buffering No pkt drops No load-balancing inefficiency

Mice FCT (<100KB) Average99 th Percentile Almost no jitter 20

Conclusion pFabric: simple, yet near-optimal – Decouples flow scheduling from rate control A clean-slate approach – Requires new switches and minor host changes Incremental deployment with existing switches is promising and ongoing work 21

Thank You! 22

23