Presentation is loading. Please wait.

Presentation is loading. Please wait.

Curbing Delays in Datacenters: Need Time to Save Time? Mohammad Alizadeh Sachin Katti, Balaji Prabhakar Insieme Networks Stanford University 1.

Similar presentations


Presentation on theme: "Curbing Delays in Datacenters: Need Time to Save Time? Mohammad Alizadeh Sachin Katti, Balaji Prabhakar Insieme Networks Stanford University 1."— Presentation transcript:

1 Curbing Delays in Datacenters: Need Time to Save Time? Mohammad Alizadeh Sachin Katti, Balaji Prabhakar Insieme Networks Stanford University 1

2 2 Window-based rate control schemes (e.g., TCP) do not work at near zero round-trip latency

3 Datacenter Networks 3 1000s of server ports  Message latency is King need very high throughput, very low latency web app db map- reduce HPC monitoring cache 10-40Gbps links 1-5μs latency 10-40Gbps links 1-5μs latency

4 Transport in Datacenters TCP widely used, but has poor performance – Buffer hungry: adds significant queuing latency 4 TCP ~1–10ms DCTCP ~100μs ~Zero Latency How do we get here? Queuing Latency Baseline fabric latency: 1-5μs

5 (KBytes) Experiment: 2 flows (Win 7 stack), Broadcom 1Gbps Switch Reducing Queuing: DCTCP vs TCP 5 S1S1 SnSn ECN Marking Thresh = 30KB

6 Towards Zero Queuing S1S1 SnSn ECN@90% S1S1 SnSn S1S1 SnSn

7 Towards Zero Queuing ns2 sim: 10 DCTCP flows, 10Gbps switch, ECN at 9Gbps (90% util) Target Throughput Floor ≈ 23μs S1S1 SnSn ECN@90%

8 8 Receiver Sender RTT = 10  C×RTT = 10 pkts Cwnd = 1 Throughput = 1/RTT = 10% Window-based Rate Control C = 1

9 9 Receiver Sender Cwnd = 1 Throughput = 1/RTT = 50% Window-based Rate Control RTT = 2  C×RTT = 2 pkts C = 1

10 10 Receiver Sender Cwnd = 1 Throughput = 1/RTT = 99% Window-based Rate Control RTT = 1.01  C×RTT = 1.01 pkts C = 1

11 11 Receiver Sender 1 Cwnd = 1 Sender 2 Cwnd = 1 As propagation time  0: Queue buildup is unavoidable Window-based Rate Control RTT = 1.01  C×RTT = 1.01 pkts

12 So What? Window-based RC needs lag in the loop Near-zero latency transport must: 1.Use timer-based rate control / pacing 2.Use small packet size 12 Or… Change the Problem! Both increase CPU overhead (not practical in software) Possible in hardware, but complex (e.g., HULL NSDI’12)

13 Changing the Problem… 13 Priority queue Switch Port FIFO queue Switch Port 7 7 1 1 9 9 4 4 3 3 5 5 Queue buildup costly  need precise rate control Queue buildup irrelevant  coarse rate control OK

14 pFABRIC 14

15 H1 H2 H3 H4 H5 H6 H7 H8 H9 15 DC Fabric: Just a Giant Switch

16 H1 H2 H3 H4 H5 H6 H7 H8 H9 H1 H2 H3 H4 H5 H6 H7 H8 H9 H1 H2 H3 H4 H5 H6 H7 H8 H9 TXRX 16 DC Fabric: Just a Giant Switch

17 H1 H2 H3 H4 H5 H6 H7 H8 H9 H1 H2 H3 H4 H5 H6 H7 H8 H9 TXRX 17 DC Fabric: Just a Giant Switch

18 H1 H2 H3 H4 H5 H6 H7 H8 H9 H1 H2 H3 H4 H5 H6 H7 H8 H9 Objective?  Minimize avg FCT DC transport = Flow scheduling on giant switch ingress & egress capacity constraints TXRX 18

19 “Ideal” Flow Scheduling Problem is NP-hard [Bar-Noy et al.] – Simple greedy algorithm: 2-approximation 1 2 3 1 2 3 19

20 pFabric in 1 Slide Packets carry a single priority # e.g., prio = remaining flow size pFabric Switches Very small buffers (~10-20 pkts for 10Gbps fabric) Send highest priority / drop lowest priority pkts pFabric Hosts Send/retransmit aggressively Minimal rate control: just prevent congestion collapse 20

21 Key Idea 21 Decouple flow scheduling from rate control H1H1 H1H1 H2H2 H2H2 H3H3 H3H3 H4H4 H4H4 H5H5 H5H5 H6H6 H6H6 H7H7 H7H7 H8H8 H8H8 H9H9 H9H9 Switches implement flow scheduling via local mechanisms Hosts use simple window-based rate control (≈TCP) to avoid high packet loss Queue buildup does not hurt performance  Window-based rate control OK

22 Switch Port 7 7 1 1 9 9 4 4 3 3  Priority Scheduling send highest priority packet first  Priority Dropping drop lowest priority packets first 6 6 3 3 2 2 5 5 small “bag” of packets per-port 22 prio = remaining flow size H1H1 H1H1 H2H2 H2H2 H3H3 H3H3 H4H4 H4H4 H5H5 H5H5 H6H6 H6H6 H7H7 H7H7 H8H8 H8H8 H9H9 H9H9 pFabric Switch

23 pFabric Switch Complexity Buffers are very small (~2×BDP per-port) – e.g., C=10Gbps, RTT=15µs → Buffer ~ 30KB – Today’s switch buffers are 10-30x larger Priority Scheduling/Dropping Worst-case: Minimum size packets (64B) – 51.2ns to find min/max of ~600 numbers – Binary comparator tree: 10 clock cycles – Current ASICs: clock ~ 1ns 23

24 Why does this work? Invariant for ideal scheduling: At any instant, have the highest priority packet (according to ideal algorithm) available at the switch. Priority scheduling  High priority packets traverse fabric as quickly as possible What about dropped packets?  Lowest priority → not needed till all other packets depart  Buffer > BDP → enough time (> RTT) to retransmit 24

25 25 Evaluation (144-port fabric; Search traffic pattern) Recall: “Ideal” is REALLY idealized! Centralized with full view of flows No rate-control dynamics No buffering No pkt drops No load-balancing inefficiency

26 Mice FCT (<100KB) Average99 th Percentile 26

27 Conclusion Window-based rate control does not work at near-zero round-trip latency pFabric: simple, yet near-optimal – Decouples flow scheduling from rate control – Allows use of coarse window-base rate control pFabric is within 10-15% of “ideal” for realistic DC workloads (SIGCOMM’13) 27

28 Thank You! 28

29 29


Download ppt "Curbing Delays in Datacenters: Need Time to Save Time? Mohammad Alizadeh Sachin Katti, Balaji Prabhakar Insieme Networks Stanford University 1."

Similar presentations


Ads by Google