Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Tutorial Survey of LL-FC Methods for Datacenter Ethernet 101 Flow Control M. Gusat Contributors: Ton Engbersen, Cyriel Minkenberg, Ronald Luijten and.

Similar presentations


Presentation on theme: "1 Tutorial Survey of LL-FC Methods for Datacenter Ethernet 101 Flow Control M. Gusat Contributors: Ton Engbersen, Cyriel Minkenberg, Ronald Luijten and."— Presentation transcript:

1 1 Tutorial Survey of LL-FC Methods for Datacenter Ethernet 101 Flow Control M. Gusat Contributors: Ton Engbersen, Cyriel Minkenberg, Ronald Luijten and Clark Jeffries 26 Sept. 2006 IBM Zurich Research Lab

2 2 Outline Part I  Requirements of datacenter link-level flow control (LL-FC)  Brief survey of top 3 LL-FC methods  PAUSE, aka. On/Off grants  credit  rate  Baseline performance evaluation Part II  Selectivity and scope of LL-FC  per-what? : LL-FC’s resolution

3 3 Req’ts of.3x’: Next Generation of Ethernet Flow Control for Datacenters 1. Lossless operation No-drop expectation of datacenter apps (storage, IPC) Low latency 2. Selective Discrimination granularity: link, prio/VL, VLAN, VC, flow...? Scope: Backpressure upstream one hop, k-hops, e2e...? 3. Simple... PAUSE-compatible !!

4 4 Generic LL-FC System One link with 2 adjacent buffers: TX (SRC) and RX (DST)  Round trip time (RTT) per link is system’s time constant LL-FC issues:  link traversal (channel Bw allocation)  RX buffer allocation  pairwise-communication between channel’s terminations  signaling overhead (PAUSE, credit, rate commands)  backpressure (BP):  increase / decrease injections  stop and restart protocol RTT

5 5 FC-Basics: PAUSE (On/Off Grants) Xbar Data Link Down - stream Links TX Queues OQ Threshold Stop Go “Over-run”= Send STOP RX Buffer OQ PAUSE BP Semantics : STOP / GO / STOP.. Threshold Stop Go “Over-run”= Send STOP * Note: Selectivity and granularity of FC domains are not considered here. FC Return path

6 6 FC-Basics: Credits Xbar * Note: Selectivity and granularity of FC domains are not considered here.

7 7 Correctness: Min. Memory for “No Drop”  "Minimum“: to operate lossless => O(RTT link ) – Credit : 1 credit = 1 memory location – Grant : 5 (=RTT+1) memory locations  Credits – Under full load the single credit is constantly looping between RX and TX RTT=4 => max. performance = f(up-link utilisation) = 25%  Grants – Determined by slow restart: if last packet has left the RX queue, it takes an RTT until the next packet arrives

8 8 PAUSE vs. Credit @ M = RTT+1  "Equivalent" = ‘fair’ comparison 1.Credit scheme: 5 credit = 5 memory locations 2.Grant scheme: 5 (=RTT+1) memory locations Performance loss for PAUSE/Grants is due to lack of underflow protection, because if M < 2*RTT the link is not work-conserving (pipeline bubbles on restart) For equivalent (to credit) performance, M=9 is required for PAUSE.

9 9 RX queue Qi=1 (full capacity). Max. flow (input arrivals) during one timestep (Dt = 1) is 1/8. Goal: update the TX probability Ti from any sending node during the time interval [t, t+1) to obtain the new Ti applied during the time interval [t+1, t+2). Algorithm for obtaining Ti(t+1) from Ti(t)... => Initially the offered rate from source0 was set =.100, and from source1 =.025. All other processing rates were.125. Hence all queues show low occupancy. At timestep 20, the flow rate to the sink was reduced to.050 => causing a congestion level in Queue2 of.125/.050 = 2.5 times processing capacity. Results: The average queue occupancies are.23 to.25, except Q3 =.13. The source flows are treated about equally and their long-term sum is about.050 (optimal). FC-Basics: Rate

10 10 Conclusion Part I: Which Scheme is “Better”? PAUSE + simple + scalable (lower overhead of signalling) - 2xM size required Credits (absolute or incremental) + are always lossless, independent of the RTT and memory size + adopted by virtually all modern ICTNs (IBA, PCIe, FC, HT,...) - not trivial for buffer-sharing - protocol reliability - scalability At equal M = RTT, credits show 30+% higher T put vs. PAUSE *Note: Stability of both was formally proven herehere Rate: in-between PAUSE and credits + adopted in adapters + potential good match for BCN (e2e CM) - complexity (cheap fast bridges)

11 11 Part II: Selectivity and Scope of LL-FC “Per-Prio/VL PAUSE”  The FC-ed ‘link’ could be a  physical channel (e.g. 802.3x)  virtual lane (VL, e.g. IBA 2-16 VLs)  virtual channel (VC, larger figure) ... Per-Prio/VL PAUSE is the often proposed PAUSE v2.0... Yet, is it good enough for the next decade of datacenter Ethernet? Evaluation of IBA vs. PCIe/As vs. NextGen-Bridge (PrizmaCi)

12 12 Already Implemented in IBA (and other ICTNs...) IBA has 15 FC-ed VLs for QoS  SL-to-VL mapping is performed per hop, according to capabilities However, IBA doesn’t have VOQ-selective LL-FC  “selective” = per switch (virtual) output port So what?  Hogging - aka buffer monopolization, HOL 1 -blocking, output queue lockup, single-stage congestion, saturation tree (k=0) How can we prove that hogging really occurs in IBA?  A. Back-of-the-envelope reasoning  B. Analytical modeling of stability and work-conservation (papers available)  C. Comparative simulations: IBA, PCI-AS etc. (next slides)

13 13 Simulation: parallel backup to a RAID across an IBA switch  TX / SRC  16 independent IBA sources, e.g. 16 “producer” CPU/threads  SRC behavior: greedy, using any communication model (UD)  SL: BE service discipline on a single VL –(the other VLs suffer of their own )  Fabrics (single stage)  16x16 IBA generic SE  16x16 PCI-AS switch  16x16 Prizma CI switch  RX / DST  16 HDD “consumers”  t 0 : initially each HDD sinks data at full 1x (100%)  t sim : during simulation HDD[0] enters thermal recalibration or sector remapping; consequently »HDD[0] progressively slows down its incoming link throughput: 90, 80,..., 10% IBA SE Hogging Scenario

14 14 First: Friendly Bernoulli Traffic  2 Sources (A, B) sending @ (12x + 4x) to 16*1x End Nodes (C..R) link 0 throughput reduction aggregate throughput achievable performance actual IBA performance R Throughput loss Fig. from IBA Spec

15 15 Myths and Fallacies about Hogging Isn’t IBA’s static rate control sufficient? No, because it is STATIC IBA’s VLs are sufficient...?! No.  VLs and ports are orthogonal dimensions of LL-FC  1. VLs are for SL and QoS => VLs are assigned to prios, not ports!  2. Max. no. of VLs = 15 << max (SE_degree x SL) = 4K Can the SE buffer partitioning solve hogging, blocking and sat_trees, at least in single SE systems? No.  1. Partitioning makes sense only w/ Status-based FC (per bridge output port - see PCIe/AS SBFC);  IBA doesn’t have a native Status-based FC  2. Sizing becomes the issue => we need dedication per I and O ports  M = O( SL * max{RTT, MTU} * N 2 ) very large number!  Academic papers and theoretical disertations prove stability and work- conservation, but the amounts of required M are large

16 16 Conclusion Part II: Selectivity and Scope of LL-FC  Despite 16 VLs, IBA/DCE is exposed to the “transistor effect”: any single flow can modulate the aggregate T put of all the others  Hogging (HOL 1 -blocking) requires a solution even for the smallest IBA/DCE system (single hop)  Prios/VL and VOQ/VC are 2 orthogonal dimensions of LL-FC Q: QoS violation as price of ‘non-blocking’ LL-FC? Possible granularities of LL-FC queuing domains:  A. CM can serve in single hop fabrics also as LL-FC  B. Introduce VOQ-FC: intermediate coarser grain no. VCs = max{VOQ} * max{VL} = 64..4096 x 2..16 <= 64K VCs Alternative: 802.1p (map prios to 8 VLs) +.1q (map VLANs to 4K VCs)? Was proposed in 802.3ar...

17 17 Backup

18 18 Switch[k+1] RX Port[k+1, i] RX Mgnt. Unit (Buffer Allocation) LL-FC TX Unit “return path of LL-FC token" VOQ[1] VOQ[n] LL-FC Reception TX Scheduler RX Buffer LL-FC Between Two Bridges "send packet" Switch[k] TX Port[k,j]


Download ppt "1 Tutorial Survey of LL-FC Methods for Datacenter Ethernet 101 Flow Control M. Gusat Contributors: Ton Engbersen, Cyriel Minkenberg, Ronald Luijten and."

Similar presentations


Ads by Google