IBM Zurich Research Lab GmbH1 The Need for an Improved PAUSE Mitch Gusat and Cyriel Minkenberg IEEE 802 Dallas Nov. 2006.

Slides:



Advertisements
Similar presentations
Ch. 12 Routing in Switched Networks
Advertisements

Congestion Control and Fairness Models Nick Feamster CS 4251 Computer Networking II Spring 2008.
1 Tutorial Survey of LL-FC Methods for Datacenter Ethernet 101 Flow Control M. Gusat Contributors: Ton Engbersen, Cyriel Minkenberg, Ronald Luijten and.
Computer Networking Lecture 20 – Queue Management and QoS.
Winter 2004 UCSC CMPE252B1 CMPE 257: Wireless and Mobile Networking SET 3f: Medium Access Control Protocols.
Jaringan Komputer Lanjut Packet Switching Network.
CSIT560 Internet Infrastructure: Switches and Routers Active Queue Management Presented By: Gary Po, Henry Hui and Kenny Chong.
Differentiated Services. Service Differentiation in the Internet Different applications have varying bandwidth, delay, and reliability requirements How.
What's inside a router? We have yet to consider the switching function of a router - the actual transfer of datagrams from a router's incoming links to.
High Performance Router Architectures for Network- based Computing By Dr. Timothy Mark Pinkston University of South California Computer Engineering Division.
1 EE 627 Lecture 11 Review of Last Lecture UDP & Multimedia TCP & UDP Interaction.
Scheduling CS 215 W Keshav Chpt 9 Problem: given N packet streams contending for the same channel, how to schedule pkt transmissions?
Networking Issues in LAN Telephony Brian Yang
Analysis and Simulation of a Fair Queuing Algorithm
Slide Set 15: IP Multicast. In this set What is multicasting ? Issues related to IP Multicast Section 4.4.
ACN: Congestion Control1 Congestion Control and Resource Allocation.
Computer Networking Lecture 17 – Queue Management As usual: Thanks to Srini Seshan and Dave Anderson.
Distributed-Dynamic Capacity Contracting: A congestion pricing framework for Diff-Serv Murat Yuksel and Shivkumar Kalyanaraman Rensselaer Polytechnic Institute,
Lecture 5: Congestion Control l Challenge: how do we efficiently share network resources among billions of hosts? n Last time: TCP n This time: Alternative.
1 Indirect Adaptive Routing on Large Scale Interconnection Networks Nan Jiang, William J. Dally Computer System Laboratory Stanford University John Kim.
Enhancing TCP Fairness in Ad Hoc Wireless Networks Using Neighborhood RED Kaixin Xu, Mario Gerla University of California, Los Angeles {xkx,
1 A State Feedback Control Approach to Stabilizing Queues for ECN- Enabled TCP Connections Yuan Gao and Jennifer Hou IEEE INFOCOM 2003, San Francisco,
Jennifer Rexford Princeton University MW 11:00am-12:20pm Wide-Area Traffic Management COS 597E: Software Defined Networking.
Dragonfly Topology and Routing
Pipelined Two Step Iterative Matching Algorithms for CIOQ Crossbar Switches Deng Pan and Yuanyuan Yang State University of New York, Stony Brook.
Localized Asynchronous Packet Scheduling for Buffered Crossbar Switches Deng Pan and Yuanyuan Yang State University of New York Stony Brook.
Switching, routing, and flow control in interconnection networks.
Buffer Management for Shared- Memory ATM Switches Written By: Mutlu Apraci John A.Copelan Georgia Institute of Technology Presented By: Yan Huang.
T. S. Eugene Ngeugeneng at cs.rice.edu Rice University1 COMP/ELEC 429 Introduction to Computer Networks Lecture 8: Bridging Slides used with permissions.
Bell Labs Advanced Technologies EMEAAT Proprietary Information © 2004 Lucent Technologies1 Overview contributions for D27 Lucent Netherlands Richa Malhotra.
RTS/CTS-Induced Congestion in Ad Hoc Wireless LANs Saikat Ray, Jeffrey B. Carruthers, and David Starobinski Department of Electrical and Computer Engineering.
CIS679: Scheduling, Resource Configuration and Admission Control r Review of Last lecture r Scheduling r Resource configuration r Admission control.
Network Aware Resource Allocation in Distributed Clouds.
ATM SWITCHING. SWITCHING A Switch is a network element that transfer packet from Input port to output port. A Switch is a network element that transfer.
QoS Support in High-Speed, Wormhole Routing Networks Mario Gerla, B. Kannan, Bruce Kwan, Prasasth Palanti,Simon Walton.
CONGESTION CONTROL and RESOURCE ALLOCATION. Definition Resource Allocation : Process by which network elements try to meet the competing demands that.
DUKE UNIVERSITY Self-Tuned Congestion Control for Multiprocessor Networks Shubhendu S. Mukherjee VSSAD, Alpha Development Group.
Link Scheduling & Queuing COS 461: Computer Networks
ACN: RED paper1 Random Early Detection Gateways for Congestion Avoidance Sally Floyd and Van Jacobson, IEEE Transactions on Networking, Vol.1, No. 4, (Aug.
IBM Research GmbH, Zürich Research Laboratory R 3 C 2 : Reactive Route & Rate Control for CEE Mitch Gusat, Daniel Crisan, Cyriel Minkenberg, and Casimer.
Congestion Control Ian Colloff LWG San Francisco September 25, 2006.
Congestion Control in CSMA-Based Networks with Inconsistent Channel State V. Gambiroza and E. Knightly Rice Networks Group
1 IEEE Meeting July 19, 2006 Raj Jain Modeling of BCN V2.0 Jinjing Jiang and Raj Jain Washington University in Saint Louis Saint Louis, MO
ISLIP Switch Scheduler Ali Mohammad Zareh Bidoki April 2002.
Michael Schapira Yale and UC Berkeley Joint work with P. Brighten Godfrey, Aviv Zohar and Scott Shenker.
Interconnect simulation. Different levels for Evaluating an architecture Numerical models – Mathematic formulations to obtain performance characteristics.
Interconnect simulation. Different levels for Evaluating an architecture Numerical models – Mathematic formulations to obtain performance characteristics.
CS640: Introduction to Computer Networks Aditya Akella Lecture 20 - Queuing and Basics of QoS.
Research Unit in Networking - University of Liège A Distributed Algorithm for Weighted Max-Min Fairness in MPLS Networks Fabian Skivée
1 Data Link Layer Lecture 23 Imran Ahmed University of Management & Technology.
Impact of memory size on ECM and E 2 CM Single-Hop High Degree Hotspot Cyriel Minkenberg & Mitch Gusat IBM Research GmbH, Zurich May 10, 2007.
The Macroscopic behavior of the TCP Congestion Avoidance Algorithm.
1 IEX8175 RF Electronics Avo Ots telekommunikatsiooni õppetool, TTÜ raadio- ja sidetehnika inst.
Queue Scheduling Disciplines
1 Buffering Strategies in ATM Switches Carey Williamson Department of Computer Science University of Calgary.
Load Balanced Link Reversal Routing in Mobile Wireless Ad Hoc Networks Nabhendra Bisnik, Alhussein Abouzeid ECSE Department RPI Costas Busch CSCI Department.
Virtual-Channel Flow Control William J. Dally
7.1 The Network Layer It provides services to the transport layer. It is concerned with getting packets from the source to the destination, possibly making.
1 Sheer volume and dynamic nature of video stresses network resources PIE: A lightweight latency control to address the buffer problem issue Rong Pan,
1 Building big router from lots of little routers Nick McKeown Assistant Professor of Electrical Engineering and Computer Science, Stanford University.
How to Train your Dragonfly
Topics discussed in this section:
Mechanics of Flow Control
Switching, routing, and flow control in interconnection networks
Kevin Lee & Adam Piechowicz 10/10/2009
PRESENTATION COMPUTER NETWORKS
EE 122: Lecture 18 (Differentiated Services)
EE 122: Lecture 7 Ion Stoica September 18, 2001.
Intro to Backpressure and Service Discipline Effects on CM
EE 122: Differentiated Services
Presentation transcript:

IBM Zurich Research Lab GmbH1 The Need for an Improved PAUSE Mitch Gusat and Cyriel Minkenberg IEEE 802 Dallas Nov. 2006

IBM Zurich Research Lab GmbH2 Outline I) Overcoming PAUSE-induced Deadlocks  PAUSE exposed to circular dependencies  Two deadlock-free PAUSE solutions II) PAUSE Interaction with Congestion Management III) Conclusions

IBM Zurich Research Lab GmbH3 PAUSE Issues PAUSE-related issues interfere with BCN simulations Correctness  Deadlocks cycles in the routing graph (if multipath adaptivity is enabled) –multiple solutions exist circular dependencies (in bidir fabrics)  BCN can’t help this => Solutions required Performance (to be elaborated in a future report)  low-order HOL-blocking and memory hogging Non-selective PAUSE causes hogging, i.e., monopolization of common resources: e.g. shared memory may be monopolized by frames for a congested port (as shown here)here Consequences –best: reduced throughput –worst: unfairness, starvation, saturation tree, collapse properly tuned, BCN can address this problem

IBM Zurich Research Lab GmbH4 A 3-level Bidir Fat Tree Unfolded Using shared-memory switches with global PAUSE in a bidirectional fat tree network can cause deadlock  Circular dependencies (CD) != loops in the routing graph (STD)  Deadlocks were observed in BCN simulations Root = ‘hinge’ to unfold

IBM Zurich Research Lab GmbH5 PAUSE-caused Deadlocks in BCN Simulations 16-node 5-stage fabric Bernoulli traffic SM, no BCN SM, BCN Partitioned, w/ BCN Partitioned, no BCN

IBM Zurich Research Lab GmbH6 The Mechanism of PAUSE-induced CD Deadlocks When incorrectly implemented, PAUSE-based flow control can cause hogging and deadlocks PAUSE-deadlocking in shared-memory switches:  Switches A and B are both full (within the granularity of an MTU or Jumbo) => PAUSE thresholds exceeded All traffic from A is destined to B and viceversa  Neither can send, waiting on each other indefinitely: Deadlock.  Note: Traffic from A never takes the path from B back to A and vice versa Due to shortest-path routing A B

IBM Zurich Research Lab GmbH7 Two Solutions to Defeat the Deadlock I. Architectural: Assert PAUSE on a per-input basis  No input is allowed to consume more than 1/N-th of the shared memory  All traffic in B’s input buffer for A is guaranteed to be destined to a different port than the one leading back to A (and vice versa)  Hence, the circular dependency has been broken! Confirmed by simulations  Assert PAUSE on input i: occ mem >= T h or occ[i] >= T h /N  Deassert PAUSE on input i: occ mem < T h and occ[i] < T l /N  Q eq = M / (2N) II. LL-FC: Bypass Queue, distinctly PAUSE-d  Achieves similar result as (I), plus: independent of switch architecture (and implementation) required for IPC traffic (LD/ST, request/reply) compatible w/ PCIe (dev. driver compatibility) A B

IBM Zurich Research Lab GmbH8 Simulation of BCN with Deadlock-free PAUSE Observations  Q eq should be set to partition the shared memory Setting it higher promotes hogging Setting it lower wastes memory space  BCN works best with large buffers per port Buffer size per port should be significantly larger than mean burst size 256 frames per port

IBM Zurich Research Lab GmbH9 PAUSE Interaction with Congestion Management What is the effect of deadlock-free PAUSE on BCN? Memory partitioning ‘stiffens’ the feedback loop PAUSE triggers backpressure tree earlier  Backrolling propagation speed depends not only on the available memory, but also on the switch service discipline Next: Static analysis of PAUSE-BCN interference, function of the switch service discipline Note: To visualise the analytical iterations, enable animation.

10 Simple Analytical Method Method used in this presentation  explicit assumptions  simple traffic scenario  reduced MIN topology, with static/deterministic routing (fixed) This ‘model’ considers  queuing – in Eth. Channel Adapter (ECA) and switch element (SE)  scheduling – in ECA and SE  Ethernet ’s per-prio PAUSE-based LL-FC (aka backpressure - BP)  reactive CM a la BCN Linearization around steady-state => tractable static analysis  salient transients will be mentioned, but not computed  Compute the cumulative effects of –scheduling, –LL-FC backpressure per prio (only one used here), –CM source throttling (rate adjustment)  Do not compute the formulas for –blocking probability per stage and SE –variance of service time distribution –Lyapunov stability

11 Model and Traffic assumptions Traffic = ∑ (background + hot) “A total of 50% of link rate is attempted from 9 queues ( 8 background + 1 hot) from each ECA.” Bgnd traffic:  8 queue/ECA on the left. Each of the 8 queues is connected to one of the 8 ECAs on the right. => 64 flows (8 queue/ECA x 8) on the left that are each injecting packets. “80% of these [total link rate] are background, that is 80%x50% = 40% of link rate.” => background traffic intensity λ=0.4 is uniformly space-distributed Hot traffic: “ 20% of these are hot, so hot traffic is 20%x50% = 10% of link rate.”

12 120% Link Load => 20% Overload - What Happens Next? Hotspot arrival intensity: λ bgnd + λ hot = = 1.2 > 1 => Overload, [mild] congestion factor = SE (L2,S3)...next ? BP and CM will react  if SE(L2,S3) is work-conserving, 0.2 overload must be losslesy squelched by CM and BP The exact sequence depends on the actual traffic, SE architecture and threshold settings. Irrelevant for static analysis, albeit important in operation Separation of concerns -> Study the independent effects of BP (1 st ) and CM (2 nd )  iff linear system in steady-state -> superposition allows to compose the effects cf = 1.2 S1 L4 L3 L2 L1 S2S3 CM BP

13 Link-Level FC will Back-Pressure: Whom? How Much? Whose 1 st ? Depends on the SE’s service discipline Most well-understood and used disciplines 1. Round-Robin RR versions: strict (non-WC) and work-conserving (skip invalid queues) 2. FIFO, aka FCFS, aka EDF (timestamps, aging) 3. Fair Queuing, WRR, WFQ A future 802.3x should standardize only the LL-FC not its ‘fairness’ bgnd + hot’ = Buffers fill up Stop2 ?Stop1 ? = 1.2 > 1 hot” = bgnd + hot’ = =.8

14 EDF-based BP: FCFS-type of Fairness (subset of max-min) New TX rates EDF-fair are backpropagated λ ’ = (1 - θ) * λ = * λ θ = 1- μ j / (∑ λ ij ), incremental upstream traversal rooted on SE (L2,S3) Hint: subtract the bgnd traffic λ =.4 from the EDF-fair rates and compare w/ previous hot rates Obs.: If moderate-to-severe congestion θ->1 => λ ’ -> 0 : Blocking spreads across all ingress branches => neither parking lot ‘unfairness’ nor flow decoupling is possible. (wide canopy saturation tree) * All flows sharing resources along the hot paths are backpressured proportional to their respective contribution (not their traffic class). No flow isolation S1 L4 L3 L2 L1 S2S3 BP BP BP

15 RR-based BP: Prop. Fairness – Selective and Drastic New TX rates RR-fair are iteratively computed and backpropagated  1. identify the INs exceeding RR quota, as members of N’ ≤ N  2. distribute the overload δ across N’  δ ij’ = N* λ ij - μ j / (N*N’), δ ij’ ≤ δ for work-conserving service  3. recompute the new admissible arrival rates λ ij’ = λ ij - δ ij ’ incrementally, upstream traversal rooted on SE (L2,S3)  3’. If strict RR no longer δ ij’ ≤ δ => the BP effects are drastic and focused! Hint: subtract the bgdn traffic λ =.4 from the RR-fair rates and compare w/ previous hot rates Obs. 1: Only the selected branch is BP-ed (discrimination) => RR-BP blocking always discriminates between ingress branches. Obs. 2: If severe congestion and/or many hops, selected branches will be swiftly choked down (bonsai – narrow trees) S1 L4 L3 L2 L1 S2S3 BP BP.6.4 BP.3.5 /.5 /.25 /.15

16 20% Overload - Reaction According to CM What’s the effect of CM only, if no LL-FC BP? Congestion factor cf=1.2 :  1. Marking by SE(L2, S3)  is done at flow resolution (queue connection here)  is based on SE queue occupancy and a set of thresholds (single one –if fair w/ p=1%, BCN marking is pro-rated 33% (bgnd) + 67% (hot)  2. ECA sources adapt their injection rate  per e2e flow Desired result: convergence to proportionally fair stable rates λ bgnd + λ CM_hot = O ( ) - achievable by fair marking by CPID, proper tuning of BCN params and enhancements to self-increase (see recent Stanford U. proposal)

17 20% Overload - Reaction According to LL-FC Strictly depending on the service discipline 802 shouldn’t mandate scheduling to switch vendors, because  Round-Robin (RR: strict, or, work-conserving)  strong/prop. fairness  decouples flows  simple & scalable  globally unfair (parking lot problem)  FIFO/EDF (timestamps)  temporally & globally fair: first-come-first-served  locally unfair => flow coupling (can’t isolate across partitions and clients)  complex to scale BP will impact the speed, strength and locality (fairness) of backpressure... (underlying CM)  hence different behaviors of the CM loop

18 Observations PAUSE-induced deadlocks must be solved  two solutions were proposed PAUSE + BCN: two feedback loops intercoupled  BP/LL-FC modulates CM’s convergence: +/- phase and amplitude depends on topology, RTTs, traffic and SE Switch service disciplines impact (via PAUSE) BCN’s stability margin and transient response  Switches w/ RR service may require higher gains for w and G d, or a higher P s, than switches using EDF ...how to signal this? CM should trigger earlier than BP => the two mechanisms, albeit ‘independent’ should be codesigned and co-tuned.  thresholds’ choice depends on link and e2e RTTs

19 Instead of Conclusion: Improved PAUSE 10GigE is a discontinuity in the Ethernet evolution  opportunity to address new needs and markets  however, improvements are needed Requirements of next-generation PAUSE 1. Correct by design, not implementation 1. Deadlock-free 2. No HOL 1 - and, possibly reduced HOL 2 -blocking Note: Do not try to address high-order HOL-blocking at link layer 2. Configurable for both lossy and lossless operation 3. QoS / 802.1p support 4. Enables virtualization / 802.1q 5. Beneficial or neutral to CM schemes (BCN, TCP,...) 6. Legacy PAUSE-compatible 7. Simple to understand and implement by designers 1. Min. no. of flow control domains: h/w queues and IDs in Ether-frame 8. Compelling to use => always enabled...!