The Crosspoint Queued Switch Yossi Kanizo (Technion, Israel) Joint work with Isaac Keslassy (Technion, Israel) and David Hay (Politecnico di Torino, Italy)
Typical Switch Architectures IQ – Input Queued Linecards Switch Fabric CICQ – Combined Input and Crosspoint Queued Linecards Assumes Instantaneous Closed Loop
Single-Rack Router Instantaneous closed loop → works in a single rack Problem: multi-rack routers Linecards Switch Fabric
Current Router Architectures [Source: N. McKeown] Is the closed loop still instantaneous?
Time Trends ns
Hiding Propagation Delays Traditional solutions: Increase time-slot poor switch performance Hide propagation delays using buffers impractical amount of buffering Proposed solution: closed loop → open loop Performance degradation vs. instantaneous closed loop
Outline CQ: Open-loop switch architecture Performance Evaluation Analytical results Simulations CQ performance degradation is not significant
Proposed Architecture: The Crosspoint-Queued (CQ) Switch No queues in the linecards Buffering only inside the fabric Independent output schedulers Drops with full buffers Switch Core Linecards 10s of meters
CQ Properties Open loop No communication overhead No linecard queues No linecard queue management “Router on a chip” Buffering and switch fabric on same chip
Why not 10 years ago? No need: single rack No technology: SRAM density Moore’s law: density doubling every 2.5 years Aggressive 128x128 CQ switch: 4 cells of 64 bytes per crosspoint → 64 cells today Conservative buffer requirements TCP Stanford model with smaller buffer needs [Appenzeller, Keslassy and McKeown ’04]
Outline CQ: Our open-loop switch architecture Performance Evaluation Analytical results Simulations
100% Throughput as B→ Throughput bounds: OQ(2B-1) ≤ CQ(B)≤ OQ(NB) Buffer size B, LQF scheduling algorithm 100% Throughput ∞
Uniform Traffic, B=1 Uniform traffic model: At each time-slot, at each of the N inputs: Bernoulli IID packet arrivals with probability Each packet is destined for one of the N outputs uniformly at random Theorem: Under uniform traffic and B=1, the performance of the switch is independent of the specific work-conserving scheduling algorithm Intuition: Symmetry
Uniform Traffic, B=1 Theorem: The throughput and waiting time of a CQ switch, B=1 is: Proof: Based on Z-transform q=1- /N Goes to 100% as N goes to infinity
Models for larger buffers Approximate Performance Analysis Model for exhaustive round-robin scheduling Based on modifications to polling system with zero switch-over times Model for random scheduling algorithm Show 100% throughput as N→∞
Trace-Driven Simulation Buffers of size 64 suffice to ensure 99% throughput for N=32. 32x32 CQ switch with different buffer sizes (in units of 64-byte packets)
Conclusions CQ is open loop → allows multi-rack configuration CQ provides easy scheduling CQ is feasible to implement in a single chip CQ shows good performance in simulations
Thank You