High Speed Stable Packet Switches Shivendra S. Panwar Joint work with: Yihan Li, Yanming Shen and H. Jonathan Chao New York State Center for Advanced Technology.

Slides:



Advertisements
Similar presentations
1 Maintaining Packet Order in Two-Stage Switches Isaac Keslassy, Nick McKeown Stanford University.
Advertisements

1 Outline  Why Maximal and not Maximum  Definition and properties of Maximal Match  Parallel Iterative Matching (PIM)  iSLIP  Wavefront Arbiter (WFA)
Some Unsolved Problems in High Speed Packet Swtiching
DYNAMIC POWER ALLOCATION AND ROUTING FOR TIME-VARYING WIRELESS NETWORKS Michael J. Neely, Eytan Modiano and Charles E.Rohrs Presented by Ruogu Li Department.
Submitters: Erez Rokah Erez Goldshide Supervisor: Yossi Kanizo.
Nick McKeown CS244 Lecture 6 Packet Switches. What you said The very premise of the paper was a bit of an eye- opener for me, for previously I had never.
Frame-Aggregated Concurrent Matching Switch Bill Lin (University of California, San Diego) Isaac Keslassy (Technion, Israel)
Towards Simple, High-performance Input-Queued Switch Schedulers Devavrat Shah Stanford University Berkeley, Dec 5 Joint work with Paolo Giaccone and Balaji.
Isaac Keslassy, Shang-Tse (Da) Chuang, Nick McKeown Stanford University The Load-Balanced Router.
A Scalable Switch for Service Guarantees Bill Lin (University of California, San Diego) Isaac Keslassy (Technion, Israel)
Algorithm Orals Algorithm Qualifying Examination Orals Achieving 100% Throughput in IQ/CIOQ Switches using Maximum Size and Maximal Matching Algorithms.
Making Parallel Packet Switches Practical Sundar Iyer, Nick McKeown Departments of Electrical Engineering & Computer Science,
Fast Matching Algorithms for Repetitive Optimization Sanjay Shakkottai, UT Austin Joint work with Supratim Deb (Bell Labs) and Devavrat Shah (MIT)
1 Input Queued Switches: Cell Switching vs. Packet Switching Abtin Keshavarzian Joint work with Yashar Ganjali, Devavrat Shah Stanford University.
April 10, HOL Blocking analysis based on: Broadband Integrated Networks by Mischa Schwartz.
1 Comnet 2006 Communication Networks Recitation 5 Input Queuing Scheduling & Combined Switches.
The Concurrent Matching Switch Architecture Bill Lin (University of California, San Diego) Isaac Keslassy (Technion, Israel)
Scaling Internet Routers Using Optics Producing a 100TB/s Router Ashley Green and Brad Rosen February 16, 2004.
1 Architectural Results in the Optical Router Project Da Chuang, Isaac Keslassy, Nick McKeown High Performance Networking Group
Using Load-Balancing To Build High-Performance Routers Isaac Keslassy, Shang-Tse (Da) Chuang, Nick McKeown Stanford University.
1 ENTS689L: Packet Processing and Switching Buffer-less Switch Fabric Architectures Buffer-less Switch Fabric Architectures Vahid Tabatabaee Fall 2006.
048866: Packet Switch Architectures Dr. Isaac Keslassy Electrical Engineering, Technion MSM.
CSIT560 by M. Hamdi 1 Course Exam: Review April 18/19 (in-Class)
048866: Packet Switch Architectures Dr. Isaac Keslassy Electrical Engineering, Technion The.
048866: Packet Switch Architectures Dr. Isaac Keslassy Electrical Engineering, Technion Scaling.
The Crosspoint Queued Switch Yossi Kanizo (Technion, Israel) Joint work with Isaac Keslassy (Technion, Israel) and David Hay (Politecnico di Torino, Italy)
1 Internet Routers Stochastics Network Seminar February 22 nd 2002 Nick McKeown Professor of Electrical Engineering and Computer Science, Stanford University.
1 EE384Y: Packet Switch Architectures Part II Load-balanced Switches Nick McKeown Professor of Electrical Engineering and Computer Science, Stanford University.
Maximum Size Matchings & Input Queued Switches Sundar Iyer, Nick McKeown High Performance Networking Group, Stanford University,
COMP680E by M. Hamdi 1 Course Exam: Review April 17 (in-Class)
1 Achieving 100% throughput Where we are in the course… 1. Switch model 2. Uniform traffic  Technique: Uniform schedule (easy) 3. Non-uniform traffic,
1 Netcomm 2005 Communication Networks Recitation 5.
048866: Packet Switch Architectures Dr. Isaac Keslassy Electrical Engineering, Technion Maximal.
048866: Packet Switch Architectures Dr. Isaac Keslassy Electrical Engineering, Technion Scheduling.
Distributed Scheduling Algorithms for Switching Systems Shunyuan Ye, Yanming Shen, Shivendra Panwar
Load Balancing and Switch Scheduling Xiangheng Liu EE 384Y Final Presentation June 4, 2003.
Pipelined Two Step Iterative Matching Algorithms for CIOQ Crossbar Switches Deng Pan and Yuanyuan Yang State University of New York, Stony Brook.
Localized Asynchronous Packet Scheduling for Buffered Crossbar Switches Deng Pan and Yuanyuan Yang State University of New York Stony Brook.
Load Balanced Birkhoff-von Neumann Switches
Nick McKeown CS244 Lecture 7 Valiant Load Balancing.
Belgrade University Aleksandra Smiljanić: High-Capacity Switching Switches with Input Buffers (Cisco)
A Cooperative MAC Protocol for Wireless LAN Pei Liu, Zhifeng Tao, Shivendra S. Panwar Motivation: In the legacy system, source station transmits.
Enabling Class of Service for CIOQ Switches with Maximal Weighted Algorithms Thursday, October 08, 2015 Feng Wang Siu Hong Yuen.
Summary of switching theory Balaji Prabhakar Stanford University.
EE384y EE384Y: Packet Switch Architectures Part II Scaling Crossbar Switches Nick McKeown Professor of Electrical Engineering and Computer Science,
Routers. These high-end, carrier-grade 7600 models process up to 30 million packets per second (pps).
ISLIP Switch Scheduler Ali Mohammad Zareh Bidoki April 2002.
Packet Forwarding. A router has several input/output lines. From an input line, it receives a packet. It will check the header of the packet to determine.
Abtin Keshavarzian Yashar Ganjali Department of Electrical Engineering Stanford University June 5, 2002 Cell Switching vs. Packet Switching EE384Y: Packet.
1 Performance Guarantees for Internet Routers ISL Affiliates Meeting April 4 th 2002 Nick McKeown Professor of Electrical Engineering and Computer Science,
Stress Resistant Scheduling Algorithms for CIOQ Switches Prashanth Pappu Applied Research Laboratory Washington University in St Louis “Stress Resistant.
Jon Turner Resilient Cell Resequencing in Terabit Routers.
Belgrade University Aleksandra Smiljanić: High-Capacity Switching Switches with Input Buffers (Cisco)
Buffered Crossbars With Performance Guarantees Shang-Tse (Da) Chuang Cisco Systems EE384Y Thursday, April 27, 2006.
SNRC Meeting June 7 th, Crossbar Switch Scheduling Nick McKeown Professor of Electrical Engineering and Computer Science, Stanford University
Improving Matching algorithms for IQ switches Abhishek Das John J Kim.
Topics in Internet Research: Project Scope Mehreen Alam
1 Buffering Strategies in ATM Switches Carey Williamson Department of Computer Science University of Calgary.
Reduced Rate Switching in Optical Routers using Prediction Ritesh K. Madan, Yang Jiao EE384Y Course Project.
A Load Balanced Switch with an Arbitrary Number of Linecards I.Keslassy, S.T.Chuang, N.McKeown ( CSL, Stanford University ) Some slides adapted from authors.
Input buffered switches (1)
scheduling for local-area networks”
EE384Y: Packet Switch Architectures Scaling Crossbar Switches
Balaji Prabhakar Departments of EE and CS Stanford University
Packet Forwarding.
Stability Analysis of MNCM Class of Algorithms and two more problems !
EE 122: Lecture 7 Ion Stoica September 18, 2001.
Balaji Prabhakar Departments of EE and CS Stanford University
Long Gong, Paul Tune, Liang Liu, Sen, Jun (Jim) Xu
EE384Y: Packet Switch Architectures II
Presentation transcript:

High Speed Stable Packet Switches Shivendra S. Panwar Joint work with: Yihan Li, Yanming Shen and H. Jonathan Chao New York State Center for Advanced Technology in Telecommunications (CATT) Electrical and Computer Engineering Dept. Polytechnic University, New York

2 Overview Switching technology continues to be one of the bottlenecks in the development of broadband networks Fixed-length switching technology achieves high switching efficiency for high-speed packet switches Virtual Output Queueing (VOQ) switches can achieve 100% throughput without speedup Two approaches to resolve the output contention in a VOQ switch  Matching algorithms  Load-balanced switch Switch performance  Throughput  Packet delay

3 Buffering in a Packet Switch High speed fixed-length packet switching  Input Queuing (IQ) Easy to implement HOL Blocking, throughput 58.6%  Output Queuing (OQ) 100% throughput Internal speedup of N  Virtual Output Queuing (VOQ) Overcome HOL blocking No speedup requirement

4 Matching Algorithms Scheduling in a VOQ switch Stable matching schemes for VOQ switching  Maximum Weight Matching (MWM)  Maximal Matching, iSLIP  Other algorithms with 100% throughput and no speedup Polling system based matching  Exhaustive Service Matching with Hamiltonian Walk  Limited Service Matching  Average delay analysis

5 Scheduling in a VOQ Switch Scheduling is needed to avoid output contention. A scheduling problem can be modeled as a matching problem in a bipartite graph.  An input and an output are connected by an edge if the corresponding VOQ is not empty.  Each edge may have a weight, which can be The length of the VOQ The age of the HOL cell

6 Maximum Weight Matching (MWM) MWM always finds a match with the maximum weight Complexity of O(N 3 ). Is stable with 100% throughput under all admissible traffic Weight of the match: 25

7 Maximal Matching  Add connections incrementally, without removing connections made earlier.  No more matches can be made trivially by the end of the operation.  Stable with a speedup of 2  Complexity O(NlogN) Multiple Iterative Matching  Use multiple iterations to converge to a maximal matching  iSLIP and DRRM complexity of each iteration is O(logN) O(logN) iterations are needed to converge on a maximal matching Weight of the match: 23

8 iSLIP Step 1: Request  Each input sends a request to every output for which it has a queued cell. Step 2: Grant  If an output receives multiple requests it chooses the one that appears next in a fixed round-robin schedule.  The output arbiter pointer is incremented by one location beyond the granted input if, and only if, the grant is accepted in step 3. Step 3: Accept  If an input receives multiple grants, it accepts the one that appears next in a fixed round-robin schedule.  The input arbiter pointer is incremented by one location beyond the accepted output. Input Output Request Grant Accept

9 Achieving 100% Throughput without Speedup Algorithms with memory  [Tassiulas] Compare the latest schedule to a randomly generated match, select the one with higher weight as the new match, complexity O(logN).  [Giaccone et al] Derandomized algorithm, using Hamiltonian walk, complexity O(logN). Other algorithms, with higher complexity, take into account the latest schedule, its neighbors, and the arrival pattern.  SERENA, complexity O(N). Polling system based matching algorithms  Low complexity: HE-iSLIP, O(logN).  Low packet delay Much lower than other O(logN) algorithms. Comparable to higher complexity algorithms.

10 The Architecture of a Cell-Based VOQ Switch Input 1 Input 2 Input 3 Input 4 Output 1 Output 2 Output 3 Output 4 Switch Fabric VOQ ISM ORM 1 N 1 N 1 N 1 N 1 N 1 N 1 N 1 N Input Segmentation Module (ISM): Segment packets to fixed-length cells. Output Reassembly Module (ORM): Reassemble cells into packets. Previous work  Try to find a good match in each (cell) time slot. Cells in the same packet are interrupted during transmission.  Considered cell delay, not packet delay.

11 Polling System Based Matching Exhaustive Service Matching  Inspired by exhaustive service polling systems.  All the cells in the corresponding VOQ are served after an input and an output are matched.  Slot times wasted to achieve an input-output match are amortized over all the cells waiting in the VOQ instead of only one.  Cells within the same packet are transferred continuously. Hamiltonian walk is used to guarantee stability.  Hamiltonian walk is a walk which visits every vertex of a graph exactly once.  For an NxN switch, each possible match is visited exactly once in every N! time slots.

12 Exhaustive Service Matching with Hamiltonian Walk (EMHW) EMHW  Let S(t) be the match at time t.  At time t+1, generate match Z(t+1) by the Exhaustive Service Matching algorithm based on S(t), and H(t+1) by Hamiltonian walk.  Let where is the weight of S at time t+1. Stability Theorem 2: An EMHW is stable under any admissible Bernoulli i.i.d. input traffic.

13 Implementation Complexity of EMHW Implementation complexity  EMHW: O(logN) for HE-iSLIP  HE-iSLIP: only one iteration is needed to achieve 100% throughput and low packet delay. Compare to the Derandomized Matching Algorithm (O(logN))  The weight of the schedule generated by EMHW is always larger than or equal to the schedule generated by the derandomized matching algorithm. Theorem: Suppose the schedule at time t is M(t), and at time t+1 the schedule by the derandomized matching algorithm and EMHW are S d (t+1) and S(t+1), respectively. Then it is always true that

14 Limited Service Matching An approximation to EMHW with lower complexity. When a VOQ is under service, a limit on the maximum number of cells that can be served continuously is enforced by means of a counter. No Hamiltonian walk is used. Limited Service Matching DRRM can be implemented in a distributed manner.

15 Simulated Delay Performance Packet delay: the sum of cell delay and reassembly delay Cell delay: measured from VOQ to destination output Reassembly delay: time spent in an ORM, often ignored in other work. Input 1 Input 2 Input 3 Input 4 Output 1 Output 2 Output 3 Output 4 Switch Fabric VOQ ISM ORM 1 N 1 N 1 N 1 N 1 N 1 N 1 N 1 N

16 Traffic Patterns in Simulations Uniform traffic  Pattern 1: packet size is 1 cell.  Pattern 2: packet size is 10 cells.  Pattern 3: packet size is varied and the average is 10 cells (Internet packet size distribution). Nonuniform traffic  Diagonal traffic, packet size is 1 cell.  Hotspot traffic, packet size is 1 cell.

17 Performance Summary schemescomplexitystablepacket delay performance HE-iSLIPO(logN)YesLowest when packet size is larger than 1 cell. iSLIPO(logN)NoAlways higher than HE-iSLIP. DERANDO(logN)YesHighest for all traffic patterns. SERENAO(N)YesLower than HE-iSLIP only under nonuniform diagonal traffic. MWMO(N 3 )YesLowest when packet size is 1 cell.

18 Packet Delay under Uniform Traffic Pattern 1: packet size is 1 cell. MWM HE-iSLIP SERENA iSLIP

19 Packet Delay under Uniform Traffic Pattern 2: packet length is 10 cells. Pattern 3: packet length is variable, the average is 10 cells. MWM HE-iSLIP MWM SERENA iSLIP SERENA

20 When Packet Length is Larger Than 1 Cell Why does HE-iSLIP have a lower packet delay than MWM? For example, when packet length is 10 cells:  Cell delay  Reassembly delay Low cell delay and low reassembly delay needed for low packet delay HE-iSLIP MWM HE-iSLIP MWM

21 Performance Analysis -- Average Delay of E-iSLIP (1) Exhaustive random polling system model  Symmetric system -- only consider one input  N VOQs per input, exhaustive service policy -- an exhaustive service polling system with N stations  The service order of the VOQs are not fixed -- random polling system, assume all station VOQs have the same probability of selection for service after a VOQ is served.

22 Performance Analysis -- Average Delay of E-iSLIP (2) Switch over time S Average delay T [Levy]

23 Performance Analysis -- Average Delay of E-iSLIP (3) When N is large

24 EMHW Summary Exhaustive Service Matching with Hamiltonian walk (EMHW)  Stable under any admissible Bernoulli i.i.d. input traffic.  HE-iSLIP, complexity O(logN) Packet delay performance  Compared to iSLIP (O(logN)), Derandomized Algorithm (O(logN)), SERENA (O(N)) and MWM (O(N 3 )), under uniform traffic, HE-iSLIP has lower packet delay than the Derandomized algorithm and SERENA. HE-iSLIP has lower packet delay than MWM for typical packet sizes.  HE-iSLIP has low cell delay and low reassembly delay, which lead to low packet delay.

25 Load-Balanced Switches Switch architecture Packet out-of-sequence issue Packet scheduling in the first-stage switch Packet delay performance Conclusions

26 Motivation Challenges in designing high-performance switch  Scalable Scale to high number of linecards and to high linecard speeds No centralized scheduler Low complexity  Provide performance guarantees 100% throughput guarantee no commercial switch today has a throughput guarantee. Load-balanced switch  100% throughput for broad class of traffic  No centralized scheduler needed, scalable

27 Work on Load-balanced Switches Original load-balanced switch (Computer Communications, Chang)  100% throughput  Unbounded out-of-sequence delay FCFS (First come first serve) (Computer Communications, Chang)  Jitter control mechanism  Increases the average delay EDF (Earliest deadline first) (Computer Communications, Chang)  Reduce the average delay  High complexity Mailbox switch (Infocom 2003, Chang)  Prevents packets from being out-of-sequence  Not 100% throughput FFF (Full frames first) (Infocom 2002, Mckeown)  Frame-based  No need for resequencing  Require multi-stage buffer communication FOFF (Full ordered frames first) (Sigcomm 2003, Mckeown)  Frame-based  Maximum resequencing delay N 2  Bandwidth wastage

28 Byte-Focal Switch Architecture Input VOQ Arrival 2nd stage switch fabric Second-stage VOQ Re-sequencing buffer i 1 N (1,1) (1,N) (1,k) (i,1) (i,k)... (i,N) … … (N,1) (N,k) (N,N)... (1,1) (1,k) (1,N) (j,1) 1 j N (j,k) (j,N) (N,1) (N,k) (N,N) 1st stage switch fabric … … 1 k N … 1 2 N … 1 2 N … 1 2 N … … 1 i N … …

29 Switching fabric Deterministic and periodic connection pattern At both stages, at time slot t, input i is connected to output j with j = (i+t) mod N t=0t=1 t=2t=3

30 Load-Balancing Second stage VOQ First stage VOQ(i,k)... Packets are sent in Round-Robin order to the second-stage (1,k) (2,k) (3,k) (N,k) 1st stage switch fabric 2nd stage switch fabric

31 Out-of-Sequence Problem … ith input port 1st stage switch fabric At time t jth central buffer (j+1) th central buffer Maximum re-sequencing delay=N 2 At time t+ 2nd stage switch fabric

32 Resequencing Buffer Design Input VOQArrival2nd stage switch fabricSecond-stage VOQ Re-sequencing buffer i 1 N (1,1) (1,N) (1,k) (i,1) (i,k)... (i,N) … … (N,1) (N,k) (N,N)... (1,1) (1,k) (1,N) (j,1) 1 j N (j,k) (j,N) (N,1) (N,k) (N,N) 1 st stage switch fabric … … 1 k N … 1 2 N … 1 2 N … 1 2 N … … 1 i N … … Resequencing buffer design

33 Resequencing Buffer Design Input port i 1 Input port i 2 … … Departure queue … … j 1 +2 j1j1 j 1 +1 The next packet via center stage queue j has not arrived j2j2 j 2 +1 j 2 +2 i2i2 i1i1 The HOF packet arrived The index joins the departure queue For i 2, j 2 is the HOF packet, then the index also joins the departure queue...

34 First Stage Scheduling Input VOQArrivalSwitch fabricSecond-stage VOQ Re-sequencing buffer i 1 N (1,1) (1,N) (1,k) (i,1) (i,k)... (i,N) … … (N,1) (N,k) (N,N)... (1,1) (1,k) (1,N) (j,1) 1 j N (j,k) (j,N) (N,1) (N,k) (N,N) Switch fabric … … 1 k N … 1 2 N … 1 2 N … 1 2 N … … 1 i N … … First stage scheduling algorithms

35 First Stage Scheduling jth input port of 2nd stage Arbiter input port i At time t, i connected to j Up to N HOL eligible cells Choose an eligible cell … … Problem: Which VOQ to serve? 1st stage switch fabric Eligible cells: cells that can be sent to jth input port of 2nd stage 2 nd stage switch fabric

36 First Stage Scheduling Round-robin  The arbiter at each input port selects VOQs in round-robin order  Unstable under non-uniform traffic Longest queue first  The arbiter at each input port chooses to serve the longest queue  High complexity, not practical Fixed threshold scheme  Set a threshold, N, a queue exceeds the threshold has a higher priority  Serve the high priority queue first Dynamic threshold scheme  Set the threshold as TH=Q(t)/N  Then same as fixed threshold scheme Longest queue first, fixed threshold and dynamic threshold schemes have been proved to be stable in our work.

37 Simulation Settings Uniform i.i.d: Diagonal i.i.d:, for j = (i +1) mod N. This is a very skewed loading, since input i has packets only for outputs i and (i + 1) mod N Hot-spot:,, for i≠j. This type of traffic is more balanced than diagonal traffic, but obviously more unbalanced than uniform traffic

38 Average Cell Delay At the low load, 2-stage switching has larger delay than iSlip, but smaller at high load FOFF shows large delay even at low load due to the bandwidth waste when transferring partial frames Average delay under uniform traffic with switch size of 32 LQF: Longest queue first FOFF: Full order frame first

39 Dynamic Threshold The LQF scheme has the best delay performance  not practical due to its high implementation complexity Dynamic threshold scheme performance is comparable with the LQF scheme Compared with fixed threshold scheme, adapt to the non-uniform input loadings, thus achieving a better delay performance, while maintaining low complexity Focus on the dynamic threshold scheme from now on

40 Dynamic Threshold As the input traffic changes from uniform to hotspot to diagonal (hence less balanced), the dynamic threshold scheme can achieve good performance, especially for the diagonal traffic. The Byte-Focal switch performs load-balancing at the first stage, thus achieving good performance even under extreme non-uniform traffic The average delay of the dynamic threshold scheme under different input traffic patterns with a switch size of 32

41 Delays with Different Switch Sizes The average delays are almost linearly dependent on the switch size.

42 3-stage Delays The three components of the total delay with switch size of 32 3 delay components: - first stage queueing delay - second stage queueing delay - resequencing delay The first stage queueing delay and the second stage queueing delay are comparable The resequencing delay is much smaller compared to the other two delays.

43 Variable Size Packet Scheduling Two approaches:  Cell-mode and packet-mode  The average delay is the same Combining the packet mode scheduling and the dynamic threshold scheme - the packet mode dynamic threshold algorithm: 1. At each time slot, if it is in the middle of a packet, keep serving this queue 2. If not, apply the dynamic threshold scheme The resequencing delay and the reassembly delay overlap  The sum of the resequencing delay and the reassembly delay is bounded by where is the maximum packet length  The additional delay due to packet reassembly is reduced

44 Packet Delay Used Internet packet length distribution: trimodal distribution As with the cell delay, the delay performance is not degraded under non-uniform traffic

45 Packet Delay Packet delays increase as the packet length increases, also increases with packet length variance But with a weak dependence

46 Conclusion The maximum resequencing delay is N 2 The dynamic threshold scheme is practical and has good delay performance The time complexity of the resequencing buffer is O(1) Does not need communications between linecards Achieve a uniformly good delay performance over a wide range of traffic matrices Achieves good performance with low complexity