Presentation on theme: "Some Unsolved Problems in High Speed Packet Swtiching"— Presentation transcript:
1 Some Unsolved Problems in High Speed Packet Swtiching Shivendra S. PanwarJoint work with: Yihan Li, Yanming Shen and H. Jonathan ChaoPolytechnic University, Brooklyn, NYNY State Center for Advanced Technology in Telecommunications
2 Advice to Woodward and Bernstein: “Follow the money”-- Deep Throat(aka Mark Felt)
3 Advice to performance analysts: “Find the bottleneck”
5 Buffering in a Packet Switch Fixed-size packet switchesOperates in a time-slotted mannerThe slot duration is equal to the cell transmission timeContention occurs when multiple inputs have arrivals destined to the same outputBuffering is needed to avoid packet lossBuffering schemes in a packet switchOutput queueing (IQ)Input queueing (OQ)Virtual output queueing (VOQ) / combined input-output-queueing (CIOQ)
6 Output Queuing (OQ) 100% throughput Internal speedup of N Impractical for large NInput 1Output 13Input 23Output 2Input 3Output 33Output 4Input 43
7 Input Queuing (IQ) Easy to implement HOL Blocking, throughput 58.6% Output 121Head of Line BlockingInput 223Output 2Input 343Output 3Input 442Output 4
8 Virtual Output Queuing (VOQ) Overcome HOL blockingNo speedup requirementNeed scheduling algorithms to resolve contentionComplexityPerformance guarantee1234
9 Challenges in Switch Design Stability100% throughputDelay performanceScalabilityScale to high number of linecards and to high linecard speedsDistributed scheduler is more desirable than a centralized schedulerScheduler complexityPin count
10 High Speed Packet Switches VOQ switches and scheduling algorithmsBuffered crossbar switchLoad Balanced switchMulti-stage switch
12 Scheduling for VOQ Switch Scheduling is needed to avoid output contentionA scheduling problem can be modeled as a matching problem in a bipartite graphAn input and an output are connected by an edge if the corresponding VOQ is not emptyEach edge may have a weight, which can beThe length of the VOQThe age of the HOL cell
13 Maximum Weight Matching (MWM) 7MWM always finds a match with the maximum weightStable under any admissible trafficVery high complexityO(N3), impractical437856ReferencesL. Tassiulas, A. Ephremides, ``Stability properties of constrained queueing systems and scheduling for maximum throughput in multihop radio networks,'' IEEE Transactions on Automatic Control, Vol. 37, No. 12, pp , December 1992.E. Leonardi, M. Mellia, F. Neri, Marco A. Marsan, “On the stability of Input-Queued Switches with speed-up”, IEEE/ACM Transactions on Networking, Vol.9, No.1, pp , ISSN: S (01)01313, February 20011052Weight of the match: 25N. McKeown, V. Anantharam, and J. Walrand, “Achieving 100% Throughput in an Input-Queued Switch,” IEEE Transaction on Comm., vol. 47, no. 8, Aug. 1999, ppJ.G. Dai and B. Prabhakar, “The throughput of data switches with and without speedup,” INFOCOM 2000.
14 Maximum Weight Matching The maximum weight matching algorithm is strongly stable under any admissible traffic patternLyapunov functionStrongly stableAdmissibleReferencesEmilio Leonardi, Marco Mellia, Fabio Neri, Marco Ajmone Marsan, “On the stability of Input-Queued Switches with speed-up”, IEEE/ACM Transactions on Networking, Vol.9, No.1, pp , ISSN: S (01)01313, February 2001N. McKeown, V. Anantharam, and J. Walrand, “Achieving 100% Throughput in an Input-Queued Switch,” IEEE Transaction on Comm., vol. 47, no. 8, Aug. 1999, pp
15 Maximum Weight Matching Fluid modelThe maximum weight matching is rate stable if:The arrival processes satisfy a strong law of large numbers (SLLN) with probability one, andReferencesJ.G. Dai and B. Prabhakar, “The throughput of data switches with and without speedup,” INFOCOM 2000, pp
16 Approximate MWM1-APRXA function f(.) is a sub-linear function if limx∞ f(x)/x = 0Let the weight of a schedule obtained by a scheduling algorithm B be WBLet the weight of the maximum weight match for the same switch state be W*If WB ≥ W* - f(W*)B is a 1-APRX to MWMB is stable ifMakes it possible to find stable matching algorithms with lower complexity than MWM.ReferencesD. Shah, M. Kopikare, “Delay bounds for approximate Maximum weight matching algorithms for input-queued switches”, IEEE INFOCOM, New York, USA, June 2002.
17 Average Delay Bound Delay bound for MWM Lyapunov function References E. Leonardi, M. Melia, F. Neri, and M. Ajmone Marson. Bounds on average delays and queue size averages and variances in input-queued cell-based switches. Proceedings of IEEE INFOCOM, 2001.
18 Average Delay Bound (contd.) Delay bound for approximate-MWMLyapunov functionCb: weight difference to the MWM matchingUniform traffic, they have the same resultReferencesD. Shah, M. Kopikare, “Delay bounds for approximate Maximum weight matching algorithms for input-queued switches”, IEEE INFOCOM, New York, USA, June 2002.
19 Open IssuesWith simulations, MWM has the best delay performance (Cell delay)Average delay: Choose the weight of a queue as Qa , then delay is increasing with a for a>0Is MWM the optimal scheduling scheme for achieving the minimum average cell delay?What is the optimal scheduling scheme to achieve the minimum average packet delay (Including reassembly delay)?
20 Maximal Matching Maximal Matching 743856102Weight of the match: 23Maximal MatchingAdd connections incrementally, without removing connections made earlierNo more matches can be made trivially by the end of the operationSolution may not be uniqueComplexity O(NlogN)
21 Maximal MatchingA maximal matching achieves 100% throughput with speed-up S≥2 under any admissible traffic pattern[Leonardi, ToN 2001]100% throughputifwith probability 1A maximal matching algorithm is rate stable with speed-up S≥2 [Dai, Infocom 2000]ReferencesEmilio Leonardi, Marco Mellia, Fabio Neri, Marco Ajmone Marsan, “On the stability of Input-Queued Switches with speed-up”, IEEE/ACM Transactions on Networking, Vol.9, No.1, pp , ISSN: S (01)01313, February 2001J.G. Dai and B. Prabhakar, “The throughput of data switches with and without speedup,” INFOCOM 2000, pp
22 Multiple Iterative Matching Use multiple iterations to converge on a maximal matchingParallel Iterative Matching (PIM)iSLIP and DRRMcomplexity of each iteration is O(logN)O(logN) iterations are needed to converge on a maximal matching (iSLIP)100% throughput only under uniform traffic
23 iSLIP Step 1: Request Step 2: Grant Step 3: Accept Each input sends a request to every output for which it has a queued cell.Step 2: GrantIf an output receives multiple requests it chooses the one that appears next in a fixed round-robin schedule.The output arbiter pointer is incremented by one location beyond the granted input if, and only if, the grant is accepted in step 3.Step 3: AcceptIf an input receives multiple grants, it accepts the one that appears next in a fixed round-robin schedule.The input arbiter pointer is incremented by one location beyond the accepted output.OutputInputRequestGrantAccept
24 Achieving 100% Throughput without Speedup Matching algorithms using memoryPolling system based matching
25 Low Complexity Algorithms with 100% Throughput Algorithms with memoryUse the previous schedule as a candidateReferencesL. Tassiulas, “Linear complexity algorithms for maximum throughput in radio networks and input queued switches,” IEEE INFOCOM 1998, vol.2, New York, 1998, ppP. Giaccone, B. Prabhakar, D. Shah “Toward simple, high-performance schedulers for high-aggregate bandwidth switches”, IEEE INFOCOM 2002, New York, 2002.Polling system based matching algorithmsImprove the efficiency by using exhaustive serviceY. Li, S. Panwar, H. J. Chao, “Exhaustive service matching algorithms for input queued switches,” 2004 Workshop on High Performance Switching and Routing (HPSR 2004), April 2004.Y. Li, S. Panwar, H. J. Chao, “ Performance Analysis of a Dual Round Robin Matching Switch with Exhaustive Service,” IEEE GLOBECOM 2002.
26 Matching Algorithms with Memory The queue length of each VOQ does not change much during successive time slotsIn each time slot, there can beAt most one cell arrives to each inputAt most one cell departs from each inputIt is likely that a busy connection will continue to be busy over a few time slots, if the queue length is used as the weight of a connectionUse the match in the previous time slot as an candidate for the new matchImportant results:Randomized algorithm with memory [Tassiulas 98]Derandomized algorithm with memory [Giaccone 02]With higher complexity: APSARA, LAURA, SERENA [Giaccone 02]
27 Notations For a NxN switch, there are N! possible matches Q(t)=[qij]NxN, qij is the queue length of VOQijM(t), a match at time tThe weight of M(t)W(t)=<M(t),Q(t)>the sum of the lengths of all matched VOQs
28 Randomized algorithm with memory Let S(t) be the schedule used at time tAt time t+1, uniformly select a match R(t+1) at random from the set of all N! possible matchesLetStable under any Bernoulli i.i.d. admissible arrival trafficVery simple to implement, complexity O(logN)Delay performance is very poor
29 Derandomized Algorithm with Memory Hamiltonian walkA walk which visits every vertex of a graph exactly once.In a NxN switch,N! vertices (possible schedules), a Hamiltonian walk visits each vertex once every N! time slotsH(t): the value of the vertex which is visited at time tThe complexity of generating H(t+1) when H(t) is known is O(1)Derandomized algorithm with memoryUse the match generated by Hamiltonian walk instead of the random matchSimilar performance as randomized algorithm
30 Compared to MWM …Simple matching algorithms can achieve stability as MWM doesNot necessary to find “the best match” in each time slot to achieve 100% throughputMWM has much better delay performance than randomized and derandomized matching“better” matches lead to better delay performance
31 With Higher Complexity and Lower Delay Introduce higher complexity for much lower delay than the randomized and derandomized algorithmsAPSARAinclude the neighbors of the latest match as candidatesLAURA:merge the latest match with a random match to remember the heavy edgesSERENAMerge the latest match with the arrival figureFigure: generated from the current arrival patternComplexity O(N)
32 Polling System Based Matching Exhaustive Service MatchingInspired by exhaustive service polling systemsAll the cells in the corresponding VOQ are served after an input and an output are matchedSlot times wasted to achieve an input-output match are amortized over all the cells waiting in the VOQ instead of only oneCells within the same packet are transferred continuouslyHamiltonian walk is used to guarantee stability
33 Exhaustive Service Matching with Hamiltonian Walk (EMHW) Let S(t) be the match at time t.At time t+1, generate match Z(t+1) by the Exhaustive Service Matching algorithm based on S(t), and H(t+1) by Hamiltonian walkLetwhere <S,Q(t+1)> is the weight of S at time t+1.Stable under any admissible trafficAnalyzed by an exhaustive service polling systemImplementation complexityHE-iSLIP: O(logN)
34 E-iSLIP Average Delay Analysis Exhaustive random polling system modelSymmetric system -- only consider one inputN VOQs per input, exhaustive service policy -- an exhaustive service polling system with N stationsThe service order of the VOQs are not fixed -- random polling system, assume all station VOQs have the same probability of selection for service after a VOQ is servedSwitch over time SAverage delay T [Levy and Kleinrock]
35 Delay Performance of HE-iSLIP Packet delay: the sum of cell delay and reassembly delayCell delay: measured from VOQ to destination outputReassembly delay: time spent in an ORM, often ignored in other workInput 1Input 2Input 3Input 4Output 1Output 2Output 3Output 4Switch FabricVOQISMORM1N
36 packet delay performance Performance Summaryschemescomplexitystablepacket delay performanceiSLIPO(logN)NoAlways higher than HE-iSLIP.HE-iSLIPYesLowest when packet size is larger than 1 cell.DerandomizedHighest for all traffic patterns.SERENAO(N)Lower than HE-iSLIP only under nonuniform diagonal traffic.MWMO(N3)Lowest when packet size is 1 cell.
37 Packet Delay under Uniform Traffic Pattern 1: packet size is 1 cell.SERENAiSLIPHE-iSLIPMWM
38 Packet Delay under Uniform Traffic Pattern 2: packet length is 10 cellsPattern 3: packet length is variable, the average is 10 cells (Internet packet size distribution)SERENASERENAiSLIPMWMiSLIPMWMHE-iSLIPHE-iSLIP
39 When packet length is larger than 1 cell Why does HE-iSLIP have a lower packet delay than MWM?For example, when packet length is 10 cells:Cell delayReassembly delayHE-iSLIPMWMHE-iSLIPMWMLow cell delay + low reassembly delay needed for low packet delayOpen Problem: Which scheduler minimizes packet delay performance?
40 Packet-Based Scheduling Packet-based scheduling algorithmonce it starts transmitting the first cell of a packet to an output port, it continues the transmission until the whole packet is completely received at the corresponding output portPacket-based MWM is stable for any admissible Bernoulli i.i.d. trafficLyapunov function, MA. Marsan, A. Bianco, P. Giaccone, E. Leonardi, and F. Neri, “Packet Scheduling in Input-Queued Cell-Based Swithces,” INFOCOM 2001, ppPacket-based MWM is stable under regenerative admissible input trafficFluid model, Y. Ganjali, A. Keshavarzian, D. Shah, “Input Queued Switches: Cell switching v/s Packet switching", Proceedings of Infocom, 2003.regenerative: Let T be the time between two successive occurrences of the event that all ports are free with E(T) being finiteModified waiting PB-MWM algorithm is stable under any admissible traffic
41 Buffered Crossbar Switch One buffer for each crosspointDistributed arbitration for inputs and outputsFrom each input, one cell can be sent to a crosspoint buffer if it has spaceOne cell can be sent to an output if at least one crosspoint buffer to that output is nonemptyReferencesY. Doi and N. Yamanaka, “A High-Speed ATM Switch with Input and Cross-Point Buffers,” IEICE TRANS. COMMUN., VOL. E76, NO.3, pp , March 1993.R. Rojas-Cessa, E. Oki, Z. Jing, and H. J. Chao, “CIXB-1: Combined Input-One-Cell-Crosspoint Buffered Switch,” Proceedings of IEEE Workshop of High Performance Switches and Routers 2001.
42 Birkhoff-von Neumann Switch When traffic matrix is knownBirkhoff-von Neumann decompositionReferenceCheng-Shang Chang, Wen-Jyh Chen and Hsiang-Yi Huang, "On service guarantees for input buffered crossbar switches: a capacity decomposition approach by Birkhoff and von Neumann," IEEE IWQoS'99, pp , London, U.K., 1999.
44 Load-Balanced Switch Load-balanced switch Convert the traffic to uniform, then fixed switching100% throughput for broad class of trafficNo centralized scheduler needed, scalableSwitching...Load-balancing…1kN
45 Original Work on LB Switch Stability: the load-balanced switch is stableDelay: burst reductionProblem: unbounded out-of-sequence delaysReferenceC.-S. Chang, D.-S. Lee and Y.-S. Jou, “Load balanced Birkhoff-von Neumann switches, Part I: one-stage buffering,” Computer Comm., Vol. 25, pp , 2002.
46 LB Switch variants Solve the out-of-sequence problem FCFS (First come first serve)Jitter control mechanismIncrease the average delayEDF (Earliest deadline first)Reduce the average delayHigh complexityMailbox switchPrevent packets from being out-of-sequenceNot 100% throughputReferencesC.-S. Chang, D.-S. Lee and C.-M. Lien, “Load balanced Birkhoff-von Neumann switches, Part II: multi-stage buffering,” Computer Comm., Vol. 25, pp , 2002.C.S. Chang, D. Lee, and Y. J. Shih, “Mailbox switch: A scalable twostage switch architecture for conflict resolution of ordered packets,” In Proceedings of IEEE INFOCOM, Hong Kong, March 2004.
47 More LB switch variants FFF (Full frames first) (Infocom 2002, Mckeown)Frame-basedNo need for resequencingRequire multi-stage buffer communication-high complexityFOFF (Full ordered frames first) (Sigcomm 2003, Mckeown)Maximum resequencing delay N2Bandwidth wastageReferencesI. Keslassy and N. McKeown, “Maintaining packet order in two-stage switches,” Proc. of the IEEE Infocom, June 2002.I. Keslassy, S.-T. Chuang, K. Yu, D. Miller, M. Horowitz, O. Solgaard and N. McKeown , “Scaling Internet routers using optics,” ACM SIGCOMM ’03, Karlsruhe, Germany, Aug
49 Byte-Focal Switch Packet-by-packet scheduling Improves the average delay performanceThe maximum resequencing delay is N2The time complexity of the resequencing buffer is O(1)Does not need communications between linecardsReferencesY. Shen, S. Jiang, S.S.Panwar, H.J. Chao, “Byte-Focal: a practical load-balanced swtich”, HPSR 2005, Hongkong.
50 Multi-Stage Switches Single Stage Switches (e.g., Cross-point switch) Single path between each input-output pairCannot meet the increasing demands of Internet trafficNo packets out-of-sequenceEasy to designLack of scalabilityMulti-stage Switches (e.g., Clos-network switch)Multiple paths between each input-output pairBetter tradeoff between the switch performance and complexityHighly scalable and fault tolerantMemory-less multi-stage switchesNo packets out-of-sequence, may encounter internal blockingBuffered multi-stage switchesPacket may be out-of-sequence, easy scheduling
53 Trueway SwitchThe switch fabric consists of multiple switching planes, with each being a three-stage Clos network with m center modulesEach input/output pair has multiple routing pathsHighly scalable1n2Cross-point buffered memory
54 Challenges in Multi-Stage Switching How to efficiently allocate and share the limited on-chip memory?How to schedule packets on multiple paths to maximize memory utilization and system performance?How to minimize link congestion and prevent buffer overflow (i.e., stage-to-stage flow control)?How to maintain cells/packet order if they are delivered over multiple paths (i.e., port-to-port flow control)?How to achieve 100% throughput?
55 Conclusion Introduced switch architecture trends Many open research problemsBottleneck keeps changing!