Packet Scheduling/Arbitration in Virtual Output Queues and Others

Packet Scheduling/Arbitration in Virtual Output Queues and Others
Please if anyone has additional comments please speak up

Head-of-Line Blocking
Blocked! Blocked!

Crossbar Switches: Virtual Output Queues
At each input port, there are N queues – each associated with an output port Only one packet can go from an input port at a time Only one packet can be received by an output port at a time It retains the scalability of FIFO input-queued switches It eliminates the HoL problem with FIFO input Queues

Virtual Output Queues

VOQs: How Packets Move VOQs Scheduler

Crossbar Scheduler in VOQ Architecture
Memory b/w=2R Can be quite complex! Scheduler

Question: do more lanes help?
Answer: it depends on the scheduling VOQs with Bad Scheduling Head of Line Blocking Good Scheduling? Ayalon: depends on traffic matrix…

Crossbar Scheduler in VOQ Architecture
Which packets I can send during each configuration of the crossbar

Switch core architecture
Port #1 Crossbar Port Processor optics LCS Protocol Cell Data Request Grant/Credit Scheduler Port #256 Port Processor optics LCS Protocol

Basic Switch Model S(n) 1 1 N N D1(n) A1(n) DN(n) AN(n) L11(n) A11(n)
ANN(n) LNN(n)

Some definitions 3. Queue occupancies: Occupancy L11(n) LNN(n)

Some possible performance goals
When traffic is admissible

VOQ Switch Scheduling The VOQ switch scheduling can be represented by a bipartite graph The left-hand side nodes of the bipartite graph are the input ports The right-hand side nodes of the bipartite graph are the output ports The edges between the nodes are requests for packet transmission between input ports and output ports. A 1 B 2 C 3 D 4 E 5 F 6

Maximum size bipartite match
Intuition: maximizes instantaneous throughput L11(n)>0 Maximum Size Match LN1(n)>0 “Request” Graph Bipartite Match

Network flows and bipartite matching
1 B 2 Sink t Source s C 3 D 4 E 5 F 6 Finding a maximum size bipartite matching is equivalent to solving a network flow problem with capacities and flows of size “1”.

Network Flows a c Source s Sink t b d
10 1 Source s Sink t b d Let G=[V,E] be a directed graph with capacity cap(v,w) on edge [v,w]. A flow is an (integer) function, f, that is chosen for each edge so that f(v,w) <= cap(v,w). We wish to maximize the flow allocation.

A maximum network flow example By inspection
10 1 Source s Sink t b d Step 1: Source s Sink t a c b d 10, 10 10 1 Flow is of size 10

A maximum network flow example
Step 2: a c 10, 10 Source s Sink t 10, 10 1 10, 10 1 10, 1 b d 10, 1 1, 1 Source s Sink t a c b d 10, 10 10, 2 10, 9 1,1 1, 1 Maximum flow: Flow is of size 10+2 = 12 Not obvious Flow is of size 10+1 = 11

Ford-Fulkerson method of augmenting paths
Set f(v,w) = -f(w,v) on all edges. Define a Residual Graph, R, in which res(v,w) = cap(v,w) – f(v,w) Find paths from s to t for which there is positive residue. Increase the flow along the paths to augment them by the minimum residue along the path. Keep augmenting paths until there are no more to augment.

Example of Residual Graph
c 10, 10 10, 10 1 s 10, 10 t 1 10 10 b d 1 Flow is of size 10 Residual Graph, R res(v,w) = cap(v,w) – f(v,w) a c 10 10 10 1 s t 1 10 10 b d 1 Augmenting path

Step 2: a c 10, 10 s t 10, 10 1 10, 10 1 10, 1 b d 10, 1 1, 1 Flow is of size 10+1 = 11 Residual Graph a c 10 s t 10 10 1 1 1 1 b d 1 9 9 Augmenting path

Step 3: a c 10, 9 s t 10, 10 1, 1 10, 10 1, 1 10, 2 b d 10, 2 1, 1 Flow is of size 10+2 = 12 Residual Graph a c 10 s t 10 10 1 1 2 2 b d 1 8 8

An other Example: Ford-Fulkerson method
find augmenting path p f=0 Gf G s 16 13 10 4 9 7 12 20 11 a b c d t 12 a b 16 20 9 s 10 4 7 t 13 4 11 c d s 16 4/13 10 4 9 7 12 20 4/4 4/11 a b c d t f=4

find augmenting path p f=4 Gf G 12 12 a b a b 16 20 16 20 9 s 10 4 7 t 4 9 s 10 t 7 4 4/13 4/4 4 4 4/11 9 c d c d 7 f=4+12 s 12/16 4/13 10 4 9 7 12/12 12/20 4/4 4/11 a b c d t 12 8

find augmenting path p f=16 Gf G 12/12 12 a b a b 4 8 12/16 12/20 12 12 4 9 9 s 10 7 t s 10 4 7 t 4 4/13 4/4 4 4 4/11 9 c d c d 7 f=16+7 s 12/16 11/13 10 4 9 7/7 12/12 19/20 4/4 11/11 a b c d t 12 11 7 1 2 19

find augmenting path p f=23 Gf G 12/12 12 a b a b 4 1 12/16 19/20 12 19 4 9 s 10 t 9 7/7 s 10 4 7 t 11 11/13 4/4 4 11/11 2 c d c d 11 No more augmenting path Maximum Flow is 23

An example for Flow: Obvious solution
10 9 S T 10 9 S T 10 9 Input graph G Residual Graph Gr Flow graph Gf Total flow = 10, Sub-optimal solution!

Flow algorithm – Optimal version
10 9 S T 1 10 9 S T 10 9 S T 10 9 S T 10 9 S T 10 9 Input graph G Residual Graph Gr Flow graph Gf 9 Total flow = = 19 units!

Complexity of network flow problems
In general, it is possible to find a solution by considering at most V.E paths, by picking shortest augmenting path first. There are many variations, such as picking most augmenting path first. The complexity of the algorithm is less when the graph is bipartite There are techniques other than the Ford-Fulkerson method.

Ford - Fulkerson Algorithm – 1
Network flows and bipartite matching Ford - Fulkerson Algorithm – 1 Finding a maximum size bipartite matching is equivalent to solving a network flow problem with capacities and flows of size “1”. sink 1 2 3 4 5 6 a b c d e f source

sink Increasing the flow by 1. 1 2 3 4 5 6 a b c d e f source

sink Augmenting flow along the augmenting path. 1 2 3 4 5 6 a b c d e f source

sink Maximum flow found! Thus maximum matching found. 1 2 3 4 5 6 a b c d e f source

Network flows and bipartite matching Another Example
1 B 2 Sink t Source s C 3 D 4 E 5 F 6 Finding a maximum size bipartite matching is equivalent to solving a network flow problem with capacities and flows of size “1”.

Residual Graph for first three paths: A 1 B 2 t s C 3 D 4 E 5 F 6

Residual Graph for next two paths: A 1 B 2 t s C 3 D 4 E 5 F 6

Residual Graph for next augmenting path: A 1 B 2 t s C 3 D 4 E 5 F 6

Residual Graph for last augmenting path: A 1 B 2 t s C 3 D 4 E 5 F 6

Flow Graph: A 1 B 2 t s C 3 D 4 E 5 F 6

Maximum Size Matching: A 1 B C D E F 2 3 4 5 6 Technique easily extends to maximum weight matching (where a weight is assigned to each link). But is more complicated and more time consuming.

Complexity of Maximum Matchings
Maximum Size/Cardinality Matchings: Algorithm by Dinic O(N5/2) Maximum Weight Matchings Algorithm by Kuhn O(N3logN) ftp://dimacs.rutgers.edu/pub/netflow/matching/ (contains code for maximum size/weighting algorithms) In general: Hard to implement in hardware Slooooow.

Maximum size bipartite match
Intuition: maximizes instantaneous throughput for uniform traffic. L11(n)>0 Maximum Size Match LN1(n)>0 “Request” Graph Bipartite Match

Why doesn’t maximizing instantaneous throughput give 100% throughput for non-uniform traffic?
Three possible matches, S(n):

Maximum weight matching
S*(n) Weight could be length of queue or age of packet Achieves 100% throughput under all traffic patterns L11(n) A11(n) A1(n) 1 1 D1(n) A1N(n) AN1(n) AN(n) DN(n) ANN(n) N N LNN(n) L11(n) Maximum Weight Match LN1(n) “Request” Graph Bipartite Match

Packet Scheduling/Arbitration in Virtual Output Queues: Maximal Matching Algorithms
Please if anyone has additional comments please speak up

Maximum Matching in VOQ Architecture
1 2 3 4 Maximum size matching Maximum weight 8 6

Complexity of Maximum Matchings
Maximum Size/Cardinality Matchings: Algorithm by Dinic O(N5/2) Maximum Weight Matchings Algorithm by Kuhn O(N3logN) In general: Hard to implement in hardware Slooooow.

Maximal Matching A maximal matching is a matching in which each edge is added one at a time, and is not later removed from the matching. i.e., No augmenting paths allowed (they remove edges added earlier) – like by inspection. No input and output are left unnecessarily idle.

Example of Maximal Size Matching
1 A 1 B 2 B 2 C 3 C 3 D 4 D 4 E 5 E 5 F 6 F 6 Maximal Matching Maximum Matching

Comments on Maximal Matchings
In general, maximal matching is much simpler to implement, and has a much faster running time. A maximal size matching is at least half the size of a maximum size matching. A maximal weight matching is defined in the obvious way. A maximal weight matching is at least half the size of a maximum weight matching.

PIM Maximal Size Matching Algorithm: Performance and Properties
It is among the very first practical schedulers proposed for VOQ architectures (used by DEC). It is based on having arbiters at the inputs and outputs It iterates the following steps until no more requests can be accepted (or for a given number of iterations): Request: Each unmatched input sends a request to every output for which it has a queued cell Grant (outputs): If an unmatched output receives any request, it grants one by randomly selecting a request uniformly over all requests. Accept (inputs): If an unmatched input receives a grant, it accepts one by selecting an output randomly among those granted to this input.

Implementation of the parallel maximal matching algorithms
State of Input Queues (N2 bits) 1 2 N Decision Register Grant Arbiters Request Arbiters

Implementation of the parallel maximal matching algorithms (another similar way)

PIM Maximum Size Matching Algorithm: Performance and Properties
PIM: 1st Iteration Step 1: Request Random selection 1 2 3 4 1 2 3 4 1 2 3 4 Random selection Step 2: Grant Step 3: Accept

PIM: 2nd Iteration Step 1: Request 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 1 1 2 2 Step 2: Grant 3 3 Step 3: Accept 4 4

Traffic Types to evaluate Algorithms
Uniform traffic Unbalanced traffic Hotpot traffic

Parallel Iterative Matching
PIM with a single iteration

Parallel Iterative Matching
PIM with 4 iterations

Parallel Iterative Matching Analytical Results
Number of iterations to converge:

It is a fair algorithm – servicing inputs Can have 100% throughtput under uniform traffic It converges in logN iterations to a maximal size matching It has a very poor performance (63% throughput) with 1 iteration – because of its inability to desynchronize the output pointers It is not easy to build random arbiters in hardware The best iterative maximal size matching algorithm takes O(N2logN) serial or O(log N) parallel time steps. If the number of iterations is constant, then it can be implemented in constant time (that is why it is practical) – however the hardware design is not trivial.

RRM Maximum Size Matching Algorithm: Performance and Properties
Round Robin Matching (RRM) is easier to implement that PIM (in terms of designing the I/O arbiters). The pointers of the arbiters move in straightforward way It iterates the following steps until no more requests can be accepted (or for a given number of iterations): Request. Each input sends a request to every output for which it has a queued cell. Grant. If an output receives any requests, it chooses the one that appears next in a fixed, round-robin schedule starting from the highest priority element. The output notifies each input whether or not its request was granted. The pointer gi to the highest priority element of the round-robin schedule is incremented (modulo N) to one location beyond the granted input. If no request is received, the pointer stays unchanged.

RRM Maximum Size Matching Algorithm: Performance and Properties
Accept. If an input receives a grant, it accepts the one that appears next in a fixed, round-robin schedule starting from the highest priority element. The pointer ai to the highest priority element of the round-robin schedule is incremented (modulo N) to one location beyond the accepted output. If no grant is received, the pointer stays unchanged.

RRM Maximal Matching Algorithm (1)
Step 1: Request 1 2 3

Step 2: Grant 3 0 2 1 1 2 3

1 2 3 Step 2: Grant 3 0 2 1

1 2 3 Step 3: Accept 3 0 2 1 0 3 1 2

Poor performance of RRM Maximal Matching Algorithm
1 .. 1 1 .. 1 1 1 50% Throughput

iSLIP Maximum Size Matching Algorithm: Performance and Properties
It is a scheduler used in most VOQ switches (e.g., Cisco). It is exactly like RRM algorithm with the following change: Grant. If an output receives any requests, it chooses the one that appears next in a fixed, round-robin schedule starting from the highest priority element. The output notifies each input whether or not its request was granted. The pointer gi to the highest priority element of the round-robin schedule is incremented (modulo N) to one location beyond the granted input if and only if the grant is accepted in (Accept phase) .

iSLIP Maximum Size Matching Algorithm
iSlip: 1st Iteration Original pointer Selected one Updated pointer Step 1: Request 4 1 3 2 1 1 1 2 3 4 1 2 3 4 2 2 3 3 4 4 4 1 3 2 1 2 3 4 1 2 3 4 1 4 2 3 Step 2: Grant 4 1 3 2 Step 3: Accept

iSLIP Maximum Size Matching Algorithm
iSlip: 2nd Iteration Original pointer Selected one Updated pointer Step 1: Request 1 1 1 1 4 1 3 2 2 2 2 2 3 3 3 3 4 4 4 4 1 1 1 4 2 3 2 2 Step 2: Grant 3 3 4 1 3 2 Step 3: Accept 4 4 No change

Simple Iterative Algorithms: iSlip
Step 1: Request 1 2 3

Step 2: Grant 3 0 2 1 1 2 3

1 2 3 Step 2: Grant 3 0 2 1

Step 3: Accept 3 0 2 1 1 1 0 3 1 2 2 2 3 3

0 3 1 2 1 2 3 Step 3: Accept 3 0 2 1

Step 3: Accept 3 0 2 1 1 1 0 3 1 2 2 2 3 3

iSLIP Implementation Programmable Priority Encoder State Decision
1 1 log2N Decision Grant Accept 2 2 N Grant Accept log2N N N N Grant Accept log2N

Hardware Design Layout of the 256 bits Priority Encoder

Hardware Design Layout of 256 bits grant arbiter

FIRM Maximum Size Matching Algorithm: Performance and Properties
It is exactly like iSLIP with a very small – yet significant modification. Grant (outputs): If an unmatched output receives a request, it grants the one that appears next in a fixed, round-robin schedule starting from the highest priority element. The output notifies each input whether or not its request is granted. The pointer to the highest priority element of the round-robin schedule is incremented beyond the granted input. If input does not accept the pointer is set at the granted one.

Simple Iterative Algorithms: FIRM
Step 3: Accept 3 0 2 1 1 2 3

Pointer Synchronization
Why this is good: this small change prevents the output arbiters from moving in lock-step (being synchronized – pointing to the same input) leading to a dramatic improvement in performance. If several outputs grant the same input, no matter how this input chooses, only one match can be made, and the other outputs will be idle. To get as many matches as possible, it's better that each output grants a different input. Since each output will select the highest priority input if a request is received from this input, it's better to keep the output pointers desynchronized (pointing to different locations).

iSLIP Maximal Matching Algorithm
1 .. 1 1 .. 1 1 1 100% Throughput

Pointer Synchronization: Differences between RRM, iSlip & FIRM

Differences between RRM, iSlip & FIRM
Input No grant unchanged Granted one location beyond the accepted one Output No request Grant accepted one location beyond the granted one Grant not accepted one location beyond the previously granted one the granted one

General remarks Since all of these algorithms try to approximate maximum size matching, they can be unstable under non-uniform traffic They can achieve 100% throughput under uniform traffic Under a large number of iterations, their performance is similar They have similar implementation complexity

Packet Scheduling/Arbitration in Virtual Output Queues and Others

Similar presentations

Presentation on theme: "Packet Scheduling/Arbitration in Virtual Output Queues and Others"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Packet Scheduling/Arbitration in Virtual Output Queues and Others

Similar presentations

Presentation on theme: "Packet Scheduling/Arbitration in Virtual Output Queues and Others"— Presentation transcript:

Similar presentations

About project

Feedback