Parallelism in Network Systems Joint work with Sundar Iyer

Parallelism in Network Systems Joint work with Sundar Iyer
HP Labs, 10th September, 2003 Nick McKeown Professor of Electrical Engineering and Computer Science, Stanford University

Background Parallelism is an obvious way to improve performance
Raw cycles per second Enable multiple threads of execution Parallelism in computer systems Optimize for the common case, and accept the occasional snafu. Emphasis on correctness: cache coherency, data dependency, data bypass in a pipeline. Parallelism in network systems Need performance guarantees. Does my 100Gb/s router really operate at 100Gb/s? Is my pipeline screwed up downstream? Mis-sequencing packets is a no-no.

Outline I’ll be describing some examples of parallelism and load-balancing to scale packet switches: Simple example Parallel Packet Switch (PPS): multiple packet switches in parallel. General problem Constraint Sets More interesting example Distributed Shared Memory (DSM) router. If there’s still time Parallel packet buffers. Load-balanced 2-stage routers. Problems we’ll encounter: Mis-sequencing packets, Resource conflicts.

Basic idea of parallelism
“Serial system” R R Some packet processing function; e.g. header processor, address lookup, packet buffer or complete packet switch. “Parallel system” R/k 1 R/k R R R/k R/k k Q: If packets are load-balanced randomly, what can say about the performance of the resulting system?

Problem 1 Example of Parallelism causing mis-sequencing
3 2 1 R R R/2 R/2 3 2 1 R R R/2 R/2 If packet 1 takes longer to process/store than packet 2, then their departure order could be reversed.

Problem 2 Example of Parallelism causing resource conflicts
3 2 1 R R “Read packet 1, then 3, then 2” R/2 R/2 3 1 R R R/2 R/2 2 If packet 1 and packet 3 in the same memory, they can’t be read contiguously.

Parallel Packet Switch (PPS)
Goals Build a big packet switch out of lots of little packet switches, Each memory to run slower than line rate, No packet mis-sequencing, Emulation of a FCFS OQ switch.

Architecture of a PPS Definition:
A PPS consists of multiple identical lower-speed packet-switches operating independently and in parallel. An incoming stream of packets is spread, packet-by-packet, by a demultiplexor across the slower packet-switches, then recombined by a multiplexor at the output.

Architecture of a PPS Demultiplexor OQ Switch Multiplexor (R/k) (R/k)
1 2 3 N=4 1 1 Demultiplexor Multiplexor R OQ Switch R 2 Demultiplexor 2 Multiplexor R R 3 OQ Switch Demultiplexor Multiplexor R R k=3 N=4 (R/k) (R/k)

Emulation of an ideal OQ Switch
R Yes No =? R PPS

Emulation Scenario Layer 1 R 4 1 2 3 5 1 4 5 1 2 3 4 2 3 1 R 1 2 3 N=4 1 2 1 5 1 2 3 4 4 2 1 3 3 2 1 4 1 2 3 1 2 3 4 5 R/3 R R Layer 2 2 2 1 2 3 4 5 R R 3 R/3 Layer 3 R R N=4 R/3

Why is there no Choice at the Input?
Layer 1 5 j 4 1 2 3 1 2 3 j 4 5 j 4 1 2 3 j 4 1 2 3 R R 1 3 N=4 1 1 2 3 4 R R Layer 2 2 2 R R 3 Layer 3 R R N=4

Result of no Choice 4 1 2 3 5 4 5 j j 1 2 3 4 Layer 1 R R 1 1 R R

Increasing choice using speedup
Layer 1 j 1 j 1 5 1 4 j 2 3 j 5 4 1 2 3 5 j 1 4 2 3 R R 1 (2R/3) (2R/3) 1 3 N=4 1 R R Layer 2 2 2 R R 3 Layer 3 R R N=4 (2R/3) (2R/3)

Effect of Speedup on Choice
Layer 1 2R/k Layer 2 R A speedup of S = 2, with k = 10 links Layer 9 Layer 10

Available Input Link Set (AIL)
Definition Available Input Link Set (AIL) AIL(i,n) is the set of layers to which external input port i can start writing a packet to, at time slot n.

Departure Time of a Packet (n’)
Definition Departure Time of a Packet (n’) The departure time of a packet, n’, is the time it would have departed from an equivalent FIFO OQ switch.

Definition Available Output Link Set (AOL) AOL(j,n’) is the set of layers that output j can start reading a packet from, at time slot n’.

Observation Inputs can only send to the AIL set.
Outputs can only read from the AOL set. Layer 1 R R 4 1 1 (2R/3) (2R/3) 1 3 N=4 5 1 j 1 j R R Layer 2 2 2 2 2 R R 3 Layer 3 3 R R N=4 (2R/3) (2R/3)

Lower Bounds on Choice Sets
Minimum size of AIL, AOL:

Overcoming resource conflict
A packet must be sent to a link which belongs to both the AIL and the AOL set:

Parallel Packet Switch Results
If S > 2 then each packet is guaranteed to find a layer that belongs to both the AIL and AOL sets. If S > 2 then a PPS can precisely emulate a FIFO output queued switch for all traffic patterns.

Output-Queued Router one output G/D/1 Work-conservation:  100% throughput, minimizes delay, delay guarantees possible Problem : Memory bandwidth not scalable Previous attempts : Very complicated or ad hoc Our approach : Deterministic parallelism

Parallel Output-Queued Router
one output 1 k many outputs 1 k N

Parallel Output-Queued Router (May not be work-conserving)
? 1 B6 B5 A5 C5 C6 1 A Time slot = 3 Time slot = 1 Time slot = 2 B B5 B6 A6 2 k=3 C N=3 A8 A7 At most two memory operations per time slot: 1 write and 1 read

General Problem Problem :
Can we design a parallel output-queued work-conserving router from slower parallel memories? Theorem 1: (sufficiency) A parallel output-queued router is work-conserving with 3N –1 memories that can perform at most one memory operation per time slot

Re-stating the Problem
There are K cages able to hold an infinite number of pigeons. Assume that time is slotted, and in any one time slot At most N pigeons can arrive and at most N can depart. At most 1 pigeon can enter or leave a cage via a pigeon hole. We know when a pigeon will depart. For any router: How many cages, K, do we need so that all N pigeons can be immediately placed in a cage when they arrive, and can depart at the right time?

Intuition Only one packet can enter a memory at time t
DT=t+X DT=t Only one packet can enter a memory at time t Only one packet can enter or leave a memory at time t Memory Only one packet can enter or leave a memory at any time

Constraint Sets When a packet arrives in a time slot it must choose a memory not chosen by The N – 1 other packets that arrive at that timeslot. The N other packets that depart at that timeslot. The N - 1 other packets that can depart at the same time as this packet departs (in future). Proof: By the pigeon-hole principle, 3N –1 memories (one memory operation per time slot) are sufficient for the router to be work-conserving.

The Parallel Shared Memory Router
At most one operation – a write or a read per time slot Memory 1 K=8 C3 Memory A4 Memory Memory A3 A4 R A Memory A1 A1 A5 B ? C3 R Memory C Appeal : Simplicity. Say if it was one memory It is NOT scaleable. Juniper tried to implement this. B3 Memory Arriving Packets Departing Packets B1 B1 Memory C1 C1 k = 7 memories don’t suffice ... but 8 do.

Distributed Shared Memory Router
Switch Fabric Memories Memories Memories R R R Line Card 1 Line Card 2 Line Card N Central memories are distributed to line cards and shared, Memory and line cards can be added incrementally, 3. 3N –1 memories  3NR suffice for the router to be work-conserving.

Corollary 1 Problem: What is the switch bandwidth for a work-conserving DSM router? Corollary: (sufficiency) A switch bandwidth of 4NR is sufficient for a distributed shared memory router to be work-conserving Intuition: There are a maximum of 3 memory accesses and 1 port access.

Corollary 2 Problem What is the switching algorithm for a work-conserving DSM router? Bus : No algorithm needed, but impractical Crossbar : Algorithm needed because only permutations are allowed Corollary 2 An edge coloring algorithm can switch packets for a work-conserving distributed shared memory route Intuition Follows from König’s theorem - Any bipartite graph with maximum degree  has an edge coloring with  colors

Summary - Routers which give 100% throughput
Fabric # Mem. Mem. BW Total Memory BW Switch BW Switch Algorithm Output-Queued Bus N (N+1)R N(N+1)R NR None Shared Mem. Bus 1 2NR 2NR 2NR None Input Queued Crossbar N 2R 2NR NR MWM CIOQ (Cisco) 2N 3R 6NR 2NR Maximal Crossbar Time Reserve* 2N 3R 6NR 3NR PSM Bus k 3NR/k 3NR 3NR C. Sets DSM (Juniper) N 3R 3NR 4NR Edge Color Rui: What’s LB? (IQ*) Xbar N 3R 3NR 6NR C. Sets N 4R 4NR 4NR C. Sets PPS - OQ Clos Nk 2R(N+1)/k 2N(N+1)R 4NR C. Sets Nk 4NR/k 4NR 4NR C. Sets PPS –Shared Memory Clos Nk 2NR/k 2NR 2NR None

Summary - Routers which give delay guarantees
Fabric # Mem. Mem. BW Total Memory BW Switch BW Switch Algorithm Output-Queued Bus N (N+1)R N(N+1)R NR None Shared Mem. Bus 1 2NR 2NR 2NR None - 2NR 2NR/k Nk NR 2R N Crossbar Input Queued CIOQ (Cisco) 2N 3R 6NR 2NR Marriage Crossbar Time Reserve 2N 3R 6NR 3NR PSM Bus k 4NR/k 4NR 4NR C. Sets DSM (Juniper) N 4R 4NR 5NR Edge Color Rui: What’s LB? (IQ*) Xbar N 4R 4NR 8NR C. Sets N 6R 6NR 6NR C. Sets PPS - OQ Clos Nk 3R(N+1)/k 3N(N+1)R 6NR C. Sets Nk 6NR/k 6NR 6NR C. Sets PPS –Shared Memory Clos

Packet Buffering Line rate, R Line rate, R Memory Memory 1 1 Scheduler Memory Line rate, R Line rate, R N N Scheduler Scheduler Input or Output Line Card Shared Memory Buffer Big: For TCP to work well, the buffers need to hold one RTT (about 0.25s) of data. Fast: Clearly, the buffer needs to store (retrieve) packets as fast as they arrive (depart).

An Example Packet buffers for a 40Gb/s line card
10Gbits Buffer Memory Write Rate, R One 40B packet every 8ns Read Rate, R One 40B packet every 8ns Buffer Manager Scheduler requests causes random access Problem is solved if a memory can be (random) accessed every 4 ns and store 10Gb of data

Available Memory Technology
Use SRAM? + Fast enough random access time, but Too low density to store 10Gbits of data. Use DRAM/RLDRAM? + High density means we can store data, but Can’t meet random access time.

Problem Problem: How can we design high speed packet buffers from commodity available memories?

Can’t we just use lots of DRAMs in parallel?
Buffer Memory Buffer Memory Buffer Memory Buffer Memory Buffer Memory Buffer Memory Buffer Memory Buffer Memory 40B 320B 320B 320B Write Rate, R Read Rate, R One 40B packet every 8ns Buffer Manager One 40B packet every 8ns Scheduler Requests

Works fine if there is only one FIFO queue
320B 40B 320B 320B 320B Write Rate, R Read Rate, R Buffer Manager (on chip SRAM) 40B 40B 320B 320B 40B 40B One 40B packet every 8ns One 40B packet every 8ns Scheduler Requests Aggregate 320B for the queue in fast SRAM and read and write to all DRAMs in parallel

In practice, buffer holds many FIFOs
1 320B 320B 320B 320B e.g. In an IP Router, Q might be 200. In an ATM switch, Q might be 106. We don’t know which head of line packet the scheduler will request next? 2 320B 320B 320B 320B Q 320B 320B 320B 320B 40B 320B 320B 320B Write Rate, R Read Rate, R Buffer Manager (on chip SRAM) ?B ?B 320B 320B One 40B packet every 8ns One 40B packet every 8ns Scheduler Requests

Parallel Packet Buffer Hybrid Memory Hierarchy
Large DRAM memory holds the body of FIFOs 1 Q 2 54 53 52 51 50 10 9 8 7 6 5 95 94 93 92 91 90 89 88 87 86 15 14 13 12 11 10 9 8 7 6 86 85 84 83 82 11 10 9 8 7 DRAM b = degree of parallelism Writing b bytes Reading b bytes Buffer Manager cache for FIFO tails 55 56 96 97 87 88 57 58 59 60 89 90 91 1 Q 2 Small tail SRAM 1 2 Q 3 4 5 6 Small head SRAM cache for FIFO heads Arriving Departing Packets Packets R R (ASIC with on chip SRAM) Scheduler Requests

Re-stating the Problem
What is the minimum size of the SRAM needed for the parallel buffer so that every packet is available immediately in the SRAM when requested? Theorem 3 (Necessity) An SRAM size of size Qw = Q(b – 1)(2 + lnQ) bytes is necessary. w is the size of each queue

Theorem Theorem (sufficiency)
An SRAM cache of size Qw = Qb(2 +ln Q) bytes is sufficient so that every packet is available immediately in the SRAM when requested Examples: 40Gb/s line card, b=640, Q=128: SRAM = 560kBytes 160Gb/s line card, b=2560, Q=512: SRAM = 10MBytes

Theorem Problem What is the minimum size of the SRAM so that every packet is available in the SRAM within a bounded pipeline latency when requested? Theorem (necessity and sufficiency): An SRAM cache of size Qw = Q(b – 1) bytes is both necessary and sufficient if the pipeline latency is Q(b – 1) + 1 time slots.

Intuition The algorithm replenishes the queue which is going to be in the most urgent need of replenishment If we use a lookahead buffer to know the requests “in advance”, we can identify the queue which will empty earliest This increases the pipeline latency from when a request is made until the byte is available. Example: 160Gb/s line card, b=2560, Q=128: SRAM = 160kBytes, latency is 8ms.

Discussion Q=1000, b = 10 Queue Length for Zero Latency
Queue Length, w Queue Length for Maximum Latency Sundar: Nick, note that the dotted curves plot the P(i,x) function which we discussed about. Note that “i” takes values from 1 to Q. And X is the lookahead (latency). The Queue Length is got by taking the maximum value of P(i,x) for a given x. Hence the dark line which is the Queue Length Required (I.e. the Maximum Deficit) is got by choosing the maximum value of all the dotted curves. Pipeline Latency, x

References All available from: http://www.stanford.edu/~nickm/papers
Parallel Packet Switch (PPS) "Analysis of the Parallel Packet Switch Architecture" Sundar Iyer and Nick McKeown IEEE/ACM Transactions on Networking, April 2003. Constraint Sets and DSM Routers "Routers with a Single Stage of Buffering" Sundar Iyer, Rui Zhang, and Nick McKeown ACM SIGCOMM Aug Fast Packet Buffers "Designing Packet Buffers for Router Line Cards" Sundar Iyer, R. R. Kompella, and Nick McKeown Paper in Submission, October 2002. Load-Balanced Routers "Scaling Internet Routers Using Optics" Isaac Keslassy, ShangTse Chuang, Kyoungsik Yu, David Miller, Mark Horowitz, Olav Solgaard, Nick McKeown ACM SIGCOMM Aug

Parallelism in Network Systems Joint work with Sundar Iyer

Similar presentations

Presentation on theme: "Parallelism in Network Systems Joint work with Sundar Iyer"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Parallelism in Network Systems Joint work with Sundar Iyer

Similar presentations

Presentation on theme: "Parallelism in Network Systems Joint work with Sundar Iyer"— Presentation transcript:

Similar presentations

About project

Feedback