Presentation is loading. Please wait.

Presentation is loading. Please wait.

Scaling Internet Routers Using Optics UW, October 16 th, 2003 Nick McKeown Joint work with research groups of: David Miller, Mark Horowitz, Olav Solgaard.

Similar presentations


Presentation on theme: "Scaling Internet Routers Using Optics UW, October 16 th, 2003 Nick McKeown Joint work with research groups of: David Miller, Mark Horowitz, Olav Solgaard."— Presentation transcript:

1 Scaling Internet Routers Using Optics UW, October 16 th, 2003 Nick McKeown Joint work with research groups of: David Miller, Mark Horowitz, Olav Solgaard. Students: Isaac Keslassy, Shang-Tse Chuang, Kyoungsik Yu. Department of Electrical Engineering, Stanford University Paper: http://klamath.stanford.edu/~nickm/papers/sigcomm2003.pdfhttp://klamath.stanford.edu/~nickm/papers/sigcomm2003.pdf Web site: http://klamath.stanford.edu/or

2 Backbone router capacity 1Tb/s 1Gb/s 10Gb/s 100Gb/s Router capacity per rack 2x every 18 months

3 Backbone router capacity 1Tb/s 1Gb/s 10Gb/s 100Gb/s Router capacity per rack 2x every 18 months Traffic 2x every year

4 Extrapolating 1Tb/s Router capacity 2x every 18 months Traffic 2x every year 100Tb/s 2015: 16x disparity

5 Consequence  Unless something changes, operators will need:  16 times as many routers, consuming  16 times as much space,  256 times the power,  Costing 100 times as much.  Actually need more than that…

6 Stanford 100Tb/s Internet Router Goal: Study scalability  Challenging, but not impossible  Two orders of magnitude faster than deployed routers  We will build components to show feasibility 40Gb/s OpticalSwitch Line termination IP packet processing Packet buffering Line termination IP packet processing Packet buffering Electronic Linecard #1 Electronic Linecard #1 Electronic Linecard #625 Electronic Linecard #625 160- 320Gb/s 160Gb/s 160- 320Gb/s 100Tb/s = 640 * 160Gb/s

7 Throughput Guarantees  Operators increasingly demand throughput guarantees:  To maximize use of expensive long-haul links  For predictability and planning  Despite lots of effort and theory, no commercial router today has a throughput guarantee.

8 Requirements of our router  100Tb/s capacity  100% throughput for all traffic  Must work with any set of linecards present  Use technology available within 3 years  Conform to RFC 1812

9 What limits router capacity? Approximate power consumption per rack Power density is the limiting factor today

10 Crossbar Linecards Switch Linecards Trend: Multi-rack routers Reduces power density

11 Alcatel 7670 RSP Juniper TX8/T640 TX8 Chiaro Avici TSR

12 Limits to scaling  Overall power is dominated by linecards  Sheer number  Optical WAN components  Per packet processing and buffering.  But power density is dominated by switch fabric

13 Trend: Multi-rack routers Reduces power density Switch Linecards Limit today ~2.5Tb/s  Electronics  Scheduler scales <2x every 18 months  Opto-electronic conversion

14 In Out WAN Linecard In WAN Multi-rack routers Out Switch fabric

15 Question  Instead, can we use an optical fabric at 100Tb/s with 100% throughput?  Conventional answer: No.  Need to reconfigure switch too often  100% throughput requires complex electronic scheduler.

16 Outline  How to guarantee 100% throughput?  How to eliminate the scheduler?  How to use an optical switch fabric?  How to make it scalable and practical?

17 In Out R R R R R R Router capacity = NR Switch capacity = N 2 R 100% Throughput ? ? ? ? ? ? ? ? ? R R R R R R R R R R R R R

18 R In Out R R R R R R/N If traffic is uniform R R

19 Real traffic is not uniform R In Out R R R R R R/N R R R R R R R R R ?

20 Out R R R R/N Two-stage load-balancing switch Load-balancing stageSwitching stage In Out R R R R/N R R R 100% throughput for weakly mixing, stochastic traffic. [C.-S. Chang, Valiant]

21 Out R R R R/N In R R R R/N 3 3 1 2 3 3 3 3 3

22 Out R R R R/N In R R R R/N 3 3 1 2 3 3 3 3 3

23 Chang’s load-balanced switch Good properties 1. 100% throughput for broad class of traffic 1. No scheduler needed  Scalable

24 Chang’s load-balanced switch Bad properties FOFF: Load-balancing algorithm  Packet sequence maintained  No pathological patterns  100% throughput - always  Delay within bound of ideal  (See paper for details) FOFF: Load-balancing algorithm  Packet sequence maintained  No pathological patterns  100% throughput - always  Delay within bound of ideal  (See paper for details) 1. Packet mis-sequencing 2. Pathological traffic patterns  Throughput 1/N-th of capacity 3. Uses two switch fabrics  Hard to package 4. Doesn’t work with some linecards missing  Impractical

25 In Out R R R R R R 2R/N Single Mesh Switch One linecard

26 In R R R Out Backplane R R R Packaging 2R/N R/N

27 Many fabric options Options Space: Full uniform mesh Time: Round-robin crossbar Wavelength: Static WDM Any permutation network C 1, C 2, …, C N C1C1 C2C2 C3C3 CNCN In Out In Out In Out In Out N channels each at rate 2R/N

28 In Out In Out In Out In Out Static WDM switching Array Waveguide Router (AWGR) Passive and Almost Zero Power A B C D A, B, C, D A, A, A, A B, B, B, B C, C, C, C D, D, D, D 4 WDM channels, each at rate 2R/N

29 R WDM 1 N      R Out WDM 1 N 1 N R R      2 R R 4 2 1 Linecard dataflow WDM 1 N      2 2 2 2 2 2 2 2 2 2 2 2 1 1 3 3 1 1 11 1 1 1 1 1 1 1 1 RR 3 In 1 1 1 1 1 1 1 1

30 Problems of scale  For N < 64, WDM is a good solution.  We want N = 640.  Need to decompose.

31 Decomposing the mesh 2R/8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8

32 Decomposing the mesh 2R/4 2R/8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 TDM WDM

33 When N is too large Decompose into groups (or racks) 1, 2, …, G 1 Array Waveguide Router (AWGR) 2L 2R 12L Group/Rack 1 Group/Rack G 1 G 1, 2, …, G

34 When a linecard is missing  Each linecard spreads its data equally over every other linecard.  Problem: If one is missing, or failed, then the spreading no longer works.

35 When a linecard fails In Out R R R 2R/3 In R R R 2R/3 + 2R/6 2R/3 + 2R/6 + 2R/3 + 2R/6 = 2R 2R/3 + 2R/6 Solution: 1.Move light beams  Replace AWGR with MEMS switch.  Reconfigure when linecard added, removed or fails. 2.Finer channel granularity  Multiple paths. 2R/3 + 2R/3 = (4/3)R

36 Solution Use transparent MEMS switches 12L 2R 12L Group/Rack 1 Group/Rack G=40 MEMS Switch 1 G MEMS Switch 1 G MEMS Switch 1 G Theorems: 1. Require L+G - 1 MEMS switches 2. Polynomial time reconfiguration algorithm MEMS switches reconfigured only when linecard added, removed or fails.

37 Hybrid Architecture: Logical View

38 Hybrid Electro-Optical Architecture

39 Number of MEMS Switches Linecard 1 Linecard 2 Linecard 3 Crossbar Linecard 1 Linecard 2 Linecard 3 R RR R Linecard 1 Linecard 2 Linecard 3 Crossbar Linecard 1 Linecard 2 Linecard 3 Static MEMS Linecard 4 Linecard 3Linecard 4 R R R R R R R R R R R R R R R R R R R R

40 Number of MEMS Switches Linecard 1 Linecard 2 Linecard 3 Crossbar Linecard 1 Linecard 2 Linecard 3 4R/3 2R/3 R/3 Linecard 1 Linecard 2 Linecard 3 Crossbar Linecard 1 Linecard 2 Linecard 3 Static MEMS R R/3 2R/3 R/3 2R/3 R R R R R R R R R R R R

41 Number of MEMS needed for a schedule  L i : number of linecards in group i, 1 ≤ i ≤ G. Group i needs to send to group j :  Assume each group can send at most R to each MEMS. Number of MEMS needed between groups i and j :

42 Number of MEMS needed for a schedule  The number of MEMS needed for group i to send to group j is A ij.  The total number of MEMS needed for group i is the sum of the A ij ’s

43 Constraints for the TDM Schedule 1. Latin Square: In any period N, each transmitting linecard is connected to each receiving linecard exactly once. 2. MEMS constraint: In any time-slot, there are at most A ij connections between transmitting group i and receiving group j, where:

44 Example  Assume L 1 =3, L 2 =2, L 3 =1  Then  E.g., at most 2 packets from the first group to the first group at each time-slot

45 Bad TDM Transmit Schedule t = 0t = 1t = 2t = 3t = 4t = 5 LC 1123456 LC 2612345 LC 3561234 LC 4456123 LC 5345612 LC 6234561

46 Good TDM Transmit Schedule t = 0t = 1t = 2t = 3t = 4t = 5 LC 1123456 LC 2512364 LC 3654123 LC 4231645 LC 5465231 LC 6346512

47 Configuration Algorithm 1. Assign connections between groups, so MEMS constraint is satisfied. 2. Assign group connections to specific linecards, so there is exactly one connection per linecard pair in the schedule. Comments:  Algorithm is surprisingly complex.  Best running time so far: 40 seconds for 640 linecards.

48 Challenges R WDM 1 G     G R Out WDM 1 G Pkt Switch 1 G R R     G 2 R=160Gb/s R 4 2 1 WDM 1 G     G Address Lookup 11 RR 3 In How to build a 250ms 160Gb/s buffer? Low-cost, low-power optoelectronic conversion?

49 What we are building Buffer Manager 90nm ASIC Buffer Manager 90nm ASIC 250ms DRAM 160Gb/s 320Gb/s Chip #1: 160Gb/s Packet Buffer CMOS ASIC 16 x 10Gb/s To LinecardsTo Optical Fabric Chip #2: 16 x 55 Opto-electronic crossbar 55 x 10Gb/s 1500nm Optical source Optical Detector Optical Modulator

50 100Tb/s Load-Balanced Router L = 16 160Gb/s linecards Linecard Rack G = 40 L = 16 160Gb/s linecards Linecard Rack 1 L = 16 160Gb/s linecards 5556 12 40 x 40 MEMS Switch Rack < 100W


Download ppt "Scaling Internet Routers Using Optics UW, October 16 th, 2003 Nick McKeown Joint work with research groups of: David Miller, Mark Horowitz, Olav Solgaard."

Similar presentations


Ads by Google