Big, Fast Routers Dave Andersen CMU CS 15-744.

Slides:



Advertisements
Similar presentations
Router Internals CS 4251: Computer Networking II Nick Feamster Spring 2008.
Advertisements

Router Internals CS 4251: Computer Networking II Nick Feamster Fall 2008.
IP Router Architectures. Outline Basic IP Router Functionalities IP Router Architectures.
A Search Memory Substrate for High Throughput and Low Power Packet Processing Sangyeun Cho, Michel Hanna and Rami Melhem Dept. of Computer Science University.
NetFPGA Project: 4-Port Layer 2/3 Switch Ankur Singla Gene Juknevicius
A Scalable and Reconfigurable Search Memory Substrate for High Throughput Packet Processing Sangyeun Cho and Rami Melhem Dept. of Computer Science University.
Computer Networking Lecture 9 – IP Packets, Routers.
Network Algorithms, Lecture 4: Longest Matching Prefix Lookups George Varghese.
1 o Two issues in practice – Scale – Administrative autonomy o Autonomous system (AS) or region o Intra autonomous system routing protocol o Gateway routers.
Bio Michel Hanna M.S. in E.E., Cairo University, Egypt B.S. in E.E., Cairo University at Fayoum, Egypt Currently is a Ph.D. Student in Computer Engineering.
© 2009 Cisco Systems, Inc. All rights reserved. SWITCH v1.0—4-1 Implementing Inter-VLAN Routing Deploying Multilayer Switching with Cisco Express Forwarding.
Based on slides from Dave Andersen and Nick Feamster
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 1 High Speed Router Design Shivkumar Kalyanaraman Rensselaer Polytechnic Institute
Router Architecture : Building high-performance routers Ian Pratt
KARL NADEN – NETWORKS (18-744) FALL 2010 Overview of Research in Router Design.
Nick McKeown CS244 Lecture 6 Packet Switches. What you said The very premise of the paper was a bit of an eye- opener for me, for previously I had never.
What's inside a router? We have yet to consider the switching function of a router - the actual transfer of datagrams from a router's incoming links to.
Spring 2002CS 4611 Router Construction Outline Switched Fabrics IP Routers Tag Switching.
UCB Switches Jean Walrand U.C. Berkeley
CS 268: Router Design Ion Stoica March 1, 2004.
1 Architectural Results in the Optical Router Project Da Chuang, Isaac Keslassy, Nick McKeown High Performance Networking Group
EE 122: Router Design Kevin Lai September 25, 2002.
048866: Packet Switch Architectures Dr. Isaac Keslassy Electrical Engineering, Technion Introduction.
CS 268: Lecture 12 (Router Design) Ion Stoica March 18, 2002.
Nick McKeown 1 Memory for High Performance Internet Routers Micron February 12 th 2003 Nick McKeown Professor of Electrical Engineering and Computer Science,
Katz, Stoica F04 EECS 122: Introduction to Computer Networks Switch and Router Architectures Computer Science Division Department of Electrical Engineering.
048866: Packet Switch Architectures Dr. Isaac Keslassy Electrical Engineering, Technion Maximal.
Router Architectures An overview of router architectures.
Router Architectures An overview of router architectures.
Router Design (Nick Feamster) February 11, Today’s Lecture The design of big, fast routers Partridge et al., A 50 Gb/s IP Router Design constraints.
Chapter 4 Queuing, Datagrams, and Addressing
1 IP routers with memory that runs slower than the line rate Nick McKeown Assistant Professor of Electrical Engineering and Computer Science, Stanford.
Computer Networks Switching Professor Hui Zhang
Paper Review Building a Robust Software-based Router Using Network Processors.
Professor Yashar Ganjali Department of Computer Science University of Toronto
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
Univ. of TehranComputer Network1 Advanced topics in Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr. Nasser Yazdani.
CS 3035/GZ01: Networked Systems Kyle Jamieson Department of Computer Science University College London Inside Internet Routers.
CS 552 Computer Networks IP forwarding Fall 2005 Rich Martin (Slides from D. Culler and N. McKeown)
A 50-Gb/s IP Router 참고논문 : Craig Partridge et al. [ IEEE/ACM ToN, June 1998 ]
TO p. 1 Spring 2006 EE 5304/EETS 7304 Internet Protocols Tom Oh Dept of Electrical Engineering Lecture 9 Routers, switches.
Advance Computer Networking L-8 Routers Acknowledgments: Lecture slides are from the graduate level Computer Networks course thought by Srinivasan Seshan.
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 1 ECSE-6600: Internet Protocols Informal Quiz #14 Shivkumar Kalyanaraman: GOOGLE: “Shiv RPI”
Routers. These high-end, carrier-grade 7600 models process up to 30 million packets per second (pps).
ISLIP Switch Scheduler Ali Mohammad Zareh Bidoki April 2002.
Packet Forwarding. A router has several input/output lines. From an input line, it receives a packet. It will check the header of the packet to determine.
1 Performance Guarantees for Internet Routers ISL Affiliates Meeting April 4 th 2002 Nick McKeown Professor of Electrical Engineering and Computer Science,
Stress Resistant Scheduling Algorithms for CIOQ Switches Prashanth Pappu Applied Research Laboratory Washington University in St Louis “Stress Resistant.
CS 4396 Computer Networks Lab Router Architectures.
Efficient Cache Structures of IP Routers to Provide Policy-Based Services Graduate School of Engineering Osaka City University
High-Speed Policy-Based Packet Forwarding Using Efficient Multi-dimensional Range Matching Lakshman and Stiliadis ACM SIGCOMM 98.
CS 740: Advanced Computer Networks IP Lookup and classification Supplemental material 02/05/2007.
Winter 2006EE384x Handout 11 EE384x: Packet Switch Architectures Handout 1: Logistics and Introduction Professor Balaji Prabhakar
Switching and Router Design
Lecture Note on Switch Architectures. Function of Switch.
1 A quick tutorial on IP Router design Optics and Routing Seminar October 10 th, 2000 Nick McKeown
1 How scalable is the capacity of (electronic) IP routers? Nick McKeown Professor of Electrical Engineering and Computer Science, Stanford University
Packet Switch Architectures The following are (sometimes modified and rearranged slides) from an ACM Sigcomm 99 Tutorial by Nick McKeown and Balaji Prabhakar,
Network layer (addendum) Slides adapted from material by Nick McKeown and Kevin Lai.
scheduling for local-area networks”
IP Routers – internal view
Weren’t routers supposed
CS 268: Router Design Ion Stoica February 27, 2003.
Packet Forwarding.
Addressing: Router Design
Lecture 11 Switching & Forwarding
CS 31006: Computer Networks – The Routers
Advance Computer Networking
EE 122: Lecture 7 Ion Stoica September 18, 2001.
Project proposal: Questions to answer
Presentation transcript:

Big, Fast Routers Dave Andersen CMU CS 15-744

Router Architecture Data Plane Control Plane How packets get forwarded How routing protocols establish routes/etc.

Processing: Fast Path vs. Slow Path Optimize for common case BBN router: 85 instructions for fast-path code Fits entirely in L1 cache Non-common cases handled on slow path Route cache misses Errors (e.g., ICMP time exceeded) IP options Fragmented packets Mullticast packets

First Generation Routers Off-chip Buffer Shared Bus Route Table CPU Buffer Memory Line Interface MAC Typically <0.5Gb/s aggregate capacity CPU Memory Line Interface Line card DMAs into buffer, CPU examines header, has output DMA out

Second Generation Routers CPU Route Table Buffer Memory Bypasses memory bus with direct transfer over bus between line cards Moves forwarding decisions local to card to reduce CPU pain Punt to CPU for “slow” operations Line Card Line Card Line Card Buffer Memory Buffer Memory Buffer Memory Fwding Cache Fwding Cache Fwding Cache MAC MAC MAC Typically <5Gb/s aggregate capacity

Control Plane & Data Plane Control plane must remember lots of routing info (BGP tables, etc.) Data plane only needs to know the “FIB” (Forwarding Information Base) Smaller, less information, etc. Simplifies line cards vs the network processor

Bus-based Some improvements possible But shared bus was big bottleneck Cache bits of forwarding table in line cards, send directly over bus to outbound line card But shared bus was big bottleneck E.g., modern PCI bus (PCIx16) is only 32Gbit/sec (in theory) Almost-modern cisco (XR 12416) is 320Gbit/sec. Ow! How do we get there?

Third Generation Routers “Crossbar”: Switched Backplane Line Card CPU Card Line Card Local Buffer Memory Local Buffer Memory CPU Line Interface Routing Table Memory Fwding Table Fwding Table Periodic Control updates MAC MAC Typically <50Gb/s aggregate capacity

Crossbars N input ports, N output ports (One per line card usually) Every line card has its own forwarding table/classifier/etc – removes CPU bottleneck Scheduler Decides which input/output port to connect in a given time slot Crossbar constraint: If input I is connected to output j, no other input connected to j, no other output connected to input I Scheduling is bipartite matching…

What’s so hard here? Back-of-the-envelope numbers Line cards can be 40 Gbit/sec today (OC-768) Undoubtedly faster in a few more years, so scale these #s appropriately! To handle minimum-sized packets (~40b) 125 Mpps, or 8ns per packet But note that this can be deeply pipelined, at the cost of buffering and complexity. Some lookup chips do this, though still with SRAM, not DRAM. Good lookup algos needed still. For every packet, you must: Do a routing lookup (where to send it) Schedule the crossbar Maybe buffer, maybe QoS, maybe filtering by ACLs

Routing Lookups Routing tables: 200,000 – 1M entries So, how to do it? Router must be able to handle routing table load 5 years hence. Maybe 10. So, how to do it? DRAM (Dynamic RAM, ~50ns latency) Cheap, slow SRAM (Static RAM, <5ns latency) Fast, $$ TCAM (Ternary Content Addressable Memory – parallel lookups in hardware) Really fast, quite $$, lots of power

Longest-Prefix Match Not just one entry that matches a dst 128.2.0.0/16 vs 128.2.153.0/24 Must take the “longest” (most specific) prefix

Method 1: Trie P1 = 10* P2 = 111* P3 = 11001* P4 = 1* P5 = 0* Sample Database Root P1 = 10* P2 = 111* P3 = 11001* P4 = 1* P5 = 0* P6 = 1000* P7 = 100000* P8 = 1000000* 1 P5 P4 1 P1 1 P2 P6 How fast does lookup need to be – 1000bit packets, 10Gbps  10million pps  100ns!  2 Ghz processor = 200 instructions, but memory speed? How many memory refs do we need… 1 P3 P7 P8

Skip Count vs. Path Compression (Skip count) Skip 2 or 11 (path compressed) 1 P1 1 1 P1 P2 1 P2 1 1 P4 P3 P4 P3 Removing one way branches ensures # of trie nodes is at most twice # of prefixes Using a skip count requires exact match at end and backtracking on failure  path compression simpler

LPM with PATRICIA Tries Traditional method – Patricia Tree Arrange route entries into a series of bit tests Worst case = 32 bit tests Problem: memory speed, even w/SRAM! Bit to test – 0 = left child,1 = right child Lookup 128.32.150.3 - simple Lookup 128.32.149.3 – follow to 128.32.150/24  no match  backtrack and look for parent leaves that match 10 default 0/0 16 128.2/16 19 128.32/16 128.32.130/240 128.32.150/24

How can we speed LPM up? Two general approaches: Shrink the table so it fits in really fast memory (cache) Degermark et al.; optional reading Complete prefix tree (node has 2 or 0 kids) can be compressed well. 3 stages: Match 16 bits; match next 8; match last 8 Drastically reduce the # of memory lookups WUSTL algorithm ca. same time (Binary search on prefixes)

TCAMs for LPM Content addressable memory (CAM) Hardware-based route lookup Input = tag, output = value Requires exact match with tag Multiple cycles (1 per prefix) with single CAM Multiple CAMs (1 per prefix) searched in parallel Ternary CAM (0,1,don’t care) values in tag match Priority (i.e., longest prefix) by order of entries Very expensive, lots of power, but fast! Some commercial routers use it

Skipping LPMs with caching Packet trains exhibit temporal locality Many packets to same destination. Problems? Cisco Express Forwarding

Problem 2: Crossbar Scheduling Find a bipartite matching In under 8ns First issue: Head-of-line blocking with input queues If only 1 queue per input Max throughput <= (2-sqrt(2)) =~ 58% Solution? Virtual output queueing In input line card, one queue per dst. Card Requires N queues; more if QoS The Way It’s Done Now ™

Head-of-Line Blocking Problem: The packet at the front of the queue experiences contention for the output queue, blocking all packets behind it. Input 1 Output 1 Input 2 Output 2 Input 3 Output 3 Maximum throughput in such a switch: 2 – sqrt(2)

Early Crossbar Scheduling Algorithm Wavefront algorithm Observation: 2,1 1,2 don’t conflict with each other (11 “cycles”) Slow! (36 “cycles”) Do in groups, with groups in parallel (5 “cycles”) Can find opt group size, etc. Problems: Fairness, speed, …

Alternatives to the Wavefront Scheduler PIM: Parallel Iterative Matching Request: Each input sends requests to all outputs for which it has packets Grant: Output selects an input at random and grants Accept: Input selects from its received grants Problem: Matching may not be maximal Solution: Run several times Problem: Matching may not be “fair” Solution: Grant/accept in round robin instead of random

iSLIP – Round-robin PIM Each input maintains round-robin list of outputs Each output maints round-robin list of inputs Request phase: Inputs send to all desired output Grant phase: Output picks first input in round-robin sequence Input picks first output from its RR seq Output updates RR seq only if it’s chosen Good fairness in simulation

100% Throughput? Why does it matter? Cool result: Guaranteed behavior regardless of load! Same reason moved away from cache-based router architectures Cool result: Dai & Prabhakar: Any maximal matching scheme in a crossbar with 2x speedup gets 100% throughput Speedup: Run internal crossbar links at 2x the input & output link speeds

Filling in the (painful) details Routers do more than just LPM & crossbar Packet classification (L3 and up!) Counters & stats for measurement & debugging IPSec and VPNs QoS, packet shaping, policing IPv6 Access control and filtering IP multicast AQM: RED, etc. (maybe) Serious QoS: DiffServ (sometimes), IntServ (not)

Other challenges in Routing Fast Classification src, dst, sport, dport, {other stuff} -> accept, deny, which class/queue, etc. Even with TCAMs, hard efficient use of limited entries, etc. Routing correctness We’ll get back to this a bit later Architecture w.r.t. management Also later. How do you manage a collection of 100s of routers?

Going Forward Today’s highest-end: Multi-rack routers Measured in Tb/sec One scenario: Big optical switch connecting multiple electrical switches Cool design: McKeown sigcomm 2003 paper BBN MGR: Normal CPU for forwarding Modern routers: Several ASICs