Towards 10Gb/s open-source routing

Slides:



Advertisements
Similar presentations
IP Router Architectures. Outline Basic IP Router Functionalities IP Router Architectures.
Advertisements

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
Efficient IP-Address Lookup with a Shared Forwarding Table for Multiple Virtual Routers Author: Jing Fu, Jennifer Rexford Publisher: ACM CoNEXT 2008 Presenter:
FreeBSD Network Stack Performance Srinivas Krishnan University of North Carolina at Chapel Hill.
Router Architectures An overview of router architectures.
Router Architectures An overview of router architectures.
Sven Ubik, Petr Žejdl CESNET TNC2008, Brugges, 19 May 2008 Passive monitoring of 10 Gb/s lines with PC hardware.
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
Paper Review Building a Robust Software-based Router Using Network Processors.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
Jon Turner (and a cast of thousands) Washington University Design of a High Performance Active Router Active Nets PI Meeting - 12/01.
LOGO BUS SYSTEM Members: Bui Thi Diep Nguyen Thi Ngoc Mai Vu Thi Thuy Class: 1c06.
MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.
A 50-Gb/s IP Router 참고논문 : Craig Partridge et al. [ IEEE/ACM ToN, June 1998 ]
Advance Computer Networking L-8 Routers Acknowledgments: Lecture slides are from the graduate level Computer Networks course thought by Srinivasan Seshan.
A Measurement Based Memory Performance Evaluation of High Throughput Servers Garba Isa Yau Department of Computer Engineering King Fahd University of Petroleum.
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 1 ECSE-6600: Internet Protocols Informal Quiz #14 Shivkumar Kalyanaraman: GOOGLE: “Shiv RPI”
I/O Computer Organization II 1 Interconnecting Components Need interconnections between – CPU, memory, I/O controllers Bus: shared communication channel.
ECE 526 – Network Processing Systems Design Computer Architecture: traditional network processing systems implementation Chapter 4: D. E. Comer.
CS 4396 Computer Networks Lab Router Architectures.
Efficient Cache Structures of IP Routers to Provide Policy-Based Services Graduate School of Engineering Osaka City University
Cisco Network Devices Chapter 6 powered by DJ 1. Chapter Objectives At the end of this Chapter you will be able to:  Identify and explain various Cisco.
Lecture Note on Switch Architectures. Function of Switch.
Describe basic routing concepts. Routers Interconnect Networks Router is responsible for forwarding packets from network to network, from the original.
Univ. of TehranIntroduction to Computer Network1 An Introduction to Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.
L1/HLT trigger farm Bologna setup 0 By Gianluca Peco INFN Bologna Genève,
Univ. of TehranIntroduction to Computer Network1 An Introduction to Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.
Recent experience with PCI-X 2.0 and PCI-E network interfaces and emerging server systems Yang Xia Caltech US LHC Network Working Group October 23, 2006.
MAHARANA PRATAP COLLEGE OF TECHNOLOGY SEMINAR ON- COMPUTER PROCESSOR SUBJECT CODE: CS-307 Branch-CSE Sem- 3 rd SUBMITTED TO SUBMITTED BY.
Network layer (addendum) Slides adapted from material by Nick McKeown and Kevin Lai.
Considerations for Benchmarking Virtual Networks Samuel Kommu, Jacob Rapp, Ben Basler,
Multiqueue & Linux Networking Robert Olsson UU/KTH.
Status from Martin Josefsson Hardware level performance of Status from Martin Josefsson Hardware level performance of by Jesper Dangaard Brouer.
Open-source routing at 10Gb/s Olof Hagsand (KTH) Robert Olsson (Uppsala U) Bengt Görden (KTH) SNCNW May 2009 Project grants: Internetstiftelsen (IIS) Equipment:
Bifrost KTH/CSD course kick-off Fall 2010 Robert Olsson.
InterVLAN Routing 1. InterVLAN Routing 2. Multilayer Switching.
CPU Central Processing Unit
Lecture 2. A Computer System for Labs
NFV Compute Acceleration APIs and Evaluation
NaNet Problem: lower communication latency and its fluctuations. How?
KTH/CSD course kick-off Summer 2010 Robert Olsson
Graphics Processor Graphics Processing Unit
Bus Systems ISA PCI AGP.
Control and forwarding plane separation on an open-source router
HyperTransport™ Technology I/O Link
Operating Systems (CS 340 D)
CS 268: Router Design Ion Stoica February 27, 2003.
Chapter 6: Network Layer
Addressing: Router Design
Reference Router on NetFPGA 1G
Architecture & Organization 1
Open Source 10g Talk at KTH/Kista
What happens inside a CPU?
CS 286 Computer Organization and Architecture
Report from Netconf 2009 Jesper Dangaard Brouer
Multi-PCIe socket network device
KTH/CSD course kick-off Spring 2010 Robert Olsson
KTH/CSD course kick-off Fall 2009 Robert Olsson
NT1110 Computer Structure and Logic
Virtual Memory Main memory can act as a cache for the secondary storage (disk) Advantages: illusion of having more physical memory program relocation protection.
CPU Central Processing Unit
Architecture & Organization 1
Advance Computer Networking
Bridges and Extended LANs
Packet Switch Architectures
EE 122: Lecture 7 Ion Stoica September 18, 2001.
* From AMD 1996 Publication #18522 Revision E
Reference Router on NetFPGA 1G
Lecture 25: Interconnection Networks
TELL1 A common data acquisition board for LHCb
Presentation transcript:

Towards 10Gb/s open-source routing Olof Hagsand (KTH) Robert Olsson (Uppsala U) Bengt Görden (KTH) Linuxkongress 2008

Introduction Investigate packet forwarding performance of new PC hardware: Multi-core CPUs Multiple PCI-e buses 10G NICs Can we obtain enough performance to use open-source routing also in the 10Gb/s realm?

Measuring throughput Packet per second Per-packet costs CPU processing, I/O and memory latency, clock frequency Bandwidth Per-byte costs Bandwidth limitations of bus and memory

Measuring throughput drops breakpoint capacity overload overload

Inside a router, HW style Switched backplane Line Card Buffer Memory forwarder Line Card Buffer Memory forwarder Line Card Buffer Memory forwarder Line Card Buffer Memory forwarder CPU Card CPU RIB Multiple simultaneous transfers over the backplane Specialized hardware: ASICs, NPUs, backplane with switching stages or crossbars

Inside a router, PC-style RIB Buffer Memory CPU Shared bus backplane Line Card Line Card Line Card Every packet goes twice over shared bus to the CPU Cheap, but low performance But lets increase the # of CPUs and # of buses!

Block hw structure

Hardware – Box (set 2) AMD Opteron 2356 with one quad core 2.3GHz Barcelona CPUs on a TYAN 2927 Motherboard (2U)

Hardware - NIC Intel 10g board Chipset 82598 Open chip specs. Thanks Intel!

Lab

Equipment summary Hardware needs to be carefully selected BifrostLinux on kernel 2.6.24rc7 with LC-trie forwarding Tweaked pktgen Set 1: AMD Opteron 2222 with two double core 3GHz CPUs on a Tyan Thunder n6650W(S2915) motherboard Set 2: AMD Opteron 2356 with one quad core 2.3GHz Barcelona CPUs on a TYAN 2927 Motherboard (2U) Dual PCIe buses 10GE network interface cards. PCI Express x8 lanes based on Intel 82598 chipset

Experiments Transmission(TX) Forwarding experiments A bandwidth of 25.8 Gb/s and a packet rate of 10 Mp/s using four CPU cores and two PCIe buses Forwarding experiments Single IP flow was forwarded using a single CPU. Realistic multiple-flow stream with varying destination address and packet sizes using a single CPU. Multi-queues on the interface cards were used to dispatch different flows to four different CPUs.

Tx Experiments Goal: Just to see how much the hw can handle – upper limit Loopback tests over fibers Don't process RX packets just let MAC count them These numbers can give indication what forwarding capacity is possible Experiments: Single CPU TX single interface Four CPUs TX one interface each Tested device

Tx single sender: Packets per second

Tx single sender: Bandwidth

Tx - Four CPUs: Bandwidth SUM Packet length: 1500 bytes

TX Experiments summary Single Tx sender is primarily limited by PPS at around 3.5Mpps A bandwidth of 25.8 Gb/s and a packet rate of 10 Mp/s using four CPU cores and two PCIe buses This shows that the hw itself allows 10Gb/s performance We also see nice symmetric Tx between the CPU cores.

Forwarding experiments Goal: Realistic forwarding performance Overload ,measurements ( Single forwarding path from one traffic source to one traffic sink Single IP flow was forwarded using a single CPU. Realistic multiple-flow stream with varying destination address and packet sizes using a single CPU. Multi-queues on the interface cards were used to dispatch different flows to four different CPUs. Test Generator Tested device Sink device

Single flow, single CPU: Packets per second

Single flow, single CPU: Bandwidth

Single sender forwarding summary Virtually wire-speed for 1500-byte packets Little difference between forwarding on same card, different ports, or between different cards Seems to be slightly better perf on same card, but not significant Primary limiting factor is pps, around 900Kpps TX has small effect on overall performance

Introducing realistic traffic For the rest of the experiments we introduce a more realistic traffic scenario Multiple packet sizes Simple model based on realistic packet distribution data Multiple flows (multiple dst IP:s) This is also necessary for multi-core experiments since NIC classification is made using hash algorithm on packet headers

Packet size distribution Real data from www.caida.org, Wide aug 2008

Flow distribution Flows have size and duration distributions 8000 simultaneous flows Each flow 30 packets long Mean flow duration is 258 ms 31000 new flows per second Measured by dst cache misses Destinations spread randomly over 11.0.0.0/8 FIB contains ~ 280K entries 64K entries in 11.0.0.0/8 This flow distribution is relatively aggressive

Multi-flow and single-CPU: PPS & BW max min Set 1 Set 2 Set 1 Set 2 Small routing table 280K entries No ipfilters ipfilters enabled

Multi-Q experiments Use more CPU cores to handle forwarding NIC classification (Receiver Side Scaling RSS) uses hash algorithm to select input queue Allocate several interrupt channels, one for each CPU. Flows are distributed evenly between CPUs need agregated traffic with multiple flows Questions: Are flows evenly distributed? Will performance increase as CPUs are added?

Multi-flow and Multi-CPU (set 1) 1 CPU 4 CPUs CPU #1 CPU #2 CPU #3 CPU #4 Only 64 byte packets

Results MultiQ Packets are evenly distributed between the four CPUs. But forwarding using one CPU is better than using four CPUs! Why is this?

Profiling. Single CPU Multiple CPUs samples % symbol name 396100 14.8714 kfree 390230 14.6510 dev_kfree_skb_irq 300715 11.2902 skb_release_data 156310 5.8686 eth_type_trans 142188 5.3384 ip_rcv 106848 4.0116 __alloc_skb 75677 2.8413 raise_softirq_irqoff 69924 2.6253 nf_hook_slow 69547 2.6111 kmem_cache_free 68244 2.5622 netif_receive_skb samples % symbol name 1087576 22.0815 dev_queue_xmit 651777 13.2333 __qdisc_run 234205 4.7552 eth_type_trans 204177 4.1455 dev_kfree_skb_irq 174442 3.5418 kfree 158693 3.2220 netif_receive_skb 149875 3.0430 pfifo_fast_enqueue 116842 2.3723 ip_finish_output 114529 2.3253 __netdev_alloc_skb 110495 2.2434 cache_alloc_refill

Multi-Q analysis With multiple CPUs: TX processing is using a large part of the CPU making using more CPUs sub-optimal It turns out that the Tx and Qdisc code needs to be adapted to scale up performance

MultiQ: Updated drivers We recently made new measurements (not in paper) using updated driver code We also used hw set 2 (Barcelona) to get better results We now see an actual improvement when we add one processor (More to come)

Multi-flow and Multi-CPU (set 2) 1 CPU 2 CPUs 4 CPUs

Conclusions Tx and forwarding results towards 10Gb/s performance using Linux and selected hardware >25Gb/s Tx performance Near 10Gb/s wirespeed forwarding for large packets Identified bottleneck for multi-q and multi-core forwarding. If this is removed, upscaling performance using several CPU cores is possible to 10Gb/s and beyond.