Sven Ubik, Petr Žejdl CESNET TNC2008, Brugges, 19 May 2008 Passive monitoring of 10 Gb/s lines with PC hardware.

Slides:



Advertisements
Similar presentations
SHARKFEST '09 | Stanford University | June 15–18, 2009 Now and Then, How and When? June 16 th, 2009 Stephen Donnelly Technologist | Endace Technology SHARKFEST.
Advertisements

Router Architecture : Building high-performance routers Ian Pratt
By Aaron Thomas. Quick Network Protocol Intro. Layers 1- 3 of the 7 layer OSI Open System Interconnection Reference Model  Layer 1 Physical Transmission.
Super Fast Camera System Performed by: Tokman Niv Levenbroun Guy Supervised by: Leonid Boudniak.
Spring 2002CS 4611 Router Construction Outline Switched Fabrics IP Routers Tag Switching.
1 Router Construction II Outline Network Processors Adding Extensions Scheduling Cycles.
PFLDNet Argonne Feb 2004 R. Hughes-Jones Manchester 1 UDP Performance and PCI-X Activity of the Intel 10 Gigabit Ethernet Adapter on: HP rx2600 Dual Itanium.
ECE 526 – Network Processing Systems Design
Router Construction II Outline Network Processors Adding Extensions Scheduling Cycles.
Can Google Route? Building a High-Speed Switch from Commodity Hardware Guido Appenzeller, Matthew Holliman Q2/2002.
System Architecture A Reconfigurable and Programmable Gigabit Network Interface Card Jeff Shafer, Hyong-Youb Kim, Paul Willmann, Dr. Scott Rixner Rice.
GigE Knowledge. BODE, Company Profile Page: 2 Table of contents  GigE Benefits  Network Card and Jumbo Frames  Camera - IP address obtainment  Multi.
IETF 90: VNF PERFORMANCE BENCHMARKING METHODOLOGY Contributors: Sarah Muhammad Durrani: Mike Chen:
PHY 201 (Blum) Buses Warning: some of the terminology is used inconsistently within the field.
Sven Ubik, CESNET TNC2004, Rhodos, 9 June 2004 Performance monitoring of high-speed networks from NREN perspective.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
Performance Tradeoffs for Static Allocation of Zero-Copy Buffers Pål Halvorsen, Espen Jorde, Karl-André Skevik, Vera Goebel, and Thomas Plagemann Institute.
5 Feb 2002Alternative Ideas for the CALICE Backend System 1 Alternative Ideas for the CALICE Back-End System Matthew Warren and Gordon Crone University.
Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks Chitalwala. E., El-Ghazawi. T., Gaj. K., The George Washington.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
© 2008 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Achieving 10 Gb/s Using Xen Para-virtualized.
LiNK: An Operating System Architecture for Network Processors Steve Muir, Jonathan Smith Princeton University, University of Pennsylvania
A TCP/IP transport layer for the DAQ of the CMS Experiment Miklos Kozlovszky for the CMS TriDAS collaboration CERN European Organization for Nuclear Research.
Evaluation of the LDC Computing Platform for Point 2 SuperMicro X6DHE-XB, X7DB8+ Andrey Shevel CERN PH-AID ALICE DAQ CERN 10 October 2006.
MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.
RiceNIC: A Reconfigurable and Programmable Gigabit Network Interface Card Jeff Shafer, Dr. Scott Rixner Rice Computer Architecture:
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
GBT Interface Card for a Linux Computer Carson Teale 1.
“ PC  PC Latency measurements” G.Lamanna, R.Fantechi & J.Kroon (CERN) TDAQ WG –
Vladimír Smotlacha CESNET Full Packet Monitoring Sensors: Hardware and Software Challenges.
Design and Performance of a PCI Interface with four 2 Gbit/s Serial Optical Links Stefan Haas, Markus Joos CERN Wieslaw Iwanski Henryk Niewodnicznski Institute.
Securing and Monitoring 10GbE WAN Links Steven Carter Center for Computational Sciences Oak Ridge National Laboratory.
4 Dec 2006 Testing the machine (X7DBE-X) with 6 D-RORCs 1 Evaluation of the LDC Computing Platform for Point 2 SuperMicro X7DBE-X Andrey Shevel CERN PH-AID.
Platform Architecture Lab USB Performance Analysis of Bulk Traffic Brian Leete
Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:
By V. Koutsoumpos, C. Kachris, K. Manolopoulos, A. Belias NESTOR Institute – ICS FORTH Presented by: Kostas Manolopoulos.
Sven Ubik, Petr Zejdl, Vladimir Smotlacha TNC-2006, Catania, Hardware anonymization.
Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.
An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.
Vladimír Smotlacha CESNET High-speed Programmable Monitoring Adapter.
Latest ideas in DAQ development for LHC B. Gorini - CERN 1.
Efficient Cache Structures of IP Routers to Provide Policy-Based Services Graduate School of Engineering Osaka City University
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Lecture 25 PC System Architecture PCIe Interconnect
High Speed Detectors at Diamond Nick Rees. A few words about HDF5 PSI and Dectris held a workshop in May 2012 which identified issues with HDF5: –HDF5.
Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.
An Efficient Gigabit Ethernet Switch Model for Large-Scale Simulation Dong (Kevin) Jin.
An Efficient Gigabit Ethernet Switch Model for Large-Scale Simulation Dong (Kevin) Jin.
Spring 2000CS 4611 Router Construction Outline Switched Fabrics IP Routers Extensible (Active) Routers.
Exploiting Task-level Concurrency in a Programmable Network Interface June 11, 2003 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Part IVI/O Systems Chapter 13: I/O Systems. I/O Hardware a typical PCI bus structure 2.
L1/HLT trigger farm Bologna setup 0 By Gianluca Peco INFN Bologna Genève,
Sven Ubik, Aleš Friedl CESNET TNC 2009, Malaga, Spain, 11 June 2009 Experience with passive monitoring deployment in GEANT2 network.
R&D on data transmission FPGA → PC using UDP over 10-Gigabit Ethernet Domenico Galli Università di Bologna and INFN, Sezione di Bologna XII SuperB Project.
ROD Activities at Dresden Andreas Glatte, Andreas Meyer, Andy Kielburg-Jeka, Arno Straessner LAr Electronics Upgrade Meeting – LAr Week September 2009.
Recent experience with PCI-X 2.0 and PCI-E network interfaces and emerging server systems Yang Xia Caltech US LHC Network Working Group October 23, 2006.
Status Report of the PC-Based PXD-DAQ Option Takeo Higuchi (KEK) 1Sep.25,2010PXD-DAQ Workshop.
APE group Many-core platforms and HEP experiments computing XVII SuperB Workshop and Kick-off Meeting Elba, May 29-June 1,
The ALICE Data-Acquisition Read-out Receiver Card C. Soós et al. (for the ALICE collaboration) LECC September 2004, Boston.
LISA Linux Switching Appliance Radu Rendec Ioan Nicu Octavian Purdila Universitatea Politehnica Bucuresti 5 th RoEduNet International Conference.
Open-source routing at 10Gb/s Olof Hagsand (KTH) Robert Olsson (Uppsala U) Bengt Görden (KTH) SNCNW May 2009 Project grants: Internetstiftelsen (IIS) Equipment:
NFV Compute Acceleration APIs and Evaluation
NaNet Problem: lower communication latency and its fluctuations. How?
Bus Systems ISA PCI AGP.
Open Source 10g Talk at KTH/Kista
Evolution of S-LINK to PCI interfaces
Network Systems and Throughput Preservation
IP Control Gateway (IPCG)
Presentation transcript:

Sven Ubik, Petr Žejdl CESNET TNC2008, Brugges, 19 May 2008 Passive monitoring of 10 Gb/s lines with PC hardware

Active monitoring – a probe Uses test packets Results truly applicable to test packets only Passive monitoring – a view Does not send anything, provides characteristics about real traffic Many characteristics are inherent to real traffic and cannot be obtained from test packets traffic volume, protocol usage, burstiness, real packet loss, anomalies, security attacks, … Active vs. passive monitoring

Hardware vs. software Hardware processing is often considered „fast“ and software processing „slow“ Software runs on top of hardware and hardware is often programmed, no clear line between HW and SW processing Hardware programming is sometimes considered „design-time“ and software programming „run-time“, but dynamically reconfigurable HW exists and software needs to be designed There is often difference in flexibility of programming (SW better than HW) What is more „powerful“, FPGA, network processor or multi-core CPU?

NICs and monitoring cards 10GE NICs are now commonly available and relatively inexpensive, $1300 / port including XFP transceiver (was $ years ago) 10GE monitoring cards are few and expensive – DAG (Endace), Napatech, COMBO (Invea-Tech), ~6500 Euro / port and more incl. XFP transceiver (was over Euro 2 years ago) Two main differences NICs vs. monitoring cards: - some hardware acceleration (filtering, header classification, simple packet statistics) - large packet buffer and block DMA transfer – key difference

10 Gb/s cards that we tested x8 PCI-Express NICs : Myricom Myri-10G Neterion Xframe II 64-bit/133 MHz PCI-X NICs Intel PRO/10 GbE Neterion Xframe Monitoring card: DAG 8.2X (PCI-E) Theoretical bus throughput: - 20 Gb/s for x8 PCI-E - 8 Gb/s for PCI-X

Test setup RFC2544 – Benchmarking Methodology for Network Interconnect Devices Frame sizes 1518, 1280, 1024, 512, 256, 128 and 64 bytes DUT – Device under test, difficult to isolate in case of a PC card tested card PC hardware NAPI driver for NICs Linux 2.6 with standard IP stack MAPI middleware test application (header filter and packet counter)

Processing throughput Maximum IP-layer throughput in Gb/s with zero-loss processing Myricom best among NICs, but marginally DAG 100% line rate Maximum load at zero loss [Gb/s] Packet size [B] Packets/s at 10 Gb/s MyricomIntelXframeXframe IIDAG (for comparison)

Processing frame rates Myricom Myri-10G PCI-E Neterion Xframe II PCI-E Neterion Xframe PCI-X Intel PRO/10GbE PCI-X

CPU load CPU load at max. zero-loss throughput For all cards CPU was not bottleneck DAG anomaly for larger frames being investigated

Traffic processing in a PC Example of a modern mainboard – Supermicro X7DB8 with Intel 5000P chipset): In a modern PC, bandwidth of PCI, memory and FSB are sufficient for sustained processing of 10 Gb/s data, bottlenecks are CPUs and NICs. In our case NICs were most likely bottleneck with limit of ~1.3 mil. packets/s.

Cycles per packet 10 Gb/s in 64-byte packets = 14.8*10 6 packets / second 3 GHz CPUs: 4 cores – 806 cycles / packet 8 cores – 1612 cycles / packet 16 cores – 3224 cycles / packet

Packet sizes in live traffic ~ 40% packets near 64 bytes ~ 40% packets near 1518 bytes ~ 20% packets in between (~3% near 600 bytes) average packet size 790 bytes Example: GN2 – CESNET link:

Traffic classification into application-layer protocols Based on MAPI and trackflib library Each protocol requires combination of header filtering and payload searching 2x dual-core 3 GHz Xeon: ~3.5 Gb/s of live traffic zero-loss monitoring (=> 4x quad-core: ~14 Gb/s) Example application: ABW 3.6 Gb/s

Tilera – TILExpress-64 and TILExpress-20G cards 64-cores, 1 or 2 XAUI connectors (Infiniband-style) Many-core processing Other many-core cards exists, but without high-speed network interface (e.g., 128-core NVIDIA Tesla C870 GPU processing board)

1.In hardware: Some monitoring cards have firmware that copies packets into multiple memory buffers based on user-defined load balancing (DSM – Data Stream Management in DAG cards, but more than two buffers available only in NinjaBoxes) 2. In software: One core runs packet scheduler that creates virtual buffers (packets are not copied), not splitting flows Other cores serve virtual buffers, in development … Distribution into multiple cores

Complex zero-loss processing of 10 Gb/s packet stream is possible in a modern PC when two conditions are satisfied: Packet are copied from the network to the PC’s memory efficiently (CPU must not be loaded by this task), this is currently not possible with NICs, but it is possible with monitoring cards Packets need to be distributed among multiple cores Conclusion

Thank you for your attention Questions?