Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.

Slides:



Advertisements
Similar presentations
Diagnosing Performance Overheads in the Xen Virtual Machine Environment Aravind Menon Willy Zwaenepoel EPFL, Lausanne Jose Renato Santos Yoshio Turner.
Advertisements

Middleware Support for RDMA-based Data Transfer in Cloud Computing Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, Thomas Robertazzi Department of Electrical.
System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
WHAT IS AN OPERATING SYSTEM? An interface between users and hardware - an environment "architecture ” Allows convenient usage; hides the tedious stuff.
Fast Communication Firefly RPC Lightweight RPC  CS 614  Tuesday March 13, 2001  Jeff Hoy.
1 Web Server Performance in a WAN Environment Vincent W. Freeh Computer Science North Carolina State Vsevolod V. Panteleenko Computer Science & Engineering.
ECE 526 – Network Processing Systems Design Software-based Protocol Processing Chapter 7: D. E. Comer.
VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.
Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.
Introduction to Systems Architecture Kieran Mathieson.
Embedded Real-time Systems The Linux kernel. The Operating System Kernel Resident in memory, privileged mode System calls offer general purpose services.
Chapter 13 Embedded Systems
CSE 124 Networked Services Fall 2009 B. S. Manoj, Ph.D 10/27/20091CSE 124 Networked Services Fall 2009 Some.
Figure 1.1 Interaction between applications and the operating system.
Research Agenda on Efficient and Robust Datapath Yingping Lu.
Embedded Transport Acceleration Intel Xeon Processor as a Packet Processing Engine Abhishek Mitra Professor: Dr. Bhuyan.
1 I/O Management in Representative Operating Systems.
FreeBSD Network Stack Performance Srinivas Krishnan University of North Carolina at Chapel Hill.
Threads math 442 es Jim Fix. Reality vs. Abstraction A computer’s OS manages a lot: multiple users many devices; hardware interrupts multiple application.
Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck Pavan Balaji  Hemal V. Shah ¥ D. K. Panda 
Xen and the Art of Virtualization. Introduction  Challenges to build virtual machines Performance isolation  Scheduling priority  Memory demand  Network.
IWARP Ethernet Key to Driving Ethernet into the Future Brian Hausauer Chief Architect NetEffect, Inc.
TCP Servers: Offloading TCP/IP Processing in Internet Servers
I/O Acceleration in Server Architectures
CS252: Systems Programming Ninghui Li Final Exam Review.
Performance Tradeoffs for Static Allocation of Zero-Copy Buffers Pål Halvorsen, Espen Jorde, Karl-André Skevik, Vera Goebel, and Thomas Plagemann Institute.
Microkernels, virtualization, exokernels Tutorial 1 – CSC469.
Scalable Networking for Next-Generation Computing Platforms Yoshio Turner *, Tim Brecht *‡, Greg Regnier §, Vikram Saletore §, John Janakiraman *, Brian.
Network Server Performance and Scalability June 9, 2005 Scott Rixner Rice Computer Architecture Group
1 Design and Performance of a Web Server Accelerator Eric Levy-Abegnoli, Arun Iyengar, Junehwa Song, and Daniel Dias INFOCOM ‘99.
Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.
Chapter 6 Operating System Support. This chapter describes how middleware is supported by the operating system facilities at the nodes of a distributed.
B.Ramamurthy9/19/20151 Operating Systems u Bina Ramamurthy CS421.
LiNK: An Operating System Architecture for Network Processors Steve Muir, Jonathan Smith Princeton University, University of Pennsylvania
LWIP TCP/IP Stack 김백규.
High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.
Hardware Definitions –Port: Point of connection –Bus: Interface Daisy Chain (A=>B=>…=>X) Shared Direct Device Access –Controller: Device Electronics –Registers:
High Performance Computing & Communication Research Laboratory 12/11/1997 [1] Hyok Kim Performance Analysis of TCP/IP Data.
LWIP TCP/IP Stack 김백규.
Boosting Event Building Performance Using Infiniband FDR for CMS Upgrade Andrew Forrest – CERN (PH/CMD) Technology and Instrumentation in Particle Physics.
Windows NT and Real-Time? Reading: “Inside Microsoft Windows 2000”, (Solomon, Russinovich, Microsoft Programming Series) “Real-Time Systems and Microsoft.
MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.
A Comparative Study of the Linux and Windows Device Driver Architectures with a focus on IEEE1394 (high speed serial bus) drivers Melekam Tsegaye
2006 Sonoma Workshop February 2006Page 1 Sockets Direct Protocol (SDP) for Windows - Motivation and Plans Gilad Shainer Mellanox Technologies Inc.
Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.
Types of Operating Systems
E X C E E D I N G E X P E C T A T I O N S OP SYS Linux System Administration Dr. Hoganson Kennesaw State University Operating Systems Functions of an operating.
Srihari Makineni & Ravi Iyer Communications Technology Lab
1 Public DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience Arkady Kanevsky & Peter Corbett Network Appliance Vijay Velusamy.
High Performance Network Virtualization with SR-IOV By Yaozu Dong et al. Published in HPCA 2010.
Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.
Middleware Services. Functions of Middleware Encapsulation Protection Concurrent processing Communication Scheduling.
TCP Offload Through Connection Handoff Hyong-youb Kim and Scott Rixner Rice University April 20, 2006.
A Systematic Approach to the Design of Distributed Wearable Systems Urs Anliker, Jan Beutel, Matthias Dyer, Rolf Enzler, Paul Lukowicz Computer Engineering.
Types of Operating Systems 1 Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.
Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.
6.894: Distributed Operating System Engineering Lecturers: Frans Kaashoek Robert Morris
Low Overhead Real-Time Computing General Purpose OS’s can be highly unpredictable Linux response times seen in the 100’s of milliseconds Work around this.
Major OS Components CS 416: Operating Systems Design, Spring 2001 Department of Computer Science Rutgers University
© 2004 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Understanding Virtualization Overhead.
CSCI/CMPE 4334 Operating Systems Review: Exam 1 1.
Experiences with VI Communication for Database Storage Yuanyuan Zhou, Angelos Bilas, Suresh Jagannathan, Cezary Dubnicki, Jammes F. Philbin, Kai Li.
Advisor: Hung Shi-Hao Presenter: Chen Yu-Jen
Introduction to Operating Systems Concepts
LWIP TCP/IP Stack 김백규.
Internetworking: Hardware/Software Interface
Xen Network I/O Performance Analysis and Opportunities for Improvement
I/O Systems I/O Hardware Application I/O Interface
Mid Term review CSC345.
Chapter 13: I/O Systems.
Presentation transcript:

Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier Intel Network Architecture Lab

Intel Research & Development 2 ETA Overview (Embedded Transport Acceleration) ETA Architectural Goals –Investigate the requirements and attributes of an effective Packet Processing Engine (PPE) –Define an efficient, asynchronous queuing model for Host/PPE communications –Explore Platform and OS integration of a PPE ETA Prototype Goals –Use as a development vehicle for measurement and analysis –Understand packet processing capabilities of a general- purpose IA CPU

Intel Research & Development 3 ETA System Architecture LAN Storage IPC ETA Host Interface IP Storage Driver File System Kernel Applications User Socket Applications Socket Proxy  Network stack  Virtualized, asynchronous queuing and event handling  Engine Architecture & platform integration Network Fabric Packet Processing Engine

Intel Research & Development 4 Direct Transport Interface ETA Packet Processing Engine NIC Application (Kernel or User) Adaptation Layer DTI Event Queue DTI Rx Queue DTI Tx Queue Anonymous Buffer Pool DTI Doorbell Shared Host Memory App Buffers NIC …

Intel Research & Development 5 DTI Operation Model DTI operations: Connection requests (Connect, Listen, Bind, Accept, Close, …) Data transfer requests (Send, Receive) Misc. operations (Set/Get Options,…) EVENT A EVENT B EVENT C EventQ TxQ OP A OP C RxQ OP B OP D DTI Doorbell Process Operation Service Doorbell De-Queue Operation Descriptor Post Completion Event Post ETA Interrupt Event (if waiting) Host Application Adaptation layer

Intel Research & Development 6 ETA PPE Software Gigabit NICs (5) ETA Host Interface Kernel Test Program CPU 0 Host 2.4 Ghz CPU 1 PPE 2.4 Ghz Off-the-shelf Linux Servers Host Memory Clients Test Clients Kernel Abstraction Layer ETA Test Environment

Intel Research & Development 7 Transmit Performance Intel Research & Development

Intel Research & Development 8 Receive Performance Intel Research & Development

Intel Research & Development 9 Effect of Threads on TX Intel Research & Development

Intel Research & Development 10 Effect of Threads/Copy on RX

Intel Research & Development 11 Performance Analysis Look at one datapoint –1KB Transmit case (Single-threaded) –Compare SMP to ETA Profile using VTune TM –Statistical sampling using instruction and cycle count events 1KB XMIT

Intel Research & Development 12 2P SMP Profile Processing requirements in multiple components –TCP/IP is the largest single component, but is small compared to total –The copy overhead is required to support legacy (synchronous) socket semantics –Interrupts and system calls are required in order to time-share the CPU resources

Intel Research & Development 13 ETA Profile (1 host CPU + 1 PPE) Processing times are compressed –Idle time represents CPU resource that is usable for applications –Asynchronous queuing interface avoids copy overhead –Interrupts avoided by not time-sharing CPU –System calls avoided by ETA queuing model

Intel Research & Development 14 ETA2P SMP Profile Comparisons

Intel Research & Development 15 Normalized to SMP rate Normalized CPU Usage

Intel Research & Development 16 Partitioning the system in ETA allows us to optimize the PPE in ways that are not possible when sharing the CPU with applications and the OS. –No kernel scheduling, NIC Interrupts not needed to preemptively schedule the driver and kernel –ETA optimized driver processing < half of SMP version by avoiding device register accesses (interrupt handling) and by doing educated pre-fetches –Copies are avoided by queuing transmit requests and asynchronously reaping completions… (Asynch. IO is important) –System calls are avoided because we’re cheating (running the test in the kernel) but… we expect the same result at user level given user-level queuing and an asynchronous sockets API Analysis

Intel Research & Development 17 Analysis 2 ETA TCP/IP processing component is < half of SMP version –Some path length reduction (explicit scheduling, locking) –Efficiencies gained from not being scheduled by the OS and interrupted by the NIC device, giving us better CPU pipeline and cache behavior Further Analysis –Based on new reference TCP/IP stack optimized for the ETA environment (in development)

Intel Research & Development 18 Futures Scalable Linux Network Performance –Joint project between HP Labs and Intel R&D –Asynchronous sockets on ETA Optimized Packet Processing Engine Stack –Tuned to the ETA environment –Greater concurrency to hide memory access latencies Analysis –Connection acceleration –End-end latency measurement and analysis Legacy Sockets Stack on ETA –Legacy application enabling

Intel Research & Development 19 Summary Partitioning of processing resources ala ETA can greatly improve networking performance –General purpose CPUs can be used more efficiently for packet processing An asynchronous queuing model for efficient Host / PPE communication is important –Lessons learned in VI Architecture and IBA can be applied to streams and sockets

Intel Research & Development 20 Acknowledgements Dave Minturn Annie Foong Gary McAlpine Vikram Saletore Thank You.