Exploiting Task-level Concurrency in a Programmable Network Interface June 11, 2003 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.

Slides:



Advertisements
Similar presentations
Virtual Machine Queue Architecture Review Ali Dabagh Architect Windows Core Networking Don Stanwyck Sr. Program Manager NDIS Virtualization.
Advertisements

Computer Abstractions and Technology
Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.
CSC457 Seminar YongKang Zhu December 6 th, 2001 About Network Processor.
Chapter 1 CSF 2009 Computer Performance. Defining Performance Which airplane has the best performance? Chapter 1 — Computer Abstractions and Technology.
VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.
1 Design and Implementation of A Content-aware Switch using A Network Processor Li Zhao, Yan Luo, Laxmi Bhuyan University of California, Riverside Ravi.
Architectural Considerations for CPU and Network Interface Integration C. D. Cranor; R. Gopalakrishnan; P. Z. Onufryk IEEE Micro Volume: 201, Jan.-Feb.
Improving the Efficiency of Memory Partitioning by Address Clustering Alberto MaciiEnrico MaciiMassimo Poncino Proceedings of the Design,Automation and.
Embedded Transport Acceleration Intel Xeon Processor as a Packet Processing Engine Abhishek Mitra Professor: Dr. Bhuyan.
ECE 526 – Network Processing Systems Design
Gnort: High Performance Intrusion Detection Using Graphics Processors Giorgos Vasiliadis, Spiros Antonatos, Michalis Polychronakis, Evangelos Markatos,
G Robert Grimm New York University Receiver Livelock.
FreeBSD Network Stack Performance Srinivas Krishnan University of North Carolina at Chapel Hill.
An Efficient Programmable 10 Gigabit Ethernet Network Interface Card Paul Willmann, Hyong-youb Kim, Scott Rixner, and Vijay S. Pai.
Storage area network and System area network (SAN)
Router Architectures An overview of router architectures.
Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck Pavan Balaji  Hemal V. Shah ¥ D. K. Panda 
Xen and the Art of Virtualization. Introduction  Challenges to build virtual machines Performance isolation  Scheduling priority  Memory demand  Network.
System Architecture A Reconfigurable and Programmable Gigabit Network Interface Card Jeff Shafer, Hyong-Youb Kim, Paul Willmann, Dr. Scott Rixner Rice.
Router Architectures An overview of router architectures.
Sven Ubik, Petr Žejdl CESNET TNC2008, Brugges, 19 May 2008 Passive monitoring of 10 Gb/s lines with PC hardware.
Input/Output. Input/Output Problems Wide variety of peripherals —Delivering different amounts of data —At different speeds —In different formats All slower.
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
Paper Review Building a Robust Software-based Router Using Network Processors.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
Network Server Performance and Scalability June 9, 2005 Scott Rixner Rice Computer Architecture Group
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
© 2008 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Achieving 10 Gb/s Using Xen Para-virtualized.
LiNK: An Operating System Architecture for Network Processors Steve Muir, Jonathan Smith Princeton University, University of Pennsylvania
RiceNIC: A Reconfigurable and Programmable Gigabit Network Interface Card Jeff Shafer, Dr. Scott Rixner Rice Computer Architecture:
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
GBT Interface Card for a Linux Computer Carson Teale 1.
LECC2003 AmsterdamMatthias Müller A RobIn Prototype for a PCI-Bus based Atlas Readout-System B. Gorini, M. Joos, J. Petersen (CERN, Geneva) A. Kugel, R.
Hardware Trends. Contents Memory Hard Disks Processors Network Accessories Future.
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
Design and Performance of a PCI Interface with four 2 Gbit/s Serial Optical Links Stefan Haas, Markus Joos CERN Wieslaw Iwanski Henryk Niewodnicznski Institute.
Srihari Makineni & Ravi Iyer Communications Technology Lab
Operating Systems CSE 411 Multi-processor Operating Systems Multi-processor Operating Systems Dec Lecture 30 Instructor: Bhuvan Urgaonkar.
Increasing Web Server Throughput with Network Interface Data Caching October 9, 2002 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.
I/O Computer Organization II 1 Interconnecting Components Need interconnections between – CPU, memory, I/O controllers Bus: shared communication channel.
1 Network Performance Optimisation and Load Balancing Wulf Thannhaeuser.
ECE 526 – Network Processing Systems Design Computer Architecture: traditional network processing systems implementation Chapter 4: D. E. Comer.
2 Systems Architecture, Fifth Edition Chapter Goals Describe the system bus and bus protocol Describe how the CPU and bus interact with peripheral devices.
An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.
CH10 Input/Output DDDData Transfer EEEExternal Devices IIII/O Modules PPPProgrammed I/O IIIInterrupt-Driven I/O DDDDirect Memory.
VTurbo: Accelerating Virtual Machine I/O Processing Using Designated Turbo-Sliced Core Embedded Lab. Kim Sewoog Cong Xu, Sahan Gamage, Hui Lu, Ramana Kompella,
TCP Offload Through Connection Handoff Hyong-youb Kim and Scott Rixner Rice University April 20, 2006.
CS 4396 Computer Networks Lab Router Architectures.
Simics: A Full System Simulation Platform Synopsis by Jen Miller 19 March 2004.
Supporting Multimedia Communication over a Gigabit Ethernet Network VARUN PIUS RODRIGUES.
Sep. 17, 2002BESIII Review Meeting BESIII DAQ System BESIII Review Meeting IHEP · Beijing · China Sep , 2002.
1Thu D. NguyenCS 545: Distributed Systems CS 545: Distributed Systems Spring 2002 Communication Medium Thu D. Nguyen
Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.
Chapter 1 — Computer Abstractions and Technology — 1 Uniprocessor Performance Constrained by power, instruction-level parallelism, memory latency.
Technical Overview of Microsoft’s NetDMA Architecture Rade Trimceski Program Manager Windows Networking & Devices Microsoft Corporation.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
L1/HLT trigger farm Bologna setup 0 By Gianluca Peco INFN Bologna Genève,
1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008.
Recent experience with PCI-X 2.0 and PCI-E network interfaces and emerging server systems Yang Xia Caltech US LHC Network Working Group October 23, 2006.
DIRECT MEMORY ACCESS and Computer Buses
Morgan Kaufmann Publishers
CS 286 Computer Organization and Architecture
Unit 2 Computer Systems HND in Computing and Systems Development
Storage area network and System area network (SAN)
Net301 LECTURE 10 11/19/2015 Lect
Presentation transcript:

Exploiting Task-level Concurrency in a Programmable Network Interface June 11, 2003 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture Group

2 Why programmable network interface?  More complex functionality on network interface –TCP offload, iSCSI, etc.  Easy maintenance –Bug fix, upgrade, customization, etc.  Performance? –51% less web server throughput than ASIC NIC –Big problem

3 Improving Performance  Increase clock speed and/or complexity –Typical solutions for general-purpose processors –Do not work for embedded processors  Design constraints: limited power and area –Power proportional to C V² f –Higher f requires higher V –Thus, Power roughly proportional to f³ –Complexity increases C only for marginal gains  Implication: simple and low frequency processor

4 Use Parallel Programming  Use multiple programmable cores –Increase computational capacity –Achieve performance within power limit Consume far less power than higher-frequency core  Improvements with two cores over single core –65-157% for bidirectional traffic –27-51% for web server workloads –Web server throughput comparable to ASIC NIC

5 Outline  Background –Tigon Programmable Gigabit Ethernet Controller –Network Interface Processing: Send/Receive  Parallelization of Firmware  Experimental Results  Conclusion

6 Tigon Gigabit Ethernet Controller  Two programmable cores –Based on MIPS running at 88MHz –Small on-chip memory (scratch pad) per core  Shared off-chip SRAM  Supports event-driven firmware  No interrupts –Event handlers run to completion –Handlers on same core require no synchronization  Released firmware fully utilizes only one core  No previous Ethernet firmware to utilize 2 cores

7 Send Processing Mailbox Send Buffer Descriptor Ready Tigon event: Main Memory CPU Network Interface Card Packet Descriptor 1. Create buffer descriptor PCI Bus Network Bridge 2. Alert: produced buffer descriptor Memory-mapped I/O Descriptor Direct Memory Access 3. Fetch buffer descriptor Packet 4. Transfer Packet 5. Transmit Packet Interrupt 6. Alert: consumed buffer descriptor DMA Read Complete Send Data Ready Update Send Consumer DMA Write Complete Index

8 Receive Processing: Pre-allocation Mailbox Receive Buffer Descriptor Ready DMA Read Complete Tigon event: Main Memory CPU Network Interface Card Descriptor 2. Create buffer descriptor PCI Bus Network Bridge 3. Alert: produced buffer descriptor Memory-mapped I/O Descriptor Direct Memory Access 4. Fetch buffer descriptor Receive Buffer 1. Allocate receive buffer

9 Receive Processing: Actual Receive Receive Complete Tigon event: DMA Write Complete Update Receive Return Producer Main Memory CPU Network Interface Card Descriptor 2. Create buffer descriptor PCI Bus Network Bridge 5. Alert: produced buffer descriptor Interrupt Descriptor Direct Memory Access 4. Transfer buffer descriptor Receive Buffer Packet 1. Store packet 3. Transfer packet Packet Index

10 Tigon Uniprocessor Performance Intel 100% over Tigon Decreasing maximum UDP throughput due to network headers and per-frame overhead Tigon with uniprocessor firmware Intel PRO/1000 MTNetgear 622T

11 Outline  Background  Parallelization of Firmware –Principles –Resource Sharing Patterns –Partitioning Process  Experimental Results  Conclusion

12 Principles  Identify unit of concurrency –Event handler  Analyze resource sharing patterns  Profile uniprocessor firmware  Partition event handlers so as to –Balance load –Minimize synchronization –Maximize on-chip memory utilization

13 Resource Sharing Patterns Mailbox DMA Read Complete Send Buffer Descriptor Ready Send Data Ready Update Send Consumer Receive Buffer Descriptor Ready Receive Complete Update Receive Return Producer DMA Write Complete Shared data objects Shared DMA read channel Shared DMA write channel

14 Partitioning Process Mailbox DMA Read Complete Send Buffer Descriptor Ready Send Data Ready Update Send Consumer Receive Buffer Descriptor Ready Receive Complete Update Receive Return Producer DMA Write Complete Shared data objects Shared DMA read channel Shared DMA write channel 6% 4% 3% 5% 14% 30% 1% 31% CPU A: CPU B: 30% 31% 52% 41% 47% 53%

15 Final Partition Mailbox DMA Read Complete Send Buffer Descriptor Ready Send Data Ready Update Send Consumer Receive Buffer Descriptor Ready Receive Complete Update Receive Return Producer DMA Write Complete Shared data objects Shared DMA read channel Shared DMA write channel CPU A: 47% CPU B: 53%

16 Outline  Background  Parallelization of Firmware  Experimental Results –Improved Maximum Throughput –Improved Web Server Throughput  Conclusion

17 Experimental Setup  Network interface card –3Com Gigabit Ethernet interface card based on Tigon  Firmware versions –Uniprocessor firmware: from original manufacturer –Parallel firmware: modified version of  Benchmarks –UDP bidirectional, unidirectional, and ping traffic –Web server (thttpd) and software router (Click)  Testbed –PC machines with AMD Athlon CPU and 2GB RAM –FreeBSD 4.7

18 Overall Improvements 157% improvement 65% improvement

19 Sources of Improvements 37% improvement due to two processors 70% improvement due to scratch pads

20 Comparison to ASIC NICs 3Com Tigon (1997) Intel PRO/1000 MT Intel (2002) Netgear GA622T Nat. Semi. (2001) 3Com Tigon (1997) UNIPROCESSOR Intel only 21% over Tigon

21 Impact on Web Server Throughput Overall 27-51% improvement Comparable to ASIC NICs

22 Parallelization Makes Programmability Viable  Programmability useful for complex functions  Limited clock speed for embedded processor –Limited uniprocessor performance  Use multiple cores to improve performance  Two core vs. single core –65% increase for maximum throughput –51% increase for web server throughput –Web server throughput comparable to ASIC NICs

23

24 UDP Send: Overall Improvements

25 UDP Send: Sources of Improvements

26 UDP Receive: Overall Improvements

27 UDP Receive: Sources of Improvements

28 UDP Ping: Overall Improvements

29 UDP Ping: Sources of Improvements

30 Impact on Routing Throughput