An Efficient Programmable 10 Gigabit Ethernet Network Interface Card Paul Willmann, Hyong-youb Kim, Scott Rixner, and Vijay S. Pai.

Slides:

Advertisements

Similar presentations

Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.

Advertisements

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Latency Tolerance: what to do when it just won’t go away CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.

COMP3221 lec31-mem-bus-I.1 Saeid Nooshabadi COMP 3221 Microprocessors and Embedded Systems Lectures 31: Memory and Bus Organisation - I

1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)

SYNAR Systems Networking and Architecture Group CMPT 886: Architecture of Niagara I Processor Dr. Alexandra Fedorova School of Computing Science SFU.

Architectural Considerations for CPU and Network Interface Integration C. D. Cranor; R. Gopalakrishnan; P. Z. Onufryk IEEE Micro Volume: 201, Jan.-Feb.

Multiscalar processors

Embedded Computing From Theory to Practice November 2008 USTC Suzhou.

1 Last Class: Introduction Operating system = interface between user & architecture Importance of OS OS history: Change is only constant User-level Applications.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.

EECC722 - Shaaban #1 Lec # 9 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.

1 OS & Computer Architecture Modern OS Functionality (brief review) Architecture Basics Hardware Support for OS Features.

COMP381 by M. Hamdi 1 Commercial Superscalar and VLIW Processors.

Lecture 39: Review Session #1 Reminders –Final exam, Thursday 3:10pm Sloan 150 –Course evaluation (Blue Course Evaluation) Access through.

Inside The CPU. Buses There are 3 Types of Buses There are 3 Types of Buses Address bus Address bus –between CPU and Main Memory –Carries address of where.

System Architecture A Reconfigurable and Programmable Gigabit Network Interface Card Jeff Shafer, Hyong-Youb Kim, Paul Willmann, Dr. Scott Rixner Rice.

Module I Overview of Computer Architecture and Organization.

Introduction to Interconnection Networks. Introduction to Interconnection network Digital systems(DS) are pervasive in modern society. Digital computers.

Network Server Performance and Scalability June 9, 2005 Scott Rixner Rice Computer Architecture Group

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

CHAPTER 8: CPU and Memory Design, Enhancement, and Implementation

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.

RiceNIC: A Reconfigurable and Programmable Gigabit Network Interface Card Jeff Shafer, Dr. Scott Rixner Rice Computer Architecture:

1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University.

Comparing Memory Systems for Chip Multiprocessors Leverich et al. Computer Systems Laboratory at Stanford Presentation by Sarah Bird.

Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.

Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.

Srihari Makineni & Ravi Iyer Communications Technology Lab

Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.

Increasing Web Server Throughput with Network Interface Data Caching October 9, 2002 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.

By Edward A. Lee, J.Reineke, I.Liu, H.D.Patel, S.Kim

Simics: A Full System Simulation Platform Synopsis by Jen Miller 19 March 2004.

Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.

By Fernan Naderzad.  Today we’ll go over: Von Neumann Architecture, Hardware and Software Approaches, Computer Functions, Interrupts, and Buses.

DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%

AN ASYNCHRONOUS BUS BRIDGE FOR PARTITIONED MULTI-SOC ARCHITECTURES ON FPGAS REPORTER: HSUAN-JU LI 2014/04/09 Field Programmable Logic and Applications.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

An Efficient Gigabit Ethernet Switch Model for Large-Scale Simulation Dong (Kevin) Jin.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.

Exploiting Task-level Concurrency in a Programmable Network Interface June 11, 2003 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.

Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.

Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.

1/21 Cell Processor Systems Seminar Diana Palsetia (11/21/2006)

1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008.

PipeliningPipelining Computer Architecture (Fall 2006)

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

DIRECT MEMORY ACCESS and Computer Buses

Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.

Multiscalar Processors

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

Assembly Language for Intel-Based Computers, 5th Edition

Architecture & Organization 1

5.2 Eleven Advanced Optimizations of Cache Performance

CS 286 Computer Organization and Architecture

Cache Memory Presentation I

/ Computer Architecture and Design

ECE 445 – Computer Organization

Architecture & Organization 1

Ka-Ming Keung Swamy D Ponpandi

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

Latency Tolerance: what to do when it just won’t go away

Modified from notes by Saeid Nooshabadi

Ka-Ming Keung Swamy D Ponpandi

Presentation transcript:

An Efficient Programmable 10 Gigabit Ethernet Network Interface Card Paul Willmann, Hyong-youb Kim, Scott Rixner, and Vijay S. Pai

1 Designing a 10 Gigabit NIC Programmability for performance  Computation offloading improves performance NICs have power, area concerns  Architecture solutions should be efficient Above all, must support 10 Gb/s links  What are the computation, memory requirements?  What architecture efficiently meets them?  What firmware organization should be used?

2 Mechanisms for an Efficient Programmable 10 Gb/s NIC A partitioned memory system  Low-latency access to control structures  High-bandwidth, high-capacity access to frame data A distributed task-queue firmware  Utilizes frame-level parallelism to scale across many simple, low-frequency processors New RMW instructions  Reduce firmware frame-ordering overheads by 50% and reduce clock frequency requirement by 17%

3 Outline Motivation How Programmable NICs work Architecture Requirements, Design Frame-parallel Firmware Evaluation

4 How Programmable NICs Work PCI Interface Ethernet Interface Memory Processor(s) Bus Ethernet

5 Per-frame Requirements Instructions Data Accesses TX Frame RX Frame25385 Processing and control data requirements per frame, as determined by dynamic traces of relevant NIC functions

6 Aggregate Requirements 10 Gb/s - Max Sized Frames Instruction Throughput Control Data Bandwidth Frame Data Bandwidth TX Frame229 MIPS2.6 Gb/s19.75 Gb/s RX Frame206 MIPS2.2 Gb/s19.75 Gb/s Total435 MIPS4.8 Gb/s39.5 Gb/s 1514-byte Frames at 10 Gb/s 812,744 Frames/s

7 Meeting 10 Gb/s Requirements with Hardware Processor Architecture  At least 435 MIPS within embedded device  Does NIC firmware have ILP? Memory Architecture  Low latency control data  High bandwidth, high capacity frame data  … both, how?

8 ILP Processors for NIC Firmware? ILP limited by data, control dependences Analysis of dynamic trace reveal dependences Perfect BPPerfect 1BPNo BP In-order In-order In-order Out-order Out-order Out-order

9 Processors: 1-Wide, In-order 2x performance costly  Branch prediction, reorder buffer, renaming logic, wakeup logic  Overheads translate to greater than 2x core power, area costs  Great for a GP processor; not for an embedded device Other opportunities for parallelism? YES!  Many steps to process a frame - run them simultaneously  Many frames need processing - process simultaneously Use parallel single-issue cores Perfect 1BPNo BP In-order Out-order

10 Memory Architecture Competing demands  Frame data: High bandwidth, high capacity for many offload mechanisms  Control data: Low latency; coherence among processors, PCI Interface, and Ethernet Interface The traditional solution: Caches  Advantages: low latency, transparent to the programmer  Disadvantages: Hardware costs (tag arrays, coherence)  In many applications, advantages outweigh costs

11 Are Caches Effective? SMPCache trace analysis of a 6-processor NIC architecture

12 Choosing a Better Organization Cache Hierarchy A Partitioned Organization

13 Putting it All Together Instruction Memory I-Cache 0 CPU 0 (P+4)x(S) Crossbar (32-bit) PCI Interface Ethernet Interface PCI Bus DRAM Ext. Mem. Interface (Off-Chip) Scratchpad 0Scratchpad 1S-pad S-1 CPU P-1 I-Cache 1I-Cache P-1 CPU 1

14 Parallel Firmware NIC processing steps already well-defined Previous Gigabit NIC firmware divides steps between 2 processors … but does this mechanism scale?

15 Task Assignment with an Event Register PCI Read BitSW Event Bit… Other Bits PCI Interface Finishes Work Processor(s) inspect transactions Processor(s) need to enqueue TX Data Processor(s) pass data to Ethernet Interface

16 Task-level Parallel Firmware Transfer DMAs 0-4 0Idle PCI Read Bit PCI Read HW Status Function Running (Proc 0) Function Running (Proc 1) 1 Transfer DMAs Time Process DMAs 0-4 Idle Process DMAs Idle

17 Frame-level Parallel Firmware Transfer DMAs 0-4 Idle PCI RD HW Status Function Running (Proc 0) Function Running (Proc 1) Transfer DMAs 5-9 Time Process DMAs 0-4 Build Event Idle Process DMAs 5-9 Build Event Idle

18 Evaluation Methodology Spinach: A library of cycle-accurate LSE simulator modules for network interfaces  Memory latency, bandwidth, contention modeled precisely  Processors modeled in detail  NIC I/O (PCI, Ethernet Interfaces) modeled in detail  Verified when modeling the Tigon 2 Gigabit NIC (LCTES 2004) Idea: Model everything inside the NIC  Gather performance, trace data

19 Scaling in Two Dimensions

20 Processor Performance Processor Behavior IPC Component Execution0.72 Miss Stalls0.01 Load Stalls0.12 Scratchpad Conflict Stalls 0.05 Pipeline Stalls0.10 Total1.00 Achieves 83% of theoretical peak IPC Small I-Caches work Sensitive to mem stalls  Half of loads are part of a load- to-use sequence  Conflict stalls could be reduced with more ports, more banks

21 Reducing Frame Ordering Overheads Firmware ordering costly - 30% of execution Synchronization, bitwise check/updates occupy processors, memory Solution: Atomic bitwise operations that also update a pointer according to last set location

22 Maintaining Frame Ordering 0 Index 0Index 1Index 3Index 4… more bits Frame Status Array 000 CPU A prepares frames CPU B prepares frames CPU C Detects Completed Frames Ethernet Interface LOCK Iterate Notify Hardware UNLOCK 1111

23 RMW Instructions Reduce Clock Frequency Performance: 6x166 MHz = 6x200 MHz  Performance is equivalent at all frame sizes  17% reduction in frequency requirement Dynamically tasked firmware balances the benefit  Send cycles reduced by 28.4%  Receive cycles reduced by 4.7%

24 Conclusions A Programmable 10 Gb/s NIC This NIC architecture relies on:  Data Memory System - Partitioned organization, not coherent caches  Processor Architecture - Parallel scalar processors  Firmware - Frame-level parallel organization  RMW Instructions - reduce ordering overheads A programmable NIC: A substrate for offload services

25 Comparing Frame Ordering Methods