Presentation is loading. Please wait.

Presentation is loading. Please wait.

I/O Acceleration in Server Architectures

Similar presentations


Presentation on theme: "I/O Acceleration in Server Architectures"— Presentation transcript:

1 I/O Acceleration in Server Architectures
Laxmi N. Bhuyan University of California, Riverside

2 Acknowledgement Many slides in this presentation have been taken (or modified from) from Li Zhao’s Ph.D. dissertations at UCR and Ravi Iyer’s (Intel) presentation at UCR. The research has been supported by NSF, UC Micro and Intel Research.

3 Enterprise Workloads Key Characteristics Throughput-Oriented
Lots of transactions, operations, etc in flight Many VMs, processes, threads, fibers, etc Scalability and Adaptability are key Rich (I/O) Content TCP, SoIP, SSL, XML High Throughput Requirements Efficiency and Utilization are key

4 Rich I/O Content in the Enterprise
Trends Increasing layers of processing on I/O data Business critical functions (TCP, IP storage, security, XML etc.) Independent of actual application processing Exacerbate by high network rates High rates of I/O Bandwidth with new technologies PCI-Express technology 10/s Gb to 40 Gb/s network technologies and it just keeps going App XML SSL iSCSI TCP/IP Platform Network Data

5 Network Protocols TCP/IP protocols OSI Reference Model 4 layers
App3 layers 7 6 5 4 3 2 1 Application Presentation Session Transport Network Data Link Physical OSI TCP/IP Examples 4 3 2 1 Application HTTP, Telnet, SSL, XML HTTP, Telnet XML SSL TCP, UDP IP, IPSec, ICMP Transport Internet Link Ethernet, FDDI Ethernet, FDDI Coax, Signaling SSL and XML are in Session Layer

6 Situation even worse with virtualization

7 Virtualization Overhead: Server Consolidation
Server consolidation is when both the guests run on same physical hardware Server-to-Server Latency & Bandwidth comparison under 10Gbps ===>

8 Communicating with the Server: The O/S Wall
Problems: O/S overhead to move a packet between network and application level => Protocol Stack (TCP/IP) O/S interrupt Data copying from kernel space to user space and vice versa Oh, the PCI Bottleneck! CPU User Kernel NIC PCI Bus

9 The Send Operation The application writes the transmit data to the TCP/IP sockets interface for transmission in payload sizes ranging from 4 KB to 64 KB. The data is copied from the User space to the Kernel space The OS segments the data into maximum transmission unit (MTU)–size packets, and then adds TCP/IP header information to each packet. The OS copies the data onto the network interface card (NIC) send queue. The NIC performs the direct memory access (DMA) transfer of each data packet from the TCP buffer space to the NIC, and interrupts CPU activities to indicate completion of the transfer.

10 Transmit/Receive data using a standard NIC
Note: Receive path is longer than Send due to extra copying

11 Where do the cycles go?

12 Network Bandwidth is Increasing
TCP requirements Rule of thumb: 1GHz for 1Gbps 1000 100 100 Network bandwidth outpaces Moore’s Law 40 10 10 GHz and Gbps The gap between the rate of processing network applications and the fast growing network bandwidth is increasing 1 0.1 Moore’s Law .01 1990 1995 2000 2003 2005 2006/7 2010 Time

13 I/O Acceleration Techniques
TCP Offload: Offload TCP/IP Checksum and Segmentation to Interface hardware or programmable device (Ex. TOEs) – A TOE-enabled NIC using Remote Direct Memory Access (RDMA) can use zero-copy algorithms to place data directly into application buffers. O/S Bypass: User-level software techniques to bypass protocol stack – Zero Copy Protocol (Needs programmable device in the NIC for direct user level memory access – Virtual to Physical Memory Mapping. Ex. VIA)

14 Comparing standard TCP/IP and TOE enabled TCP/IP stacks
(

15 Design of a Web Switch Using IXP 2400 Network Processor
Internet GET /cgi-bin/form HTTP/1.1 Host: APP. DATA TCP IP Same problems with AONs, programmable routers, web switches. Requests going through network, IP and TCP layers Our Solution: Bring TCP connection establishment and processing down to the network level using NP Ref: L. Bhuyan, “A Network Processor Based, Content Aware Switch”, IEEE Micro, May/June 2006, (with L. Zhao and Y. Luo). Application level Processing

16 Ex; Design of a Web Switch Using Intel IXP 2400 NP (IEEE Micro June/July 2006)
Internet Image Server IP TCP APP. DATA Application Server GET /cgi-bin/form HTTP/1.1 Host: Switch HTML Server

17 But Our Concentration in this talk
But Our Concentration in this talk! Server Acceleration Design server (CPU) architectures to speed up protocol stack processing! Also Focus on TCP/IP

18 Profile of a Packet Simulation Results. Run Free BSD on Simplescalar.
No System Overheads Descriptor & Header Accesses IP Processing Computes TCB Accesses TCP Processing Memory Copy Memory Total Avg Clocks / Packet: ~ 21K Effective Bandwidth: 0.6 Gb/s (1KB Receive)

19 Five Emerging Technologies
Optimized Network Protocol Stack (ISSS+CODES, 2003) Cache Optimization (ISSS+CODES, 2003, ANCHOR, 2004) Network Stack Affinity Scheduling Direct Cache Access (DCA) Lightweight Threading Memory Copy Engine (ICCD 2005 and IEEE TC)

20 Cache Optimizations

21 Instruction Cache Behavior
Higher requirement on L1 I-cache size due to the program structure Benefit more from larger line size, higher degree of set associativity

22 Execution Time Analysis
Given a total L1 cache size on the chip, more area should be devoted to I-Cache and less to D-cache

23 Direct Cache Access (DCA)
Normal DMA Writes Direct Cache Access CPU Cache Memory NIC Memory Controller Step 1 DMA Write Step 2 Cache Update Step 3 CPU Read CPU Step 4 CPU Read Cache Step 2 Snoop Invalidate Memory Controller Memory Step 1 DMA Write Step 3 Memory Write NIC Eliminate 3 to 25 memory accesses by placing packet data directly into cache

24 Memory Copy Engines L.Bhuyan, “Hardware Support for Bulk Data Movement in Server Platforms”, ICCD, October 2005 (IEEETC, 2006), with L. Zhao, et.al.

25 Memory Overhead Simulation
NIC descriptors Mbufs TCP/IP headers Payload

26 Copy Engines Copy is time-consuming due to Copy engine can
CPU moves data at small granularity Source or destination is in memory (not cache) Memory accesses clog up resources Copy engine can Fast copies and reducing CPU resource occupancy Copies can be done in parallel with the CPU computation Avoid cache pollution and reduce interconnect traffic Low overhead communication between the engine & the CPU Hardware support to allow the engine to run asynchronously with the CPU Hardware support to share the virtual address between the engine and the CPU Low overhead signaling of completion

27 Asynchronous Low-Cost Copy (ALCC)
Today, memory to memory data copies require CPU execution Build a copy engine and tightly couple it with the CPU Low communication overhead; asynchronous execution w.r.t CPU App Processing Memory Copy App Processing App Processing App Processing App Processing Memory Copy Continue computing during memory to memory copies

28 Performance Evaluation

29 Total I/O Acceleration

30 Potential Efficiencies (10X)
Benefits of Architectural Technques Ref: Greg Regnier, et al., “TCP Onloading for DataCenter Servers,” IEEE Computer, vol 37, Nov 2004 On CPU, multi-gigabit, line speed network I/O is possible

31 CPU-NIC Integration

32 CPU-NIC Integration Performance Comparison (RX) with Various Connections in SUN Niagra 2 Machine
INIC performs better than DNIC with greater than 16 Connections.

33 Latency Comparison INIC can achieve a lower latency by saving 6 µs. It is due to the smaller latency of accessing I/O registers and eliminating PCI-E bus latency.

34 Current and Future Work
Architectural characteristics and Optimization TCP/IP Optimization, CPU+NIC Integration, TCB Cache Design, Anatomy and Optimization of Driver Software Caching techniques, ISA optimization, Data Copy engines Simulator Design Similar analysis with virtualization and 10 GE with multi-core CPUs ongoing with Intel project. Core Scheduling in Multicore Processors TCP/IP Scheduling on multi-core processors Application level Cache-Aware and Hash-based Scheduling Parallel/Pipeline Scheduling to simultaneously address throughput and latency Scheduling to minimize power consumption Similar research with Virtualization Design and analyisis of Heterogeneous Multiprocessors – Heterogeneous Chip multiprocessors -- Use of Network Processors, GPUs and FPGA’s

35 THANK YOU


Download ppt "I/O Acceleration in Server Architectures"

Similar presentations


Ads by Google