Network Server Performance and Scalability June 9, 2005 Scott Rixner Rice Computer Architecture Group

Network Server Performance and Scalability June 9, 2005 Scott Rixner Rice Computer Architecture Group http://www.cs.rice.edu/CS/Architecture/

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability2 Rice Computer Architecture  Faculty –Scott Rixner  Students –Mike Calhoun –Hyong-youb Kim –Jeff Shafer –Paul Willmann  Research Focus –System architecture –Embedded systems  http://www.cs.rice.edu/CS/ Architecture/ http://www.cs.rice.edu/CS/ Architecture/

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability3 Network Servers Today  Content types –Mostly text, small images –Low quality video (300-500 Kbps) Internet 1 Gbps Network Server Clients 3 Mbps

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability4 Network Servers in the Future  Content types –Diverse multimedia content –DVD quality video (10 Mbps) Internet 100 Gbps Clients 100 Mbps Network Server

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability5 TCP Performance Issues  Network Interfaces –Limited flexibility –Serialized access  Computation –Only about 3000 instructions per packet –However, very low IPC, parallelization difficulties  Memory –Large connection data structures (about 1KB each) –Low locality, high DRAM latency

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability6 Selected Research  Network Interfaces –Programmable NIC design –Firmware parallelization –Network interface data caching  Operating Systems –Connection handoff to the network interface –Parallizing network stack processing  System Architecture –Memory controller design

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability7 Designing a 10 Gigabit NIC  Programmability for performance –Computation offloading improves performance  NICs have power, area concerns –Architecture solutions should be efficient  Above all, must support 10 Gbps links –What are the computation and memory requirements? –What architecture efficiently meets them? –What firmware organization should be used?

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability8 Aggregate Requirements 10 Gbps – Maximum-sized Frames Instruction Throughput Control Data Bandwidth Frame Data Bandwidth TX Frame229 MIPS2.6 Gbps19.75 Gbps RX Frame206 MIPS2.2 Gbps19.75 Gbps Total435 MIPS4.8 Gbps39.5 Gbps 1514-byte Frames at 10 Gbps 812,744 Frames/s

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability9 Meeting 10 Gbps Requirements  Processor Architecture –At least 435 MIPS within embedded device –Limited instruction-level parallelism –Abundant task-level parallelism  Memory Architecture –Control data needs low latency, small capacity –Frame data needs high bandwidth, large capacity –Must partition storage

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability10 Processor Architecture Perfect 1BPNo BP In-order 1 0.87 Out-order 2 1.741.21  2x performance costly –Branch prediction, reorder buffer, renaming logic, wakeup logic –Overheads translate to greater than 2x core power, area costs –Great for a GP processor; not for an embedded device  Are there other opportunities for parallelism? –Many steps to process a frame – run them simultaneously –Many frames need processing – process simultaneously  Solution: use parallel single-issue cores

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability11 Control Data Caching SMPCache trace analysis of a 6-processor NIC architecture

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability12 A Programmable10Gbps NIC Instruction Memory I-Cache 0 CPU 0 (P+4)x(S) Crossbar (32-bit) PCI Interface Ethernet Interface PCI Bus DRAM Ext. Mem. Interface (Off-Chip) Scratchpad 0Scratchpad 1S-pad S-1 CPU P-1 I-Cache 1I-Cache P-1 CPU 1

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability13 Network Interface Firmware  NIC processing steps are well defined  Must provide high latency tolerance –DMA to host –Transfer to/from network  Event mechanism is the obvious choice –How do you process and distribute events?

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability14 Task Assignment with an Event Register PCI Read BitSW Event Bit… Other Bits PCI Interface Finishes Work Processor(s) inspect transactions 00011 Processor(s) need to enqueue TX Data Processor(s) pass data to Ethernet Interface

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability15 Task-level Parallel Firmware Transfer DMAs 0-4 0Idle PCI Read Bit PCI Read HW Status Proc 0Proc 1 1 Transfer DMAs 5-9 1 0 Time Process DMAs 0-4 Idle Process DMAs 5-9 1 Idle

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability16 Frame-level Parallel Firmware Transfer DMAs 0-4 Idle PCI RD HW Status Proc 0Proc 1 Transfer DMAs 5-9 Time Process DMAs 0-4 Build Event Idle Process DMAs 5-9 Build Event Idle

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability17 Scaling in Two Dimensions Gbps

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability18 A Programmable 10 Gbps NIC  This NIC architecture relies on: –Data Memory System – Partitioned organization, not coherent caches –Processor Architecture – Parallel scalar processors –Firmware – Frame-level parallel organization –RMW Instructions – reduce ordering overheads  A programmable NIC: A substrate for offload services

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability19 NIC Offload Services  Network Interface Data Caching  Connection Handoff  Virtual Network Interfaces  …

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability20 Network Interface Data Caching  Cache data in network interface  Reduces interconnect traffic  Software-controlled cache  Minimal changes to the operating system  Prototype web server –Up to 57% reduction in PCI traffic –Up to 31% increase in server performance –Peak 1571 Mbps of content throughput Breaks PCI bottleneck

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability21 Results: PCI Traffic ~1260 Mb/s is limit! ~60 % Content traffic PCI saturated 60 % utilization 1198 Mb/s of HTTP content 30 % Overhead

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability22 Content Locality  Block cache with 4KB block size 8-16MB caches capture locality

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability23 Results: PCI Traffic Reduction Low temporal reuse Low PCI utilization Good temporal reuse CPU bottleneck 36-57 % reduction with four traces Up to 31% performance improvement

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability24 Connection Handoff to the NIC  No magic processor on NIC –OS must control work between itself and NIC  Move established connections between OS and NIC –Connection: unit of control –OS decides when and what  Benefits –Sockets are intact – no need to change applications –Zero-copy –No port allocation or routing on NIC –Can adapt to route changes TCP IP Handoff TCP IP Ethernet Handoff Sockets Ethernet / Lookup Driver NIC OS Handoff interface: 1.Handoff 2.Send 3.Receive 4.Ack 5.…

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability25 Connection Handoff  Traditional offload –NIC replicates entire network stack –NIC can limit connections due to resource limitations  Connection handoff –OS decides which subset of connections NIC should handle –NIC resource limitations limit amount of offload, not number of connections OS NIC

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability26 Establishment and Handoff  OS establishes connections  OS decides whether or not to handoff each connection OS Connection NIC Connection 1.Establish a connection 2. Handoff

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability27 Data Transfer  Offloaded connections require minimal support from OS for data transfers –Socket layer for interface to applications –Driver layer for interrupts, buffer management OS Connection NIC Connection 3. Send, Receive, Ack, … Data

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability28 Connection Teardown  Teardown requires both NIC and OS to deallocate connection data structures OS Connection NIC Connection 4. De-alloc 5. De-alloc

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability29 Connection Handoff Status  Working prototype built on FreeBSD  Initial results for web workloads –Reductions in cycles and cache misses on host –Transparently handle multiple NICs –Fewer messages on PCI 1.4 per packet to 0.6 per packet Socket-level instead of packet-level communication –~17% throughput increase (simulations)  To do –Framework for offload policies –Test zero-copy, more workloads –Port to Linux

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability30 Virtual Network Interfaces  Traditionally used for user-level network access –Each process has its own “virtual NIC” –Provide protection among processes  Can we use this concept to improve network stack performance within the OS? –Possibly, but we need to understand the behavior of the OS on networking workloads first

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability31 Networking Workloads  Performance is influenced by –The operating system’s network stack –The increasing number of connections –Microprocessor architecture trends

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability32 Networking Performance  Bound by TCP/IP processing  2.4GHz Intel Xeon: 2.5 Gbps for one nttcp stream - Hurwitz and Feng, IEEE Micro 2004

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability33 Throughput vs. Connections  Faster links  more connections  More connections  worse performance CS IBM NASA WC

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability34 The End of the Uniprocessor?  Uniprocessors have become too complicated –Clock speed increases have slowed down –Increasingly complicated architectures for performance  Multi-core processors are becoming the norm –IBM Power 4 – 2 cores (2001) –Intel Pentium 4 – 2 hyperthreads (2002) –Sun UltraSPARC IV – 2 cores (2004) –AMD Opteron – 2 cores (2005)  Sun Niagra – 8 cores, 4 threads each (est. 2006)  How do we use these cores for networking?

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability35 Parallelism with Data-Synchronized Stacks Linux 2.4.20+, FreeBSD 5+

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability36 DragonflyBSD, Solaris 10 Parallelism with Control-Synchronized Stacks

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability37 Parallelization Challenges  Data-Synchronous –Lots of thread parallelism –Significant locking overheads  Control-Synchronous –Reduces locking –Load balancing issues  Which approach is better? –Throughput? Scalability? –We’re optimizing both schemes in FreeBSD 5 to find out  Network Interface –Serialization point –Can virtualization help?

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability38 Memory Controller Architecture  Improve DRAM efficiency –Memory access scheduling –Virtual Channels  Improve copy performance –45-61% of kernel execution time can be copies –Best copy algorithm dependent on copy size, cache residency, cache state –Probe copy –Hardware copy acceleration  Improve I/O performance…

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability39 Summary  Our focus is on system-level architectures for networking  Network interfaces must evolve –No longer just a PCI-to-Ethernet bridge –Need to provide capabilities to help the operating system  Operating systems must evolve –Future systems will have 10s to 100s of processors –Networking must be parallelized – many bottlenecks remain  Synergy between the NIC and OS cannot be ignored  Memory performance is also increasingly a critical factor

Network Server Performance and Scalability June 9, 2005 Scott Rixner Rice Computer Architecture Group

Similar presentations

Presentation on theme: "Network Server Performance and Scalability June 9, 2005 Scott Rixner Rice Computer Architecture Group"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Network Server Performance and Scalability June 9, 2005 Scott Rixner Rice Computer Architecture Group

Similar presentations

Presentation on theme: "Network Server Performance and Scalability June 9, 2005 Scott Rixner Rice Computer Architecture Group"— Presentation transcript:

Similar presentations

About project

Feedback