Implementing Convergent Networking: Partner Concepts Uri Elzur Broadcom Corporation Director, Advanced Technology Brian Hausauer Neteffect Inc. Chief Architect
Convergence In The Data Center: Convergence Over IP Uri Elzur Broadcom Corporation Director, Advanced Technology
Agenda Application requirements in Data Center Data flows and Server architecture Convergence Demo Hardware and software challenges and advantages Summary
Enterprise Network Today IT: Get ready for tomorrow’s Data Center, today Multiple networks drive Total Cost of Ownership (TCO) up Consolidation, convergence, virtualization requires Flexible I/O Higher speeds (2.5G, 10G) requires more Efficient I/O Issue: Best use of Memory and CPU resources Additional constraints: Limited power, cooling and smaller form factor
Convergence Over Ethernet Multiple networks and multiple stacks in the OS are used to provide these services Wire protocols; e.g., Internet Small Computer System Interface (iSCSI) and iWARP (Remote Direct Memory Access {RDMA}) enable the use of Ethernet as the converged network Direct Attach storage migrates to Networked Storage Proprietary clustering can now use RDMA over Ethernet The OS supports one device servicing multiple stack with the Virtual Bus Driver To accommodate these new traffic types, Ethernet’s efficiency must be optimal CPU utilization Memory BW utilization Latency Sockets Applications Windows Sockets Windows Socket Switch Storage Applications RDMA Provider User Mode File System Kernel Mode Partition TCP/IP Class Driver NDIS iSCSI Port Driver NDIS IM Driver (iscsiprt sys) . iSCSI NDIS Miniport RDMA Driver Miniport HBA NIC RNIC
Data Center Application Characteristics Database Application Servers Cluster Web Servers IP/Ethernet Storage Cluster Management DAS Load Balancers
The Server In The Data Center Web Servers Application Servers Database Cluster DAS Load Balancers Server network requirements – Data, Storage, Clustering, and Management Acceleration required for Data = TCP, Storage = iSCSI, Clustering = RDMA Application requirements: More transactions per server, Higher rate, Larger messages (e.g., e-mail) IP/Ethernet Storage Cluster Management Long Lived connection Short Lived connection
Traditional L2 NIC Rx Flow And Buffer Management Application Application pre-posts buffer Data arrives at Network Interface Adapter (NIC) NIC Direct Memory Access (DMA) data to driver buffers (Kernel) NIC notifies Driver after a frame is DMA’d (Interrupt moderation per frame) Driver notifies Stack Stack fetches headers, processes TCP/IP, strip headers Stack copies data from driver to Application buffers Stack notifies Application 1 8 TCP Stack 7 6 5 Driver 3 4 L2 NIC 2 Minimum of one copy
iSCSI iSCSI provides a reliable high performance block storage service Microsoft Operating System support for iSCSI accelerates iSCSI’s deployment Microsoft iSCSI Software Initiator iSCSI HBA iSCSI HBA provides for Better performance iSCSI Boot iSER enablement Storage Applications File System Partition Manager Class Driver iSCSI Port Driver (iscsiprtysys) iSCSI Miniport HBA
The Value Of iSCSI Boot Storage consolidation – lower TCO Easier maintenance, replacement No need to replace server blade for a HD failure No disk on blade/motherboard – space, power savings Smaller blades, higher density Simpler board design, no need for HD specific mechanical restrainer Higher reliability Hot replacement of disks if a disk fails RAID protection over boot disk Re-assign disk to another server in case of server failure
WSD And RDMA Kernel by pass attractive for High Performance Computing (HPC), Databases, and any Socket application WSD model supports RNICs with RDMA over Ethernet (a.k.a., iWARP) As latency improvements are mainly due to kernel by-pass, WSD is competitive with other RDMA-based technologies, e.g., Infiniband Traditional Model WSD Model Socket App Socket App WinSock WinSock WinSock Switch WinSock SPI TCP/IP WinSock Provider TCP/IP WinSock Provider RDMA service Provider User Kernel TCP/IP Transport Driver TCP/IP Transport Driver RDMA provider Driver Microsoft WSD Module OEM WSD Software NDIS OEM SAN Hardware NDIS Miniport SAN NDIS Miniport Private interface NIC RNIC
L2 Technology Can’t Efficiently Handle iSCSI And RDMA iSCSI HBA implementation concerns iSCSI Boot Digest overhead – CRC-32C Copy overhead – Zero Copy requires iSCSI protocol processing RDMA RNIC implementation concerns Throughput – high Software overhead for RDMA processing MPA – CRC-32C, Markers every 512B DDP/RDMA – protocol processing, zero copy, User mode interaction, special queues Minimal latency – Software processing doesn’t allow for kernel bypass Thus, for optimal performance specific offload is required
Convergence Over Ethernet: TOE, iSCSI, RDMA, Management Legacy model New model Networking Converges functions Multiple functions (SAN, LAN, IPC, Mgmt.) can be consolidated to a single fabric type Blade server storage connectivity (low cost) Clustering (IPC, HPC) NIC Host Application layer Presentation layer Session layer Transport layer Network layer Data Link layer Physical layer C-NIC TOE RSS Host Cluster IPC Convergence over Std. Ethernet LAN Storage Management Remote Management Storage Networking Block Storage File Storage Consolidates ports Leverage Ethernet pervasiveness, knowledge, cost leadership and volume Consolidate KVM over IP Leverage existing Standard Ethernet equipment Lower TCO – one technology for multiple purposes
C-NIC Demo
C-NIC Hardware Design – Advantages/Challenges Performance – wire speed Find the right split between Hardware and Firmware Hardware for Speed – e.g., connection look up, frame validity, buffer selection, and offset computation Hardware connection look up is significantly more efficient than software IPv6 address length (128-bits) exacerbates it Flexibility Firmware provides flexibility, but maybe slower than hardware… Specially optimized RISC CPU – it’s not about MHz… Accommodate future protocol changes: e.g., TCP ECN Minimal latency From wire to application buffer (or from application to wire for Tx) Not involving the CPU Flat ASIC architecture for minimal latency Scalability – 1G, 2.5G, 10G Zero Copy architecture – a match to server memory BW and latency; additional copy or few copies in any L2 solution Power goals – under 5W per 1G/2.5G, under 10W per 10G CPU consumes 90W
C-NIC Software Design – Advantages/Challenges Virtual Bus Driver Reconcile requests from all stacks Plug and Play Reset Network control and speed Power Support of multiple stacks Resource allocation and management Resource isolation Run time – priorities Interfaces separation Interrupt moderation per stack Statistics
Summary C-NIC Advantages TCP Offload Engine in Hardware – for better application performance, lower CPU, and improved latency RDMA – for Memory BW and ultimate Latency iSCSI – for networked storage and iSCSI Boot Flexible and Efficient I/O for the data center of today and tomorrow
Brian Hausauer Chief Architect NetEffect, Inc BrianH@NetEffect.com WinHEC 2005 Brian Hausauer Chief Architect NetEffect, Inc BrianH@NetEffect.com
Myrinet, Quadrics, InfiniBand, etc. Today’s Data Center NAS Users LAN Ethernet networking adapter Applications ▪ ▪ ▪ ▪ ▪ ▪ ▪ switch Clustering Myrinet, Quadrics, InfiniBand, etc. clustering adapter ▪ ▪ ▪ ▪ ▪ ▪ ▪ switch SAN Block Storage Fibre Channel block storage adapter ▪ ▪ ▪ ▪ ▪ ▪ ▪ switch
Datacenter Trends: Traffic Increasing 3x Annually 2006 I/O traffic requirements Front-end web servers Mid-tier application servers Back-end database servers Network Heavy (5-10 Gb/s) Intermediate (200-500 Mb/s) Low (<200 Mb/s) Storage (1.5-4 Gb/s) (<100 Mb/s) (3-6 Gb/s) Clustering IPC None (2-4 Gb/s) Application requirements pervasive standard, plug-n-play interop concurrent access, high throughput, low overhead fast access, low latency 6.5 – 14.0 Gb/s 2.3 – 4.6 Gb/s 5.2 – 10.2 Gb/s Typical for each server Sources: 2006 IA Server I/O Analysis, Intel Corporation; Oracle Scaling 3-fabric infrastructure expensive and cumbersome Server density complicates connections to three fabrics Successful solution must meet different application requirements
High Performance Computing: Clusters Dominate Clusters in Top 500 Systems 300 Clusters continue to grow in popularity and now dominate the Top 500 fastest computers 294 clusters in top 500 computers 250 200 Ethernet is the interconnect for over 50% of the top clusters 150 Ethernet continues to increase share as the cluster interconnect of choice for the top clusters in the world 100 50 1997 1998 1999 2000 2001 2002 2003 2004 Ethernet-based Clusters All Other Clusters Source: www.top500.org
Next-generation Ethernet can be the solution Why Ethernet? pervasive standard multi-vendor interoperability potential to reach high volumes and low cost powerful management tools/infrastructure Why not? Ethernet does not meet the requirements for all fabrics Ethernet overhead is the major obstacle The solution: iWARP Extensions to Ethernet Industry driven standards to address Ethernet deficiencies Renders Ethernet suitable for all fabrics at multi-Gb and beyond Reduces cost, complexity and TCO
Overhead & Latency in Networking Sources Solutions Packet Processing Intermediate Buffer Copies Command Context Switches user application app buffer OS buffer driver buffer app buffer I/O cmd % CPU Overhead I/O cmd I/O library 100% application to OS context switches 40% kernel context switch context switch device driver OS device driver OS I/O cmd I/O cmd server software 60% application to OS context switches 40% Intermediate buffer copies 20% TCP/IP 40% transport processing 40% application to OS context switches 40% Intermediate buffer copies 20% software Transport (TCP) offload hardware I/O adapter I/O cmd RDMA / DDP TCP/IP standard Ethernet TCP/IP packet adapter buffer User-Level Direct Access / OS Bypass
Introducing NetEffect’s NE01 Ethernet Channel Adapter (ECA) A single chip supports: Transport (TCP) offload RDMA/DDP OS bypass / ULDA Meets requirements for: Clustering (HPC, DBC,…) Storage (file and block) Networking Reduces overhead up to 100% Strategic advantages Patent-pending virtual pipeline and RDMA architecture One die for all chips enables unique products for dual 10Gb / dual 1Gb host CPUs NetEffect ECA Ethernet ports (10 Gb or 1 ; Copper or fibre) host interface transaction switch protocol engine DDR2 SDRAM controller MAC server chipset memory external DRAM with ECC PCI Express or PCI-X
Future Server Ethernet Channel Adapter (ECA) for a Converged Fabric O/S Acceleration Interfaces Existing Interfaces RDMA Accelerator Clustering OS/driver s/w Block Storage OS/driver s/w Networking OS/driver s/w TCP Accelerator (WSD, DAPL, VI, MPI) iSER iSCSI iSCSI iWARP TOE NIC NetEffect ECA Transaction Switch Ethernet fabric(s) NetEffect ECA delivers optimized file and block storage, networking, and clustering from a single adapter
NetEffect ECA Architecture host interface Basic Networking Accelerated Networking Clustering Block Storage crossbar ... MAC MAC
NetEffect ECA Architecture Networking Sockets (SW Stack) WSD, SDP, TCP Accelerator Basic & Accelerated Networking Related software standards Sockets Microsoft WinSock Direct (WSD) Sockets Direct Protocol (SDP) TCP Accelerator Interfaces iWARP TOE Basic Networking
NetEffect ECA Architecture Storage Connection Mgmt iSER, R-NFS iSCSI, NFS Block Storage Related software standards File system NFS DAFS R-NFS Block mode iSCSI iSER iWARP TOE Basic Networking Setup/Teardown and Exceptions only
NetEffect ECA Architecture Clustering Connection Mgmt MPI, DAPL, Clustering Related software standards MPI DAPL API IT API RDMA Accelerator Interfaces iWARP TOE Basic Networking N/A Setup/Teardown and Exceptions only
Myrinet, Quadrics, InfiniBand, etc. Tomorrow’s Data Center Separate Fabrics for Networking, Storage, and Clustering Users LAN iWARP Ethernet LAN Ethernet NAS ▪ ▪ ▪ ▪ ▪ ▪ ▪ switch Applications networking storage clustering networking adapter Block Storage Fibre Channel Storage iWARP Ethernet switch networking storage clustering storage adapter ▪ ▪ ▪ ▪ ▪ ▪ ▪ SAN clustering networking storage clustering adapter Clustering iWARP Ethernet Clustering Myrinet, Quadrics, InfiniBand, etc. switch ▪ ▪ ▪ ▪ ▪ ▪ ▪
Fat Pipe for Blades & Stacks Converged Fabric for Networking, Storage & Clustering Users LAN iWARP Ethernet ▪ ▪ ▪ ▪ ▪ ▪ ▪ switch NAS Applications Storage iWARP Ethernet networking storage clustering adapter switch ▪ ▪ ▪ ▪ ▪ ▪ ▪ SAN Clustering iWARP Ethernet switch ▪ ▪ ▪ ▪ ▪ ▪ ▪
Take-Aways Multi-gigabit networking is required for each tier of the data center Supporting multiple incompatible network infrastructure is becoming increasingly more difficult as budget, power, cooling and space constraints tighten With the adoption of iWARP, Ethernet for the first time meets the requirements for all connectivity within the data center NetEffect is developing a high performance iWARP Ethernet Channel Adapter that enables the convergence of clustering, storage and networking
Call to Action Deploy iWARP products for convergence of networking, storage and clustering Deploy 10 Gb Ethernet for fabric convergence Develop applications to RDMA-based APIs for maximum server performance
Open Group Authors of ITAPI, RNIC PI & Sockets API Extensions Resources NetEffect www.NetEffect.com iWARP Consortium www.iol.unh.edu/consortiums/iwarp/ Open Group Authors of ITAPI, RNIC PI & Sockets API Extensions www.opengroup.org/icsc/ DAT Collaborative www.datcollaborative.org RDMA Consortium www.rdmaconsortium.org IETF RDDP WG www.ietf.org/html.charters/rddp-charter.html
Community Resources Windows Hardware and Driver Central (WHDC) www.microsoft.com/whdc/default.mspx Technical Communities www.microsoft.com/communities/products/default.mspx Non-Microsoft Community Sites www.microsoft.com/communities/related/default.mspx Microsoft Public Newsgroups www.microsoft.com/communities/newsgroups Technical Chats and Webcasts www.microsoft.com/communities/chats/default.mspx www.microsoft.com/webcasts Microsoft Blogs www.microsoft.com/communities/blogs