Presentation is loading. Please wait.

Presentation is loading. Please wait.

Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory.

Similar presentations


Presentation on theme: "Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory."— Presentation transcript:

1 Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory

2 2 Overview Cluster Computing vs. Distributed Grids InfiniBand –IB for WAN IP and Ethernet –Improving performance Other LAN/WAN Options Summary

3 3 Cluster Computing vs. Distributed Grids Typical clusters –Homogenous architecture –Dedicated environments Compatibility is not a concern –Clusters can use high-speed LAN networks E.g., VIA, Quadrics, Myrinet, InfiniBand –And specific hardware accelerators E.g., Protocol offload, RDMA

4 4 Cluster Computing vs. Distributed Grids cont’ed Distributed environments –Heterogeneous architecture –Communication over WAN –Multiple administrative domains Compatibility is critical –Most WAN stacks are IP/Ethernet –Popular grid communication protocols TCP/IP/Ethernet UDP/IP/Ethernet But what about performance? –TCP/IP/Ethernet latency: 10s of µs –InfiniBand latency: 1s of µs How do you maintain high intra-cluster performance while enabling inter- cluster communication?

5 5 Solutions Use one network for LAN and another for WAN –You need to manage two networks –Your communication library needs to be multi-network capable May have impact on performance or resource utilization Maybe a better solution: A common network subsystem –One network for both LAN and WAN –Two popular network families InfiniBand Ethernet

6 6 InfiniBand Initially introduced as a LAN –Now expanding onto WAN Issues with using IB on the WAN –IB copper cables have limited lengths –IB uses end-to-end credit-based flow control

7 7 Cable Lengths IB copper cabling –Signal integrity decreases with length and data rate –IB 4x-QDR (32Gbps) max cable length is < 1m Solution: optical cabling for IB E.g., Intel Connects Cables –Optical cables –Electrical-to-optical converters at ends ~50 ps conversion delay –Plug into existing copper-based adapters

8 8 End-to-End Flow Control IB uses end-to-end credit-based flow control –One credit corresponds to one buffer unit at receiver –Sender can send one unit of data per credit –Long one-way latencies impact achievable throughput WAN latencies are on the order of ms Solution: Hop-by-hop flow control –E.g., Obsidian Networks Longbow switches –Switches have internal buffering –Link-level flow control is performed between node and switch

9 9 Effect of Delay on Bandwidth Distance (km) Delay (µs) 15 210 20100 2001000 200010000 Source: S. Narravula, et. al., Performance of HPC Middleware over InfiniBand WAN, Ohio State Technical Report, 2007. OSU-CISRC-12/07-TR77

10 10 IP and Ethernet Traditionally –IP/Ethernet is used for WAN –and for low-cost alternative on LAN –Software-based TCP/IP stack implementation Software overhead limits performance Performance limitations –Small 1500-byte maximum transfer unit (MTU) –TCP/IP software stack overhead

11 11 Increasing Maximum Transfer Unit Ethernet standard specifies 1500-byte MTU –Each packet requires hardware and software processing –Is considerable at gigabit speeds MTU can be increased –9K Jumbo frames –Reduce per-byte processing overhead Not compatible on WAN

12 12 Large Segment Offload Engine on NIC a.k.a. Virtual MTU Introduced by Intel and Broadcom Allow TCP/IP software stack to use 9K or 16K MTUs –Reducing software overhead Fragmentation performed by NIC Standard 1500-byte MTU on the wire –Compatible with upstream switches and routers

13 13 Offload Protocol Processing to NIC Handling packets at gigabit speeds requires considerable processing –Even with large MTU –Uses CPU time that would otherwise be used by application Protocol Offload Engines (POE) –Perform communication processing on NIC –Myrinet, Quadrics, IB TCP Offload Engines (TOE) is a specific kind of POE –Chelsio, NetEffect

14 14 TOE vs Non-TOE: Latency Source: P. Balaji, W. Feng and D. K. Panda, Bridging the Ethernet-Ethernot Performance Gap. IEEE Micro Journal Special Issue on High-Performance Interconnects, pp. 24-40, May/June Volume, Issue 3, 2006.

15 15 TOE vs Non-TOE: Bandwidth and CPU Utilization

16 16 TOE vs Non-TOE: Bandwidth and CPU Utilization (9K MTU)

17 17 Other LAN/WAN Options iWARP protocol offload –Runs over IP –Has functionality similar to TCP –Adds RDMA Myricom –Myri-10G adapter –Uses 10G Ethernet physical layer –POE –Can handle both TCP/IP and MX Mellanox –ConnectX adapter –Has multiple ports that can be configured for IB or Ethernet –POE –Can handle both TCP/IP and IB Convergence in software stack: OpenFabrics –Supports IB and Ethernet adapters –Provides a common API to upper layer

18 18 Summary Clusters can take advantage of high-performance LAN NICs –E.g., InfiniBand Grids need interoperability –TCP/IP is ubiquitous Performance gap Bridging the gap –IB over the WAN –POE for Ethernet Alternatives –iWarp, Myricom’s Myri-10G, Mellanox ConnectX


Download ppt "Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory."

Similar presentations


Ads by Google