IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec. 19-22 Copyright by IBM HPS Switch and Adapter Architecture, Design & Performance Rama K Govindaraju.

IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec. 19-22 Copyright by IBM HPS Switch and Adapter Architecture, Design & Performance Rama K Govindaraju ramag@us.ibm.com HiPC Conference Bangalore, India December 19-22, 2004

IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec. 19-22 Copyright by IBM Team Architecture –Peter Hochschild, Don Grice, Kevin Gildea, Rama Govindaraju Hardware –Carl A Bender, Jay Herring, Piyush Chaudhary, Steven Martin, Jason Goscinski, John Houston, … Software –Chulho Kim, Robert Blackmore, Rajeev Sivaram, Hanhong Xue, … And many others contributed to this effort

IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec. 19-22 Copyright by IBM Outline What is HPS? Example HPS customers Interconnect Historical Performance HPS switch architecture HPS adapter architecture HPS software architecture Transport Modes HPS Performance Lessons Learned and Future Work

IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec. 19-22 Copyright by IBM What is HPS? HPS (High Performance Switch) –4 th generation switch and adapter to interconnect IBM’s Power processor based nodes (Power 4 and 5) –To be used in many of the world’s fastest supercomputers 20 of the top 100 today use HPS –Addressing requirements of HPC labs, DOE, and others Weather Forecasting, Petroleum sector, Automotive and Aerospace sector NSA and DOD –Core infrastructure for the 100TF ASCI Purple system to be delivered in June 2005

IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec. 19-22 Copyright by IBM Example HPS Customers More than 30 and growing Several over 1000 CPUs Total over: 200TF

IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec. 19-22 Copyright by IBM Historical Interconnect Performance 19931996199820002004 Adapter Switch Processor TB2 HPS Power 2 TB3 TBS Power 2 TBMX TBS Power PC/3 Colony SP-Switch2 Power 3 HPS Power 4 Peak link bandwidth 40MB/s150MB/s 500MB/s2GB/s MPI bandwidth 35MB/s110MB/s135MB/s375MB/s1.8-14GB/s MPI latency40us24us21us17us <4.2us Links/node server 1111,22,4,6,8 IBM developed Switch Interconnects and Adapters

IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec. 19-22 Copyright by IBM HPS Switch Fabric 4K end points, 59ns latency, 2GB/s bandwidth per link per direction

IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec. 19-22 Copyright by IBM HPS Adapter Microcode Model

IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec. 19-22 Copyright by IBM HMC FNM DD HYP HPS Switch Fabric HPS Adapter User Space Kernel Space LAPI IBM’s MPI Parallel ESSL VSD GPFS SOCKETS TCPUDP IP APPLICATION ESSL IF_LSHAL Service Processor HPS Software Architecture LL CSM

IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec. 19-22 Copyright by IBM User Space Kernel Space MPI LAPI HAL Federation Adapter Interface Layer User Buffer HAL Buffers IP Interface UDP TCP Sockets FIFO versus RDMA models FIFO copy FIFO DMA RDMA

IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec. 19-22 Copyright by IBM Supported Communication Modes FIFO Mode –Message chopped into 2K packet chunks on the host and copied by CPU –Memory bus crossing depends on caching. At least 1 IO bus crossing RDMA enablement –No slave side protocol –CPU offload –Enhanced Programming model –1 IO bus crossing User Buffer CPU Network FIFO Adapter Ld/St DMA RDMA

IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec. 19-22 Copyright by IBM RDMA value proposition Possible overlap of computation and communication –Fragmentation/reassembly offloaded to the adapter –Minimize packet arrival interrupts –Requires application to be written take advantage of overlap One sided programming model Zero copy transport and reduced memory subsystem load Striping advantage KEY DIFFERENTIATOR: reliable RDMA protocol over unreliable datagram transport –Allows striping across multiple paths –Out of order arrival –Reduces hot spotting and contention Cons –Pinned memory usage –Resource management and fairness issues

IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec. 19-22 Copyright by IBM Federation Performance Summary: –Latency: Power 4, 1.9GHz, HPS MPI latency 4.34us Interrupt latency: adds 10us 8 task latency: adds 1us –Bandwidth: Power 4, 1.9GHz, HPS FIFO mode: –Unidirectional bandwidth: ~ 1.8GB/s –Bidirectional bandwidth: 2.1GB/s RDMA mode: –Unidirectional bandwidth: ~1.8GB/s –Bidirectional bandwidth: ~3.0GB/s –Linear striping performance up to 8 links »Unidirectional: 14GB/s, Bidirectional: 24GB/s These are preliminary measurements

IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec. 19-22 Copyright by IBM HPS: MPI Latency Machine TypeLatency Measurement 1.9GHz, p690+4.34us 1.7GHz, p690+4.72us 1.7GHz, p655+4.70us 1.5GHz, p690+5.15us 1.3GHz, p6905.5us All measurements measured using IBM’s thread safe MPI libraries 8 task latency adds approximately 1 additional microsecond Interrupt latency adds approximately 10-12 microseconds All measurements are preliminary

IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec. 19-22 Copyright by IBM Unidirectional Bandwidth Peak Machine TypePeak Uni-dir Bandwidth 1.9GHz, p690+1.800GB/s 1.7GHz, p690+1.686GB/s 1.7GHz, p655+1.800GB/s 1.5GHz, p690+1.470GB/s 1.3GHz, p6901.170GB/s All measurements are preliminary

IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec. 19-22 Copyright by IBM Unidirectional Bandwidth Profile Message Size (bytes) Bandwidth (MB/s) P655, 1.7GHz based system M1/2= 32K, M3/4=128K

IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec. 19-22 Copyright by IBM Bidirectional Bandwidth Profile Message Size (bytes) Bandwidth (MB/s) P655, 1.7GHz based system M1/2=16K, M3/4=64K

IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec. 19-22 Copyright by IBM T1 T2T3 T1 T2T3 T1 T2T3 = Communication time by thread/task a) Asynchronous Model b) Synchronous Modelc) Aggregate Comm Thread Model Striping Options

IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec. 19-22 Copyright by IBM Striping Models MPI Layer LAPI Layer HAL ADAPTERS MPI Layer LAPI Layer HAL ADAPTERS Multiple threads doing copies model Single Thread with Pipelined RDMA model Second approach: - More elegant failover model - Less synchronization issues and CPU contention via RDMA

IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec. 19-22 Copyright by IBM RDMA Unidirectional Bandwidth

IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec. 19-22 Copyright by IBM RDMA Bidirectional Bandwidth

IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec. 19-22 Copyright by IBM How can users exploit RDMA? Overlap computation and communication –Non blocking calls –Reuse communication buffers if possible –User exposed RDMA in 11/05 Minimize interrupts for large transfers Reduce contention for memory Better raw bandwidth for messages over 80KB Possibility of overlapping collectives better (via striping) IP transport much more efficient (translates to improved GPFS performance) Select striping when sending large messages

IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec. 19-22 Copyright by IBM Future Work Enabling HPS for Power 5 based nodes Exploit SMT in Power 5 processor for FIFO mode Further attack MPI latency Use RDMA to improve MPI collectives performance Parallel file systems (GPFS) further exploitation of IP over RDMA Take lessons learned into the Percs project

IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec. 19-22 Copyright by IBM HPS Switch and Adapter Architecture, Design & Performance Rama K Govindaraju.

Similar presentations

Presentation on theme: "IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec. 19-22 Copyright by IBM HPS Switch and Adapter Architecture, Design & Performance Rama K Govindaraju."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec. 19-22 Copyright by IBM HPS Switch and Adapter Architecture, Design & Performance Rama K Govindaraju.

Similar presentations

Presentation on theme: "IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec. 19-22 Copyright by IBM HPS Switch and Adapter Architecture, Design & Performance Rama K Govindaraju."— Presentation transcript:

Similar presentations

About project

Feedback