Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB
Advanced Computational Research Laboratory High Performance Computational Problem- Solving and Visualization Environment Computational Experiments in multiple disciplines: CS, Science and Eng. 16-Processor IBM SP3 Member of C3.ca Association, Inc. (
Advanced Computational Research Laboratory Virendra Bhavsar, Director Eric Aubanel, Research Associate & Scientific Computing Support Sean Seeley, System Administrator
Programming the IBM Power3 SP History and future of POWER chip Uni-processor optimization Description of ACRL’s IBM SP Parallel Processing –MPI –OpenMP Hybrid MPI/OpenMP MPI-I/O (one slide)
POWER chip: 1990 to –Performance Optimized with Enhanced RISC –Reduced Instruction Set Computer –Superscalar: combined floating point multiply- add (FMA) unit which allowed peak MFLOPS rate = 2 x MHz –Initially: 25 MHz (50 MFLOPS) and 64 KB data cache
POWER chip: 1990 to : SP1 –IBM’s first SP (scalable power parallel) –Rack of standalone POWER processors (62.5 MHz) connected by internal switch network –Parallel Environment & system software
POWER chip: 1990 to : POWER2 –2 FMAs –Increased data cache size –66.5 MHz (254 MFLOPS) –Improved instruction set (incl. Hardware square root) –SP2: POWER2 + higher bandwidth switch for larger systems
POWER chip: 1990 to : POWERPC Support SMP 1996: P2SC POWER2 super chip: clock speeds up to 160 MHz
POWER chip: 1990 to 2003 Feb. ‘99: POWER3 –Combined P2SC & POWERPC –64 bit architecture –Initially 2-way SMP, 200 MHz –Cache improvement, including L2 cache of MB –Instruction & data prefetch
POWER3+ chip: Feb Winterhawk II MHz 4- way SMP 2 MULT/ ADD MFLOPS 64 KB Level nsec/ 3.2 GB/ sec 8 MB Level nsec/ 6.4 GB/ sec 1.6 GB/ s Memory Bandwidth 6 GFLOPS/ Node Nighthawk II MHz 16- way SMP 2 MULT/ ADD MFLOPS 64 KB Level nsec/ 3.2 GB/ sec 8 MB Level nsec/ 6.4 GB/ sec 14 GB/ s Memory Bandwidth 24 GFLOPS/ Node
The Clustered SMP ACRL’s SP: Four 4-way SMPs Each node has its own copy of the O/S Processors on the node are closer than those on different nodes
Power3 Architecture
Power way Logical UMA SP High Node L3 cache shared between all processors on node - 32 MB Up to 32 GB main memory Each processor: 1.1 GHz 140 Gflops total peak
Going to NUMA NUMA up to 256 processors Teraflops
Programming the IBM Power3 SP History and future of POWER chip Uni-processor optimization Description of ACRL’s IBM SP Parallel Processing –MPI –OpenMP Hybrid MPI/OpenMP MPI-I/O (one slide)
Uni-processor Optimization Compiler options: –start with -O3 -qstrict, then -O3, -qarch=pwr3 Cache re-use Take advantage of superscalar architecture –give enough operations per load/store Use ESSL - optimization already maximally exploited
Memory Access Times
Cache 128 byte cache line 2 MB L2 cache: 4-way set- associative, 8 MB total L1 cache: 128-way set-associative, 64 KB
How to Monitor Performance? IBM’s hardware monitor: HPMCOUNT –Uses hardware counters on chip –Cache & TLB misses, fp ops, load-stores, … –Beta version –Available soon on ACRL’s SP
HMPCOUNT sample output real*8 a(256,256),b(256,256), c(256,256) common a,b,c do j=1,256 do i=1,256 a(i,j)=b(i,j)+c(i,j) end do end PM_TLB_MISS (TLB misses) : Average number of loads per TLB miss : Total loads and stores : M Instructions per load/store : Cycles per instruction : Instructions per cycle : Total floating point operations : M Hardware float point rate : Mflop/sec
HMPCOUNT sample output real*8 a(257,256),b(257,256), c(257,256) common a,b,c do j=1,256 do i=1,257 a(i,j)=b(i,j)+c(i,j) end do end PM_TLB_MISS (TLB misses) : 1634 Average number of loads per TLB miss : Total loads and stores : M Instructions per load/store : Cycles per instruction : Instructions per cycle : Total floating point operations : M Hardware float point rate : Mflop/sec
ESSL Linear algebra, Fourier & related transforms, sorting, interpolation, quadrature, random numbers Fast! –560x560 real*8 matrix multiply Hand coding: 19 Mflops dgemm: 1.2 GFlops Parallel (threaded and distributed) versions
Programming the IBM Power3 SP History and future of POWER chip Uni-processor optimization Description of ACRL’s IBM SP Parallel Processing –MPI –OpenMP Hybrid MPI/OpenMP MPI-I/O (one slide)
ACRL’s IBM SP 4 Winterhawk II nodes –16 processors Each node has: –1 GB RAM –9 GB (mirrored) disk on each node –Switch adapter High Perforrnance Switch Gigabit Ethernet (1 node) Control workstation Disk: SSA tower with GB disks Disk Gigabit Ethernet
IBM Power3 SP Switch Bidirectional multistage interconnection networks (MIN) 300 MB/sec bi-directional 1.2 sec latency
General Parallel File System Application GPFS Client RVSD/VSD Application GPFS Client RVSD/VSD Application GPFS Client RVSD/VSD Application GPFS Server RVSD/VSD Node 2Node 3Node 4 Node 1 SP Switch
ACRL Software Operating System: AIX Compilers –IBM XL Fortran 7.1 (HPF not yet installed) –VisualAge C for AIX, Version –VisualAge C++ Professional for AIX, Version –IBM Visual Age Java - not yet installed Job Scheduler: Loadleveler 2.2 Parallel Programming Tools –IBM Parallel Environment 3.1: MPI, MPI-2 parallel I/O Numerical Libraries: ESSL (v. 3.2) and Parallel ESSL (v. 2.2 ) Visualization: OpenDX (not yet installed) E-Commerce software (not yet installed)
Programming the IBM Power3 SP History and future of POWER chip Uni-processor optimization Description of ACRL’s IBM SP Parallel Processing –MPI –OpenMP Hybrid MPI/OpenMP MPI-I/O (one slide)
Why Parallel Computing? Solve large problems in reasonable time Many algorithms are inherently parallel –image processing, Monte Carlo –Simulations (eg. CFD) High performance computers have parallel architectures –Commercial off-the shelf (COTS) components Beowulf clusters SMP nodes –Improvements in network technology
NRL Layered Ocean Model at Naval Research Laboratory IBM Winterhawk II SP
Parallel Computational Models Data Parallelism –Parallel program looks like serial program parallelism in the data –Vector processors –HPF
Parallel Computational Models Message Passing (MPI) –Processes have only local memory but can communicate with other processes by sending & receiving messages –Data transfer between processes requires operations to be performed by both processes –Communication network not part of computational model (hypercube, torus, …) SendReceive
Parallel Computational Models Shared Memory (threads) –P(osix)threads –OpenMP: higher level standard Address space Processes
Parallel Computational Models Remote Memory Operations –“One-sided” communication MPI-2, IBM’s LAPI –One process can access the memory of another without the other’s participation, but does so explicitly, not the same way it accesses local memory Put Get
Parallel Computational Models Combined: Message Passing & Threads –Driven by clusters of SMPs –Leads to software complexity! Address space Processes Address space Processes Address space Processes Network
Programming the IBM Power3 SP History and future of POWER chip Uni-processor optimization Description of ACRL’s IBM SP Parallel Processing –MPI –OpenMP Hybrid MPI/OpenMP MPI-I/O (one slide)
Message Passing Interface MPI 1.0 standard in 1994 MPI 1.1 in IBM support MPI 2.0 in 1997 –Includes 1.1 but adds new features MPI-IO One-sided communication Dynamic processes
Advantages of MPI Universality Expressivity –Well suited to formulating a parallel algorithm Ease of debugging –Memory is local Performance –Explicit association of data with process allows good use of cache
MPI Functionality Several modes of point-to-point message passing –blocking (e.g. MPI_SEND) –non-blocking (e.g. MPI_ISEND) –synchronous (e.g. MPI_SSEND) –buffered (e.g. MPI_BSEND) Collective communication and synchronization –e.g. MPI_REDUCE, MPI_BARRIER User-defined datatypes Logically distinct communicator spaces Application-level or virtual topologies
Simple MPI Example My_Id01 This is from MPI process number 0 This is from MPI processes other than 0
Simple MPI Example Program Trivial implicit none include "mpif.h" ! MPI header file integer My_Id, Numb_of_Procs, Ierr call MPI_INIT ( ierr ) call MPI_COMM_RANK ( MPI_COMM_WORLD, My_Id, ierr ) call MPI_COMM_SIZE ( MPI_COMM_WORLD, Numb_of_Procs, ierr ) print *, ' My_id, numb_of_procs = ', My_Id, Numb_of_Procs if ( My_Id.eq. 0 ) then print *, ' This is from MPI process number ',My_Id else print *, ' This is from MPI processes other than 0 ', My_Id end if call MPI_FINALIZE ( ierr ) ! bad things happen if you forget ierr stop end
MPI Example with send/recv My_Id01 SendReceive SendReceive
MPI Example with send/recv Program Simple implicit none Include "mpif.h" Integer My_Id, Other_Id, Nx, Ierr Parameter ( Nx = 100 ) Real A ( Nx ), B ( Nx ) call MPI_INIT ( Ierr ) call MPI_COMM_RANK ( MPI_COMM_WORLD, My_Id, Ierr ) Other_Id = Mod ( My_Id + 1, 2 ) A = My_Id call MPI_SEND ( A, Nx, MPI_REAL, Other_Id, My_Id, MPI_COMM_WORLD, Ierr ) call MPI_RECV ( B, Nx, MPI_REAL, Other_Id, Other_Id, MPI_COMM_WORLD, Ierr ) call MPI_FINALIZE ( Ierr ) stop end
What Will Happen? /* Processor 0 */... MPI_Send(sendbuf, bufsize, MPI_CHAR, partner, tag, MPI_COMM_WORLD); printf("Posting receive now...\n"); MPI_Recv(recvbuf, bufsize, MPI_CHAR, partner, tag, MPI_COMM_WORLD, status); /* Processor 1 */... MPI_Send(sendbuf, bufsize, MPI_CHAR, partner, tag, MPI_COMM_WORLD); printf("Posting receive now...\n"); MPI_Recv(recvbuf, bufsize, MPI_CHAR, partner, tag, MPI_COMM_WORLD, status);
MPI Message Passing Modes Ready Standard Synchronous Buffered Ready Eager Rendezvous Buffered > eager limit <= eager limit Default Eager Limit on SP is 4 KB (can be up to 64 KB)
MPI Performance Visualization ParaGraph –Developed by University of Illinois –Graphical display system for visualizing behaviour and performance of MPI programs
Message Passing on SMP Call MPI_SENDCall MPI_RECEIVE Buffer Memory Crossbar or Switch Data to Send Received Data export MP_SHARED_MEMORY=yes|no
Shared Memory MPI MPI_SHARED_MEMORY= LatencyBandwidth ( sec)(Mbytes/sec) –between 2 nodes: –same nodes: 30 (no)80 (no) –same nodes:10 (yes)270(yes)
Message Passing off Node MPI Across all the processors Many more messages going through the fabric
Programming the IBM Power3 SP History and future of POWER chip Uni-processor optimization Description of ACRL’s IBM SP Parallel Processing –MPI –OpenMP Hybrid MPI/OpenMP MPI-I/O (one slide)
OpenMP 1997: group of hardware and software vendors announced their support for OpenMP, a new API for multi-platform shared-memory programming (SMP) on UNIX and Microsoft Windows NT platforms. OpenMP parallelism specified through the use of compiler directives which are imbedded in C/C++ or Fortran source code. IBM does not yet support OpenMP for C++.
OpenMP All processors can access all the memory in the parallel system Parallel execution is achieved by generating threads which execute in parallel Overhead for SMP parallelization is large ( sec)- size of parallel work construct must be significant enough to overcome overhead
OpenMP 1.All OpenMP programs begin as a single process: the master thread 2.FORK: the master thread then creates a team of parallel threads 3.Parallel region statements executed in parallel among the various team threads 4.JOIN: threads synchronize and terminate, leaving only the master thread
OpenMP How is OpenMP typically used? OpenMP is usually used to parallelize loops: –Find your most time consuming loops. –Split them up between threads. Better scaling can be obtained using OpenMP parallel regions, but can be tricky!
OpenMP Loop Parallelization !$OMP PARALLEL DO do i=0,ilong do k=1,kshort... end do #pragma omp parallel for for(i=0; i <= ilong; i++) for(k=1; k <= kshort; k++) {... }
Variable Scoping Most difficult part of Shared Memory Parallelization –What memory is Shared –What memory is Private - each processor has its own copy Compare MPI: all variables are private Variables are shared by default, except: –loop indices –scalars that are set and then used in loop
How Does Sharing Work? THREAD 1: increment(x) { x = x + 1; } THREAD 1: 10 LOAD A, (x address) 20 ADD A, 1 30 STORE A, (x address) THREAD 2: increment(x) { x = x + 1; } THREAD 2: 10 LOAD A, (x address) 20 ADD A, 1 30 STORE A, (x address) Shared X initially 0 Result could be 1 or 2 Need synchronization
False Sharing Processor 1Processor 2 Block in Cache Cache line Address tag Block Say A(1-5)starts on cache line, then some of A(6-10) will be on first cache line so won’t be accessible until first thread finished !$OMP PARALLEL DO do I=1,20 A(I)=... enddo
Programming the IBM Power3 SP History and future of POWER chip Uni-processor optimization Description of ACRL’s IBM SP Parallel Processing –MPI –OpenMP Hybrid MPI/OpenMP MPI-I/O (one slide)
Why Hybrid MPI-OpenMP? To optimize performance on “mixed-mode” hardware like the SP MPI is used for “inter-node” communication, and OpenMP is used for “intra-node” communication –threads have lower latency –threads can alleviate network contention of a pure MPI implementation
Hybrid MPI-OpenMP? Unless you are forced against your will, for the hybrid model to be worthwhile: –There has to be obvious parallelism to exploit –The code has to be easy to program and maintain easy to write bad OpenMP code –It has to promise to perform at least as well as the equivalent all-MPI program Experience has shown that converting working MPI code to a hybrid model rarely results in better performance –especially true with applications having a single level of parallelism
Hybrid Scenario Thread the computational portions of the code that exist between MPI calls MPI calls are “single-threaded” and therefore use only a single CPU. Assumes: –application has two natural levels of parallelism –or that in breaking an MPI code with one level of parallelism that communication between resulting threads is little/none
Programming the IBM Power3 SP History and future of POWER chip Uni-processor optimization Description of ACRL’s IBM SP Parallel Processing –MPI –OpenMP Hybrid MPI/OpenMP MPI-I/O (one slide)
MPI-IO Part of MPI-2 Resulted work at IBM Research exploring the analogy between I/O and message passing See “Using MPI-2”, by Gropp et al. (MIT Press) memory processes file
Conclusion Don’t forget uni-processor optimization If you choose one parallel programming API, choose MPI Mixed MPI-OpenMP may be appropriate in certain cases –More work needed here Remote memory access model may be the answer