Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB

Advanced Computational Research Laboratory High Performance Computational Problem- Solving and Visualization Environment Computational Experiments in multiple disciplines: CS, Science and Eng. 16-Processor IBM SP3 Member of C3.ca Association, Inc. (http://www.c3.ca)

Advanced Computational Research Laboratory www.cs.unb.ca/acrl Virendra Bhavsar, Director Eric Aubanel, Research Associate & Scientific Computing Support Sean Seeley, System Administrator

Programming the IBM Power3 SP History and future of POWER chip Uni-processor optimization Description of ACRL’s IBM SP Parallel Processing –MPI –OpenMP Hybrid MPI/OpenMP MPI-I/O (one slide)

POWER chip: 1990 to 2003 1990 –Performance Optimized with Enhanced RISC –Reduced Instruction Set Computer –Superscalar: combined floating point multiply- add (FMA) unit which allowed peak MFLOPS rate = 2 x MHz –Initially: 25 MHz (50 MFLOPS) and 64 KB data cache

POWER chip: 1990 to 2003 1991: SP1 –IBM’s first SP (scalable power parallel) –Rack of standalone POWER processors (62.5 MHz) connected by internal switch network –Parallel Environment & system software

POWER chip: 1990 to 2003 1993: POWER2 –2 FMAs –Increased data cache size –66.5 MHz (254 MFLOPS) –Improved instruction set (incl. Hardware square root) –SP2: POWER2 + higher bandwidth switch for larger systems

POWER chip: 1990 to 2003 1993: POWERPC Support SMP 1996: P2SC POWER2 super chip: clock speeds up to 160 MHz

POWER chip: 1990 to 2003 Feb. ‘99: POWER3 –Combined P2SC & POWERPC –64 bit architecture –Initially 2-way SMP, 200 MHz –Cache improvement, including L2 cache of 1- 16 MB –Instruction & data prefetch

POWER3+ chip: Feb. 2000 Winterhawk II - 375 MHz 4- way SMP 2 MULT/ ADD - 1500 MFLOPS 64 KB Level 1 - 5 nsec/ 3.2 GB/ sec 8 MB Level 2 - 45 nsec/ 6.4 GB/ sec 1.6 GB/ s Memory Bandwidth 6 GFLOPS/ Node Nighthawk II - 375 MHz 16- way SMP 2 MULT/ ADD - 1500 MFLOPS 64 KB Level 1 - 5 nsec/ 3.2 GB/ sec 8 MB Level 2 - 45 nsec/ 6.4 GB/ sec 14 GB/ s Memory Bandwidth 24 GFLOPS/ Node

The Clustered SMP ACRL’s SP: Four 4-way SMPs Each node has its own copy of the O/S Processors on the node are closer than those on different nodes

Power3 Architecture

Power4 - 32 way Logical UMA SP High Node L3 cache shared between all processors on node - 32 MB Up to 32 GB main memory Each processor: 1.1 GHz 140 Gflops total peak

Going to NUMA NUMA up to 256 processors - 1.1 Teraflops

Uni-processor Optimization Compiler options: –start with -O3 -qstrict, then -O3, -qarch=pwr3 Cache re-use Take advantage of superscalar architecture –give enough operations per load/store Use ESSL - optimization already maximally exploited

Memory Access Times

Cache 128 byte cache line 2 MB L2 cache: 4-way set- associative, 8 MB total L1 cache: 128-way set-associative, 64 KB

How to Monitor Performance? IBM’s hardware monitor: HPMCOUNT –Uses hardware counters on chip –Cache & TLB misses, fp ops, load-stores, … –Beta version –Available soon on ACRL’s SP

HMPCOUNT sample output real*8 a(256,256),b(256,256), c(256,256) common a,b,c do j=1,256 do i=1,256 a(i,j)=b(i,j)+c(i,j) end do end PM_TLB_MISS (TLB misses) : 66543 Average number of loads per TLB miss : 5.916 Total loads and stores : 0.525 M Instructions per load/store : 2.749 Cycles per instruction : 2.378 Instructions per cycle : 0.420 Total floating point operations : 0.066 M Hardware float point rate : 2.749 Mflop/sec

HMPCOUNT sample output real*8 a(257,256),b(257,256), c(257,256) common a,b,c do j=1,256 do i=1,257 a(i,j)=b(i,j)+c(i,j) end do end PM_TLB_MISS (TLB misses) : 1634 Average number of loads per TLB miss : 241.876 Total loads and stores : 0.527 M Instructions per load/store : 2.749 Cycles per instruction : 1.271 Instructions per cycle : 0.787 Total floating point operations : 0.066 M Hardware float point rate : 3.525 Mflop/sec

ESSL Linear algebra, Fourier & related transforms, sorting, interpolation, quadrature, random numbers Fast! –560x560 real*8 matrix multiply Hand coding: 19 Mflops dgemm: 1.2 GFlops Parallel (threaded and distributed) versions

ACRL’s IBM SP 4 Winterhawk II nodes –16 processors Each node has: –1 GB RAM –9 GB (mirrored) disk on each node –Switch adapter High Perforrnance Switch Gigabit Ethernet (1 node) Control workstation Disk: SSA tower with 6 18.2 GB disks Disk Gigabit Ethernet

IBM Power3 SP Switch Bidirectional multistage interconnection networks (MIN) 300 MB/sec bi-directional 1.2  sec latency

General Parallel File System Application GPFS Client RVSD/VSD Application GPFS Client RVSD/VSD Application GPFS Client RVSD/VSD Application GPFS Server RVSD/VSD Node 2Node 3Node 4 Node 1 SP Switch

ACRL Software Operating System: AIX 4.3.3 Compilers –IBM XL Fortran 7.1 (HPF not yet installed) –VisualAge C for AIX, Version 5.0.1.0 –VisualAge C++ Professional for AIX, Version 5.0.0.0 –IBM Visual Age Java - not yet installed Job Scheduler: Loadleveler 2.2 Parallel Programming Tools –IBM Parallel Environment 3.1: MPI, MPI-2 parallel I/O Numerical Libraries: ESSL (v. 3.2) and Parallel ESSL (v. 2.2 ) Visualization: OpenDX (not yet installed) E-Commerce software (not yet installed)

Why Parallel Computing? Solve large problems in reasonable time Many algorithms are inherently parallel –image processing, Monte Carlo –Simulations (eg. CFD) High performance computers have parallel architectures –Commercial off-the shelf (COTS) components Beowulf clusters SMP nodes –Improvements in network technology

NRL Layered Ocean Model at Naval Research Laboratory IBM Winterhawk II SP

Parallel Computational Models Data Parallelism –Parallel program looks like serial program parallelism in the data –Vector processors –HPF

Parallel Computational Models Message Passing (MPI) –Processes have only local memory but can communicate with other processes by sending & receiving messages –Data transfer between processes requires operations to be performed by both processes –Communication network not part of computational model (hypercube, torus, …) SendReceive

Parallel Computational Models Shared Memory (threads) –P(osix)threads –OpenMP: higher level standard Address space Processes

Parallel Computational Models Remote Memory Operations –“One-sided” communication MPI-2, IBM’s LAPI –One process can access the memory of another without the other’s participation, but does so explicitly, not the same way it accesses local memory Put Get

Parallel Computational Models Combined: Message Passing & Threads –Driven by clusters of SMPs –Leads to software complexity! Address space Processes Address space Processes Address space Processes Network

Message Passing Interface MPI 1.0 standard in 1994 MPI 1.1 in 1995 - IBM support MPI 2.0 in 1997 –Includes 1.1 but adds new features MPI-IO One-sided communication Dynamic processes

Advantages of MPI Universality Expressivity –Well suited to formulating a parallel algorithm Ease of debugging –Memory is local Performance –Explicit association of data with process allows good use of cache

MPI Functionality Several modes of point-to-point message passing –blocking (e.g. MPI_SEND) –non-blocking (e.g. MPI_ISEND) –synchronous (e.g. MPI_SSEND) –buffered (e.g. MPI_BSEND) Collective communication and synchronization –e.g. MPI_REDUCE, MPI_BARRIER User-defined datatypes Logically distinct communicator spaces Application-level or virtual topologies

Simple MPI Example My_Id01 This is from MPI process number 0 This is from MPI processes other than 0

Simple MPI Example Program Trivial implicit none include "mpif.h" ! MPI header file integer My_Id, Numb_of_Procs, Ierr call MPI_INIT ( ierr ) call MPI_COMM_RANK ( MPI_COMM_WORLD, My_Id, ierr ) call MPI_COMM_SIZE ( MPI_COMM_WORLD, Numb_of_Procs, ierr ) print *, ' My_id, numb_of_procs = ', My_Id, Numb_of_Procs if ( My_Id.eq. 0 ) then print *, ' This is from MPI process number ',My_Id else print *, ' This is from MPI processes other than 0 ', My_Id end if call MPI_FINALIZE ( ierr ) ! bad things happen if you forget ierr stop end

MPI Example with send/recv My_Id01 SendReceive SendReceive

MPI Example with send/recv Program Simple implicit none Include "mpif.h" Integer My_Id, Other_Id, Nx, Ierr Parameter ( Nx = 100 ) Real A ( Nx ), B ( Nx ) call MPI_INIT ( Ierr ) call MPI_COMM_RANK ( MPI_COMM_WORLD, My_Id, Ierr ) Other_Id = Mod ( My_Id + 1, 2 ) A = My_Id call MPI_SEND ( A, Nx, MPI_REAL, Other_Id, My_Id, MPI_COMM_WORLD, Ierr ) call MPI_RECV ( B, Nx, MPI_REAL, Other_Id, Other_Id, MPI_COMM_WORLD, Ierr ) call MPI_FINALIZE ( Ierr ) stop end

What Will Happen? /* Processor 0 */... MPI_Send(sendbuf, bufsize, MPI_CHAR, partner, tag, MPI_COMM_WORLD); printf("Posting receive now...\n"); MPI_Recv(recvbuf, bufsize, MPI_CHAR, partner, tag, MPI_COMM_WORLD, status); /* Processor 1 */... MPI_Send(sendbuf, bufsize, MPI_CHAR, partner, tag, MPI_COMM_WORLD); printf("Posting receive now...\n"); MPI_Recv(recvbuf, bufsize, MPI_CHAR, partner, tag, MPI_COMM_WORLD, status);

MPI Message Passing Modes Ready Standard Synchronous Buffered Ready Eager Rendezvous Buffered > eager limit <= eager limit Default Eager Limit on SP is 4 KB (can be up to 64 KB)

MPI Performance Visualization ParaGraph –Developed by University of Illinois –Graphical display system for visualizing behaviour and performance of MPI programs

Message Passing on SMP Call MPI_SENDCall MPI_RECEIVE Buffer Memory Crossbar or Switch Data to Send Received Data export MP_SHARED_MEMORY=yes|no

Shared Memory MPI MPI_SHARED_MEMORY= LatencyBandwidth (  sec)(Mbytes/sec) –between 2 nodes: 24133 –same nodes: 30 (no)80 (no) –same nodes:10 (yes)270(yes)

Message Passing off Node MPI Across all the processors Many more messages going through the fabric

OpenMP 1997: group of hardware and software vendors announced their support for OpenMP, a new API for multi-platform shared-memory programming (SMP) on UNIX and Microsoft Windows NT platforms. www.openmp.org OpenMP parallelism specified through the use of compiler directives which are imbedded in C/C++ or Fortran source code. IBM does not yet support OpenMP for C++.

OpenMP All processors can access all the memory in the parallel system Parallel execution is achieved by generating threads which execute in parallel Overhead for SMP parallelization is large (100-200  sec)- size of parallel work construct must be significant enough to overcome overhead

OpenMP 1.All OpenMP programs begin as a single process: the master thread 2.FORK: the master thread then creates a team of parallel threads 3.Parallel region statements executed in parallel among the various team threads 4.JOIN: threads synchronize and terminate, leaving only the master thread

OpenMP How is OpenMP typically used? OpenMP is usually used to parallelize loops: –Find your most time consuming loops. –Split them up between threads. Better scaling can be obtained using OpenMP parallel regions, but can be tricky!

OpenMP Loop Parallelization !$OMP PARALLEL DO do i=0,ilong do k=1,kshort... end do #pragma omp parallel for for(i=0; i <= ilong; i++) for(k=1; k <= kshort; k++) {... }

Variable Scoping Most difficult part of Shared Memory Parallelization –What memory is Shared –What memory is Private - each processor has its own copy Compare MPI: all variables are private Variables are shared by default, except: –loop indices –scalars that are set and then used in loop

How Does Sharing Work? THREAD 1: increment(x) { x = x + 1; } THREAD 1: 10 LOAD A, (x address) 20 ADD A, 1 30 STORE A, (x address) THREAD 2: increment(x) { x = x + 1; } THREAD 2: 10 LOAD A, (x address) 20 ADD A, 1 30 STORE A, (x address) Shared X initially 0 Result could be 1 or 2 Need synchronization

False Sharing 7654321076543210 Processor 1Processor 2 Block in Cache Cache line Address tag Block Say A(1-5)starts on cache line, then some of A(6-10) will be on first cache line so won’t be accessible until first thread finished !$OMP PARALLEL DO do I=1,20 A(I)=... enddo

Why Hybrid MPI-OpenMP? To optimize performance on “mixed-mode” hardware like the SP MPI is used for “inter-node” communication, and OpenMP is used for “intra-node” communication –threads have lower latency –threads can alleviate network contention of a pure MPI implementation

Hybrid MPI-OpenMP? Unless you are forced against your will, for the hybrid model to be worthwhile: –There has to be obvious parallelism to exploit –The code has to be easy to program and maintain easy to write bad OpenMP code –It has to promise to perform at least as well as the equivalent all-MPI program Experience has shown that converting working MPI code to a hybrid model rarely results in better performance –especially true with applications having a single level of parallelism

Hybrid Scenario Thread the computational portions of the code that exist between MPI calls MPI calls are “single-threaded” and therefore use only a single CPU. Assumes: –application has two natural levels of parallelism –or that in breaking an MPI code with one level of parallelism that communication between resulting threads is little/none

MPI-IO Part of MPI-2 Resulted work at IBM Research exploring the analogy between I/O and message passing See “Using MPI-2”, by Gropp et al. (MIT Press) memory processes file

Conclusion Don’t forget uni-processor optimization If you choose one parallel programming API, choose MPI Mixed MPI-OpenMP may be appropriate in certain cases –More work needed here Remote memory access model may be the answer

Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

Similar presentations

Presentation on theme: "Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

Similar presentations

Presentation on theme: "Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB."— Presentation transcript:

Similar presentations

About project

Feedback