Presentation is loading. Please wait.

Presentation is loading. Please wait.

Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

Similar presentations


Presentation on theme: "Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB."— Presentation transcript:

1 Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB

2 Advanced Computational Research Laboratory High Performance Computational Problem- Solving and Visualization Environment Computational Experiments in multiple disciplines: CS, Science and Eng. 16-Processor IBM SP3 Member of C3.ca Association, Inc. (http://www.c3.ca)

3 Advanced Computational Research Laboratory www.cs.unb.ca/acrl Virendra Bhavsar, Director Eric Aubanel, Research Associate & Scientific Computing Support Sean Seeley, System Administrator

4

5

6 Programming the IBM Power3 SP History and future of POWER chip Uni-processor optimization Description of ACRL’s IBM SP Parallel Processing –MPI –OpenMP Hybrid MPI/OpenMP MPI-I/O (one slide)

7 POWER chip: 1990 to 2003 1990 –Performance Optimized with Enhanced RISC –Reduced Instruction Set Computer –Superscalar: combined floating point multiply- add (FMA) unit which allowed peak MFLOPS rate = 2 x MHz –Initially: 25 MHz (50 MFLOPS) and 64 KB data cache

8 POWER chip: 1990 to 2003 1991: SP1 –IBM’s first SP (scalable power parallel) –Rack of standalone POWER processors (62.5 MHz) connected by internal switch network –Parallel Environment & system software

9 POWER chip: 1990 to 2003 1993: POWER2 –2 FMAs –Increased data cache size –66.5 MHz (254 MFLOPS) –Improved instruction set (incl. Hardware square root) –SP2: POWER2 + higher bandwidth switch for larger systems

10 POWER chip: 1990 to 2003 1993: POWERPC Support SMP 1996: P2SC POWER2 super chip: clock speeds up to 160 MHz

11 POWER chip: 1990 to 2003 Feb. ‘99: POWER3 –Combined P2SC & POWERPC –64 bit architecture –Initially 2-way SMP, 200 MHz –Cache improvement, including L2 cache of 1- 16 MB –Instruction & data prefetch

12 POWER3+ chip: Feb. 2000 Winterhawk II - 375 MHz 4- way SMP 2 MULT/ ADD - 1500 MFLOPS 64 KB Level 1 - 5 nsec/ 3.2 GB/ sec 8 MB Level 2 - 45 nsec/ 6.4 GB/ sec 1.6 GB/ s Memory Bandwidth 6 GFLOPS/ Node Nighthawk II - 375 MHz 16- way SMP 2 MULT/ ADD - 1500 MFLOPS 64 KB Level 1 - 5 nsec/ 3.2 GB/ sec 8 MB Level 2 - 45 nsec/ 6.4 GB/ sec 14 GB/ s Memory Bandwidth 24 GFLOPS/ Node

13 The Clustered SMP ACRL’s SP: Four 4-way SMPs Each node has its own copy of the O/S Processors on the node are closer than those on different nodes

14 Power3 Architecture

15 Power4 - 32 way Logical UMA SP High Node L3 cache shared between all processors on node - 32 MB Up to 32 GB main memory Each processor: 1.1 GHz 140 Gflops total peak

16 Going to NUMA NUMA up to 256 processors - 1.1 Teraflops

17 Programming the IBM Power3 SP History and future of POWER chip Uni-processor optimization Description of ACRL’s IBM SP Parallel Processing –MPI –OpenMP Hybrid MPI/OpenMP MPI-I/O (one slide)

18 Uni-processor Optimization Compiler options: –start with -O3 -qstrict, then -O3, -qarch=pwr3 Cache re-use Take advantage of superscalar architecture –give enough operations per load/store Use ESSL - optimization already maximally exploited

19 Memory Access Times

20 Cache 128 byte cache line 2 MB L2 cache: 4-way set- associative, 8 MB total L1 cache: 128-way set-associative, 64 KB

21 How to Monitor Performance? IBM’s hardware monitor: HPMCOUNT –Uses hardware counters on chip –Cache & TLB misses, fp ops, load-stores, … –Beta version –Available soon on ACRL’s SP

22 HMPCOUNT sample output real*8 a(256,256),b(256,256), c(256,256) common a,b,c do j=1,256 do i=1,256 a(i,j)=b(i,j)+c(i,j) end do end PM_TLB_MISS (TLB misses) : 66543 Average number of loads per TLB miss : 5.916 Total loads and stores : 0.525 M Instructions per load/store : 2.749 Cycles per instruction : 2.378 Instructions per cycle : 0.420 Total floating point operations : 0.066 M Hardware float point rate : 2.749 Mflop/sec

23 HMPCOUNT sample output real*8 a(257,256),b(257,256), c(257,256) common a,b,c do j=1,256 do i=1,257 a(i,j)=b(i,j)+c(i,j) end do end PM_TLB_MISS (TLB misses) : 1634 Average number of loads per TLB miss : 241.876 Total loads and stores : 0.527 M Instructions per load/store : 2.749 Cycles per instruction : 1.271 Instructions per cycle : 0.787 Total floating point operations : 0.066 M Hardware float point rate : 3.525 Mflop/sec

24 ESSL Linear algebra, Fourier & related transforms, sorting, interpolation, quadrature, random numbers Fast! –560x560 real*8 matrix multiply Hand coding: 19 Mflops dgemm: 1.2 GFlops Parallel (threaded and distributed) versions

25 Programming the IBM Power3 SP History and future of POWER chip Uni-processor optimization Description of ACRL’s IBM SP Parallel Processing –MPI –OpenMP Hybrid MPI/OpenMP MPI-I/O (one slide)

26 ACRL’s IBM SP 4 Winterhawk II nodes –16 processors Each node has: –1 GB RAM –9 GB (mirrored) disk on each node –Switch adapter High Perforrnance Switch Gigabit Ethernet (1 node) Control workstation Disk: SSA tower with 6 18.2 GB disks Disk Gigabit Ethernet

27

28 IBM Power3 SP Switch Bidirectional multistage interconnection networks (MIN) 300 MB/sec bi-directional 1.2  sec latency

29 General Parallel File System Application GPFS Client RVSD/VSD Application GPFS Client RVSD/VSD Application GPFS Client RVSD/VSD Application GPFS Server RVSD/VSD Node 2Node 3Node 4 Node 1 SP Switch

30 ACRL Software Operating System: AIX 4.3.3 Compilers –IBM XL Fortran 7.1 (HPF not yet installed) –VisualAge C for AIX, Version 5.0.1.0 –VisualAge C++ Professional for AIX, Version 5.0.0.0 –IBM Visual Age Java - not yet installed Job Scheduler: Loadleveler 2.2 Parallel Programming Tools –IBM Parallel Environment 3.1: MPI, MPI-2 parallel I/O Numerical Libraries: ESSL (v. 3.2) and Parallel ESSL (v. 2.2 ) Visualization: OpenDX (not yet installed) E-Commerce software (not yet installed)

31 Programming the IBM Power3 SP History and future of POWER chip Uni-processor optimization Description of ACRL’s IBM SP Parallel Processing –MPI –OpenMP Hybrid MPI/OpenMP MPI-I/O (one slide)

32 Why Parallel Computing? Solve large problems in reasonable time Many algorithms are inherently parallel –image processing, Monte Carlo –Simulations (eg. CFD) High performance computers have parallel architectures –Commercial off-the shelf (COTS) components Beowulf clusters SMP nodes –Improvements in network technology

33 NRL Layered Ocean Model at Naval Research Laboratory IBM Winterhawk II SP

34 Parallel Computational Models Data Parallelism –Parallel program looks like serial program parallelism in the data –Vector processors –HPF

35 Parallel Computational Models Message Passing (MPI) –Processes have only local memory but can communicate with other processes by sending & receiving messages –Data transfer between processes requires operations to be performed by both processes –Communication network not part of computational model (hypercube, torus, …) SendReceive

36 Parallel Computational Models Shared Memory (threads) –P(osix)threads –OpenMP: higher level standard Address space Processes

37 Parallel Computational Models Remote Memory Operations –“One-sided” communication MPI-2, IBM’s LAPI –One process can access the memory of another without the other’s participation, but does so explicitly, not the same way it accesses local memory Put Get

38 Parallel Computational Models Combined: Message Passing & Threads –Driven by clusters of SMPs –Leads to software complexity! Address space Processes Address space Processes Address space Processes Network

39 Programming the IBM Power3 SP History and future of POWER chip Uni-processor optimization Description of ACRL’s IBM SP Parallel Processing –MPI –OpenMP Hybrid MPI/OpenMP MPI-I/O (one slide)

40 Message Passing Interface MPI 1.0 standard in 1994 MPI 1.1 in 1995 - IBM support MPI 2.0 in 1997 –Includes 1.1 but adds new features MPI-IO One-sided communication Dynamic processes

41 Advantages of MPI Universality Expressivity –Well suited to formulating a parallel algorithm Ease of debugging –Memory is local Performance –Explicit association of data with process allows good use of cache

42 MPI Functionality Several modes of point-to-point message passing –blocking (e.g. MPI_SEND) –non-blocking (e.g. MPI_ISEND) –synchronous (e.g. MPI_SSEND) –buffered (e.g. MPI_BSEND) Collective communication and synchronization –e.g. MPI_REDUCE, MPI_BARRIER User-defined datatypes Logically distinct communicator spaces Application-level or virtual topologies

43 Simple MPI Example My_Id01 This is from MPI process number 0 This is from MPI processes other than 0

44 Simple MPI Example Program Trivial implicit none include "mpif.h" ! MPI header file integer My_Id, Numb_of_Procs, Ierr call MPI_INIT ( ierr ) call MPI_COMM_RANK ( MPI_COMM_WORLD, My_Id, ierr ) call MPI_COMM_SIZE ( MPI_COMM_WORLD, Numb_of_Procs, ierr ) print *, ' My_id, numb_of_procs = ', My_Id, Numb_of_Procs if ( My_Id.eq. 0 ) then print *, ' This is from MPI process number ',My_Id else print *, ' This is from MPI processes other than 0 ', My_Id end if call MPI_FINALIZE ( ierr ) ! bad things happen if you forget ierr stop end

45 MPI Example with send/recv My_Id01 SendReceive SendReceive

46 MPI Example with send/recv Program Simple implicit none Include "mpif.h" Integer My_Id, Other_Id, Nx, Ierr Parameter ( Nx = 100 ) Real A ( Nx ), B ( Nx ) call MPI_INIT ( Ierr ) call MPI_COMM_RANK ( MPI_COMM_WORLD, My_Id, Ierr ) Other_Id = Mod ( My_Id + 1, 2 ) A = My_Id call MPI_SEND ( A, Nx, MPI_REAL, Other_Id, My_Id, MPI_COMM_WORLD, Ierr ) call MPI_RECV ( B, Nx, MPI_REAL, Other_Id, Other_Id, MPI_COMM_WORLD, Ierr ) call MPI_FINALIZE ( Ierr ) stop end

47 What Will Happen? /* Processor 0 */... MPI_Send(sendbuf, bufsize, MPI_CHAR, partner, tag, MPI_COMM_WORLD); printf("Posting receive now...\n"); MPI_Recv(recvbuf, bufsize, MPI_CHAR, partner, tag, MPI_COMM_WORLD, status); /* Processor 1 */... MPI_Send(sendbuf, bufsize, MPI_CHAR, partner, tag, MPI_COMM_WORLD); printf("Posting receive now...\n"); MPI_Recv(recvbuf, bufsize, MPI_CHAR, partner, tag, MPI_COMM_WORLD, status);

48 MPI Message Passing Modes Ready Standard Synchronous Buffered Ready Eager Rendezvous Buffered > eager limit <= eager limit Default Eager Limit on SP is 4 KB (can be up to 64 KB)

49 MPI Performance Visualization ParaGraph –Developed by University of Illinois –Graphical display system for visualizing behaviour and performance of MPI programs

50

51

52 Message Passing on SMP Call MPI_SENDCall MPI_RECEIVE Buffer Memory Crossbar or Switch Data to Send Received Data export MP_SHARED_MEMORY=yes|no

53 Shared Memory MPI MPI_SHARED_MEMORY= LatencyBandwidth (  sec)(Mbytes/sec) –between 2 nodes: 24133 –same nodes: 30 (no)80 (no) –same nodes:10 (yes)270(yes)

54 Message Passing off Node MPI Across all the processors Many more messages going through the fabric

55 Programming the IBM Power3 SP History and future of POWER chip Uni-processor optimization Description of ACRL’s IBM SP Parallel Processing –MPI –OpenMP Hybrid MPI/OpenMP MPI-I/O (one slide)

56 OpenMP 1997: group of hardware and software vendors announced their support for OpenMP, a new API for multi-platform shared-memory programming (SMP) on UNIX and Microsoft Windows NT platforms. www.openmp.org OpenMP parallelism specified through the use of compiler directives which are imbedded in C/C++ or Fortran source code. IBM does not yet support OpenMP for C++.

57 OpenMP All processors can access all the memory in the parallel system Parallel execution is achieved by generating threads which execute in parallel Overhead for SMP parallelization is large (100-200  sec)- size of parallel work construct must be significant enough to overcome overhead

58 OpenMP 1.All OpenMP programs begin as a single process: the master thread 2.FORK: the master thread then creates a team of parallel threads 3.Parallel region statements executed in parallel among the various team threads 4.JOIN: threads synchronize and terminate, leaving only the master thread

59 OpenMP How is OpenMP typically used? OpenMP is usually used to parallelize loops: –Find your most time consuming loops. –Split them up between threads. Better scaling can be obtained using OpenMP parallel regions, but can be tricky!

60 OpenMP Loop Parallelization !$OMP PARALLEL DO do i=0,ilong do k=1,kshort... end do #pragma omp parallel for for(i=0; i <= ilong; i++) for(k=1; k <= kshort; k++) {... }

61 Variable Scoping Most difficult part of Shared Memory Parallelization –What memory is Shared –What memory is Private - each processor has its own copy Compare MPI: all variables are private Variables are shared by default, except: –loop indices –scalars that are set and then used in loop

62 How Does Sharing Work? THREAD 1: increment(x) { x = x + 1; } THREAD 1: 10 LOAD A, (x address) 20 ADD A, 1 30 STORE A, (x address) THREAD 2: increment(x) { x = x + 1; } THREAD 2: 10 LOAD A, (x address) 20 ADD A, 1 30 STORE A, (x address) Shared X initially 0 Result could be 1 or 2 Need synchronization

63 False Sharing 7654321076543210 Processor 1Processor 2 Block in Cache Cache line Address tag Block Say A(1-5)starts on cache line, then some of A(6-10) will be on first cache line so won’t be accessible until first thread finished !$OMP PARALLEL DO do I=1,20 A(I)=... enddo

64 Programming the IBM Power3 SP History and future of POWER chip Uni-processor optimization Description of ACRL’s IBM SP Parallel Processing –MPI –OpenMP Hybrid MPI/OpenMP MPI-I/O (one slide)

65 Why Hybrid MPI-OpenMP? To optimize performance on “mixed-mode” hardware like the SP MPI is used for “inter-node” communication, and OpenMP is used for “intra-node” communication –threads have lower latency –threads can alleviate network contention of a pure MPI implementation

66 Hybrid MPI-OpenMP? Unless you are forced against your will, for the hybrid model to be worthwhile: –There has to be obvious parallelism to exploit –The code has to be easy to program and maintain easy to write bad OpenMP code –It has to promise to perform at least as well as the equivalent all-MPI program Experience has shown that converting working MPI code to a hybrid model rarely results in better performance –especially true with applications having a single level of parallelism

67 Hybrid Scenario Thread the computational portions of the code that exist between MPI calls MPI calls are “single-threaded” and therefore use only a single CPU. Assumes: –application has two natural levels of parallelism –or that in breaking an MPI code with one level of parallelism that communication between resulting threads is little/none

68 Programming the IBM Power3 SP History and future of POWER chip Uni-processor optimization Description of ACRL’s IBM SP Parallel Processing –MPI –OpenMP Hybrid MPI/OpenMP MPI-I/O (one slide)

69 MPI-IO Part of MPI-2 Resulted work at IBM Research exploring the analogy between I/O and message passing See “Using MPI-2”, by Gropp et al. (MIT Press) memory processes file

70 Conclusion Don’t forget uni-processor optimization If you choose one parallel programming API, choose MPI Mixed MPI-OpenMP may be appropriate in certain cases –More work needed here Remote memory access model may be the answer


Download ppt "Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB."

Similar presentations


Ads by Google