Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

Slides:



Advertisements
Similar presentations
Clusters, Grids and their applications in Physics David Barnes (Astro) Lyle Winton (EPP)
Advertisements

ECMWF Slide 1Porting MPI Programs to the IBM Cluster 1600 Peter Towers March 2004.
Multiple Processor Systems
Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.
1 Overview Assignment 4: hints Memory management Assignment 3: solution.
Virtual Memory 1 Computer Organization II © McQuain Virtual Memory Use main memory as a cache for secondary (disk) storage – Managed jointly.
MPI Message Passing Interface Portable Parallel Programs.
MPI Message Passing Interface
Altix ccNUMA Architecture Distributed Memory - Shared address space.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 6: Multicore Systems
Today’s topics Single processors and the Memory Hierarchy
Beowulf Supercomputer System Lee, Jung won CS843.
GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ) SCICOMP.
AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Advanced Computational Research Laboratory (ACRL) Virendra C. Bhavsar Faculty of Computer Science University of New Brunswick Fredericton, NB, E3B 5A3.
IBM RS6000/SP Overview Advanced IBM Unix computers series Multiple different configurations Available from entry level to high-end machines. POWER (1,2,3,4)
Introduction CS 524 – High-Performance Computing.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Parallel Computing Overview CS 524 – High-Performance Computing.
An Introduction to Princeton’s New Computing Resources: IBM Blue Gene, SGI Altix, and Dell Beowulf Cluster PICASso Mini-Course October 18, 2006 Curt Hillegas.
1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.
1 Parallel Computing—Introduction to Message Passing Interface (MPI)
Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
Parallel Processing Architectures Laxmi Narayan Bhuyan
Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Machine.
IBM RS/6000 SP POWER3 SMP Jari Jokinen Pekka Laurila.
High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.
Parallel Programming with Java
Parallel and Distributed Intelligent Systems Virendrakumar C. Bhavsar Professor and Director, Advanced Computational Research Laboratory Faculty of Computer.
1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead NERSC User Group Meeting June 12, 2006.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
BSP on the Origin2000 Lab for the course: Seminar in Scientific Computing with BSP Dr. Anne Weill –
Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.
2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
MODERN OPERATING SYSTEMS Third Edition ANDREW S. TANENBAUM Chapter 8 Multiple Processor Systems Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall,
Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.
Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.
NUG Meeting Performance Profiling Using hpmcount, poe+ & libhpm Richard Gerber NERSC User Services
© 2005 IBM Essential Overview Louisiana Tech University Ruston, Louisiana Charles Grassl IBM January, 2006.
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
1 Lecture 1: Parallel Architecture Intro Course organization:  ~18 parallel architecture lectures (based on text)  ~10 (recent) paper presentations 
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Background Computer System Architectures Computer System Software.
From Clustered SMPs to Clustered NUMA John M. Levesque The Advanced Computing Technology Center.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Lecture 2: Performance Evaluation
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Processor support devices Part 2: Caches and the MESI protocol
CS5102 High Performance Computer Systems Thread-Level Parallelism
Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”
Is System X for Me? Cal Ribbens Computer Science Department
Guoliang Chen Parallel Computing Guoliang Chen
Lecture 1: Parallel Architecture Intro
Parallel Processing Architectures
Hybrid Programming with OpenMP and MPI
Chapter 4 Multiprocessors
Presentation transcript:

Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB

Advanced Computational Research Laboratory High Performance Computational Problem- Solving and Visualization Environment Computational Experiments in multiple disciplines: CS, Science and Eng. 16-Processor IBM SP3 Member of C3.ca Association, Inc. (

Advanced Computational Research Laboratory Virendra Bhavsar, Director Eric Aubanel, Research Associate & Scientific Computing Support Sean Seeley, System Administrator

Programming the IBM Power3 SP History and future of POWER chip Uni-processor optimization Description of ACRL’s IBM SP Parallel Processing –MPI –OpenMP Hybrid MPI/OpenMP MPI-I/O (one slide)

POWER chip: 1990 to –Performance Optimized with Enhanced RISC –Reduced Instruction Set Computer –Superscalar: combined floating point multiply- add (FMA) unit which allowed peak MFLOPS rate = 2 x MHz –Initially: 25 MHz (50 MFLOPS) and 64 KB data cache

POWER chip: 1990 to : SP1 –IBM’s first SP (scalable power parallel) –Rack of standalone POWER processors (62.5 MHz) connected by internal switch network –Parallel Environment & system software

POWER chip: 1990 to : POWER2 –2 FMAs –Increased data cache size –66.5 MHz (254 MFLOPS) –Improved instruction set (incl. Hardware square root) –SP2: POWER2 + higher bandwidth switch for larger systems

POWER chip: 1990 to : POWERPC Support SMP 1996: P2SC POWER2 super chip: clock speeds up to 160 MHz

POWER chip: 1990 to 2003 Feb. ‘99: POWER3 –Combined P2SC & POWERPC –64 bit architecture –Initially 2-way SMP, 200 MHz –Cache improvement, including L2 cache of MB –Instruction & data prefetch

POWER3+ chip: Feb Winterhawk II MHz 4- way SMP 2 MULT/ ADD MFLOPS 64 KB Level nsec/ 3.2 GB/ sec 8 MB Level nsec/ 6.4 GB/ sec 1.6 GB/ s Memory Bandwidth 6 GFLOPS/ Node Nighthawk II MHz 16- way SMP 2 MULT/ ADD MFLOPS 64 KB Level nsec/ 3.2 GB/ sec 8 MB Level nsec/ 6.4 GB/ sec 14 GB/ s Memory Bandwidth 24 GFLOPS/ Node

The Clustered SMP ACRL’s SP: Four 4-way SMPs Each node has its own copy of the O/S Processors on the node are closer than those on different nodes

Power3 Architecture

Power way Logical UMA SP High Node L3 cache shared between all processors on node - 32 MB Up to 32 GB main memory Each processor: 1.1 GHz 140 Gflops total peak

Going to NUMA NUMA up to 256 processors Teraflops

Programming the IBM Power3 SP History and future of POWER chip Uni-processor optimization Description of ACRL’s IBM SP Parallel Processing –MPI –OpenMP Hybrid MPI/OpenMP MPI-I/O (one slide)

Uni-processor Optimization Compiler options: –start with -O3 -qstrict, then -O3, -qarch=pwr3 Cache re-use Take advantage of superscalar architecture –give enough operations per load/store Use ESSL - optimization already maximally exploited

Memory Access Times

Cache 128 byte cache line 2 MB L2 cache: 4-way set- associative, 8 MB total L1 cache: 128-way set-associative, 64 KB

How to Monitor Performance? IBM’s hardware monitor: HPMCOUNT –Uses hardware counters on chip –Cache & TLB misses, fp ops, load-stores, … –Beta version –Available soon on ACRL’s SP

HMPCOUNT sample output real*8 a(256,256),b(256,256), c(256,256) common a,b,c do j=1,256 do i=1,256 a(i,j)=b(i,j)+c(i,j) end do end PM_TLB_MISS (TLB misses) : Average number of loads per TLB miss : Total loads and stores : M Instructions per load/store : Cycles per instruction : Instructions per cycle : Total floating point operations : M Hardware float point rate : Mflop/sec

HMPCOUNT sample output real*8 a(257,256),b(257,256), c(257,256) common a,b,c do j=1,256 do i=1,257 a(i,j)=b(i,j)+c(i,j) end do end PM_TLB_MISS (TLB misses) : 1634 Average number of loads per TLB miss : Total loads and stores : M Instructions per load/store : Cycles per instruction : Instructions per cycle : Total floating point operations : M Hardware float point rate : Mflop/sec

ESSL Linear algebra, Fourier & related transforms, sorting, interpolation, quadrature, random numbers Fast! –560x560 real*8 matrix multiply Hand coding: 19 Mflops dgemm: 1.2 GFlops Parallel (threaded and distributed) versions

Programming the IBM Power3 SP History and future of POWER chip Uni-processor optimization Description of ACRL’s IBM SP Parallel Processing –MPI –OpenMP Hybrid MPI/OpenMP MPI-I/O (one slide)

ACRL’s IBM SP 4 Winterhawk II nodes –16 processors Each node has: –1 GB RAM –9 GB (mirrored) disk on each node –Switch adapter High Perforrnance Switch Gigabit Ethernet (1 node) Control workstation Disk: SSA tower with GB disks Disk Gigabit Ethernet

IBM Power3 SP Switch Bidirectional multistage interconnection networks (MIN) 300 MB/sec bi-directional 1.2  sec latency

General Parallel File System Application GPFS Client RVSD/VSD Application GPFS Client RVSD/VSD Application GPFS Client RVSD/VSD Application GPFS Server RVSD/VSD Node 2Node 3Node 4 Node 1 SP Switch

ACRL Software Operating System: AIX Compilers –IBM XL Fortran 7.1 (HPF not yet installed) –VisualAge C for AIX, Version –VisualAge C++ Professional for AIX, Version –IBM Visual Age Java - not yet installed Job Scheduler: Loadleveler 2.2 Parallel Programming Tools –IBM Parallel Environment 3.1: MPI, MPI-2 parallel I/O Numerical Libraries: ESSL (v. 3.2) and Parallel ESSL (v. 2.2 ) Visualization: OpenDX (not yet installed) E-Commerce software (not yet installed)

Programming the IBM Power3 SP History and future of POWER chip Uni-processor optimization Description of ACRL’s IBM SP Parallel Processing –MPI –OpenMP Hybrid MPI/OpenMP MPI-I/O (one slide)

Why Parallel Computing? Solve large problems in reasonable time Many algorithms are inherently parallel –image processing, Monte Carlo –Simulations (eg. CFD) High performance computers have parallel architectures –Commercial off-the shelf (COTS) components Beowulf clusters SMP nodes –Improvements in network technology

NRL Layered Ocean Model at Naval Research Laboratory IBM Winterhawk II SP

Parallel Computational Models Data Parallelism –Parallel program looks like serial program parallelism in the data –Vector processors –HPF

Parallel Computational Models Message Passing (MPI) –Processes have only local memory but can communicate with other processes by sending & receiving messages –Data transfer between processes requires operations to be performed by both processes –Communication network not part of computational model (hypercube, torus, …) SendReceive

Parallel Computational Models Shared Memory (threads) –P(osix)threads –OpenMP: higher level standard Address space Processes

Parallel Computational Models Remote Memory Operations –“One-sided” communication MPI-2, IBM’s LAPI –One process can access the memory of another without the other’s participation, but does so explicitly, not the same way it accesses local memory Put Get

Parallel Computational Models Combined: Message Passing & Threads –Driven by clusters of SMPs –Leads to software complexity! Address space Processes Address space Processes Address space Processes Network

Programming the IBM Power3 SP History and future of POWER chip Uni-processor optimization Description of ACRL’s IBM SP Parallel Processing –MPI –OpenMP Hybrid MPI/OpenMP MPI-I/O (one slide)

Message Passing Interface MPI 1.0 standard in 1994 MPI 1.1 in IBM support MPI 2.0 in 1997 –Includes 1.1 but adds new features MPI-IO One-sided communication Dynamic processes

Advantages of MPI Universality Expressivity –Well suited to formulating a parallel algorithm Ease of debugging –Memory is local Performance –Explicit association of data with process allows good use of cache

MPI Functionality Several modes of point-to-point message passing –blocking (e.g. MPI_SEND) –non-blocking (e.g. MPI_ISEND) –synchronous (e.g. MPI_SSEND) –buffered (e.g. MPI_BSEND) Collective communication and synchronization –e.g. MPI_REDUCE, MPI_BARRIER User-defined datatypes Logically distinct communicator spaces Application-level or virtual topologies

Simple MPI Example My_Id01 This is from MPI process number 0 This is from MPI processes other than 0

Simple MPI Example Program Trivial implicit none include "mpif.h" ! MPI header file integer My_Id, Numb_of_Procs, Ierr call MPI_INIT ( ierr ) call MPI_COMM_RANK ( MPI_COMM_WORLD, My_Id, ierr ) call MPI_COMM_SIZE ( MPI_COMM_WORLD, Numb_of_Procs, ierr ) print *, ' My_id, numb_of_procs = ', My_Id, Numb_of_Procs if ( My_Id.eq. 0 ) then print *, ' This is from MPI process number ',My_Id else print *, ' This is from MPI processes other than 0 ', My_Id end if call MPI_FINALIZE ( ierr ) ! bad things happen if you forget ierr stop end

MPI Example with send/recv My_Id01 SendReceive SendReceive

MPI Example with send/recv Program Simple implicit none Include "mpif.h" Integer My_Id, Other_Id, Nx, Ierr Parameter ( Nx = 100 ) Real A ( Nx ), B ( Nx ) call MPI_INIT ( Ierr ) call MPI_COMM_RANK ( MPI_COMM_WORLD, My_Id, Ierr ) Other_Id = Mod ( My_Id + 1, 2 ) A = My_Id call MPI_SEND ( A, Nx, MPI_REAL, Other_Id, My_Id, MPI_COMM_WORLD, Ierr ) call MPI_RECV ( B, Nx, MPI_REAL, Other_Id, Other_Id, MPI_COMM_WORLD, Ierr ) call MPI_FINALIZE ( Ierr ) stop end

What Will Happen? /* Processor 0 */... MPI_Send(sendbuf, bufsize, MPI_CHAR, partner, tag, MPI_COMM_WORLD); printf("Posting receive now...\n"); MPI_Recv(recvbuf, bufsize, MPI_CHAR, partner, tag, MPI_COMM_WORLD, status); /* Processor 1 */... MPI_Send(sendbuf, bufsize, MPI_CHAR, partner, tag, MPI_COMM_WORLD); printf("Posting receive now...\n"); MPI_Recv(recvbuf, bufsize, MPI_CHAR, partner, tag, MPI_COMM_WORLD, status);

MPI Message Passing Modes Ready Standard Synchronous Buffered Ready Eager Rendezvous Buffered > eager limit <= eager limit Default Eager Limit on SP is 4 KB (can be up to 64 KB)

MPI Performance Visualization ParaGraph –Developed by University of Illinois –Graphical display system for visualizing behaviour and performance of MPI programs

Message Passing on SMP Call MPI_SENDCall MPI_RECEIVE Buffer Memory Crossbar or Switch Data to Send Received Data export MP_SHARED_MEMORY=yes|no

Shared Memory MPI MPI_SHARED_MEMORY= LatencyBandwidth (  sec)(Mbytes/sec) –between 2 nodes: –same nodes: 30 (no)80 (no) –same nodes:10 (yes)270(yes)

Message Passing off Node MPI Across all the processors Many more messages going through the fabric

Programming the IBM Power3 SP History and future of POWER chip Uni-processor optimization Description of ACRL’s IBM SP Parallel Processing –MPI –OpenMP Hybrid MPI/OpenMP MPI-I/O (one slide)

OpenMP 1997: group of hardware and software vendors announced their support for OpenMP, a new API for multi-platform shared-memory programming (SMP) on UNIX and Microsoft Windows NT platforms. OpenMP parallelism specified through the use of compiler directives which are imbedded in C/C++ or Fortran source code. IBM does not yet support OpenMP for C++.

OpenMP All processors can access all the memory in the parallel system Parallel execution is achieved by generating threads which execute in parallel Overhead for SMP parallelization is large (  sec)- size of parallel work construct must be significant enough to overcome overhead

OpenMP 1.All OpenMP programs begin as a single process: the master thread 2.FORK: the master thread then creates a team of parallel threads 3.Parallel region statements executed in parallel among the various team threads 4.JOIN: threads synchronize and terminate, leaving only the master thread

OpenMP How is OpenMP typically used? OpenMP is usually used to parallelize loops: –Find your most time consuming loops. –Split them up between threads. Better scaling can be obtained using OpenMP parallel regions, but can be tricky!

OpenMP Loop Parallelization !$OMP PARALLEL DO do i=0,ilong do k=1,kshort... end do #pragma omp parallel for for(i=0; i <= ilong; i++) for(k=1; k <= kshort; k++) {... }

Variable Scoping Most difficult part of Shared Memory Parallelization –What memory is Shared –What memory is Private - each processor has its own copy Compare MPI: all variables are private Variables are shared by default, except: –loop indices –scalars that are set and then used in loop

How Does Sharing Work? THREAD 1: increment(x) { x = x + 1; } THREAD 1: 10 LOAD A, (x address) 20 ADD A, 1 30 STORE A, (x address) THREAD 2: increment(x) { x = x + 1; } THREAD 2: 10 LOAD A, (x address) 20 ADD A, 1 30 STORE A, (x address) Shared X initially 0 Result could be 1 or 2 Need synchronization

False Sharing Processor 1Processor 2 Block in Cache Cache line Address tag Block Say A(1-5)starts on cache line, then some of A(6-10) will be on first cache line so won’t be accessible until first thread finished !$OMP PARALLEL DO do I=1,20 A(I)=... enddo

Programming the IBM Power3 SP History and future of POWER chip Uni-processor optimization Description of ACRL’s IBM SP Parallel Processing –MPI –OpenMP Hybrid MPI/OpenMP MPI-I/O (one slide)

Why Hybrid MPI-OpenMP? To optimize performance on “mixed-mode” hardware like the SP MPI is used for “inter-node” communication, and OpenMP is used for “intra-node” communication –threads have lower latency –threads can alleviate network contention of a pure MPI implementation

Hybrid MPI-OpenMP? Unless you are forced against your will, for the hybrid model to be worthwhile: –There has to be obvious parallelism to exploit –The code has to be easy to program and maintain easy to write bad OpenMP code –It has to promise to perform at least as well as the equivalent all-MPI program Experience has shown that converting working MPI code to a hybrid model rarely results in better performance –especially true with applications having a single level of parallelism

Hybrid Scenario Thread the computational portions of the code that exist between MPI calls MPI calls are “single-threaded” and therefore use only a single CPU. Assumes: –application has two natural levels of parallelism –or that in breaking an MPI code with one level of parallelism that communication between resulting threads is little/none

Programming the IBM Power3 SP History and future of POWER chip Uni-processor optimization Description of ACRL’s IBM SP Parallel Processing –MPI –OpenMP Hybrid MPI/OpenMP MPI-I/O (one slide)

MPI-IO Part of MPI-2 Resulted work at IBM Research exploring the analogy between I/O and message passing See “Using MPI-2”, by Gropp et al. (MIT Press) memory processes file

Conclusion Don’t forget uni-processor optimization If you choose one parallel programming API, choose MPI Mixed MPI-OpenMP may be appropriate in certain cases –More work needed here Remote memory access model may be the answer