N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Scaling Up User Codes on the SP David Skinner, NERSC Division, Berkeley Lab.

Slides:

Advertisements

Similar presentations

Introduction to Grid Application On-Boarding Nick Werstiuk

Advertisements

© Crown copyright Met Office Met Office Unified Model I/O Server Paul Selwood.

More on Processes Chapter 3. Process image _the physical representation of a process in the OS _an address space consisting of code, data and stack segments.

ARCHER Tips and Tricks A few notes from the CSE team.

The Charm++ Programming Model and NAMD Abhinav S Bhatele Department of Computer Science University of Illinois at Urbana-Champaign

The Who, What, Why and How of High Performance Computing Applications in the Cloud Soheila Abrishami 1.

Introduction CSCI 444/544 Operating Systems Fall 2008.

The Path to Multi-core Tools Paul Petersen. Multi-coreToolsThePathTo 2 Outline Motivation Where are we now What is easy to do next What is missing.

Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

Parallel Programming Models and Paradigms

Adaptive MPI Chao Huang, Orion Lawlor, L. V. Kalé Parallel Programming Lab Department of Computer Science University of Illinois at Urbana-Champaign.

The new The new MONARC Simulation Framework Iosif Legrand  California Institute of Technology.

Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Comparison of Communication and I/O of the Cray T3E and IBM SP Jonathan Carter NERSC User.

Performance Engineering and Debugging HPC Applications David Skinner

ADLB Update Recent and Current Adventures with the Asynchronous Dynamic Load Balancing Library Rusty Lusk Mathematics and Computer Science Division Argonne.

Parallel Communications and NUMA Control on the Teragrid’s New Sun Constellation System Lars Koesterke with Kent Milfeld and Karl W. Schulz AUS Presentation.

Task Farming on HPCx David Henty HPCx Applications Support

Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.

Computer System Architectures Computer System Software

Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Scaling to New Heights Retrospective IEEE/ACM SC2002 Conference Baltimore, MD.

Office of Science U.S. Department of Energy Evaluating Checkpoint/Restart on the IBM SP Jay Srinivasan

Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Porting from the Cray T3E to the IBM SP Jonathan Carter NERSC User Services.

S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE On pearls and perils of hybrid OpenMP/MPI programming.

Application performance and communication profiles of M3DC1_3D on NERSC babbage KNC with 16 MPI Ranks Thanh Phung, Intel TCAR Woo-Sun Yang, NERSC.

NERSC NUG Meeting 5/29/03 Seaborg Code Scalability Project Richard Gerber NERSC User Services.

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER Evolution of the NERSC SP System NERSC User Services Original Plans Phase 1 Phase 2 Programming.

Advanced / Other Programming Models Sathish Vadhiyar.

4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Scaling Up Parallel I/O on the SP David Skinner, NERSC Division, Berkeley Lab.

Headline in Arial Bold 30pt HPC User Forum, April 2008 John Hesterberg HPC OS Directions and Requirements.

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

MPI Performance in a Production Environment David E. Skinner, NERSC User Services ScicomP 10 Aug 12, 2004.

A New Parallel Debugger for Franklin: DDT Katie Antypas User Services Group NERSC User Group Meeting September 17, 2007.

Chapter 4 – Threads (Pgs 153 – 174). Threads  A "Basic Unit of CPU Utilization"  A technique that assists in performing parallel computation by setting.

Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Scaling Up MPI and MPI-I/O on seaborg.nersc.gov David Skinner, NERSC Division, Berkeley Lab.

CIS250 OPERATING SYSTEMS Chapter One Introduction.

Programmability Hiroshi Nakashima Thomas Sterling.

Template This is a template to help, not constrain, you. Modify as appropriate. Move bullet points to additional slides as needed. Don’t cram onto a single.

Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.

Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.

Issues on the operational cluster 1 Up to 4.4x times variation of the execution time on 169 cores Using -O2 optimization flag Using IBM MPI without efficient.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

Operating Systems Unit 2: – Process Context switch Interrupt Interprocess communication – Thread Thread models Operating Systems.

Introduction to HPC Debugging with Allinea DDT Nick Forrington

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Parallel Programming By J. H. Wang May 2, 2017.

Chapter 2 Processes and Threads Today 2.1 Processes 2.2 Threads

Is System X for Me? Cal Ribbens Computer Science Department

Threads and Cooperation

Parallel Programming in C with MPI and OpenMP

Department of Computer Science University of California, Santa Barbara

Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar

Integrated Runtime of Charm++ and OpenMP

Why Threads Are A Bad Idea (for most purposes)

Parallel Programming in C with MPI and OpenMP

Department of Computer Science University of California, Santa Barbara

Why Threads Are A Bad Idea (for most purposes)

Why Threads Are A Bad Idea (for most purposes)

Presentation transcript:

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Scaling Up User Codes on the SP David Skinner, NERSC Division, Berkeley Lab

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 2 Motivation NERSC’s focus is on capability computation –Capability == jobs that use ¼ or more of the machines resources Scientists whose work involves large scale computation or HPC should keep ahead of workstation sized problems “Big Science” problems are more interesting!

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 3 Challenges CPU’s are outpacing memory bandwidth and switches, leaving FLOPs increasingly isolated. Vendors often have machines < ½ the size of NERSC machines: system software may be operating in uncharted regimes –MPI implementation –Filesystem metadata systems –Batch queue system Users need information on how to mitigate the impact of these issues for large concurrency applications.

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 4 Seaborg.nersc.gov MP_EUIDEVICE (switch fabric) MPI Bandwidth (MB/sec) MPI Latency (usec) css0500 / 3508 / 16 css1 csss500 / 350 (single task) 8 / 16

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 5 Switch Adapater Performance  csss css0 

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 6 Switch considerations For data decomposed applications with some locality partition problem along SMP boundaries (minimize surface to volume ratio) Use MP_SHAREDMEMORY to minimize switch traffic csss is most often the best route to the switch

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 7 Synchronization On the SP each SMP image is scheduled independently and while use code is waiting, OS will schedule other tasks A fully synchronizing MPI call requires everyone’s attention By analogy, imagine trying to go to lunch with 1024 people Probability that everyone is ready at any given time scales poorly

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 8 Synchronization (continued) MPI_Alltoall and MPI_Allreduce can be particularly bad in the range of 512 tasks and above Use MPI_Broadcast if possible –Not fully synchronizing Remove un-needed MPI_Barrier calls Use Asynchronous I/O when possible

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 9 Load Balance If one task lags the others in time to complete synchronization suffers, e.g. a 3% slowdown in one task can mean a 50% slowdown for the code overall Seek out and eliminate sources of variation Distribute problem uniformly among nodes/cpus

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 10 Alternatives to MPI CHARM++ and NAMD –Spatially decomposed molecular dynamics with periodic load balancing, data decomposition is adaptive AMPI  –An automatic approach to load balancing BlueGene L type machines with > 10K cpus will need re- examine these issues altogether

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 11 Improving MPI Scaling on Seaborg

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 12 The SP switch Use MP_SHAREDMEMORY=yes (default) Use MP_EUIDEVICE=csss for 32 bit applications (default) Run /usr/common/usg/bin/phost prior to your parallel program to map machine names to POE tasks –MPI and LAPI versions available –Hostslists are useful in general

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER bit MPI 32 bit MPI has inconvenient memory limits –256MB per task default and 2GB maximum –1.7GB can be used in practice, but depends on MPI usage –The scaling of this internal usage is complicated, but larger concurrency jobs have more of their memory “stolen” by MPI’s internal buffers and pipes 64 bit MPI removes these barriers –But must run on css0 only, less switch bandwidth Seaborg has 16,32, and 64 GB per node available

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER bit MPI Howto At compile time: * module load mpi64 * compile with the "-q64" option using mpcc_r, mpxlf_r, or mpxlf90_r. At run time: * module load mpi64 * use network.MPI = css0,us,shared" in your job scripts. The multilink adapter "csss" is not currently supported. * run your POE code as you normally would

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 15 MP_LABELIO, phost Labeled I/O will let you know which task generated the message “segmentation fault”, gave wrong answer, etc. export MP_LABELIO=yes Run /usr/common/usg/bin/phost prior to your parallel program to map machine names to POE tasks –MPI and LAPI versions available –Hostslists are useful in general

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 16 Core files Core dumps don’t scale (no parallel work) MP_COREDIR=/dev/null  No corefile I/O MP_COREFILE_FORMAT=light_core  Less I/O LL script to save just one full fledged core file, throw away others … if MP_CHILD !=0 export MP_COREDIR=/dev/null endif …

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 17 Debugging In general debugging 512 and above is error prone and cumbersome. Debug at a smaller scale when possible. Use shared memory device MPICH on a workstation with lots of memory to simulate 1024 cpus. For crashed jobs examine LL logs for memory usage history.

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 18 Parallel I/O Can be a significant source of variation in task completion prior to synchronization Limit the number of readers or writers when appropriate. Pay attention to file creation rates. Output reduced quantities when possible

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 19 OpenMP Using a mixed model, even when no underlying fine grained parallelism is present can take strain off of the MPI implementation, e.g. on seaborg a 2048 way job can run with only 128 MPI tasks and 16 OpenMP threads Having hybrid code whose concurrencies can be tuned between MPI and OpenMP tasks has portability advantages

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 20 Summary Resources are present to face the challenges posed by scaling up MPI applications on seaborg. Scientists should expand their problem scopes to tackle increasingly challenging computational problems. NERSC consultants can provide help in achieving scaling goals.

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 21

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 22

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 23

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 24