NUMA Control for Hybrid Applications Kent Milfeld TACC May 5, 2015.

Slides:



Advertisements
Similar presentations
Parallel Processing with OpenMP
Advertisements

Introduction to Openmp & openACC
Practical techniques & Examples
Distributed Systems CS
Class CS 775/875, Spring 2011 Amit H. Kumar, OCCS Old Dominion University.
Thoughts on Shared Caches Jeff Odom University of Maryland.
1 Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.
CPU Scheduling CS 3100 CPU Scheduling1. Objectives To introduce CPU scheduling, which is the basis for multiprogrammed operating systems To describe various.
Modified from Silberschatz, Galvin and Gagne ©2009 Lecture 7 Chapter 4: Threads (cont)
An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu
Reference: Message Passing Fundamentals.
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.
Software Group © 2006 IBM Corporation Compiler Technology Task, thread and processor — OpenMP 3.0 and beyond Guansong Zhang, IBM Toronto Lab.
Mapping Techniques for Load Balancing
The hybird approach to programming clusters of multi-core architetures.
Parallel Communications and NUMA Control on the Teragrid’s New Sun Constellation System Lars Koesterke with Kent Milfeld and Karl W. Schulz AUS Presentation.
Task Farming on HPCx David Henty HPCx Applications Support
Computer System Architectures Computer System Software
High Performance Computation --- A Practical Introduction Chunlin Tian NAOC Beijing 2011.
1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 5 Shared Memory Programming with OpenMP An Introduction to Parallel Programming Peter Pacheco.
Executing OpenMP Programs Mitesh Meswani. Presentation Outline Introduction to OpenMP Machine Architectures Shared Memory (SMP) Distributed Memory MPI.
Parallel Programming in Java with Shared Memory Directives.
MPI3 Hybrid Proposal Description
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
STRATEGIC NAMING: MULTI-THREADED ALGORITHM (Ch 27, Cormen et al.) Parallelization Four types of computing: –Instruction (single, multiple) per clock cycle.
1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
Hybrid MPI and OpenMP Parallel Programming
Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.
MPI and OpenMP.
Porting processes to threads with MPC instead of forking Some slides from Marc Tchiboukdjian (IPDPS’12) : Hierarchical Local Storage Exploiting Flexible.
Martin Kruliš by Martin Kruliš (v1.1)1.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
1/46 PARALLEL SOFTWARE ( SECTION 2.4). 2/46 The burden is on software From now on… In shared memory programs: Start a single process and fork threads.
Advanced topics Cluster Training Center for Simulation and Modeling September 4, 2015.
CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
Scaling up R computation with high performance computing resources.
Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Thread & Processor Scheduling
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Distributed Shared Memory
CS5102 High Performance Computer Systems Thread-Level Parallelism
The Mach System Sri Ramkrishna.
CS427 Multicore Architecture and Parallel Computing
CS399 New Beginnings Jonathan Walpole.
Computer Engg, IIT(BHU)
The University of Adelaide, School of Computer Science
Task Scheduling for Multicore CPUs and NUMA Systems
Introduction to OpenMP
NGS computation services: APIs and Parallel Jobs
Operation System Program 4
Chapter 4: Threads.
Alternative Processor Panel Results 2008
CPU SCHEDULING.
Threads Chapter 4.
Distributed Systems CS
Hybrid Programming with OpenMP and MPI
Multithreaded Programming
Hybrid Parallel Programming
Hybrid MPI and OpenMP Parallel Programming
CS510 Operating System Foundations
Hybrid Parallel Programming
Shared-Memory Paradigm & OpenMP
Presentation transcript:

NUMA Control for Hybrid Applications Kent Milfeld TACC May 5, 2015

Parallel Paradigms Distributed and Shared Memory Parallel Paradigms in HPC MPI: addresses data movement in distributed memory (between processes-- executables) OpenMP: addresses data access in shared memory (among threads in an executable)

Hybrid Program Model Start with special MPI initialization Create OMP parallel regions within MPI task (process). Serial regions are the master thread or MPI task. MPI rank is known to all threads Call MPI library in serial or parallel regions. Finalize MPI Program

Hybrid Applications Typical definition of hybrid application – Uses both message passing (MPI) and a form of shared memory algorithm (OpenMP), e.g., uses MPI task as a container for OpenMP threads – Runs on multicore systems Hybrid execution does not guarantee optimal performance – Multicore systems have multilayered, complex memory architecture – Actual performance is heavily application dependent Non-Uniform Memory Access -NUMA – Shared memory with underlying multiple levels – Different access latencies for different levels – Complicated by asymmetries in multisocket, multicore systems – More responsibility on the programmer to make application efficient

Modes of Hybrid Operation Pure MPI 1 MPI Task Thread on each Core MPI Task on Core Master Thread of MPI Task Slave Thread of MPI Task Master Thread of MPI Task 16 MPI Tasks 1 MPI Tasks 16 threads/task 2 MPI Tasks 8 threads/task

Needs for NUMA Control Asymmetric multi-core configuration on node requires better control on core affinity and memory policy. – Load balancing issues on node Slowest CPU/core on node may limit overall performance – use only balanced nodes, or – employ special in-code load balancing measures Applications performance can be enhanced by specific arrangement of – tasks (process affinity) – memory allocation (memory policy) 1 0 core Stampede computing node

NUMA Operations Each process/thread is executed by a core and has access to a certain memory space Core assigned by process affinity Memory allocation assigned by memory policy The control of process affinity and memory policy using NUMA operations NUMA Control is managed by the kernel (default). Default NUMA Control settings can be overridden with numaclt.

NUMA Operations Ways Process Affinity and Memory Policy can be managed: – Dynamically on a running process (knowing process id) – At process execution (with wrapper command) – Within program through F90/C API Users can alter Kernel Policies by manually setting Process Affinity and Memory Policy – Users can assign their own processes onto specific cores. – Avoid overlapping of multiple processes

Affinity and Policy can be changed externally through numactl at the socket and core level. numactl Syntax Process affinity: socket references and core references 0,1,2,3, 4,5,6,7 Memory policy: socket references 1 0 core 8,9,10,11, 12,13,14,15 Stampede computing node

numactl Options on Stampede

Affinity -- Stampede … 15 Map: 1 Tid to 1 CPUid I Map: 0 Tid to 0 CPUid I Bit allows execution on core. egrep -e '(processor|physical id)' /proc/cpuinfo … There is a map for each thread or process Map created in code with sched_setaffinity. Map often created with numactl.

General Tips for Process Affinity and Memory Policies Memory policy: MPI – local is best SMP – Interleave may be the best for large, completely shared arrays SMP – local may be the best for private arrays Once allocated, memory structure is fixed Process affinity: MPI tasks shall be evenly populated on multi sockets Threads per task shall be evenly loaded on multi cores 1 0 core

Hybrid Runs with NUMA Control A single MPI task (process) is launched and becomes the “master thread”. It uses any numactl options specified on the launch command. When a parallel region forks the slave threads, the slaves inherit the affinity and memory policy of the master thread (launch process).

Hybrid Batch Script 16 threads job script (Bourne shell)... #SBATCH –N 6 #SBATCH –n 6... export OMP_NUM_THREADS=16 # Unset any MPI Affinities export MV2_USE_AFFINITY=0 export MV2_ENABLE_AFFINITY=0 export VIADEV_USE_AFFINITY=0 export VIADEV_ENABLE_AFFINITY=0 ibrun numactl –i all./a.out Make sure 1 MPI task is created on each node Set number of OMP threads for each MPI task Can control only memory allocation Number of MPI task on each node: n/N

Hybrid Batch Script 2 tasks, 8 threads/task job script (Bourne shell)... #SBATCH –N 6 #SBATCH –n export OMP_NUM_THREADS=8 ibrun numa.sh./a.out numa.sh: #!/bin/bash # Unset any MPI Affinities export MV2_USE_AFFINITY=0 export MV2_ENABLE_AFFINITY=0 export VIADEV_USE_AFFINITY=0 export VIADEV_ENABLE_AFFINITY=0 # Get rank from appropriate MPI API variable myrank=$(( ${PMI_RANK-0} + ${PMI_ID-0} + ${MPIRUN_RANK-0} + ${OMPI_COMM_WORLD_RANK-0} + ${OMPI_MCA_ns_nds_vpid-0} )) localrank=$(( $myrank % 2 )) socket=$localrank exec numactl --cpunodebind $socket --membind $socket./a.out

Hybrid Batch Script with tacc_affinity 16 job script (Bourne shell)... #SBATCH –N 6 #SBATCH –n export OMP_NUM_THREADS=4 ibrun tacc_affinity./a.out Simple setup for ensuring evenly distributed core setup for your hybrid runs. tacc_affinity is not the single magic solution for every application out there - you can modify the script and replace tacc_affinity with your own for your code (or use a num.sh script)

Summary NUMA control ensures hybrid jobs run with optimal core affinity and memory policy. Users have global, socket, core-level control for process and threads arrangement. Possible to get great return with small investment by avoiding non-optimal core/memory policy.

tacc_affinity #!/bin/bash # -*- shell-script -*- export MV2_USE_AFFINITY=0 export MV2_ENABLE_AFFINITY=0 export VIADEV_USE_AFFINITY=0 export VIADEV_ENABLE_AFFINITY=0 my_rank=$(( ${PMI_RANK-0} + ${PMI_ID-0} + ${MPIRUN_RANK-0} + ${OMPI_COMM_WORLD_RANK-0} + ${OMPI_MCA_ns_nds_vpid-0} )) # If running under "ibrun", TACC_pe_ppn will already be set # else get info from SLURM_TASKS_PER_NODE if [ -z "$TACC_pe_ppn" ] then myway=`echo $SLURM_TASKS_PER_NODE | awk -F '(' '{print $1}'` else myway=$TACC_pe_ppn fi

tacc_affinity (cont’d) local_rank=$(( $my_rank % $myway )) EvenRanks=$(( $myway % 2 )) if [ "$SYSHOST" = stampede ]; then # if 1 task/node, allow memory on both sockets if [ $myway -eq 1 ]; then numnode="0,1" # if 2 tasks/node, set 1st task on 0, 2nd on 1 elif [ $myway -eq 2 ]; then numnode="$local_rank" # if even number of tasks per node, place memory on alternate chips elif [ $EvenRanks -eq 0 ]; then numnode=$(( $local_rank % 2 )) # if odd number of tasks per node, do nothing -- i.e., allow memory on both sockets else numnode="0,1" fi Fi exec numactl --cpunodebind=$numnode --membind=$numnode $*