Hybrid MPI and OpenMP Parallel Programming

Slides:

Advertisements

Similar presentations

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.

Advertisements

Parallel Processing with OpenMP

Introduction to Openmp & openACC

Reference: / MPI Program Structure.

Scientific Programming OpenM ulti- P rocessing M essage P assing I nterface.

Reference: Message Passing Fundamentals.

1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

Introduction to MPI. What is Message Passing Interface (MPI)?  Portable standard for communication  Processes can communicate through messages.  Each.

Message-Passing Programming and MPI CS 524 – High-Performance Computing.

Parallel Programming in C with MPI and OpenMP

Chapter 9. Concepts in Parallelisation An Introduction

INTEL CONFIDENTIAL OpenMP for Domain Decomposition Introduction to Parallel Programming – Part 5.

The hybird approach to programming clusters of multi-core architetures.

Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs Nikolaos Drosinos and Nectarios Koziris National.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Parallel Processing LAB NO 1.

Programming with Shared Memory Introduction to OpenMP

CS470/570 Lecture 5 Introduction to OpenMP Compute Pi example OpenMP directives and options.

Hybrid MPI+OpenMP Yao-Yuan Chuang.

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

COM503 Parallel Computer Architecture & Programming

Independent Study of Parallel Programming Languages An Independent Study By: Haris Ribic, Computer Science - Theoretical Independent Study Advisor: Professor.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

Lecture 8: OpenMP. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism / Implicit parallelism.

Part I MPI from scratch. Part I By: Camilo A. SilvaBIOinformatics Summer 2008 PIRE :: REU :: Cyberbridges.

Message Passing Programming Model AMANO, Hideharu Textbook pp. １４０－１４７.

Summary of MPI commands Luis Basurto. Large scale systems Shared Memory systems – Memory is shared among processors Distributed memory systems – Each.

1 Message Passing Models CEG 4131 Computer Architecture III Miodrag Bolic.

Chapter 4 Message-Passing Programming. The Message-Passing Model.

Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (

Introduction to MPI CDP 1. Shared Memory vs. Message Passing Shared Memory Implicit communication via memory operations (load/store/lock) Global address.

TM Parallel Concepts An introduction. TM The Goal of Parallelization Reduction of elapsed time of a program Reduction in turnaround time of jobs Overhead:

MPI and OpenMP.

Parallel Systems Lecture 10 Dr. Guy Tel-Zur. Administration Home assignments status Final presentation status – Open Excel file ps2013a.xlsx Allinea DDT.

Introduction to Pragnesh Patel 1 NICS CSURE th June 2015.

Timing in MPI Tarik Booker MPI Presentation May 7, 2003.

3/12/2013Computer Engg, IIT(BHU)1 MPI-1. MESSAGE PASSING INTERFACE A message passing library specification Extended message-passing model Not a language.

Message Passing Interface Using resources from

Introduction to MPI programming Morris Law, SCID May 18/25, 2013.

Background Computer System Architectures Computer System Software.

NUMA Control for Hybrid Applications Kent Milfeld TACC May 5, 2015.

1 ITCS4145 Parallel Programming B. Wilkinson March 23, hybrid-abw.ppt Hybrid Parallel Programming Introduction.

Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Introduction to MPI.

Parallel Programming Styles and Hybrids

MPI Message Passing Interface

Introduction to OpenMP

September 4, 1997 Parallel Processing (CS 667) Lecture 5: Shared Memory Parallel Programming with OpenMP* Jeremy R. Johnson Parallel Processing.

Parallel Processing - MPI

Message Passing Models

Hybrid Parallel Programming

Lecture 10 Dr. Guy Tel-Zur.

Hybrid Programming with OpenMP and MPI

Lab Course CFD Parallelisation Dr. Miriam Mehl.

Introduction to parallelism and the Message Passing Interface

Hybrid Parallel Programming

DNA microarrays. Infinite Mixture Model-Based Clustering of DNA Microarray Data Using openMP.

Hybrid Parallel Programming

Hybrid MPI and OpenMP Parallel Programming

Hardware Environment VIA cluster - 8 nodes Blade Server – 5 nodes

Introduction to Parallel Computing

Hybrid Parallel Programming

Shared-Memory Paradigm & OpenMP

CS 584 Lecture 8 Assignment?.

Programming Parallel Computers

Presentation transcript:

Hybrid MPI and OpenMP Parallel Programming Jemmy Hu SHARCNET HPTC Consultant July 8, 2015

Objectives difference between message passing and shared memory models (MPI, OpenMP) why or why not hybrid? a common model for utilizing both MPI and OpenMP approaches to parallel programming example hybrid code, compile and execute hybrid code on SHARCNET clusters

Hybrid Distributed-Shared Memory Architecture Employ both shared and distributed memory architectures The shared memory component is usually a cache coherent SMP node. Processors on a given SMP node can address that node's memory as global. The distributed memory component is the networking of multiple SMP nodes. SMPs know only about their own memory - not the memory on another SMP. Therefore, network communications are required to move data from one SMP to another. Current trends seem to indicate that this type of memory architecture will continue to prevail: more cpus per SMP node, less memory or bandwidth ratio per cpu.

MPI standard for distributed memory communications provides an explicit means to use message passing on distributed memory clusters specializes in packing and sending complex data structures over the network data goes to the process synchronization must be handled explicitly due to the nature of distributed memory

OpenMP a shared memory paradigm, implicit intra-node communication efficient utilization of shared memory SMP systems easy threaded programming, supported by most major compilers the process goes to the data, communication among threads is implicit

MPI vs. OpenMP Pure OpenMP Pros: Pure MPI Pros: Easy to implement parallelism Implicit Communication Low latency, high bandwidth Dynamic load balancing Pure OpenMP Cons: Only on shared memory machines Scale within one node data placement problem Pure MPI Pros: Portable to distributed and shared memory machines. Scales beyond one node No data placement problem Pure MPI Cons: Explicit communication High latency, low bandwidth Difficult load balancing

Why Hybrid: employ the best from both worlds ● MPI makes inter-node communication relatively easy ● MPI facilitates efficient inter-node scatters, reductions, and sending of complex data structures ● Since program state synchronization is done explicitly with messages, correctness issues are relatively easy to avoid ● OpenMP allows for high performance, and relatively straightforward, intra-node threading ● OpenMP provides an interface for the concurrent utilization of each SMP's shared memory, which is much more efficient that using message passing ● Program state synchronization is implicit on each SMP node, which eliminates much of the overhead associated with message passing Overall Goal: to reduce communication needs and memory consumption, or improve load balance

Why not Hybrid? OpenMP code performs worse than pure MPI code within node all threads are idle except one while MPI communication data placement, cache coherence critical section for shared variables Possible waste of effort

A Common Hybrid Approach From sequential code, parallel with MPI first, then try to add OpenMP. From MPI code, add OpenMP From OpenMP code, treat as serial code. Simplest and least error-prone way is to use MPI outside parallel region, and allow only master thread to communicate between MPI tasks. Could use MPI inside parallel region with thread-safe MPI.

Hybrid – Program Model Program hybrid • Start with MPI initialization call MPI_INIT (ierr) call MPI_COMM_RANK (…) call MPI_COMM_SIZE (…) … some computation and MPI communication … start OpenMP within node !$OMP PARALLEL DO PRIVATE(i) !$OMP& SHARED(n) do i=1,n … computation enddo !$OMP END PARALLEL DO call MPI_FINALIZE (ierr) end • Start with MPI initialization • Create OMP parallel regions within MPI task (process). - Serial regions are the master thread or MPI task. - MPI rank is known to all threads • Call MPI library in serial and parallel regions. • Finalize MPI

MPI vs. MPI+OpenMP Node MPI MPI+OpenMP 16 cpus across 4 nodes 16 cpus across 4 nodes 16 MPI processes 1 MPI process and 4 threads per node

pi – MPI version #include <stdio.h> #include <stdlib.h> #include <mpi.h> /* MPI header file */ #define NUM_STEPS 100000000 int main(int argc, char *argv[]) { int nprocs; int myid; double start_time, end_time; int i; double x, pi; double sum = 0.0; double step = 1.0/(double) NUM_STEPS; /* initialize for MPI */ MPI_Init(&argc, &argv); /* starts MPI */ /* get number of processes */ MPI_Comm_size(MPI_COMM_WORLD, &nprocs); /* get this process's number (ranges from 0 to nprocs - 1) */ MPI_Comm_rank(MPI_COMM_WORLD, &myid);

/* do computation */ for (i=myid; i < NUM_STEPS; i += nprocs) { /* changed */ x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x); } sum = step * sum; /* changed */ MPI_Reduce(&sum, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);/* added */ /* print results */ if (myid == 0) { printf("parallel program results with %d processes:\n", nprocs); printf("pi = %g (%17.15f)\n",pi, pi); /* clean up for MPI */ MPI_Finalize(); return 0;

OpenMP, reduction clause #include <stdio.h> #include <omp.h> #define NBIN 100000 int main(int argc, char *argv[ ]) { int I, nthreads; double x, pi; double sum = 0.0; double step = 1.0/(double) NUM_STEPS; /* do computation -- using all available threads */ #pragma omp parallel { #pragma omp for private(x) reduction(+:sum) schedule(runtime) for (i=0; i < NUM_STEPS; ++i) { x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x); } #pragma omp master pi = step * sum; printf("PI = %f\n",pi);

MPI_OpenMP version #include <omp.h> /* OpenMP header file */ #include <stdio.h> #include <stdlib.h> #include <mpi.h> /* MPI header file */ #include <omp.h> /* OpenMP header file */ #define NUM_STEPS 100000000 #define MAX_THREADS 4 int main(int argc, char *argv[]) { int nprocs, myid; int tid, nthreads, nbin; double start_time, end_time; double pi, Psum=0.0, sum[MAX_THREADS]={0.0}; double step = 1.0/(double) NUM_STEPS; /* initialize for MPI */ MPI_Init(&argc, &argv); /* starts MPI */ /* get number of processes */ MPI_Comm_size(MPI_COMM_WORLD, &nprocs); /* get this process's number (ranges from 0 to nprocs - 1) */ MPI_Comm_rank(MPI_COMM_WORLD, &myid); nbin= NUM_STEPS/nprocs;

#pragma omp parallel private(tid) { int i; double x; nthreads=omp_get_num_threads(); tid=omp_get_thread_num(); for (i=nbin*myid+tid; i < nbin*(myid+1); i+= nthreads) { /* changed*/ x = (i+0.5)*step; sum[tid] += 4.0/(1.0+x*x); } for(tid=0; tid<nthreads; tid++) /*sum by each mpi process*/ Psum += sum[tid]*step; MPI_Reduce(&Psum, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);/* added */ if (myid == 0) { printf("parallel program results with %d processes:\n", nprocs); printf("pi = %g (%17.15f)\n",pi, pi); MPI_Finalize(); return 0;

Compile and Run Compile (default intel compilers on SHARCNT systems) mpicc -o pi-mpi pi-mpi.c cc -openmp -o pi-omp pi-omp.c mpicc -openmp -o pi-hybrid pi-hybrid.c Run (sqsub) sqsub -q mpi -n 8 --ppn=4 -r 10m -o pi-mpi.log ./pi-mpi sqsub -q threaded -n 8 -r 10m -o pi-omp.log ./pi-omp sqsub -q mpi -n 8 --ppn=1 --tpp=4 -r 10m -o pi-hybrid.log ./pi-hybrid Example codes and results are in: /home/jemmyhu/CES706/Hybrid/pi/

Results MPI OpenMP Hybrid pi = 3.14159 (3.141592653589828) MPI uses 8 processes: pi = 3.14159 (3.141592653589828) OpenMP OpenMP uses 8 threads: pi = 3.14159 (3.141592653589882) Hybrid mpi process 0 uses 4 threads mpi process 1 uses 4 threads mpi process 1 sum is 1.287 (1.287002217586605) mpi process 0 sum is 1.85459 (1.854590436003132) Total MPI processes are 2 pi = 3.14159 (3.141592653589738)

Summary Computer systems in High-performance computing (HPC) feature a hierarchical hardware design (multi-core nodes connected via a network) OpenMP can take advantage of shared memory to reduce communication overhead Pure OpenMP performs better than pure MPI within node is a necessity to have hybrid code better than pure MPI across node. Whether the hybrid code performs better than MPI code depends on whether the communication advantage outcomes the thread overhead, etc. or not. There are more positive experiences of developing hybrid MPI/OpenMP parallel paradigms now. It’s encouraging to adopt hybrid paradigm in your own application.

References http://openmp.org/sc13/HybridPP_Slides.pdf https://www.cct.lsu.edu/~estrabd/intro-hybrid-mpi-openmp.pdf http://www.cac.cornell.edu/education/Training/parallelMay2011/Hybrid_Talk-110524.pdf