Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs Nikolaos Drosinos and Nectarios Koziris National.

Slides:

Advertisements

Similar presentations

Load Balancing Hybrid Programming Models for SMP Clusters and Fully Permutable Loops Nikolaos Drosinos and Nectarios Koziris National Technical University.

Advertisements

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.

Parallel Processing with OpenMP

Introduction to Openmp & openACC

1 ISCM-10 Taub Computing Center High Performance Computing for Computational Mechanics Moshe Goldberg March 29, 2001.

Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris.

Introductions to Parallel Programming Using OpenMP

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Potential Languages of the Future Chapel,

GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ) SCICOMP.

Self-Adapting Scheduling for Tasks with Dependencies in Stochastic Environments Ioannis Riakiotakis, Florina M. Ciorba, Theodore Andronikos and George.

Adaptive Cyclic Scheduling of Nested Loops Florina M. Ciorba, Theodore Andronikos and George Papakonstantinou National Technical University of Athens Computing.

Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.

Parallel Programming Models and Paradigms

Chapter 17 Parallel Processing.

Chapter 9. Concepts in Parallelisation An Introduction

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas,

Dynamic Multi Phase Scheduling for Heterogeneous Clusters Florina M. Ciorba †, Theodore Andronikos †, Ioannis Riakiotakis †, Anthony T. Chronopoulos ‡

High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.

1 MPI-2 and Threads. 2 What are Threads? l Executing program (process) is defined by »Address space »Program Counter l Threads are multiple program counters.

The hybird approach to programming clusters of multi-core architetures.

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Maria Athanasaki, Evangelos Koukis, Nectarios Koziris National Technical.

Lecture 1 – Parallel Programming Primer CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed.

Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters Nikolaos Drosinos and Nectarios Koziris National Technical.

Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Jawwad A Shamsi Nouman Durrani Nadeem Kafi Systems Research Laboratories, FAST National University of Computer and Emerging Sciences, Karachi Novelties.

Optimizing Threaded MPI Execution on SMP Clusters Hong Tang and Tao Yang Department of Computer Science University of California, Santa Barbara.

Budapest, November st ALADIN maintenance and phasing workshop Short introduction to OpenMP Jure Jerman, Environmental Agency of Slovenia.

Mixed MPI/OpenMP programming on HPCx Mark Bull, EPCC with thanks to Jake Duthie and Lorna Smith.

Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (

Executing OpenMP Programs Mitesh Meswani. Presentation Outline Introduction to OpenMP Machine Architectures Shared Memory (SMP) Distributed Memory MPI.

Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali.

Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,

A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms Kang Chen and Jeremy Johnson Department of Mathematics and.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)

S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE On pearls and perils of hybrid OpenMP/MPI programming.

11/19/2002Yun (Helen) He, SC20021 MPI and OpenMP Paradigms on Cluster of SMP Architectures: the Vacancy Tracking Algorithm for Multi- Dimensional Array.

Abdelhalim Amer *, Huiwei Lu *, Pavan Balaji *, Satoshi Matsuoka + *Argonne National Laboratory, IL, USA +Tokyo Institute of Technology, Tokyo, Japan Characterizing.

Hybrid MPI and OpenMP Parallel Programming

Ohio State Univ Effective Automatic Parallelization of Stencil Computations * Sriram Krishnamoorthy 1 Muthu Baskaran 1, Uday Bondhugula 1, Atanas Rountev.

Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.

Summary Background –Why do we need parallel processing? Moore’s law. Applications. Introduction in algorithms and applications –Methodology to develop.

Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.

Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (

TM Parallel Concepts An introduction. TM The Goal of Parallelization Reduction of elapsed time of a program Reduction in turnaround time of jobs Overhead:

Making a DSM Consistency Protocol Hierarchy-Aware: An Efficient Synchronization Scheme Gabriel Antoniu, Luc Bougé, Sébastien Lacour IRISA / INRIA & ENS.

A Pattern Language for Parallel Programming Beverly Sanders University of Florida.

OpenMP Runtime Extensions Many core Massively parallel environment Intel® Xeon Phi co-processor Blue Gene/Q MPI Internal Parallelism Optimizing MPI Implementation.

OpenMP An API : For Writing Portable SMP Application Software Rider NCHC GTD.

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

SHARED MEMORY PROGRAMMING WITH OpenMP

Computer Engg, IIT(BHU)

Introduction to OpenMP

Department of Computer Science University of California, Santa Barbara

Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke

September 4, 1997 Parallel Processing (CS 730) Lecture 5: Shared Memory Parallel Programming with OpenMP* Jeremy R. Johnson Wed. Jan. 31, 2001 *Parts.

September 4, 1997 Parallel Processing (CS 730) Lecture 5: Shared Memory Parallel Programming with OpenMP* Jeremy R. Johnson *Parts of this lecture.

Hybrid Programming with OpenMP and MPI

Hybrid Parallel Programming

Hybrid MPI and OpenMP Parallel Programming

Department of Computer Science University of California, Santa Barbara

Hybrid Parallel Programming

Shared-Memory Paradigm & OpenMP

Presentation transcript:

Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs Nikolaos Drosinos and Nectarios Koziris National Technical University of Athens Computing Systems Laboratory

2/10/2003EuroPVM/MPI Overview  Introduction  Pure MPI Model  Hybrid MPI-OpenMP Models Hyperplane Scheduling Fine-grain Model Coarse-grain Model  Experimental Results  Conclusions – Future Work

2/10/2003EuroPVM/MPI Introduction  Motivation: SMP clusters Hybrid programming models Mostly fine-grain MPI-OpenMP paradigms Mostly DOALL parallelization

2/10/2003EuroPVM/MPI Introduction  Contribution: 3 programming models for the parallelization of nested loops algorithms pure MPI fine-grain hybrid MPI-OpenMP coarse-grain hybrid MPI-OpenMP Advanced hyperplane scheduling minimize synchronization need overlap computation with communication

2/10/2003EuroPVM/MPI Introduction Algorithmic Model: FOR j 0 = min 0 TO max 0 DO … FOR j n-1 = min n-1 TO max n-1 DO Computation(j 0,…,j n-1 ); ENDFOR … ENDFOR  Perfectly nested loops  Constant flow data dependencies

2/10/2003EuroPVM/MPI Introduction Target Architecture: SMP clusters

2/10/2003EuroPVM/MPI Overview  Introduction  Pure MPI Model  Hybrid MPI-OpenMP Models Hyperplane Scheduling Fine-grain Model Coarse-grain Model  Experimental Results  Conclusions – Future Work

2/10/2003EuroPVM/MPI Pure MPI Model  Tiling transformation groups iterations into atomic execution units (tiles)  Pipelined execution  Overlapping computation with communication  Makes no distinction between inter-node and intra-node communication

2/10/2003EuroPVM/MPI Pure MPI Model Example: FOR j 1 =0 TO 9 DO FOR j 2 =0 TO 7 DO A[j 1,j 2 ]:=A[j 1 -1,j 2 ] + A[j 1,j 2 -1]; ENDFOR

2/10/2003EuroPVM/MPI Pure MPI Model CPU1 CPU0 CPU1 CPU0 NODE1 NODE0 4 MPI nodes

2/10/2003EuroPVM/MPI Pure MPI Model CPU1 CPU0 CPU1 CPU0 NODE1 NODE0 4 MPI nodes

2/10/2003EuroPVM/MPI Pure MPI Model tile 0 = nod 0 ; … tile n-2 = nod n-2 ; FOR tile n-1 = 0 TO DO Pack(snd_buf, tile n-1 – 1, nod); MPI_Isend(snd_buf, dest(nod)); MPI_Irecv(recv_buf, src(nod)); Compute(tile); MPI_Waitall; Unpack(recv_buf, tile n-1 + 1, nod); END FOR

2/10/2003EuroPVM/MPI Overview  Introduction  Pure MPI Model  Hybrid MPI-OpenMP Models Hyperplane Scheduling Fine-grain Model Coarse-grain Model  Experimental Results  Conclusions – Future Work

2/10/2003EuroPVM/MPI Hyperplane Scheduling  Implements coarse-grain parallelism assuming inter-tile data dependencies  Tiles are organized into data-independent subsets (groups)  Tiles of the same group can be concurrently executed by multiple threads  Barrier synchronization between threads

2/10/2003EuroPVM/MPI Hyperplane Scheduling CPU1 CPU0 CPU1 CPU0 NODE1 NODE0 2 MPI nodes x 2 OpenMP threads

2/10/2003EuroPVM/MPI Hyperplane Scheduling CPU1 CPU0 CPU1 CPU0 NODE1 NODE0 2 MPI nodes x 2 OpenMP threads

2/10/2003EuroPVM/MPI Hyperplane Scheduling #pragma omp parallel { group 0 = nod 0 ; … group n-2 = nod n-2 ; tile 0 = nod 0 * m 0 + th 0 ; … tile n-2 = nod n-2 * m n-2 + th n-2 ; FOR(group n-1 ){ tile n-1 = group n-1 - ; if(0 <= tile n-1 <= ) compute(tile); #pragma omp barrier }

2/10/2003EuroPVM/MPI Overview  Introduction  Pure MPI Model  Hybrid MPI-OpenMP Models Hyperplane Scheduling Fine-grain Model Coarse-grain Model  Experimental Results  Conclusions – Future Work

2/10/2003EuroPVM/MPI Fine-grain Model  Incremental parallelization of computationally intensive parts  Relatively straightforward from pure MPI  Threads (re)spawned at computation  Inter-node communication outside of multi- threaded part  Thread synchronization through implicit barrier of omp parallel directive

2/10/2003EuroPVM/MPI Fine-grain Model FOR(group n-1 ){ Pack(snd_buf, tile n-1 – 1, nod); MPI_Isend(snd_buf, dest(nod)); MPI_Irecv(recv_buf, src(nod)); #pragma omp parallel { thread_id=omp_get_thread_num(); if(valid(tile,thread_id,group n-1 )) Compute(tile); } MPI_Waitall; Unpack(recv_buf, tile n-1 + 1, nod); }

2/10/2003EuroPVM/MPI Overview  Introduction  Pure MPI Model  Hybrid MPI-OpenMP Models Hyperplane Scheduling Fine-grain Model Coarse-grain Model  Experimental Results  Conclusions – Future Work

2/10/2003EuroPVM/MPI Coarse-grain Model  SPMD paradigm  Requires more programming effort  Threads are only spawned once  Inter-node communication inside multi- threaded part (requires MPI_THREAD_MULTIPLE)  Thread synchronization through explicit barrier ( omp barrier directive)

2/10/2003EuroPVM/MPI Coarse-grain Model #pragma omp parallel { thread_id=omp_get_thread_num(); FOR(group n-1 ){ #pragma omp master{ Pack(snd_buf, tile n-1 – 1, nod); MPI_Isend(snd_buf, dest(nod)); MPI_Irecv(recv_buf, src(nod)); } if(valid(tile,thread_id,group n-1 )) Compute(tile); #pragma omp master{ MPI_Waitall; Unpack(recv_buf, tile n-1 + 1, nod); } #pragma omp barrier }

2/10/2003EuroPVM/MPI Summary: Fine-grain vs Coarse-grain Fine-grainCoarse-grain Threads re-spawningThreads are only spawned once Inter-node MPI communication outside of multi-threaded region Inter-node MPI communication inside multi-threaded region, assumed by master thread Intra-node synchronization through implicit barrier ( omp parallel ) Intra-node synchronization through explicit OpenMP barrier

2/10/2003EuroPVM/MPI Overview  Introduction  Pure MPI model  Hybrid MPI-OpenMP models Hyperplane Scheduling Fine-grain Model Coarse-grain Model  Experimental Results  Conclusions – Future Work

2/10/2003EuroPVM/MPI Experimental Results  8-node SMP Linux Cluster (800 MHz PIII, 128 MB RAM, kernel )  MPICH v ( --with-device=ch_p4, --with-comm=shared )  Intel C++ compiler 7.0 ( -O3 -mcpu=pentiumpro -static )  FastEthernet interconnection  ADI micro-kernel benchmark (3D)

2/10/2003EuroPVM/MPI Alternating Direction Implicit (ADI)  Unitary data dependencies  3D Iteration Space (X x Y x Z)

2/10/2003EuroPVM/MPI ADI – 4 nodes

2/10/2003EuroPVM/MPI ADI – 4 nodes  X < Y  X > Y

2/10/2003EuroPVM/MPI ADI X=512 Y=512 Z=8192 – 4 nodes

2/10/2003EuroPVM/MPI ADI X=128 Y=512 Z=8192 – 4 nodes

2/10/2003EuroPVM/MPI ADI X=512 Y=128 Z=8192 – 4 nodes

2/10/2003EuroPVM/MPI ADI – 2 nodes

2/10/2003EuroPVM/MPI ADI – 2 nodes  X < Y  X > Y

2/10/2003EuroPVM/MPI ADI X=128 Y=512 Z=8192 – 2 nodes

2/10/2003EuroPVM/MPI ADI X=256 Y=512 Z=8192 – 2 nodes

2/10/2003EuroPVM/MPI ADI X=512 Y=512 Z=8192 – 2 nodes

2/10/2003EuroPVM/MPI ADI X=512 Y=256 Z=8192 – 2 nodes

2/10/2003EuroPVM/MPI ADI X=512 Y=128 Z=8192 – 2 nodes

2/10/2003EuroPVM/MPI ADI X=128 Y=512 Z=8192 – 2 nodes Computation Communication

2/10/2003EuroPVM/MPI ADI X=512 Y=128 Z=8192 – 2 nodes Computation Communication

2/10/2003EuroPVM/MPI Overview  Introduction  Pure MPI model  Hybrid MPI-OpenMP models Hyperplane Scheduling Fine-grain Model Coarse-grain Model  Experimental Results  Conclusions – Future Work

2/10/2003EuroPVM/MPI Conclusions  Nested loop algorithms with arbitrary data dependencies can be adapted to the hybrid parallel programming paradigm  Hybrid models can be competitive to the pure MPI paradigm  Coarse-grain hybrid model can be more efficient than fine-grain one, but also more complicated  Programming efficiently in OpenMP not easier than programming efficiently in MPI

2/10/2003EuroPVM/MPI Future Work  Application of methodology to real applications and benchmarks  Work balancing for coarse-grain model  Performance evaluation on advanced interconnection networks (SCI, Myrinet)  Generalization as compiler technique

2/10/2003EuroPVM/MPI Questions?