Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters Nikolaos Drosinos and Nectarios Koziris National Technical.

Slides:



Advertisements
Similar presentations
Load Balancing Hybrid Programming Models for SMP Clusters and Fully Permutable Loops Nikolaos Drosinos and Nectarios Koziris National Technical University.
Advertisements

Parallel Processing with OpenMP
Introduction to Openmp & openACC
Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris.
Introductions to Parallel Programming Using OpenMP
GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ) SCICOMP.
PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.
Self-Adapting Scheduling for Tasks with Dependencies in Stochastic Environments Ioannis Riakiotakis, Florina M. Ciorba, Theodore Andronikos and George.
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.
A Comparative Study of Network Protocols & Interconnect for Cluster Computing Performance Evaluation of Fast Ethernet, Gigabit Ethernet and Myrinet.
Parallel Programming Models and Paradigms
A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas,
Dynamic Multi Phase Scheduling for Heterogeneous Clusters Florina M. Ciorba †, Theodore Andronikos †, Ioannis Riakiotakis †, Anthony T. Chronopoulos ‡
The hybird approach to programming clusters of multi-core architetures.
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Maria Athanasaki, Evangelos Koukis, Nectarios Koziris National Technical.
Unified Parallel C at LBNL/UCB FT Benchmark in UPC Christian Bell and Rajesh Nishtala.
Lecture 1 – Parallel Programming Primer CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed.
Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs Nikolaos Drosinos and Nectarios Koziris National.
Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Parallel Communications and NUMA Control on the Teragrid’s New Sun Constellation System Lars Koesterke with Kent Milfeld and Karl W. Schulz AUS Presentation.
Optimizing Threaded MPI Execution on SMP Clusters Hong Tang and Tao Yang Department of Computer Science University of California, Santa Barbara.
Budapest, November st ALADIN maintenance and phasing workshop Short introduction to OpenMP Jure Jerman, Environmental Agency of Slovenia.
Mixed MPI/OpenMP programming on HPCx Mark Bull, EPCC with thanks to Jake Duthie and Lorna Smith.
Executing OpenMP Programs Mitesh Meswani. Presentation Outline Introduction to OpenMP Machine Architectures Shared Memory (SMP) Distributed Memory MPI.
Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali.
Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,
A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms Kang Chen and Jeremy Johnson Department of Mathematics and.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
11/19/2002Yun (Helen) He, SC20021 MPI and OpenMP Paradigms on Cluster of SMP Architectures: the Vacancy Tracking Algorithm for Multi- Dimensional Array.
Sensitivity of Cluster File System Access to I/O Server Selection A. Apon, P. Wolinski, and G. Amerson University of Arkansas.
Abdelhalim Amer *, Huiwei Lu *, Pavan Balaji *, Satoshi Matsuoka + *Argonne National Laboratory, IL, USA +Tokyo Institute of Technology, Tokyo, Japan Characterizing.
Hybrid MPI and OpenMP Parallel Programming
Ohio State Univ Effective Automatic Parallelization of Stencil Computations * Sriram Krishnamoorthy 1 Muthu Baskaran 1, Uday Bondhugula 1, Atanas Rountev.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
Summary Background –Why do we need parallel processing? Moore’s law. Applications. Introduction in algorithms and applications –Methodology to develop.
Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications Daniel Chavarría-Miranda John Mellor-Crummey Dept. of Computer Science Rice.
Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.
A Pattern Language for Parallel Programming Beverly Sanders University of Florida.
OpenMP Runtime Extensions Many core Massively parallel environment Intel® Xeon Phi co-processor Blue Gene/Q MPI Internal Parallelism Optimizing MPI Implementation.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 28, 2005 Session 29.
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Lecture 1 – Parallel Programming Primer
SHARED MEMORY PROGRAMMING WITH OpenMP
Department of Computer Science University of California, Santa Barbara
September 4, 1997 Parallel Processing (CS 730) Lecture 5: Shared Memory Parallel Programming with OpenMP* Jeremy R. Johnson Wed. Jan. 31, 2001 *Parts.
Hybrid Parallel Programming
September 4, 1997 Parallel Processing (CS 730) Lecture 5: Shared Memory Parallel Programming with OpenMP* Jeremy R. Johnson *Parts of this lecture.
Hybrid Programming with OpenMP and MPI
Hybrid Parallel Programming
Hybrid MPI and OpenMP Parallel Programming
Department of Computer Science University of California, Santa Barbara
Hybrid Parallel Programming
Mattan Erez The University of Texas at Austin
Shared-Memory Paradigm & OpenMP
Presentation transcript:

Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters Nikolaos Drosinos and Nectarios Koziris National Technical University of Athens Computing Systems Laboratory

April 27, 2004IPDPS Overview  Introduction  Pure Message-passing Model  Hybrid Models Hyperplane Scheduling Fine-grain Model Coarse-grain Model  Experimental Results  Conclusions – Future Work

April 27, 2004IPDPS Motivation  Active research interest in SMP clusters Hybrid programming models  However: Mostly fine-grain hybrid paradigms (masteronly model) Mostly DOALL multi-threaded parallelization

April 27, 2004IPDPS Contribution  Comparison of 3 programming models for the parallelization of tiled loops algorithms pure message-passing fine-grain hybrid coarse-grain hybrid  Advanced hyperplane scheduling minimize synchronization need overlap computation with communication preserves data dependencies

April 27, 2004IPDPS Algorithmic Model Tiled nested loops with constant flow data dependencies FORACROSS tile 0 DO … FORACROSS tile n-2 DO FOR tile n-1 DO Receive(tile); Compute(tile); Send(tile); END FOR END FORACROSS … END FORACROSS

April 27, 2004IPDPS Target Architecture SMP clusters

April 27, 2004IPDPS Overview  Introduction  Pure Message-passing Model  Hybrid Models Hyperplane Scheduling Fine-grain Model Coarse-grain Model  Experimental Results  Conclusions – Future Work

April 27, 2004IPDPS Pure Message-passing Model tile 0 = pr 0 ; … tile n-2 = pr n-2 ; FOR tile n-1 = 0 TO DO Pack(snd_buf, tile n-1 – 1, pr); MPI_Isend(snd_buf, dest(pr)); MPI_Irecv(recv_buf, src(pr)); Compute(tile); MPI_Waitall; Unpack(recv_buf, tile n-1 + 1, pr); END FOR

April 27, 2004IPDPS Pure Message-passing Model

April 27, 2004IPDPS Overview  Introduction  Pure Message-passing Model  Hybrid Models Hyperplane Scheduling Fine-grain Model Coarse-grain Model  Experimental Results  Conclusions – Future Work

April 27, 2004IPDPS Hyperplane Scheduling  Implements coarse-grain parallelism assuming inter-tile data dependencies  Tiles are organized into data-independent subsets (groups)  Tiles of the same group can be concurrently executed by multiple threads  Barrier synchronization between threads

April 27, 2004IPDPS Hyperplane Scheduling tile ( mpi_rank, omp_tid, tile ) group

April 27, 2004IPDPS Hyperplane Scheduling #pragma omp parallel { group 0 = pr 0 ; … group n-2 = pr n-2 ; tile 0 = pr 0 * m 0 + th 0 ; … tile n-2 = pr n-2 * m n-2 + th n-2 ; FOR(group n-1 ){ tile n-1 = group n-1 - ; if(0 <= tile n-1 <= ) compute(tile); #pragma omp barrier }

April 27, 2004IPDPS Overview  Introduction  Pure Message-passing Model  Hybrid Models Hyperplane Scheduling Fine-grain Model Coarse-grain Model  Experimental Results  Conclusions – Future Work

April 27, 2004IPDPS Fine-grain Model  Incremental parallelization of computationally intensive parts  Pure MPI + hyperplane scheduling  Inter-node communication outside of multi- threaded part ( MPI_THREAD_MASTERONLY )  Thread synchronization through implicit barrier of omp parallel directive

April 27, 2004IPDPS Fine-grain Model FOR(group n-1 ){ Pack(snd_buf, tile n-1 – 1, pr); MPI_Isend(snd_buf, dest(pr)); MPI_Irecv(recv_buf, src(pr)); #pragma omp parallel { thread_id=omp_get_thread_num(); if(valid(tile,thread_id,group n-1 )) Compute(tile); } MPI_Waitall; Unpack(recv_buf, tile n-1 + 1, pr); }

April 27, 2004IPDPS Overview  Introduction  Pure Message-passing Model  Hybrid Models Hyperplane Scheduling Fine-grain Model Coarse-grain Model  Experimental Results  Conclusions – Future Work

April 27, 2004IPDPS Coarse-grain Model  Threads are only initialized once  SPMD paradigm (requires more programming effort)  Inter-node communication inside multi- threaded part (requires MPI_THREAD_FUNNELED )  Thread synchronization through explicit barrier ( omp barrier directive)

April 27, 2004IPDPS Coarse-grain Model #pragma omp parallel { thread_id=omp_get_thread_num(); FOR(group n-1 ){ #pragma omp master{ Pack(snd_buf, tile n-1 – 1, pr); MPI_Isend(snd_buf, dest(pr)); MPI_Irecv(recv_buf, src(pr)); } if(valid(tile,thread_id,group n-1 )) Compute(tile); #pragma omp master{ MPI_Waitall; Unpack(recv_buf, tile n-1 + 1, pr); } #pragma omp barrier }

April 27, 2004IPDPS Overview  Introduction  Pure Message-passing Model  Hybrid Models Hyperplane Scheduling Fine-grain Model Coarse-grain Model  Experimental Results  Conclusions – Future Work

April 27, 2004IPDPS Experimental Results  8-node SMP Linux Cluster (800 MHz PIII, 128 MB RAM, kernel )  MPICH v ( --with-device=ch_p4, --with-comm=shared )  Intel C++ compiler 7.0 ( -O3 -mcpu=pentiumpro -static )  FastEthernet interconnection  ADI micro-kernel benchmark (3D)

April 27, 2004IPDPS Alternating Direction Implicit (ADI)  Stencil computation used for solving partial differential equations  Unitary data dependencies  3D iteration space (X x Y x Z)

April 27, 2004IPDPS ADI – 2 dual SMP nodes

April 27, 2004IPDPS ADI X=128 Y=512 Z=8192 – 2 nodes

April 27, 2004IPDPS ADI X=256 Y=512 Z=8192 – 2 nodes

April 27, 2004IPDPS ADI X=512 Y=512 Z=8192 – 2 nodes

April 27, 2004IPDPS ADI X=512 Y=256 Z=8192 – 2 nodes

April 27, 2004IPDPS ADI X=512 Y=128 Z=8192 – 2 nodes

April 27, 2004IPDPS ADI X=128 Y=512 Z=8192 – 2 nodes Computation Communication

April 27, 2004IPDPS ADI X=512 Y=128 Z=8192 – 2 nodes Computation Communication

April 27, 2004IPDPS Overview  Introduction  Pure Message-passing Model  Hybrid Models Hyperplane Scheduling Fine-grain Model Coarse-grain Model  Experimental Results  Conclusions – Future Work

April 27, 2004IPDPS Conclusions  Tiled loop algorithms with arbitrary data dependencies can be adapted to the hybrid parallel programming paradigm  Hybrid models can be competitive to the pure message-passing paradigm  Coarse-grain hybrid model can be more efficient than fine-grain one, but also more complicated  Programming efficiently in OpenMP not easier than programming efficiently in MPI

April 27, 2004IPDPS Future Work  Application of methodology to real applications and standard benchmarks  Work balancing for coarse-grain model  Investigation of alternative topologies, irregular communication patterns  Performance evaluation on advanced interconnection networks (SCI, Myrinet)

April 27, 2004IPDPS Thank You! Questions?