Case Study: PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki.

Slides:



Advertisements
Similar presentations
CHP-5 LinkedList.
Advertisements

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer.
NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.
May 2, 2015©2006 Craig Zilles1 (Easily) Exposing Thread-level Parallelism  Previously, we introduced Multi-Core Processors —and the (atomic) instructions.
Maths Workshop. Aims of the Workshop To raise standards in maths by working closely with parents. To raise standards in maths by working closely with.
PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu
Reference: Message Passing Fundamentals.
Cc Compiler Parallelization Options CSE 260 Mini-project Fall 2001 John Kerwin.
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Dr. Muhammed Al-Mulhem 1ICS ICS 535 Design and Implementation of Programming Languages Part 1 OpenMP -Example ICS 535 Design and Implementation.
1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.
Software Development Unit 6.
CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.
– 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance.
1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.
Budapest, November st ALADIN maintenance and phasing workshop Short introduction to OpenMP Jure Jerman, Environmental Agency of Slovenia.
Programming with Shared Memory Introduction to OpenMP
Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.
1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 5 Shared Memory Programming with OpenMP An Introduction to Parallel Programming Peter Pacheco.
Parallel Programming in Java with Shared Memory Directives.
Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.
CHP-4 QUEUE.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Measuring Synchronisation and Scheduling Overheads in OpenMP J. Mark Bull EPCC University of Edinburgh, UK
1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.
The WRF Model The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research.
Pursuing Faster I/O in COSMO POMPA Workshop May 3rd 2010.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.
UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.
S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE On pearls and perils of hybrid OpenMP/MPI programming.
Planned AlltoAllv a clustered approach Stephen Booth (EPCC) Adrian Jackson (EPCC)
Computer Organization David Monismith CS345 Notes to help with the in class assignment.
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j
Threaded Programming Lecture 4: Work sharing directives.
Introduction to Problem Solving. Steps in Programming A Very Simplified Picture –Problem Definition & Analysis – High Level Strategy for a solution –Arriving.
Design Issues. How to parallelize  Task decomposition  Data decomposition  Dataflow decomposition Jaruloj Chongstitvatana 2 Parallel Programming: Parallelization.
JAVA AND MATRIX COMPUTATION
Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.
Processor Architecture
9/22/2011CS4961 CS4961 Parallel Programming Lecture 9: Task Parallelism in OpenMP Mary Hall September 22,
Threaded Programming Lecture 2: Introduction to OpenMP.
Monte-Carlo based Expertise A powerful Tool for System Evaluation & Optimization  Introduction  Features  System Performance.
13-1 Chapter 13 Concurrency Topics Introduction Introduction to Subprogram-Level Concurrency Semaphores Monitors Message Passing Java Threads C# Threads.
Operating Systems (CS 340 D) Dr. Abeer Mahmoud Princess Nora University Faculty of Computer & Information Systems Computer science Department.
Threaded Programming Lecture 1: Concepts. 2 Overview Shared memory systems Basic Concepts in Threaded Programming.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
OpenMP Runtime Extensions Many core Massively parallel environment Intel® Xeon Phi co-processor Blue Gene/Q MPI Internal Parallelism Optimizing MPI Implementation.
CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
First INFN International School on Architectures, tools and methodologies for developing efficient large scale scientific computing applications Ce.U.B.
Mergesort example: Merge as we return from recursive calls Merge Divide 1 element 829.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Introduction to OpenMP
Operating Systems (CS 340 D)
Computer Engg, IIT(BHU)
Introduction to OpenMP
Operating Systems (CS 340 D)
Coding Concepts (Basics)
Introduction to OpenMP
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
Multithreading Why & How.
Lecture 2 The Art of Concurrency
Presentation transcript:

Case Study: PRACE Autumn School in HPC Programming Techniques November 2014 Athens, Greece Nikos Tryfonidis, Aristotle University of Thessaloniki

 You are given a scientific code, parallelized with MPI.  Question: Any possible performance benefits by a mixed-mode implementation?  We will go through the steps of preparing, implementing and evaluating the addition of threads to the code.

 MG : General Purpose Computational Fluid Dynamics Code (~20000 lines). Written in C and parallelized with MPI, using communication library written by author.  Developed by Mantis Numerics and provided by Prof. Sam Falle (director of company and author of the code).  MG has been used professionally for research in Astrophysics and simulations of liquid CO2 in pipelines, non-ideal detonations, groundwater flow etc.

1. Preparation: Code description, initial benchmarks. 2. Implementation: Introduction of threads into the code. Application of some interesting OpenMP concepts:  Parallelizing linked list traversals  OpenMP Tasks  Avoiding race conditions 3. Results - Conclusion

Code Description

 Step 1: Inspection of the code, discussion with the author.

 Step 2: Run some initial benchmarks to get an idea of the program’s (pure MPI) runtime and scaling.

 Step 1: Inspection of the code, discussion with the author.  Step 2: Run some initial benchmarks to get an idea of the program’s (pure MPI) runtime and scaling.  Step 3: Use profiling to gain some insight into the code’s hotspots/bottlenecks.

 Computational domain: Consists of cells (yellow boxes) and joins (arrows).  The code performs computational work by looping through all cells and joins. … 1 st Cell 2 nd Cell …… Last Cell (1D example) 1 st Join2 nd Join Last Join

 Cells are distributed to all MPI Processes, using a 1D decomposition (each Process gets a contiguous group of cells and joins). Halo Communication Proc. 1 Proc. 2

 Computational hotspot of the code: “step” function (~500 lines).  “step” determines the stable time step and then advances the solution over that time step.  Mainly consists of halo communication and multiple loops over cells and joins (separate).

Halo Communication (Calls to MPI) Loops through Cells and Joins (Computational Work) 1 st Order Step

Halo Communication (Calls to MPI) Loops through Cells and Joins (Computational Work) Halo Communication (Calls to MPI) Loops through Cells, Halo Cells and Joins: Multiple Loops (Heavier Computational Work) 1 st Order Step 2 nd Order Step

Initial Benchmarks and Profiling

 Initial benchmarks were run, using test case suggested by the code author.  A 3D computational domain was used. Various domain sizes were tested (100³, 200³ and 300³ cells), for 10 computational steps.  Representative performance results will be shown here.

Figure 1: Execution time (in seconds) versus number of MPI Processes (size: 300³)

Figure 2: Speedup versus number of MPI Processes (all sizes).

 Profiling of the code was done using CrayPAT.  Four profiling runs were performed, with different numbers of processors (2, 4, 128, 256) and a grid size of 200³ cells.  Most relevant result of the profiling runs for the purpose of this presentation: percentage of time spent in MPI functions.

Figure 3: Percentage of time spent in MPI communication, for 2, 4, 128 and 256 processors (200³ cells)

 The performance of the code is seriously affected by increasing the number of processors.  Performance actually becomes worse after a certain point.  Profiling shows that MPI communication dominates the runtime for high processor counts.

 Smaller number of MPI Processes means:  Fewer calls to MPI.  Cheaper MPI collective communications. MG uses a lot of these (written in communication library).

 Smaller number of MPI Processes means:  Fewer calls to MPI.  Cheaper MPI collective communications. MG uses a lot of these (written in communication library).  Fewer halo cells (less data communicated, less memory required). Note: Simple 1D decomposition of domain requires more halo cells per MPI process than 2D or 3D domain decompositions. Mixed-Mode, requiring fewer halo cells, helps here.

 Addition of OpenMP code: possible additional synchronization (barriers, critical regions etc) needed for threads – bad for performance!  Only one thread (master) used for communication, means we will not be using the system’s maximum bandwidth potential.

The Actual Work

 All loops in “step” function are linked list traversals! pointer = first cell while (pointer != NULL) { - Do Work on current pointer / cell – pointer = next cell } Linked List example (pseudocode) :

 Linked list traversals use a while loop.  Iterations continue until the final element of the linked list is reached.

 Linked list traversals use a while loop.  Iterations continue until the final element of the linked list is reached.  In other words:  Next element that the loop will work on is not known until the end of current iteration  No well-defined loop boundaries!

 Linked list traversals use a while loop.  Iterations continue until the final element of the linked list is reached.  In other words:  Next element that the loop will work on is not known until the end of current iteration  No well-defined loop boundaries! We can’t use simple OpenMP “parallel for” directives to parallelize these loops!

Manual Parallelization of Linked List Traversals

 Straightforward way to parallelize a linked list traversal: transform the while loop into a for loop. This can be parallelized with a for loop!

1. Count number of cells (1 loop needed) 2. Allocate array of pointers of appropriate size 3. Point to every cell (1 loop needed) 4. Rewrite the original while loop as a for loop

BEFORE: pointer = first cell while(pointer!= NULL) { - Do Work on current pointer / cell – pointer = next cell } AFTER: pointer = first cell while (pointer != NULL) { counter+=1 pointer = next cell } Allocate pointer array (size of counter) for (i=0; i<counter; i++) { pointer_array[i] = pointer pointer = next cell } for (i=0; i<counter; i++) { pointer = pointer_array[i] - Do Work - }

BEFORE: pointer = first cell while(pointer!= NULL) { - Do Work on current pointer / cell – pointer = next cell } AFTER: pointer = first cell while (pointer != NULL) { counter+=1 pointer = next cell } Allocate pointer array (size of counter) for (i=0; i<counter; i++) { pointer_array[i] = pointer pointer = next cell } for (i=0; i<counter; i++) { pointer = pointer_array[i] - Do Work - }

 After verifying that the code still produces correct results, we are ready to introduce OpenMP to the “for” loops we wrote.

 In similar fashion to plain OpenMP, we must pay attention to:  The data scope of the variables.  Data dependencies that may lead to race conditions.

1 #pragma omp parallel shared (cptr_ptr,...) 2 private (t, cptr,...) 3 firstprivate (cptr_counter,...) 4 default (none) 5 { 6 #pragma omp for schedule(type, chunk) 7 for (t=0; t<cptr_counter; t++) { 8 9 cptr = cptr_ptr[t]; / Do Work / 12 / (... ) / 13 } 14 }

 After introducing OpenMP to the code and verifying correctness, performance tests took place, in order to evaluate performance as a plain OpenMP code.  Tests were run for different problem sizes, using different numbers of threads (1,2,4,8).

Figure 4: Execution time versus number of threads, for second – order step loops (size: 200³ cells)

Figure 5: Speedup versus number of threads, for second – order step loops (size: 200³ cells)

 Almost ideal speedup for up to 4 threads. With 8 threads, the two heaviest loops continue to show decent speedup.  Similar results for smaller problem size (100³ cells), only less speedup.

 Almost ideal speedup for up to 4 threads. With 8 threads, the two heaviest loops continue to show decent speedup.  Similar results for smaller problem size (100³ cells), only less speedup.  In mixed mode, cells will be distributed to processes: interesting to see if we will still have speedup there.

Parallelization of Linked List Traversals Using OpenMP Tasks

 OpenMP Tasks: a feature introduced with OpenMP 3.0.  The Task construct basically wraps up a block of code and its corresponding data, and schedules it for execution by a thread.  OpenMP Tasks allow the parallelization of a more wide variety of loops, making OpenMP more flexible.

 The Task construct is the right tool for parallelizing a “while” loop with OpenMP.  Each iteration of the “while” loop can be a Task.  Using Tasks is an elegant method for our case, leading to cleaner code with minimal additions.

pointer = first cell while(pointer!= NULL) { - Do Work on current pointer / cell – pointer = next cell } BEFORE: AFTER: #pragma omp parallel { #pragma omp single { pointer = first cell while (pointer != NULL) { #pragma omp task { -Do Work on current pointer / cell– } pointer = next cell }

 Using OpenMP Tasks, we were able to parallelize the linked list traversal by just adding OpenMP directives!  Fewer additions to the code, elegant method.  Usual OpenMP work still applies: data scope and dependencies need to be resolved.

Figure 6: Execution time versus number of threads, for second – order step loops, using Tasks (size: 200³ cells)

Figure 7: Speedup versus number of threads, for second – order step loops, using Tasks (size: 200³ cells)

1.J.M. Bull, F. Reid, N. McDonnell - A Microbenchmark Suite for OpenMP Tasks. 8th International Workshop on OpenMP, IWOMP 2012, Rome, Italy, June 11-13, Proceedings Figure 8: OpenMP Task creation and dispatch overhead versus number of Threads¹.

 For the current code, performance tests show that creating the Tasks and dispatching them requires roughly the same time needed to complete them, for one thread.  With more threads, it gets much worse (remember the logarithmic axis in previous graph).

 The problem: very big number of Tasks, not heavy enough each to justify huge overheads.  Despite being elegant and clear, OpenMP Tasks are clearly not the way to go.  Could try different strategies (e.g. grouping Tasks together), but that would cancel the benefits of Tasks (elegance and clarity).

 Manual Parallelization of linked list traversals will be used for our mixed-mode MPI+OpenMP implementation with this particular code.  It may be ugly and inelegant, but it can get things done.  In defense of Tasks: If the code had been written with the intent of using OpenMP Tasks, things could have been different.

Avoiding Race Conditions Without Losing The Race

 Additional synchronization required by OpenMP can prove to be very harmful for the performance of the mixed-mode code.  While race conditions need to be avoided at all costs, this must be done in the least expensive way possible.

 At a certain point, the code needs to find the maximum value of an array.  While trivial in serial, with OpenMP this is a race condition waiting to happen. for (i=0; i<n; i++){ if (a[i] > max){ max = a[i]; } Part of loop to be parallelized with OpenMP:

 At a certain point, the code needs to find the maximum value of an array.  While trivial in serial, with OpenMP this is a race condition waiting to happen.  Two ways to tackle this: 1. Critical Regions 2. Manually (Temporary Shared Arrays) for (i=0; i<n; i++){ if (a[i] > max){ max = a[i]; } What happens if (when) 2 or more threads try to write to “max” at the same time? Part of loop to be parallelized with OpenMP:

 With a Critical Region we can easily avoid the race conditions.  However, Critical Regions are very bad for performance.  Question: Include loop in critical region or not? for (i=0;i<n;i++) { #pragma omp critical if (a[i] > max) { max = a[i]; } Now only one thread at a time can be inside critical block.

Data (Shared Array, 4 Threads): 4 8 5

Data (Shared Array, 4 Threads): Temp. Shared Array: 8 Thread 0 Thread 1 Thread 2Thread 3 Each thread writes its own maximum to corresponding element

Data (Shared Array, 4 Threads): Temp. Shared Array: 8 Thread 0 Thread 1 Thread 2Thread 3 Each thread writes its own maximum to corresponding element Single Thread: 9 A single thread picks out the total maximum

 Benchmarks were carried out, measuring execution time for the “find maximum” loop only.  Three cases tested:  Critical Region, single “find max” instruction inside  Critical Regions, whole “find max” loop inside  Temporary Arrays

Figure 9: Execution time versus number of threads (size: 200³ cells).

Figure 10: Speedup versus number of threads (size: 200³ cells).

 The temporary array method is clearly the winner.  However:  Additional code needed for this method.  Smaller problem sizes give less performance gains for more threads (nothing we can do about that, though).

Mixed-Mode Performance

 The code was tested in mixed-mode with 2, 4 and 8 threads per MPI Process.  Same variation in problem size as before (100³, 200³, 300³ cells).  Representative results will be shown here.

Figure 11: Time versus number of threads, 2 threads per MPI Proc.

Figure 12: Time versus number of threads, 4 threads per MPI Proc.

Figure 13: Time versus number of threads, 8 threads per MPI Proc.

Figure 14: Speedup versus number of threads, all combinations

Figure 15: Speedup versus number of threads, all combinations

Figure 16: Speedup versus number of threads, all combinations

 Mixed-Mode outperforms the original MPI- only implementation for the higher processor numbers tested.

 MPI-only performs better (or almost the same) as mixed mode for the lower processor numbers tested.

 Mixed-Mode outperforms the original MPI- only implementation for the higher processor numbers tested.  MPI-only performs better (or almost the same) as mixed mode for the lower processor numbers tested.  Mixed-Mode with 4 threads/MPI Process is the best choice for problem sizes tested.

Figure 17: Memory usage versus number of PEs, 8 threads per MPI Process (200³ cells)

Was Mixed-Mode Any Good Here?

 For problem sizes and processor numbers tested: Mixed-Mode performed better or equally compared to pure MPI.  Higher processor numbers: Mixed-Mode manages to achieve speedup where pure MPI slows down.  Mixed-Mode required significantly less memory.

 Any possible performance benefits from a Mixed-Mode implementation?

 Any possible performance benefits from a Mixed-Mode implementation for this code?  Answer: Yes, for larger numbers of processors (> 256), a mixed-mode implementation of this code:  Provides Speedup instead of Slow-Down.  Uses less memory