– 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance.

Slides:



Advertisements
Similar presentations
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Advertisements

1 Programming Explicit Thread-level Parallelism  As noted previously, the programmer must specify how to parallelize  But, want path of least effort.
Open[M]ulti[P]rocessing Pthreads: Programmer explicitly define thread behavior openMP: Compiler and system defines thread behavior Pthreads: Library independent.
PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu
1 OpenMP—An API for Shared Memory Programming Slides are based on:
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.
EECC551 - Shaaban #1 Spring 2004 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.
Introduction to OpenMP For a more detailed tutorial see: Look at the presentations.
EECC551 - Shaaban #1 Winter 2003 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.
1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.
CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral
CSCI-6964: High Performance Parallel & Distributed Computing (HPDC) AE 216, Mon/Thurs 2-3:20 p.m. Pthreads (reading Chp 7.10) Prof. Chris Carothers Computer.
A Very Short Introduction to OpenMP Basile Schaeli EPFL – I&C – LSP Vincent Keller EPFL – STI – LIN.
INTEL CONFIDENTIAL OpenMP for Domain Decomposition Introduction to Parallel Programming – Part 5.
EECC551 - Shaaban #1 Winter 2002 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.
INTEL CONFIDENTIAL Reducing Parallel Overhead Introduction to Parallel Programming – Part 12.
Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines.
ECE 1747 Parallel Programming Shared Memory: OpenMP Environment and Synchronization.
Programming with Shared Memory Introduction to OpenMP
Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.
Parallel Programming in Java with Shared Memory Directives.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.
Chapter 17 Shared-Memory Programming. Introduction OpenMP is an application programming interface (API) for parallel programming on multiprocessors. It.
ECE 1747 Parallel Programming Shared Memory: OpenMP Environment and Synchronization.
Introduction to OpenMP. OpenMP Introduction Credits:
OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.
Lecture 8: OpenMP. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism / Implicit parallelism.
04/10/25Parallel and Distributed Programming1 Shared-memory Parallel Programming Taura Lab M1 Yuuki Horita.
CS 838: Pervasive Parallelism Introduction to OpenMP Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from online references.
Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j
OpenMP fundamentials Nikita Panov
Threaded Programming Lecture 4: Work sharing directives.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2011 Dependence Analysis and Loop Transformations.
Design Issues. How to parallelize  Task decomposition  Data decomposition  Dataflow decomposition Jaruloj Chongstitvatana 2 Parallel Programming: Parallelization.
Vector/Array ProcessorsCSCI 4717 – Computer Architecture CSCI 4717/5717 Computer Architecture Topic: Vector/Array Processors Reading: Stallings, Section.
Introduction to OpenMP
October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,
Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.
ECE 1747H: Parallel Programming Lecture 2: Data Parallelism.
09/07/2012CS4230 CS4230 Parallel Programming Lecture 7: Loop Scheduling cont., and Data Dependences Mary Hall September 7,
Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (
MPI and OpenMP.
ECE 1747H: Parallel Programming Lecture 2-3: More on parallelism and dependences -- synchronization.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Special Topics in Computer Engineering OpenMP* Essentials * Open Multi-Processing.
CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
Lecture 38: Compiling for Modern Architectures 03 May 02
Code Optimization Overview and Examples
Introduction to OpenMP
ECE 1747 Parallel Programming
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Improving Barrier Performance Dr. Xiao Qin.
Loop Parallelism and OpenMP CS433 Spring 2001
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing A bug in the rwlock program Dr. Xiao Qin.
Open[M]ulti[P]rocessing
Computer Engg, IIT(BHU)
Loop Restructuring Loop unswitching Loop peeling Loop fusion
Introduction to OpenMP
Shared-Memory Programming
Exploiting Parallelism
Computer Science Department
ECE1747 Parallel Programming
Introduction to High Performance Computing Lecture 20
Programming with Shared Memory Introduction to OpenMP
Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.
Introduction to OpenMP
Shared-Memory Paradigm & OpenMP
Writing Cache Friendly Code
Presentation transcript:

– 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance Optimizations by Code restructuring In the context of OpenMP notation

– 2 – OpenMP Implementation Overview OpenMP implementation compiler, library. Unlike Pthreads (purely a library).

– 3 – OpenMP Example Usage (1 of 2) OpenMP Compiler Annotated Source Sequential Program Parallel Program compiler switch

– 4 – OpenMP Example Usage (2 of 2) If you give sequential switch, pragmas are ignored. If you give parallel switch, pragmas are read, and cause translation into parallel program. Ideally, one source for both sequential and parallel program (big maintenance plus).

– 5 – OpenMP Directives Parallelization directives: parallel for Data environment directives: shared, private, threadprivate, reduction, etc.

– 6 – OpenMP Notation: Parallel For #pragma omp parallel for A number of threads are spawned at entry. Each thread is assigned a set of iterations for the loop and executes that code. e.g., block, or cyclic iteration assignment to threads Each thread waits at the end. Very similar to fork/join synchronization.

– 7 – API Semantics Master thread executes sequential code. Master and slaves execute parallel code. Note: very similar to fork-join semantics of Pthreads create/join primitives.

– 8 – Scheduling of Iterations Scheduling: assigning iterations to a thread. OpenMP allows scheduling strategies, such as block, cyclic, etc.

– 9 – Scheduling of Iterations: Specification #pragma omp parallel for schedule( ) can be one of can be one of block (default) cyclic

– 10 – Example Multiplication of two matrices C = A x B, where the A matrix is upper-triangular (all elements below diagonal are 0). 0 A

– 11 – Sequential Matrix Multiply Becomes for( i=0; i<n; i++ ) for( j=0; j<n; j++ ) { c[i][j] = 0.0; for( k=i; k<n; k++ ) c[i][j] += a[i][k]*b[k][j]; } Load imbalance with block distribution.

– 12 – OpenMP Matrix Multiply #pragma omp parallel for schedule( cyclic ) for( i=0; i<n; i++ ) for( j=0; j<n; j++ ) { c[i][j] = 0.0; for( k=i; k<n; k++ ) c[i][j] += a[i][k]*b[k][j]; }

– 13 – Code Restructuring Optimizations Private variables Loop reordering Loop peeling

– 14 – General Idea Parallelism limited by dependences. Restructure code to eliminate or reduce dependences. Compiler usually not able to do this, good to know how to do it by hand.

– 15 – Example 1: Dependency on Scalar for( i=0; i<n; i++ ) { tmp = a[i]; a[i] = b[i]; b[i] = tmp; } Loop-carried dependence on tmp. Easily fixed by privatizing tmp.

– 16 – Fix: Scalar Privatization f() { int tmp; /* local allocation on stack */ for( i=from; i<to; i++ ) { tmp = a[i]; a[i] = b[i]; b[i] = tmp; } Removes dependence on tmp. Removes dependence on tmp.

– 17 – Fix: Scalar Privatization in OpenMP #pragma omp parallel for private( tmp ) for( i=0; i<n; i++ ) { tmp = a[i]; a[i] = b[i]; b[i] = tmp; } Removes dependence on tmp.

– 18 – Example 3: Induction Variable for( i=0, index=0; i<n; i++ ) { index += i; a[i] = b[index]; } Dependence on index. Can be computed from loop variable.

– 19 – Fix: Induction Variable Elimination #pragma omp parallel for for( i=0, index=0; i<n; i++ ) { a[i] = b[i*(i+1)/2]; } Dependence removed by computing the induction variable.

– 20 – Example 4: Induction Variable for( i=0, index=0; i<n; i++ ) { index += f(i); b[i] = g(a[index]); } Dependence on variable index, but no formula for its value.

– 21 – Fix: Loop Splitting for( i=0; i<n; i++ ) { index[i] += f(i); } #pragma omp parallel for for( i=0; i<n; i++ ) { b[i] = g(a[index[i]]); } Loop splitting has removed dependence.

– 22 – Example 5 for( k=0; k<n; k++ ) for( i=0; i<n; i++ ) for( j=0; j<n; j++ ) a[i][j] += b[i][k] + c[k][j]; Dependence on a[i][j] prevents k-loop parallelization. No dependencies carried by i- and j-loops.

– 23 – Example 5 Parallelization for( k=0; k<n; k++ ) #pragma omp parallel for for( i=0; i<n; i++ ) for( j=0; j<n; j++ ) a[i][j] += b[i][k] + c[k][j]; We can do better by reordering the loops.

– 24 – Optimization: Loop Reordering #pragma omp parallel for for( i=0; i<n; i++ ) for( j=0; j<n; j++ ) for( k=0; k<n; k++ ) a[i][j] += b[i][k] + c[k][j]; Larger parallel pieces of work.

– 25 – Example 6 #pragma omp parallel for for(i=0; i<n; i++ ) a[i] = b[i]; #pragma omp parallel for for( i=0; i<n; i++ ) c[i] = b[i]^2; Make two parallel loops into one.

– 26 – Optimization: Loop Fusion #pragma omp parallel for for(i=0; i<n; i++ ) { a[i] = b[i]; c[i] = b[i]^2; } Reduces loop startup overhead.

– 27 – Example 7: While Loops while( *a) { process(a); a++; } The number of loop iterations is unknown.

– 28 – Special Case of Loop Splitting for( count=0, p=a; p!=NULL; count++, p++ ); #pragma omp parallel for for( i=0; i<count; i++ ) process( a[i] ); Count the number of loop iterations. Then parallelize the loop.

– 29 – Example 8 for( i=0, wrap=n; i<n; i++ ) { b[i] = a[i] + a[wrap]; wrap = i; } Dependence on wrap. Only first iteration causes dependence.

– 30 – Loop Peeling b[0] = a[0] + a[n]; #pragma omp parallel for for( i=1; i<n; i++ ) { b[i] = a[i] + a[i-1]; }

– 31 – Example 10 for( i=0; i<n; i++ ) a[i+m] = a[i] + b[i]; Dependence if m<n.

– 32 – Another Case of Loop Peeling if(m>n) { #pragma omp parallel for for( i=0; i<n; i++ ) a[i+m] = a[i] + b[i]; } else { … cannot be parallelized }

– 33 – Summary Reorganize code such that dependences are removed or reduced large pieces of parallel work emerge loop bounds become known … Code can become messy … there is a point of diminishing returns.