Multi-core CPU Computing Straightforward with OpenMP

Slides:



Advertisements
Similar presentations
NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.
Advertisements

1 Programming Explicit Thread-level Parallelism  As noted previously, the programmer must specify how to parallelize  But, want path of least effort.
Open[M]ulti[P]rocessing Pthreads: Programmer explicitly define thread behavior openMP: Compiler and system defines thread behavior Pthreads: Library independent.
PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu
1 OpenMP—An API for Shared Memory Programming Slides are based on:
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.
Dr. Muhammed Al-Mulhem 1ICS ICS 535 Design and Implementation of Programming Languages Part 1 OpenMP -Example ICS 535 Design and Implementation.
1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.
OpenMPI Majdi Baddourah
A Very Short Introduction to OpenMP Basile Schaeli EPFL – I&C – LSP Vincent Keller EPFL – STI – LIN.
Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines.
1 Parallel Programming With OpenMP. 2 Contents  Overview of Parallel Programming & OpenMP  Difference between OpenMP & MPI  OpenMP Programming Model.
Budapest, November st ALADIN maintenance and phasing workshop Short introduction to OpenMP Jure Jerman, Environmental Agency of Slovenia.
Programming with Shared Memory Introduction to OpenMP
CS470/570 Lecture 5 Introduction to OpenMP Compute Pi example OpenMP directives and options.
Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.
1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 5 Shared Memory Programming with OpenMP An Introduction to Parallel Programming Peter Pacheco.
Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (
Parallel Programming in Java with Shared Memory Directives.
OpenMP - Introduction Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.
OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.
O PEN MP (O PEN M ULTI -P ROCESSING ) David Valentine Computer Science Slippery Rock University.
OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
CS 838: Pervasive Parallelism Introduction to OpenMP Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from online references.
Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j
OpenMP fundamentials Nikita Panov
High-Performance Parallel Scientific Computing 2008 Purdue University OpenMP Tutorial Seung-Jai Min School of Electrical and Computer.
Introduction to OpenMP
Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.
Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (
9/22/2011CS4961 CS4961 Parallel Programming Lecture 9: Task Parallelism in OpenMP Mary Hall September 22,
Threaded Programming Lecture 2: Introduction to OpenMP.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Special Topics in Computer Engineering OpenMP* Essentials * Open Multi-Processing.
Heterogeneous Computing using openMP lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
COMP7330/7336 Advanced Parallel and Distributed Computing OpenMP: Programming Model Dr. Xiao Qin Auburn University
Heterogeneous Computing using openMP lecture 1 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
OpenMP Lab Antonio Gómez-Iglesias Texas Advanced Computing Center.
Distributed and Parallel Processing George Wells.
B. Estrade, LSU – High Performance Computing Enablement Group OpenMP II B. Estrade.
NPACI Parallel Computing Institute August 19-23, 2002 San Diego Supercomputing Center S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED.
Introduction to OpenMP
SHARED MEMORY PROGRAMMING WITH OpenMP
Shared Memory Parallelism - OpenMP
Lecture 5: Shared-memory Computing with Open MP
SHARED MEMORY PROGRAMMING WITH OpenMP
CS427 Multicore Architecture and Parallel Computing
Open[M]ulti[P]rocessing
Computer Engg, IIT(BHU)
Introduction to OpenMP
September 4, 1997 Parallel Processing (CS 667) Lecture 5: Shared Memory Parallel Programming with OpenMP* Jeremy R. Johnson Parallel Processing.
Exploiting Parallelism
Programming with Shared Memory
Computer Science Department
Shared Memory Programming with OpenMP
OpenMP Quiz B. Wilkinson January 22, 2016.
Hybrid Parallel Programming
Lab. 3 (May 11th) You may use either cygwin or visual studio for using OpenMP Compiling in cygwin “> gcc –fopenmp ex1.c” will generate a.exe Execute :
Introduction to High Performance Computing Lecture 20
Programming with Shared Memory Introduction to OpenMP
DNA microarrays. Infinite Mixture Model-Based Clustering of DNA Microarray Data Using openMP.
Introduction to OpenMP
OpenMP Quiz.
OpenMP Martin Kruliš.
OpenMP Parallel Programming
Shared-Memory Paradigm & OpenMP
Parallel Programming with OPENMP
Presentation transcript:

Multi-core CPU Computing Straightforward with OpenMP By: Maurice Peemen Date: 25-9-2012

Why parallel computing? Need to process more data Process data in less time CPU clock speed does not increase Go Parallel! 16-11-2018

Parallel computing More Instruction Level Parallelism (ILP)? Easy but limited by the amount of ILP in your code Vector data path SIMD by SSE instructions? Efficient but complex for the programmer Multi-Core Divide the program over multiple cores How to migrate single-core code to multi-core? In many cases quite easy with OpenMP as an interface 16-11-2018

Multi-Threaded Started with hyper-threading Moved on to Multi-core Utilize these cores with threads? 16-11-2018

Fork and join programming model initial thread (master thread) fork Hardware resource team of threads (worker threads) collaborating CPU CPU CPU CPU join Memory Each thread runs on a CPU original (master) thread 16-11-2018

Fork and join example Speeding up parts of the application with parallelism We use OpenMP to implement these operations 16-11-2018

What is OpenMP? API for shared-memory parallel programming In the form of compiler directives #pragma omp parallel Library functions omp_get_num_threads() Environment variables OMP_NUM_THREADS = 4 No additional parallelization effort for development, maintenance, etc. Supported by mainstream compilers C/C++ Fortran 16-11-2018

We want to parallelize this loop using OpenMP A simple example saxpy operation const int n = 10000; float x[n], y[n], a; int i; for (i=0; i<n; i++) { y[i] = a * x[i] + y[i]; } We want to parallelize this loop using OpenMP 16-11-2018

The loop is parallelized. That’s it! A simple example saxpy operation const int n = 10000; float x[n], y[n], a; int i; #pragma omp parallel for for (i=0; i<n; i++) { y[i] = a * x[i] + y[i]; } OpenMP directive The loop is parallelized. That’s it! 16-11-2018

A simple example saxpy operation Creates a team of threads const int n = 10000; float x[n], y[n], a; int i; #pragma omp parallel num_threads(3) { #pragma omp for for (i=0; i<n; i++) { y[i] = a * x[i] + y[i]; } const int n = 10000; float x[n], y[n], a; int i; #pragma omp parallel { #pragma omp for for (i=0; i<n; i++) { y[i] = a * x[i] + y[i]; } Explicitly specify the number of threads Divides the work over the threads 16-11-2018

OpenMP default can be changed Why does this work? Loop index i is private (OpenMP default) Each thread maintains it’s own i value and range Private variable i becomes undefined after: parallel for Everything else is shared (OpenMP default) All threads update y, but at different memory locations a, n, x, are read-only (it is oké to share) const int n = 10000; float x[n], y[n], a; int i; #pragma omp parallel for for (i=0; i<n; i++) { y[i] = a * x[i] + y[i]; } OpenMP default can be changed 16-11-2018

More about loop index Suppose we incorrectly use a shared loop index Some compilers may complain But some compilers don’t detect the error: #pragma omp parallel for shared(i) for (i=0; i<n; i++) { y[i] = a * x[i] + y[i]; } $gcc –fopenmp loop-index.c –o loop-index $ 16-11-2018

Nested loop By default, only j is private j-loop is bound to parallel for We want i and j to be private: #pragma omp parallel for for (j=0; j<n; j++) { for (i=0; i<n; i++) { // statement } #pragma omp parallel for private(i) 16-11-2018

A more complicated step by step example Compute π Processing time 953.144 ms 16-11-2018

Single Program Multiple Data (SPMD) Total workload: number of steps 100000000 thread 1 thread 2 thread3 thread4 16-11-2018

Create team of threads Processing time 4 threads 4412 ms? Problem! # define numthreads = 4 Processing time 4 threads 4412 ms? Problem! Single thread Processing time 953 ms False Sharing Each thread has its own partial_sum[id] Defined as an array, the partial sums are in consecutive memory locations, these can share a cache line 16-11-2018

Remove false-sharing Processing time 4 threads 253 ms Single thread Processing time 953 ms Compiler directive, indicate that it’s a critical region. Check the learning material for detail 16-11-2018

Reduction directive, check the learning material for details Use the loop directive Reduction directive, check the learning material for details Processing time 4 threads 246 ms Single thread Processing time 953 ms 16-11-2018

Other Important Contents Variable Type: shared, private, firstprivate, etc. Synchronization: atomic, ordered, barrier, etc. Scheduling: static, dynamic guided Compiling with OpenMP is very simple GCC add compiler flag –fopenmp Optional add #include "omp.h“ 16-11-2018

The tutorial application Underwater image correction For hands-on experience with OpenMP Tomorrow also with SIMD vectorization with SSE 16-11-2018

Application: Underwater Image Correction Effects that distort the underwater image Diffusion of blue light Much improvement after histogram adjustment 16-11-2018

Simplified correction pipeline Four simple steps to correct the diffusion step Stretch the important part of the luminance channel RGB2YCbCr color conversion Y histogram Adjust histogram YCbCr2RGB color conversion 16-11-2018

RGB 2 YCbCr RGB image YCbCr image Y Cb Cr Y = 16 + 0.257*R + 0.504*G + 0.098*B Cb = 128 - 0.148*R - 0.291*G + 0.439*B Cr = 128 + 0.439*R - 0.368*G - 0.071*B 16-11-2018

Y Channel histogram Construct the Y channel histogram 16-11-2018

Adjust histogram Compute the Cumulative Distribution Function Use this to cut 1% from both sides of the histogram The 1 and 99% are stretched over 0 to 255 Build LUT to stretch the Y channel 16-11-2018

Y channel improvement Before adjustment After adjustment 16-11-2018

YCbCr 2 RGB YCbCr RGB Clip RGB values at [0-255] Y = Y - 16 Cb = Cb - 128 Cr = Cr - 128 R = 1.169*Y + 1.602*Cr G = 1.169*Y - 0.393*Cb - 0.816*Cr B = 1.169*Y + 2.025*Cb 16-11-2018

Resulting image 16-11-2018