Multi-core CPU Computing Straightforward with OpenMP

Multi-core CPU Computing Straightforward with OpenMP
By: Maurice Peemen Date:

Why parallel computing?
Need to process more data Process data in less time CPU clock speed does not increase Go Parallel!

Parallel computing More Instruction Level Parallelism (ILP)?
Easy but limited by the amount of ILP in your code Vector data path SIMD by SSE instructions? Efficient but complex for the programmer Multi-Core Divide the program over multiple cores How to migrate single-core code to multi-core? In many cases quite easy with OpenMP as an interface

Multi-Threaded Started with hyper-threading Moved on to Multi-core
Utilize these cores with threads?

Fork and join programming model
initial thread (master thread) fork Hardware resource team of threads (worker threads) collaborating CPU CPU CPU CPU join Memory Each thread runs on a CPU original (master) thread

Fork and join example Speeding up parts of the application with parallelism We use OpenMP to implement these operations

What is OpenMP? API for shared-memory parallel programming
In the form of compiler directives #pragma omp parallel Library functions omp_get_num_threads() Environment variables OMP_NUM_THREADS = 4 No additional parallelization effort for development, maintenance, etc. Supported by mainstream compilers C/C++ Fortran

We want to parallelize this loop using OpenMP
A simple example saxpy operation const int n = 10000; float x[n], y[n], a; int i; for (i=0; i<n; i++) { y[i] = a * x[i] + y[i]; } We want to parallelize this loop using OpenMP

The loop is parallelized. That’s it!
A simple example saxpy operation const int n = 10000; float x[n], y[n], a; int i; #pragma omp parallel for for (i=0; i<n; i++) { y[i] = a * x[i] + y[i]; } OpenMP directive The loop is parallelized. That’s it!

A simple example saxpy operation Creates a team of threads
const int n = 10000; float x[n], y[n], a; int i; #pragma omp parallel num_threads(3) { #pragma omp for for (i=0; i<n; i++) { y[i] = a * x[i] + y[i]; } const int n = 10000; float x[n], y[n], a; int i; #pragma omp parallel { #pragma omp for for (i=0; i<n; i++) { y[i] = a * x[i] + y[i]; } Explicitly specify the number of threads Divides the work over the threads

OpenMP default can be changed
Why does this work? Loop index i is private (OpenMP default) Each thread maintains it’s own i value and range Private variable i becomes undefined after: parallel for Everything else is shared (OpenMP default) All threads update y, but at different memory locations a, n, x, are read-only (it is oké to share) const int n = 10000; float x[n], y[n], a; int i; #pragma omp parallel for for (i=0; i<n; i++) { y[i] = a * x[i] + y[i]; } OpenMP default can be changed

More about loop index Suppose we incorrectly use a shared loop index
Some compilers may complain But some compilers don’t detect the error: #pragma omp parallel for shared(i) for (i=0; i<n; i++) { y[i] = a * x[i] + y[i]; } $gcc –fopenmp loop-index.c –o loop-index $

Nested loop By default, only j is private
j-loop is bound to parallel for We want i and j to be private: #pragma omp parallel for for (j=0; j<n; j++) { for (i=0; i<n; i++) { // statement } #pragma omp parallel for private(i)

A more complicated step by step example
Compute π Processing time ms

Single Program Multiple Data (SPMD)
Total workload: number of steps thread thread thread thread4

Create team of threads Processing time 4 threads 4412 ms? Problem!
# define numthreads = 4 Processing time 4 threads 4412 ms? Problem! Single thread Processing time 953 ms False Sharing Each thread has its own partial_sum[id] Defined as an array, the partial sums are in consecutive memory locations, these can share a cache line

Remove false-sharing Processing time 4 threads 253 ms
Single thread Processing time 953 ms Compiler directive, indicate that it’s a critical region. Check the learning material for detail

Reduction directive, check the learning material for details
Use the loop directive Reduction directive, check the learning material for details Processing time 4 threads 246 ms Single thread Processing time 953 ms

Other Important Contents
Variable Type: shared, private, firstprivate, etc. Synchronization: atomic, ordered, barrier, etc. Scheduling: static, dynamic guided Compiling with OpenMP is very simple GCC add compiler flag –fopenmp Optional add #include "omp.h“

The tutorial application
Underwater image correction For hands-on experience with OpenMP Tomorrow also with SIMD vectorization with SSE

Application: Underwater Image Correction
Effects that distort the underwater image Diffusion of blue light Much improvement after histogram adjustment

Simplified correction pipeline
Four simple steps to correct the diffusion step Stretch the important part of the luminance channel RGB2YCbCr color conversion Y histogram Adjust histogram YCbCr2RGB color conversion

RGB 2 YCbCr RGB image YCbCr image Y Cb Cr
Y = *R *G *B Cb = *R *G *B Cr = *R *G *B

Y Channel histogram Construct the Y channel histogram

Adjust histogram Compute the Cumulative Distribution Function
Use this to cut 1% from both sides of the histogram The 1 and 99% are stretched over 0 to 255 Build LUT to stretch the Y channel

Y channel improvement Before adjustment After adjustment

YCbCr 2 RGB YCbCr RGB Clip RGB values at [0-255] Y = Y - 16
Cb = Cb - 128 Cr = Cr - 128 R = 1.169*Y *Cr G = 1.169*Y *Cb *Cr B = 1.169*Y *Cb

Resulting image

Multi-core CPU Computing Straightforward with OpenMP

Similar presentations

Presentation on theme: "Multi-core CPU Computing Straightforward with OpenMP"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multi-core CPU Computing Straightforward with OpenMP

Similar presentations

Presentation on theme: "Multi-core CPU Computing Straightforward with OpenMP"— Presentation transcript:

Similar presentations

About project

Feedback