Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 0907532 Special Topics in Computer Engineering OpenMP* Essentials * Open Multi-Processing.

Similar presentations


Presentation on theme: "1 0907532 Special Topics in Computer Engineering OpenMP* Essentials * Open Multi-Processing."— Presentation transcript:

1 1 0907532 Special Topics in Computer Engineering OpenMP* Essentials * Open Multi-Processing

2 Topics: Discussion of what OpenMP is and what it can do Parallel regions Work sharing – “parallel for” Work queuing – “taskq” Shared & private variables Protection of shared data, critical sections, locks etc Reduction clause

3 What is OpenMP A collection of compiler directives, library routines, and environment variables that can be used to specify shared memory parallelism. Designed with the cooperation of many computer vendors including Intel, HP, IBM, and SGI. For this reason it has become the standard (and therefore portable) way of programming SMPs. The Fortran directives are very similar to the C/C++ OpenMP directives.

4 OpenMP Programming Model Fork-join parallelism: Master thread spawns a team of threads as needed Parallelism is added incrementally: the sequential program evolves into a parallel program Fork/Join threading model Parallel Regions Master Thread time

5 Writing parallel Code The section of code that is meant to run in parallel is marked, with a compiler directive that will cause the threads to form before the section is executed. Example: Parallelize a “for loop”  Preceed the loop with the compiler directive #pragma omp parallel for OR #pragma omp parallel #pragma omp for

6 Writing parallel Code (continued) Each thread has an "id" attached to it. The thread id is an integer, and the master thread has an id of "0". After the execution of the parallelized code, the threads "join" back into the master thread, which continues onward to the end of the program.

7 Example Loop for(i = 1, i < 13, i++) c[i] = a[i] + b[i] On a three core chip, 4 iterations of the loop are assigned to a different thread which will execute on a different core Thread 0 executes i= 1, 2, 3, 4 Thread 1 executes i= 5, 6, 7, 8 Thread 2 executes i= 9, 10, 11, 12

8 Work-sharing Construct Threads are assigned an independent set of iterations Threads must wait at the end of work-sharing construct #pragma omp parallel #pragma omp for Implicit barrier i = 1 i = 2 i = 3 i = 4 i = 5 i = 6 i = 7 i = 8 i = 9 i = 10 i = 11 i = 12 #pragma omp parallel #pragma omp for for(i = 1, i < 13, i++) c[i] = a[i] + b[i] #pragma omp parallel for for(i = 1, i < 13, i++) c[i] = a[i] + b[i] Or

9 Race Condition A race condition is nondeterministic behavior caused by the times at which two or more threads access a shared variable

10 Race Condition For example, suppose both Thread A and Thread B are executing the statement area += 4.0 / (1.0 + x*x); If variable “area” is shared, then we could run into a race condition.  nondeterministic behavior caused by the times at which threads access shared variable area

11 Two Timings Value of area Thread AThread B 11.667 +3.765 15.432 + 3.563 18.995

12 Two Timings Value of area Thread AThread B 11.667 +3.765 15.432 + 3.563 18.995 Value of area Thread AThread B 11.667 +3.765 11.667 15.432 + 3.563 15.230 Order of thread execution causes non determinant behavior in a data race

13 The Private Clause Can you spot the Race Condition? Make x & y private to each thread to resolve race condition void* work(float* c, int N) { float x, y; int i; #pragma omp parallel for for(i=0; i<N; i++) { x = a[i]; y = b[i]; c[i] = x + y; } private(x,y)

14 Cache effect Poor use of cache can degrade performance by a factor of 10 in some loops Techniques for exploiting cache reuse  Loop interchange  Cache Blocking  False Sharing

15 Loop Interchange for(i=0;i<NUM;i++) for(j=0;j<NUM;j++) for(k=0;k<NUM;k++) c[i][j] =c[i][j] + a[i][k] * b[k][j];

16 Loop Interchange for(i=0;i<NUM;i++) for(j=0;j<NUM;j++) for(k=0;k<NUM;k++) c[i][j] =c[i][j] + a[i][k] * b[k][j]; for(i=0;i<NUM;i++) for(k=0;k<NUM;k++) for(j=0;j<NUM;j++) c[i][j] =c[i][j] + a[i][k] * b[k][j]; Fast Loop Index Non unit stride skipping in memory can cause cache thrashing – particularly for arrays sizes 2^n

17 Pan ready to fry eggs Refrigerator Poor Cache Uilization - with Eggs : Carton represents cache line Refrigerator represents main memory Table represents cache When table is filled up – old cartons are evicted and most eggs are wasted Request for an egg not already on table, brings a new carton of eggs from the refrigerator, but user only fries one egg from each carton. When table fills up old carton is evicted User requests one specific egg User requests 2 nd specific egg User requests a 3rd egg – Carton evicted

18 Refrigerator : Good Cache Utilization - with Eggs Carton eviction doesn’t hurt us because we’ve already fried all the eggs in the cartons on the table – just like previous user User eventually asks for all the eggs Request for one egg brings new carton of eggs from refrigerator User specifically requests eggs form carton already on table User fries all eggs in carton before egg from next carton is requested


Download ppt "1 0907532 Special Topics in Computer Engineering OpenMP* Essentials * Open Multi-Processing."

Similar presentations


Ads by Google