1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

1 Programming with Shared Memory

C ONTENT Introduction Cilk TBB OpenMP 2

3 Using heavy weight processes Using threads. Example Pthreads Using a completely new programming language for parallel programming - not popular. Example Ada Using library routines with an existing sequential programming language Modifying the syntax of an existing sequential programming language to create a parallel programming language Using an existing sequential programming language supplemented with compiler directives for specifying parallelism. Example OpenMP Alternatives for Programming Shared Memory

4 F AMILY T REE Chare Kernel small tasks Cilk space efficient scheduler cache-oblivious algorithms OpenMP* fork/join tasks JSR-166 (FJTask) containers OpenMP taskqueue while & recursion Intel® TBB STL generic programming STAPL recursive ranges Threaded-C continuation tasks task stealing ECMA.NET* parallel iteration classes Libraries 1988 2001 2006 1995 Languages Pragmas

5 Operating systems often based upon notion of a process. Processor time shares between processes, switching from one process to another. Might occur at regular intervals or when an active process becomes delayed. Offers opportunity to deschedule processes blocked from proceeding for some reason, e.g. waiting for an I/O operation to complete. Concept could be used for parallel programming. Not much used because of overhead but fork/join concepts used elsewhere. Using Heavyweight Processes

6 FORK-JOIN construct

7 UNIX System Calls SPMD model with different code for master process and forked slave process.

8 Differences between a process and threads

SAS(S HARED A DDRESS S PACE ) P ROGRAMMING M ODEL 9 Thread (Process) Thread (Process) System X read(X)write(X) Shared variable

10 线程安全（ Thread safe ）是指某线程可由多个线程同时调用，并且能够产生正确的结果. 标准的 I/O 线程安全：输出消息时不会产生字符交错情况. 返回时间的系统调用可能不是线程安全的. 访问共享数据的例程需要特别设计以确保是线程安全的. Thread-Safe Routines

11 考虑如下两进程：每一进程往共享变量 x 加 1. 首先读 x ，然后计算加 1 ，最后结果写回去。 Accessing Shared Data

12 Conflict in accessing shared variable

13 临界区包含代码以及所涉及的资源。建立临界区是确保在任何时刻只有一个进程访问特定资源。这种机制也称之为互斥（ mutual exclusion ） Critical Section

14 最简单的互斥机制是锁。一种锁是种位变量： 1 指示一个进程进入了临界区； 0 指示没有进程在临界区. 类似于门锁 : 进程来到临界区的 “ 门口 ” ，如果发现门是开着，它进去并锁上门。如果它完成操作，打开门离开临界区 Locks

15 Control of critical sections through busy waiting

16 Locks are implemented in Pthreads with mutually exclusive lock variables, or “mutex” variables:. pthread_mutex_lock(&mutex1); critical section pthread_mutex_unlock(&mutex1);. If a thread reaches a mutex lock and finds it locked, it will wait for the lock to open. If more than one thread is waiting for the lock to open when it opens, the system will select one thread to be allowed to proceed. Only the thread that locks a mutex can unlock it. Pthread Lock Routines

17 当进程 P1 锁定资源 R1 后，然后申请被 P2 锁定的资源 R2 ，同时 P2 申请资源 R1 Deadlock

18 死锁也可以发生在如下的循环锁中 Deadlock (deadly embrace)

19 Offers one routine that can test whether a lock is actually closed without blocking the thread: pthread_mutex_trylock() Will lock an unlocked mutex and return 0 or will return with EBUSY if the mutex is already locked – might find a use in overcoming deadlock. Pthreads

20 A positive integer (including zero) operated upon by two operations: P operation on semaphore s Waits until s is greater than zero and then decrements s by one and allows the process to continue. V operation on semaphore s Increments s by one and releases one of the waiting processes (if any). Semaphores

21 P and V operations are performed indivisibly. Mechanism for activating waiting processes is also implicit in P and V operations. Though exact algorithm not specified, algorithm expected to be fair. Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore.

22 Mutual exclusion of critical sections can be achieved with one semaphore having the value 0 or 1 (a binary semaphore), which acts as a lock variable, but the P and V operations include a process scheduling mechanism: Process 1 Process 2 Process 3 Noncritical section Noncritical section Noncritical section... P(s) P(s) P(s) Critical section Critical section Critical section V(s) V(s) V(s)... Noncritical section Noncritical section Noncritical section

23 Can take on positive values other than zero and one. Provide, for example, a means of recording the number of “resource units” available or used and can be used to solve producer/ consumer problems. General semaphore (or counting semaphore)

24 Suite of procedures that provides only way to access shared resource. Only one process can use a monitor procedure at any instant. Could be implemented using a semaphore or lock to protect entry, i.e., monitor_proc1() { lock(x);. monitor body. unlock(x); return; } Monitor

25 Often, a critical section is to be executed if a specific global condition exists; for example, if a certain value of a variable has been reached. With locks, the global variable would need to be examined at frequent intervals (“polled”) within a critical section. Very time-consuming and unproductive exercise. Can be overcome by introducing so-called condition variables. Condition Variables

26 Language Constructs for Parallelism Shared Data Shared memory variables might be declared as shared with, say, shared int x;

27 par Construct For specifying concurrent statements: par { S1; S2;. Sn; }

28 forall Construct To start multiple similar processes together: forall (i = 0; i < n; i++) { S1; S2;. Sm; } which generates n processes each consisting of the statements forming the body of the for loop, S1, S2, …, Sm. Each process uses a different value of i.

29 Example forall (i = 0; i < 5; i++) a[i] = 0; clears a[0], a[1], a[2], a[3], and a[4] to zero concurrently.

D ESIGN FOR M ULTITHREADING Good design is critical Bad multithreading can be worse than no multithreading Deadlocks, synchronization bugs, poor performance, etc.

B AD M ULTITHREADING Thread 1 Thread 2 Thread 3 Thread 4 Thread 5

Rendering Thread Game Thread G OOD M ULTITHREADING Main Thread Physics Rendering Thread Animation/ Skinning Particle Systems Networking File I/O Game Thread

Present Rendering AI Physics Input Frame 2Frame 3Frame 4 A NOTHER P ARADIGM : C ASCADES Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Frame 1 Advantages: Synchronization points are few and well-defined Disadvantages: Increases latency (for constant frame rate) Needs simple (one-way) data flow

M ULTITHREADED P ROGRAMMING IN C ILK 34

Introduction Inlets Abort C ONTENT 35

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 36 C ILK IN ONE SLIDE 扩展 C 语言支持并行，同时不改变原来的串行语义. 面向 fork-join 方式任务产生的编程模型非常适合递归算法 (e.g. branch- and-bound) 有着坚实的理论基础 … 能够证明性能 cilkMarks a function as a “cilk” function that can be spawned spawnSpawns a cilk function … only 2 to 5 times the cost of a regular function call syncWait until immediate children spawned functions return 高级关键字 inletDefine a function to handle return values from a cilk task cilk_fenceA portable memory fence. abortTerminate all currently existing spawned tasks

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 37 R ECURSION IS AT THE HEART OF CILK Cilk 中派生新任务非常方便. 不是采用循环，而是递归地产生很多任务. 创建任务嵌套队列，调度器采用 workstealing 确保所有的核都忙 With Cilk, the programmer worries about expressing concurrency, not the details of how it is implemented

I NTRODUCTION... Cilk program 是一组 procedures procedure 是一系列 threads Cilk threads are: represented by nodes in the dag 38

F IBONACCI – AN EXAMPLE 39 int fib (int n) { if (n<2) return (n); else { int x,y; x = fib(n-1); y = fib(n-2); return (x+y); } int fib (int n) { if (n<2) return (n); else { int x,y; x = fib(n-1); y = fib(n-2); return (x+y); } CC cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); } cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); } Cilk code Cilk provides no new data types.

B ASIC C ILK K EYWORDS 40 cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); } cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); } 声明一 Cilk 函数或过程。该过程可以并行 spawn. 派生子线程。子线程可以与父进程并行执行 Control cannot pass this point until all spawned children have returned.

D YNAMIC M ULTITHREADING 41 cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); } cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); } The computation dag unfolds dynamically. Example: fib(4) 4 3 2 2 1 110 0

M ULTITHREADED C OMPUTATION 42 有向无环图 G = (V, E) 代表了并行指令流. 每一顶点 v 代表一 (Cilk) thread: 最大指令序列，不包含并行控制指令 ( spawn, sync, return ). 每一边 e 可以是一 spawn 边, return 边, 或 continue 边. spawn edge return edge continue edge initial thread final thread

C ACTUS S TACK 43 B B A A C C E E D D A A A A B B A A C C A A C C D D A A C C E E Views of stack CBA D E Cilk supports C’s rule for pointers: A pointer to stack space can be passed from parent to child, but not from child to parent. (Cilk also supports malloc.) Cilk’s cactus stack supports several views in parallel.

O PERATING ON R ETURNED V ALUES 44 Cilk achieves this functionality using an internal function, called an inlet, which is executed as a secondary thread on the parent frame when the child returns. The inlet keyword defines a void internal function to be an inlet. x += spawn foo(a,b,c);

S EMANTICS OF I NLETS 45 cilk int fib (int n) { int x = 0; if (n<2) return n; else { summer(spawn fib (n-1)); summer(spawn fib (n-2)); sync; return (x); } cilk int fib (int n) { int x = 0; if (n<2) return n; else { summer(spawn fib (n-1)); summer(spawn fib (n-2)); sync; return (x); } inlet void summer (int result) { x += result; return; }

1. The Cilk procedure fib(i) is spawned. 2. Control passes to the next statement. 3. When fib (i) returns, summer () is invoked 46

S EMANTICS OF I NLETS In the current implementation of Cilk, the inlet definition may not contain a spawn, and only the first argument of the inlet may be spawned at the call site. 47

I MPLICIT I NLETS 48 cilk int wfib(int n) { if (n == 0) { return 0; } else { int i, x = 1; for (i=0; i<=n-2; i++) { x += spawn wfib(i); } sync; return x; } cilk int wfib(int n) { if (n == 0) { return 0; } else { int i, x = 1; for (i=0; i<=n-2; i++) { x += spawn wfib(i); } sync; return x; } 对于赋值运算, Cilk 编译器自动产生一个不明确的 inlet

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 49 C OMMON PATTERN FOR C ILK 考虑包含循环的程序将其转换为递归结构 … 将范围分割为两半直到每一块足够小 void vadd (real *A, real *B, int n){ int i; for(i=0; i<n; i++) A[i] += B[i]; } void vadd (real *A, real *B, int n){ if (n<MIN) { int i; for(i=0; i<n; i++) A[i] += B[i]; } else { vadd(A, B, n/2); vadd(A+n/2, B+n/2, n-n/2); } } void vadd (real *A, real *B, int n){ if (n<MIN) { int i; for(i=0; i<n; i++) A[i] += B[i]; } else { vadd(A, B, n/2); vadd(A+n/2, B+n/2, n-n/2); } } 加入 Cilk 关键词 spawn spawn sync; cilk

C OMPUTING A P RODUCT 50 p =  A i i = 0 n int product(int *A, int n) { int i, p=1; for (i=0; i<n; i++) { p *= A[i]; int product(int *A, int n) { int i, p=1; for (i=0; i<n; i++) { p *= A[i]; } return p; } return p; } 优化 : 如果部分结果为 0 ，终止计算

C OMPUTING A P RODUCT p =  A i i = 0 n int product(int *A, int n) { int i, p=1; for (i=0; i<n; i++) { p *= A[i]; int product(int *A, int n) { int i, p=1; for (i=0; i<n; i++) { p *= A[i]; } return p; } return p; } if (p == 0) break; 优化 : 如果部分结果为 0 ，终止计算 51

C OMPUTING A P RODUCT IN P ARALLEL p =  A i i = 0 n cilk int prod(int *A, int n) { int p = 1; if (n == 1) { return A[0]; } else { p *= spawn product(A, n/2); p *= spawn product(A+n/2, n-n/2); sync; return p; } cilk int prod(int *A, int n) { int p = 1; if (n == 1) { return A[0]; } else { p *= spawn product(A, n/2); p *= spawn product(A+n/2, n-n/2); sync; return p; } 怎样终止 ? 52

C ILK ’ S A BORT F EATURE cilk int product(int *A, int n) { int p = 1; inlet void mult(int x) { p *= x; return; } if (n == 1) { return A[0]; } else { mult( spawn product(A, n/2) ); mult( spawn product(A+n/2, n-n/2) ); sync; return p; } cilk int product(int *A, int n) { int p = 1; inlet void mult(int x) { p *= x; return; } if (n == 1) { return A[0]; } else { mult( spawn product(A, n/2) ); mult( spawn product(A+n/2, n-n/2) ); sync; return p; } 1. Recode the implicit inlet to make it explicit. 53

C ILK ’ S A BORT F EATURE cilk int product(int *A, int n) { int p = 1; inlet void mult(int x) { p *= x; return; } if (n == 1) { return A[0]; } else { mult( spawn product(A, n/2) ); mult( spawn product(A+n/2, n-n/2) ); sync; return p; } cilk int product(int *A, int n) { int p = 1; inlet void mult(int x) { p *= x; return; } if (n == 1) { return A[0]; } else { mult( spawn product(A, n/2) ); mult( spawn product(A+n/2, n-n/2) ); sync; return p; } 2. Check for 0 within the inlet. 54

C ILK ’ S A BORT F EATURE cilk int product(int *A, int n) { int p = 1; inlet void mult(int x) { p *= x; return; } if (n == 1) { return A[0]; } else { mult( spawn product(A, n/2) ); mult( spawn product(A+n/2, n-n/2) ); sync; return p; } cilk int product(int *A, int n) { int p = 1; inlet void mult(int x) { p *= x; return; } if (n == 1) { return A[0]; } else { mult( spawn product(A, n/2) ); mult( spawn product(A+n/2, n-n/2) ); sync; return p; } cilk int product(int *A, int n) { int p = 1; inlet void mult(int x) { p *= x; if (p == 0) { abort; /* Aborts existing children, */ } /* but not future ones. */ return; } if (n == 1) { return A[0]; } else { mult( spawn product(A, n/2) ); mult( spawn product(A+n/2, n-n/2) ); sync; return p; } cilk int product(int *A, int n) { int p = 1; inlet void mult(int x) { p *= x; if (p == 0) { abort; /* Aborts existing children, */ } /* but not future ones. */ return; } if (n == 1) { return A[0]; } else { mult( spawn product(A, n/2) ); mult( spawn product(A+n/2, n-n/2) ); sync; return p; } 2. Check for 0 within the inlet. 55

C ILK ’ S A BORT F EATURE cilk int product(int *A, int n) { int p = 1; inlet void mult(int x) { p *= x; if (p == 0) { abort; /* Aborts existing children, */ } /* but not future ones. */ return; } if (n == 1) { return A[0]; } else { mult( spawn product(A, n/2) ); mult( spawn product(A+n/2, n-n/2) ); sync; return p; } cilk int product(int *A, int n) { int p = 1; inlet void mult(int x) { p *= x; if (p == 0) { abort; /* Aborts existing children, */ } /* but not future ones. */ return; } if (n == 1) { return A[0]; } else { mult( spawn product(A, n/2) ); mult( spawn product(A+n/2, n-n/2) ); sync; return p; } 56

C ILK ’ S A BORT F EATURE cilk int product(int *A, int n) { int p = 1; inlet void mult(int x) { p *= x; return; } if (n == 1) { return A[0]; } else { mult( spawn product(A, n/2) ); mult( spawn product(A+n/2, n-n/2) ); sync; return p; } cilk int product(int *A, int n) { int p = 1; inlet void mult(int x) { p *= x; return; } if (n == 1) { return A[0]; } else { mult( spawn product(A, n/2) ); mult( spawn product(A+n/2, n-n/2) ); sync; return p; } cilk int product(int *A, int n) { int p = 1; inlet void mult(int x) { p *= x; if (p == 0) { abort; /* Aborts existing children, */ } /* but not future ones. */ return; } if (n == 1) { return A[0]; } else { mult( spawn product(A, n/2) ); if (p == 0) { /* Don’t spawn if we’ve */ return 0; /* already aborted! */ } mult( spawn product(A+n/2, n-n/2) ); sync; return p; } cilk int product(int *A, int n) { int p = 1; inlet void mult(int x) { p *= x; if (p == 0) { abort; /* Aborts existing children, */ } /* but not future ones. */ return; } if (n == 1) { return A[0]; } else { mult( spawn product(A, n/2) ); if (p == 0) { /* Don’t spawn if we’ve */ return 0; /* already aborted! */ } mult( spawn product(A+n/2, n-n/2) ); sync; return p; } 57

M UTUAL E XCLUSION 58 Cilk’s solution to mutual exclusion is no better than anybody else’s. Cilk provides a library of locks declared with Cilk_lockvar. To avoid deadlock with the Cilk scheduler, a lock should only be held within a Cilk thread. I.e., spawn and sync should not be executed while a lock is held.

L OCKING Cilk_lockvar data type #include : :Cilk_lockvar mylock; : { Cilk_lock_init(mylock); : Cilk_lock(mylock); /* begin critical section */ : Cilk_unlock(mylock); /* end critical section */ }

K EY I DEAS Cilk is simple: cilk, spawn, sync, SYNCHED, inlet, abort JCilk is simpler Work & span 60

I NTEL ’ S T HREADING B UILDING B LOCKS 61

T HREADING B UILDING B LOCKS L IBRARY C HARACTERISTICS C++ Library. Targets threading for performance (designed to parallelize computationally intensive work). Is compatible with other threading packages. Emphasizes scalable data parallel programming. Specifies templates and tasks instead of threads - the library schedules tasks onto threads and manages load balancing. 62

63 C OMPONENTS OF TBB ( VERSION 2.1) Synchronization primitives atomic operations various flavors of mutexes (improved) Parallel algorithms parallel_for (improved) parallel_reduce (improved) parallel_do (new) pipeline (improved) parallel_sort parallel_scan Concurrent containers concurrent_hash_map concurrent_queue concurrent_vector (all improved) Task scheduler With new functionality Memory allocators tbb_allocator (new), cache_aligned_allocator, scalable_allocator Utilities tick_count tbb_thread (new)

64 C++ R EVIEW : F UNCTION T EMPLATE Type-parameterized function. Strongly typed. Obeys scope rules. Actual arguments evaluated exactly once. Not redundantly instantiated. template void swap( T& x, T& y ) { T z = x; x = y; y = z; } void reverse( float* first, float* last ) { while( first<last-1 ) swap( *first++, *--last ); } Compiler instantiates template swap with T=float. [first,last) define half-open interval

65 G ENERICITY OF SWAP T(const T&)Copy constructor void T::operator=(const T&);Assignment ~T()Destructor template void swap( T& x, T& y ) { T z = x; x = y; y = z; } // Construct z // Assignment // Destroy z Requirements for T

66 C++ R EVIEW : T EMPLATE C LASS Type-parameterized class template class pair { public: T first; U second; pair( const T& x, const U& y ) : first(x), second(y) {} }; pair x; x.first = “abc”; x.second = 42; Compiler instantiates template pair with T=string and U=int.

67 TBB L IBRARY A LGORITHM – PARALLEL _ FOR parallel_for is a template function provided by library. template void parallel_for(const Range& range, Functor& func, partitioner ); Requirements for Range R: Library provides blocked_range, blocked_range2d, blocked_range3d Programmer can define new kinds of ranges R(const R&)Copy a range R::~R()Destroy a range bool R::empty() constIs range empty? bool R::is_divisible() constCan range be split? R::R (R& r, split)Split r into two subranges

68 R EQUIREMENTS FOR F UNCTOR template void parallel_for(const Range& range, Functor& func, partitioner ); Requirements for Functor func : F::F( const F& )Copy constructor F::~F()Destructor void F::operator() (Range& subrange)Apply F to subrange

E XAMPLE – PARALLEL _ FOR 69 Example: concurrently apply a function to each element in an array. Serial version: void SerialApplyFoo( float a[], size_t n ) { for( size_t i=0; i<n; ++i ) Foo(a[i]); } Iteration space is 0…(n-1).

TBB L IBRARY A LGORITHM – PARALLEL _ FOR, CONTINUED Parallel version requires two steps: 70 #include "tbb/blocked_range.h" class ApplyFoo { float *const my_a; public: ApplyFoo( float a[] ) : my_a(a) {} void operator()( const blocked_range & r ) const { float *a = my_a; for( size_t i=r.begin(); i!=r.end(); ++i ) Foo(a[i]); } };

TBB L IBRARY A LGORITHM – PARALLEL _ FOR, CONTINUED parallel_for breaks iteration space into chunks each of which are run on separate threads blocked_range ( begin, end, grainsize ) - recursively divisible struct. grainsize, specifies the number of iterations for a “reasonable size” chunk to deal out to a processor. If the iteration space has more than grainsize iterations, parallel_for splits it into separate subranges that are scheduled separately. operator () processes a chunk 71 #include "tbb/parallel_for.h" void ParallelApplyFoo( float a[], size_t n ) { parallel_for(blocked_range (0,n,IdealGrainSize), ApplyFoo(a) ); }

TBB L IBRARY - A LGORITHMS parallel_reduce – math ops with elements of an array in parallel parallel_do – loops indeterminate length iteration spaces parallel_* - several others… 72

73 W ORK S TEALING Thread deque mailbox Thread deque mailbox Thread deque mailbox Thread deque mailbox Cache Affinity2. Steal task advertised in mailbox Load balance3. Steal oldest task from random victim Locality1. Take youngest task from my deque Override0. Do explicitly specified task

74 deque H OW THIS WORKS Split range..... recursively......until  grainsize.

75 W ORK D EPTH F IRST ; S TEAL B READTH F IRST L1 L2 victim thread Best choice for theft! big piece of work data far from victim’s hot data. Second best choice.

76 P ARALLEL S ORT E XAMPLE ( WITH WORK STEALING ) Q UICKSORT – S TEP 1 THREAD 1 32 44 9 26 31 57 3 19 55 29 27 1 20 5 42 62 25 51 49 15 54 6 18 48 10 2 60 41 14 47 24 36 37 52 22 34 35 11 28 8 13 43 53 23 61 38 56 16 59 17 50 7 21 45 4 39 33 40 58 12 30 0 46 63 Thread 1 starts with the initial data tbb::parallel_sort (color, color+64); 76

77 37 37 11 0 9 26 31 30 3 19 12 29 27 1 20 5 33 4 25 21 7 15 17 6 18 16 10 2 23 13 14 8 24 36 32 28 22 34 35 52 47 41 43 53 60 61 38 56 48 59 54 50 49 51 45 62 39 42 40 58 55 57 44 46 63 37 THREAD 1 32 44 9 26 31 57 3 19 55 29 27 1 20 5 42 62 25 51 49 15 54 6 18 48 10 2 60 41 14 47 24 36 37 52 22 34 35 11 28 8 13 43 53 23 61 38 56 16 59 17 50 7 21 45 4 39 33 40 58 12 30 0 46 63 THREAD 2THREAD 3THREAD 4 Thread 1 partitions/splits its data 77 P ARALLEL S ORT E XAMPLE ( WITH WORK STEALING ) Q UICKSORT – S TEP 2

78 37 37 11 0 9 26 31 30 3 19 12 29 27 1 20 5 33 4 25 21 7 15 17 6 18 16 10 2 23 13 14 8 24 36 32 28 22 34 35 52 47 41 43 53 60 61 38 56 48 59 54 50 49 51 45 62 39 42 40 58 55 57 44 46 63 37 THREAD 1THREAD 2 32 44 9 26 31 57 3 19 55 29 27 1 20 5 42 62 25 51 49 15 54 6 18 48 10 2 60 41 14 47 24 36 37 52 22 34 35 11 28 8 13 43 53 23 61 38 56 16 59 17 50 7 21 45 4 39 33 40 58 12 30 0 46 63 Thread 2 gets work by stealing from Thread 1 THREAD 3THREAD 4 78 P ARALLEL S ORT E XAMPLE ( WITH WORK STEALING ) Q UICKSORT – S TEP 2

79 7 37 49 7 37 49 11 0 9 26 31 30 3 19 12 29 27 1 20 5 33 4 25 21 7 15 17 6 18 16 10 2 23 13 14 8 24 36 32 28 22 34 35 52 47 41 43 53 60 61 38 56 48 59 54 50 49 51 45 62 39 42 40 58 55 57 44 46 63 37 THREAD 1 1 0 2 6 4 5 3 7 12 29 27 19 20 30 33 31 25 21 11 15 17 26 18 16 10 9 23 13 14 8 24 36 32 28 22 34 35 45 47 41 43 46 44 40 38 42 48 39 49 50 52 51 54 62 59 56 61 58 55 57 60 53 63 THREAD 2 32 44 9 26 31 57 3 19 55 29 27 1 20 5 42 62 25 51 49 15 54 6 18 48 10 2 60 41 14 47 24 36 37 52 22 34 35 11 28 8 13 43 53 23 61 38 56 16 59 17 50 7 21 45 4 39 33 40 58 12 30 0 46 63 Thread 1 partitions/splits its data Thread 2 partitions/splits its data 79 P ARALLEL S ORT E XAMPLE ( WITH WORK STEALING ) Q UICKSORT – S TEP 3

80 7 37 49 7 37 49 11 0 9 26 31 30 3 19 12 29 27 1 20 5 33 4 25 21 7 15 17 6 18 16 10 2 23 13 14 8 24 36 32 28 22 34 35 52 47 41 43 53 60 61 38 56 48 59 54 50 49 51 45 62 39 42 40 58 55 57 44 46 63 37 THREAD 1 1 0 2 6 4 5 3 7 12 29 27 19 20 30 33 31 25 21 11 15 17 26 18 16 10 9 23 13 14 8 24 36 32 28 22 34 35 45 47 41 43 46 44 40 38 42 48 39 49 50 52 51 54 62 59 56 61 58 55 57 60 53 63 THREAD 2THREAD 3THREAD 4 32 44 9 26 31 57 3 19 55 29 27 1 20 5 42 62 25 51 49 15 54 6 18 48 10 2 60 41 14 47 24 36 37 52 22 34 35 11 28 8 13 43 53 23 61 38 56 16 59 17 50 7 21 45 4 39 33 40 58 12 30 0 46 63 Thread 3 gets work by stealing from Thread 1 Thread 4 gets work by stealing from Thread 2 80 P ARALLEL S ORT E XAMPLE ( WITH WORK STEALING ) Q UICKSORT – S TEP 3

81 11 0 9 26 31 30 3 19 12 29 27 1 20 5 33 4 25 21 7 15 17 6 18 16 10 2 23 13 14 8 24 36 32 28 22 34 35 52 47 41 43 53 60 61 38 56 48 59 54 50 49 51 45 62 39 42 40 58 55 57 44 46 63 37 1 0 2 6 4 5 3 7 12 29 27 19 20 30 33 31 25 21 11 15 17 26 18 16 10 9 23 13 14 8 24 36 32 28 22 34 35 45 47 41 43 46 44 40 38 42 48 39 49 50 52 51 54 62 59 56 61 58 55 57 60 53 63 11 8 14 13 9 10 16 12 17 1518 21 25 26 31 33 30 20 23 19 27 29 24 36 32 28 22 34 35 THREAD 1THREAD 2THREAD 3THREAD 4 32 44 9 26 31 57 3 19 55 29 27 1 20 5 42 62 25 51 49 15 54 6 18 48 10 2 60 41 14 47 24 36 37 52 22 34 35 11 28 8 13 43 53 23 61 38 56 16 59 17 50 7 21 45 4 39 33 40 58 12 30 0 46 63 Thread 1 sorts the rest of its data 0 1 2 3 4 5 6 7 18 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 Thread 4 sorts the rest of its data Thread 2 sorts the rest its data Thread 3 partitions/splits its data 81 P ARALLEL S ORT E XAMPLE ( WITH WORK STEALING ) Q UICKSORT – S TEP 4

82 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 11 0 9 26 31 30 3 19 12 29 27 1 20 5 33 4 25 21 7 15 17 6 18 16 10 2 23 13 14 8 24 36 32 28 22 34 35 52 47 41 43 53 60 61 38 56 48 59 54 50 49 51 45 62 39 42 40 58 55 57 44 46 63 37 THREAD 1 1 0 2 6 4 5 3 7 12 29 27 19 20 30 33 31 25 21 11 15 17 26 18 16 10 9 23 13 14 8 24 36 32 28 22 34 35 45 47 41 43 46 44 40 38 42 48 39 49 50 52 51 54 62 59 56 61 58 55 57 60 53 63 THREAD 2THREAD 3THREAD 4 11 8 14 13 9 10 16 12 17 15 18 21 25 26 31 33 30 20 23 19 27 29 24 36 32 28 22 34 35 32 44 9 26 31 57 3 19 55 29 27 1 20 5 42 62 25 51 49 15 54 6 18 48 10 2 60 41 14 47 24 36 37 52 22 34 35 11 28 8 13 43 53 23 61 38 56 16 59 17 50 7 21 45 4 39 33 40 58 12 30 0 46 63 Thread 1 gets more work by stealing from Thread 3 Thread 3 sorts the rest of its data 82 P ARALLEL S ORT E XAMPLE ( WITH WORK STEALING ) Q UICKSORT – S TEP 5

83 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 27 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 11 0 9 26 31 30 3 19 12 29 27 1 20 5 33 4 25 21 7 15 17 6 18 16 10 2 23 13 14 8 24 36 32 28 22 34 35 52 47 41 43 53 60 61 38 56 48 59 54 50 49 51 45 62 39 42 40 58 55 57 44 46 63 37 THREAD 1 1 0 2 6 4 5 3 7 12 29 27 19 20 30 33 31 25 21 11 15 17 26 18 16 10 9 23 13 14 8 24 36 32 28 22 34 35 45 47 41 43 46 44 40 38 42 48 39 49 50 52 51 54 62 59 56 61 58 55 57 60 53 63 THREAD 2THREAD 3THREAD 4 11 8 14 13 9 10 16 12 17 15 18 21 25 26 31 33 30 20 23 19 27 29 24 36 32 28 22 34 35 19 25 26 22 24 21 20 23 27 30 29 33 36 32 28 31 34 35 32 44 9 26 31 57 3 19 55 29 27 1 20 5 42 62 25 51 49 15 54 6 18 48 10 2 60 41 14 47 24 36 37 52 22 34 35 11 28 8 13 43 53 23 61 38 56 16 59 17 50 7 21 45 4 39 33 40 58 12 30 0 46 63 Thread 1 partitions/splits its data 83 P ARALLEL S ORT E XAMPLE ( WITH WORK STEALING ) Q UICKSORT – S TEP 6

84 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 27 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 11 0 9 26 31 30 3 19 12 29 27 1 20 5 33 4 25 21 7 15 17 6 18 16 10 2 23 13 14 8 24 36 32 28 22 34 35 52 47 41 43 53 60 61 38 56 48 59 54 50 49 51 45 62 39 42 40 58 55 57 44 46 63 37 THREAD 1 1 0 2 6 4 5 3 7 12 29 27 19 20 30 33 31 25 21 11 15 17 26 18 16 10 9 23 13 14 8 24 36 32 28 22 34 35 45 47 41 43 46 44 40 38 42 48 39 49 50 52 51 54 62 59 56 61 58 55 57 60 53 63 THREAD 2THREAD 3THREAD 4 11 8 14 13 9 10 16 12 17 15 18 21 25 26 31 33 30 20 23 19 27 29 24 36 32 28 22 34 35 19 25 26 22 24 21 20 23 27 30 29 33 36 32 28 31 34 35 32 44 9 26 31 57 3 19 55 29 27 1 20 5 42 62 25 51 49 15 54 6 18 48 10 2 60 41 14 47 24 36 37 52 22 34 35 11 28 8 13 43 53 23 61 38 56 16 59 17 50 7 21 45 4 39 33 40 58 12 30 0 46 63 Thread 2 gets more work by stealing from Thread 1 Thread 1 sorts the rest of its data 84 P ARALLEL S ORT E XAMPLE ( WITH WORK STEALING ) Q UICKSORT – S TEP 6

85 11 0 9 26 31 30 3 19 12 29 27 1 20 5 33 4 25 21 7 15 17 6 18 16 10 2 23 13 14 8 24 36 32 28 22 34 35 52 47 41 43 53 60 61 38 56 48 59 54 50 49 51 45 62 39 42 40 58 55 57 44 46 63 37 THREAD 1 1 0 2 6 4 5 3 7 12 29 27 19 20 30 33 31 25 21 11 15 17 26 18 16 10 9 23 13 14 8 24 36 32 28 22 34 35 45 47 41 43 46 44 40 38 42 48 39 49 50 52 51 54 62 59 56 61 58 55 57 60 53 63 THREAD 2THREAD 3THREAD 4 11 8 14 13 9 10 16 12 17 15 18 21 25 26 31 33 30 20 23 19 27 29 24 36 32 28 22 34 35 19 25 26 22 24 21 20 23 27 30 29 33 36 32 28 31 34 35 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 32 44 9 26 31 57 3 19 55 29 27 1 20 5 42 62 25 51 49 15 54 6 18 48 10 2 60 41 14 47 24 36 37 52 22 34 35 11 28 8 13 43 53 23 61 38 56 16 59 17 50 7 21 45 4 39 33 40 58 12 30 0 46 63 Thread 2 sorts the rest of its data DONE 85 P ARALLEL S ORT E XAMPLE ( WITH WORK STEALING ) Q UICKSORT – S TEP 7

TBB L IBRARY - P IPELINE TBB implements the pipeline pattern. Data flows through a series of pipeline stages, and each stage processes the data in some way. 86

87 Parallel stage scales because it can process items in parallel or out of order. Serial stage processes items one at a time in order. Another serial stage. Items wait for turn in serial stage Controls excessive parallelism by limiting total number of items flowing through pipeline. Uses sequence numbers recover order for serial stage. Tag incoming items with sequence numbers Throughput limited by throughput of slowest serial stage. P ARALLEL PIPELINE 810931342567

E XAMPLE Sample problem: read a text file (sequential), capitalize the first letter of each word (parallel), and write the modified text to a new file (sequential). 88

TBB L IBRARY – P IPELINE, CONTINUED 89 // Create the pipeline tbb::pipeline pipeline; // Create file-reading writing stage and add it to the pipeline MyInputFilter input_filter( input_file ); pipeline.add_filter( input_filter ); // Create capitalization stage and add it to the pipeline MyTransformFilter transform_filter; pipeline.add_filter( transform_filter ); // Create file-writing stage and add it to the pipeline MyOutputFilter output_filter( output_file ); pipeline.add_filter( output_filter ); // Run the pipeline pipeline.run( MyInputFilter::n_buffer ); // Must remove filters from pipeline before they are implicitly destroyed. pipeline.clear();

TBB – P IPELINE, CONTINUED 90 // Filter that writes each buffer to a file. class MyOutputFilter: public tbb::filter { FILE* my_output_file; public: MyOutputFilter( FILE* output_file ); /*override*/void* operator()( void* item ); }; MyOutputFilter::MyOutputFilter( FILE* output_file ) : tbb::filter(serial), my_output_file(output_file) { } void* MyOutputFilter::operator()( void* item ) { MyBuffer& b = *static_cast (item); fwrite( b.begin(), 1, b.size(), my_output_file ); return NULL; }

TBB – P IPELINE, CONTINUED 91 // Changes the first letter of each word from lower case to upper case. class MyTransformFilter: public tbb::filter { public: MyTransformFilter(); /*override*/void* operator()( void* item ); }; MyTransformFilter::MyTransformFilter() : tbb::filter(parallel) {} /*override*/void* MyTransformFilter::operator()( void* item ) { // a for loop and ‘toupper()’ go here… }

TBB - T IMING tick_count class 92 using namespace tbb; void Foo() { tick_count t0 = tick_count::now();...action being timed... tick_count t1 = tick_count::now(); printf("time for action = %g seconds\n", (t1-t0).seconds() ); }

TBB: T ASK S CHEDULER Intel Propaganda: The task scheduler is the engine that powers the loop templates. When practical, use the loop templates instead of the task scheduler because the templates hide the complexity of the scheduler. However, if you have an algorithm that does not naturally map onto one of the high-level templates, use the task scheduler. All of the scheduler functionality that is used by the high-level templates is available for you to use directly, so you can build new high-level templates that are just as powerful as the existing ones. 93

TBB – T ASK S CHEDULER Maps tasks to threads. Handles load balancing and scheduling. Hides threading details – just think in terms of tasks. Any task using the task scheduler must have an initialized tbb::task_scheduler_init object. 94 #include "tbb/task_scheduler_init.h" using namespace tbb; int main() { task_scheduler_init init;... return 0; }

95 E XAMPLE : N AIVE F IBONACCI C ALCULATION Recursion typically used to calculate Fibonacci number F(n)=F(n-1)+F(n-2) long SerialFib( long n ) { if( n<2 ) return n; else return SerialFib(n-1) + SerialFib(n-2); } 95

96 E XAMPLE : N AIVE F IBONACCI C ALCULATION Can envision Fibonacci computation as a task graph SerialFib(4) SerialFib(3) SerialFib(2) SerialFib(1) SerialFib(2) SerialFib(1) SerialFib(0) SerialFib(2) SerialFib(1) SerialFib(0) SerialFib(3) SerialFib(2) SerialFib(1) SerialFib(0) SerialFib(1) SerialFib(0) 96

97 F IBONACCI - T ASK S PAWNING S OLUTION Use TBB tasks to thread creation and execution of task graph Create new root task Allocate task object Construct task Spawn (execute) task, wait for completion long ParallelFib( long n ) { long sum; FibTask& a = *new(Task::allocate_root()) FibTask(n,&sum); Task::spawn_root_and_wait(a); return sum; } 97

98 class FibTask: public task { public: const long n; long* const sum; FibTask( long n_, long* sum_ ) : n(n_), sum(sum_) {} task* execute() { // Overrides virtual function task::execute if( n<CutOff ) { *sum = SerialFib(n); } else { long x, y; FibTask& a = *new( allocate_child() ) FibTask(n-1,&x); FibTask& b = *new( allocate_child() ) FibTask(n-2,&y); set_ref_count(3); // 3 = 2 children + 1 for wait spawn( b ); spawn_and_wait_for_all( a ); *sum = x+y; } return NULL; } }; F IBONACCI - T ASK S PAWNING S OLUTION Derived from TBB task class Create new child tasks to compute (n-1) th and (n-2) th Fibonacci numbers Reference count is used to know when spawned tasks have completed Set before spawning any children Spawn task; return immediately Can be scheduled at any time Spawn task; block until all children have completed execution The execute method does the computation of a task 98

99 F URTHER O PTIMIZATIONS E NABLED BY S CHEDULER Recycle tasks Avoid overhead of allocating/freeing Task Avoid copying data and rerunning constructors/destructors Continuation passing Instead of blocking, parent specifies another Task that will continue its work when children are done. Further reduces stack space and enables bypassing scheduler Bypassing scheduler Task can return pointer to next Task to execute For example, parent returns pointer to its left child See include/tbb/parallel_for.h for example Saves push/pop on deque (and locking/unlocking it) 99

100 C ONCURRENT C ONTAINERS TBB Library provides highly concurrent containers STL containers are not concurrency-friendly: attempt to modify them concurrently can corrupt container Standard practice is to wrap a lock around STL containers Turns container into serial bottleneck Library provides fine-grained locking or lockless implementations Worse single-thread performance, but better scalability. Can be used with the library, OpenMP, or native threads. 100

TBB - C ONTAINERS concurrent_hash_map concurrent_queue concurrent_vector Supports concurrent access, concurrent operations and parallel iteration. The TBB library retains control over memory allocation. 101

102 C ONCURRENCY -F RIENDLY I NTERFACES Some STL interfaces are inherently not concurrency-friendly For example, suppose two threads each execute: Solution: concurrent_queue has pop_if_present extern std::queue q; if(!q.empty()) { item=q.front(); q.pop(); } At this instant, another thread might pop last element. 102

103 C ONCURRENT Q UEUE C ONTAINER concurrent_queue Preserves local FIFO order If thread pushes and another thread pops two values, they come out in the same order that they went in Method push(const T&) places copy of item on back of queue Two kinds of pops Blocking – pop(T&) non-blocking – pop_if_present(T&) Method size() returns signed integer If size() returns –n, it means n pops await corresponding pushes Method empty() returns size() == 0 Difference between pushes and pops May return true if queue is empty, but there are pending pop() 103

104 C ONCURRENT Q UEUE C ONTAINER E XAMPLE Simple example to enqueue and print integers Constructor for queue Push items onto queue While more things on queue Pop item off Print item #include “tbb/concurrent_queue.h” #include using namespace tbb; int main () { concurrent_queue queue; int j; for (int i = 0; i < 10; i++) queue.push(i); while (!queue.empty()) { queue.pop(&j); printf(“from queue: %d\n”, j); } return 0; } 104

105 C ONCURRENT V ECTOR C ONTAINER concurrent_vector Dynamically growable array of T Method grow_by(size_type delta) appends delta elements to end of vector Method grow_to_at_least(size_type n) adds elements until vector has at least n elements Method size() returns the number of elements in the vector Method empty() returns size() == 0 Never moves elements until cleared Can concurrently access and grow Method clear() is not thread-safe with respect to access/resizing 105

106 C ONCURRENT V ECTOR C ONTAINER E XAMPLE Append a string to the array of characters held in concurrent_vector Grow the vector to accommodate new string grow_by() returns old size of vector (first index of new element) Copy string into vector void Append( concurrent_vector & V, const char* string) { size_type n = strlen(string)+1; memcpy( &V[V.grow_by(n)], string, n+1 ); } 106

107 C ONCURRENT H ASH T ABLE C ONTAINER concurrent_hash_map Maps Key to element of type T You define class HashCompare with two methods hash() maps Key to hashcode of type size_t equal() returns true if two Keys are equal Enables concurrent find(), insert(), and erase() operations find() and insert() set “smart pointer” that acts as lock on item accessor grants read-write access const_accessor grants read-only access lock released when smart pointer is destroyed 107

108 C ONCURRENT H ASH T ABLE C ONTAINER E XAMPLE User-defined method hash() takes a string as a key and maps to an integer User-defined method equal() returns true if two strings are equal struct MyHashCompare { static size_t hash( const string& x ) { size_t h = 0; for( const char* s = x.c_str(); *s; s++ ) h = (h*157)^*s; return h; } static bool equal( const string& x, const string& y ) { return strcmp(x, y) == 0; } }; 108

109 C ONCURRENT H ASH T ABLE C ONTAINER E XAMPLE K EY I NSERT If insert() returns true, new string insertion Value is key’s place within sequence of strings from getNextString() Otherwise, string has been previously seen typedef concurrent_hash_map myHash; myHash table; string newstring; int place = 0; … while (getNextString(&newString)) { myHash::accessor a; if (table.insert( a, newString )) // new string inserted a->second = ++place; } 109

110 C ONCURRENT H ASH T ABLE C ONTAINER E XAMPLE K EY F IND If find() returns true, key was found within hash table myHash table; string s1, s2; int p1, p2; … { myHash::const_accessor a; // read_lock myHash::const_accessor b; if (table.find(a,s1) && table.find(b,s2)) { // find strings p1 = a->second; p2 = b->second; if (p1 < p2) printf(“%s came before %s\n”,s1,s2); else printf(“%s came before %s\n”,s2,s1); } else printf(“One or both strings not seen before\n”); } 110

TBB - A LLOCATION tbb_allocator : allocates and frees memory via the TBB malloc library if available, otherwise it reverts to using malloc and free. scalable_allocator : allocates and frees memory in a way that scales with the number of processors. others… 111

112 S CALABLE M EMORY A LLOCATORS Serial memory allocation can easily become a bottleneck in multithreaded applications Threads require mutual exclusion into shared heap TBB offers two choices for scalable memory allocation Similar to the STL template class std::allocator scalable_allocator Offers scalability, but not protection from false sharing Memory is returned to each thread from a separate pool cache_aligned_allocator Offers both scalability and false sharing protection 112

113 M ETHODS FOR SCALABLE _ ALLOCATOR #include “tbb/scalable_allocator.h” template class scalable_allocator; Scalable versions of malloc, free, realloc, calloc void *scalable_malloc( size_t size ); void scalable_free( void *ptr ); void *scalable_realloc( void *ptr, size_t size ); void *scalable_calloc( size_t nobj, size_t size ); STL allocator functionality T* A::allocate( size_type n, void* hint=0 ) Allocate space for n values void A::deallocate( T* p, size_t n ) Deallocate n values from p void A::construct( T* p, const T& value ) void A::destroy( T* p ) 113

114 S CALABLE A LLOCATORS E XAMPLE #include “tbb/scalable_allocator.h” typedef char _Elem; typedef std::basic_string<_Elem, std::char_traits, tbb::scalable_allocator > MyString;... {... int *p; MyString str1 = "qwertyuiopasdfghjkl"; MyString str2 = "asdfghjklasdfghjkl"; p = tbb::scalable_allocator ().allocate(24);... } Use TBB scalable allocator for STL basic_string class Use TBB scalable allocator to allocate 24 integers 114

115 TBB: S YNCHRONIZATION P RIMITIVES Parallel tasks must sometimes touch shared data When data updates might overlap, use mutual exclusion to avoid race High-level generic abstraction for HW atomic operations Atomically protect update of single variable Critical regions of code are protected by scoped locks The range of the lock is determined by its lifetime (scope) Leaving lock scope calls the destructor, making it exception safe Minimizing lock lifetime avoids possible contention Several mutex behaviors are available Spin vs. queued “are we there yet” vs. “wake me when we get there” Writer vs. reader/writer (supports multiple readers/single writer) Scoped wrapper of native mutual exclusion function 115

116 A TOMIC E XECUTION atomic T should be integral type or pointer type Full type-safe support for 8, 16, 32, and 64-bit integers Operations atomic i;... int z = i.fetch_and_add(2); ‘= x’ and ‘x = ’read/write value of x x.fetch_and_store (y)z = x, y = x, return z x.fetch_and_add (y)z = x, x += y, return z x.compare_and_swap (y,p)z = x, if (x==p) x=y; return z 116

117 SUMMARY Intel® Threading Building Blocks is a parallel programming model for C++ applications Used for computationally intense code A focus on data parallel programming Intel® Threading Building Blocks provides Generic parallel algorithms Highly concurrent containers Low-level synchronization primitives A task scheduler that can be used directly 117

S HARED M EMORY P ROGRAMMING WITH O PEN MP 118

CS267 Lecture 6119 I NTRODUCTION TO O PEN MP What is OpenMP? Open specification for Multi-Processing “Standard” API for defining multi-threaded shared- memory programs openmp.org – Talks, examples, forums, etc. openmp.org High-level API Preprocessor (compiler) directives ( ~ 80% ) Library Calls ( ~ 19% ) Environment Variables ( ~ 1% )

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 120 O PEN MP * O VERVIEW : omp_set_lock( lck) #pragma omp parallel for private(A, B) #pragma omp critical C$OMP parallel do shared(a, b, c) C$OMP PARALLEL REDUCTION (+: A, B) call OMP_INIT_LOCK (ilok) call omp_test_lock(jlok ) setenv OMP_SCHEDULE “dynamic” CALL OMP_SET_NUM_THREADS(1 0) C$OMP DO lastprivate(XX) C$OMP ORDERED C$OMP SINGLE PRIVATE(X) C$OMP SECTIONS C$OMP MASTER C$OMP ATOMIC C$OMP FLUSH C$OMP PARALLEL DO ORDERED PRIVATE (A, B, C) C$OMP THREADPRIVATE(/ABC/ ) C$OMP PARALLEL COPYIN(/blk/) Nthrds = OMP_GET_NUM_PROCS() !$OMP BARRIER OpenMP: An API for Writing Multithreaded Applications  A set of compiler directives and library routines for parallel application programmers  Makes writing multi-threaded applications in Fortran, C and C++ as easy as we can make it.  Standardizes last 20 years of SMP practice * The name “OpenMP” is the property of the OpenMP Architecture Review Board. 120

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 121 T HE ESSENCE OF O PEN MP Create threads that execute in a shared address space: The only way to create threads is with the “parallel construct” Once created, all threads execute the code inside the construct. Split up the work between threads by one of two means: SPMD (Single program Multiple Data) … all threads execute the same code and you use the thread ID to assign work to a thread. Workshare constructs split up loops and tasks between threads. Manage data environment to avoid data access conflicts Synchronization so correct results are produced regardless of how threads are scheduled. Carefully manage which data can be private (local to each thread) and shared. 121

CS267 Lecture 6122 A P ROGRAMMER ’ S V IEW OF O PEN MP OpenMP is a portable, threaded, shared-memory programming specification with “light” syntax Exact behavior depends on OpenMP implementation ! Requires compiler support (C or Fortran) OpenMP will: Allow a programmer to separate a program into serial regions and parallel regions, rather than T concurrently-executing threads. Hide stack management Provide synchronization constructs OpenMP will not: Parallelize automatically Guarantee speedup Provide freedom from data races

CS267 Lecture 6123 M OTIVATION – O PEN MP int main() { // Do this part in parallel printf( "Hello, World!\n" ); return 0; }

CS267 Lecture 6124 M OTIVATION – O PEN MP int main() { omp_set_num_threads(16); // Do this part in parallel #pragma omp parallel { printf( "Hello, World!\n" ); } return 0; }

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 125 O PEN MP E XECUTION M ODEL : Fork-Join Parallelism:  Master thread spawns a team of threads as needed.  Parallelism added incrementally until performance are met: i.e. the sequential program evolves into a parallel program. Parallel Regions Master Thread in red A Nested Paralle l region Sequential Parts

CS267 Lecture 6 126 P ROGRAMMING M ODEL – C ONCURRENT L OOPS OpenMP easily parallelizes loops Requires: No data dependencies (reads/write or write/write pairs) between iterations! Preprocessor calculates loop bounds for each thread directly from serial source ? ? for( i=0; i < 25; i++ ) { printf(“Foo”); } #pragma omp parallel for

CS267 Lecture 6127 P ROGRAMMING M ODEL – L OOP S CHEDULING schedule clause determines how loop iterations are divided among the thread team static([chunk]) divides iterations statically between threads Each thread receives [chunk] iterations, rounding as necessary to account for all iterations Default [chunk] is ceil( # iterations / # threads ) dynamic([chunk]) allocates [chunk] iterations per thread, allocating an additional [chunk] iterations when a thread finishes Forms a logical work queue, consisting of all loop iterations Default [chunk] is 1 guided([chunk]) allocates dynamically, but [chunk] is exponentially reduced with each allocation

CS267 Lecture 6 128 P ROGRAMMING M ODEL – D ATA S HARING Parallel programs often employ two types of data Shared data, visible to all threads, similarly named Private data, visible to a single thread (often stack- allocated) OpenMP: shared variables are shared private variables are private PThreads: Global-scoped variables are shared Stack-allocated variables are private // shared, globals int bigdata[1024]; void* foo(void* bar) { // private, stack int tid; /* Calculation goes here */ } int bigdata[1024]; void* foo(void* bar) { int tid; #pragma omp parallel \ shared ( bigdata ) \ private ( tid ) { /* Calc. here */ }

CS267 Lecture 6129 P ROGRAMMING M ODEL - S YNCHRONIZATION OpenMP Synchronization OpenMP Critical Sections Named or unnamed No explicit locks / mutexes Barrier directives Explicit Lock functions When all else fails – may require flush directive Single-thread regions within parallel regions master, single directives #pragma omp critical { /* Critical code here */ } #pragma omp barrier omp_set_lock( lock l ); /* Code goes here */ omp_unset_lock( lock l ); #pragma omp single { /* Only executed once */ }

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 130 E XAMPLE P ROBLEM : N UMERICAL I NTEGRATION  4.0 (1+x 2 ) dx =  0 1  F(x i )  x   i = 0 N Mathematically, we know that: We can approximate the integral as a sum of rectangles: Where each rectangle has width  x and height F(x i ) at the middle of interval i. F(x) = 4.0/(1+x 2 ) 4.0 2.0 1.0 X 0.0

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 131 PI P ROGRAM : AN EXAMPLE static long num_steps = 100000; double step; void main () { int i; double x, pi, sum = 0.0; step = 1.0/(double) num_steps; x = 0.5 * step; for (i=0;i<= num_steps; i++){ sum += 4.0/(1.0+x*x); x+=step; } pi = step * sum;} 131

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 132 PI P ROGRAM : IDENTIFY C ONCURRENCY static long num_steps = 100000; double step; void main () { int i; double x, pi, sum = 0.0; step = 1.0/(double) num_steps; x = 0.5 * step; for (i=0;i<= num_steps; i++){ sum += 4.0/(1.0+x*x); x+=step; } pi = step * sum;} Loop iterations can in principle be executed concurrently 132

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 133 PI P ROGRAM : E XPOSE C ONCURRENCY, PART 1 static long num_steps = 100000; double step; void main () { double pi, sum = 0.0; step = 1.0/(double) num_steps; int i; double x; for (i=0;i<= num_steps; i++){ x = (i+0.5)*step; sum += 4.0/(1.0+x*x); } pi = step * sum;} Isolate data that must be shared from data local to a task Redefine x to remove loop carried dependence This is called a reduction … results from each iteration accumulated into a single global. 133

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 134 PI P ROGRAM : E XPOSE C ONCURRENCY, PART 2 D EAL WITH THE REDUCTION static long num_steps = 100000; #define NUM 4 //expected max thread count double step; void main () { double pi, sum[NUM] = {0.0}; step = 1.0/(double) num_steps; int i, ID=0; double x; for (i=0;i<= num_steps; i++){ x = (i+0.5)*step; sum[ID] += 4.0/(1.0+x*x); } for(int i=0, pi=0.0;i<NUM;i++) pi += step * sum[i];} Common Trick: promote scalar “sum” to an array indexed by the number of threads to create thread local copies of shared data. 134

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 135 PI P ROGRAM : E XPRESS C ONCURRENCY USING O PEN MP #include static long num_steps = 100000; #define NUM 4 double step; void main () { double pi, sum[NUM] = {0.0}; step = 1.0/(double) num_steps; #pragma omp parallel num_threads(NUM) { int i, ID; double x; ID = omp_get_thread_num(); for (i=ID;i<= num_steps; i+=NUM){ x = (i+0.5)*step; sum[ID] += 4.0/(1.0+x*x); } for(int i=0, pi=0.0;i<NUM;i++) pi += step * sum[i];} Create NUM threads Each thread executes code in the parallel block Simple mod to loop to deal out iterations to threads variables defined inside a thread are private to that thread automatic variables defined outside a parallel region are shared between threads 135

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 136 PI P ROGRAM : F IXING THE NUM THREADS BUG #include static long num_steps = 100000; #define NUM 4 double step; void main () { double pi, sum[NUM] = {0.0}; step = 1.0/(double) num_steps; #pragma omp parallel num_threads(NUM) { int nthreads = omp_get_num_threads(); int i, ID; double x; ID = omp_get_thread_num(); for (i=ID;i<= num_steps; i+=nthreads){ x = (i+0.5)*step; sum[ID] += 4.0/(1.0+x*x); } for(int i=0, pi=0.0;i<NUM;i++) pi += step * sum[i];} Hence, you need to add a bit of code to get the actual number of threads NUM is a requested number of threads, but an OS can choose to give you fewer. 136

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 137 I NCREMENTAL P ARALLELISM Software development with incremental Parallelism: Behavior preserving transformations to expose concurrency. Express concurrency incrementally by adding OpenMP directives… in a large program I can do this loop by loop to evolve my original program into a parallel OpenMP program. Build and time program, optimize as needed with behavior preserving transformations until you reach the desired performance. 137

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 138 PI P ROGRAM : E XECUTE C ONCURRENCY #include static long num_steps = 100000; #define NUM 4 double step; void main () { double pi, sum[NUM] = {0.0}; step = 1.0/(double) num_steps; #pragma omp parallel num_threads(NUM) { int nthreads = omp_get_num_threads(); int i, ID; double x; ID = omp_get_thread_num(); for (i=ID;i<= num_steps; i+=nthreads){ x = (i+0.5)*step; sum[ID] += 4.0/(1.0+x*x); } for(int i=0, pi=0.0;i<NUM;i++) pi += step * sum[i];} The performance can suffer on some systems due to false sharing of sum[ID] … i.e. independent elements of the sum array share a cache line and hence every update requires a cache line transfer between threads. Build this program and execute on parallel hardware. 138

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 139 PI P ROGRAM : S AFE UPDATE OF SHARED DATA #include static long num_steps = 100000; #define NUM 4 double step; int main () { double pi, sum=0.0; step = 1.0/(double) num_steps; #pragma omp parallel num_threads(NUM) { int i, ID; double x, psum= 0.0; int nthreads = omp_get_num_threads(); ID = omp_get_thread_num(); for (i=ID;i<= num_steps; i+=nthreads){ x = (i+0.5)*step; psum += 4.0/(1.0+x*x); } #pragma omp critical sum += psum; } pi = step * sum;} Replace array for sum with a local/private version of sum (psum) … no more false sharing Use a critical section so only one thread at a time can update sum, i.e. you can safely combine psum values 139

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 140 P I PROGRAM : MAKING LOOP - SPLITTING AND REDUCTIONS EVEN EASIER #include static long num_steps = 100000; double step; void main () { int i; double x, pi, sum = 0.0; step = 1.0/(double) num_steps; #pragma omp parallel for private(i, x) reduction(+:sum) for (i=0;i<= num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x); } pi = step * sum; } Reduction used to manage dependencies Private clause creates data local to a thread

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 141 S YNCHRONIZATION : B ARRIER Barrier : Each thread waits until all threads arrive. #pragma omp parallel shared (A, B, C) private(id) { id=omp_get_thread_num(); A[id] = big_calc1(id); #pragma omp barrier #pragma omp for for(i=0;i<N;i++){C[i]=big_calc3(i,A);} #pragma omp for nowait for(i=0;i<N;i++){ B[i]=big_calc2(C, i); } A[id] = big_calc4(id); } implicit barrier at the end of a parallel region implicit barrier at the end of a for worksharing construct no implicit barrier due to nowait 141

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 142 P UTTING THE MASTER THREAD TO WORK The master construct denotes a structured block that is only executed by the master thread. The other threads just skip it (no synchronization is implied). #pragma omp parallel { do_many_things(); #pragma omp master { exchange_boundaries(); } #pragma omp barrier do_many_other_things(); } 142

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 143 R UNTIME L IBRARY ROUTINES AND ICV S To use a known, fixed number of threads in a program, (1) tell the system that you don’t want dynamic adjustment of the number of threads, (2) set the number of threads, then (3) save the number you got. #include void main() { int num_threads; omp_set_dynamic( 0 ); omp_set_num_threads( omp_num_procs() ); #pragma omp parallel { int id=omp_get_thread_num(); #pragma omp single num_threads = omp_get_num_threads(); do_lots_of_stuff(id); } } Protect this op since Memory stores are not atomic Request as many threads as you have processors. Disable dynamic adjustment of the number of threads. Internal Control Variables (ICVs) … define state of runtime system to a thread. Consistent pattern: set with “omp_set” or an environment variable, read with “omp_get” 143

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 144 O PTIMIZING LOOP PARALLEL PROGRAMS #include #pragma omp parallel { // define neighborhood as the num_neighbors particles // within “cutoff” of each particle “i”. #pragma omp for for( int i = 0; i < n; i++ ) { Fx[i]=0.0; Fy[i]=0.0; for (int j = 0; j < num_neigh[i]; j++) neigh_ind = neigh[i][j]; Fx[i] += forceX(i, neigh_ind); FY[i] += forceY(i, neigh_ind); } Particles may be unevenly distributed … i.e. different particles have different numbers of neighbors. Evenly spreading out loop iterations may fail to balance the load among threads We need a way to tell the compiler how to best distribute the load. Short range force computation for a particle system using the cut-off method 144

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 145 T HE SCHEDULE CLAUSE The schedule clause affects how loop iterations are mapped onto threads schedule(static [,chunk]) Deal-out blocks of iterations of size “chunk” to each thread. schedule(dynamic[,chunk]) Each thread grabs “chunk” iterations off a queue until all iterations have been handled. schedule(guided[,chunk]) Threads dynamically grab blocks of iterations. The size of the block starts large and shrinks down to size “chunk” as the calculation proceeds. schedule(runtime) Schedule and chunk size taken from the OMP_SCHEDULE environment variable (or the runtime library … for OpenMP 3.0). 145

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 146 O PTIMIZING LOOP PARALLEL PROGRAMS #include #pragma omp parallel { // define neighborhood as the num_neigh particles // within “cutoff” of each particle “i”. #pragma omp for schedule(dynamic, 10) for( int i = 0; i < n; i++ ) { Fx[i]=0.0; Fy[i]=0.0; for (int j = 0; j < num_neigh[i]; j++) neigh_ind = neigh[i][j]; Fx[i] += forceX(i, neigh_ind); FY[i] += forceY(i, neigh_ind); } Divide range of n into chunks of size 10. Each thread computes a chunk then goes back to get its next chunk of 10 iterations. Dynamically balances the load between threads. Short range force computation for a particle system using the cut-off method

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 147 Schedule ClauseWhen To Use STATIC Pre-determined and predictable by the programmer DYNAMIC Unpredictable, highly variable work per iteration GUIDED Special case of dynamic to reduce scheduling overhead The schedule clause loop work-sharing constructs: The schedule clause Least work at runtime : scheduling done at compile-time Most work at runtime : complex scheduling logic used at run-time 147

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 148 S ECTIONS W ORK -S HARING C ONSTRUCT The Sections work-sharing construct gives a different structured block to each thread. #pragma omp parallel { #pragma omp sections { #pragma omp section X_calculation(); #pragma omp section y_calculation(); #pragma omp section z_calculation(); } } #pragma omp parallel { #pragma omp sections { #pragma omp section X_calculation(); #pragma omp section y_calculation(); #pragma omp section z_calculation(); } } By default, there is a barrier at the end of the “omp sections”. Use the “nowait” clause to turn off the barrier. 148

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 149 S INGLE W ORK -S HARING C ONSTRUCT The single construct denotes a block of code that is executed by only one thread. A barrier is implied at the end of the single block. #pragma omp parallel { do_many_things(); #pragma omp single { exchange_boundaries(); } do_many_other_things(); } 149

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 150 S UMMARY OF O PEN MP’ S KEY CONSTRUCTS The only way to create threads is with the parallel construct: #pragma omp parallel All thread execute the instructions in a parallel construct. Split work between threads by: SPMD: use thread ID to control execution Worksharing constructs to split loops (simple loops only) #pragma omp for Combined parallel/workshare as a shorthand #pragma omp parallel for High level synchronization is safest #pragma critical #pragma barrier 150

1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

Similar presentations

Presentation on theme: "1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

Similar presentations

Presentation on theme: "1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2."— Presentation transcript:

Similar presentations

About project

Feedback