Shared Memory Parallel Programming: An Introduction Kenjiro Taura.

Shared Memory Parallel Programming: An Introduction Kenjiro Taura

Agenda An example problem Pthreads OpenMP Common errors in shared memory programming

Goals (of this talk) Understand basics of shared memory programming, with Pthreads and OpenMP as specific examples (because of popularity) Understand common issues in parallel programming (particularly in shared memory parallel programming) As a byproduct, undrestand motivations of the two main topics of this lecture Model checking tool Transactional memory

An example problem Consider Nqueen problem Given N, answer how many ways N queens can be put in N-by-N chessboard Certainly a boring problem, but its structure is common in other problems

Serial code (bitmap tricks are by no means important) int nqs(unsigned int v, unsigned int l, unsigned int r, int row, int n) { if (row == n) { return 1; } else { int c = 0; unsigned int b = v | l | r; while (b + 1 < (1 << n)) { unsigned int o = (b + 1) & (~b); c += nqs(v | o, ((l | o) << 1) & ((1 << n) - 1), (r | o) >> 1, row + 1, n); b |= o; } return c; } See nqs (SECTION 1) in the code

Some points to note Put a queen on each row, from top (0-th row) to the bottom v, l, and r are bitmaps that together indicate which columns (on that row) are not safe The entire code forms a natural recursion

Source of parallelism Recursions can be done in parallel, leading to a tree structured parallelism

Parallel recursion The structure is common in many algorithms game tree search combinatorial search (SAT, CSP, etc.) quicksort, mergesort So let's take nqueen as a template of many parallel algorithms Note: Nqueen is a particularly easy example that does almost nothing but recursive calls

Pthreads POSIX standard for shared memory programming Basic concepts Threads sharing memory (process address space) Threads created at arbitrary point at runtime Synchronization primitives

Pthreads In virtually all existing implementation, shared memory among threads is implemented by hardware Even the operating system essentially does nothing as far as shared memory (communication among threads) is concerned As such, virtually all Unix platforms support it so many languages build similar thread libraries on top of Pthreads

Pthreads primitives Thread creation ( pthread_create ) Thread join ( pthread_join ) wait for termination of a specified thread

Parallelizing nqueens (only) by pthread_create and pthread_join Recursive call (up to a certain depth) → creation of thread that does recursive call Nothing fundamentally difficult Just tedious because pthread_create accepts only functions taking a single argument Naively parallelized version nqp in the code (SECTION 3)

Problems of this approach How many threads does this code end up with? It depends on up to which depth threads are created Say N = 20 20 threads (up to depth = 1) is clearly not enough with today's large (say 32 core) machines Something like 20 * 17 is already unnecessarily many

Effects of unnecessarily many threads Besides being pointless, they incur thread creation overhead and unnecessary context switches These are of less concern if task granularity is large enough, howerver The real issues are If each thread consumes lots of (temporary) memory, the overall memory consumption may be overwhelming If each thread accesses many memory locations with locality, overall memory locality will be lost

So we like to work with a fixed, arbitrary number of threads The idea: decouple ”work” and ”thread” run a fixed number of threads generate a unit of ”work” (a small data structure representing what to do), not ”a thread” on each recursive call each thread repeats fetching and executing a single unit of work threads works

The simplest work generation strategy Prepare an empty queue (container) of works Run nqueen recursion up to a specified depth Calls at that depth will package the given parameters into work, and put them into the queue See nqw (SECTION 5) in the code Then run all elements in the queue work queue generate work at depth = 2

Work queue structure Many possibilities (linked list, dynamic array, etc.) For simplicity, assume dynamic arrays (e.g., Java's ArrayList) After generating all works, it amounts to execute all works in the array This particularly fits OpenMP (discussed next)... array; sz; capacity;... See work_queue in the code

OpenMP Another accepted shared memory programming model gcc >= 4.2 supports OpenMP v2.5 Visit openmp.org for specs and available implementations Make parallelization of regular loops like this particularly easy for (i = 0; i < n; i++)...

A simplest illustrating example This is all you need to execute iterations of the for loop in parallel main() { #pragma omp parallel for for (i = 0; i < n; i++) { a[i] = f (i); } } You can also specify (and fix) the number of threads to create (OMP_NUM_THREADS)

Underlying execution model main() begins with a single thread (same as Pthreads) But at this point, other N – 1 threads are (at least conceptually) created and waiting Two important concepts Parallel region (#pragma omp parallel) Worksharing constructs (#pragma omp for)

#pragma omp parallel Code qualified by this pragma will be executed by all the threads Standalone effect of this pragma is to let the same code be (perhaps redundantly) executed by all threads #pragma omp parallel { printf(”hello\n”); } Real ”parallel execution” is achieved by worksharing constructs (like #pragma omp for)

#pragma omp for Code qualified by this pragma must be a regular for loop, typically in a parallel region #pragma omp for for (i = 0; i < n; i++)... Iterations will be distributed to the working threads i.e., units of work (iterations) are shared among threads, thus named ”worksharing” constructs See spec for other worksharing constructs (sections)

What is OpenMP good for? It's particularly good for parallelizing regular for loops May not be very good at executing dynamically created work You cannot create threads at arbitrary points v3.0 spec has a provision for it (task, taskwait) But is implementation out there? Also lack some synchronization primitives (esp. condition variables)

Back to nqueen example Given a queue filled with work, it is only one step away to parallelize it with OpenMP See nqw_omp_master (SECTION 6) in the code

Writing this in Pthreads With work queue wq, generate a fixed number of threads, passing wq to each of them Each thread: while (some more work in wq) { fetch a work from wq; execute it and write (partial) count somewhere; }

Does this work? nq_worker(wq) { while (wq->hd sz) { arg = wq->args + wq->hd; wq->hd++; arg->ans = nqw(wq, arg->v, arg->l, arg->r,...); } } (wq->hd is an index that is meant to point to the next work to execute)

You need to remember In shared memory, anything beyond a single read/write may not be atomic (i.e., may be interleaved with other threads' acceess to memory)

How to achieve atomicity Locks Compare-and-swap or load-linked/store- conditional instruction for a read-modify-write to a single location They essentially allow us to ”check” if another thread modified the location of interest (yet to be realized) transactional memory for general multiple-words atomic transactions

Locks Basic concepts m = create_mutex() lock(m) unlock(m) Property: two threads can lock a single mutex at the same time The practics is to sandwich the code that must be executed atomically by lock(m) and unlock(m), and do this consistently in your code

Locks in Pthreads pthread_mutex_t m; pthread_mutex_init(&m,...); pthread_mutex_lock(&m); pthread_mutex_unlock(&m);

Compare-and-swap An instruction supported by most modern CPUs. ex. CAS [p],x,y does the following, but atomically if (*p == x) swap(*p, y)

Atomic increment by compare-and- swap x = *p; y = x + 1; CAS [p],x,y This will complete the job if *p was not modified during the three instructions Otherwise, it does nothing, and let you know the result by leaving y intact

Atomic increment by compare-and- swap (continued) while (1) { x = *p; y = x + 1; CAS [p],x,y; if (x == y) break; } i.e., repeat if somebody has overtook

Generality of compare-and-swap Obviously, it can be used to implement *p = E(*p) atomically for any expression involving (only) x x = *p; y = E(x); CAS [p],x,y There are CAS2 that can do this on two memory locations

What's wrong with locks? Pointless to use in case CAS can be used (single- word update) Overhead can be high, compared to the actual code to execute Lock lets you wait on ”entry” Slow processes block others even if they can finish quickly Lock may block processes unnecessarily, in case threads may not access the same location With multiple-locks, deadlocks may happen

Back to nqueen example You may fix this either by locks or atomic increment to wq->hd, as long as array is not updated See nq_worker (SECTION 5) in the code while (1) { nqueen_arg_t arg = 0; /* try to fetch work */ pthread_mutex_lock(&wq->mx); { int hd = wq->hd; if (hd sz) { arg = wq->args + hd; wq->hd = hd + 1; } pthread_mutex_unlock(&wq->mx); if (arg ==0) break;/* no work left, quit */ arg->ans = nqw(wq, arg->v, arg->l, arg->r, arg->row, arg->n, arg->parallel_depth); }

Further generalization (and headaches ahead :-) So far all works have been generated before creating any thread 1. run nqueen up to a depth to generate work; 2. create threads; 3. wait for all threads to exit; A slight generalization is to allow works to be added at any time, by these threads

When do we need this? Time to generate ”all” work becomes significant compared to average work size (and number of processors) e.g., quicksort that takes time before recursive calls It's difficult to determine a point at which no further work generation is ever necessary e.g., highly unbalanced search tree Neither was the case in Nqueen, but we nevertheless use it as a running example

A simple fix Just do this 1. create threads; 2. run nqueen up to a depth to generate work; 3. wait for all threads to exit; rather than 1. run nqueen up to a depth to generate work; 2. create threads; 3. wait for all threads to exit;

Two things Adding a work to queue now needs a lock Let's assume we have done that. Can you discover a subtle bug in nqw2_pth_master? More significant problem is premature quit

How to avoid premature quit? What was wrong? Our method to detect ”no work in the queue” was neglecting the fact that some work is ”in progress” (so may generate further work) One way to fix this is to explicitly mark a task as ”finished” have a separate field in work record, or use ans = – 1 to indicate ”not finished”

What worker should do when it finds a queue (perhaps temporarily) empty? while (1) { nqueen_arg_t arg = 0; /* try to fetch work */ pthread_mutex_lock(&wq->mx); { int hd = wq->hd; if (hd sz) { arg = wq->args + hd; wq->hd = hd + 1; } pthread_mutex_unlock(&wq->mx); if (arg == 0) ??????????? /* what to do? */ arg->ans = nqw(wq, arg->v, arg->l, arg->r, arg->row, arg->n, arg->parallel_depth); }

if (arg == 0) continue; /* ??? */

if (arg == 0) { if (really_no_work(wq)) break; else continue; } int really_no_work() { for (i = 0; i sz; i++) { if (wq->args[i].ans == -1) return 0; } return 1; }

More issues you may not want to discuss any more... When it turns out there is a work in progress, the thread must wait Busy wait? Generally need to block (use condition variable) What if we like to have multiple queues to avoid bottlenecks of the queue? Terminatin detection further complicated

Motivations for model checking tools Enough about so many subtlties involved in parallel programming Message passing has similar issues Reasoning about correctness ultimately amounts to exploring all the ways these concurrent events may happen (interleavings) Human is not good at it Model checking is about automating this ”exhaustive” reasoning

Shared Memory Parallel Programming: An Introduction Kenjiro Taura.

Similar presentations

Presentation on theme: "Shared Memory Parallel Programming: An Introduction Kenjiro Taura."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Shared Memory Parallel Programming: An Introduction Kenjiro Taura.

Similar presentations

Presentation on theme: "Shared Memory Parallel Programming: An Introduction Kenjiro Taura."— Presentation transcript:

Similar presentations

About project

Feedback