Presentation is loading. Please wait.

Presentation is loading. Please wait.

Haskell on a Shared-Memory Multiprocessor Tim Harris Simon Marlow Simon Peyton Jones.

Similar presentations


Presentation on theme: "Haskell on a Shared-Memory Multiprocessor Tim Harris Simon Marlow Simon Peyton Jones."— Presentation transcript:

1 Haskell on a Shared-Memory Multiprocessor Tim Harris Simon Marlow Simon Peyton Jones

2 Why now? Shift in the balance: –no more free sequential performance boosts –SMP hardware will be the norm –non-parallel programs will be frozen in performance –even a modest parallel speedup is now worthwhile, because the other processors come for free race to produce good parallel languages

3 The story so far… Parallel FP research is not new, but –it has mostly focussed on distributed memory, and hence separate heaps: communication is expensive, so careful tuning of work distribution is needed –multi-core processors (for small N) will be shared memory, we can use a single heap: almost zero communication overhead means better prospects for reliable speedup tradeoffs are likely to be quite different less scalability beyond small N

4 Concurrent Haskell Concurrent programming in Haskell is exciting right now: –STM means less error-prone concurrent programming –we understand how Concurrent Haskell interacts with OS level concurrency and the FFI –lots of people are using it Concurrent programs are Parallel programs too –so we already have plenty of parallel programs to play with –to say it another way: we can use Concurrent Haskell to write parallel programs (no need for parallel annotations like par straight away)

5 So, what’s the problem? Suppose we let 2 Haskell threads loose on a shared heap. What goes wrong? –allocation: the threads better have separate allocation areas –immutable heap objects present no problems (and are common!) –mutable objects: MVars, TVars. We better make sure that these are thread-safe. –shared data in the runtime: eg. the scheduler’s run queue, the garbage collector’s remembered set. Access to these must be made thread-safe. –but …

6 stack update The real problem is Thunks! let x = fac z in x * 2 THUNK: fac z z x allocation: evaluation: value IND returned

7 Should we lock thunks? Thunks are clearly shared mutable state, so we should protect against simultaneous access with a mutex, right? Free vars THUNK

8 Locks are v. expensive A lock is implemented using a guaranteed atomic instruction, such as compare-and-swap. These instructions are about 100x more expensive than ordinary instructions We measured adding two CAS instructions to every thunk evaluation, result was about 50% worse performance.

9 Can we do it lock-free? What would go wrong if we let them both evaluate it? –they both compute the same value… –just extra work –most thunks are cheap Free vars THUNK

10 Not quite that simple… Free vars THUNK Race between update and entry: IND Value IND Value

11 Hardware re-ordering? Not all processors guarantee strong memory ordering –no read ordering: processor might observe the writes in a different order –no write ordering: header might be written before value, or worse, the value itself might be written after the update –Happily, x86 currently guarantees both read & write ordering

12 Hardware re-ordering cont. No write ordering => we need a memory barrier (could be expensive!) write ordering but no read ordering: Free vars0 THUNK Initialise padding field to 0

13 Can we reduce duplication? idea: –periodically scan each thread’s stack –attempt to claim exclusive access to each thunk under evaluation –halt any duplicate evaluation THUNK update

14 Claiming a thunk traverse a thread’s stack, when we reach an update frame, atomically swap the header word of the thunk with BLACKHOLE Free vars0 THUNKBLACKHOLE update

15 Claiming a thunk If the header was previously: 1.a THUNK, we have now claimed it 2.BLACKHOLE, another thread owns it 3.IND, another thread has already updated it Free vars0 BLACKHOLE update Duplicate Evaluation

16 What happens to the duplicate evaluation? Well-known technique (Reid ’99), also used in asynchronous exceptions and STM. update BLACKHOLE AP_STACK IND This thread has claimed this thunk.

17 Stopping duplicate evaluation, cont. The thread blocks until the BLACKHOLE has completed evaluation update BLACKHOLE Another thread has claimed this thunk. Block

18 Claiming thunks Works like real locking for long-running thunks, compared to lock-free execution for short-lived thunks, precisely what we want Must mark update frames for thunks we have claimed, so we don’t attempt to claim twice. If a thread has claimed a thunk, this does not necessarily mean that it is the only thread evaluating it. The other thread(s) may not have tried to claim it yet.

19 Evaluating a BLACKHOLE, blocking What if a thread enters a BLACKHOLE, i.e. a claimed thunk? The thread must block. In single-threaded GHC, we attached blocked threads to the BLACKHOLE itself. –easy to find the blocked threads when updating the BLACKHOLE, but –in a multi-threaded setting this leads to more race conditions on the thunk –so we must store the queue of blocked threads in a separate list, and check it periodically

20 Black-holing Black-holing has been around for a while. It also: –fixes some space leaks –catches some loops We are just extending the existing black-holing technique to catch duplicate work in SMP-GHC.

21 Narrowing the window: grey-holing ToDo

22 More possibilities for duplication two threads evaluate z simultaneously, creating two copies of x x is duplicated for ever we can try to catch this at the update: if we update an IND, then return the other value. Not foolproof. z = let x = … expensive … in Just x

23 STM(?) ToDo

24 Measurements using real locks

25 Measurements our lock-free implementation

26 Case study: parallelising GHC --make GHC –-make compiles multiple modules in dependency order.hi files for library modules are read once and shared by future compilations we want to parallelise compilations of independent modules, while synchronising access to the shared state

27 parallel compilation C AB Main in parallel

28 GHC’s shared state It’s a dataflow graph! one thread for each node, blocks until results are available from all the inputs parallel compilation happens automatically simple throttling to prevent too many simultaneous compilations. C AB Main

29 Results: ideal 2 identical modules Why not a speedup of 2? –GC is single threaded, more GC when compiling in parallel (more live data) –dependency anal is single threaded –interface loading is shared –increased load on the memory system discounting GC, we get speedup of 1.54. speedup of 1.3 used 1.5 CPUs

30 Results: compiling Happy Modules are not completely independent, speedup drops to 1.2

31 Results: compiling Anna larger program make –j2 is now losing better parallel speedup when optimising: –probably lower proportion of time spent reading interface files, –and proportionally lower contention for shared state

32 Conclusion & what’s next? lock-free thunk evaluation looks promising current issues: –lock contention in the runtime –lack of processor affinity –combination leads to dramatic slowdown for some examples, particularly concurrent programs we are redesigning the scheduler to fix these issues multithreaded GC –tricky, but well-understood –benefits everyone on multi-core/multi-proc Apps! planned full support for SMP in GHC 6.6


Download ppt "Haskell on a Shared-Memory Multiprocessor Tim Harris Simon Marlow Simon Peyton Jones."

Similar presentations


Ads by Google