Presentation on theme: "Comparing and Optimising Parallel Haskell Implementations on Multicore Jost Berthold Simon Marlow Abyd Al Zain Kevin Hammond."— Presentation transcript:
Comparing and Optimising Parallel Haskell Implementations on Multicore Jost Berthold Simon Marlow Abyd Al Zain Kevin Hammond
The Parallel Haskell Landscape research into parallelism using Haskell has been ongoing since the late 1980s – semi-implicit, deterministic programming model: par :: a -> b -> b – strategies package up larger parallel computation patterns, separates algorithm from parallelism – the GUM implementation ran on clusters or multiprocessors, using PVM – successful: linear speedups on large clusters Another Parallel Haskell variant: Eden – more explicit than par : programming model says where the evaluation happens – also able to express parallel computation skeletons, e.g. parMap – implementation based on GHC, runs on clusters and multiprocessors using PVM-based communication – multiple heaps, not virtually-shared as in GUM (simpler implementation) Several other Parallel/Distributed Haskell dialects, mostly research prototypes and all based on distributed heaps (some virtually-shared)
The Parallel Haskell Landscape Recently (2005) shared-memory parallelism added to GHC – single shared heap – programming models supported: pure: – par and Strategies – soon: Data Parallel Haskell impure, non-deterministic: – Concurrent Haskell, STM – widely available, high-quality implementation – very lightweight concurrency, we win concurrency benchmarks – parallel GC added recently This work: – compare distributed and shared-heap models – analyse performance of the shared-heap implementation implement execution profiling make improvements to the runtime
Shared vs. Distributed heaps why a shared heap? – no communication overhead, hence easier to program – good for fine-grained tasks with plenty of communication and sharing why a distributed heap? – parallel GC is much easier – no cache-coherency overhead – no mutexes
The GpH programming model par :: a -> b -> b stores a pointer to a in a spark pool an idle CPU takes a spark from the spark pool and turns it into a thread seq :: a -> b -> b – used for sequential ordering parMap :: (a -> b) -> [a] -> [b] parMap f  =  parMap f (x:xs) = let y = f x ys = parMap f xs in y `par` (ys `seq` y:ys)
sumEuler :: Int -> Int sumEuler n = sum (map phi [1..n]) phi :: Int -> Int phi n = length (filter (relprime n) [1..(n-1)]) sumEuler :: Int -> Int sumEuler n = sum (parMap phi [1..n]) phi :: Int -> Int phi n = length (filter (relprime n) [1..(n-1)]) sumEuler :: Int -> Int sumEuler n = parChunkFoldMap (+) phi [1..n] phi :: Int -> Int phi n = length (filter (relprime n) [1..(n-1)]) parChunkFoldMap :: (b -> b -> b) -> (a -> b) -> [a] -> b parChunkFoldMap f g xs = foldl1 f (map (foldl1 f. map g) (splitAtN c xs) `using` parList rnf) sumEuler benchmark
sumEuler execution profile 1. Standard GHC, 8 CPUs (2 x quad-core) 2. Eden using PVM, 8 CPUs (2 x quad-core)
Analysis (1) The shared-heap implementation was spending a lot of time at the GC barrier. It turned out that the GC barrier had a bug: it was stopping one CPU at a time. We fixed that. Also, reducing the number of barriers, by increasing the size of the young generations, helps a bit.
sumEuler execution profile (2) 1. Standard GHC, including fix for GC barrier and 5MB young generation 2. Eden using PVM, 8 CPUs (2 x quad-core)
Analysis (2) Some of the gaps are due to poor load- balancing. The existing load-balancing strategy was based on pushing spare work to idle CPUs – could be a long delay between a CPU becoming idle and receiving work from another CPU. We implemented lock-free work-stealing queues for load-balancing of sparks.
sumEuler execution profile (3) 1. Standard GHC + GC barrier fixes + work-stealing 2. Eden using PVM, 8 CPUs (2 x quad-core)
Analysis (3) High priority: implement per-CPU GC – each CPU has a local heap that can be collected independently of the other CPUs. – Single shared global heap, collected much less frequently using stop-the-world – e.g. Concurrent Caml, Manticore Lower the overhead for spark activation, by having a dedicated thread to run sparks. – This will make the implementation less sensitive to granularity: less need to group work into “chunks”, easier for programmers to get speedup
Matrix multiplication Using strategies, we can parallelise matrix multiply either elementwise, by grouping rows or columns, or blockwise. In Eden, the matrix data is communicated between the processing elements, but no PE keeps a complete copy of the matrix.
Matrix multiplication 1. Standard GHC, 8 CPUs (2 x quad-core) 2. Standard GHC + GC barrier fix + work-stealing 3. Eden
Analysis (4) The distributed memory implementation suffers due to communication overhead. Also the distributed-memory algorithm is more complex, due to trying to avoid copying the input data. We still have a way to go, though: GHC achieves a 5.6 speedup on 8 CPUs.
Further Challenges Work duplication – GHC doesn’t prevent multiple threads from duplicating a computation, it tries to discover duplicated work in progress and halt one of the threads. – to prevent duplication up-front is expensive – extra memory operations (black holes), or even atomic instructions – we found that in some cases work duplication really is affecting scaling – so we want to do this for some computations
Further Challenges Space leak in par – “ par e1 e2 ” stores a pointer to e1 in the spark pool before evaluating e2 – Typically e2 and e1 share some computation – If we don’t have enough processors, we might not evaluate e1 in parallel – how do we know when we can discard that entry from the spark pool? If we don’t ever discard entries from the spark pool, we have a space leak. – “when e2 has completed” doesn’t work, e.g. parMap – “when e1 is evaluated” also doesn’t work: e1 itself isn’t shared, but it refers to shared computations – “when e1 is disjoint from the program’s live data” is too hard to determine – workaround: use only “ par x e2 ” where x is shared with e2.
Conclusions The tradeoff between distributed and shared heaps is a complex one – a distributed heap can give better performance – but is harder to program against: the programmer must think about communication – We believe a shared heap is the better model in the short- term, but as we need to scale to larger numbers of cores or NUMA architectures, a distributed or hybrid model will become necessary. We have made significant improvements to the performance of parallel programs in GHC – and identified several further areas for improvement – GHC 6.10.1 (released next week) contains some of these improvements, download it and try it out!