Haskell on a Shared-Memory Multiprocessor Tim Harris Simon Marlow Simon Peyton Jones.

Slides:



Advertisements
Similar presentations
An Implementation of Mostly- Copying GC on Ruby VM Tomoharu Ugawa The University of Electro-Communications, Japan.
Advertisements

Garbage collection David Walker CS 320. Where are we? Last time: A survey of common garbage collection techniques –Manual memory management –Reference.
A Block-structured Heap Simplifies Parallel GC Simon Marlow (Microsoft Research) Roshan James (U. Indiana) Tim Harris (Microsoft Research) Simon Peyton.
Comparing and Optimising Parallel Haskell Implementations on Multicore Jost Berthold Simon Marlow Abyd Al Zain Kevin Hammond.
Threads Cannot be Implemented As a Library Andrew Hobbs.
CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.
Intro to Computer Org. Pipelining, Part 2 – Data hazards + Stalls.
Concurrency Important and difficult (Ada slides copied from Ed Schonberg)
D u k e S y s t e m s Time, clocks, and consistency and the JMM Jeff Chase Duke University.
Concurrency The need for speed. Why concurrency? Moore’s law: 1. The number of components on a chip doubles about every 18 months 2. The speed of computation.
Concurrency 101 Shared state. Part 1: General Concepts 2.
Asynchronous Assertions Eddie Aftandilian and Sam Guyer Tufts University Martin Vechev ETH Zurich and IBM Research Eran Yahav Technion.
CS 536 Spring Automatic Memory Management Lecture 24.
Slides 8d-1 Programming with Shared Memory Specifying parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Fall 2010.
Intro to Threading CS221 – 4/20/09. What we’ll cover today Finish the DOTS program Introduction to threads and multi-threading.
“THREADS CANNOT BE IMPLEMENTED AS A LIBRARY” HANS-J. BOEHM, HP LABS Presented by Seema Saijpaul CS-510.
G Robert Grimm New York University Cool Pet Tricks with… …Virtual Memory.
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
CS220 Software Development Lecture: Multi-threading A. O’Riordan, 2009.
Honors Compilers Addressing of Local Variables Mar 19 th, 2002.
Memory Management 2010.
CS510 Concurrent Systems Class 2 A Lock-Free Multiprocessor OS Kernel.
29-Jun-15 Java Concurrency. Definitions Parallel processes—two or more Threads are running simultaneously, on different cores (processors), in the same.
CPS110: Implementing threads/locks on a uni-processor Landon Cox.
A. Frank - P. Weisberg Operating Systems Introduction to Tasks/Threads.
User-Level Interprocess Communication for Shared Memory Multiprocessors Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy Presented.
A Parallel, Real-Time Garbage Collector Author: Perry Cheng, Guy E. Blelloch Presenter: Jun Tao.
Basics of Operating Systems March 4, 2001 Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard.
1 Advanced Computer Programming Concurrency Multithreaded Programs Copyright © Texas Education Agency, 2013.
Lightweight Concurrency in GHC KC Sivaramakrishnan Tim Harris Simon Marlow Simon Peyton Jones 1.
Computer System Architectures Computer System Software
Parallel garbage collection with a block-structured heap Simon Marlow (Microsoft Research) Simon Peyton Jones (Microsoft Research) Roshan James (U. Indiana)
Object Oriented Analysis & Design SDL Threads. Contents 2  Processes  Thread Concepts  Creating threads  Critical sections  Synchronizing threads.
Håkan Sundell, Chalmers University of Technology 1 NOBLE: A Non-Blocking Inter-Process Communication Library Håkan Sundell Philippas.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 8: April 26, 2001 Simultaneous Multi-Threading (SMT)
10/20/20151 GC16/3011 Functional Programming Lecture 22 The Four-Stroke Reduction Engine.
Scala Parallel Collections Aleksandar Prokopec, Tiark Rompf Scala Team EPFL.
Games Development 2 Concurrent Programming CO3301 Week 9.
Computers Operating System Essentials. Operating Systems PROGRAM HARDWARE OPERATING SYSTEM.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Mutual Exclusion.
Lecture 20: Parallelism & Concurrency CS 62 Spring 2013 Kim Bruce & Kevin Coogan CS 62 Spring 2013 Kim Bruce & Kevin Coogan Some slides based on those.
Chapter 4 – Threads (Pgs 153 – 174). Threads  A "Basic Unit of CPU Utilization"  A technique that assists in performing parallel computation by setting.
DOUBLE INSTANCE LOCKING A concurrency pattern with Lock-Free read operations Pedro Ramalhete Andreia Correia November 2013.
Concurrency Control 1 Fall 2014 CS7020: Game Design and Development.
1 Computer Systems II Introduction to Processes. 2 First Two Major Computer System Evolution Steps Led to the idea of multiprogramming (multiple concurrent.
CS399 New Beginnings Jonathan Walpole. 2 Concurrent Programming & Synchronization Primitives.
Processes and Virtual Memory
Software Transactional Memory Should Not Be Obstruction-Free Robert Ennals Presented by Abdulai Sei.
Thread basics. A computer process Every time a program is executed a process is created It is managed via a data structure that keeps all things memory.
Threaded Programming Lecture 1: Concepts. 2 Overview Shared memory systems Basic Concepts in Threaded Programming.
MULTIVIE W Slide 1 (of 21) Software Transactional Memory Should Not Be Obstruction Free Paper: Robert Ennals Presenter: Emerson Murphy-Hill.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
® July 21, 2004GC Summer School1 Cycles to Recycle: Copy GC Without Stopping the World The Sapphire Collector Richard L. Hudson J. Eliot B. Moss Originally.
1 Why Threads are a Bad Idea (for most purposes) based on a presentation by John Ousterhout Sun Microsystems Laboratories Threads!
Introduction Contain two or more CPU share common memory and peripherals. Provide greater system throughput. Multiple processor executing simultaneous.
Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
SMP Basics KeyStone Training Multicore Applications Literature Number: SPRPxxx 1.
Threads, SMP, and Microkernels Chapter 4. Processes and Threads Operating systems use processes for two purposes - Resource allocation and resource ownership.
LINKED LISTS.
Tutorial 2: Homework 1 and Project 1
Threads Cannot Be Implemented As a Library
Lecture 25 More Synchronized Data and Producer/Consumer Relationship
Java Concurrency 17-Jan-19.
Concurrency: Mutual Exclusion and Process Synchronization
Java Concurrency.
CS333 Intro to Operating Systems
Java Concurrency.
Programming with Shared Memory Specifying parallelism
Java Concurrency 29-May-19.
Presentation transcript:

Haskell on a Shared-Memory Multiprocessor Tim Harris Simon Marlow Simon Peyton Jones

Why now? Shift in the balance: –no more free sequential performance boosts –SMP hardware will be the norm –non-parallel programs will be frozen in performance –even a modest parallel speedup is now worthwhile, because the other processors come for free race to produce good parallel languages

The story so far… Parallel FP research is not new, but –it has mostly focussed on distributed memory, and hence separate heaps: communication is expensive, so careful tuning of work distribution is needed –multi-core processors (for small N) will be shared memory, we can use a single heap: almost zero communication overhead means better prospects for reliable speedup tradeoffs are likely to be quite different less scalability beyond small N

Concurrent Haskell Concurrent programming in Haskell is exciting right now: –STM means less error-prone concurrent programming –we understand how Concurrent Haskell interacts with OS level concurrency and the FFI –lots of people are using it Concurrent programs are Parallel programs too –so we already have plenty of parallel programs to play with –to say it another way: we can use Concurrent Haskell to write parallel programs (no need for parallel annotations like par straight away)

So, what’s the problem? Suppose we let 2 Haskell threads loose on a shared heap. What goes wrong? –allocation: the threads better have separate allocation areas –immutable heap objects present no problems (and are common!) –mutable objects: MVars, TVars. We better make sure that these are thread-safe. –shared data in the runtime: eg. the scheduler’s run queue, the garbage collector’s remembered set. Access to these must be made thread-safe. –but …

stack update The real problem is Thunks! let x = fac z in x * 2 THUNK: fac z z x allocation: evaluation: value IND returned

Should we lock thunks? Thunks are clearly shared mutable state, so we should protect against simultaneous access with a mutex, right? Free vars THUNK

Locks are v. expensive A lock is implemented using a guaranteed atomic instruction, such as compare-and-swap. These instructions are about 100x more expensive than ordinary instructions We measured adding two CAS instructions to every thunk evaluation, result was about 50% worse performance.

Can we do it lock-free? What would go wrong if we let them both evaluate it? –they both compute the same value… –just extra work –most thunks are cheap Free vars THUNK

Not quite that simple… Free vars THUNK Race between update and entry: IND Value IND Value

Hardware re-ordering? Not all processors guarantee strong memory ordering –no read ordering: processor might observe the writes in a different order –no write ordering: header might be written before value, or worse, the value itself might be written after the update –Happily, x86 currently guarantees both read & write ordering

Hardware re-ordering cont. No write ordering => we need a memory barrier (could be expensive!) write ordering but no read ordering: Free vars0 THUNK Initialise padding field to 0

Can we reduce duplication? idea: –periodically scan each thread’s stack –attempt to claim exclusive access to each thunk under evaluation –halt any duplicate evaluation THUNK update

Claiming a thunk traverse a thread’s stack, when we reach an update frame, atomically swap the header word of the thunk with BLACKHOLE Free vars0 THUNKBLACKHOLE update

Claiming a thunk If the header was previously: 1.a THUNK, we have now claimed it 2.BLACKHOLE, another thread owns it 3.IND, another thread has already updated it Free vars0 BLACKHOLE update Duplicate Evaluation

What happens to the duplicate evaluation? Well-known technique (Reid ’99), also used in asynchronous exceptions and STM. update BLACKHOLE AP_STACK IND This thread has claimed this thunk.

Stopping duplicate evaluation, cont. The thread blocks until the BLACKHOLE has completed evaluation update BLACKHOLE Another thread has claimed this thunk. Block

Claiming thunks Works like real locking for long-running thunks, compared to lock-free execution for short-lived thunks, precisely what we want Must mark update frames for thunks we have claimed, so we don’t attempt to claim twice. If a thread has claimed a thunk, this does not necessarily mean that it is the only thread evaluating it. The other thread(s) may not have tried to claim it yet.

Evaluating a BLACKHOLE, blocking What if a thread enters a BLACKHOLE, i.e. a claimed thunk? The thread must block. In single-threaded GHC, we attached blocked threads to the BLACKHOLE itself. –easy to find the blocked threads when updating the BLACKHOLE, but –in a multi-threaded setting this leads to more race conditions on the thunk –so we must store the queue of blocked threads in a separate list, and check it periodically

Black-holing Black-holing has been around for a while. It also: –fixes some space leaks –catches some loops We are just extending the existing black-holing technique to catch duplicate work in SMP-GHC.

Narrowing the window: grey-holing ToDo

More possibilities for duplication two threads evaluate z simultaneously, creating two copies of x x is duplicated for ever we can try to catch this at the update: if we update an IND, then return the other value. Not foolproof. z = let x = … expensive … in Just x

STM(?) ToDo

Measurements using real locks

Measurements our lock-free implementation

Case study: parallelising GHC --make GHC –-make compiles multiple modules in dependency order.hi files for library modules are read once and shared by future compilations we want to parallelise compilations of independent modules, while synchronising access to the shared state

parallel compilation C AB Main in parallel

GHC’s shared state It’s a dataflow graph! one thread for each node, blocks until results are available from all the inputs parallel compilation happens automatically simple throttling to prevent too many simultaneous compilations. C AB Main

Results: ideal 2 identical modules Why not a speedup of 2? –GC is single threaded, more GC when compiling in parallel (more live data) –dependency anal is single threaded –interface loading is shared –increased load on the memory system discounting GC, we get speedup of speedup of 1.3 used 1.5 CPUs

Results: compiling Happy Modules are not completely independent, speedup drops to 1.2

Results: compiling Anna larger program make –j2 is now losing better parallel speedup when optimising: –probably lower proportion of time spent reading interface files, –and proportionally lower contention for shared state

Conclusion & what’s next? lock-free thunk evaluation looks promising current issues: –lock contention in the runtime –lack of processor affinity –combination leads to dramatic slowdown for some examples, particularly concurrent programs we are redesigning the scheduler to fix these issues multithreaded GC –tricky, but well-understood –benefits everyone on multi-core/multi-proc Apps! planned full support for SMP in GHC 6.6