Presentation on theme: "A Block-structured Heap Simplifies Parallel GC Simon Marlow (Microsoft Research) Roshan James (U. Indiana) Tim Harris (Microsoft Research) Simon Peyton."— Presentation transcript:
A Block-structured Heap Simplifies Parallel GC Simon Marlow (Microsoft Research) Roshan James (U. Indiana) Tim Harris (Microsoft Research) Simon Peyton Jones (Microsoft Research)
Problem Domain Stop the world and collect using multiple threads. –we are not tackling the problem of GC running concurrently with program execution, for now. –we are not tackling the problem of independent GC in a program running on multiple CPUs (but plan to later). Our existing GC is quite complex: –Multi-generational –Arbitrary aging per generation –Eager promotion: promote an object early if it is referenced by an old generation. –Copying or compaction for the old generation (parallelise copying only for now) –Typical allocation rate: 100Mb-1Gb/s
Background: copying collection Allocation area To-space Roots point to live objects Copy live objects to to-space Scan live objects for more roots Complete when scan pointer catches up with allocation pointer.
How can we parallelise this? The main problem is finding an effective way to partition the problem, so we can keep N CPUs busy all the time. Static partitioning (eg. partition the heap by address) isn’t good: –live data might not be evenly distributed –Need synchronisation when pointers cross partition boundaries
Work queues So typically, we need dynamic partitioning for GC. –The available work (pointers to object to be scanned) is kept on a queue –CPUs remove items from the queue, scan the object, and add more roots to the queue. –eg. Flood, Detlefs, Shavit, Zhang (2001) –Good work partitioning, but need separate work queues: in single-threaded GC, the to- space is the work queue. clever lock-free data structures extra administrative overhead some strategy for overflow (GC can’t use arbitrary extra memory!)
A block-structured heap Heap is divided into blocks, e.g. 4k Blocks can be linked together in lists GC sits on top of a block allocator, which manages a free list of blocks. Each block has a “block descriptor”: a small data structure including the link field, which generation it belongs to, … Getting to the block descriptor from an arbitrary address is a pure function (~6 instructions)
Block-structured heap Advantages: –Memory can be recycled quickly: less wastage, better cache behaviour –Flexible: dynamic resizing of generations is easy –Large objects can be stored in their own blocks, and managed separately.
Best of all… Since to-space is a list of blocks, it is an ideal work queue for parallel GC. –No need for a separate work queue, no extra admin overhead relative to single threaed GC. –~4k is large enough that contention for the global block queue should be low –~4k is small enough that we should still scale to large numbers of threads
But what if… … there isn’t enough work to fill a block? E.g. If the heap consists of a single linked list of integers, then the scan pointer will always be close to the allocation pointer, we will never generate a full block of work. –then there isn’t much available parallelism anyway!
Available parallelism There’s enough parallelism, at least in old-gen collections.
The details… GHC’s heap is divided into generations. Each generation is divided into “steps” for aging. The last generation has only one step.
Queues per step Gen 0, step 1 Work queue Done queue
Inside a workspace… Objects copied to this step are allocated into the todo block (per-thread allocation!) Loop: –Grab a block to be scanned from the work queue on a step –Scan it –Push it back to the “done” list on the step –When a todo block becomes full, move it to the global work queue for this step, grab an empty block todo blockscan block Scan pointer Alloc pointer = free memory = not scanned = scanned
Inside a workspace… todo blockscan block Scan pointer Alloc pointer = free memory = not scanned = scanned When there are no full blocks of work left: –Make a scan block = the todo block –Scan until complete –Look for more full blocks… –We want to avoid fragmentation: never flush a partially full block to the step unless absolutely necessary, keep it as the todo block.
Termination When a thread finds no work, it increments a semaphore If it finds the semaphore is == number of threads, exit. If there is work to do, decrement the semaphore and continue (don’t remove the work from the queue until the semaphore has been decremented).
Optimisations… Keep a list of “done” blocks per workspace, avoiding contention for global list. Concatenate them all at the end. Buffer the global work queue locally per workspace. A one block buffer is enough to reduce contention significantly. Some objects don’t need to be scanned, copy them to a separate non-scanned block (single- threaded GC already does this). Keep the thread-local state structure (workspaces) in a register.
Forwarding pointers Must synchronise if two threads attempt to copy the same object, otherwise the object is duplicated. Use CAS to install the forwarding pointer; if another thread installs the pointer first, return it (don’t copy the object). One CAS per object! CAS on a constructor not strictly necessary… just accept some duplication? Payload Header Payload Header Object is copied Into to-space FWD Overwrite with a forwarding pointer
Status First prototype completed by Roshan James as an intern project this summer. Working multi-threaded, but speedup wasn’t quite what we hoped for (0% - 30% on 2 CPUs). Rewrite in progress, currently working single- threaded. Even with one CAS per object, only very slightly slower than existing single- threaded GC. I’m optimistic! We’re hooking up CPU performance counters to the runtime to see what’s really going on; I want to see if the cache behaviour can be tuned.
Further work Parallelise mark/compact too –No CAS required when marking (no forwarding pointers) –Blocks make parallelising compaction easier: just statically partition the list of marked heap blocks and compact each segment, concatenate the result. Independent minor GCs. –Hard to parallelise minor GC: too quick, not enough parallelism –Stopping the world for minor GC is a severe bottleneck in a program running on multiple CPUs. –So do per-CPU independent minor GCs. –Main techincal problem: either track or prevent inter-minor- generation pointers. (eg. Doligez/Leroy(1993) for ML, Steensgaard(2001)). Can we do concurrent GC?