A Block-structured Heap Simplifies Parallel GC Simon Marlow (Microsoft Research) Roshan James (U. Indiana) Tim Harris (Microsoft Research) Simon Peyton.

Slides:



Advertisements
Similar presentations
Fast and Lock-Free Concurrent Priority Queues for Multi-Thread Systems Håkan Sundell Philippas Tsigas.
Advertisements

Garbage collection David Walker CS 320. Where are we? Last time: A survey of common garbage collection techniques –Manual memory management –Reference.
Part IV: Memory Management
Comparing and Optimising Parallel Haskell Implementations on Multicore Jost Berthold Simon Marlow Abyd Al Zain Kevin Hammond.
The Linux Kernel: Memory Management
Understanding Operating Systems Fifth Edition
Lecture 10: Heap Management CS 540 GMU Spring 2009.
CSC 213 – Large Scale Programming. Today’s Goals  Consider what new does & how Java works  What are traditional means of managing memory?  Why did.
File Systems.
Garbage Collection CSCI 2720 Spring Static vs. Dynamic Allocation Early versions of Fortran –All memory was static C –Mix of static and dynamic.
Fixed/Variable Partitioning
By Jacob SeligmannSteffen Grarup Presented By Leon Gendler Incremental Mature Garbage Collection Using the Train Algorithm.
CS 326 Programming Languages, Concepts and Implementation Instructor: Mircea Nicolescu Lecture 18.
CPSC 388 – Compiler Design and Construction
CS 536 Spring Automatic Memory Management Lecture 24.
Memory Management. History Run-time management of dynamic memory is a necessary activity for modern programming languages Lisp of the 1960’s was one of.
Processes CSCI 444/544 Operating Systems Fall 2008.
Memory Allocation. Three kinds of memory Fixed memory Stack memory Heap memory.
CS510 Advanced OS Seminar Class 10 A Methodology for Implementing Highly Concurrent Data Objects by Maurice Herlihy.
Memory Management 2010.
Informationsteknologi Friday, November 16, 2007Computer Architecture I - Class 121 Today’s class Operating System Machine Level.
Memory Allocation and Garbage Collection. Why Dynamic Memory? We cannot know memory requirements in advance when the program is written. We cannot know.
Memory Management (continued) CS-3013 C-term Memory Management CS-3013 Operating Systems C-term 2008 (Slides include materials from Operating System.
An Adaptive, Region-based Allocator for Java Feng Qian & Laurie Hendren 2002.
Memory Management 1 CS502 Spring 2006 Memory Management CS-502 Spring 2006.
CS-3013 & CS-502, Summer 2006 Memory Management1 CS-3013 & CS-502 Summer 2006.
Reference Counters Associate a counter with each heap item Whenever a heap item is created, such as by a new or malloc instruction, initialize the counter.
Stacks, Queues, and Deques. A stack is a last in, first out (LIFO) data structure –Items are removed from a stack in the reverse order from the way they.
SEG Advanced Software Design and Reengineering TOPIC L Garbage Collection Algorithms.
CLR: Garbage Collection Inside Out
Chapter 7 Run-Time Environments
Parallel garbage collection with a block-structured heap Simon Marlow (Microsoft Research) Simon Peyton Jones (Microsoft Research) Roshan James (U. Indiana)
© 2004, D. J. Foreman 1 Memory Management. © 2004, D. J. Foreman 2 Building a Module -1  Compiler ■ generates references for function addresses may be.
Processes and OS basics. RHS – SOC 2 OS Basics An Operating System (OS) is essentially an abstraction of a computer As a user or programmer, I do not.
Chapter 4 Memory Management.
Message Analysis-Guided Allocation and Low-Pause Incremental Garbage Collection in a Concurrent Language Konstantinos Sagonas Jesper Wilhelmsson Uppsala.
Computer Architecture Lecture 26 Fasih ur Rehman.
11/26/2015IT 3271 Memory Management (Ch 14) n Dynamic memory allocation Language systems provide an important hidden player: Runtime memory manager – Activation.
Computer Systems Week 14: Memory Management Amanda Oddie.
Operating Systems Lecture 14 Segments Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing Liu School of Software Engineering.
1 Memory Management Chapter 7. 2 Memory Management Subdividing memory to accommodate multiple processes Memory needs to be allocated to ensure a reasonable.
Processes CSCI 4534 Chapter 4. Introduction Early computer systems allowed one program to be executed at a time –The program had complete control of the.
Garbage Collection and Memory Management CS 480/680 – Comparative Languages.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
Operating Systems CSE 411 CPU Management Sept Lecture 10 Instructor: Bhuvan Urgaonkar.
A FIRST BOOK OF C++ CHAPTER 6 MODULARITY USING FUNCTIONS.
The High Performance Simulation Project Status and short term plans 17 th April 2013 Federico Carminati.
CS6502 Operating Systems - Dr. J. Garrido Memory Management – Part 1 Class Will Start Momentarily… Lecture 8b CS6502 Operating Systems Dr. Jose M. Garrido.
Operating Systems (CS 340 D) Dr. Abeer Mahmoud Princess Nora University Faculty of Computer & Information Systems Computer science Department.
Lecture 7 Page 1 CS 111 Summer 2013 Dynamic Domain Allocation A concept covered in a previous lecture We’ll just review it here Domains are regions of.
® July 21, 2004GC Summer School1 Cycles to Recycle: Copy GC Without Stopping the World The Sapphire Collector Richard L. Hudson J. Eliot B. Moss Originally.
Chapter 5 Record Storage and Primary File Organizations
COMP091 – Operating Systems 1 Memory Management. Memory Management Terms Physical address –Actual address as seen by memory unit Logical address –Address.
Haskell on a Shared-Memory Multiprocessor Tim Harris Simon Marlow Simon Peyton Jones.
Memory management The main purpose of a computer system is to execute programs. These programs, together with the data they access, must be in main memory.
LINKED LISTS.
Chapter 2 Memory and process management
Operating Systems (CS 340 D)
Dynamic Domain Allocation
Memory Management © 2004, D. J. Foreman.
Concepts of programming languages
CSI 400/500 Operating Systems Spring 2009
Main Memory Management
Operating Systems (CS 340 D)
Optimizing Malloc and Free
Page Replacement.
CS399 New Beginnings Jonathan Walpole.
Dr. Mustafa Cem Kasapbaşı
CMPE 152: Compiler Design May 2 Class Meeting
Presentation transcript:

A Block-structured Heap Simplifies Parallel GC Simon Marlow (Microsoft Research) Roshan James (U. Indiana) Tim Harris (Microsoft Research) Simon Peyton Jones (Microsoft Research)

Problem Domain Stop the world and collect using multiple threads. –we are not tackling the problem of GC running concurrently with program execution, for now. –we are not tackling the problem of independent GC in a program running on multiple CPUs (but plan to later). Our existing GC is quite complex: –Multi-generational –Arbitrary aging per generation –Eager promotion: promote an object early if it is referenced by an old generation. –Copying or compaction for the old generation (parallelise copying only for now) –Typical allocation rate: 100Mb-1Gb/s

Background: copying collection Allocation area To-space Roots point to live objects Copy live objects to to-space Scan live objects for more roots Complete when scan pointer catches up with allocation pointer.

How can we parallelise this? The main problem is finding an effective way to partition the problem, so we can keep N CPUs busy all the time. Static partitioning (eg. partition the heap by address) isn’t good: –live data might not be evenly distributed –Need synchronisation when pointers cross partition boundaries

Work queues So typically, we need dynamic partitioning for GC. –The available work (pointers to object to be scanned) is kept on a queue –CPUs remove items from the queue, scan the object, and add more roots to the queue. –eg. Flood, Detlefs, Shavit, Zhang (2001) –Good work partitioning, but need separate work queues: in single-threaded GC, the to- space is the work queue. clever lock-free data structures extra administrative overhead some strategy for overflow (GC can’t use arbitrary extra memory!)

A block-structured heap Heap is divided into blocks, e.g. 4k Blocks can be linked together in lists GC sits on top of a block allocator, which manages a free list of blocks. Each block has a “block descriptor”: a small data structure including the link field, which generation it belongs to, … Getting to the block descriptor from an arbitrary address is a pure function (~6 instructions)

Block-structured heap Advantages: –Memory can be recycled quickly: less wastage, better cache behaviour –Flexible: dynamic resizing of generations is easy –Large objects can be stored in their own blocks, and managed separately.

Best of all… Since to-space is a list of blocks, it is an ideal work queue for parallel GC. –No need for a separate work queue, no extra admin overhead relative to single threaed GC. –~4k is large enough that contention for the global block queue should be low –~4k is small enough that we should still scale to large numbers of threads

But what if… … there isn’t enough work to fill a block? E.g. If the heap consists of a single linked list of integers, then the scan pointer will always be close to the allocation pointer, we will never generate a full block of work. –then there isn’t much available parallelism anyway!

Available parallelism There’s enough parallelism, at least in old-gen collections.

The details… GHC’s heap is divided into generations. Each generation is divided into “steps” for aging. The last generation has only one step.

Queues per step Gen 0, step 1 Work queue Done queue

Thread 1 Workspaces Thread 0 Step 0Step 1 Step 0Step 1 Step 0 Step 1 Step 0Step 1 Step 0 Generation 0 Generation 1 Generation 2

Inside a workspace… Objects copied to this step are allocated into the todo block (per-thread allocation!) Loop: –Grab a block to be scanned from the work queue on a step –Scan it –Push it back to the “done” list on the step –When a todo block becomes full, move it to the global work queue for this step, grab an empty block todo blockscan block Scan pointer Alloc pointer = free memory = not scanned = scanned

Inside a workspace… todo blockscan block Scan pointer Alloc pointer = free memory = not scanned = scanned When there are no full blocks of work left: –Make a scan block = the todo block –Scan until complete –Look for more full blocks… –We want to avoid fragmentation: never flush a partially full block to the step unless absolutely necessary, keep it as the todo block.

Termination When a thread finds no work, it increments a semaphore If it finds the semaphore is == number of threads, exit. If there is work to do, decrement the semaphore and continue (don’t remove the work from the queue until the semaphore has been decremented).

Optimisations… Keep a list of “done” blocks per workspace, avoiding contention for global list. Concatenate them all at the end. Buffer the global work queue locally per workspace. A one block buffer is enough to reduce contention significantly. Some objects don’t need to be scanned, copy them to a separate non-scanned block (single- threaded GC already does this). Keep the thread-local state structure (workspaces) in a register.

Forwarding pointers Must synchronise if two threads attempt to copy the same object, otherwise the object is duplicated. Use CAS to install the forwarding pointer; if another thread installs the pointer first, return it (don’t copy the object). One CAS per object! CAS on a constructor not strictly necessary… just accept some duplication? Payload Header Payload Header Object is copied Into to-space FWD Overwrite with a forwarding pointer

Status First prototype completed by Roshan James as an intern project this summer. Working multi-threaded, but speedup wasn’t quite what we hoped for (0% - 30% on 2 CPUs). Rewrite in progress, currently working single- threaded. Even with one CAS per object, only very slightly slower than existing single- threaded GC. I’m optimistic! We’re hooking up CPU performance counters to the runtime to see what’s really going on; I want to see if the cache behaviour can be tuned.

Further work Parallelise mark/compact too –No CAS required when marking (no forwarding pointers) –Blocks make parallelising compaction easier: just statically partition the list of marked heap blocks and compact each segment, concatenate the result. Independent minor GCs. –Hard to parallelise minor GC: too quick, not enough parallelism –Stopping the world for minor GC is a severe bottleneck in a program running on multiple CPUs. –So do per-CPU independent minor GCs. –Main techincal problem: either track or prevent inter-minor- generation pointers. (eg. Doligez/Leroy(1993) for ML, Steensgaard(2001)). Can we do concurrent GC?