Presentation is loading. Please wait.

Presentation is loading. Please wait.

© Richard Jones, University of Kent 2009 SCIEnce Paris Workshop 20091 Richard Jones Computing Laboratory University of Kent,

Similar presentations


Presentation on theme: "© Richard Jones, University of Kent 2009 SCIEnce Paris Workshop 20091 Richard Jones Computing Laboratory University of Kent,"— Presentation transcript:

1 © Richard Jones, University of Kent SCIEnce Paris Workshop Richard Jones Computing Laboratory University of Kent, Canterbury January 2009, Paris Dynamic Memory Management Challenges for today and tomorrow SCIEnce 2009

2 © Richard Jones, University of Kent SCIEnce Paris Workshop Overview Part 1: Introduction Basic algorithms Part 2:GC for high performance Part 3:Reference counting revisited Part 4:Integration with the environment Discussion: GC and CA

3 © Richard Jones, University of Kent SCIEnce Paris Workshop PART 1: Introduction Why garbage collect? Basic algorithms

4 © Richard Jones, University of Kent SCIEnce Paris Workshop Why garbage collect? Obvious requirements Finite, limited storage. Language requirement — objects may survive their creating method. The problem — hard/impossible to determine when something is garbage. Programmers find it hard to get right. Too little collected? memory leaks. Too much collected? broken programs. Good software engineering Explicit memory management conflicts with the software engineering principles of abstraction and modularity.

5 © Richard Jones, University of Kent SCIEnce Paris Workshop Basic algorithms —The garage metaphor— Reference counting: Maintain a note on each object in your garage, indicating the current number of references to the object. When an object’s reference count goes to zero, throw the object out (it’s dead). Mark-Sweep: Put a note on objects you need (roots). Then recursively put a note on anything needed by a live object. Afterwards, check all objects and throw out objects without notes. Mark-Compact: Put notes on objects you need (as above). Move anything with a note on it to the back of the garage. Burn everything at the front of the garage (it’s all dead). Copying: Move objects you need to a new garage. Then recursively move anything needed by an object in the new garage. Afterwards, burn down the old garage (any objects in it are dead)!

6 © Richard Jones, University of Kent SCIEnce Paris Workshop Generational GC Weak generational hypothesis “Most objects die young” [Ungar, 1984] Common for 80-95% objects to die before a further MB of allocation. Strategy: Segregate objects by age into generations (regions of the heap). Collect different generations at different frequencies. –so need to “remember” pointers that cross generations. Concentrate on the nursery generation to reduce pause times. –full heap collection pauses 5-50x longer than nursery collections. roots Old gen Young gen remset

7 © Richard Jones, University of Kent SCIEnce Paris Workshop Generational GC: a summary Highly successful for a range of applications acceptable pause time for interactive applications. reduces the overall cost of garbage collection. improves paging and cache behaviour. Requires a low survival rate, infrequent major collections, low overall cost of write barrier But generational GC is not a universal panacea. It improves expected pause time but not the worst-case.  objects may not die sufficiently fast  applications may thrash the write barrier  too many old-young pointers may increase pause times  copying is expensive if survival rate is high.

8 © Richard Jones, University of Kent SCIEnce Paris Workshop Can GC perform as well as malloc/free? Test environment Real Java benchmarks + JikesRVM + real collectors + real malloc/frees. Object reachability traces: provide an ‘oracle’. Dynamic SimpleScalar simulator: count cycles, cache misses, etc. Quantifying the Performance of Garbage Collection v. Explicit Memory Management, Hertz and Berger. OOPSLA’05. Best GC v. Best malloc

9 © Richard Jones, University of Kent SCIEnce Paris Workshop PART 2: GC for high performance systems Assume: Multiprocessor, many user threads, very large heaps. Require: Good throughput, good response time, but not hard real-time. Typical configuration: A nursery region providing fast, bump-pointer allocation. A mature region managed by a mark-sweep collector. –requires occasional compaction Exploit parallel hardware Concurrent allocation Concurrent marking Parallel marker and collector threads Concurrent sweeping Parallel & concurrent compaction

10 © Richard Jones, University of Kent SCIEnce Paris Workshop Parallelism Avoid bottlenecks Heap contention — allocation. Tracing, especially contention for mark stack. Compaction — address fix-up must appear atomic. Use lock-free data structures Load balancing is critical Work starvation vs. excessive synchronisation Over-partition work, work-share Marking, scanning card tables / remsets, sweeping, compacting

11 © Richard Jones, University of Kent SCIEnce Paris Workshop Terminology mutator collector Parallel collection Incremental collection Single threaded collection Concurrent collection User program

12 © Richard Jones, University of Kent SCIEnce Paris Workshop A. Concurrent allocation Multiple user threads Must avoid contention for the heap Avoid locks, avoid atomic instructions (CAS) freep thread

13 © Richard Jones, University of Kent SCIEnce Paris Workshop Local Allocation Blocks Thread contends for a contiguous block (LAB) — CAS. Thread allocates within LAB (bump a pointer) — no contention. Locality properties make this effective even for GCs that rarely move objects — needs variable-sized LABs. thread nextLAB LAB LAB manager

14 © Richard Jones, University of Kent SCIEnce Paris Workshop marked B: Concurrent Marking Marking concurrently with user threads introduces a coherency problem: the mutator might hide pointers from the collector. Snapshot at the beginning write barrier Record & revisit IUcatch changes to connectivity. SABprevent loss of the original path. Incremental update write barrier Record & revisit

15 © Richard Jones, University of Kent SCIEnce Paris Workshop Write barrier properties Performance: no barriers on local variables Incremental update Mutator can hide live objects from the collector. Need a final, stop the world, marking phase. Snapshot at the beginning Any object reachable at the start of the phase remains reachable. So no STW phase required. But more floating garbage retained than for incremental update.

16 © Richard Jones, University of Kent SCIEnce Paris Workshop The goal is always to avoid contention (e.g. for the mark stack). Thread-local mark stacks, work-stealing. Grey packets. C. Parallel markers Stephen Thomas, Insignia Solutions

17 © Richard Jones, University of Kent SCIEnce Paris Workshop Marker threads acquire a full packet of marking work (grey references). They mark & empty this packet and fill a fresh (empty) packet with new work. The new, filled packet is returned to the pool. Avoids most contention Simple termination. Allows prefetching (unlike traditional mark stack). Reduces processor weak ordering problems (fences only around packet acquisition/disposal). Grey packets thread Grey packet pool Stephen Thomas, Insignia Solutions

18 © Richard Jones, University of Kent SCIEnce Paris Workshop E. Compaction Without compaction Heaps tend to fragment over time. Issues for compaction: Moving objects. Updating all references to a moved object. In parallel And concurrently.

19 © Richard Jones, University of Kent SCIEnce Paris Workshop Compaction strategy Old Lisp idea recently applied to concurrent systems. Divide the heap into a few, large regions: Marker constructs remsets –locations with references into each region. Heuristic for condemned region (live volume/number/references). Use remsets to fix-up references. heap remsets

20 © Richard Jones, University of Kent SCIEnce Paris Workshop Parallel compaction Split heap into Many small blocks (e.g. 256b). Fewer, larger target areas (e.g. 16 x processors, each 4MB +). Each thread 1.increments index of next target to compact (e.g. CAS). 2.claims a target area with a lower address (e.g. CAS). 3.moves objects/blocks in the next block en masse into target area. An Efficient Parallel Heap Compaction Algorithm, Abuaiadh et al, OOPSLA’04 target block

21 © Richard Jones, University of Kent SCIEnce Paris Workshop Results Reduced compaction time (individual objects) from 1680 ms to 470 ms for a large 3-tier application suffering fragmentation problems. Moving blocks rather than individual objects further reduces compaction time by 25% at a small increase in space costs (+4%). Compaction speed up is linear in number of threads. Throughput increased slightly (2%).

22 © Richard Jones, University of Kent SCIEnce Paris Workshop Read barrier methods Alternatively: don't let the mutator see objects that the collector hasn’t seen. How? Trap mutator accesses and mark or copy & redirect. Read barriers –software –memory protection.

23 © Richard Jones, University of Kent SCIEnce Paris Workshop Memory Protection By protecting a grey page, any access attempt by the mutator is trapped. Read barrier Mutator

24 © Richard Jones, University of Kent SCIEnce Paris Workshop Mostly concurrent, incremental compaction Phases 1.Mark, and select a region for compaction. 2.Stop the world phase –move the objects. –update addresses in roots (thread stacks, etc.) –read/write protect the whole heap. 3.Concurrent phase –update addresses on access violations. –background fix-up threads. Map the whole heap into 2 virtual address ranges the second is used for address fix-up. Results Reduces compaction time pause time to 10-15% of mark time (rather than 10 x mark time). Up to 10% throughput improvement. Mostly Concurrent Compaction for Mark-Sweep GC, Ossia et al, ISMM’04

25 © Richard Jones, University of Kent SCIEnce Paris Workshop Compressor: concurrent, incremental and parallel compaction Fully concurrent (no Stop-the-world phase), semispace sliding compaction A single heap pass + a pass on a small mark-bit vector (cf. 2 full heap passes for most compactors) Phases 1.Mark live objects, constructing a mark-bit vector. 2.From mark-bit vector, construct offset table mapping fromspace blocks to tospace blocks. 3.mprotect tospace. Stop threads one at a time and update references to tospace (using offset table and mark-bit vector). 4.On access violations, move N pages and update references in moved pages. Unprotect pages. need to map physical pages to 2 virtual pages. The Compressor: concurrent, incremental and parallel compaction, Kermany & Petrank, PLDI’06 cf. Mostly Concurrent Compaction for Mark-Sweep GC, Ossia et al, ISMM’04

26 © Richard Jones, University of Kent SCIEnce Paris Workshop Compressor results Throughput: better than generational mark-sweep for SPECjbb2000 server benchmark, worse for DaCapo client benchmarks Pauses: delay before achieving full allocation rate. SPECjbb2000: 2-way 2.4GHz Pentium III Xeon DaCapo: 2.8GHz Pentium 4 uniprocessor

27 © Richard Jones, University of Kent SCIEnce Paris Workshop Don’t move, unmap instead Observations: Objects tend to live and die in clusters. Moving objects is expensive. Use memory mapping as primary compaction technique Parallel marking. Unmap pages with no live objects and remap to give new contiguous allocation space. Fall back to ‘perfect’ compactor if you run out of space. Works best in 64-bit address space (else quickly run out of address space). The Mapping Collector: virtual memory support for generational, parallel, and concurrent compaction, Wegiel & Krintz, ASPLOS’08

28 © Richard Jones, University of Kent SCIEnce Paris Workshop Azul Pauseless GC Mark Parallel, concurrent; work-lists. Threads mark own roots; parallel threads mark blocked threads’ roots. Relocate Protect pages and concurrently relocate. Free physical but not virtual pages immediately (might be stale refs). Side array for forwarding pointers, also used by RB. Mutator may do the copy on behalf of the GC. Remap GC threads traverse object graph, tripping RB to update stale refs. Results Substantially reduced transaction times; almost all pauses < 3ms. Reports other (incomparable) systems had 20+% pauses > 3ms. mark relocate mark remap

29 © Richard Jones, University of Kent SCIEnce Paris Workshop D. Concurrent sweeping Sweeping concurrently (with mutator) Sweeper only modifies mark-bits and garbage. Both are invisible to mutator so no interference. Parallel extension is also straightforward Chief issue is load balancing.

30 © Richard Jones, University of Kent SCIEnce Paris Workshop PART 3: Reference counting revisited Problems Cycles Overheads Atomic RC operations Benefits Distributed overheads Immediacy Recycling memory Observations Practical solutions use Deferred RC –don’t count local variable ops –periodically scan stack RC and tracing GC are duals –GC traces live objects –RC traces dead objects –Unified Theory of Garbage Collection, Bacon et al, OOPSLA’04 Two solutions

31 © Richard Jones, University of Kent SCIEnce Paris Workshop A. Concurrent, cyclic RC RC contention: use a producer-consumer model Mutator threads only add RC operations to local buffers. Collector thread consumes them to modify RC. Avoiding races between increments and decrements: Buffers periodically turned over to collector (epochs). Perform Increments this epoch, Decrements the next. Cycles: use trial deletion Trace suspect subgraphs Remove RCs due to internal references All RCs = 0 means the subgraph is garbage. Results Throughput close to a parallel MS collector for most applications. Maximum pause time: 2.6ms, smallest gap between pauses: 36ms. Concurrent cycle collection in reference counted systems Bacon and Rajan, ECOOP’01. threads Epoch 1 Epoch 2 Epoch 3

32 © Richard Jones, University of Kent SCIEnce Paris Workshop B. Sliding views Observation: Naively: RC(o 2 )--, RC(o 3 )++, RC(o 3 )--, RC(o 4 )++ But only need: RC(o 2 )--, RC(o 4 )++. Sliding views Mutator checks o 1 ’s dirty bit: if clear, set and add old values to a local buffer. Threads mark objects referenced from stacks, etc, & clear dirty bits. Pass local buffers to collector thread. Collector performs RC(o 2 )-- using buffered values, RC(o 4 )++ using current value. Results vs. STW Mark Sweep (Sun JVM 1.2.2) Worst-case response times (SPEC jbb) –1 thread: 16ms v. 7433ms (464x); –16 threads: 250ms v 6593ms (26x) An on-the-fly reference counting garbage collector for Java, Levanoni and Petrank, OOPSLA’01. ç ç o1o1 o2o2 o3o3 o4o4

33 © Richard Jones, University of Kent SCIEnce Paris Workshop PART 4: Real-time GC Region schemes Cumbersome model, hard to program? Requires rewriting libraries. Better to infer regions (à la ML-kit)? Real-time GC issues Schedulability –Guaranteed worst-case pause times. –Guaranteed minimum mutator utilisation. Managing large data structures. Dealing with fragmentation.

34 © Richard Jones, University of Kent SCIEnce Paris Workshop Metronome Allocation Segregated free-lists, with large number of size classes –low internal fragmentation –external fragmentation removed by copying Mostly non-copying collection Incremental mark, lazy sweep. Snapshot At the Beginning trace –avoids rescanning. A Real-time GC with Low Overhead and Consistent Utilisation, Bacon et al, POPL’03 48

35 © Richard Jones, University of Kent SCIEnce Paris Workshop Metronome collection Defragmentation If sufficient free pages –Allow mutator to continue for another GC cycle without running out of memory. Otherwise, copy objects: –From a fragmented page to another page of same class. –Indirection pointer in object headers, …so mutator only sees ToSpace objects. –Collector does all the work. –Break large arrays broken into a sequence of power-of-2 sized arraylets. –compiler optimisations for fast access. 4

36 © Richard Jones, University of Kent SCIEnce Paris Workshop Metronome performance Scheduling Time-based. Work-based gave inconsistent utilisation, leading to infeasible scheduling. Parameters Max live memory: 20-34MB live. Max allocation rate: MB/s max. Results Min. mutator utilisation: 44.1%-44.6%. Pause time: ms max ( ms avg). Copied < 2% of traced data. JVM98 on 500MHz PowerPC. 6

37 © Richard Jones, University of Kent SCIEnce Paris Workshop PART 5: New heap organisations Older-First Insight: give objects as long as possible to die. So collect older objects before younger ones. Slide a collection “window” from the old to the young end of the heap. Beltway Generalisation of copying collectors.

38 © Richard Jones, University of Kent SCIEnce Paris Workshop Beltway Beltway combines the 5 key insights that shape copying garbage collection 1.Most objects die young. 2.Avoid collecting old objects. 3.But give youngest objects time to die. 4.Incrementality improves response time. 5.Copying GC can improve locality. Beltway can be configured to be any other region-based copying collector...or new copying GCs. Beltway: getting around garbage collection gridlock, Blackburn et al, PLDI’02

39 © Richard Jones, University of Kent SCIEnce Paris Workshop Beltway collector Increment Contiguous region of memory. Unit of collection. Increments give objects time to die. Belt Group of increments. Increments on a belt collected in FIFO order. Belts collected independently. Promotion policy decides to which belt to copy survivors. More general than generations which process all objects in a generation en masse.

40 © Richard Jones, University of Kent SCIEnce Paris Workshop Example: Beltway Single nursery increment Necessary for efficient write barrier. Large increment (100% of available heap) = completeness Only ever collected when all data is in increment Reduce incrementality to gain completeness Promotion path

41 © Richard Jones, University of Kent SCIEnce Paris Workshop PART 4: Integration with the environment

42 © Richard Jones, University of Kent SCIEnce Paris Workshop GC without paging Page-level locality Collecting disrupts program’s working set — GC touches all pages. Paging = performance killed — unrealistic to overprovision memory. Adaptive GC Tune heap expansion v. GC decisions. Monitor paging behaviour. Bookmarking GC Low space consumption + elimination of GC-induced paging Cooperates with OS over page eviction/replacement. Full-heap, compacting GC –even when portions of the heap are evicted.

43 © Richard Jones, University of Kent SCIEnce Paris Workshop GC & the Virtual Memory Manager On eviction GC and try to discard an empty but resident page. Select a victim page. Scan page and remember its outgoing pointers (‘bookmarks’) –cf. remsets. Protect page and inform VMM that it can be evicted. GC As normal but don’t follow references to evicted pages Treat bookmarks as roots. On return to main memory Trap mutator access violation. Remove the bookmarks for this page. GC Evacuates pages Selects victim pages VMM Page replacement eviction notification victim pages Garbage Collection without Paging, Hertz et al, PLDI’05

44 © Richard Jones, University of Kent SCIEnce Paris Workshop Average pause times BC BC w/ Resizing Only GenCopy GenMS CopyMS SemiSpace PseudoJBB with 77MB heap while paging Available memory Average pause time (ms)

45 © Richard Jones, University of Kent SCIEnce Paris Workshop Today… GC Throughput competitive with explicit deallocation but a space penalty. Not intrusive. High-performance server GC Heaps of several GB. Multiprocessor machines. Short pause times Real-time GC Guaranteed performance if you specify parameters. Reference counting Renewed interest Integration with environment GC implementers are cache sensitive. Cooperation with OS over paging.

46 © Richard Jones, University of Kent SCIEnce Paris Workshop The future “Server” GC will move onto the desktop. How to deal with massively multicore systems? Better cooperation with the operating system etc. The virtual machine communities should talk to each other. Power-aware garbage collection. Exploit program knowledge. Object lifetimes have distinct patterns — can we exploit them? What if each Beltway belt represents a management policy –expected lifetime, promotion route, … Static analyses, maybe not to reclaim objects, but to give hints to the GC.

47 © Richard Jones, University of Kent SCIEnce Paris Workshop Questions?

48 © Richard Jones, University of Kent SCIEnce Paris Workshop Discussion Your configurations Work loads, object demographics Object sizes Heap sizes Lifetimes Allocation rate, mutation rate Requirements Pause times v. throughput Concurrency, parallelism Fragmentation? Bottlenecks? Cache, memory bandwidth Working set size Tools GCspy


Download ppt "© Richard Jones, University of Kent 2009 SCIEnce Paris Workshop 20091 Richard Jones Computing Laboratory University of Kent,"

Similar presentations


Ads by Google