# Incorporating Generations into a Modern Reference Counting Garbage Collector Hezi Azatchi Advisor: Erez Petrank.

## Presentation on theme: "Incorporating Generations into a Modern Reference Counting Garbage Collector Hezi Azatchi Advisor: Erez Petrank."— Presentation transcript:

Incorporating Generations into a Modern Reference Counting Garbage Collector Hezi Azatchi Advisor: Erez Petrank

2 Outline Background  Garbage Collection  Reference Counting  Mark&Sweep  Improving Tracing using Generations  On-The-Fly Sliding-View Garbage Collectors Our Generational on-the-fly Algorithms Results Summary

3 The Reference Counting Algorithm [Collins 1960] if o 1.RC==0: Delete o 1. Decrement o.RC for all sons of o 1. Recursively delete objects whose RC is decremented to 0. o1o1 o2o2 p o3o3 o4o4 Each object has an RC field. New objects get o.RC:=1. When p that points to o 1 is modified to point to o 2, we do: o 1.RC--, o 2.RC++. Background – Reference Counting

4 3 years later… [Harold-McBeth 1963] Reference counting algorithm does not reclaim cycles!. But, It turns out that “normal” programs do not use too many cycles. So, other methods (such as mark and sweep) are used “seldom” to collect the cycles. Background – Reference Counting o1o1 o2o2

5 Deferred Reference Counting Problem: RC algorithms prescribe an action for each pointer operation. Solution [Deutch & Bobrow, 1976] : Don’t update RC for locals. Put objects with RC=0 in a Zero-Count-Table(ZCT). “Once in a while”: collect all the objects (in the ZCT) with o.RC=0 that are not referenced from local roots. Deferred RC reduces overhead by 80%. Used in most modern RC systems. Background – Reference Counting

6 The Mark-Sweep algorithm [McCarthy 1960] Traverse & mark live objects. White objects can be reclaimed. globals Roots Background – Mark and Sweep

7 Generational Garbage Collection [Ungar, 1984] Weak generational hypothesis: “most objects die young” Segregate objects by age into two or more regions of the heap called: generations. Objects are first allocated in the youngest generation, but are promoted into older generation if they survive long enough. Most pauses are short (for young generation GC). Collection effort concentrated where there is garbage. Better locality. Background – Generational GC

8 Generational GC – Inter-Generational-Pointers Pointers from old to young generation must be part of the root set of the young generation. Stack Globals Old Young Background – Generational GC

9 Note Interesting Properties Mark&sweep is good with low fraction of live objects thus it “fits” the young generation which has low fraction of live objects. RC does not depend on amount of live space thus it “fits” to the old generation which does have large amount of live space. Thus – a combination of RC for old generation and Mark&sweep for the young may be good! This is exactly what we tried On a modern platform (SMP). With advanced modern on-the-fly collectors. Background

10 Terminology (Mutators) (Collector Threads) Background

11 On the fly Sliding-View Algorithms Levanoni-Petrank OOPSLA 2001

12 Motivation for RC Reference Counting work is proportional to the work on creations and modifications. Can tracing deal with tomorrow’s huge heaps? Reference counting has good locality. The Challenge: RC write barriers seem too expensive. RC seems impossible to “parallelize”. Levanoni Petrank Algorithms - Motivation

13 Multithreaded RC? Problem 1: ref-counts updates must be atomic.  Problem 2: parallel updates confuse counters: A BDC Thread 2: Read A.next; (see B) A.next  D; B.RC- -; D.RC++ Thread 1: Read A.next; (see B) A.next  C; B.RC- -; C.RC++ Levanoni Petrank Algorithms - Motivation

14 First Multithreaded RC [DeTreville]: Lock heap for each pointer modification. Thread records its updates in a buffer. Once in a while (snapshot alike): GC thread reads all buffers to update ref counts Reclaims all objects with 0 rc that are not local. Levanoni Petrank Algorithms - Motivation

15 To Summarize… Overhead on write barrier is considered high. Even with deferred RC of Deutch & Bobrow. Using reference counting concurrently with program threads seems to bear high synchronization cost. Lock or “compare & swap” for each pointer update. Levanoni Petrank Algorithms - Motivation

16 Improving the write-barrier overhead Consider a pointer p that takes the following values between GC’s: O 0,O 1, O 2, …, O n. Out of 2n operations: O 0.RC--; O 1.RC++; O 1.RC--; O 2.RC++; O 2.RC--; … ; O n.RC++; Only two are needed: O 0.RC-- and O n.RC++ p O1O1 O2O2 O3O3 OnOn..... O4O4 O0O0 Levanoni Petrank Algorithms

The write barrier Procedure Update(p:Pointer, new:Object) prev := *p if !Dirty(p) then log // into local log buffer Dirty(p) = True; *p := new Collection time P  O 1 ; (record p’s previous value O 0 ) P  O 2 ; (do nothing) … P  O n ; (do nothing) Collection time: For each modified slot p: Read p to get O n, read records to get O 0 O 0.RC--, O n.RC++ Time

18 The “Snapshot” (Concurrent) RC Algorithm: Use write barrier with program threads. Take a snapshot: –Stop all threads –Scan roots (locals) –get the buffers with modified slots –Clear all dirty bits. –Resume threads Then run collector: –For each modified slot: decrease rc for previous snapshot value (read buffer), increase rc for current snapshot value (“read heap”), –Reclaim non-local objects with rc 0. Levanoni Petrank Algorithms

19 The General Picture P1 P2 P3 P4 P5 P6 p7 Heap at collection k Heap at collection k+1 Use list of modifications to update reference counts. Record Modifications P1 P2 P3 P4 P5 P6 p7 Levanoni Petrank Algorithms

20 The “Snapshot” Tracing (Mark&Sweep) Collector Take a snapshot: –Stop all threads –Scan roots (locals) –get the buffers with modified slots –Clear all dirty bits. –Resume threads Use write barrier with program threads. Then run collector: –Mark via current snapshot foreach reachable slot s if (!s.dirty) then “read heap” else “read buffer” recursively mark s value - Sweep all non-local objects which are not marked. Levanoni Petrank Algorithms

21 Intermediate Concurrent Algorithm Properties: Snapshot oriented, concurrent, (not so bad…) Pause time: Stop all threads clear all dirty bits. mark roots of all threads. Pause time goal: Stop one thread to mark its own local roots! The goal: an on-the-fly algorithm with a low throughput cost. Levanoni Petrank Algorithms

22 Collecting On-the-fly - What if we stop each thread at a time? Take a sliding view: For each thread t –Stop t –Scan roots (locals) –get the buffers with modified slots –Resume t –Clear all dirty bits. Then run collector: –For each modified slot: decrease rc for previous snapshot value (read buffer), increase rc for current snapshot value (“read heap”), –Reclaim non-local objects with rc 0. Several problems to be solved… Levanoni Petrank Algorithms

23 The New Picture – using Sliding-Views p1 Sliding view of the heap at collection k Sliding view of the heap at collection k+1 Read information from each thread at a time (while other threads run): no snapshot. List of Modifications p2 p3 p4 p5 p6 p7 p1 p3 p7 p2 p4 p5 p6 Heap Levanoni Petrank Algorithms

24 Danger in Sliding Views p1 Program does: P1  O P2  O P1  NULL p2 p3 p4 p5 p6 p7 Heap Here sliding view reads P2 (NULL) Here sliding view reads P1 (NULL) Problem: reachability of O not noticed! Solution: if a pointer to O has been stored during the sliding view phase – do not reclaim O (and descendants). Levanoni Petrank Algorithms

25 The Sliding Views Collector Take a sliding view: –Start snooping –For each thread t –Stop t –Scan roots (locals) –get the buffers with modified slots –Resume t –Stop snooping –Clear all dirty bits. Then run collector: –For each modified slot: decrease rc for previous snapshot value (read buffer), increase rc for current snapshot value (“read heap”), –Reclaim non-local objects with rc 0. Levanoni Petrank Algorithms

26 Implementation for Java Based on Sun’s JDK1.2.2 for Windows NT Main features 2-bit RC field per object (á la [Wise et. al.]) A custom allocator for on-the-fly RC Benchmarks: Server benchmarks SPECjbb2000 --- simulates business-like transactions in a large firm MTRT --- a multi-threaded ray tracer Client benchmarks SPECjvm98 --- a suite of mostly single-threaded client benchmarks Levanoni Petrank Algorithms

27 Improved RC - How many RC updates are eliminated? BenchmarkNo of stores No of “first” stored Ratio of “first” stores jbb71,011,357264,1151/269 Compress64,905511/1273 Db33,124,78030,6961/1079 Jack135,174,7751,5461/87435 Javac22,042,028535,2961/41 Jess26,258,10727,3331/961 mpegaudio5,517,795511/108192 Levanoni Petrank Algorithms

28 SPECjbb – max pause time Levanoni Petrank Algorithms

29 SPECjbb Throughput Levanoni Petrank Algorithms

30 MTRT Throughput Levanoni Petrank Algorithms

31 This Work: Sliding Views Algorithms with Generations

32 Motivation Investigate how generations integrate with reference-counting on a multiprocessor. Tracing work is proportional to the amount of live objects and by weak generational hypothesis: “many objects die young”. RC does not depend on the amount of live space. The old generation has high fraction of live objects. The goal: Get larger throughput Algorithms match their generations Work is concentrated where garbage is. Better locality, working set size is smaller. Note: similar pauses expected. This Work - Generational Algorithms

33 Design issues: Two generations Two collection types – minor and full Each object which has survived a collection is promoted Simplify implementation Lower overhead for Inter-Generational-Pointers handling. The heap is partitioned logically In an on-the-fly collector object copying is very difficult if not impossible. An object is promoted by marking it as old. This Work - Generational Algorithms

34 Design issues: Promotion is done by the collector Collection triggering Minor collection is triggered every X[Bytes] Allocations. Full collection is triggered when the heap occupancy grows to more than Y% Two local buffers for each mutator: 1. “young-objects” buffer – pointers to new objects. 2. “old-objects” buffer. The young generation processed by this cycle: All local “young-objects” buffers from the previous cycle. This Work - Generational Algorithms

35 Log modified objects instead of modified slots Update(A.p1, C) Update(A.p2, C) Update(A.p2, D) Heap Objects This Work - Generational Algorithms

36 “ young-objects ” buffer and “ old-objects ” buffer roles 1. o1.next := new(256); 3. Update(o1.next, o2); 2. Update(*o1.next, o1); t K cycle K+2 cycle K+1 cycle o1 Heap o2 new “ young-objects ” “ old-Objects ” Mutator This Work - Generational Algorithms

37 Three On-the-fly Generational Algorithms Reference-Counting for both collections. Reference-Counting for young collection. Tracing for the major collection. Reference-Counting for major collection. Tracing for the minor collection. Expected to be the best This Work - Generational Algorithms

38 Agenda No time to present all algorithms Only major RC (the best) algorithm will be presented. Go over several interesting difficulties: Issues for major RC collections Efficient find the Inter-Generational-Pointers Prepare the buffers for the major reference- counting. Issues for minor RC collections Efficient promotion with minor RC. Snoop selectively. No need to accurate update all objects RCs. This Work - Generational Algorithms

39 Reference Counting for the Major collection algorithm Expected to be the best. Uses Mark and sweep for minor collections. Uses RC for the major collections. This Work - Generational Algorithms

The minor collection - mark&sweep Then run collector: –Find the inter-generational- pointers, add them to the roots set. –Mark via the sliding view foreach reachable slot s if (!s.dirty) then “read heap” else “read buffer” recursively mark s value -Sweep non local, unmarked objects, promote survivals. -Prepare buffers for major collection Take a sliding view: –Start snooping –For each thread t –Stop t –Scan roots (locals) –get the buffers with modified slots –Resume t –Stop snooping –Clear all dirty bits.

The major collection – RC Then run collector: –For each modified slot: (which are in the current sliding-view buffers or in the prepared major buffers) decrease rc for previous snapshot value (read buffer), increase rc for current snapshot value (“read heap”), –Reclaim non-local objects with rc 0, promote survivals. Take a sliding view: –Start snooping –For each thread t –Stop t –Scan roots (locals) –get the buffers with modified slots –Resume t –Stop snooping –Clear all dirty bits.

42 Issues for: Major RC collections Young generation: How do we find inter-generational- pointers (for the mark&sweep of the young generation) efficiently? Provide the major RC collection with consistent buffers. This Work - Generational Algorithms

43 Inter-Generational-Pointers are “given for free” Observation: Old objects that point to young objects - must have been modified since the previous collection, because young objects did not exist before. Thus: all inter-generational pointers must be logged in “old-objects” local buffers. Does this get all Inter-Generational Pointers? Must note some race conditions due to the non- atomic sliding-view. This Work - Generational Algorithms

Collector Mutator A Mutator B  p:=new(16)  Take a sliding-view  Cooperate:  Stop  Mark-Roots  Read-Buffers  Resume  Cooperate:  Stop  Mark-Roots  Read-Buffers  Resume  Update(o.next, *p) K-1 cycle K cycle K+1 cycle That new object is logged to the young generation of the next cycle The “inter-generational- pointer” is logged in the buffers of cycle k-1, these buffers won’t be available in this cycle!!! The first race – “intra sliding-view update”

Collector Mutator A Mutator B  p:=new(16)  Take a sliding-view  Cooperate:  Stop  Mark-Roots  Read-Buffers  Resume  Cooperate:  Stop  Mark-Roots  Read-Buffers  Resume K-1 cycle K cycle K+1 cycle The second race – “update before clear”  Clear-Dirty-Marks  Read o.next=x  Read o.Dirty=true  Update(o.next, *p) Inter-generational-pointer was created and not logged to any buffer! If we won’t traverse through the inter- generational-pointer to the new object it might be sweeped mistakenly in this cycle!

46 Solution to both races: Record into “IGPs_buffer” all objects that are involved in an update to young object in the following uncertainty period: While taking the sliding-view and till the end of the clear-dirty-marks. The true inter-generational-set is contained in the following set: {Union over all mutators’ old-objects buffers}  {IGPs_buffer} This Work - Generational Algorithms

47 Full RC collection buffers preparation The mutators log objects to their local “young-objects” and “old-objects” buffers. The collector log part of these logged objects to the “major-new-objects” buffer, and to the “major-old-objects” buffer. This Work - Generational Algorithms

48 Which objects to log to the major buffers? Only objects which will be alive in the next major collection. Use a OldDirty flag (for each logged object) – To avoid multiple loggings of the same object. Logging to the “major-new-objects” buffer Log only objects which were promoted (it is known at the young-generation sweep phase). No object’s children are logged (because the object did not exist in the previous major cycle, thus its children did not reference any object). This Work - Generational Algorithms

49 Logging to the “major-old-objects” buffer The parents objects which are logged into the young generation “old- objects” buffers are: old. (Why?) Thus they can be logged to the major buffer. (They will survive). Their children may be sweeped, thus log only children which were promoted (only after the sweep phase). This Work - Generational Algorithms

50 Issues for: RC minor collection Efficient promotion with minor RC collections Reference-Counting for the young generation algorithm Advantage: The RC field might be not accurate. Selectively snoop only young objects. This Work - Generational Algorithms

51 Efficient promotion with minor RC collections Recall reclamation with deferred RC: Go over suspects (young generation) If RC=0 and not local – reclaim (recursively) Otherwise – need to promote? A problem: Cannot promote objects because recursive reclamation may delete them later. Thus, only at the end of the reclamation phase we may promote. Bad (simple) solution: traverse the young generation twice. This Work - Generational Algorithms

52 Efficient promotion with minor RC collections Our solution: Each object in the young generation whose RC>0 is marked as “pendingPromotion” object which is treated as young. Zero the “pendingPromotion” bitmap in the end, thus promote all the surviving objects. This Work - Generational Algorithms

53 Reference-Counting for the young generation No need to decrement previous objects’ RCs They were pointed at previous collection. Each was either promoted or reclaimed, thus: the young generation should not fix their RC, the full tracing collection will handle them. Further improvement: after the sliding-view was taken, no need to log the object’s children. Should we snoop selectively on a minor collection? Yes – the minor collection can reclaim only young objects. This Work - Generational Algorithms

Implementation The original and all three Garbage Collectors were implemented into Jikes – a Research Java Virtual Machine and JIT compiler developed at IBM T.J. Watson Research Center. The entire system, including the collector itself is written in Java (extended with unsafe primitives to access raw memory). The reference collector: SIGPLAN-2001: “Java without the Coffee Breaks: A Nonintrusive Multiprocessor Garbage Collector” by David F. Bacon et al. Measurements were taken over: 4-way IBM Netfinity 8500R server with 550MHZ Intel- Pentium III Xeon processors and 2GBytes of physical memory.

Best algorithm - “RC for full” SPECjvm98 on Multiprocessor

Best algorithm - “RC for full” SPECjvm98 on uniprocessor

Best algorithm - “RC for full” SPECjbb2000 on Multiprocessor

Best algorithm - “RC for full” _227_mtrt on Multiprocessor

59 Best algorithm-“RC for minor” Max Pause Time

Second algorithm -“RC for minor” _227_mtrt on Multiprocessor

Second algorithm -“RC for minor” SPECjbb00 on Multiprocessor

62 Summary We’ve presented (for the first time) an incorporation of generations into on-the-fly Reference Counting algorithm. We have implemented three algorithms into Jikes - IBM’s Research JVM with JIT compiler. It turns out that the generation incorporation (for all algorithms): Improves efficiency Keeps the short pauses times of the original. The algorithm that is using Reference-Counting in the old generation was doing better than the others (as expected). This Work - Summary