Presentation is loading. Please wait.

Presentation is loading. Please wait.

Garbage Collection Advantage: Improving Program Locality

Similar presentations


Presentation on theme: "Garbage Collection Advantage: Improving Program Locality"— Presentation transcript:

1 Garbage Collection Advantage: Improving Program Locality
Xianglong Huang (UT) Stephen M Blackburn (ANU), Kathryn S McKinley (UT) J Eliot B Moss (UMass), Zhenlin Wang (MTU), Perry Cheng (IBM) Special Thanks to authors for providing many of these slides

2 Motivation Memory gap problem OO programs become more popular
OO programs exacerbates memory gap problem Automatic memory management Pointer data structures Many small methods Goal: improve OO program locality

3 Cache Performance Matters

4 Opportunity Generational copying garbage collector reorders objects at runtime

5 Copying of Linked Objects
1 1 4 4 2 2 3 3 6 6 7 7 5 5 Breadth First

6 Copying of Linked Objects
1 1 4 4 2 2 3 3 6 6 7 7 5 5 Breadth First 1 2 3 4 5 6 7 Depth First

7 Copying of Linked Objects
1 1 4 4 2 2 3 3 6 6 7 7 5 5 Breadth First 1 1 2 3 4 4 5 6 7 Depth First 1 1 2 3 5 4 4 6 7 Online Object Reordering

8 Outline Motivation Online Object Reordering (OOR) Methodology
Experimental Results Conclusion

9 OOR System Overview Records object accesses in each method (excludes cold basic blocks) Finds hot methods by adaptive sampling Reorders objects with hot fields in older generation during GC Copies hot objects into separate region

10 Online Object Reordering
Where are the cache misses? How to identify hot field accesses at runtime? How to reorder the objects?

11 Where Are The Cache Misses?
Heap structure: VM Objects Stack Older Generation Nursery Not to scale

12 Where Are The Cache Misses?

13 Where Are The Cache Misses?
Two opportunities to reorder objects in the older generation Promote nursery objects Full heap collection

14 How to Find Hot Fields? Runtime info (intercept every read)?
Compiler analysis? Runtime information + compiler analysis Key: Low overhead estimation

15 Which Classes Need Reordering?
Step 1: Compiler analysis Excludes cold basic blocks Identifies field accesses Step 2: JIT adaptive sampling identifies hot methods Mark as hot field accesses in hot methods Key: Low overhead estimation

16 Example: Compiler Analysis
Method Foo { Class A a; try { …=a.b; } catch(Exception e){ …a.c Hot BB Collect access info Compiler Compiler Cold BB Ignore Access List: 1. A.b 2. …. ….

17 Example: Adaptive Sampling
Method Foo { Class A a; try { …=a.b; } catch(Exception e){ …a.c Adaptive Sampling Foo Accesses: 1. A.b 2. …. …. Foo is hot A.b is hot A A’s type information c b ….. c b B

18 Copying of Linked Objects
Type Information 1 3 4 1 1 4 4 2 2 3 3 6 6 7 7 5 5 Online Object Reordering Hot space Cold space

19 OOR System Overview Hot Methods Source Code Look Up Access Info
Database Adaptive Sampling Adaptive Sampling Baseline Compiler Optimizing Compiler Optimizing Compiler Adds Entries Register Hot Field Accesses GC: Copies Objects GC: Copies Objects Executing Code Affects Locality Improves Locality Advice Input/Output JikesRVM component OOR addition

20 Outline Motivation Online Object Reordering Methodology
Experimental Results Conclusion

21 Methodology: Virtual Machine
Jikes RVM VM written in Java High performance Timer based adaptive sampling Dynamic optimization Experiment setup Pseudo-adaptive 2nd iteration [Eeckhout et al.]

22 Methodology: Memory Management
Memory Management Toolkit (MMTk): Allocators and garbage collectors Multi-space heap Boot image Large object space (LOS) Immortal space Experiment setup Generational copying GC with 4M bounded nursery

23 Overhead: OOR Analysis Only
Benchmark Base Execution Time (sec) w/ only OOR Analysis (sec) Overhead jess 4.39 4.43 0.84% jack 5.79 5.82 0.57% raytrace 4.63 4.61 -0.59% mtrt 4.95 4.99 0.70% javac 12.83 12.70 -1.05% compress 8.56 8.54 0.20% pseudojbb 13.39 13.43 0.36% db 18.88 -0.03% antlr 0.94 0.91 -2.90% hsqldb 160.56 158.46 -1.30% ipsixql 41.62 42.43 1.93% jython 37.71 37.16 -1.44% ps-fun 129.24 128.04 -1.03% Mean -0.19%

24 Detailed Experiments Separate application and GC time
Vary thresholds for method heat Vary thresholds for cold basic blocks Three architectures x86, AMD, PowerPC x86 Performance counter: DL1, trace cache, L2, DTLB, ITLB

25 Discussion What will be the result of cache affinity on multicore systems ? Will memory affinity benefit the system more ? Do we need some modification in algorithm to increase cache affinity per core ? Will it significantly improve performance ?

26 Performance Implications of Cache Affinity on Multicore Processors Vahid Kazempour, Alexandra Fedorova and Pouya Alagheband EuroPar 2008 “We hypothesized that cache affinity does not affect performance on multicore processors: on multicore uniprocessors — because reloading the L1 cache state is cheap, and on multicore multiprocessors – because L2 cache affinity is generally low due to cache sharing.” “Even though upper-bound performance improvements from exploiting cache affinity on multicore multiprocessors are lower than on unicore multiprocessors, they are still significant: 11% on average and 27% maximum. This merits consideration of affinity awareness on multicore multiprocessors.”

27 Performance javac

28 Performance db

29 Performance jython Any static ordering leaves you vulnerable to pathological cases.

30 Phase Changes

31 Conclusion Static traversal orders have up to 25% variation
OOR improves or matches best static ordering OOR has very low overhead Past predicts future

32 Discussion In experiments OOR gives performance benefit of 10-40% compared to full heap mark sweep collector. Is the comparison valid – As OOR runs with generational copy collector whereas mark sweep collects full heap at a time. How much benefit we are getting through generational version of copy collector ? In Myths and Realities paper we see GenMS is 30-45% better than full heap MS.

33 Questions? Thank you!


Download ppt "Garbage Collection Advantage: Improving Program Locality"

Similar presentations


Ads by Google