Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Study of Garbage Collector Scalability on Multicores LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro INRIA/University of Paris 6.

Similar presentations


Presentation on theme: "A Study of Garbage Collector Scalability on Multicores LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro INRIA/University of Paris 6."— Presentation transcript:

1 A Study of Garbage Collector Scalability on Multicores LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro INRIA/University of Paris 6

2 14/20 most popular languages have GC but GC doesn’t scale on multicore hardware Garbage collection on multicore hardware Lokesh Gidra2 Parallel Scavenge/HotSpot scalability on a 48-core machine GC threads GC Throughput (GB/s) SPECjbb2005 with 48 application threads/3.5GB A Study of the Scalability of Garbage Collectors on Multicores Degrades after 24 GC threads Better ↑

3 Scalability of GC is a bottleneck By adding new cores, application creates more garbage per time unit And without GC scalability, the time spent in GC increases Lokesh GidraA Study of the Scalability of Garbage Collectors on Multicores3 Lusearch [PLOS’11] ~50% of the time spent in the GC at 48 cores

4 Where is the problem? Probably not related to GC design: the problem exists in ALL the GCs of HotSpot 7 (both, stop-the-world and concurrent GCs) What has really changed: Multicores are distributed architectures, not centralized anymore Lokesh GidraA Study of the Scalability of Garbage Collectors on Multicores4

5 From centralized architectures to distributed ones Lokesh GidraA Study of the Scalability of Garbage Collectors on Multicores5 A few years ago… Uniform memory access machines Now… Inter-connect Node 0Node 1 Node 2 Node 3 Cores Non-uniform memory access machines Cores Memory System Bus Memory

6 From centralized architectures to distributed ones Our machine: AMD Magny-Cours with 8 nodes and 48 cores 12 GB per node 6 cores per node Lokesh GidraA Study of the Scalability of Garbage Collectors on Multicores6 Node 0Node 1 Node 2 Node 3 Memory Local access: ~ 130 cycles Remote access: ~350 cycles

7 Our machine: AMD Magny-Cours with 8 nodes and 48 cores 12 GB per node 6 cores per node From centralized architectures to distributed ones Lokesh GidraA Study of the Scalability of Garbage Collectors on Multicores7 Node 0Node 1 Node 2 Node 3 Memory Local access: ~ 130 cycles Remote access: ~350 cycles #cores = #threads Better ↓ Completion time (ms) Time to perform a fixed number of reads in //

8 Our machine: AMD Magny-Cours with 8 nodes and 48 cores 12 GB per node 6 cores per node From centralized architectures to distributed ones Lokesh GidraA Study of the Scalability of Garbage Collectors on Multicores8 Node 0Node 1 Node 2 Node 3 Memory Local access: ~ 130 cycles Remote access: ~350 cycles Better ↓ Completion time (ms) Local Access Time to perform a fixed number of reads in // #cores = #threads

9 Our machine: AMD Magny-Cours with 8 nodes and 48 cores 12 GB per node 6 cores per node From centralized architectures to distributed ones Lokesh GidraA Study of the Scalability of Garbage Collectors on Multicores9 Node 0Node 1 Node 2 Node 3 Memory Local access: ~ 130 cycles Remote access: ~350 cycles Better ↓ Completion time (ms) Random access Time to perform a fixed number of reads in // #cores = #threads Local Access

10 Our machine: AMD Magny-Cours with 8 nodes and 48 cores 12 GB per node 6 cores per node From centralized architectures to distributed ones Lokesh GidraA Study of the Scalability of Garbage Collectors on Multicores10 Node 0Node 1 Node 2 Node 3 Memory Local access: ~ 130 cycles Remote access: ~350 cycles Better ↓ Completion time (ms) Time to perform a fixed number of reads in // Single node access Local Access #cores = #threads Random access

11 Parallel Scavenge Heap Space Lokesh GidraA Study of the Scalability of Garbage Collectors on Multicores11 Kernel’s lazy first-touch page allocation policy First-touch allocation policy Virtual address space Parallel Scavenge

12 Parallel Scavenge Heap Space Lokesh GidraA Study of the Scalability of Garbage Collectors on Multicores12 Kernel’s lazy first-touch page allocation policy ⇒ initial sequential phase maps most pages on first node Initial application thread First-touch allocation policy Parallel Scavenge

13 Parallel Scavenge Heap Space Lokesh GidraA Study of the Scalability of Garbage Collectors on Multicores13 Kernel’s lazy first-touch page allocation policy ⇒ initial sequential phase maps most pages on its node Initial application thread First-touch allocation policy But during the whole execution, the mapping remains as it is (virtual space reused by the GC) Parallel Scavenge A severe problem for generational GCs

14 Parallel Scavenge Heap Space Lokesh Gidra14 Bad balance Bad locality First-touch allocation policy 95% on a single node PS SpecJBB GC threads GC Throughput (GB/s) Better ↑ Parallel Scavenge A Study of the Scalability of Garbage Collectors on Multicores

15 NUMA-aware heap layouts Lokesh Gidra15 Bad balance Bad locality First-touch allocation policy Round-robin allocation policy Node local object allocation and copy 95% on a single node Targets balanceTargets locality Parallel ScavengeInterleavedFragmented A Study of the Scalability of Garbage Collectors on Multicores PS SpecJBB GC threads GC Throughput (GB/s) Better ↑

16 Interleaved heap layout analysis Lokesh Gidra16 Bad balancePerfect balance Bad locality First-touch allocation policy Round-robin allocation policy Node local object allocation and copy 95% on a single node7/8 remote accesses PS Interleaved SpecJBB GC threads GC Throughput (GB/s) Better ↑ Parallel ScavengeInterleavedFragmented A Study of the Scalability of Garbage Collectors on Multicores

17 Fragmented heap layout analysis Lokesh Gidra17 Bad balancePerfect balanceGood balance Bad locality Average locality Parallel ScavengeInterleavedFragmented First-touch allocation policy Round-robin allocation policy Node local object allocation and copy 95% on a single node7/8 remote accesses Bad balance if a single thread allocates for the others PS Interleaved Fragmented SpecJBB 7/8 remote scans 100% local copies GC threads GC Throughput (GB/s) Better ↑ A Study of the Scalability of Garbage Collectors on Multicores

18 Synchronization optimizations Removed a barrier between the GC phases Replaced the GC task-queue with a lock-free one Lokesh GidraA Study of the Scalability of Garbage Collectors on Multicores18 GC Throughput (GB/s) PS Interleaved Fragmented SpecJBB Fragmented + synchro Synchro optimization has effect with high contention GC threads Better ↑

19 Effect of Optimizations on the App (GC excluded) A good balance improves a lot application time Locality has only a marginal effect on application While fragmented space increases locality for application over interleaved space (recently allocated objects are the most accessed) Lokesh GidraA Study of the Scalability of Garbage Collectors on Multicores19 Application time PS Other heap layouts XML Transform from SPECjvm GC threads Better ↓

20 Overall effect (both GC and application) Optimizations double the app throughput of SPECjbb Pause time divided in half (105ms to 49ms) Lokesh GidraA Study of the Scalability of Garbage Collectors on Multicores20 Application throughput (ops/ms) PS Fragmented SpecJBB Interleaved Fragmented + synchro GC threads Better ↑

21 GC scales well with memory-intensive applications Lokesh GidraA Study of the Scalability of Garbage Collectors on Multicores21 3.5GB1GB 2GB 512MB1GB2GB PSFragmented + synchro

22 Take Away Previous GCs do not scale because they are not NUMA-aware  Existing mature GCs can scale with standard // programming techniques  Using NUMA-aware memory layouts should be useful for all GCs (concurrent GCs included) Most important NUMA effects 1.Balancing memory access 2.Memory locality only helps at high core count Lokesh GidraA Study of the Scalability of Garbage Collectors on Multicores22

23 Take Away Previous GCs do not scale due to NUMA obliviousness  Existing mature GCs can scale with standard // programming techniques  Using NUMA-aware memory layouts should be useful for all GCs (concurrent GCs included) Most important NUMA effects 1.Balancing memory access 2.Memory locality at high core count Lokesh GidraA Study of the Scalability of Garbage Collectors on Multicores23 Thank You


Download ppt "A Study of Garbage Collector Scalability on Multicores LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro INRIA/University of Paris 6."

Similar presentations


Ads by Google