A Study of Garbage Collector Scalability on Multicores LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro INRIA/University of Paris 6.

Slides:



Advertisements
Similar presentations
Technische Universität München Massively Parallel Sort-Merge Joins (MPSM) in Main Memory Multi-Core Database Systems Martina Albutiu, Alfons Kemper, and.
Advertisements

Technische Universität München Massively Parallel Sort-Merge Joins (MPSM) in Main Memory Multi-Core Database Systems Martina Albutiu, Alfons Kemper, and.
The Impact of Soft Resource Allocation on n-tier Application Scalability Qingyang Wang, Simon Malkowski, Yasuhiko Kanemasa, Deepal Jayasinghe, Pengcheng.
Shredder GPU-Accelerated Incremental Storage and Computation
Wait-Free Queues with Multiple Enqueuers and Dequeuers
Garbage Collection for Large Scale Multiprocessors (Funded by ANR projects: Prose and ConcoRDanT) Lokesh GidraGaël Thomas Julien SopenaMarc Shapiro Regal-LIP6/INRIA.
A Block-structured Heap Simplifies Parallel GC Simon Marlow (Microsoft Research) Roshan James (U. Indiana) Tim Harris (Microsoft Research) Simon Peyton.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 1 MC 2 –Copying GC for Memory Constrained Environments Narendran Sachindran J. Eliot.
Comparing and Optimising Parallel Haskell Implementations on Multicore Jost Berthold Simon Marlow Abyd Al Zain Kevin Hammond.
SE-292 High Performance Computing
1 Write Barrier Elision for Concurrent Garbage Collectors Martin T. Vechev Cambridge University David F. Bacon IBM T.J.Watson Research Center.
SE-292 High Performance Computing
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
Assessing the Scalability of Garbage Collectors on Many Cores (Funded by ANR projects: Prose and ConcoRDanT) Lokesh GidraGaël Thomas Julien SopenaMarc.
Matching Memory Access Patterns and Data Placement for NUMA Systems Zoltán Majó Thomas R. Gross Computer Science Department ETH Zurich, Switzerland.
MC 2 : High Performance GC for Memory-Constrained Environments - Narendran Sachindran, J. Eliot B. Moss, Emery D. Berger Sowmiya Chocka Narayanan.
By Jacob SeligmannSteffen Grarup Presented By Leon Gendler Incremental Mature Garbage Collection Using the Train Algorithm.
MC 2 : High Performance GC for Memory-Constrained Environments N. Sachindran, E. Moss, E. Berger Ivan JibajaCS 395T *Some of the graphs are from presentation.
Heap Shape Scalability Scalable Garbage Collection on Highly Parallel Platforms Kathy Barabash, Erez Petrank Computer Science Department Technion, Israel.
NUMA Tuning for Java Server Applications Mustafa M. Tikir.
Introduction to MIMD architectures
ParMarkSplit: A Parallel Mark- Split Garbage Collector Based on a Lock-Free Skip-List Nhan Nguyen Philippas Tsigas Håkan Sundell Distributed Computing.
OOPSLA 2003 Mostly Concurrent Garbage Collection Revisited Katherine Barabash - IBM Haifa Research Lab. Israel Yoav Ossia - IBM Haifa Research Lab. Israel.
1 The Compressor: Concurrent, Incremental and Parallel Compaction. Haim Kermany and Erez Petrank Technion – Israel Institute of Technology.
Task-aware Garbage Collection in a Multi-Tasking Virtual Machine Sunil Soman Laurent Daynès Chandra Krintz RACE Lab, UC Santa Barbara Sun Microsystems.
Shared Counters and Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
1 An Efficient On-the-Fly Cycle Collection Harel Paz, Erez Petrank - Technion, Israel David F. Bacon, V. T. Rajan - IBM T.J. Watson Research Center Elliot.
G1 TUNING Shubham Modi( ) Ujjwal Kumar Singh(10772) Vaibhav(10780)
DTHREADS: Efficient Deterministic Multithreading
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Optimizing RAM-latency Dominated Applications
SEG Advanced Software Design and Reengineering TOPIC L Garbage Collection Algorithms.
Virtual Memory Chantha Thoeun. Overview  Purpose:  Use the hard disk as an extension of RAM.  Increase the available address space of a process. 
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
ISMM 2004 Mostly Concurrent Compaction for Mark-Sweep GC Yoav Ossia, Ori Ben-Yitzhak, Marc Segal IBM Haifa Research Lab. Israel.
Computer System Architectures Computer System Software
Exploring Multi-Threaded Java Application Performance on Multicore Hardware Ghent University, Belgium OOPSLA 2012 presentation – October 24 th 2012 Jennifer.
Parallel and Distributed Systems Instructor: Xin Yuan Department of Computer Science Florida State University.
Scalable Web Server on Heterogeneous Cluster CHEN Ge.
Incremental Garbage Collection Uwe Kern 23. Januar 2002
Investigating the Effects of Using Different Nursery Sizing Policies on Performance Tony Guan, Witty Srisa-an, and Neo Jia Department of Computer Science.
David F. Bacon Perry Cheng V.T. Rajan IBM T.J. Watson Research Center ControllingFragmentation and Space Consumption in the Metronome.
A REAL-TIME GARBAGE COLLECTOR WITH LOW OVERHEAD AND CONSISTENT UTILIZATION David F. Bacon, Perry Cheng, and V.T. Rajan IBM T.J. Watson Research Center.
Embedded System Lab. 오명훈 Addressing Shared Resource Contention in Multicore Processors via Scheduling.
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
® July 21, 2004GC Summer School1 Cycles to Recycle: Copy GC Without Stopping the World The Sapphire Collector Richard L. Hudson J. Eliot B. Moss Originally.
Background Computer System Architectures Computer System Software.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
Page 1 2P13 Week 1. Page 2 Page 3 Page 4 Page 5.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
NUMA Optimization of Java VM
1© Copyright 2015 EMC Corporation. All rights reserved. NUMA(YEY) BY JACOB KUGLER.
.NET Memory Primer Martin Kulov. "Out of CPU, memory and disk, memory is typically the most important for overall system performance." Mark Russinovich.
The Docker Container Approach to Build Scalable and Performance Testing Environment Pankaj Rodge, VMware.
SPIDAL Java Optimized February 2017 Software: MIDAS HPC-ABDS
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Institute of Parallel and Distributed Systems (IPADS)
Java 9: The Quest for Very Large Heaps
Alternative system models
The University of Adelaide, School of Computer Science
NumaGiC: A garbage collector for big-data on big NUMA machines
David F. Bacon, Perry Cheng, and V.T. Rajan
KISS-Tree: Smart Latch-Free In-Memory Indexing on Modern Architectures
The University of Adelaide, School of Computer Science
Database System Architectures
The University of Adelaide, School of Computer Science
The Gamma Database Machine Project
Tim Harris (MSR Cambridge)
Presentation transcript:

A Study of Garbage Collector Scalability on Multicores LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro INRIA/University of Paris 6

14/20 most popular languages have GC but GC doesn’t scale on multicore hardware Garbage collection on multicore hardware Lokesh Gidra2 Parallel Scavenge/HotSpot scalability on a 48-core machine GC threads GC Throughput (GB/s) SPECjbb2005 with 48 application threads/3.5GB A Study of the Scalability of Garbage Collectors on Multicores Degrades after 24 GC threads Better ↑

Scalability of GC is a bottleneck By adding new cores, application creates more garbage per time unit And without GC scalability, the time spent in GC increases Lokesh GidraA Study of the Scalability of Garbage Collectors on Multicores3 Lusearch [PLOS’11] ~50% of the time spent in the GC at 48 cores

Where is the problem? Probably not related to GC design: the problem exists in ALL the GCs of HotSpot 7 (both, stop-the-world and concurrent GCs) What has really changed: Multicores are distributed architectures, not centralized anymore Lokesh GidraA Study of the Scalability of Garbage Collectors on Multicores4

From centralized architectures to distributed ones Lokesh GidraA Study of the Scalability of Garbage Collectors on Multicores5 A few years ago… Uniform memory access machines Now… Inter-connect Node 0Node 1 Node 2 Node 3 Cores Non-uniform memory access machines Cores Memory System Bus Memory

From centralized architectures to distributed ones Our machine: AMD Magny-Cours with 8 nodes and 48 cores 12 GB per node 6 cores per node Lokesh GidraA Study of the Scalability of Garbage Collectors on Multicores6 Node 0Node 1 Node 2 Node 3 Memory Local access: ~ 130 cycles Remote access: ~350 cycles

Our machine: AMD Magny-Cours with 8 nodes and 48 cores 12 GB per node 6 cores per node From centralized architectures to distributed ones Lokesh GidraA Study of the Scalability of Garbage Collectors on Multicores7 Node 0Node 1 Node 2 Node 3 Memory Local access: ~ 130 cycles Remote access: ~350 cycles #cores = #threads Better ↓ Completion time (ms) Time to perform a fixed number of reads in //

Our machine: AMD Magny-Cours with 8 nodes and 48 cores 12 GB per node 6 cores per node From centralized architectures to distributed ones Lokesh GidraA Study of the Scalability of Garbage Collectors on Multicores8 Node 0Node 1 Node 2 Node 3 Memory Local access: ~ 130 cycles Remote access: ~350 cycles Better ↓ Completion time (ms) Local Access Time to perform a fixed number of reads in // #cores = #threads

Our machine: AMD Magny-Cours with 8 nodes and 48 cores 12 GB per node 6 cores per node From centralized architectures to distributed ones Lokesh GidraA Study of the Scalability of Garbage Collectors on Multicores9 Node 0Node 1 Node 2 Node 3 Memory Local access: ~ 130 cycles Remote access: ~350 cycles Better ↓ Completion time (ms) Random access Time to perform a fixed number of reads in // #cores = #threads Local Access

Our machine: AMD Magny-Cours with 8 nodes and 48 cores 12 GB per node 6 cores per node From centralized architectures to distributed ones Lokesh GidraA Study of the Scalability of Garbage Collectors on Multicores10 Node 0Node 1 Node 2 Node 3 Memory Local access: ~ 130 cycles Remote access: ~350 cycles Better ↓ Completion time (ms) Time to perform a fixed number of reads in // Single node access Local Access #cores = #threads Random access

Parallel Scavenge Heap Space Lokesh GidraA Study of the Scalability of Garbage Collectors on Multicores11 Kernel’s lazy first-touch page allocation policy First-touch allocation policy Virtual address space Parallel Scavenge

Parallel Scavenge Heap Space Lokesh GidraA Study of the Scalability of Garbage Collectors on Multicores12 Kernel’s lazy first-touch page allocation policy ⇒ initial sequential phase maps most pages on first node Initial application thread First-touch allocation policy Parallel Scavenge

Parallel Scavenge Heap Space Lokesh GidraA Study of the Scalability of Garbage Collectors on Multicores13 Kernel’s lazy first-touch page allocation policy ⇒ initial sequential phase maps most pages on its node Initial application thread First-touch allocation policy But during the whole execution, the mapping remains as it is (virtual space reused by the GC) Parallel Scavenge A severe problem for generational GCs

Parallel Scavenge Heap Space Lokesh Gidra14 Bad balance Bad locality First-touch allocation policy 95% on a single node PS SpecJBB GC threads GC Throughput (GB/s) Better ↑ Parallel Scavenge A Study of the Scalability of Garbage Collectors on Multicores

NUMA-aware heap layouts Lokesh Gidra15 Bad balance Bad locality First-touch allocation policy Round-robin allocation policy Node local object allocation and copy 95% on a single node Targets balanceTargets locality Parallel ScavengeInterleavedFragmented A Study of the Scalability of Garbage Collectors on Multicores PS SpecJBB GC threads GC Throughput (GB/s) Better ↑

Interleaved heap layout analysis Lokesh Gidra16 Bad balancePerfect balance Bad locality First-touch allocation policy Round-robin allocation policy Node local object allocation and copy 95% on a single node7/8 remote accesses PS Interleaved SpecJBB GC threads GC Throughput (GB/s) Better ↑ Parallel ScavengeInterleavedFragmented A Study of the Scalability of Garbage Collectors on Multicores

Fragmented heap layout analysis Lokesh Gidra17 Bad balancePerfect balanceGood balance Bad locality Average locality Parallel ScavengeInterleavedFragmented First-touch allocation policy Round-robin allocation policy Node local object allocation and copy 95% on a single node7/8 remote accesses Bad balance if a single thread allocates for the others PS Interleaved Fragmented SpecJBB 7/8 remote scans 100% local copies GC threads GC Throughput (GB/s) Better ↑ A Study of the Scalability of Garbage Collectors on Multicores

Synchronization optimizations Removed a barrier between the GC phases Replaced the GC task-queue with a lock-free one Lokesh GidraA Study of the Scalability of Garbage Collectors on Multicores18 GC Throughput (GB/s) PS Interleaved Fragmented SpecJBB Fragmented + synchro Synchro optimization has effect with high contention GC threads Better ↑

Effect of Optimizations on the App (GC excluded) A good balance improves a lot application time Locality has only a marginal effect on application While fragmented space increases locality for application over interleaved space (recently allocated objects are the most accessed) Lokesh GidraA Study of the Scalability of Garbage Collectors on Multicores19 Application time PS Other heap layouts XML Transform from SPECjvm GC threads Better ↓

Overall effect (both GC and application) Optimizations double the app throughput of SPECjbb Pause time divided in half (105ms to 49ms) Lokesh GidraA Study of the Scalability of Garbage Collectors on Multicores20 Application throughput (ops/ms) PS Fragmented SpecJBB Interleaved Fragmented + synchro GC threads Better ↑

GC scales well with memory-intensive applications Lokesh GidraA Study of the Scalability of Garbage Collectors on Multicores21 3.5GB1GB 2GB 512MB1GB2GB PSFragmented + synchro

Take Away Previous GCs do not scale because they are not NUMA-aware  Existing mature GCs can scale with standard // programming techniques  Using NUMA-aware memory layouts should be useful for all GCs (concurrent GCs included) Most important NUMA effects 1.Balancing memory access 2.Memory locality only helps at high core count Lokesh GidraA Study of the Scalability of Garbage Collectors on Multicores22

Take Away Previous GCs do not scale due to NUMA obliviousness  Existing mature GCs can scale with standard // programming techniques  Using NUMA-aware memory layouts should be useful for all GCs (concurrent GCs included) Most important NUMA effects 1.Balancing memory access 2.Memory locality at high core count Lokesh GidraA Study of the Scalability of Garbage Collectors on Multicores23 Thank You