Presentation is loading. Please wait.

Presentation is loading. Please wait.

Garbage Collection for Large Scale Multiprocessors (Funded by ANR projects: Prose and ConcoRDanT) Lokesh GidraGaël Thomas Julien SopenaMarc Shapiro Regal-LIP6/INRIA.

Similar presentations


Presentation on theme: "Garbage Collection for Large Scale Multiprocessors (Funded by ANR projects: Prose and ConcoRDanT) Lokesh GidraGaël Thomas Julien SopenaMarc Shapiro Regal-LIP6/INRIA."— Presentation transcript:

1 Garbage Collection for Large Scale Multiprocessors (Funded by ANR projects: Prose and ConcoRDanT) Lokesh GidraGaël Thomas Julien SopenaMarc Shapiro Regal-LIP6/INRIA

2 Introduction Why? – Heavy use of Managed Runtime Environments Application servers Scientific applications Example: Jboss, Sunflow etc. – Hardware is more and more multi-resourced. – GC performance is critical. – Existing GCs developed for SMPs. What? – Assess GC scalability : Empirical Results. – Possible factors affecting the GC scalability. – Our approach to fixing them. Lokesh Gidra2

3 Contemporary Architecture C0 C1 C5 L2 L3 DRAM C0 C1 C5 L2 L3 DRAM Our machine has 8 such nodes with 6 cores each Non Uniform Memory Access (NUMA) Remote access >> Local access Non Uniform Memory Access (NUMA) Remote access >> Local access Lokesh Gidra Node 0Node 1

4 GC Scalability (Lusearch) Pause time increases with GC threads  Negative Scalability! Lokesh Gidra4 HotSpot JVM’s Garbage Collectors Pause Time GC Threads Application Threads Application Time

5 Trivial Bottleneck Scalable synchronization primitives are vital. GC task queue uses a monitor – Unnecessarily blocks GC threads. Replaced with lock-free version. No barrier for GC threads after GC completion. Trivial but very important: Up to 80% improvement. Lokesh Gidra5

6 Main Bottleneck Remote access and … Remote access! 7 out of 8 accesses are remote – When scanning an object (87.7% remote) – When copying an object (82.7% remote) – When stealing for load balancing (2-4 bus ops/steal) Lokesh Gidra6

7 Our Approach: Big Picture Improve GC locality – Local Scan – Local Copy – Local Stealing Tradeoff: – Locality vs. Load Balance Fix young generation of ParallelScavenge. Lokesh Gidra7

8 Avoid Remote Access Lokesh Gidra8 Node 0 Node 1 From Node 0 Node 1 To a c b d e f abcd GC0GC1 Ref. Q from 0 to 1 e ef

9 Heap Partitioning Lokesh Gidra9 = nMB Baseline design NUMA-aware space = n/2MB Chunk 0: only ¼ fullChunk 1: full Collect when full Problem: Collect more often when even 1 chunk is full = n/2MB

10 Heap Partitioning: Our Approach Lokesh Gidra10 Chunk 0 Chunk 1 = nMB Collect when total= nMB

11 Load Balancing NUMA-aware work stealing – A thread only steals from local threads on the same node. What about inter-node imbalance? – Apps with master-slave design cause this Example: h2 database Lokesh Gidra11

12 Lokesh Gidra12 Node 0 Node 1 From Node 0 Node 1 To a c b d GC0GC1 Ref Q from 0 to 1 Master’s stackSome slave’s stack bdac

13 Remote access hinders the scalability of GC. Tradeoff: Locality vs. Load Balance – Inter-node imbalance acts as a hurdle. Using all the cores is sub-optimal – Hits the memory wall. Adaptive resizing of NUMA-aware generation costs more! Up to 65% on scalable benchmarks of DaCapo. Lokesh Gidra13 Conclusion and Future Work


Download ppt "Garbage Collection for Large Scale Multiprocessors (Funded by ANR projects: Prose and ConcoRDanT) Lokesh GidraGaël Thomas Julien SopenaMarc Shapiro Regal-LIP6/INRIA."

Similar presentations


Ads by Google