Exploring Multi-Threaded Java Application Performance on Multicore Hardware Ghent University, Belgium OOPSLA 2012 presentation – October 24 th 2012 Jennifer B. Sartor, Lieven Eeckhout Exploring Multi-Threaded Java Application Performance on Multicore Hardware
Modern Software & Hardware Managed languages Ubiquitous, but added runtime layer Many service threads interact with application JIT compilation, on-stack replacement, collector Stop the application, possibly critical Share hardware resources Multicore with multiple sockets How do we schedule threads with constrained resources? Scale core frequency for power Use caches of all sockets, or limit communication p. 2
Extensive Performance Study Multi-threaded Java application on multicore, multi-socket hardware Large space to explore Number of threads Thread-to-core/socket mapping Pairing or isolating application and JVM threads Pinning Impact of frequency scaling Difference between startup and steady state p. 3 How do choices with scheduling and hardware resources affect performance?
Experimental Machine: Nehalem Scale frequency per socket to or GHz p. 4
Gain Insight on Scheduling Application Java Virtual Machine Garbage collector Just-in-time compiler with on-stack replacement Cao, et al. [ISCA 2012] studied JVM amenability to heterogeneity by measuring service threads’ performance per energy We study end-to-end performance p. 5
1. Cost of Isolation 1. Frequency Scaling Socket 1 Socket 0 Roadmap p. 6 Socket 0 Socket 1 3. Pairing Threads Socket 1 Socket 0
Experimental Methodology Jikes Research Virtual Machine (Dec 2011) Generational Immix collector 1.5, 2, and 3x minimum heap sizes Multithreaded DaCapo benchmarks 9.12-bach Avrora, lusearch (with fix), pmd, sunflow, xalan Also, pseudojbb2005 Timed 10 invocations Steady state, measure 15 th iteration Startup, measure 1 st iteration p. 7
Baseline Setup Nehalem Core 0 Nehalem Core 1 Nehalem Core 2 Nehalem Core 3 Nehalem Core 4 Nehalem Core 5 Nehalem Core 6 Nehalem Core 7 Socket 0 Socket 1 Application threads JVM service threads Collection Compilation p. 8 Pin application & collection threads
Boosting Socket Frequency Nehalem Core 0 Nehalem Core 1 Nehalem Core 2 Nehalem Core 3 Nehalem Core 4 Nehalem Core 5 Nehalem Core 6 Nehalem Core 7 Socket 0 Socket GHz 27-50% improvement in execution time p. 9
Exploring The Cost of Isolation Nehalem Core 0 Nehalem Core 1 Nehalem Core 2 Nehalem Core 3 Nehalem Core 4 Nehalem Core 5 Nehalem Core 6 Nehalem Core 7 Socket 0 Socket 1 Collection threads p. 10
Isolating Collection Threads Isolating collector does not significantly hurt performance p. 11
Exploring The Cost of Isolation Nehalem Core 0 Nehalem Core 1 Nehalem Core 2 Nehalem Core 3 Nehalem Core 4 Nehalem Core 5 Nehalem Core 6 Nehalem Core 7 Socket 0 Socket 1 Compiler thread p. 12
Isolating Compiler Thread at Startup Isolating compiler at startup has little impact p. 13
Isolating On-Stack-Replace at Startup Isolating OSR at startup improves performance p. 14
Exploring The Cost of Isolation Nehalem Core 0 Nehalem Core 1 Nehalem Core 2 Nehalem Core 3 Nehalem Core 4 Nehalem Core 5 Nehalem Core 6 Nehalem Core 7 Socket 0 Socket 1 All JVM service threads p. 15
Isolating All JVM Threads Isolating service threads only significantly hurts one benchmark p. 16
Exploring Frequency Scaling Nehalem Core 0 Nehalem Core 1 Nehalem Core 2 Nehalem Core 3 Nehalem Core 4 Nehalem Core 5 Nehalem Core 6 Nehalem Core 7 Socket 0 Socket 1 Baseline: JVM service threads isolated, all cores at highest frequency p. 17
Exploring Frequency Scaling Nehalem Core 0 Nehalem Core 1 Nehalem Core 2 Nehalem Core 3 Nehalem Core 0 Nehalem Core 1 Nehalem Core 2 Nehalem Core 3 Nehalem Core 4 Nehalem Core 5 Nehalem Core 6 Nehalem Core 7 Nehalem Core 4 Nehalem Core 5 Nehalem Core 6 Nehalem Core 7 versus Lower frequency of application threads Lower frequency of JVM service threads p. 18
Lower Frequency: Collector vs App Lowering collector frequency affects performance 5x less than for application p. 19
Lower Freq at Startup: Compiler vs App Lowering compiler frequency is not detrimental compared to application p. 20
Lower Frequency: JVM vs App Lowering JVM frequency affects performance 5x less than for application p. 21
Exploring Pairing Threads Nehalem Core 0 Nehalem Core 1 Nehalem Core 2 Nehalem Core 3 Nehalem Core 4 Nehalem Core 5 Nehalem Core 6 Nehalem Core 7 Socket 0 Socket 1 Pair application and collection threads p. 22
Pairing App & Collector, 2 Sockets With all but avrora, pairing application and collector performs best p. 23
Overall Performance Comparison Either use 1 socket, or isolate compiler thread p. 24
Conclusions: Scheduling Insights 1 socket: # application = # collection threads 2 sockets: Isolate compilation thread Pair application and collection threads Set # application threads = # cores, fewer collection threads Increasing application frequency is more important than for JVM service threads Analyzed Java performance given hardware resources p. 25