Exploring Multi-Threaded Java Application Performance on Multicore Hardware Ghent University, Belgium OOPSLA 2012 presentation – October 24 th 2012 Jennifer.

Slides:

Advertisements

Similar presentations

1 Wake Up and Smell the Coffee: Performance Analysis Methodologies for the 21st Century Kathryn S McKinley Department of Computer Sciences University of.

Advertisements

Garbage Collection for Large Scale Multiprocessors (Funded by ANR projects: Prose and ConcoRDanT) Lokesh GidraGaël Thomas Julien SopenaMarc Shapiro Regal-LIP6/INRIA.

Cooperative Cache Scrubbing Jennifer B. Sartor, Wim Heirman, Steve Blackburn*, Lieven Eeckhout, Kathryn S. McKinley^ PACT 2014 * ^

Steve Blackburn Department of Computer Science Australian National University Perry Cheng TJ Watson Research Center IBM Research Kathryn McKinley Department.

An Evaluation of OpenMP on Current and Emerging Multithreaded/Multicore Processors Matthew Curtis-Maury, Xiaoning Ding, Christos D. Antonopoulos, and Dimitrios.

Assessing the Scalability of Garbage Collectors on Many Cores (Funded by ANR projects: Prose and ConcoRDanT) Lokesh GidraGaël Thomas Julien SopenaMarc.

Extensibility, Safety and Performance in the SPIN Operating System Presented by Allen Kerr.

Resurrector: A Tunable Object Lifetime Profiling Technique Guoqing Xu University of California, Irvine OOPSLA’13 Conference Talk 1.

Techniques for Multicore Thermal Management Field Cady, Bin Fu and Kai Ren.

Microarchitectural Characterization of Production JVMs and Java Workload work in progress Jungwoo Ha (UT Austin) Magnus Gustafsson (Uppsala Univ.) Stephen.

ParMarkSplit: A Parallel Mark- Split Garbage Collector Based on a Lock-Free Skip-List Nhan Nguyen Philippas Tsigas Håkan Sundell Distributed Computing.

1 The Compressor: Concurrent, Incremental and Parallel Compaction. Haim Kermany and Erez Petrank Technion – Israel Institute of Technology.

- Sam Ganzfried - Ryan Sukauye - Aniket Ponkshe. Outline Effects of asymmetry and how to handle them Design Space Exploration for Core Architecture Accelerating.

Task-aware Garbage Collection in a Multi-Tasking Virtual Machine Sunil Soman Laurent Daynès Chandra Krintz RACE Lab, UC Santa Barbara Sun Microsystems.

Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.

Dynamic Tainting for Deployed Java Programs Du Li Advisor: Witawas Srisa-an University of Nebraska-Lincoln 1.

1/28/2004CSCI 315 Operating Systems Design1 Operating System Structures & Processes Notice: The slides for this lecture have been largely based on those.

7th Biennial Ptolemy Miniconference Berkeley, CA February 13, 2007 Cyber-Physical Systems: A Vision of the Future Edward A. Lee Robert S. Pepper Distinguished.

Comparison of JVM Phases on Data Cache Performance Shiwen Hu and Lizy K. John Laboratory for Computer Architecture The University of Texas at Austin.

A. Frank - P. Weisberg Operating Systems Structure of Operating Systems.

The College of William and Mary 1 Influence of Program Inputs on the Selection of Garbage Collectors Feng Mao, Eddy Zheng Zhang and Xipeng Shen.

Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.

Taking Off The Gloves With Reference Counting Immix

Supporting GPU Sharing in Cloud Environments with a Transparent

CPACT – The Conditional Parameter Adjustment Cache Tuner for Dual-Core Architectures + Also Affiliated with NSF Center for High- Performance Reconfigurable.

Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.

Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.

Introduction to the Java Virtual Machine 井民全. JVM (Java Virtual Machine) the environment in which the java programs execute The specification define an.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.

Fast Conservative Garbage Collection Rifat Shahriyar Stephen M. Blackburn Australian National University Kathryn S. M cKinley Microsoft Research.

1 Fast and Efficient Partial Code Reordering Xianglong Huang (UT Austin, Adverplex) Stephen M. Blackburn (Intel) David Grove (IBM) Kathryn McKinley (UT.

Heterogeneous Chip Multiprocessor Design for Virtual Machines Dan Upton and Kim Hazelwood University of Virginia.

Dynamic Object Sampling for Pretenuring Maria Jump Department of Computer Sciences The University of Texas at Austin Stephen M. Blackburn.

1 Evaluating the Impact of Thread Escape Analysis on Memory Consistency Optimizations Chi-Leung Wong, Zehra Sura, Xing Fang, Kyungwoo Lee, Samuel P. Midkiff,

CS380 C lecture 20 Last time –Linear scan register allocation –Classic compilation techniques –On to a modern context Today –Jenn Sartor –Experimental.

How’s the Parallel Computing Revolution Going? 1How’s the Parallel Revolution Going?McKinley Kathryn S. McKinley The University of Texas at Austin.

Instrumentation in Software Dynamic Translators for Self-Managed Systems Bruce R. Childers Naveen Kumar, Jonathan Misurda and Mary.

Efficient Deterministic Replay of Multithreaded Executions in a Managed Language Virtual Machine Michael Bond Milind Kulkarni Man Cao Meisam Fathi Salmi.

Investigating the Effects of Using Different Nursery Sizing Policies on Performance Tony Guan, Witty Srisa-an, and Neo Jia Department of Computer Science.

Performance Comparison Xen vs. KVM vs. Native –Benchmarks: SPEC CPU2006, SPEC JBB 2005, SPEC WEB, TPC –Case studies Design instrumentations for figure.

VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.

Understanding Performance, Power and Energy Behavior in Asymmetric Processors Nagesh B Lakshminarayana Hyesoon Kim School of Computer Science Georgia Institute.

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Dynamic Compilation I John Cavazos University.

380C lecture 19 Where are we & where we are going –Managed languages Dynamic compilation Inlining Garbage collection –Opportunity to improve data locality.

Yang Yu, Tianyang Lei, Haibo Chen, Binyu Zang Fudan University, China Shanghai Jiao Tong University, China Institute of Parallel and Distributed Systems.

1 Garbage Collection Advantage: Improving Program Locality Xianglong Huang (UT) Stephen M Blackburn (ANU), Kathryn S McKinley (UT) J Eliot B Moss (UMass),

Department of Computer Sciences ISMM No Bit Left Behind: The Limits of Heap Data Compression Jennifer B. Sartor* Martin Hirzel †, Kathryn S. McKinley*

A. Frank - P. Weisberg Operating Systems Structure of Operating Systems.

Full and Para Virtualization

CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.

CSE 598c – Virtual Machines Survey Proposal: Improving Performance for the JVM Sandra Rueda.

Department of Computer Sciences Z-Rays: Divide Arrays and Conquer Speed and Flexibility Jennifer B. Sartor Stephen M. Blackburn,

Object-Relative Addressing: Compressed Pointers in 64-bit Java Virtual Machines Kris Venstermans, Lieven Eeckhout, Koen De Bosschere Department of Electronics.

Duke CPS Java: make it run, make it right, make it fast (see Byte, May 1998, for more details) l “Java isn’t fast enough for ‘real’ applications”

Programming Parallel Hardware using MPJ Express By A. Shafi.

Multi-Core CPUs Matt Kuehn. Roadmap ► Intel vs AMD ► Early multi-core processors ► Threads vs Physical Cores ► Multithreading and Multi-core processing.

1 The Garbage Collection Advantage: Improving Program Locality Xianglong Huang (UT), Stephen M Blackburn (ANU), Kathryn S McKinley (UT) J Eliot B Moss.

Compilers: History and Context COMP Outline Compilers and languages Compilers and architectures – parallelism – memory hierarchies Other uses.

DVFS PerformaNce prediction for managed multithreaded applications

No Bit Left Behind: The Limits of Heap Data Compression

Timothy Zhu and Huapeng Zhou

Shoaib Akram Ghent University, Belgium

Runtime Analysis of Hotspot Java Virtual Machine

Ke Bai and Aviral Shrivastava Presented by Bryce Holton

Wake Up and Smell the Coffee: Evaluation Methodology for the 21st Century May 4th 2017 Ben Lenard.

Adaptive Code Unloading for Resource-Constrained JVMs

Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle

No Bit Left Behind: The Limits of Heap Data Compression

Garbage Collection Advantage: Improving Program Locality

Presentation transcript:

Exploring Multi-Threaded Java Application Performance on Multicore Hardware Ghent University, Belgium OOPSLA 2012 presentation – October 24 th 2012 Jennifer B. Sartor, Lieven Eeckhout Exploring Multi-Threaded Java Application Performance on Multicore Hardware

Modern Software & Hardware Managed languages Ubiquitous, but added runtime layer Many service threads interact with application  JIT compilation, on-stack replacement, collector  Stop the application, possibly critical  Share hardware resources Multicore with multiple sockets How do we schedule threads with constrained resources?  Scale core frequency for power  Use caches of all sockets, or limit communication p. 2

Extensive Performance Study Multi-threaded Java application on multicore, multi-socket hardware Large space to explore Number of threads Thread-to-core/socket mapping Pairing or isolating application and JVM threads Pinning Impact of frequency scaling Difference between startup and steady state p. 3 How do choices with scheduling and hardware resources affect performance?

Experimental Machine: Nehalem Scale frequency per socket to or GHz p. 4

Gain Insight on Scheduling Application Java Virtual Machine Garbage collector Just-in-time compiler with on-stack replacement Cao, et al. [ISCA 2012] studied JVM amenability to heterogeneity by measuring service threads’ performance per energy We study end-to-end performance p. 5

1. Cost of Isolation 1. Frequency Scaling Socket 1 Socket 0 Roadmap p. 6 Socket 0 Socket 1 3. Pairing Threads Socket 1 Socket 0

Experimental Methodology Jikes Research Virtual Machine (Dec 2011) Generational Immix collector 1.5, 2, and 3x minimum heap sizes Multithreaded DaCapo benchmarks 9.12-bach Avrora, lusearch (with fix), pmd, sunflow, xalan Also, pseudojbb2005 Timed 10 invocations Steady state, measure 15 th iteration Startup, measure 1 st iteration p. 7

Baseline Setup Nehalem Core 0 Nehalem Core 1 Nehalem Core 2 Nehalem Core 3 Nehalem Core 4 Nehalem Core 5 Nehalem Core 6 Nehalem Core 7 Socket 0 Socket 1 Application threads JVM service threads Collection Compilation p. 8 Pin application & collection threads

Boosting Socket Frequency Nehalem Core 0 Nehalem Core 1 Nehalem Core 2 Nehalem Core 3 Nehalem Core 4 Nehalem Core 5 Nehalem Core 6 Nehalem Core 7 Socket 0 Socket GHz 27-50% improvement in execution time p. 9

Exploring The Cost of Isolation Nehalem Core 0 Nehalem Core 1 Nehalem Core 2 Nehalem Core 3 Nehalem Core 4 Nehalem Core 5 Nehalem Core 6 Nehalem Core 7 Socket 0 Socket 1 Collection threads p. 10

Isolating Collection Threads Isolating collector does not significantly hurt performance p. 11

Exploring The Cost of Isolation Nehalem Core 0 Nehalem Core 1 Nehalem Core 2 Nehalem Core 3 Nehalem Core 4 Nehalem Core 5 Nehalem Core 6 Nehalem Core 7 Socket 0 Socket 1 Compiler thread p. 12

Isolating Compiler Thread at Startup Isolating compiler at startup has little impact p. 13

Isolating On-Stack-Replace at Startup Isolating OSR at startup improves performance p. 14

Exploring The Cost of Isolation Nehalem Core 0 Nehalem Core 1 Nehalem Core 2 Nehalem Core 3 Nehalem Core 4 Nehalem Core 5 Nehalem Core 6 Nehalem Core 7 Socket 0 Socket 1 All JVM service threads p. 15

Isolating All JVM Threads Isolating service threads only significantly hurts one benchmark p. 16

Exploring Frequency Scaling Nehalem Core 0 Nehalem Core 1 Nehalem Core 2 Nehalem Core 3 Nehalem Core 4 Nehalem Core 5 Nehalem Core 6 Nehalem Core 7 Socket 0 Socket 1 Baseline: JVM service threads isolated, all cores at highest frequency p. 17

Exploring Frequency Scaling Nehalem Core 0 Nehalem Core 1 Nehalem Core 2 Nehalem Core 3 Nehalem Core 0 Nehalem Core 1 Nehalem Core 2 Nehalem Core 3 Nehalem Core 4 Nehalem Core 5 Nehalem Core 6 Nehalem Core 7 Nehalem Core 4 Nehalem Core 5 Nehalem Core 6 Nehalem Core 7 versus Lower frequency of application threads Lower frequency of JVM service threads p. 18

Lower Frequency: Collector vs App Lowering collector frequency affects performance 5x less than for application p. 19

Lower Freq at Startup: Compiler vs App Lowering compiler frequency is not detrimental compared to application p. 20

Lower Frequency: JVM vs App Lowering JVM frequency affects performance 5x less than for application p. 21

Exploring Pairing Threads Nehalem Core 0 Nehalem Core 1 Nehalem Core 2 Nehalem Core 3 Nehalem Core 4 Nehalem Core 5 Nehalem Core 6 Nehalem Core 7 Socket 0 Socket 1 Pair application and collection threads p. 22

Pairing App & Collector, 2 Sockets With all but avrora, pairing application and collector performs best p. 23

Overall Performance Comparison Either use 1 socket, or isolate compiler thread p. 24

Conclusions: Scheduling Insights 1 socket: # application = # collection threads 2 sockets: Isolate compilation thread Pair application and collection threads Set # application threads = # cores, fewer collection threads Increasing application frequency is more important than for JVM service threads Analyzed Java performance given hardware resources p. 25