Microarchitectural Characterization of Production JVMs and Java Workload work in progress Jungwoo Ha (UT Austin) Magnus Gustafsson (Uppsala Univ.) Stephen.

Slides:

Advertisements

Similar presentations

1 Wake Up and Smell the Coffee: Performance Analysis Methodologies for the 21st Century Kathryn S McKinley Department of Computer Sciences University of.

Advertisements

Full-System Timing-First Simulation Carl J. Mauer Mark D. Hill and David A. Wood Computer Sciences Department University of Wisconsin—Madison.

Steve Blackburn Department of Computer Science Australian National University Perry Cheng TJ Watson Research Center IBM Research Kathryn McKinley Department.

1 Write Barrier Elision for Concurrent Garbage Collectors Martin T. Vechev Cambridge University David F. Bacon IBM T.J.Watson Research Center.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

DBMSs on a Modern Processor: Where Does Time Go? Anastassia Ailamaki Joint work with David DeWitt, Mark Hill, and David Wood at the University of Wisconsin-Madison.

1 Overview Assignment 5: hints  Garbage collection Assignment 4: solution.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.

Paul D. Bryan, Jason A. Poovey, Jesse G. Beu, Thomas M. Conte Georgia Institute of Technology.

Heap Shape Scalability Scalable Garbage Collection on Highly Parallel Platforms Kathy Barabash, Erez Petrank Computer Science Department Technion, Israel.

Performance of multiprocessing systems: Benchmarks and performance counters Miodrag Bolic ELG7187 Topics in Computers: Multiprocessor Systems on Chip.

OOPSLA 2003 Mostly Concurrent Garbage Collection Revisited Katherine Barabash - IBM Haifa Research Lab. Israel Yoav Ossia - IBM Haifa Research Lab. Israel.

1 The Compressor: Concurrent, Incremental and Parallel Compaction. Haim Kermany and Erez Petrank Technion – Israel Institute of Technology.

JVM-1 Introduction to Java Virtual Machine. JVM-2 Outline Java Language, Java Virtual Machine and Java Platform Organization of Java Virtual Machine Garbage.

Comparison of JVM Phases on Data Cache Performance Shiwen Hu and Lizy K. John Laboratory for Computer Architecture The University of Texas at Austin.

ABACUS: A Hardware-Based Software Profiler for Modern Processors Eric Matthews Lesley Shannon School of Engineering Science Sergey Blagodurov Sergey Zhuravlev.

1 Software Testing and Quality Assurance Lecture 31 – SWE 205 Course Objective: Basics of Programming Languages & Software Construction Techniques.

The College of William and Mary 1 Influence of Program Inputs on the Selection of Garbage Collectors Feng Mao, Eddy Zheng Zhang and Xipeng Shen.

Intro to Java The Java Virtual Machine. What is the JVM  a software emulation of a hypothetical computing machine that runs Java bytecodes (Java compiler.

Taking Off The Gloves With Reference Counting Immix

Rapid Identification of Architectural Bottlenecks via Precise Event Counting John Demme, Simha Sethumadhavan Columbia University

The Impact of Performance Asymmetry in Multicore Architectures Saisanthosh Ravi Michael Konrad Balakrishnan Rajwar Upton Lai UW-Madison and, Intel Corp.

University of Maryland Compiler-Assisted Binary Parsing Tugrul Ince PD Week – 27 March 2012.

JOP: A Java Optimized Processor for Embedded Real-Time Systems Martin Schöberl.

Exploring Multi-Threaded Java Application Performance on Multicore Hardware Ghent University, Belgium OOPSLA 2012 presentation – October 24 th 2012 Jennifer.

Multi-core Programming VTune Analyzer Basics. 2 Basics of VTune™ Performance Analyzer Topics What is the VTune™ Performance Analyzer? Performance tuning.

Department of Computer Science Mining Performance Data from Sampled Event Traces Bret Olszewski IBM Corporation – Austin, TX Ricardo Portillo, Diana Villa,

Software Performance Analysis Using CodeAnalyst for Windows Sherry Hurwitz SW Applications Manager SRD Advanced Micro Devices Lei.

Oct Using Platform-Specific Performance Counters for Dynamic Compilation Florian Schneider and Thomas Gross ETH Zurich.

P ath & E dge P rofiling Michael Bond, UT Austin Kathryn McKinley, UT Austin Continuous Presented by: Yingyi Bu.

Fast Conservative Garbage Collection Rifat Shahriyar Stephen M. Blackburn Australian National University Kathryn S. M cKinley Microsoft Research.

Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

1 Fast and Efficient Partial Code Reordering Xianglong Huang (UT Austin, Adverplex) Stephen M. Blackburn (Intel) David Grove (IBM) Kathryn McKinley (UT.

Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.

Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.

Dynamic Object Sampling for Pretenuring Maria Jump Department of Computer Sciences The University of Texas at Austin Stephen M. Blackburn.

CS380 C lecture 20 Last time –Linear scan register allocation –Classic compilation techniques –On to a modern context Today –Jenn Sartor –Experimental.

How’s the Parallel Computing Revolution Going? 1How’s the Parallel Revolution Going?McKinley Kathryn S. McKinley The University of Texas at Austin.

Investigating the Effects of Using Different Nursery Sizing Policies on Performance Tony Guan, Witty Srisa-an, and Neo Jia Department of Computer Science.

Click to add text © 2012 IBM Corporation Design Manager Server Instrumentation Instrumentation Data Documentation Gary Johnston, Performance Focal Point,

380C lecture 19 Where are we & where we are going –Managed languages Dynamic compilation Inlining Garbage collection –Opportunity to improve data locality.

1 Garbage Collection Advantage: Improving Program Locality Xianglong Huang (UT) Stephen M Blackburn (ANU), Kathryn S McKinley (UT) J Eliot B Moss (UMass),

Department of Computer Sciences ISMM No Bit Left Behind: The Limits of Heap Data Compression Jennifer B. Sartor* Martin Hirzel †, Kathryn S. McKinley*

CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.

CSE 598c – Virtual Machines Survey Proposal: Improving Performance for the JVM Sandra Rueda.

Redundant Memory Mappings for Fast Access to Large Memories

1 OS Review Processes and Threads Chi Zhang

Duke CPS Java: make it run, make it right, make it fast (see Byte, May 1998, for more details) l “Java isn’t fast enough for ‘real’ applications”

1 GC Advantage: Improving Program Locality Xianglong Huang, Zhenlin Wang, Stephen M Blackburn, Kathryn S McKinley, J Eliot B Moss, Perry Cheng.

® July 21, 2004GC Summer School1 Cycles to Recycle: Copy GC Without Stopping the World The Sapphire Collector Richard L. Hudson J. Eliot B. Moss Originally.

Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,

1 The Garbage Collection Advantage: Improving Program Locality Xianglong Huang (UT), Stephen M Blackburn (ANU), Kathryn S McKinley (UT) J Eliot B Moss.

Computer Sciences Department University of Wisconsin-Madison

Gwangsun Kim, Jiyun Jeong, John Kim

No Bit Left Behind: The Limits of Heap Data Compression

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Energy-Efficient Address Translation

Understanding Performance Counter Data - 1

Mark Claypool and Jonathan Tanner Computer Science Department

How much does OS operation impact your code’s performance?

CMSC 611: Advanced Computer Architecture

Wake Up and Smell the Coffee: Evaluation Methodology for the 21st Century May 4th 2017 Ben Lenard.

ColdFusion Performance Troubleshooting and Tuning

Adaptive Code Unloading for Resource-Constrained JVMs

Correcting the Dynamic Call Graph Using Control Flow Constraints

CMSC 611: Advanced Computer Architecture

No Bit Left Behind: The Limits of Heap Data Compression

Garbage Collection Advantage: Improving Program Locality

What Are Performance Counters?

Presentation transcript:

Microarchitectural Characterization of Production JVMs and Java Workload work in progress Jungwoo Ha (UT Austin) Magnus Gustafsson (Uppsala Univ.) Stephen M. Blackburn (Australian Nat’l Univ.) Kathryn S. McKinley (UT Austin)

2/22/082 Challenges of JVM Performance Analysis  Controlling nondeterminism  Just-In-Time Compilation driven by nondeterministic sampling  Garbage Collectors  Other Helper Threads  Production JVMs are not created equal  Thread model (kernel, user threads)  Type of helper threads  Need a solid measurement methodology!  Isolate each JVM part

2/22/083 Forest and Trees  What performance metrics explain performance differences and bottlenecks?  Cache miss? L1 or L2?  TLB miss?  # of instructions?  Inspecting one or two metrics is not always enough  Performance counters give us only small number of counters at a time  Multiple invocation for the measurement inevitable

2/22/084 Case Study: jython  Application performance (Cycles)

2/22/085 Case Study: jython  L1 Instruction cache miss/cyc

2/22/086 Case Study: jython  L1 Data cache miss/cyc

2/22/087 Case Study: jython  Total Instruction executed (retired)

2/22/088 Case Study: jython  L2 Data cache miss/cycle

2/22/089 Project Status  Established methodology to characterize application code performance  Large number of metrics (40+) measured from hardware performance counters  apples to apple comparison of JVMs using standard interface (JVMTI, JNI)  Simulator data for detail analysis  Limit studies  What if L1 cache had no misses?  More performance metrics  e.g. uop mix

2/22/0810 Performance Counter Methodology Warmup JVM Stop JIT Full Heap GC Measured Run change metric Invoke JVM y times 1st – xth iteration (x+1)th iteration (x+2)th – (x+2+(n/p)k)th iteration  Collecting n metric  x warmup iterations (x = 10)  p performance counters (can measure at most p metrics per iter.)  n/p iterations needed for measurement  k redundant measurement for statistical validation (k = 1)  Need to hold workload constant for multiple measurements

2/22/0811 Performance Counter Methodology  Stop-the-world Garbage Collector  No concurrent marking  One perfctr instance per pthread  JVM internal threads are different pthreads from the application  JVMTI Callbacks  Thread start - start counter  Thread finish - stop counter  GC start - pause counter, only for userlevel thread  GC stop - resume counter, only for userlevel thread

2/22/0812 Methodology Limitations  Cannot factor out memory barrier overhead  Use garbage collector with the least application overhead  If a helper thread runs in the same pthread with the application (user-level thread), it will cause perturbation  No evidence in J9, HotSpot, JRockit  Instrumented code overhead  Must be included in the measurement

2/22/0813  Performance Counter Experiment  Pentium-M uni-processor  32KB 8-way L1 cache (data & instruction)  2MB 4-way L2 cache  2 hardware counter (18 if multiplexed)  1GB Memory  32bit Linux with perfctr patch  PAPI Library  Simulator Experiment  PTLsim ( x86 simulatorhttp://  64bit AMD Athlon Experiment

2/22/0814 Experiment  3 Production JVMs * 2 versions  IBM J9, Sun HotSpot JVM, JRockit (perfctr only)  1.5 and 1.6  Heap Size = max (16MB, 4*minimum heap size)  18 Benchmarks  9 DaCapo benchmarks  8 SPEC JVM 98  1 PseudoJBB

2/22/0815 Experiment  40+ Metrics  40 distinct metrics from performance counter  L1 or L2 Cache misses (Instruction, Data, Read, Write)  TLB-I miss  Branch predictions  Resource Stalls  More rich metrics from the simulator  Micro operation mix  Load to store

2/22/0816 Performance Counter Results (Cycle Counts)  PseudoJBB  pmd  jython  jess

2/22/0817 Performance Counter Results (Cycle Counts)  jack  hsqldb  compress  db

2/22/0818 Performance Counter Results  IBM J9 1.6 performed better than Sun HotSpot 1.6 in the average  JRockit has the most variation in performance  Full results  ~800 graphs  Full jython results in the paper   or Google my name (Jungwoo Ha)

2/22/0819 Future Work  JVM activity characterization  Garbage collector  JIT  Statistical analysis of performance metrics  metrics correlation  Methodology to identify performance bottleneck  Multicore performance analysis

2/22/0820 Conclusions  Methodology for production JVM comparison  Performance evaluation data  Simulator results for deeper analysis

Thanks you!

2/22/0822

2/22/0823 Simulation Result

2/22/0824 Perfect Cache - compress

2/22/0825 Perfect Cache - db