CS380 C lecture 20 Last time –Linear scan register allocation –Classic compilation techniques –On to a modern context Today –Jenn Sartor –Experimental.

Slides:

Advertisements

Similar presentations

1 Wake Up and Smell the Coffee: Performance Analysis Methodologies for the 21st Century Kathryn S McKinley Department of Computer Sciences University of.

Advertisements

Steve Blackburn Department of Computer Science Australian National University Perry Cheng TJ Watson Research Center IBM Research Kathryn McKinley Department.

MC 2 : High Performance GC for Memory-Constrained Environments - Narendran Sachindran, J. Eliot B. Moss, Emery D. Berger Sowmiya Chocka Narayanan.

Resurrector: A Tunable Object Lifetime Profiling Technique Guoqing Xu University of California, Irvine OOPSLA’13 Conference Talk 1.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

Variability in Architectural Simulations of Multi-threaded Workloads Alaa R. Alameldeen and David A. Wood University of Wisconsin-Madison

Microarchitectural Characterization of Production JVMs and Java Workload work in progress Jungwoo Ha (UT Austin) Magnus Gustafsson (Uppsala Univ.) Stephen.

NUMA Tuning for Java Server Applications Mustafa M. Tikir.

380C Where are we & where we are going – Managed languages Dynamic compilation Inlining Garbage collection What else can you do when you examine the heap.

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science CRAMM: Virtual Memory Support for Garbage-Collected Applications Ting Yang, Emery.

Chapter 6: Database Evolution Title: AutoAdmin “What-if” Index Analysis Utility Authors: Surajit Chaudhuri, Vivek Narasayya ACM SIGMOD 1998.

U NIVERSITY OF M ASSACHUSETTS Department of Computer Science Automatic Heap Sizing Ting Yang, Matthew Hertz Emery Berger, Eliot Moss University of Massachusetts.

Dynamic Tainting for Deployed Java Programs Du Li Advisor: Witawas Srisa-an University of Nebraska-Lincoln 1.

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Garbage Collection Without Paging Matthew Hertz, Yi Feng, Emery Berger University.

Instrumentation and Profiling David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA

Comparison of JVM Phases on Data Cache Performance Shiwen Hu and Lizy K. John Laboratory for Computer Architecture The University of Texas at Austin.

Using JetBench to Evaluate the Efficiency of Multiprocessor Support for Parallel Processing HaiTao Mei and Andy Wellings Department of Computer Science.

The College of William and Mary 1 Influence of Program Inputs on the Selection of Garbage Collectors Feng Mao, Eddy Zheng Zhang and Xipeng Shen.

Adaptive Optimization in the Jalapeño JVM M. Arnold, S. Fink, D. Grove, M. Hind, P. Sweeney Presented by Andrew Cove Spring 2006.

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Dynamic Compilation II John Cavazos University.

380C Lecture 17 Where are we & where we are going –Managed languages Dynamic compilation Inlining Garbage collection –Why you need to care about workloads.

Exploring Multi-Threaded Java Application Performance on Multicore Hardware Ghent University, Belgium OOPSLA 2012 presentation – October 24 th 2012 Jennifer.

Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?

An Adaptive, Region-based Allocator for Java Feng Qian, Laurie Hendren {fqian, Sable Research Group School of Computer Science McGill.

Improving Network I/O Virtualization for Cloud Computing.

Adaptive Optimization in the Jalapeño JVM Matthew Arnold Stephen Fink David Grove Michael Hind Peter F. Sweeney Source: UIUC.

Lecture 10 : Introduction to Java Virtual Machine

P ath & E dge P rofiling Michael Bond, UT Austin Kathryn McKinley, UT Austin Continuous Presented by: Yingyi Bu.

Fast Conservative Garbage Collection Rifat Shahriyar Stephen M. Blackburn Australian National University Kathryn S. M cKinley Microsoft Research.

1 Fast and Efficient Partial Code Reordering Xianglong Huang (UT Austin, Adverplex) Stephen M. Blackburn (Intel) David Grove (IBM) Kathryn McKinley (UT.

Dynamic Object Sampling for Pretenuring Maria Jump Department of Computer Sciences The University of Texas at Austin Stephen M. Blackburn.

Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.

CISC Machine Learning for Solving Systems Problems Presented by: Alparslan SARI Dept of Computer & Information Sciences University of Delaware

How’s the Parallel Computing Revolution Going? 1How’s the Parallel Revolution Going?McKinley Kathryn S. McKinley The University of Texas at Austin.

September 11, 2003 Beltway: Getting Around GC Gridlock Steve Blackburn, Kathryn McKinley Richard Jones, Eliot Moss Modified by: Weiming Zhao Oct

Investigating the Effects of Using Different Nursery Sizing Policies on Performance Tony Guan, Witty Srisa-an, and Neo Jia Department of Computer Science.

Performance Comparison Xen vs. KVM vs. Native –Benchmarks: SPEC CPU2006, SPEC JBB 2005, SPEC WEB, TPC –Case studies Design instrumentations for figure.

Adaptive Multi-Threading for Dynamic Workloads in Embedded Multiprocessors 林鼎原 Department of Electrical Engineering National Cheng Kung University Tainan,

380C lecture 19 Where are we & where we are going –Managed languages Dynamic compilation Inlining Garbage collection –Opportunity to improve data locality.

1 Garbage Collection Advantage: Improving Program Locality Xianglong Huang (UT) Stephen M Blackburn (ANU), Kathryn S McKinley (UT) J Eliot B Moss (UMass),

Full and Para Virtualization

CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.

CSE 598c – Virtual Machines Survey Proposal: Improving Performance for the JVM Sandra Rueda.

Department of Computer Sciences Z-Rays: Divide Arrays and Conquer Speed and Flexibility Jennifer B. Sartor Stephen M. Blackburn,

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Method Profiling John Cavazos University.

® July 21, 2004GC Summer School1 Cycles to Recycle: Copy GC Without Stopping the World The Sapphire Collector Richard L. Hudson J. Eliot B. Moss Originally.

Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 Ricardo Portillo, Diana Villa, Patricia J. Teller The University of.

Tracking Bad Apples: Reporting the Origin of Null & Undefined Value Errors Michael D. Bond UT Austin Nicholas Nethercote National ICT Australia Stephen.

KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association SYSTEM ARCHITECTURE GROUP DEPARTMENT OF COMPUTER.

Test Loads Andy Wang CIS Computer Systems Performance Analysis.

1 Chapter 2: Operating-System Structures Services Interface provided to users & programmers –System calls (programmer access) –User level access to system.

1 The Garbage Collection Advantage: Improving Program Locality Xianglong Huang (UT), Stephen M Blackburn (ANU), Kathryn S McKinley (UT) J Eliot B Moss.

Eliminating External Fragmentation in a Non-Moving Garbage Collector for Java Author: Fridtjof Siebert, CASES 2000 Michael Sallas Object-Oriented Languages.

Dynamic Compilation Vijay Janapa Reddi

Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.

Software Architecture in Practice

Java 9: The Quest for Very Large Heaps

No Bit Left Behind: The Limits of Heap Data Compression

Improving java performance using Dynamic Method Migration on FPGAs

Department of Computer Science University of California, Santa Barbara

Mark Claypool and Jonathan Tanner Computer Science Department

Objective of This Course

Wake Up and Smell the Coffee: Evaluation Methodology for the 21st Century May 4th 2017 Ben Lenard.

Adaptive Code Unloading for Resource-Constrained JVMs

Correcting the Dynamic Call Graph Using Control Flow Constraints

Adaptive Optimization in the Jalapeño JVM

Performance And Scalability In Oracle9i And SQL Server 2000

Department of Computer Science University of California, Santa Barbara

Garbage Collection Advantage: Improving Program Locality

Presentation transcript:

CS380 C lecture 20 Last time –Linear scan register allocation –Classic compilation techniques –On to a modern context Today –Jenn Sartor –Experimental evaluation for managed languages with JIT compilation and garbage collection 1

Wake Up and Smell the Coffee: Performance Analysis Methodologies for the 21st Century Kathryn S McKinley Department of Computer Sciences University of Texas at Austin 2

3 Shocking News! In 2000, Java overtook C and C++ as the most popular programming language [TIOBE ]

4 Systems Research in Industry and Academia ISCA papers use C and/or C++ 5 papers are orthogonal to the programming language 2 papers use specialized programming languages 2 papers use Java and C from SPEC 1 paper uses only Java from SPEC

5 What is Experimental Computer Science?

6 An idea An implementation in some system An evaluation

7 The success of most systems innovation hinges on evaluation methodologies. 1.Benchmarks reflect current and ideally, future reality 2.Experimental design is appropriate 3.Statistical data analysis

8 The success of most systems innovation hinges on experimental methodologies. 1.Benchmarks reflect current and ideally, future reality [DaCapo Benchmarks 2006] 2.Experimental design is appropriate. 3.Statistical Data Analysis [Georges et al. 2006]  

9 We’re not in Kansas anymore! –JIT compilation, GC, dynamic checks, etc Methodology has not adapted –Needs to be updated and institutionalized “…this sophistication provides a significant challenge to understanding complete system performance, not found in traditional languages such as C or C++” [Hauswirth et al OOPSLA ’04] Experimental Design

10 Experimental Design Comprehensive comparison –3 state-of-the-art JVMs –Best of 5 executions –19 benchmarks –Platform: 2GHz Pentium-M, 1GB RAM, linux

11 Experimental Design

12 Experimental Design

13 Experimental Design

14 Experimental Design First Iteration Second Iteration Third Iteration

15 Experimental Design Another Experiment Compare two garbage collectors –Semispace Full Heap Garbage Collector –Marksweep Full Heap Garbage Collector

16 Experimental Design Another Experiment Compare two garbage collectors –Semispace Full Heap Garbage Collector –Marksweep Full Heap Garbage Collector Experimental Design –Same JVM, same compiler settings –Second iteration for both –Best of 5 executions –One benchmark - SPEC 209_db –Platform: 2GHz Pentium-M, 1GB RAM, linux

17 Marksweep vs Semispace

18 Marksweep vs Semispace

19 Semispace Marksweep Marksweep vs Semispace

20 Experimental Design

21 Experimental Design: Best Practices Measuring JVM innovations Measuring JIT innovations Measuring GC innovations Measuring Architecture innovations

22 JVM Innovation Best Practices Examples: –Thread scheduling –Performance monitoring Workload triggers differences –real workloads & perhaps microbenchmarks –e.g., force frequency of thread switching Measure & report multiple iterations –start up –steady state (aka server mode) –never configure the VM to use completely unoptimized code! Use a modest or multiple heap sizes computed as a function of maximum live size of the application Use & report multiple architectures

23 Best Practices Pentium M AMD Athlon SPARC

24 JIT Innovation Best Practices Example: new compiler optimization –Code quality: Does it improve the application code? –Compile time: How much compile time does it add? –Total time: compiler and application time together –Problem: adaptive compilation responds to compilation load –Question: How do we tease all these effects apart?

25 JIT Innovation Best Practices Teasing apart compile time and code quality requires multiple experiments Total time: Mix methodology –Run adaptive system as intended Result: mixture of optimized and unoptimized code –First & second iterations (that include compile time) –Set and/or report the heap size as a function of maximum live size of the application –Report: average and show statistical error Code quality –OK: Run iterations until performance stabilizes on “best”, or –Better: Run several iterations of the benchmark, turn off the compiler, and measure a run guaranteed to have no compilation –Best: Replay mix compilation Compile time –Requires the compiler to be deterministic –Replay mix compilation

26 Replay Compilation Force the JIT to produce a deterministic result Make a compilation profiler & replayer Profiler –Profile first or later iterations with adaptive JIT, pick best or average –Record profiling information used in compilation decisions, e.g., dynamic profiles of edges, paths, &/or dynamic call graph –Record compilation decisions, e.g., compile method bar at level two, inline method foo into bar –Mix of optimized and unoptimized, or all optimized/unoptimized Replayer –Reads in profile –As the system loads each class, apply profile +/- innovation Result –controlled experiments with deterministic compiler behavior –reduces statistical variance in measurements Still not a perfect methodology for inlining

27 GC Innovation Best Practices Requires more than one experiment... Use & report a range of fixed heap sizes –Explore the space time tradeoff –Measure heap size with respect to the maximum live size of the application –VMs should report total memory not just application memory Different GC algorithms vary in the meta-data they require JIT and VM use memory... Measure time with a constant workload –Do not measure through put Best: run two experiments –mix with adaptive methodology: what users are likely to see in practice –replay: hold the compiler activity constant Choose a profile with “best” application performance in order to keep from hiding mutator overheads in bad code.

28 Architecture Innovation Best Practices Requires more than one experiment... Use more than one VM Set a modest heap size and/or report heap size as a function of maximum live size Use a mixture of optimized and uncompiled code Simulator needs the “same” code in many cases to perform comparisons Best for microarchitecture only changes: –Multiple traces from live system with adaptive methodology start up and steady state with compiler turned off what users are likely to see in practice Wont work if architecture change requires recompilation, e.g., new sampling mechanism –Use replay to make the code as similar as possible

29 statistics Disraeli benchmar ks There are lies, damn lies, and “sometimes more than twice as fast” “our …. is better or almost as good as …. across the board” “garbage collection degrades performance by 70%” “speedups of 1.2x to 6.4x on a variety of benchmarks” “our prototype has usable performance” “the overhead …. is on average negligible” Quotes from recent research papers “…demonstrating high efficiency and scalability” “our algorithm is highly efficient” “can reduce garbage collection time by 50% to 75%” “speedups…. are very significant (up to 54-fold)” “speed up by 10-25% in many cases…” “…about 2x in two cases…” “…more than 10x in two small benchmarks” “…improves throughput by up to 41x”

30 Conclusions Methodology includes –Benchmarks –Experimental design –Statistical analysis [OOPSLA 2007] Poor Methodology –can focus or misdirect innovation and energy We have a unique opportunity –Transactional memory, multicore performance, dynamic languages What we can do –Enlist VM builders to include replay –Fund and broaden participation in benchmarking Research and industrial partnerships Funding through NSF, ACM, SPEC, industry or ?? –Participate in building community workloads

CS380 C More on Java Benchmarking – –Alias analysis Read: A. Diwan, K. S. McKinley, and J. E. B. Moss, Using Types to Analyze and Optimize Object-Oriented Programs, ACM Transactions on Programming Languages and Systems, 23(1): 30-72, January

32 Suggested Readings Performance Evaluation of JVMs How Java Programs Interact with Virtual Machines at the Microarchitectural Level, Lieven Eeckhout, Andy Georges and Koen De Bosschere, The 18th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages and Applications (OOPSLA'03), Oct Method-Level Phase Behavior in Java Workloads, Andy Georges, Dries Buytaert, Lieven Eeckhout and Koen De Bosschere, The 19th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages and Applications (OOPSLA'04), Oct Myths and Realities: The Performance Impact of Garbage Collection, S. M. Blackburn, P. Cheng, and K. S. McKinley, ACM SIGMETRICS Conference on Measurement & Modeling Computer Systems, pp , New York, NY, June The DaCapo Benchmarks: Java Benchmarking Development and Analysis, S. M. Blackburn, et. al., The ACM SIGPLAN Conference on Object Oriented Programming Systems, Languages and Applications (OOPSLA), Portland, OR, pp , October Statistically Rigorous Java Performance Evaluation, A. Georges, D. Buytaert, and L. Eeckhout, The ACM SIGPLAN Conference on Object Oriented Programming Systems, Languages and Applications (OOPSLA), Montreal, Canada, Oct To appear.