1 Fast and Efficient Partial Code Reordering Xianglong Huang (UT Austin, Adverplex) Stephen M. Blackburn (Intel) David Grove (IBM) Kathryn McKinley (UT.

Slides:

Advertisements

Similar presentations

IBM JIT Compilation Technology AOT Compilation in a Dynamic Environment for Startup Time Improvement Kenneth Ma Marius Pirvu Oct. 30, 2008.

Advertisements

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

Pooja ROY, Manmohan MANOHARAN, Weng Fai WONG National University of Singapore ESWEEK (CASES) October 2014 EnVM : Virtual Memory Design for New Memory Architectures.

Steve Blackburn Department of Computer Science Australian National University Perry Cheng TJ Watson Research Center IBM Research Kathryn McKinley Department.

CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

Xiaomi An, Jiqiang Song, Wendong Wang SimpLight Nanoelectronics Ltd 2008/03/24 Temporal Distribution Based Software Cache Partition To Reduce I-Cache Misses.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

Microarchitectural Characterization of Production JVMs and Java Workload work in progress Jungwoo Ha (UT Austin) Magnus Gustafsson (Uppsala Univ.) Stephen.

August Code Compaction for UniCore on Link-Time Optimization Platform Zhang Jiyu Compilation Toolchain Group MPRC.

Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences.

1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

Memory Subsystem Performance of Programs using Coping Garbage Collection Authers: Amer Diwan David Traditi Eliot Moss Presented by: Ronen Shabo.

CIS °The Five Classic Components of a Computer °Today’s Topics: Memory Hierarchy Cache Basics Cache Exercise (Many of this topic’s slides were.

JVM-1 Introduction to Java Virtual Machine. JVM-2 Outline Java Language, Java Virtual Machine and Java Platform Organization of Java Virtual Machine Garbage.

Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.

1  2004 Morgan Kaufmann Publishers Chapter Seven.

Comparison of JVM Phases on Data Cache Performance Shiwen Hu and Lizy K. John Laboratory for Computer Architecture The University of Texas at Austin.

Lecture 1CS 380C 1 380C Last Time –Course organization –Read Backus et al. Announcements –Hadi lab Q&A Wed 1-2 in Painter 5.38N –UT Texas Learning Center:

1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.

Improving the Performance of Object-Oriented Languages with Dynamic Predication of Indirect Jumps José A. Joao *‡ Onur Mutlu ‡* Hyesoon Kim § Rishi Agarwal.

Flexible Reference-Counting-Based Hardware Acceleration for Garbage Collection José A. Joao * Onur Mutlu ‡ Yale N. Patt * * HPS Research Group University.

Adaptive Optimization in the Jalapeño JVM M. Arnold, S. Fink, D. Grove, M. Hind, P. Sweeney Presented by Andrew Cove Spring 2006.

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Dynamic Compilation II John Cavazos University.

Exploring Multi-Threaded Java Application Performance on Multicore Hardware Ghent University, Belgium OOPSLA 2012 presentation – October 24 th 2012 Jennifer.

The Jikes RVM | Ian Rogers, The University of Manchester | Dr. Ian Rogers Jikes RVM Core Team Member Research Fellow, Advanced.

Oct Using Platform-Specific Performance Counters for Dynamic Compilation Florian Schneider and Thomas Gross ETH Zurich.

P ath & E dge P rofiling Michael Bond, UT Austin Kathryn McKinley, UT Austin Continuous Presented by: Yingyi Bu.

Java Virtual Machine Case Study on the Design of JikesRVM.

Profile-Guided Optimization Targeting High Performance Embedded Applications David Kaeli Murat Bicer Efe Yardimci Center for Subsurface Sensing and Imaging.

Dynamic Object Sampling for Pretenuring Maria Jump Department of Computer Sciences The University of Texas at Austin Stephen M. Blackburn.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

How’s the Parallel Computing Revolution Going? 1How’s the Parallel Revolution Going?McKinley Kathryn S. McKinley The University of Texas at Austin.

Title of Selected Paper: IMPRES: Integrated Monitoring for Processor Reliability and Security Authors: Roshan G. Ragel and Sri Parameswaran Presented by:

Instrumentation in Software Dynamic Translators for Self-Managed Systems Bruce R. Childers Naveen Kumar, Jonathan Misurda and Mary.

EEL5708/Bölöni Lec 4.1 Fall 2004 September 10, 2004 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Review: Memory Hierarchy.

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Dynamic Compilation I John Cavazos University.

Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.

380C lecture 19 Where are we & where we are going –Managed languages Dynamic compilation Inlining Garbage collection –Opportunity to improve data locality.

ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.

1 Garbage Collection Advantage: Improving Program Locality Xianglong Huang (UT) Stephen M Blackburn (ANU), Kathryn S McKinley (UT) J Eliot B Moss (UMass),

Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)

Full and Para Virtualization

CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.

CSE 598c – Virtual Machines Survey Proposal: Improving Performance for the JVM Sandra Rueda.

1 J. Bradley Chen and Bradley D. D. Leupen Division of Engineering and Applied Sciences Harvard University Improving Instruction Locality with Just-In-Time.

Department of Computer Sciences Z-Rays: Divide Arrays and Conquer Speed and Flexibility Jennifer B. Sartor Stephen M. Blackburn,

MIDORI The Windows Killer!! by- Sagar R. Yeole Under the guidance of- Prof. T. A. Chavan.

Exploiting Instruction Streams To Prevent Intrusion Milena Milenkovic.

Polar Opposites: Next Generation Languages & Architectures Kathryn S McKinley The University of Texas at Austin.

1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:

Vertical Profiling : Understanding the Behavior of Object-Oriented Applications Sookmyung Women’s Univ. PsLab Sewon,Moon.

1 GC Advantage: Improving Program Locality Xianglong Huang, Zhenlin Wang, Stephen M Blackburn, Kathryn S McKinley, J Eliot B Moss, Perry Cheng.

Tracking Bad Apples: Reporting the Origin of Null & Undefined Value Errors Michael D. Bond UT Austin Nicholas Nethercote National ICT Australia Stephen.

1 of 14 Lab 2: Design-Space Exploration with MPARM.

*Pentium is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States and other countries Performance Monitoring.

1 Chapter 2: Operating-System Structures Services Interface provided to users & programmers –System calls (programmer access) –User level access to system.

Just-In-Time Compilation. Introduction Just-in-time compilation (JIT), also known as dynamic translation, is a method to improve the runtime performance.

1 The Garbage Collection Advantage: Improving Program Locality Xianglong Huang (UT), Stephen M Blackburn (ANU), Kathryn S McKinley (UT) J Eliot B Moss.

Interpreted languages Jakub Yaghob

Cork: Dynamic Memory Leak Detection with Garbage Collection

Why to use the assembly and why we need this course at all?

5.2 Eleven Advanced Optimizations of Cache Performance

Ke Bai and Aviral Shrivastava Presented by Bryce Holton

Adaptive Code Unloading for Resource-Constrained JVMs

Adaptive Optimization in the Jalapeño JVM

Chapter 12 Pipelining and RISC

Garbage Collection Advantage: Improving Program Locality

JIT Compiler Design Maxine Virtual Machine Dhwani Pandya

Presentation transcript:

1 Fast and Efficient Partial Code Reordering Xianglong Huang (UT Austin, Adverplex) Stephen M. Blackburn (Intel) David Grove (IBM) Kathryn McKinley (UT Austin)

2 Software Trends By 2008, 80% of software will be written in Java or C#. [Gartner report] Java and C# are coming to your OS soon - Jnode, Singularity Advantages of modern programming languages: –Productivity, security, reliability… Performance?

3 Hardware Trends 2X/1.5yr 2X/10 yrs DRAM CPU Processor-Memory Performance Gap: (grows 50% / year) Performance cache 2005

4 Improvement Potential Base case: JikesRVM default with separate code space. Cache configuration: 32K IL1 direct map, 512K L2 (small programs on a big cache)

5 New and Better Opportunities Virtual machine monitors application behavior at runtime Dynamic recompilation –With dynamic feedback –Allocates instructions at runtime

6 Previous Work on Instruction Locality Static schemes –Static profile calling correlation and reorder code at compile and link time [Pettis and Hansen 90] –Cache coloring [Hashemi et al 97] –Profile procedure interleaving [Gloy et al. 99] –Static schemes are not flexible Dynamic scheme –JIT code reordering [Chen et al. 97] –Used as our base case

7 Optimizations in Virtual Machine Static instruction allocation used at runtime, –e.g. Just-in-time compilations –Invocation order Compiler Memory Manager Runtime Static Optimizations

8 Optimizations in Virtual Machine Dynamic instruction allocation/reordering adapt to the program behavior with low overhead Compiler Memory Manager Runtime Static Optimization

9 Opportunity for Instruction Locality Dynamic detection of hot methods, hot basic blocks Dynamic recompilation relocates methods at runtime

10 PCR Optimizations Reduce instruction capacity misses –Code space –Method separation –Code splitting Reduce instruction conflict misses –Code padding

11 PCR System JikesRVM component Input/Output Optimized method Baseline method Data Baseline Compiler Source Code Executing Code Adaptive Sampler Optimizing Compiler Hot Methods

12 PCR System: Method Separation Hot method (optimized code) Cold method (baseline code) Data Code Data Hot Methods Cold Methods Code

13 PCR System: Code Splitting Online edge profile identifies hot basic blocks in a method Code reordering moves hot basic blocks to the beginning of a method Code splitting to separate hot/cold basic blocks inside the heap Cold basic blocks Hot basic blocks Method A:

14 PCR System: Code Splitting Data Hot Blocks Cold Methods Cold Blocks Hot methods (optimized code) Cold methods (baseline code) Data Cold basic blocks Hot basic blocks Data Hot Methods Cold Methods

15 PCR Optimizations Reduce instruction capacity misses –Code Space –Method separation –Code splitting Reduce instruction conflict misses –Code padding

16 PCR System: Code Padding Baseline Compiler Source Code Binary Code Adaptive Sampler Optimizing Compiler Hot Methods Dynamic Call Graph JikesRVM component Input/Output

17 PCR System: Code Padding Method A() { … classC.B(); … } A B Conflict AB Dynamic Call Graph

18 Methodology Java virtual machine: Jikes RVM Various Architectures –x86 (Pentium 4) –PowerPC –Simulator: Dynamic SimpleScalar Use direct-mapped I-cache –Shorter latency –More conflict misses

19 PCR Results: jess on x86

20 PCR Results: fop on x86

21 Impact of Code Padding Base case: JikesRVM default + a separate code space. Cache configuration: 32K IL1 direct map, 512K L2

22 Conclusion Code space improve program performance by 6% (up to 30%) (Pentium 4) PCR has negligible overhead PCR no obvious performance improvement –On Pentium 4, no improvement on average –In simulation, PCR has 14% for one program Not consistent, no improvement on average. Potential opportunities for dynamic optimizations

23 Thank you! Questions? Compiler Garbage collector Runtime Static Optimization

24 Cache: Small vs. Large IL1DL1L2 SizeAssocLatencySizeAssocLatencySizeAsso c Latency 8K K25 16K K28 64K K210 Cacti, 90nm technology, 3GHz frequency

25 Cache-Size Comparison

26 Directmap vs. Two-way cycles (10 6 ) Cacti, 90nm technology, 3GHz

27 Improving Performance Classic optimizations not sufficient! Different programming styles –Automatic memory management –Pointer data structures –Many small methods Optimization costs incurred at runtime Virtual Machine (VM) adds complexity –Class loading, memory management, Just-in- time compiler…

28 Instruction Locality Instructions have better locality? –More instruction accesses –About same # of data cache misses Penalty in pipelined processor –Create bubbles in the pipeline Instruction locality can be more critical

29 Locality Impact On Performance Geometric mean of five Java programs Locality is key to performance 23.2% 40.1% 25.1% 48.3% Execution Time Distribution