1 Fast and Efficient Partial Code Reordering Xianglong Huang (UT Austin, Adverplex) Stephen M. Blackburn (Intel) David Grove (IBM) Kathryn McKinley (UT.

1 Fast and Efficient Partial Code Reordering Xianglong Huang (UT Austin, Adverplex) Stephen M. Blackburn (Intel) David Grove (IBM) Kathryn McKinley (UT Austin)

2 Software Trends By 2008, 80% of software will be written in Java or C#. [Gartner report] Java and C# are coming to your OS soon - Jnode, Singularity Advantages of modern programming languages: –Productivity, security, reliability… Performance?

3 Hardware Trends 2X/1.5yr 2X/10 yrs 198019902000 DRAM CPU Processor-Memory Performance Gap: (grows 50% / year) Performance cache 2005

4 Improvement Potential Base case: JikesRVM default with separate code space. Cache configuration: 32K IL1 direct map, 512K L2 (small programs on a big cache)

5 New and Better Opportunities Virtual machine monitors application behavior at runtime Dynamic recompilation –With dynamic feedback –Allocates instructions at runtime

6 Previous Work on Instruction Locality Static schemes –Static profile calling correlation and reorder code at compile and link time [Pettis and Hansen 90] –Cache coloring [Hashemi et al 97] –Profile procedure interleaving [Gloy et al. 99] –Static schemes are not flexible Dynamic scheme –JIT code reordering [Chen et al. 97] –Used as our base case

7 Optimizations in Virtual Machine Static instruction allocation used at runtime, –e.g. Just-in-time compilations –Invocation order Compiler Memory Manager Runtime Static Optimizations

8 Optimizations in Virtual Machine Dynamic instruction allocation/reordering adapt to the program behavior with low overhead Compiler Memory Manager Runtime Static Optimization

9 Opportunity for Instruction Locality Dynamic detection of hot methods, hot basic blocks Dynamic recompilation relocates methods at runtime

10 PCR Optimizations Reduce instruction capacity misses –Code space –Method separation –Code splitting Reduce instruction conflict misses –Code padding

11 PCR System JikesRVM component Input/Output Optimized method Baseline method Data Baseline Compiler Source Code Executing Code Adaptive Sampler Optimizing Compiler Hot Methods

12 PCR System: Method Separation Hot method (optimized code) Cold method (baseline code) Data Code Data Hot Methods Cold Methods Code

13 PCR System: Code Splitting Online edge profile identifies hot basic blocks in a method Code reordering moves hot basic blocks to the beginning of a method Code splitting to separate hot/cold basic blocks inside the heap Cold basic blocks Hot basic blocks Method A:

14 PCR System: Code Splitting Data Hot Blocks Cold Methods Cold Blocks Hot methods (optimized code) Cold methods (baseline code) Data Cold basic blocks Hot basic blocks Data Hot Methods Cold Methods

15 PCR Optimizations Reduce instruction capacity misses –Code Space –Method separation –Code splitting Reduce instruction conflict misses –Code padding

16 PCR System: Code Padding Baseline Compiler Source Code Binary Code Adaptive Sampler Optimizing Compiler Hot Methods Dynamic Call Graph JikesRVM component Input/Output

17 PCR System: Code Padding Method A() { … classC.B(); … } A B Conflict AB Dynamic Call Graph

18 Methodology Java virtual machine: Jikes RVM Various Architectures –x86 (Pentium 4) –PowerPC –Simulator: Dynamic SimpleScalar Use direct-mapped I-cache –Shorter latency –More conflict misses

19 PCR Results: jess on x86

20 PCR Results: fop on x86

21 Impact of Code Padding Base case: JikesRVM default + a separate code space. Cache configuration: 32K IL1 direct map, 512K L2

22 Conclusion Code space improve program performance by 6% (up to 30%) (Pentium 4) PCR has negligible overhead PCR no obvious performance improvement –On Pentium 4, no improvement on average –In simulation, PCR has 14% for one program Not consistent, no improvement on average. Potential opportunities for dynamic optimizations

23 Thank you! Questions? Compiler Garbage collector Runtime Static Optimization

24 Cache: Small vs. Large IL1DL1L2 SizeAssocLatencySizeAssocLatencySizeAsso c Latency 8K12 22128K25 16K12 23256K28 64K14 24512K210 Cacti, 90nm technology, 3GHz frequency

25 Cache-Size Comparison

26 Directmap vs. Two-way cycles (10 6 ) Cacti, 90nm technology, 3GHz

27 Improving Performance Classic optimizations not sufficient! Different programming styles –Automatic memory management –Pointer data structures –Many small methods Optimization costs incurred at runtime Virtual Machine (VM) adds complexity –Class loading, memory management, Just-in- time compiler…

28 Instruction Locality Instructions have better locality? –More instruction accesses –About same # of data cache misses Penalty in pipelined processor –Create bubbles in the pipeline Instruction locality can be more critical

29 Locality Impact On Performance Geometric mean of five Java programs Locality is key to performance 23.2% 40.1% 25.1% 48.3% Execution Time Distribution

1 Fast and Efficient Partial Code Reordering Xianglong Huang (UT Austin, Adverplex) Stephen M. Blackburn (Intel) David Grove (IBM) Kathryn McKinley (UT.

Similar presentations

Presentation on theme: "1 Fast and Efficient Partial Code Reordering Xianglong Huang (UT Austin, Adverplex) Stephen M. Blackburn (Intel) David Grove (IBM) Kathryn McKinley (UT."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Fast and Efficient Partial Code Reordering Xianglong Huang (UT Austin, Adverplex) Stephen M. Blackburn (Intel) David Grove (IBM) Kathryn McKinley (UT.

Similar presentations

Presentation on theme: "1 Fast and Efficient Partial Code Reordering Xianglong Huang (UT Austin, Adverplex) Stephen M. Blackburn (Intel) David Grove (IBM) Kathryn McKinley (UT."— Presentation transcript:

Similar presentations

About project

Feedback