rePLay: A Hardware Framework for Dynamic Optimization

Slides:



Advertisements
Similar presentations
UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco.
Advertisements

Loop Unrolling & Predication CSE 820. Michigan State University Computer Science and Engineering Software Pipelining With software pipelining a reorganized.
Dynamic Branch Prediction
UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.
Wrong Path Events and Their Application to Early Misprediction Detection and Recovery David N. Armstrong Hyesoon Kim Onur Mutlu Yale N. Patt University.
Computer Architecture Computer Architecture Processing of control transfer instructions, part I Ola Flygt Växjö University
8 Processing of control transfer instructions TECH Computer Science 8.1 Introduction 8.2 Basic approaches to branch handling 8.3 Delayed branching 8.4.
CHIMAERA: A High-Performance Architecture with a Tightly-Coupled Reconfigurable Functional Unit Kynan Fraser.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.
WCED: June 7, 2003 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-MadisonSlide 1 of 26 Exploring Efficient SMT Branch Predictor Design.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 8, 2003 Topic: Instruction-Level Parallelism (Dynamic Branch Prediction)
UPC Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures Carlos Molina ψ, ф Jordi Tubella ф Antonio González λ,ф ISHPC-VI,
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 7, 2002 Topic: Instruction-Level Parallelism (Dynamic Branch Prediction)
1 Improving Branch Prediction by Dynamic Dataflow-based Identification of Correlation Branches from a Larger Global History CSE 340 Project Presentation.
EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
EECC722 - Shaaban #1 Lec # 10 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
Branch Target Buffers BPB: Tag + Prediction
1 COMP 740: Computer Architecture and Implementation Montek Singh Thu, Feb 19, 2009 Topic: Instruction-Level Parallelism III (Dynamic Branch Prediction)
EENG449b/Savvides Lec /25/05 March 24, 2005 Prof. Andreas Savvides Spring g449b EENG 449bG/CPSC 439bG.
CIS 429/529 Winter 2007 Branch Prediction.1 Branch Prediction, Multiple Issue.
EECC722 - Shaaban #1 Lec # 9 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
Improving the Performance of Object-Oriented Languages with Dynamic Predication of Indirect Jumps José A. Joao *‡ Onur Mutlu ‡* Hyesoon Kim § Rishi Agarwal.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
Predicated Static Single Assignment (PSSA) Presented by AbdulAziz Al-Shammari
Microprocessor Microarchitecture Instruction Fetch Lynn Choi Dept. Of Computer and Electronics Engineering.
Page 1 Trace Caches Michele Co CS 451. Page 2 Motivation  High performance superscalar processors  High instruction throughput  Exploit ILP –Wider.
Using Prediction to Accelerate Coherence Protocols Authors : Shubendu S. Mukherjee and Mark D. Hill Proceedings. The 25th Annual International Symposium.
CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.
1 Dynamic Branch Prediction. 2 Why do we want to predict branches? MIPS based pipeline – 1 instruction issued per cycle, branch hazard of 1 cycle. –Delayed.
Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.
Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
Trace Substitution Hans Vandierendonck, Hans Logie, Koen De Bosschere Ghent University EuroPar 2003, Klagenfurt.
1/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control BranchTap Improving Performance With.
CS 6290 Branch Prediction. Control Dependencies Branches are very frequent –Approx. 20% of all instructions Can not wait until we know where it goes –Long.
IA64 Complier Optimizations Alex Bobrek Jonathan Bradbury.
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
1 ROGUE Dynamic Optimization Framework Using Pin Vijay Janapa Reddi PhD. Candidate - Electrical And Computer Engineering University of Colorado at Boulder.
Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.
PINTOS: An Execution Phase Based Optimization and Simulation Tool) PINTOS: An Execution Phase Based Optimization and Simulation Tool) Wei Hsu, Jinpyo Kim,
Prof. Hsien-Hsin Sean Lee
CS203 – Advanced Computer Architecture
Computer Architecture
Computer Architecture Principles Dr. Mike Frank
Dynamic Branch Prediction
COMP 740: Computer Architecture and Implementation
Multiscalar Processors
The University of Adelaide, School of Computer Science
Part IV Data Path and Control
5.2 Eleven Advanced Optimizations of Cache Performance
Samira Khan University of Virginia Nov 13, 2017
Part IV Data Path and Control
CSCI1600: Embedded and Real Time Software
Using Dead Blocks as a Virtual Victim Cache
Module 3: Branch Prediction
Address-Value Delta (AVD) Prediction
Ka-Ming Keung Swamy D Ponpandi
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
Lecture 10: Branch Prediction and Instruction Delivery
Sampoorani, Sivakumar and Joshua
Dynamic Hardware Prediction
Patrick Akl and Andreas Moshovos AENAO Research Group
Loop-Level Parallelism
Aliasing and Anti-Aliasing in Branch History Table Prediction
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
Introduction to Computer Systems Engineering
CSCI1600: Embedded and Real Time Software
Ka-Ming Keung Swamy D Ponpandi
Spring 2019 Prof. Eric Rotenberg
Presentation transcript:

rePLay: A Hardware Framework for Dynamic Optimization Paper by: Sanjay J. Patel, Member, IEEE, and Steven S. Lumetta, Member, IEEE Presentation by: Alex Rodionov

Outline Motivation, Introduction, Basic Concepts Frame Constructor Optimization Engine Frame Cache Frame Sequencer Simulation results Conclusion

Motivation Want to make programs run faster One way: code optimization Done by compiler Ex: automatic loop unrolling, common sub-expr. elimination Compiler optimizations are conservative... Optimized code must still be correct No knowledge of dynamic runtime behavior Handling pointer aliasing is complicated

rePLay Framework “frame” optimized frame instruction stream

rePLay Framework Performs code optimization at runtime Consists of: In hardware With access to dynamic behavior Speculatively; potentially unsafe optimizations Consists of: A software-programmable optimization engine Hardware to identify, cache, and sequence blocks of program code for optimization A recovery mechanism to undo speculative execution Integrates into an existing micro-architecture

rePLay Framework

Frames One or more consecutive basic blocks from original program flow:

Frames Begin at branch targets End at erratically-behaving branches Include well-behaving branches. They: Are kept inside frame Allow frame to span multiple basic blocks Are converted into assertion instructions

Assertions

Assertions Ensure that the frame executes completely Evaluate the same condition as the branches they replace Force execution to restart at the beginning of the frame if the condition evaluates to false Will re-execute using original code, not the frame Can be inserted later to verify other speculations besides branches (ex: data values)‏

Frames - Summary Built from speculatively sequential basic blocks Form the scope/boundary for optimizations Include assertion instructions to verify speculations during execution

Frame Construction

Frame Construction Frames are built from already-executed instructions over time As conditional branches are promoted to assertions, the frame grows Fired assertions can be demoted back to branches Un-promoted control instructions terminate a frame Once a frame contains enough instructions (>threshold), it is done

Frame Construction Use branch bias table to promote branches to assertions Count number of times branch had same outcome Use two such tables: One for conditional branches (T vs. NT)‏ One for indirect branches (arbitrary target)‏

Branch Promotion/Demotion

Results

Results 64KB for cond. branches, 10KB for indirect

Frame Construction - Summary We desire: Construction of large frames Promotion of consistently-behaving branches Parameters to play with: Branch promotion threshold Minimum frame size Branch history length Size of branch bias tables

Optimization Engine

Optimization Engine Performs code optimization within frames Is software-programmable, has own instruction set and local memory Optimizes frames in parallel with execution of program Can make speculative and unsafe optimizations, as long as assertions inserted Design is open – no implementation details proposed

Possible Optimizations Value speculation Pointer aliasing speculation Eliminating stack operations across function call boundaries Anything else a compiler does, plus what is afraid to do

Frame Cache

Frame Cache Delivers optimized frames for execution Can increase instruction delivery throughput even without optimization Does not replace regular instruction cache Must hold all cache lines of a frame May lead to cache fragmentation Fired assertion -> eviction from cache

Frame Cache Implementation B B C C C C

Frame Cache Implementation Frames span multiple consecutive cache lines Frames indexed by their starting PC, maps to first cache line of frame Last cache line of frame has a termination bit Cache is 4-way set associative Further implementation details lacking Authors' model is a bit unrealistic Cache size measured in # of any-sized frames

Effect on Frame Size Larger cache may hold larger frames

Effect on Frame Code Coverage Cache misses mean no frame is fetched

Effect on Frame Completion

Frame Cache - Summary Having a finite-sized frame cache does not severely affect Code coverage by frames Instructions per frame Successful frame completion

Frame Sequencer

Frame Sequencer Augments a standard branch predictor with a frame predictor Frame predictor predicts which frame to fetch from the frame cache A selector chooses final branch prediction: Execute optimized frame (frame predictor)‏ Execute unoptimized basic block (regular branch predictor)‏ History-based or confidence-based

Frame Sequencer

Frame Predictor Uses a table Indexed by path history (same as in frame constructor)‏ Outputs a frame address' starting PC Entries added/removed when frames enter/leave the frame cache

Predictor Accuracy Results 16K entry frame predictor Unknown selector mechanism Low prediction percentages compensated by reduction of total branch count

Frame Sequencer – Summary Even if frames complete without assertions once started, need to know when to start a frame Choose a frame based on previous branch target history Choose when to initiate this frame vs. listening to the conventional branch predictor

Putting it All Together

Putting it All Together Configuration: Branch Bias Table Direct-mapped 64KB for conditional branches 10KB for indirect branches Path history length of 6 (?)‏ Frame Cache 256 frames (of arbitrary size)‏ 4-way set associative Frame Predictor 16K entries Path history length of 6

Putting it All Together 8 SPECint95 benchmarks Trace-driven simulator based on SimpleScalar Alpha AXP ISA

Putting it All Together Results: Avg. frame size: 88 instructions Frame coverage: 68% of instruction stream Frame completion rate: 97.81% Frame predictor accuracy: 81.26%

Conclusion The rePLay Framework provides a system to perform risky dynamic code optimizations in a speculative manner Even with no optimizations, you still get: Increased fetch bandwidth Reduction in number of branches to execute