rePLay: A Hardware Framework for Dynamic Optimization Paper by: Sanjay J. Patel, Member, IEEE, and Steven S. Lumetta, Member, IEEE Presentation by: Alex Rodionov
Outline Motivation, Introduction, Basic Concepts Frame Constructor Optimization Engine Frame Cache Frame Sequencer Simulation results Conclusion
Motivation Want to make programs run faster One way: code optimization Done by compiler Ex: automatic loop unrolling, common sub-expr. elimination Compiler optimizations are conservative... Optimized code must still be correct No knowledge of dynamic runtime behavior Handling pointer aliasing is complicated
rePLay Framework “frame” optimized frame instruction stream
rePLay Framework Performs code optimization at runtime Consists of: In hardware With access to dynamic behavior Speculatively; potentially unsafe optimizations Consists of: A software-programmable optimization engine Hardware to identify, cache, and sequence blocks of program code for optimization A recovery mechanism to undo speculative execution Integrates into an existing micro-architecture
rePLay Framework
Frames One or more consecutive basic blocks from original program flow:
Frames Begin at branch targets End at erratically-behaving branches Include well-behaving branches. They: Are kept inside frame Allow frame to span multiple basic blocks Are converted into assertion instructions
Assertions
Assertions Ensure that the frame executes completely Evaluate the same condition as the branches they replace Force execution to restart at the beginning of the frame if the condition evaluates to false Will re-execute using original code, not the frame Can be inserted later to verify other speculations besides branches (ex: data values)
Frames - Summary Built from speculatively sequential basic blocks Form the scope/boundary for optimizations Include assertion instructions to verify speculations during execution
Frame Construction
Frame Construction Frames are built from already-executed instructions over time As conditional branches are promoted to assertions, the frame grows Fired assertions can be demoted back to branches Un-promoted control instructions terminate a frame Once a frame contains enough instructions (>threshold), it is done
Frame Construction Use branch bias table to promote branches to assertions Count number of times branch had same outcome Use two such tables: One for conditional branches (T vs. NT) One for indirect branches (arbitrary target)
Branch Promotion/Demotion
Results
Results 64KB for cond. branches, 10KB for indirect
Frame Construction - Summary We desire: Construction of large frames Promotion of consistently-behaving branches Parameters to play with: Branch promotion threshold Minimum frame size Branch history length Size of branch bias tables
Optimization Engine
Optimization Engine Performs code optimization within frames Is software-programmable, has own instruction set and local memory Optimizes frames in parallel with execution of program Can make speculative and unsafe optimizations, as long as assertions inserted Design is open – no implementation details proposed
Possible Optimizations Value speculation Pointer aliasing speculation Eliminating stack operations across function call boundaries Anything else a compiler does, plus what is afraid to do
Frame Cache
Frame Cache Delivers optimized frames for execution Can increase instruction delivery throughput even without optimization Does not replace regular instruction cache Must hold all cache lines of a frame May lead to cache fragmentation Fired assertion -> eviction from cache
Frame Cache Implementation B B C C C C
Frame Cache Implementation Frames span multiple consecutive cache lines Frames indexed by their starting PC, maps to first cache line of frame Last cache line of frame has a termination bit Cache is 4-way set associative Further implementation details lacking Authors' model is a bit unrealistic Cache size measured in # of any-sized frames
Effect on Frame Size Larger cache may hold larger frames
Effect on Frame Code Coverage Cache misses mean no frame is fetched
Effect on Frame Completion
Frame Cache - Summary Having a finite-sized frame cache does not severely affect Code coverage by frames Instructions per frame Successful frame completion
Frame Sequencer
Frame Sequencer Augments a standard branch predictor with a frame predictor Frame predictor predicts which frame to fetch from the frame cache A selector chooses final branch prediction: Execute optimized frame (frame predictor) Execute unoptimized basic block (regular branch predictor) History-based or confidence-based
Frame Sequencer
Frame Predictor Uses a table Indexed by path history (same as in frame constructor) Outputs a frame address' starting PC Entries added/removed when frames enter/leave the frame cache
Predictor Accuracy Results 16K entry frame predictor Unknown selector mechanism Low prediction percentages compensated by reduction of total branch count
Frame Sequencer – Summary Even if frames complete without assertions once started, need to know when to start a frame Choose a frame based on previous branch target history Choose when to initiate this frame vs. listening to the conventional branch predictor
Putting it All Together
Putting it All Together Configuration: Branch Bias Table Direct-mapped 64KB for conditional branches 10KB for indirect branches Path history length of 6 (?) Frame Cache 256 frames (of arbitrary size) 4-way set associative Frame Predictor 16K entries Path history length of 6
Putting it All Together 8 SPECint95 benchmarks Trace-driven simulator based on SimpleScalar Alpha AXP ISA
Putting it All Together Results: Avg. frame size: 88 instructions Frame coverage: 68% of instruction stream Frame completion rate: 97.81% Frame predictor accuracy: 81.26%
Conclusion The rePLay Framework provides a system to perform risky dynamic code optimizations in a speculative manner Even with no optimizations, you still get: Increased fetch bandwidth Reduction in number of branches to execute