Presentation is loading. Please wait.

Presentation is loading. Please wait.

rePLay: A Hardware Framework for Dynamic Optimization

Similar presentations


Presentation on theme: "rePLay: A Hardware Framework for Dynamic Optimization"— Presentation transcript:

1 rePLay: A Hardware Framework for Dynamic Optimization
Paper by: Sanjay J. Patel, Member, IEEE, and Steven S. Lumetta, Member, IEEE Presentation by: Alex Rodionov

2 Outline Motivation, Introduction, Basic Concepts Frame Constructor
Optimization Engine Frame Cache Frame Sequencer Simulation results Conclusion

3 Motivation Want to make programs run faster One way: code optimization
Done by compiler Ex: automatic loop unrolling, common sub-expr. elimination Compiler optimizations are conservative... Optimized code must still be correct No knowledge of dynamic runtime behavior Handling pointer aliasing is complicated

4 rePLay Framework “frame” optimized frame instruction stream

5 rePLay Framework Performs code optimization at runtime Consists of:
In hardware With access to dynamic behavior Speculatively; potentially unsafe optimizations Consists of: A software-programmable optimization engine Hardware to identify, cache, and sequence blocks of program code for optimization A recovery mechanism to undo speculative execution Integrates into an existing micro-architecture

6 rePLay Framework

7 Frames One or more consecutive basic blocks from original program flow:

8 Frames Begin at branch targets End at erratically-behaving branches
Include well-behaving branches. They: Are kept inside frame Allow frame to span multiple basic blocks Are converted into assertion instructions

9 Assertions

10 Assertions Ensure that the frame executes completely
Evaluate the same condition as the branches they replace Force execution to restart at the beginning of the frame if the condition evaluates to false Will re-execute using original code, not the frame Can be inserted later to verify other speculations besides branches (ex: data values)‏

11 Frames - Summary Built from speculatively sequential basic blocks
Form the scope/boundary for optimizations Include assertion instructions to verify speculations during execution

12 Frame Construction

13 Frame Construction Frames are built from already-executed instructions over time As conditional branches are promoted to assertions, the frame grows Fired assertions can be demoted back to branches Un-promoted control instructions terminate a frame Once a frame contains enough instructions (>threshold), it is done

14 Frame Construction Use branch bias table to promote branches to assertions Count number of times branch had same outcome Use two such tables: One for conditional branches (T vs. NT)‏ One for indirect branches (arbitrary target)‏

15 Branch Promotion/Demotion

16 Results

17 Results 64KB for cond. branches, 10KB for indirect

18 Frame Construction - Summary
We desire: Construction of large frames Promotion of consistently-behaving branches Parameters to play with: Branch promotion threshold Minimum frame size Branch history length Size of branch bias tables

19 Optimization Engine

20 Optimization Engine Performs code optimization within frames
Is software-programmable, has own instruction set and local memory Optimizes frames in parallel with execution of program Can make speculative and unsafe optimizations, as long as assertions inserted Design is open – no implementation details proposed

21 Possible Optimizations
Value speculation Pointer aliasing speculation Eliminating stack operations across function call boundaries Anything else a compiler does, plus what is afraid to do

22 Frame Cache

23 Frame Cache Delivers optimized frames for execution
Can increase instruction delivery throughput even without optimization Does not replace regular instruction cache Must hold all cache lines of a frame May lead to cache fragmentation Fired assertion -> eviction from cache

24 Frame Cache Implementation
B B C C C C

25 Frame Cache Implementation
Frames span multiple consecutive cache lines Frames indexed by their starting PC, maps to first cache line of frame Last cache line of frame has a termination bit Cache is 4-way set associative Further implementation details lacking Authors' model is a bit unrealistic Cache size measured in # of any-sized frames

26 Effect on Frame Size Larger cache may hold larger frames

27 Effect on Frame Code Coverage
Cache misses mean no frame is fetched

28 Effect on Frame Completion

29 Frame Cache - Summary Having a finite-sized frame cache does not severely affect Code coverage by frames Instructions per frame Successful frame completion

30 Frame Sequencer

31 Frame Sequencer Augments a standard branch predictor with a frame predictor Frame predictor predicts which frame to fetch from the frame cache A selector chooses final branch prediction: Execute optimized frame (frame predictor)‏ Execute unoptimized basic block (regular branch predictor)‏ History-based or confidence-based

32 Frame Sequencer

33 Frame Predictor Uses a table
Indexed by path history (same as in frame constructor)‏ Outputs a frame address' starting PC Entries added/removed when frames enter/leave the frame cache

34 Predictor Accuracy Results
16K entry frame predictor Unknown selector mechanism Low prediction percentages compensated by reduction of total branch count

35 Frame Sequencer – Summary
Even if frames complete without assertions once started, need to know when to start a frame Choose a frame based on previous branch target history Choose when to initiate this frame vs. listening to the conventional branch predictor

36 Putting it All Together

37 Putting it All Together
Configuration: Branch Bias Table Direct-mapped 64KB for conditional branches 10KB for indirect branches Path history length of 6 (?)‏ Frame Cache 256 frames (of arbitrary size)‏ 4-way set associative Frame Predictor 16K entries Path history length of 6

38 Putting it All Together
8 SPECint95 benchmarks Trace-driven simulator based on SimpleScalar Alpha AXP ISA

39 Putting it All Together
Results: Avg. frame size: 88 instructions Frame coverage: 68% of instruction stream Frame completion rate: 97.81% Frame predictor accuracy: 81.26%

40 Conclusion The rePLay Framework provides a system to perform risky dynamic code optimizations in a speculative manner Even with no optimizations, you still get: Increased fetch bandwidth Reduction in number of branches to execute


Download ppt "rePLay: A Hardware Framework for Dynamic Optimization"

Similar presentations


Ads by Google