Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *

Similar presentations


Presentation on theme: "Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *"— Presentation transcript:

1 Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *

2 2 Outline  Background and Motivation  VPC (Virtual Program Counter) Prediction  Results  Conclusion

3 3 Direct vs. Indirect Branch TARG A+1 A T N  A  ? Conditional (Direct) Branch Indirect Branch  Indirect branches are costly on processor performance Much more difficult to predict than conditional (direct) branches: multiple target addresses Indirect branch predictor requires a large structure br.cond TARGET R1 = MEM[R2] branch R1

4 4 Source code: Shape *s = …; a = s->area(); // virtual function call Static assembly code: R1 = MEM[R2] // function address lookup call R1 // a register-indirect call Source Code Examples  Switch structures  Virtual function calls

5 5 Indirect Branch Mispredictions Data from Intel Core Duo processor

6 6 Direct Branch? Indirect Branch? TARG2 TARG1 PC+1 Branch Predictor Direction Predictor Branch Target Buffer (BTB) Indirect Branch Predictor Hash GHR PC Addr 0x0800 TARG2 Predicted target T

7 7 Outline  Background and Motivation  VPC (Virtual Program Counter) Prediction  Results  Conclusion

8 8 VPC Prediction: Basic Idea  Key idea: Treat an indirect branch as multiple “virtual” conditional branches Only for prediction purposes  Use the conditional branch predictor

9 9 TARG2 TARG1 VPC Branch Predictor Direction Predictor Branch Target Buffer Hash GHR PC Addr 0x0800 VPC2 VPC1 Predicted target

10 10 VPC Prediction: Basic Idea  Key idea: Treat an indirect branch as multiple “virtual” conditional branches Only for prediction purposes  Use the conditional branch predictor  Benefits: No separate complex structure Can be applied to any other conditional branch prediction algorithm Improve conditional branch prediction algorithm  Will improve the indirect branch prediction accuracy

11 11 Inspiration: Static Devirtualization Source code: Shape *s = …; a = s->area(); // an indirect call Optimized source code: Shape *s = …; if (s->type == Rectangle) // a conditional branch at PC: X a = Rectangle::area(); else if (s->type == Circle) // a conditional branch at PC: Y a = Circle::area(); else a = s->area(); // an indirect call at PC: Z Small talk(’84), Calder and Grunwald (’94), Garret et al. (’94), Ishizaki et al.(’00)

12 12 VPC Prediction Source code: Shape *s = …; a = s->area(); // an indirect call Static assembly code: R1 = MEM[R2] call R1 // PC: L Dynamic virtual branches (for prediction purposes): conditional jump TARGET1 // virtual PC = L conditional jump TARGET2 // virtual PC = L XOR HASHVAL[1] conditional jump TARGET3 // virtual PC = L XOR HASHVAL[2] conditional jump TARGET4 // virtual PC = L XOR HASHVAL[3]

13 13 Virtual PC Address Generation Use original PC address and iteration counter value 0xabcd 0x018a 0x7a9c 0x… iteration counter value PC Virtual PC Hash value table

14 14 VPC Prediction Process-I 1111 L PC GHR Direction Predictor BTB not taken TARG1 cond. jump TARG1 // VPC: L cond. jump TARG2 // VPC: VL2 cond. jump TARG3 // VPC: VL3 cond. jump TARG4 // VPC: VL4 call R1 // PC: L Real Instruction Virtual Instructions Next iteration

15 15 VPC Prediction Process-II 1110 VL2 VPC VGHR BTB not taken TARG2 cond. jump TARG1 // VPC: L cond. jump TARG2 // VPC: VL2 cond. jump TARG3 // VPC: VL3 cond. jump TARG4 // VPC: VL4 call R1 // PC: L Real Instruction Virtual Instructions Direction Predictor Next iteration

16 16 VPC Prediction Process-III cond. jump TARG1 // VPC: L cond. jump TARG2 // VPC: VL2 cond. jump TARG3 // VPC: VL3 cond. jump TARG4 // VPC: VL4 call R1 // PC: L Real Instruction Virtual Instructions 1100 VL3 VPC VGHR BTB taken TARG3 Direction Predictor Predicted Target = TARG3

17 17 VPC Prediction Algorithm  Access the conditional branch predictor and the BTB with VPCA and VGHR  Compute VPCA and VGHR for the next iteration VPCA = PC XOR HASHVAL[iter] VGHR = VGHR << 1  Predicted not taken: Move to the next iteration  Predicted taken: Use the target in the BTB as the target of an indirect branch  Give up and stall if Iteration count > MAX_ITER or BTB miss

18 18 VPC Training Algorithm  An iterative process when an indirect branch is retired (not on the critical path)  Update the conditional branch predictor Virtual branch has a correct target: Taken Virtual branch has a wrong target: Not-taken  Update replacement policy bits of the correct target in the BTB  Insert the correct target into the BTB Conditional branch predictor: taken Replace the least frequently used target (LFU)

19 19 Iteration counter Hardware Cost and Complexity GHR VGHR Branch Direction Predictor (BP) PCHash FunctionVPCA BTB + Taken/Not Taken Predict? Direct/Indirect Target Address

20 20 Outline  Background and Motivation  VPC Prediction  Results  Conclusion

21 21 Simulation Methodology  Pin-based x86 Simulator  Processor configuration 4K-entry BTB 64KB perceptron conditional branch predictor Minimum 30-cycle branch misprediction penalty 8-wide, 512-entry instruction window Less aggressive processor (in the paper) Gshare, O-GEHL conditional branch predictors  Indirect branch intensive benchmarks 5 SPEC CPU2000, 5 SPEC CPU 2006, 2 other C++ IBM server benchmarks (OLTP) (in the paper)

22 22 VPC MPKI

23 23 VPC Performance

24 24 Different Direction Predictors 98% 98.3% 99% Improving conditional branch prediction accuracy also improves indirect branch prediction accuracy! Conditional branch accuracy (%)

25 25 VPC vs. Static Devirtualization  Advantages Enables other compiler optimizations (function inlining) Can reduce the number of mispredictions  Disadvantages/Limitations Not all indirect branches can be statically devirtualized Extensive static analysis/profiling Lack of adaptivity to run-time input set and phase behavior  VPC prediction can be used with statically devirtualized binaries 10% improvement on top of static devirtualization

26 26 Outline  Background and Motivation  VPC Prediction  Results  Conclusion

27 27 Conclusion  VPC dynamically converts indirect branches into multiple conditional branches; uses the existing conditional branch prediction hardware  VPC prediction reduces the branch misprediction penalty without significant extra hardware storage. Baseline: 26% IPC improvement O-GEHL: 31% IPC improvement  VPC can be an enabler encouraging programmers to use object-oriented programming styles

28 Thank you! Questions?

29 29 VPC vs. Cascaded IBP

30 30 VPC vs. Other Indirect BP gcccraftyeonperlbmk Target Tag Cache 12KB1.5KB>192KB1.5KB Cascaded>176KB2.8KB>176KB2.8KB TTC: Chang et al. (’96) Cascaded: Driesen and Holzle(’98)

31 31 Iterative prediction  It doesn’t hurt performance significantly Results  Why? Most prediction is within a few iterations. Results

32 32 VPC Hit Iteration Counter

33 33 Can the BTB be pipelined?  Yes  The next iteration of VPC can be started without knowing the previous iteration in the pipeline.  Consecutive VPC prediction iterations can be simply pipelined.  If the iteration is not needed then simply discard the prediction.

34 34 Is 4K-entry BTB too large?  Pentium 4 has a 4K-entry BTB  IBM Z series (z990) has an 8K-entry BTB  AMD Athlon and Hammer have 2K- entry BTBs

35 35 BTB Size Effects

36 36 VPC Prediction Accuracy

37 37 Target Distribution

38 38 VPC vs. Tagged Target Cache

39 39 VPC Prediction Delay Effects

40 40 VPC with O-GEHL BP

41 41 VPC with a Less Aggressive Processor

42 42 Server Benchmarks

43 43 Server Benchmarks (VPC vs. TTC)

44 44 VPC Prediction vs. Compiler-Based Devirtualization (With TTC)

45 45 Conditional Br. Prediction Effects VPC Prediction reduces the accuracy of direction branch prediction but not that much!

46 46 Indirect Branch Mispredictions

47 47 VPC Prediction with Static Devirtualization  VPC prediction can be used with static devirtualized binaries. Not all indirect branches could be devirtualized

48 48 VPC Training: Correct Prediction call R1 // PC: L Retirement: Real Instruction Known: Correct predicted, predicted iter = 3 Update the BTB replacement counter IterVPCAVGHRDirection BPBTB 1LGHRNot-taken- 2VL2GHR<<1Not-taken- 3VL3GHR<<2Taken Update replacement

49 49 VPC Training: Misprediction call R1 // PC: L Retirement: Real Instruction Known: Mispredicted, correct target address Update the BTB replacement counter IterVPCAVGHRBTB Access Train Direction BP Train BTB 1LGHR TARG != Correct Not-taken- 2VL2GHR<<1 TARG != Correct Not-taken- 3VL3GHR<<2 Target = Correct Taken Update replacement

50 50 VPC Training: Misprediction call R1 // PC: L Retirement: Real Instruction Known: Mispredicted, correct target address IterVPCAVGHRBTB Access Train Direction BP Train BTB 1LGHR TARG != Correct Not-taken- 2VL2GHR<<1 TARG != Correct Not-taken- 3VL3GHR<<2 TARG != Correct Not-taken- No Target

51 51 VPC Training: Misprediction call R1 // PC: L Retirement: Real Instruction Known: Mispredicted, correct target address IterVPCAVGHRBTB Access Repl. counter Train BP Train BTB 1LGHR TARG != Correct 3 Not- taken - 2VL2 GHR<< 1 TARG != Correct 1 Not- taken Nothing 3VL3 GHR<< 2 TARG != Correct 8 Not- taken - Replacement Taken Insert 0

52 52 Does VPC need an extra BTB port?  No  A read from the BTB is only needed when a branch is mispredicted.  95% branches are correctly predicted with VPC.  The read is performed only there is a available BTB port.


Download ppt "Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *"

Similar presentations


Ads by Google