Download presentation
Presentation is loading. Please wait.
Published byJohn Richardson Modified over 9 years ago
1
Dynamic Branch Prediction During Context Switches Jonathan Creekmore Nicolas Spiegelberg T NT
2
Overview 4Branch Prediction Techniques 4Context Switching 4Compression of Branch Tables 4Simulation 4Hardware Model 4Results 4Analysis
3
Case for Branch Prediction 4Multiple instructions at one time 8Between 15 and 20 4Branches occur every 5 instructions 8if, while, for, function calls, etc. 4Stalling pipeline is unacceptable 8Lose all advantage of multiple instruction issue
4
Context Switch Time 4Cause program execution to be paused 8State of program is saved 8New program is executed 4Eventually, original program begins executing again 4Not all of the CPU state is saved 8Such as the branch predictor tables
5
Context Switch Time 41 set of branch predictor state 4Context switch causes a new application to use the previous application’s branch predictor state 8Degrades performance for all applications 4Solution: Save the state of the branch predictor at context switch time
6
Saving Branch State Table 4Simple branch predictors still have large number of bits 4Storing and restoring the branch predictor should not take too long 8Lose the gain of storing/restoring if it takes longer than the “warm-up” time of the branch predictor
7
Compression 4Compression is the key 8Requires less storage 4Needs to be done carefully 8Some lossless compression schemes can inflate number of bits 8Luckily, lossy compression is acceptible
8
Semi-Lossy Compression 4Applies to 2-bit predictors 4Key is to store just taken/not-taken state 8Ignores strong/weak S S TTT NT W W
9
Semi-Lossy Decompression SNTWNT WTST NT T T T T T
10
Lossy Compression 4Branch prediction is just an educated guess 4Achieve higher compression ratio if some information is lost 4Majority rules 8Used by correlating branch predictor
11
Lossy Compression TT T T T T T T T T T T T T T T T NT T T T T 4x
12
Lossy Decompression 4Reinitialize all elements for an address to the stored value 4Best case -- all elements are correct 4Worst cast -- 50% of elements are correct 4Remember: Branch predictors are just educated guesses
13
Simulation 4Modified SimpleScalar’s sim-bpred to support context switching 8Not necessary to actually switch between programs 8On context switch, corrupt branch predictor table according to a “dirty” percentage to simulate another program running
14
Simulation 4Testing compression/decompression becomes simple 8Instead of corrupting branch predictor table, replace entries with the value after compression/decompression 8Testing with: 22-bit semi-lossy compression 24-bit lossy compression 28-bit lossy compression
15
Hardware Model 4Compression and decompression blocks are fully pipelined 4Compression and decompression blocks can handle n bits of compressed data at a time 4Compression and decompression occur simultaneously
16
Hardware Model 4Utilize data independence 8Compress 128 bits into 64 bits at one time 8Pipeline overhead should be minimal compared to clock cycle savings
17
Programs Simulated 4Several SPEC2000 CINT200 programs simulated 8164.gzip Compression 8175.vpr FPGA Place and route 8181.mcf Combinatorial Optimization 8197.parser Word Processing 8256.bzip2 Compression
18
Predictor Types 42048 entry bimodal predictor (4096 bits) 44096 entry bimodal predictor (8192 bits) 41024 entry two-level predictor with 4-bit history size (16384 bits) 44096 entry two-level predictor with 8-bit history size (1048576 bits) 48192 entry two-level predictor with 8-bit history size (2097152 bits)
19
2048 Entry Bimodal Predictor
22
4096 Entry Bimodal Predictor
25
1024 entry two-level predictor with 4- bit history size
28
4096 entry two-level predictor with 8- bit history size
31
8192 entry two-level predictor with 8- bit history size
34
Timing Comparison Miss Penalty 10 clock cycles Bandwidth 64 bits per clock cycle
35
Timing Equations General Timing Equation Special Case for ratio of 0
36
Timing Comparison Miss Penalty 15 clock cycles Bandwidth 64 bits per clock cycle
37
Timing Comparison Miss Penalty 10 clock cycles Bandwidth 128 bits per clock cycle
38
Summary 4Dynamic Branch Prediction is necessary for modern high-performance processors 4Context switches reduce the effect of dynamic branch prediction 4Naïvely saving the branch predictor state is costly
39
Summary 4Compression can be used to improve the cost of saving branch predictor state 4Higher compression ratios improve fixed save/restore time at a cost of increasing the number of mispredictions 8For low frequency context switches, yields an improvement in performance
40
Questions
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.