Chang Joo Lee Hyesoon Kim* Onur Mutlu** Yale N. Patt

Chang Joo Lee Hyesoon Kim* Onur Mutlu** Yale N. Patt
Performance-Aware Speculation Control using Wrong Path Usefulness Prediction Chang Joo Lee Hyesoon Kim* Onur Mutlu** Yale N. Patt My name is Chang Joo Lee. I am going to talk about our work, that is, Performance-Aware Speculation Control using Wrong Path Usefulness Prediction. I worked on this study with Hyesoon Kim, Onur Mutlu, and my advisor, Yale Patt. HPS Research Group University of Texas at Austin *School of Computer Science Georgia Institute of Technology **Microsoft Research

Outline Motivation Mechanism Experimental Evaluation Conclusion
2. I will motivate the problem we are trying to solve, first.

Fetch Gating (Pipeline Gating)
Proposed by Manne et al. [ISCA98] Stops fetching instructions on wrong path to save energy. Assumes wrong-path instructions do not contribute to performance and consume energy. Various fetch gating mechanisms Baniasadi and Moshovos [ISLPED01], Karkhanis et al. [ISLPED02], Aragon et al. [HPCA03], Buyuktosunoglu et al. [GLSVLSI03], Collins et al. [MICRO04] 3. Fetch gating or pipeline gating was first proposed by Manne et al in ISCA It tries to stop fetching instructions on wrong path assuming that wrong-path instructions do not contribute to performance and only consume energy. It stops fetching instructions if the number of low confidence branches is greater than a certain static threshold. After this work, a number of papers proposed various fetch gating mechanisms.

Limitations of Previous Mechanisms
Hardware complexity Branch confidence estimator, changes to critical/power-hungry structures. Additional hardware can offset energy savings due to fetch gating. Assumption Wrong-path execution consumes energy but is useless for performance. 4. The previous mechanisms have two limitations. First of all, (CLICK) many proposals add significant hardware complexity. For example, Manne’s mechanism needs a branch confidence estimator to decide whether a branch is low confidence or not. Some other mechanisms require changes to critical or power hungry structures. (CLICK) More importantly they assume that wrong-path instructions only consume energy without providing any performance. Is this assumption always right? I will talk about this problem, in the next slides.

Is Wrong Path Execution Really Useless?
Perfect fetch gating 6. This graph shows the benefits we can achieve with perfect fetch gating compared to the baseline for the SPEC2000 integer benchmarks. Using the oracle information, perfect fetch gating stops fetching instructions after every branch misprediction. The yellow bar indicates change in IPC and the green bar indicates change in energy. For most benchmarks, performance increases by perfect fetch gating in addition to achieving energy savings. This is because the perfect fetch gating eliminates all the cache pollution from wrong-path execution. However, (CLICK) mcf shows the opposite behavior. The perfect fetch gating degrades performance by 30% and also increases energy consumption by 15%. For parser, (CLICK) the perfect fetch gating saves energy consumption by 28% but it also degrades performance by 5%. I will explain why perfect fetch gating degrades performance significantly for these benchmarks. parser: Energy consumption decreases by 28% but performance degrades by 5% mcf: Performance degrades by 30% and energy consumption increases by 15% Performance of most benchmarks increases by perfect fetch gating.

Why Does Performance Degrade with Perfect Fetch Gating?
MPKI: 36.6 MPKI: 1.5 7. The graph shows the breakdown of all L2 cache fills by the correct and wrong path execution in the baseline processor. The green portion indicates the cache line fills from correct path instructions therefore they are all used. The orange and yellow portion indicates the cache line fills that are generated by wrong-path instructions. The orange one is never used, which indicates cache pollution effect. The yellow one is eventually used by correct path, which is the positive prefetching effect of wrong-path execution. For mcf (CLICK) all the wrong-path cache fills are used by correct path instructions. Therefore eliminating all the wrong-path instructions from mcf degrades performance by 30% as shown in the previous graph. The performance degradation is so significant, because it is very memory intensive. The increased execution time increases energy consumption for mcf as well by 15%. On the other hand, (CLICK) parser has 37% of the total L2 fills is generated by wrong-path and used by correct path while 14% of the total L2 fills is generated wrong-path but never used, which means it has both wrong-path cache pollution and positive prefetching effect. Because, the prefetching effect is higher, eliminating all the wrong path instructions by perfect fetch gating eventually degrades performance by 5% as shown in the previous slide. (CLICK) As shown for mcf and parser, wrong-path instructions are not always useless. In fact, wrong-path execution can be very useful by prefetching useful data. I am going to explain why wrong-path execution is sometimes useful by an example. parser: 37% is used wrong path fills, 14% is unused wrong path fills  5% performance degradation with perfect fetch gating mcf: almost all of wrong-path L2 fills used, memory intensive (MPKI: 36.6)  30% performance degradation with perfect fetch gating Wrong path execution can prefetch useful data Butler [Thesis93], Pierce and Mudge [IPPS94, MICRO96], Mutlu et al. [IEEE TC05]

Why Can Wrong Path Execution Be Useful?
From mcf Hammock structure within a frequently executed loop BR in BB2 is frequently mispredicted Since memory latency is large, wrong path prefetching benefit can be significant Taking into account wrong-path usefulness is important Taken Not-taken ….. BR BB4 BB2 Misprediction recovery Mispredicted BB3 BB4 Load A Load B ….. JMP BB5 Load A Load B ….. L2 cache miss Cache hit 8. This slide shows a control flow graph from mcf. We found that many benchmarks take advantage of this structure based on our experiments, especially parser. We simplified the control flow graph for ease of explanation. The control flow graph show a hammock structure usually within a frequently executed loop. And the branch in basic block 2 is frequently mispredicted. 1) (CLICK) Suppose the processor executes this control flow graph 2) Let’s say branch in basic block 2 is mispredicted as not-taken (CLICK). The processor will execute the wrong-path instructions through not-taken path. 3) (CLICK) Let’s say all the load instructions in basic block 3 and 5 miss the L2 cache. Then, memory requests for address A, B and C will be generated and the data for them will be brought into the cache while the processor continues executing wrong path instructions. 4) (CLICK) Later, the processor resolves the branch instruction in basic block 2 and performs branch misprediction recovery. 5) Because data for address A, B and C was filled into the cache by the wrong-path instructions. (CLICK) Now Load A, B and C on correct path will hit the cache. This is the perfecting effect of wrong-path execution (CLICK) and mcf and parser take advantage of this. Since the memory latency is large compared to processor, the wrong-path benefit is significant for mcf and parser. Therefore a smarter fetch gating mechanism has to take into account the performance benefits of wrong-path execution to avoid significant performance loss. BB5 Load C ….. Cache hit L2 cache miss

Now, I am going to talk about our mechanism that takes into account the usefulness of wrong-path execution.

Our Solution: Performance-Aware Speculation Control
Hardware complexity: Simple low cost fetch gating mechanism Wrong-path Usefulness: Low cost Wrong Path Usefulness Predictor (WPUP) Performance-Aware Speculation Control Lookup Fetch Gating WPUP Useful Gate Enable Branch Count Fetch Engine 10. We are proposing performance-aware speculation control which can solve the limitations of previous fetch gating mechanisms. Our performance-aware speculation control consists of two parts. (CLICK) One is a very low-cost branch count-based fetch gating mechanism that does not need significant hardware. (CLICK) The other one is wrong-path usefulness predictor, WPUP. (CLICK) With help of WPUP, we fetch gate only when wrong-path execution is predicted as useless. For example, if the fetch gating mechanism wants to fetch gate, the processor looks up WPUP. If WPUP predicts that the current path will be useful, we continue to fetch. Fetch gate only when wrong path execution is useless

Our Fetch Gating Mechanism
Branch-count based mechanism More branches  higher chance of misprediction. Fetch gate if (# of Branches) > Threshold Mispredictions show phase behavior. Threshold is determined by branch prediction accuracy for a certain period. Higher accuracy  Higher threshold No need for complex logic (e.g. confidence estimator) Fetch Gating WPUP Fetch Engine Performance-Aware Speculation Control Branch Count Gate Enable Lookup Useful 11. I am going to explain the first component, that is, our branch count based fetch gating mechanism. (CLICK) The basic idea of our fetch gating is that as more branch instructions are outstanding in the pipeline, the probability of having a misprediction goes up. We fetch gate, if the number of branches in flight is greater than a certain threshold. (CLICK) Because branch mispredictions show phase behavior, we dynamically adjust the threshold based on the branch prediction accuracy which is updated periodically. For example, if the branch prediction accuracy was high for the last period, we apply a higher threshold for the current period. This is very simple and cost effective because it requires only three counters to count the number of branches and calculate the branch prediction accuracy.

Performance-Aware Speculation Control
Two WPUP Mechanisms Branch PC-based WPUP (Fine grained) Phase-based WPUP (Coarse grained) Fetch Gating WPUP Fetch Engine Performance-Aware Speculation Control Branch Count Gate Enable Lookup Useful Can be combined with other fetch gating mechanisms. 12. Now I will explain our two different WPUP mechanisms. (CLICK) One is branch PC-based WPUP and the other one is phase-based WPUP. (CLICK) Note that these WPUP mechanisms can be combined with other fetch gating mechanisms to take advantage of wrong-path perfecting effect and minimize performance degradation. For example, our WPUP mechanisms can be applied to Manne’s fetch gating to make it wrong-path usefulness aware.

Branch PC-based WPUP Basic idea
Identifies and records conditional branch PCs that lead to useful wrong-path memory references If the fetched branch is recorded as useful, do not fetch gate 13. Let me talk about the branch PC-based WPUP first. Branch PC-based WPUP identifies and records the PC of conditional branches that lead to useful wrong-path memory references. If the fetched branch is recorded as useful, we do not fetch gate

Branch PC-based WPUP Implementation Fetch Engine
Latest Branch PC Register (LBPC, 16bits) LBPC value carried through pipeline Miss Status Holding Registers (MSHR) Branch ID field (BID, 10bits) Already used for branch misprediction recovery Branch PC field (BPC, 16bits) Wrong Path field (WP, 1bit) WPUP Cache 4 way set-associative, No Data Store, LRU 13. This mechanism needs some additional hardware to the existing processor. (CLICK) To keep track of most recently fetched branch PC, we need latest branch PC register or LBPC in the fetch engine. This is updated whenever a new conditional branch is fetched. (CLICK) Each MSHR entry needs additional fields which are branch ID, Branch PC and Wrong Path field. MSHR is Miss Status Holding Registers which keep track of all the states of memory requests in the memory system. Branch ID is already used to flush instructions that are younger that the mispredicted branch in the processor. (CLICK) Finally, a cache structure is needed to store branch PCs. It is a simple tag store structure with few entries and LRU replacement policy. I will explain how the branch PC-based WPUP works with this hardware, in the next two slides.

Branch PC-Based WPUP (Training)
LBPC: PC 2 Taken Not-taken Load B in BB3 with PC2 and BID 2 Load C in BB5 with PC 2 and BID 2 Load A in BB3 with PC 2 and BID 2 Load A in BB4 BID 2 from branch unit BB2 ….. BR 2 PC2 : BID 2 L2 cache miss Misprediction recovery Mispredicted BB3 BB4 Load A Load B ….. JMP Load A Load B ….. MSHR Addr BID BPC WP 14. Let me explain how we detect latest branch PC leading to useful wrong-path memory references and how we update WPUP. Let’s look at the control flow graph from the previous slide. (CLICK) Let’s say the processor executes the branch in basic block 2. LBPC in the fetch engine is updated with the PC of the branch in basic block 2 (CLICK). BID 2 is assigned for branch misprediction recovery. (CLICK) The branch instruction in basic block 2 is mispredicted. (CLICK) Load A in basic block 3(CLICK) goes through pipeline with branch ID 2 and its branch PC 2. (CLICK) (CLICK) Load A misses the cache (CLICK) So a memory request is generated by allocating a MSHR entry for address A. We also update the BID field with BID 2 and the BPC field with PC 2 from the instruction. Wrong path field is set to zero because the processor does not know whether or it is generated on wrong path yet. (CLICK) Load B (CLICK) misses the cache, so the processor allocates a MSHR entry for it accordingly. (CLICK) Load C (CLICK) misses the cache, so the processor generate another request as well. (CLICK) After some time, the branch misprediction is recovered. (CLICK) The BID from branch resolution unit is sent to MSHR. (CLICK) Any memory requests that are younger than the BID2 are marked as wrong-path memory requests by setting the wrong path field to 1. (CLICK) Now the processor executes the correct path instructions. (CLICK) Load A in basic block 4misses cache again because the data for A is not in the cache yet. But (CLICK) it hits the MSHR with address A. A wrong-path useful memory reference followed by branch PC2 has been detected now. PC 2 from the MSHR entry is stored into WPUP cache. A 2 PC2 1 B 2 PC2 1 BB5 Load C ….. C 2 PC2 1 MSHR hit; Wrong Path was useful. BPC 2 is stored in WPUP cache.

Branch PC-Based WPUP (Prediction)
LBPC: PC 2 Taken Not-taken Fetch Gate? Fetch Gate? BB2 ….. BR 2 PC2 : Mispredicted BB3 BB4 Load A Load B ….. JMP Load A Load B ….. WPUP Cache Wrong-path Execution Addr LRU 15. Now, the processor executes this control flow again some time later (CLICK) (CLICK) When the branch instruction in basic block 2 is fetched, LBPC is updated with PC2 (CLICK) Let’s say BR 2 is mispredicted again at this time (CLICK) and now fetch gating mechanism wants to fetch gate. (CLICK) The WPUP cache is looked up with the current LBPC value in the fetch engine. (CLICK) It hits the WPUP cache, so the processor does not fetch gate(CLICK) So the processor continues executing the wrong path to take advantage of wrong-path prefetching effect. PC2 …… …… BB5 Load C ….. …… Hit; Do not fetch gate.

Phase-based WPUP Basic idea
Predict if the current phase will provide useful wrong-path memory references If so, do not fetch gate In this slide, I will explain the other WPUP mechanism, that is, phase-based WPUP. The basic idea of the phase-based WPUP is if a phase is detected as useful, we do not fetch gate. (CLICK) This WPUP exploits the strong phase behavior of useful wrong-path references for mcf and parser, as shown in the graph. The x axis is time and y axis is the number of useful wrong-path references.

Phase-based WPUP Implementation
Wrong Path Usefulness Counter (WPUC, 5bits) Incremented for each useful wrong-path memory reference Reset periodically Do not fetch gate if WPUC > threshold BPC fields or WPUP cache not needed To implement this WPUP, we use a counter to count the number of useful wrong-path references for a certain time interval. The counter is reset at the start of every time interval. If the wrong-path usefulness counter value is greater than a threshold, the processor continues fetching. We use the wrong-path detection logic that was explained in branch PC-based WPUP. But we do not need any branch PC fields or WPUP cache since we are only counting the number of useful wrong-path memory references. Therefore, the hardware cost for phase-based WPUP is much less than branch PC-based one.

18. Now, let’s look at our experimental results.

Simulation Methodology
Alpha ISA execution driven simulator Baseline processor configuration 2GHz, 8-wide issue, out-of-order, 128-entry ROB Hybrid branch predictor (64K-entry gshare and 64K-entry PAs) 11 stages (minimum branch misprediction penalty) 1MB, 8-way unified L2 cache 32 L2 MSHRs, 300 cycle memory latency Stream prefetcher Wattch power model: 100 nm, 1.2V technology Manne’s fetch gating Gating threshold: 3 low confidence branches JRS confidence estimator (4K-entry, 4bit-MDC) Tuned for the best energy-delay product Branch Count-based fetch gating 19. We used an Alpha ISA execution driven simulator. (CLICK) The baseline processor is 2GHz 8 -wide issue, out of order with 128-entry ROB. We used a hybrid branch predictor with 11 cycle minimum branch misprediction penalty. (CLICK) Wattch is used for our power model. (CLICK) For Manne’s fetch gating we used 4K-entry JRS branch confidence estimator. (CLICK) The dynamic thresholds for our branch count based fetch gating are as shown in the table. BP Acc(%) 100~99 99~97 97~95 95~93 93~90 90~85 85~0 Threshold 18 16 13 12 11 7 3

Branch-Count Based Fetch Gating
Let’s look at the results of our branch count-based fetch gating compared to Manne’s. The top graph shows change in performance and the bottom one shows change in energy compared to the baseline . The green bar indicates perfect gating, the yellow is Manne’s fetch gating and the orange bar indicates our fetch gating. (CLICK) For gzip, vpr, parser and twolf our fetch gating degrades performance less, while it saves more energy. Thanks to that, on average (CLICK) our fetch gating mechanism performs better and also saves more energy. (CLICK) However, both our and Manne’s fetch gating mechanims degrade performance for mcf and parser by 9% and 5% respectively. Especially for mcf, the energy consumption is increased by the increased execution time like the perfect gating. Therefore, we apply our WPUP mechanisms to our fetch gating mechanism. Manne’s and our fetch gating degrade performance of mcf and parser Performance and energy savings are higher than Manne’s.

WPUP Mechanisms Improves performance and energy savings compared to Manne’s Improves performance of mcf and parser 21. The graph shows performance and energy consumption of our two WPUP mechanisms on top of our fetch gating. As you can see, for mcf, (CLICK) both branch PC and phase based WPUP recover the performance loss due to our fetch gating by 8%. The energy change goes form positive to negative as well. Please note that the WPUP mechanisms don’t significantly affect the energy savings of the other benchmarks. For mcf and parser, (CLICK) phase-based WPUP recovers more performance degradation than branch PC-based one by sacrificing more energy consumption. On average, (CLICK) the both WPUP mechanisms can improve performance. And phase-based WPUP improves performance a little bit more while saving a little bit less energy.

Performance-Aware Speculation Control vs. Manne’s Fetch Gating
Hardware Cost Performance-Aware Speculation Control vs. Manne’s Fetch Gating Hardware cost Fetch Gating WPUP Total Manne 2049B - FG-BR/PC-WPUP 6B 260B 266B FG-BR/PHASE WPUP 45B 51B 23. The table shows the cost of Manne’s fetch gating and our two speculation controls. On top of our simple branch count-based fetch gating, the total cost of our mechanisms are 266B and 51B for branch-PC based and phase based WPUP respectively. This is much less than Manne’s fetch gating with 2KB. Furthermore, Manne’s fetch gating is not wrong-path usefulness aware.

Comparison with Manne’s Fetch Gating
WPUPs improve performance and energy efficiency of Manne’s 24. In this slide we applied our branch PC-based WPUP mechanism to Manne’s fetch gating. The green bar shows Manne’s fetch gating, the yellow bar is Manne’s fetch gating with our branch PC-based WPUP and The pink bar indicates our speculation control with branch PC-based WPUP. (CLICK) As shown in the graph, using our branch PC-based WPUP to Manne’s fetch gating we recover the performance loss for mcf and parser, (CLICK) Which means our WPUP mechanism make Manne’s fetch gating more energy-efficient. (CLICK) Comparing our speculation control with branch PC-based WPUP to Manne’s fetch gating without WPUP, we can see our mechanism achieves higher performance and consumes less energy. Again note the Manne’s mechanism needs 2KB storage while our performance-aware control mechanisms need much less storage. 2.5% less performance degradation, 1.0% more energy savings

Improves Energy-Delay Product (2.6% compared to Manne’s)
25. Because of the higher performance and less energy consumption of our mechanisms, (CLICK) the energy delay product is less than Manne’s mechanism, which indicates our mechanisms are more energy-efficient. Improves Energy-Delay Product (2.6% compared to Manne’s)

Conclusion Performance-Aware Speculation Control
Branch count-based fetch gating Simple and low cost. Introduced Wrong Path Usefulness Prediction Recovers performance loss due to fetch gating by executing useful wrong-path instructions. Can be combined with other fetch gating mechanisms. Reduces performance loss due to fetch gating and also saves energy. 26. We proposed performance-aware speculation control mechanisms which take into account the usefulness of wrong-path instructions. (CLICK) Our branch count-based fetch gating works well with very little hardware. (CLICK) We introduced wrong path usefulness prediction to recover the performance loss due to fetch gating. (CLICK) Our evaluation shows that our comprehensive mechanism reduces performance loss due to fetch gating and also saves energy.

Questions? Thank you. I will answer questions.

Chang Joo Lee Hyesoon Kim* Onur Mutlu** Yale N. Patt

Similar presentations

Presentation on theme: "Chang Joo Lee Hyesoon Kim* Onur Mutlu** Yale N. Patt"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Chang Joo Lee Hyesoon Kim* Onur Mutlu** Yale N. Patt

Similar presentations

Presentation on theme: "Chang Joo Lee Hyesoon Kim* Onur Mutlu** Yale N. Patt"— Presentation transcript:

Similar presentations

About project

Feedback