Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS 7810 Lecture 7 Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching E. Rotenberg, S. Bennett, J.E. Smith Proceedings of MICRO-29.

Similar presentations


Presentation on theme: "CS 7810 Lecture 7 Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching E. Rotenberg, S. Bennett, J.E. Smith Proceedings of MICRO-29."— Presentation transcript:

1 CS 7810 Lecture 7 Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching E. Rotenberg, S. Bennett, J.E. Smith Proceedings of MICRO-29 1996

2 Fetching Multiple Blocks Aggressive o-o-o processors will perform poorly if they only fetch a single basic block every cycle Solution:  Predict multiple branches and targets in a cycle  Fetch multiple cache lines in the cycle  Initiate the next set of fetches in the next cycle

3 Without the Trace Cache Stage 1 requires identification of predictions and target addresses Stage 2 requires multi-ported access of the I-cache Stage 3 requires shifting and alignment

4 Trace Cache Takes advantage of temporal locality and biased branches Does not require multiple I-cache accesses A BC DEFG 01 1100 A C F 1 0 A B E 0 1 A B D 0 0 A 1 0

5 Base Case In each cycle, fetch up to three sequential basic blocks

6 Multiple Branch Predictor PHT k-bit global history k / MUXMUX / k-1

7 Trace Cache Design The branch predictions can be used to index into the trace cache or for tag comparison (Fig.4) Keep track of next address (taken and not-taken) Line buffer and merge logic assembles traces

8 Trace Cache

9 Design Alternatives Associativity (including paths) Partial matches – use all instructions till the first mispredict Multiple line-fill buffers Trace selection to reduce conflicts Multi-cycle trace caches?

10 Branch Address Cache The BTB maintains 14 addresses (tree of basic blocks) Based on the branch prediction, three addresses are forwarded to the I-Cache BTB extension that allows multiple target prediction  adds pipeline stages  can still have I-Cache bank contention

11 Collapsing Buffer Can detect taken branches within a single cache line Also suffers from merge logic and bank contention

12 Methodology Very aggressive o-o-o processor – large window (2048 instrs), unlimited resources, no artificial dependences, no cache misses SPEC92-Int and Instruction Benchmark Suite (IBS) Trace cache – 64 entries, 16 instrs and 3 branches per entry – 712 tag bytes and 4KB worth of instructions – ICache is 128KB

13 Results Fetching three sequential basic blocks (SEQ.3) is not much more complex than fetching one – IPC improvement of ~15% Trace cache outperforms BAC and CB – note that the latter can’t handle all kinds of trace patterns and suffer from ICache bank contention TC outperforms SEQ.3 by 12% BAC and CB do worse than SEQ.3 if they increase front-end latency

14 Ideal Fetch The trace cache is within 20% of ideal fetch The trace miss rate is fairly high – 18-76% Up to 60% of instructions do not come from the trace cache A larger trace cache comes within 10% of ideal fetch – note that the front-end is the bottleneck in this processor

15 Title Bullet


Download ppt "CS 7810 Lecture 7 Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching E. Rotenberg, S. Bennett, J.E. Smith Proceedings of MICRO-29."

Similar presentations


Ads by Google