Presentation is loading. Please wait.

Presentation is loading. Please wait.

Page 1 Trace Caches Michele Co CS 451. Page 2 Motivation  High performance superscalar processors  High instruction throughput  Exploit ILP –Wider.

Similar presentations


Presentation on theme: "Page 1 Trace Caches Michele Co CS 451. Page 2 Motivation  High performance superscalar processors  High instruction throughput  Exploit ILP –Wider."— Presentation transcript:

1 Page 1 Trace Caches Michele Co CS 451

2 Page 2 Motivation  High performance superscalar processors  High instruction throughput  Exploit ILP –Wider dispatch and issue paths  Execution units designed for high parallelism –Many functional units –Large issue buffers –Many physical registers  Fetch bandwidth becomes performance bottleneck

3 Page 3 Fetch Performance Limiters  Cache hit rate  Branch prediction accuracy  Branch throughput  Need to predict more than one branch per cycle  Non-contiguous instruction alignment  Fetch unit latency

4 Page 4 Problems with Traditional Instruction Cache  Contain instructions in compiled order  Works well for sequential code with little branching, or code with large basic blocks

5 Page 5 Suggested Solutions  Multiple branch target address prediction  Branch address cache (1993, Yeh, Marr, Patt) –Provides quick access to multiple target addresses –Disadvantages Complex alignment network, additional latency

6 Page 6 Suggested Solutions (cont’d)  Collapsing buffer  Multiple accesses to btb (1995, Conte, Mills, Menezes, Patel) –Allows fetching non- adjacent cache lines –Disadvantages Bank conflicts Poor scalability for interblock branches Significant logic added before and after instruction cache  Fill unit  Caches RISC-like instructions derived from CISC instruction stream  (1988, Melvin, Shebanow, Patt)

7 Page 7 Problems with Prior Approaches  Need to generate pointers for all noncontiguous instruction blocks BEFORE fetching can begin  Extra stages, additional latency  Complex alignment network necessary  Multiple simultaneous access to instruction cache  Multiporting is expensive  Sequencing  Additional stages, additional latency

8 Page 8 Potential Solution – Trace Cache  Rotenberg, Bennett, Smith (1996)  Advantages  Caches dynamic instruction sequences –Fetches past multiple branches  No additional fetch unit latency  Disadvantages  Redundant instruction storage –Between trace cache and instruction cache –Within trace cache

9 Page 9 Trace Cache Details  Trace  Sequence of instructions potentially containing branches and their targets  Terminate on branches with indeterminate number of targets –Returns, indirect jumps, traps  Trace identifier  Start address + branch outcomes  Trace cache line  Valid bit  Tag  Branch flags  Branch mask  Trace fall-through address  Trace target address

10 Page 10

11 Page 11 Next Trace Prediction (NTP)  History register  Correlating table  Complex history indexing  Secondary Table  Indexed by most recently committed trace ID  Index generating function

12 Page 12 NTP Index Generation

13 Page 13 Return History Stack

14 Page 14 Trace Cache vs. Existing Techniques

15 Page 15 Trace Cache Optimizations  Performance  Partial matching [Friendly, Patel, Patt (1997)]  Inactive issue [Friendly, Patel, Patt (1997)]  Trace preconstruction [Jacobson, Smith (2000)]  Power  Sequential access trace cache [Hu, et al., (2002)]  Dynamic direction prediction based trace cache [Hu, et al., (2003)]  Micro-operation cache [Solomon, et al., 2003]

16 Page 16 Trace Processors  Trace Processor Architecture  Processing elements (PE) –Trace-sized instruction buffer –Multiple dedicated functional units –Local register file –Copy of global register file  Use hierarchy to distribute execution resources  Addresses superscalar processor issues  Complexity –Simplified multiple branch prediction (next trace prediction) –Elimination of local dependence checking (local register file) –Decentralized instruction issue and result bypass logic  Architectural limitations –Reduced bandwidth pressure on global register file (local register files)

17 Page 17 Trace Processor

18 Page 18 Trace Cache Variations  Block-based trace cache (BBTC)  Black, Rychlik, Shen (1999)  Less storage capacity needed

19 Page 19 Trace Table: BBTC Trace Prediction

20 Page 20 Block Cache

21 Page 21 Rename Table

22 Page 22 BBTC Optimization  Completion time multiple branch prediction (Rakvic, et al., 2000)  Improvement over trace table predictions

23 Page 23 Tree-based Multiple Branch Prediction

24 Page 24 Tree-PHT

25 Page 25 Tree-PHT Update

26 Page 26 Trace Cache Variations (cont’d)  Software trace cache  Ramirez, Larriba-Pey, Navarro, Torrellas (1999)  Profile-directed code reordering to maximize sequentiality –Convert taken branches to not-taken –Move unused basic blocks out of execution path –Inline frequent basic blocks –Map most popular traces to reserved area of i-cache


Download ppt "Page 1 Trace Caches Michele Co CS 451. Page 2 Motivation  High performance superscalar processors  High instruction throughput  Exploit ILP –Wider."

Similar presentations


Ads by Google