Presentation is loading. Please wait.

Presentation is loading. Please wait.

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.

Similar presentations


Presentation on theme: "Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida."— Presentation transcript:

1 Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida State University June 8-16, 2007

2 Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 2 Instruction Packing Store frequently occurring instructions as specified by the compiler in a small, low- power Instruction Register File (IRF) Allow multiple instruction fetches from the IRF by packing instruction references together  Tightly packed – multiple IRF references  Loosely packed – piggybacks an IRF reference onto an existing instruction Facilitate parameterization of some instructions using an Immediate Table (IMM)

3 Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 3 Instruction Cache PC IF/ID IRF IMM IRWP packed instruction insn1 insn2 imm3 insn3 imm3 insn4 Execution of IRF Instructions Instruction Fetch StageFirst Half of Instruction Decode Stage To Instruction Decoder Executing a Tightly Packed Param4c Instruction

4 Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 4 Outline Introduction IRF and Instruction Packing Overview Integrating an IRF with an L0 I-Cache Decoupling Instruction Fetch Experimental Evaluation Related Work Conclusions & Future Work

5 Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 5 MIPS+IRF Instruction Formats inst1sinst2inst3 inst5 param inst4 param instrtrd rs shamt function instrtimmediaters opcode immediate win 6 bits 11 bits 1 bit 6 bits2 bits24 bits 6 bits 5 bits T-type R-type I-type J-type

6 Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 6 Previous Work in IRF Register Windowing + Loop Cache (MICRO 2005) Compiler Optimizations (CASES 2006)  Instruction Selection  Register Renaming  Instruction Scheduling

7 Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 7 Integrating an IRF with an L0 I-Cache L0 or Filter Caches  Small and direct-mapped Fast hit time Low energy per access Higher miss rate than L1  256B L0 I-cache 8B line size [Kin97] Fetch energy reduced 68% Cycle time increased 46%!!! IRF reduces code size, while L0 only focuses on energy reduction at the cost of performance IRF can alleviate performance penalty associated with L0 cache misses, due to overlapping fetch

8 Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 8 L0 Cache Miss Penalty 123456789Cycle Insn1 Insn2 Insn3 Insn4 IF ID EX M M M M WB IF

9 Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 9 Overlapping Fetch with an IRF 123456789Cycle Insn1 Pack2a Pack2b Insn3 IF IF ab ID ID a EX b EX EX a MbMb M M MaMa WB b WB WB a IF ID b

10 Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 10 Decoupling Instruction Fetch Instruction bandwidth in a pipeline is usually uniform (fetch, decode, issue, commit, …)  Artificially limits the effective design space Front-end throttling improves energy utilization by reducing the fetch bandwidth in areas of low ILP IRF can provide virtual front-end throttling  Fetch fewer instructions every cycle, but allow multiple issue of packed instructions  Areas of high ILP are often densely packed  Lower ILP for infrequently executed sections of code

11 Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 11 Out-of-order Pipeline Configurations

12 Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 12 Experimental Evaluation MiBench embedded benchmark suite – 6 categories representing common tasks for various domains SimpleScalar MIPS/PISA architectural simulator  Wattch/Cacti extensions for modeling energy consumption (inactive portions of pipeline only dissipate 10% of normal energy when using cc3 clock gating) VPO – Very Portable Optimizer targeted for SimpleScalar MIPS/PISA

13 Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 13 L0 Study Configuration Data ParameterLow-Power In-order Embedded Processor I-Fetch Queue4 entries Branch PredictorBimodal-128 entries, 3 cycle penalty Fetch/Decode/IssueSingle instruction RUU size8 LSQ size8 L1 Data Cache16 KB, 256 lines, 16B line, 4-way s.a., 1 cycle hit L1 Instruction Cache 16 KB, 256 lines, 16B line, 4-way s.a., 1 / 2 cycle hit L0 Instruction Cache 256 B, 32 lines, 8B line, direct mapped, 1 cycle hit Memory Latency32 cycles IRF/IMM4 windows, 32-entry IRF (128 total), 32-entry IMM. 1 branch/pack

14 Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 14 Execution Efficiency for L0 I-Caches

15 Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 15 Energy Efficiency for L0 I-Caches

16 Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 16 Decoupled Fetch Configurations ParameterHigh-end Out-of-order Embedded Processor I-Fetch Queue4/8 entries Branch PredictorBimodal-2048 entries, 3 cycle penalty Fetch Width1 / 2 / 4 Decode/Issue/Commit Width1 / 2 / 3 / 4 RUU size16 LSQ size8 L1 Data Cache32 KB, 512 lines, 16B line, 4-way s.a., 1 cycle hit L1 Instruction Cache32 KB, 512 lines, 16B line, 4-way s.a., 1 cycle hit Unified L2 Cache256 KB, 1024 lines, 64B line, 4-way s.a. 6 cycle hit Memory Latency32 cycles IRF/IMM4 windows, 32-entry IRF (128 total), 32-entry IMM. 1 branch/pack

17 Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 17 Execution Efficiency for Asymmetric Pipeline Bandwidth

18 Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 18 Energy Efficiency for Asymmetric Pipeline Bandwidth

19 Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 19 Energy-Delay 2 for Asymmetric Pipeline Bandwidth

20 Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 20 Related Work L-caches – subdivide instruction cache, such that one portion contains the most frequently accessed code Loop Caches – capture simple loop behaviors and replay instructions Zero Overhead Loop Buffers (ZOLB) Pipeline gating / Front-end throttling – stall fetch when in areas of low IPC

21 Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 21 Conclusions and Future Work Future Topics  Can we pack areas where L0 is likely to miss?  IRF + encrypted or compressed I-Caches  IRF + asymmetric frequency clustering (of pipeline backend functional units) IRF can alleviate fetch bottlenecks from L0 I-Cache misses or branch mispredictions  Increased IPC of L0 system by 6.75%  Further decreased energy of L0 system by 5.78% Decoupling fetch provides a wider spectrum of design points to be evaluated (energy/performance)

22 Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 22 The End Questions ???

23 Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 23

24 Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 24 Energy Consumption

25 Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 25 Static Code Size

26 Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 26 Conclusions & Future Work Compiler optimizations targeted specifically for IRF can further reduce energy (12.2%  15.8%), code size (16.8%  28.8%) and execution time Unique transformation opportunities exist due to IRF, such as code duplication for code size reduction and predication As processor designs become more idiosyncratic, it is increasingly important to explore the possibility of evolving existing compiler optimizations Register targeting and loop unrolling should also be explored with instruction packing Enhanced parameterization techniques

27 Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 27

28 Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 28 Instruction Redundancy Profiled largest benchmark in each of six MiBench categories Most frequent 32 instructions comprise 66.5% of total dynamic and 31% of total static instructions

29 Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 29 Compilation Framework

30 Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 30


Download ppt "Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida."

Similar presentations


Ads by Google