Half-Price Architecture

Half-Price Architecture
Ilhyun Kim Mikko H. Lipasti PHARM Team University of Wisconsin—Madison

two RF read port accesses
Motivations Processors are designed to handle 0, 1 and 2-source instructions at equal cost Satisfy the worst-case requirements of instructions No resource arbitrations / pipeline stalls in handling source operands Simple controls over instruction and data stream Handling source operands requires 2x machine bandwidth e.g. 2 read ports / 1 write port per instruction Heavily multi-ported structures in many pipeline stages Fetch Decode Rename Queue Sched Disp RF Exe Retire Commit map table reads dependence checks ready state checks two operand wakeups two RF read port accesses bypass to FU’s two input ports June 9, 2003 Ilhyun Kim and Mikko Lipasti—UW-MADISON PHARM Team ISCA

Making the common case faster
2x HW configuration assumes 2 source operands are common 18~36% of instructions have 2 source operands But, structures for 2 source operands are not fully utilized Scheduler 4%~16% of instructions need two wakeups Less than 3% of instructions handle 2 wakeups in the same clock cycle Register File 0.64 read port per instruction Less than 4% of instructions need two register read ports Handling 2 source operands may NOT be the common case  Why not build a pipeline optimized for 1-source instructions? June 9, 2003 Ilhyun Kim and Mikko Lipasti—UW-MADISON PHARM Team ISCA

Half-price Architecture
Restrict the processor’s capability to handle 2 source operands 0- or 1-source instructions are processed without any restriction 2-source instructions may execute more slowly But, they are not the common case  Reduce hardware complexity incurred by 2 source operands ½ technique in scheduler: Sequential wakeup ½ technique in RF: Sequential register access HW design point to match the worst-case requirements Opcode Rdst Rsrc 1 Rsrc 2 Opcode Rdst Rsrc 1 Needs more hardware Half-price architecture design point Opcode Rdst / Rsrc Opcode June 9, 2003 Ilhyun Kim and Mikko Lipasti—UW-MADISON PHARM Team ISCA

2-source-format instructions
2-src-format insts 18~36% of dynamic instructions have 2-source format (excluding stores) June 9, 2003 Ilhyun Kim and Mikko Lipasti—UW-MADISON PHARM Team ISCA

Target identification: 2-source instructions
2-src-format insts 2-src insts 6~23% of instructions are 2-source instructions 2 unique source operands with dependences Dynamic behaviors of 2-source instructions will expose greater opportunities June 9, 2003 Ilhyun Kim and Mikko Lipasti—UW-MADISON PHARM Team ISCA

Ilhyun Kim and Mikko Lipasti—UW-MADISON PHARM Team ISCA-30 7
Outline Motivations Half-price architecture Reducing scheduler complexity Sequential wakeup Reducing register file complexity Conclusions & Future work June 9, 2003 Ilhyun Kim and Mikko Lipasti—UW-MADISON PHARM Team ISCA

Scheduler complexity Overdesign in wakeup logic Tag comparators for two source operands Tag broadcast is expensive Delay is a function of # tag comparators and bus length Speeding up the scheduler Clustered scheduler (Palacharla et al.) Making a small window look bigger (Michaud et al.) Hierarchical scheduler (Lebeck et al., Hrishikesh et al.) Reducing wakeup bus load capacitance Tag elimination & last-tag speculation (Ernst & Austin) Half-price technique: sequential wakeup = OR readyL tagL readyR tagR tag W tag 1 … June 9, 2003 Ilhyun Kim and Mikko Lipasti—UW-MADISON PHARM Team ISCA

Last-tag speculation (Ernst & Austin, ISCA02)
Only the last-arriving operand initiates instruction issue Remove tag comparison logic for the early-arriving operand Fewer tag comparators  reduced load on the bus + compact wakeup logic  scheduling logic cycle time improvement A scoreboard checks correctness of scheduling May hurt performance due to its speculative nature Implementation issue w/ broadcast-based selective recovery  Our technique exploits last-arriving operands non-speculatively, achieving similar benefits June 9, 2003 Ilhyun Kim and Mikko Lipasti—UW-MADISON PHARM Team ISCA

2-pending-source instructions
Many operands are already ready at insert time 4~16% of instructions have 2-pending-source operands, requiring two wakeup signals before being issued 4-wide 8-wide 2-src insts 2-pending- src insts June 9, 2003 Ilhyun Kim and Mikko Lipasti—UW-MADISON PHARM Team ISCA

Slack between two wakeups
Many 2-pending-source instructions have wakeup slack Less than 3% of instructions have 0-slack wakeups  Exploit wakeup slack to prioritize operand wakeups 4-wide 8-wide 2-pending- src insts wakeup (0 slack) simultaneous June 9, 2003 Ilhyun Kim and Mikko Lipasti—UW-MADISON PHARM Team ISCA

½ technique - Sequential wakeup
Sequentially wake up ½ operands during wakeup slack Decouples half of tag comparators  reduced load on the bus Flexible routing in slow wakeup bus  compact fast wakeup logic No recovery, lower misprediction penalty (1-cycle issue delay) Instructions are issued non-speculatively in terms of operand readiness Simultaneous (0-slack) wakeups always incur penalty But, they are less than 3% of instructions tag W … tag 1 fast wakeup bus timing slow wakeup select delay broadcast t broadcast t+1 broadcast t-1 t clock t+1 latch … OR = OR = = = readyL tagL readyR tagR … … put the tag predicted to be last-arriving latch Fast wakeup bus Slow wakeup bus (1 clk behind) June 9, 2003 Ilhyun Kim and Mikko Lipasti—UW-MADISON PHARM Team ISCA

Machine models Simplescalar-Alpha-based, 12-stage, 4/8-wide OoO + Speculative scheduling Alpha-style squashing scheduling recovery invalidates all issued instructions (dependent / independent) behind the miss 4-wide: 64 RUUs, 32 LSQs, 2 memory ports 8-wide: 128 RUUs, 64 LSQs, 4 memory ports 64K IL1 (2), 64K DL1 (2), 512K unified L2 (8) Combined (bimodal + gShare) branch prediction, fetch until the first taken branch Sequential wakeup Last-arriving operand predictor: 1k-entry, PC-direct-mapped, 2-bit bimodal Last-tag speculation Same predictor Scoreboard located next to the scheduler June 9, 2003 Ilhyun Kim and Mikko Lipasti—UW-MADISON PHARM Team ISCA

Sequential wakeup performance
4-wide 8-wide Sequential wakeup slowdown is slight: avg 0.4 / 0.6%, worst 2.1% Less than 4% of instructions incur penalty Sequential wakeup is relatively insensitive to predictor accuracy  Sequential wakeup can reduce wakeup logic delay with a minimal performance impact June 9, 2003 Ilhyun Kim and Mikko Lipasti—UW-MADISON PHARM Team ISCA

Outline Motivations Half-price architecture Reducing scheduler complexity Reducing register file complexity Sequential register file access Conclusions & Future work June 9, 2003 Ilhyun Kim and Mikko Lipasti—UW-MADISON PHARM Team ISCA

Register file complexity
Overdesign in register file 2x read ports for two source operands Superscalar processors need RF to be heavily multiported Area increases quadratically, latency increases linearly Two read ports are not fully utilized 0- / 1-source instructions do not require two read ports Many instructions frequently get values off the bypass path 0.64 read ports / instruction (Balasubramonian et. al, ISCA01) Speeding up the RF Reducing the number of register entries Hierarchical register file (Cruz, Borch, Balasubramonian, …..) Reducing the number of ports Fewer RF ports + crossbar (Balasubramonian et al, Park et al…) Half-price technique: Sequential RF Access June 9, 2003 Ilhyun Kim and Mikko Lipasti—UW-MADISON PHARM Team ISCA

Two RF read port accesses
Less than 4% of instructions need 2 read port accesses Many 2-source instructions read at least one value off bypass path 4-wide 8-wide 2-src insts 2 read ports require June 9, 2003 Ilhyun Kim and Mikko Lipasti—UW-MADISON PHARM Team ISCA

½ technique – Sequential RF access
Remove ½ register read ports Only a single read port per issue slot 0 or 1-source instructions are processed without any restriction Sequentially access a single port twice for 2 values if needed (the execution latency increases by 1 clock cycle) However, speculative scheduling does not allow variable-latency operations (Implementing optimizations at decode time, ISCA02) Load latency misprediction  scheduling recovery Variable RF latency  scheduling recovery, too  Sequential RF access should be reflected in scheduling How to detect if source values will be read off the bypass path? How to schedule dependent instructions accordingly? June 9, 2003 Ilhyun Kim and Mikko Lipasti—UW-MADISON PHARM Team ISCA

Scheduling in sequential RF access
Back-to-back issue == Reading values off the bypass Back-to-back issue makes dependent instructions fall within bypass window Non-back-to-back issue or 2 ready sources at insert time incur sequential RF access (assuming 1-clk cycle bypass window) Scheduler considerations !(wakeup && selected) in the same cycle  sequential RF access Delay tag broadcast by 1 clock cycle Block the issue slot (only the one w/ seq RF access) for 1 cycle for non-pipelined RF access operation Queue Wakeup Payload Ram single-read ported Reg File Rd Wr FU MUX Scheduling Loop forward the first value through bypass network Sequential Reg Access Select Disable select for 1 CLK sequence register accesses June 9, 2003 Ilhyun Kim and Mikko Lipasti—UW-MADISON PHARM Team ISCA

Machine models Simplescalar-Alpha-based, 12-stage, 4/8-wide OoO + Speculative scheduling (same as before) Alpha-style squashing scheduling recovery invalidates all issued instructions (dependent / independent) behind the miss 4-wide: 64 RUUs, 32 LSQs, 2 memory ports 8-wide: 128 RUUs, 64 LSQs, 4 memory ports 64K IL1 (2), 64K DL1 (2), 512K unified L2 (8) Combined (bimodal + gShare) branch prediction, fetch until the first taken branch Sequential RF access ½ read-ported RF (1 read port / issue slot) Comparison cases Pipelined RF (1 extra RF stage) ½ read-ported RF (same as sequential RF access) + crossbar June 9, 2003 Ilhyun Kim and Mikko Lipasti—UW-MADISON PHARM Team ISCA

Sequential RF access performance
4-wide 8-wide Seq RF access slowdown is slight: avg 1.1 / 0.7%, worst 2.2% 1-extra RF stage requires extra bypass paths ½ read ports + crossbar almost achieves base performance crossbar complexity, global RF port arbitration  Sequential RF access reduces the number of RF read ports with a minimal performance impact June 9, 2003 Ilhyun Kim and Mikko Lipasti—UW-MADISON PHARM Team ISCA

Sequential wakeup + RF access
Performance degradation: avg 2.2%, worst 4.8% Reduced wakeup bus load capacitance, fewer read ports of RF  Half-price techniques reduce HW complexity, reaping most of the performance of a conventional pipeline June 9, 2003 Ilhyun Kim and Mikko Lipasti—UW-MADISON PHARM Team ISCA

Conclusions & Future work
Processors are overdesigned to process 0, 1, 2-source instructions at equal cost Handling 2-source instructions may not be the common case Only a small fraction of instructions utilize overdesigned hardware Reduce HW complexity by restricting the processor’s capability of handling 2-source instructions Sequential wakeup, sequential RF access The performance impact is minimal The basic concept can be extended to all pipeline stages Register rename, ready information check, bypass logic… Changing the pipeline design from instruction- to operand-granularity June 9, 2003 Ilhyun Kim and Mikko Lipasti—UW-MADISON PHARM Team ISCA

Questions? June 9, 2003 Ilhyun Kim and Mikko Lipasti—UW-MADISON PHARM Team ISCA

Last-arriving operand predictor accuracy
128 512 1k 4k bzip crafty eon gap gcc gzip mcf parser perl twolf vortex vpr 4-wide 8-wide June 9, 2003 Ilhyun Kim and Mikko Lipasti—UW-MADISON PHARM Team ISCA

½ technique - Sequential wakeup
Sequential wakeup example r2 r1 r3 ADD r4 r5 SUB - 1 r6 XOR Cycle 1 Fast bus Slow bus OP dest rdy r3 r5 ADD SUB XOR issue r1, r4 r1 1 r2 r3 ADD r4 r5 SUB - r6 XOR Cycle 2 time ADD r1, r2, r3 SUB r3, r4, r5 XOR r5, 1, r6 issue r3 r1 1 r2 ADD r4 r5 SUB - r6 XOR r1, r4 Cycle 3 June 9, 2003 Ilhyun Kim and Mikko Lipasti—UW-MADISON PHARM Team ISCA

½ technique – Sequential RF access
Scheduler changes for sequential RF access tag granted request = seq_reg_access extra delay wakeup bus OR readyL tagL readyR tagR 2-src nowR (not sticky) nowL dest tag Select Logic G S R … granted seq_reg_access request selected Disable Select for 1 CLK Bubble Sequential RF access example r1 r2 r3 r4 r5 ADD SUB XOR Select only ADD Seq read r1 r2 Wait Bubble Wakeup/ Reg Read r4 Execute/ Bypass Reg no Read SUB XOR Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Select only ADD Reg read r1,r2 Wait Wakeup/ Reg Read r4 Execute/ Bypass Reg no Read SUB XOR Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 ADD r1, r2, r3 SUB r3, r4, r5 XOR r5, 1, r6 June 9, 2003 Ilhyun Kim and Mikko Lipasti—UW-MADISON PHARM Team ISCA

Half-Price Architecture

Similar presentations

Presentation on theme: "Half-Price Architecture"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Half-Price Architecture

Similar presentations

Presentation on theme: "Half-Price Architecture"— Presentation transcript:

Similar presentations

About project

Feedback