Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS 7810 Lecture 5 The Optimal Logic Depth Per Pipeline Stage is 6 to 8 FO4 Inverter Delays M.S. Hrishikesh, N.P. Jouppi, K.I. Farkas, D. Burger, S.W. Keckler,

Similar presentations


Presentation on theme: "CS 7810 Lecture 5 The Optimal Logic Depth Per Pipeline Stage is 6 to 8 FO4 Inverter Delays M.S. Hrishikesh, N.P. Jouppi, K.I. Farkas, D. Burger, S.W. Keckler,"— Presentation transcript:

1 CS 7810 Lecture 5 The Optimal Logic Depth Per Pipeline Stage is 6 to 8 FO4 Inverter Delays M.S. Hrishikesh, N.P. Jouppi, K.I. Farkas, D. Burger, S.W. Keckler, P. Shivakumar UT-Austin and Compaq ISCA’02

2 Improvements in Clock Speed 33MHz 66MHz 100MHz 200MHz 450MHz 1GHz 2GHz 1000nm130nm

3 Definitions Clock Period  logic +  latch +  skew +  jitter  logic : the actual work being done in one stage  latch : data has to be saved in latch registers at the end of each pipeline stage (1 FO4 = 36ps at 100nm)  skew : Two parts of the circuit may receive their clocks thru different paths, resulting in a slight phase difference (0.3 FO4)  jitter : Unpredictable variations (0.5 FO4)

4 Processor Model An Alpha-like processor with latencies updated for 100nm Simplification: the study is insensitive to the technology generation Note that all structures are perfectly pipelined – this is a “Limit of Pipelining” study

5 Effect of Deep Pipelining Add = 16 FO4 Mpred = 128 FO4 Load from mem = 400 FO4 Mult = 160 FO4 Overhead = 2 FO4.... add = 16+2mpred = 8x18load = 400mult = 180 Clock Period 18 FO4 10 FO4 add = 8+2+8+2mpred = 16x10load = 400mult = 200 Clock Period FO4s Cycles Clock speed 18 FO4 18+144+400+180=742 42 1.54GHz 10 FO4 20+160+400+200=780 78 2.78GHz

6 Yet, Performance Increases… Deepening a car assembly line  more cars being made at the same time  a new car rolls out at twice the freq Independent instrs benefit from deep pipelining Dependent instrs are slowed down The latter dominates when pipelining overhead is a large fraction of clock period

7 Example Latencies Logic DelayL1DIssueQInt-Add 2 FO41699 4 FO4955 8 FO4533 16 FO4322

8 In-Order Processors With no overhead, when  logic reduces from 8FO4 to 4FO4, performance can go up by 100% (like in the car assembly line), but only goes up by 18% With overhead, max performance is seen for 6FO4 for all three benchmark classes For the Cray, optimal pipeline depth was 10.9FO4 (Int) and 5.4FO4 (vector) Degree of parallelism: vector > int-programs-today > int-programs-before (no caches!)

9 Out-of-Order Processors Optimal logic delay for integer is 6FO4, for FP non-vector is 5FO4, for FP vector is 4FO4 These results are insensitive to overhead costs and microarchitecure optimizations P.S. The effect of o-o-o execution on performance:  Non-vector FP: 0.5  1.0  Integer : 0.8  1.8  Vector FP : 0.9  3.5

10 Out-of-Order Processors

11 Increased Pipeline Depth Reasons for IPC decrease:  Longer ALU latencies (not quantified)  Longer load latency (~25% for 6-cyc increase)  Longer branch mpred cost (~10%)  Longer wakeup+select (~55%)

12 Pipelining Wakeup It takes a long time to broadcast tags across the entire issueq Hence, wake the first eight instructions in the first cycle, wake the next eight in the second, and so on This works well if most ready instructions are in the first stage – a 10-stage pipeline worsens performance by only 11% -- will this change the optimal logic depth?

13 Instruction Select Stage-1 only goes through one arbiter Stages 2-4 have a pre-select and go thru 2 arbiters Does well if most ready instrs in stage-1 (4% loss) stage 1 16-input arbiters stage 2 stage 3 stage 4 / 8

14 IssueQ Compaction Both techniques work well only if instructions move up to occupy empty slots Wastes energy, increases complexity Correctness problems – what if you miss the tag while in transit

15 Conclusions Logic per stage will only shrink by a factor of two – limits clock speed improvements in the future Pipelining wakeup+select has the biggest impact on IPC

16 Related Work Hartstein and Puzak (IBM): Most programs have optimal pipeline depth between 13-30, corresponding to FO4 delays of 4-8 Sprangle and Carmean (Intel): Optimum pipeline depth is 50-60, corresponding to FO4 delays of 4-5

17 Title Bullet


Download ppt "CS 7810 Lecture 5 The Optimal Logic Depth Per Pipeline Stage is 6 to 8 FO4 Inverter Delays M.S. Hrishikesh, N.P. Jouppi, K.I. Farkas, D. Burger, S.W. Keckler,"

Similar presentations


Ads by Google