Presentation is loading. Please wait.

Presentation is loading. Please wait.

A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures.

Similar presentations


Presentation on theme: "A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures."— Presentation transcript:

1 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures

2 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto MIPS R10000-Like Design Based on: –Complexity-Effective Superscalar Processors –S. Palacharla, N. Jouppi and J. E. Smith, ISCA 97

3 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Fetch Phase Fetch: –Read instructions from I-Cache –Predict Branches –Pass on to Decode phase

4 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Decode Phase Decode: –Parse instruction –Shuffle opcode parts to appropriate ports for rename

5 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Renaming Phase Rename: –Map Architectural registers to Physical –Eliminate False Dependences –Passes renamed instructions to scheduler Called Dispatch

6 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Scheduling Phase Wakeup: –Instructions check whether they become ready –From Writeback: physical register names Select: –Amongst the ready select those to execute –Structural hazards

7 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Register File Read Phase Read source operands

8 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Bypass and Execute Phase

9 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Data Cache Access Phase

10 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Writeback Phase Write result to register file Broadcast tag in order to wakeup waiting instructions –Notice that the tag broadcast should happen TWO cycles in advance of the result production

11 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Reservation Station Model Used by Pentium Pro, PowerPC 604 Re-order buffer holds values Renaming points to re-order buffer entries –Tomasulo-like

12 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Physical Register File vs. Reservation Station Physical Register File –Values reside in the register file –At writeback instructions broadcast the register name Reservation Stations: –Values reside: –In the register file upon commit Non-speculative –In reservation stations prior to commit Speculative

13 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Quantifying Complexity Critical Path Delay as a function of architectural parameters –Instruction Window size (WinSize) –Issue Width (IW) Full-custom Implementations –Study the critical path –Delay model –Extrapolate how it will scale with “future” technologies

14 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Renaming Inputs: –IW instructions –Up to 2 x Input register names –Up to 1 x Output register name Outputs: –2 x input physical registers –1 x new output physical register –1 x previous physical register name for checkpointing –Updated rename table Superscalar Issue complicates things a bit

15 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Renaming One Instruction s1s2d RAT p0 p31 s1s2 old d new reg from free list Write port Read port 1 1 2 1 For mispeculation recovery

16 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Renaming Two Instructions RAT s1 s2 d new d s1 s2 d new d ? ? ? ps1 ps2 Old d new d ps1 ps2 Old d new d Cross Bundle Dependency Check Logic

17 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Renaming More Instructions Dependency Checking logic for instruction i must match against all preceding destinations If there are multiple matches it must enforce priority: –Pick the one closest to this instruction

18 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto RAT: SRAM Implementation decoder SRAM cell bitlines Sense amp Arch reg Phys reg #ARCH REGS lg(#PHYS REGS)

19 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto SRAM RAT cell

20 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto RAT: CAM Implementation encoder CAM cell Arch reg Phys reg #PHYS REGS lg(#ARCH REGS) Active bit One CAM per physical register Active bit indicates the current map New version by setting active bit

21 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto CAM Cell

22 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto SRAM vs. CAM SRAM: –Arch reg rows –Lg(phy reg) cols –SRAM read/write CAM: –Phy reg rows –Lg(arch reg) cols –CAM match –Update: Reset previous valid bit Set current valid bit

23 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Scheduler: Part #1 - Wakeup

24 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Tree of Arbiters REQ Signals GRANT Signals Anyreq raised if any req is active, Grant Issued if arbiter enabled Root enabled if FU available Scheduler: Part #2 - Select For a Single FU Location based select policy

25 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Select for more than one FUs Handling Multiple FUs of Same Type: –Stack Select logic blocks in series - hierarchy –Mask the Request granted to previous unit NOT Feasible for More than 2 FUs Alternative: –statically partition issue window among FUs – MIPS R10000, HP PA 8000

26 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Datapath and Bypass Commonly Used Layout: 1 Bit-Slice Turn on Tri- State A to pass result of FU1 to left operand of FU0

27 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Complexity Analysis Critical path delay as a function of: –Issue Width –Window Size Register Renaming Table Wakeup and Select Bypass paths

28 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Methodology A representative CMOS design is selected from published alternatives Implemented the circuits for 3 technologies: –0.8micron, 0.35micron and 0.18 micron Optimize for speed Wire parasitics in delay model –Rmetal, Cmetal

29 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Methodology Feature size scaling: 1 / S Voltage scaling: 1 / U Logic Delay = (C L x V) / I Capac. Load: C L = 1  1 / S Supply Voltage: V = 1  1 / U Average charge/discharge current: I = 1  1 / U So, Logic Delay = (1 / S x 1 / U ) / (1 / U) = 1 / S

30 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Wire Delay L: wire length Intrinsic RC delay  Rmetal: resistance per unit length Cmetal: capacitance per unit length 0.5: 1 st order approximation of distributed RC model – uniformly distributed R & C

31 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Wire Delay Scaling Metal Thickness doesn’t scale much –Width ~ 1/S –Rmetal ~ S Fringe Capacitance dominates in smaller feature sizes – edges to parallel wires and the substrate Parallel plate – scales with 1 / S –Cmetal ~ S Length scales with 1/S Overall Scale factor: S x S x (1/S) 2 = 1 Wire delay remains constant

32 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Register Renaming Table

33 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Dependency Checking Logic Accessed in Parallel with Map Table Every Logical Reg compared against logical dest regs of current rename group For IW=2,4,8, delay less than map table r1 r4

34 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Renaming Delay SRAM scheme Delay Components: –Time to decode the arch reg index –Time to drive wordline –Time to pull down bit line –Time for SenseAmp to detect pull-down –MUX time ignored as control from dep. Check logic comes in advance

35 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Renaming Circuit

36 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Decoder Delay

37 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Decoder Delay Predecoding for speed Length of predecode lines: –Cellheight: Height of single cell excluding wordlines –Wordline spacing NVREG: # of virtual reg-s x3: 3-operand instr-s

38 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Decoder Delay Tnand fall delay of NAND Tnor rise delay of NOR Rnandpd NAND pull-down channel resistance + Predecode line metal resistance Ceq diff-n Cap. of NAND + gate Cap. of NOR + interconnect Cap.

39 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Decoder Delay Substitute Predecode line length, Req and Ceq we get: c2: intrinsic RC delay of predecode line c2 very small Decoder delay ~linearly dependent on IW

40 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Rename Delay Wordline c2: intrinsic RC delay of wordline c2 very small  Wordline delay ~linearly dependent on IW

41 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Rename Delay Bitline: c2 very small Bitline delay ~linearly dependent on IW SenseAmp delay ~linearly dependent on IW

42 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Rename Logic Delay Scaling Feature size -  [increase in bitline&wordline delay with increasing IW] 0.8um: IW 2  8  Bitline delay + 37% 0.18um: IW 2  8  Bitline delay + 53% Total delay increases linearly with IW Each Component shows linear increase with IW Bitline Delay > Wordline Delay Bitline length ~ # of Logical reg-s Wordline length ~ width of physical reg designator IW impact on delay worsens with decreasing feature size

43 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Wakeup Delay Critical Path: Mismatch  Pull ready signal low Delay Components: –Tag drivers  drive tag lines - vertical –Mismatched bit: pull down stack  pull matchline low – horizontal –Final OR gate  or all the matchlines of an operand tag Ttagdrive ~ Driver Pullup R & Tagline length & Tagline Load C Quadratic component significant for IW>2 & 0.18um

44 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Wakeup Delay Quadratic component Small for both cases Both delays ~linearly dependent on IW

45 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Wakeup Delay: IW and Window Size 0.18um Process Quadratic dependence Issue width has greater effect  increase all 3 delay components As IW & WinSize + together  delay actually changes like: THIS

46 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Wakeup Delay: Window Size 8 way & 0.18  Process Tag drive delay increases rapidly with WinSize + Match OR delay constant

47 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Wakeup Delay: Feature size 8 way & 64 entry window Tag drive and Tag match delays do not scale as well as MatchOR delay Match OR  logic delay Others  also have wire delays

48 A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Selection Logic and Bypass Delay Selection –Logarithmically dependent on WinSize Bypass: Delay dependent on (IW)2


Download ppt "A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures."

Similar presentations


Ads by Google