Compsci 220 / ECE 252 (Lebeck): Dynamic Scheduling I 1 Duke Compsci 220 / ECE 252 Advanced Computer Architecture I Prof. Alvin R. Lebeck Dynamic Scheduling.

Compsci 220 / ECE 252 (Lebeck): Dynamic Scheduling I 1 Duke Compsci 220 / ECE 252 Advanced Computer Architecture I Prof. Alvin R. Lebeck Dynamic Scheduling I Slides developed by Amir Roth of University of Pennsylvania with sources that included University of Wisconsin slides by Mark Hill, Guri Sohi, Jim Smith, and David Wood. Slides enhanced by Milo Martin, Mark Hill, Alvin Lebeck, Dan Sorin, and David Wood with sources that included Profs. Asanovic, Falsafi, Hoe, Lipasti, Shen, Smith, Sohi, Vijaykumar, and Wood

Administrative Homework #2 due (upload through blackboard) Homework #3 up Midterm moved to October 19. Compsci 220 / ECE 252 (Lebeck): Dynamic Scheduling I2

Compsci 220 / ECE 252 (Lebeck): Multiple Issue & Static Scheduling3 Review: Superblocks First trace scheduling construct: superblock  Use when branch is highly biased  Fuse blocks from most frequent path: A,C,D  Schedule  Create repair code in case real path was A,B,D 0: ldf Y(r1),f2 1: fbne f2,4 4: stf f0,Y(r1) 5: ldf X(r1),f4 6: mulf f4,f2,f6 7: stf f6,Z(r1) 2: ldf W(r1),f2 3: jump 5 NT=5%T=95% A BC D

Compsci 220 / ECE 252 (Lebeck): Multiple Issue & Static Scheduling4 Review: ISA Support for Load-Branch Speculation IA-64: change insn 2 to speculative load ldf.s  “Speculative” means advanced past some unknown branch  Processor keeps exception bit with register f8  Inserted insn chk.s checks exception bit  If exception, jump to yet more repair code (arghhh…) IA-64 also contains ldf.sa 0: ldf Y(r1),f2 2: ldf.s W(r1),f8 5: ldf X(r1),f4 6: mulf f4,f8,f6 1: fbne f2,4 8: chk.s f8 7: stf f6,Z(r1) 4: stf f0,Y(r1) 6’: mulf f4,f2,f6 7’: stf f6,Z(r1) Superblock Repair code Repair code 2

Compsci 220 / ECE 252 (Lebeck): Multiple Issue & Static Scheduling5 Non-Biased Branches: Use Predication 0: ldf Y(r1),f2 1: fbne f2,4 4: stf f0,Y(r1) 5: ldf X(r1),f4 6: mulf f4,f2,f6 7: stf f6,Z(r1) 2: ldf W(r1),f2 3: jump 5 NT=50%T=50% A BC D 0: ldf Y(r1),f2 1: fspne f2,p1 2: ldf.p p1,W(r1),f2 4: stf.np p1,f0,Y(r1) 5: ldf X(r1),f4 6: mulf f4,f2,f6 7: stf f6,Z(r1)  Using Predication

Compsci 220 / ECE 252 (Lebeck): Multiple Issue & Static Scheduling6 Predication Conventional control  Conditionally executed insns also conditionally fetched Predication  Conditionally executed insns unconditionally fetched  Full predication (ARM, IA-64)  Can tag every insn with predicate, but extra bits in instruction  Conditional moves (Alpha, IA-32)  Construct appearance of full predication from one primitive cmoveq r1,r2,r3 // if (r1==0) r3=r2; –May require some code duplication to achieve desired effect +Only good way of adding predication to an existing ISA If-conversion: replacing control with predication +Good if branch is unpredictable (save mis-prediction) –But more instructions fetched and “executed”

Compsci 220 / ECE 252 (Lebeck): Multiple Issue & Static Scheduling7 ISA Support for Predication IA-64: change branch 1 to set-predicate insn fspne Change insns 2 and 4 to predicated insns  ldf.p performs ldf if predicate p1 is true  stf.np performs stf if predicate p1 is false 0: ldf Y(r1),f2 1: fspne f2,p1 2: ldf.p p1,W(r1),f2 4: stf.np p1,f0,Y(r1) 5: ldf X(r1),f4 6: mulf f4,f2,f6 7: stf f6,Z(r1)

Compsci 220 / ECE 252 (Lebeck): Multiple Issue & Static Scheduling8 Hyperblock Scheduling Second trace scheduling construct: hyperblock  Use when branch is not highly biased  Fuse all four blocks: A,B,C,D  Use predication to conditionally execute insns in B and C  Schedule 0: ldf Y(r1),f2 1: fbne f2,4 4: stf f0,Y(r1) 5: ldf X(r1),f4 6: mulf f4,f2,f6 7: stf f6,Z(r1) 2: ldf W(r1),f2 3: jump 5 NT=50%T=50% A BC D 0: ldf Y(r1),f2 1: fspne f2,p1 2: ldf.p p1,W(r1),f2 4: stf.np p1,f0,Y(r1) 5: ldf X(r1),f4 6: mulf f4,f2,f6 7: stf f6,Z(r1)

Compsci 220 / ECE 252 (Lebeck): Multiple Issue & Static Scheduling9 Static Scheduling Summary Goal: increase scope to find more independent insns Loop unrolling +Simple –Expands code size, can’t handle recurrences or non-loops Trace scheduling  Superblocks and hyperblocks +Works for non-loops –More complex, requires ISA support for speculation and predication –Requires nasty repair code

Compsci 220 / ECE 252 (Lebeck): Multiple Issue & Static Scheduling10 Multiple Issue Summary Problem spots  Wide fetch + branch prediction  trace cache?  N 2 dependence cross-check  N 2 bypass  clustering? Implementations  Statically scheduled superscalar  VLIW/EPIC What’s next:  Finding more ILP by relaxing the in-order execution requirement

Compsci 220 / ECE 252 (Lebeck): Dynamic Scheduling I11 Now: Dynamic Scheduling I Dynamic scheduling  Out-of-order execution Scoreboard  Dynamic scheduling with WAW/WAR Tomasulo’s algorithm  Add register renaming to fix WAW/WAR Next unit  Adding speculation and precise state  Dynamic load scheduling Application OS FirmwareCompiler CPUI/O Memory Digital Circuits Gates & Transistors

Compsci 220 / ECE 252 (Lebeck): Dynamic Scheduling I12 The Problem With In-Order Pipelines 12345678910111213141516 addf f0,f1,f2 FDE+ W mulf f2,f3,f2 FDd* E* W subf f0,f1,f4 Fp* DE+ W What’s happening in cycle 4?  mulf stalls due to RAW hazard  OK, this is a fundamental problem  subf stalls due to pipeline hazard  Why? subf can’t proceed into D because addf is there  That is the only reason, and it isn’t a fundamental one Why can’t subf go into D in cycle 4 and E+ in cycle 6?

Compsci 220 / ECE 252 (Lebeck): Dynamic Scheduling I13 regfile D$ I$ BPBP insn buffer SD add p2,p3,p4 sub p2,p4,p5 mul p2,p5,p6 div p4,4,p7 Ready Table P2P3P4P5P6P7 Yes div p4,4,p7 mul p2,p5,p6 sub p2,p4,p5 add p2,p3,p4 and Dynamic Scheduling: The Big Picture Instructions fetch/decoded/renamed into Instruction Buffer  Also called “instruction window” or “instruction scheduler” Instructions (conceptually) check ready bits every cycle  Execute when ready Time

Compsci 220 / ECE 252 (Lebeck): Dynamic Scheduling I14 Register Renaming To eliminate WAW and WAR hazards Example  Names: r1,r2,r3  Locations: p1,p2,p3,p4,p5,p6,p7  Original mapping: r1  p1, r2  p2, r3  p3, p4 – p7 are “free”  Renaming +Removes WAW and WAR dependences +Leaves RAW intact! MapTableFreeListRaw insnsRenamed insns r1r2r3 p1p2p3p4,p5,p6,p7add r2,r3,r1add p2,p3,p4 p4p2p3p5,p6,p7sub r2,r1,r3sub p2,p4,p5 p4p2p5p6,p7mul r2,r3,r3mul p2,p5,p6 p4p2p6p7div r1,4,r1div p4,4,p7

Compsci 220 / ECE 252 (Lebeck): Dynamic Scheduling I15 Dynamic Scheduling - OoO Execution Dynamic scheduling  Totally in the hardware  Also called “out-of-order execution” (OoO) Fetch many instructions into instruction window  Use branch prediction to speculate past (multiple) branches  Flush pipeline on branch misprediction Rename to avoid false dependencies (WAW and WAR) Execute instructions as soon as possible  Register dependencies are known  Handling memory dependencies more tricky (much more later) Commit instructions in order  Anything strange happens before commit, just flush the pipeline Current machines: 64-100+ instruction scheduling window

Compsci 220 / ECE 252 (Lebeck): Dynamic Scheduling I16 Static Instruction Scheduling Issue: time at which insns execute Schedule: order in which insns execute  Related to issue, but the distinction is important Scheduling: re-arranging insns to enable rapid issue  Static: by compiler  Requires knowledge of pipeline and program dependences  Pipeline scheduling: the basics  Requires large scheduling scope full of independent insns  Loop unrolling, software pipelining: increase scope for loops  Trace scheduling: increase scope for non-loops Anything software can do … hardware can do better

Compsci 220 / ECE 252 (Lebeck): Dynamic Scheduling I17 Motivation Dynamic Scheduling Dynamic scheduling (out-of-order execution)  Execute insns in non-sequential (non-VonNeumann) order… +Reduce RAW stalls +Increase pipeline and functional unit (FU) utilization Original motivation was to increase FP unit utilization +Expose more opportunities for parallel issue (ILP) Not in-order  can be in parallel  …but make it appear like sequential execution  Important –But difficult  Next unit

Compsci 220 / ECE 252 (Lebeck): Dynamic Scheduling I18 Before We Continue If we can do this in software… …why build complex (slow-clock, high-power) hardware? +Performance portability  Don’t want to recompile for new machines +More information available  Memory addresses, branch directions, cache misses +More registers available (??)  Compiler may not have enough to fix WAR/WAW hazards +Easier to speculate and recover from mis-speculation  Flush instead of recover code –But compiler has a larger scope  Compiler does as much as it can (not much)  Hardware does the rest

Compsci 220 / ECE 252 (Lebeck): Dynamic Scheduling I19 Going Forward: What’s Next We’ll build this up in steps over the next several lectures  “Scoreboarding” - first OoO, no register renaming  “Tomasulo’s algorithm” - adds register renaming  Handling precise state and speculation  P6-style execution (Intel Pentium Pro)  R10k-style execution (MIPS R10k)  Handling memory dependencies  Conservative and speculative Let’s get started!

Compsci 220 / ECE 252 (Lebeck): Dynamic Scheduling I20 Dynamic Scheduling as Loop Unrolling Three steps of loop unrolling  Step I: combine iterations  Increase scheduling scope for more flexibility  Step II: pipeline schedule  Reduce impact of RAW hazards  Step III: rename registers  Remove WAR/WAW violations that result from scheduling

Compsci 220 / ECE 252 (Lebeck): Dynamic Scheduling I21 Loop Example: SAX (SAXPY – PY) SAX (Single-precision A X)  Only because there won’t be room in the diagrams for SAXPY for (i=0;i<N;i++) Z[i]=A*X[i]; 0: ldf X(r1),f1 // loop 1: mulf f0,f1,f2 // A in f0 2: stf f4,Z(r1) 3: addi r1,4,r1 // i in r1 4: blt r1,r2,0 // N*4 in r2  Consider two iterations, ignore branch ldf, mulf, stf, addi, ldf, mulf, stf

Compsci 220 / ECE 252 (Lebeck): Dynamic Scheduling I22 New Pipeline Terminology In-order pipeline  Often written as F,D,X,W (multi-cycle X includes M)  Example pipeline: 1-cycle int (including mem), 3-cycle pipelined FP regfile D$ I$ BPBP

Compsci 220 / ECE 252 (Lebeck): Dynamic Scheduling I23 New Pipeline Diagram InsnDXW ldf X(r1),f1c1c2c3 mulf f0,f1,f2c3c4+c7 stf f2,Z(r1)c7c8c9 addi r1,4,r1c8c9c10 ldf X(r1),f1c10c11c12 mulf f0,f1,f2c12c13+c16 stf f2,Z(r1)c16c17c18 Alternative pipeline diagram  Down: insns  Across: pipeline stages  In boxes: cycles  Basically: stages  cycles  Convenient for out-of-order

Compsci 220 / ECE 252 (Lebeck): Dynamic Scheduling I24 The Problem With In-Order Pipelines In-order pipeline  Structural hazard: 1 insn register (latch) per stage  1 insn per stage per cycle (unless pipeline is replicated)  Younger insn can’t “pass” older insn without “clobbering” it Out-of-order pipeline  Implement “passing” functionality by removing structural hazard regfile D$ I$ BPBP

Compsci 220 / ECE 252 (Lebeck): Dynamic Scheduling I25 Instruction Buffer Trick: insn buffer (many names for this buffer)  Basically: a bunch of latches for holding insns  Implements iteration fusing … here is your scheduling scope Split D into two pieces  Accumulate decoded insns in buffer in-order  Buffer sends insns down rest of pipeline out-of-order regfile D$ I$ BPBP insn buffer D2D1

Compsci 220 / ECE 252 (Lebeck): Dynamic Scheduling I26 Dispatch and Issue Dispatch (D): first part of decode  Allocate slot in insn buffer –New kind of structural hazard (insn buffer is full)  In order: stall back-propagates to younger insns Issue (S): second part of decode  Send insns from insn buffer to execution units +Out-of-order: wait doesn’t back-propagate to younger insns regfile D$ I$ BPBP insn buffer SD

Compsci 220 / ECE 252 (Lebeck): Dynamic Scheduling I27 Dispatch and Issue with Floating-Point regfile D$ I$ BPBP insn buffer SD F-regfile E/ E+E+ E+E+ E*

Compsci 220 / ECE 252 (Lebeck): Dynamic Scheduling I28 Dynamic Scheduling Algorithms Three parts to loop unrolling  Scheduling scope: insn buffer  Pipeline scheduling and register renaming: scheduling algorithm Look at two register scheduling algorithms  Register scheduler: scheduler based on register dependences  Scoreboard  No register renaming  limited scheduling flexibility  Tomasulo  Register renaming  more flexibility, better performance Big simplification in this unit: memory scheduling  Pretend register algorithm magically knows memory dependences  A little more realism next unit

Compsci 220 / ECE 252 (Lebeck): Dynamic Scheduling I29 Scheduling Algorithm I: Scoreboard Scoreboard  Centralized control scheme: insn status explicitly tracked  Insn buffer: Functional Unit Status Table (FUST) First implementation: CDC 6600 [1964]  16 separate non-pipelined functional units (7 int, 4 FP, 5 mem)  No register bypassing Our example: “Simple Scoreboard”  5 FU: 1 ALU, 1 load, 1 store, 2 FP (3-cycle, pipelined)

Compsci 220 / ECE 252 (Lebeck): Dynamic Scheduling I30 Scoreboard Data Structures FU Status Table  FU, busy, op, R, R1, R2: destination/source register names  T: destination register tag (FU producing the value)  T1,T2: source register tags (FU producing the values) Register Status Table  T: tag (FU that will write this register) Tags interpreted as ready-bits  Tag == 0  Value is ready in register file  Tag != 0  Value is not ready, will be supplied by T Insn status table  S,X bits for all active insns (issue & execute)

Compsci 220 / ECE 252 (Lebeck): Dynamic Scheduling I31 Simple Scoreboard Data Structures  Insn fields and status bits  Tags  Values FU Status R1R2 XS Insn value FU T T2T1Top == Reg Status Fetched insns Regfile R T == CAMs

Compsci 220 / ECE 252 (Lebeck): Dynamic Scheduling I32 Scoreboard Pipeline New pipeline structure: F, D, S, X, W  F (fetch)  Same as it ever was  D (dispatch)  Structural or WAW hazard ? stall : allocate scoreboard entry  S (issue)  RAW hazard ? wait : read registers, go to execute  X (execute)  Execute operation, notify scoreboard when done  W (writeback)  WAR hazard ? wait : write register, free scoreboard entry  W and RAW-dependent S in same cycle  W and structural-dependent D in same cycle

Compsci 220 / ECE 252 (Lebeck): Dynamic Scheduling I33 Scoreboard Dispatch (D) Stall for WAW or structural (Scoreboard, FU) hazards  Allocate scoreboard entry  Copy Reg Status for input registers  Set Reg Status for output register FU Status R1R2 XS Insn value FU T T2T1Top == Reg Status Fetched insns Regfile R T ==

Compsci 220 / ECE 252 (Lebeck): Dynamic Scheduling I34 Scoreboard Issue (S) Wait for RAW register hazards  Read registers FU Status R1R2 XS Insn value FU T T2T1Top == Reg Status Fetched insns Regfile R T ==

Compsci 220 / ECE 252 (Lebeck): Dynamic Scheduling I35 Issue Policy and Issue Logic Issue  If multiple insns ready, which one to choose? Issue policy  Oldest first? Safe  Longest latency first? May yield better performance  Select logic: implements issue policy  W  1 priority encoder  W: window size (number of scoreboard entries)

Compsci 220 / ECE 252 (Lebeck): Dynamic Scheduling I36 Scoreboard Execute (X)  Execute insn FU Status R1R2 XS Insn value FU T T2T1Top == Reg Status Fetched insns Regfile R T ==

Compsci 220 / ECE 252 (Lebeck): Dynamic Scheduling I37 Scoreboard Writeback (W) Wait for WAR hazard  Write value into regfile, clear Reg Status entry  Compare tag to waiting insns input tags, match ? clear input tag  Free scoreboard entry FU Status R1R2 XS Insn value FU T T2T1Top == Reg Status Fetched insns Regfile R T ==

Compsci 220 / ECE 252 (Lebeck): Dynamic Scheduling I38 Scoreboard Data Structures Insn Status InsnDSXW ldf X(r1),f1 mulf f0,f1,f2 stf f2,Z(r1) addi r1,4,r1 ldf X(r1),f1 mulf f0,f1,f2 stf f2,Z(r1) Reg Status RegT f0 f1 f2 r1 FU Status FUbusyopRR1R2T1T2 ALUno LDno STno FP1no FP2no

Compsci 220 / ECE 252 (Lebeck): Dynamic Scheduling I39 Scoreboard: Cycle 1 Insn Status InsnDSXW ldf X(r1),f1c1 mulf f0,f1,f2 stf f2,Z(r1) addi r1,4,r1 ldf X(r1),f1 mulf f0,f1,f2 stf f2,Z(r1) Reg Status RegT f0 f1LD f2 r1 FU Status FUbusyopRR1R2T1T2 ALUno LDyesldff1-r1-- STno FP1no FP2no allocate

Compsci 220 / ECE 252 (Lebeck): Dynamic Scheduling I40 Scoreboard: Cycle 2 Insn Status InsnDSXW ldf X(r1),f1c1c2 mulf f0,f1,f2c2 stf f2,Z(r1) addi r1,4,r1 ldf X(r1),f1 mulf f0,f1,f2 stf f2,Z(r1) Reg Status RegT f0 f1LD f2FP1 r1 FU Status FUbusyopRR1R2T1T2 ALUno LDyesldff1-r1-- STno FP1yesmulff2f0f1-LD FP2no allocate

Compsci 220 / ECE 252 (Lebeck): Dynamic Scheduling I41 Scoreboard: Cycle 3 Insn Status InsnDSXW ldf X(r1),f1c1c2c3 mulf f0,f1,f2c2 stf f2,Z(r1)c3 addi r1,4,r1 ldf X(r1),f1 mulf f0,f1,f2 stf f2,Z(r1) Reg Status RegT f0 f1LD f2FP1 r1 Functional unit status FUbusyopRR1R2T1T2 ALUno LDyesldff1-r1-- STyesstf-f2r1FP1- yesmulff2f0f1-LD FP2no allocate

Compsci 220 / ECE 252 (Lebeck): Dynamic Scheduling I42 Scoreboard: Cycle 4 Insn Status InsnDSXW ldf X(r1),f1c1c2c3c4 mulf f0,f1,f2c2c4 stf f2,Z(r1)c3 addi r1,4,r1c4 ldf X(r1),f1 mulf f0,f1,f2 stf f2,Z(r1) Reg Status RegT f0 f1LD f2FP1 r1ALU FU Status FUbusyopRR1R2T1T2 ALUyesaddir1 --- LDno STyesstf-f2r1FP1- yesmulff2f0f1-LD FP2no allocate free f0 ( LD ) is ready  issue mulf f1 written  clear

Compsci 220 / ECE 252 (Lebeck): Dynamic Scheduling I43 Scoreboard: Cycle 5 Insn Status InsnDSXW ldf X(r1),f1c1c2c3c4 mulf f0,f1,f2c2c4c5 stf f2,Z(r1)c3 addi r1,4,r1c4c5 ldf X(r1),f1c5 mulf f0,f1,f2 stf f2,Z(r1) Reg Status RegT f0 f1LD f2FP1 r1ALU FU Status FUbusyopRR1R2T1T2 ALUyesaddir1 --- LDyesldff1-r1-ALU STyesstf-f2r1FP1- yesmulff2f0f1-- FP2no allocate

Compsci 220 / ECE 252 (Lebeck): Dynamic Scheduling I44 Scoreboard: Cycle 6 Insn Status InsnDSXW ldf X(r1),f1c1c2c3c4 mulf f0,f1,f2c2c4c5+ stf f2,Z(r1)c3 addi r1,4,r1c4c5c6 ldf X(r1),f1c5 mulf f0,f1,f2 stf f2,Z(r1) Reg Status RegT f0 f1LD f2FP1 r1ALU FU Status FUbusyopRR1R2T1T2 ALUyesaddir1 --- LDyesldff1-r1-ALU STyesstf-f2r1FP1- yesmulff2f0f1-- FP2no D stall: WAW hazard w/ mulf ( f2 ) How to tell? RegStatus[f2] non-empty

Compsci 220 / ECE 252 (Lebeck): Dynamic Scheduling I45 Scoreboard: Cycle 7 Insn Status InsnDSXW ldf X(r1),f1c1c2c3c4 mulf f0,f1,f2c2c4c5+ stf f2,Z(r1)c3 addi r1,4,r1c4c5c6 ldf X(r1),f1c5 mulf f0,f1,f2 stf f2,Z(r1) Reg Status RegT f0 f1LD f2FP1 r1ALU FU Status FUbusyopRR1R2T1T2 ALUyesaddir1 --- LDyesldff1-r1-ALU STyesstf-f2r1FP1- yesmulff2f0f1-- FP2no W wait: WAR hazard w/ stf ( r1 ) How to tell? Untagged r1 in FuStatus Requires CAM

Compsci 220 / ECE 252 (Lebeck): Dynamic Scheduling I46 Scoreboard: Cycle 8 Insn Status InsnDSXW ldf X(r1),f1c1c2c3c4 mulf f0,f1,f2c2c4c5+c8 stf f2,Z(r1)c3c8 addi r1,4,r1c4c5c6 ldf X(r1),f1c5 mulf f0,f1,f2c8 stf f2,Z(r1) Reg Status RegT f0 f1LD f2FP1 FP2 r1ALU FU Status FUbusyopRR1R2T1T2 ALUyesaddir1 --- LDyesldff1-r1-ALU STyesstf-f2r1FP1- no FP2yesmulff2f0f1-LD allocate free f1 ( FP1 ) is ready  issue stf first mulf done ( FP1 ) W wait

Compsci 220 / ECE 252 (Lebeck): Dynamic Scheduling I47 Scoreboard: Cycle 9 Insn Status InsnDSXW ldf X(r1),f1c1c2c3c4 mulf f0,f1,f2c2c4c5+c8 stf f2,Z(r1)c3c8c9 addi r1,4,r1c4c5c6c9 ldf X(r1),f1c5c9 mulf f0,f1,f2c8 stf f2,Z(r1) Reg Status RegT f0 f1LD f2FP2 r1ALU FU Status FUbusyopRR1R2T1T2 ALUno LDyesldff1-r1-ALU STyesstf-f2r1-- FP1no FP2yesmulff2f0f1-LD D stall: structural hazard FuStatus[ST] r1 written  clear free r1 ( ALU ) is ready  issue ldf

Compsci 220 / ECE 252 (Lebeck): Dynamic Scheduling I48 Scoreboard: Cycle 10 Insn Status InsnDSXW ldf X(r1),f1c1c2c3c4 mulf f0,f1,f2c2c4c5+c8 stf f2,Z(r1)c3c8c9c10 addi r1,4,r1c4c5c6c9 ldf X(r1),f1c5c9c10 mulf f0,f1,f2c8 stf f2,Z(r1)c10 Reg Status RegT f0 f1LD f2FP2 r1 FU Status FUbusyopRR1R2T1T2 ALUno LDyesldff1-r1-- STyesstf-f2r1FP2- FP1no FP2yesmulff2f0f1-LD W & structural-dependent D in same cycle free, then allocate

Compsci 220 / ECE 252 (Lebeck): Dynamic Scheduling I49 In-Order vs. Scoreboard Big speedup? –Only 1 cycle advantage for scoreboard  Why? addi WAR hazard  Scoreboard issued addi earlier ( c8  c5 )  But WAR hazard delayed W until c9  Delayed issue of second iteration In-OrderScoreboard InsnDXWDSXW ldf X(r1),f1c1c2c3c1c2c3c4 mulf f0,f1,f2c3c4+c7c2c4c5+c8 stf f2,Z(r1)c7c8c9c3c8c9c10 addi r1,4,r1c8c9c10c4c5c6c9 ldf X(r1),f1c10c11c12c5c9c10c11 mulf f0,f1,f2c12c13+c16c8c11c12+c15 stf f2,Z(r1)c16c17c18c10c15c16c17

Compsci 220 / ECE 252 (Lebeck): Dynamic Scheduling I50 In-Order vs. Scoreboard II: Cache Miss Assume  5 cycle cache miss on first ldf  Ignore FUST structural hazards –Little relative advantage  addi WAR hazard ( c7  c13 ) stalls second iteration In-OrderScoreboard InsnDXWDSXW ldf X(r1),f1c1c2+c7c1c2c3+c8 mulf f0,f1,f2c7c8+c11c2c8c9+c12 stf f2,Z(r1)c11c12c13c3c12c13c14 addi r1,4,r1c12c13c14c4c5c6c13 ldf X(r1),f1c14c15c16c5c13c14c15 mulf f0,f1,f2c16c17+c20c6c15c16+c19 stf f2,Z(r1)c20c21c22c7c19c20c21

Compsci 220 / ECE 252 (Lebeck): Dynamic Scheduling I51 Scoreboard Redux The good +Cheap hardware  InsnStatus + FuStatus + RegStatus ~ 1 FP unit in area +Pretty good performance  1.7X for FORTRAN (scientific array) programs The less good –No bypassing  Is this a fundamental problem? –Limited scheduling scope  Structural/WAW hazards delay dispatch –Slow issue of truly-dependent (RAW) insns  WAR hazards delay writeback  Fix with hardware register renaming

Compsci 220 / ECE 252 (Lebeck): Dynamic Scheduling I 1 Duke Compsci 220 / ECE 252 Advanced Computer Architecture I Prof. Alvin R. Lebeck Dynamic Scheduling.

Similar presentations

Presentation on theme: "Compsci 220 / ECE 252 (Lebeck): Dynamic Scheduling I 1 Duke Compsci 220 / ECE 252 Advanced Computer Architecture I Prof. Alvin R. Lebeck Dynamic Scheduling."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Compsci 220 / ECE 252 (Lebeck): Dynamic Scheduling I 1 Duke Compsci 220 / ECE 252 Advanced Computer Architecture I Prof. Alvin R. Lebeck Dynamic Scheduling.

Similar presentations

Presentation on theme: "Compsci 220 / ECE 252 (Lebeck): Dynamic Scheduling I 1 Duke Compsci 220 / ECE 252 Advanced Computer Architecture I Prof. Alvin R. Lebeck Dynamic Scheduling."— Presentation transcript:

Similar presentations

About project

Feedback