A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions.

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order Execution –How it can help –Issues: Maintaining Sequential Semantics Scheduling –Scoreboard Register Renaming Initially, we’ll focus on Registers, Memory later on

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Sequential Semantics - Review Instructions appear as if they executed: –In the order they appear in the program –One after the other Pipelining: Partial Overlap of Instructions –Initiate one instruction per cycle –Subsequent instructions overlap partially –Commit one instruction per cycle

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Can we do better than pipelining? loop:ld r2, 10(r1) addr3,r3,r2 subr1,r1,1 bner1,r0,loop Pipelining: sum += a[i--] fetchdecodeld fetchdecodeadd fetchdecodesub fetchdecodebne time Superscalar: fetchdecodeld fetchdecodeadd fetchdecodesub fetchdecodebne

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Superscalar - In-order (initial def.) Two or more consecutive instructions (in the original program order) can execute in parallel Is this much better than pipelining? –What if all instructions were dependent? Superscalar buys us nothing Again key is typical program behavior –Some parallelism exists –Pipelining “drains” on dependences –Superscalar consumes “fill-up” time

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Practicalities Issue mechanism –At decode check: Dependences Input operand availability –Check against Instructions: Simultaneously Decoded In-progress in the pipeline (i.e., previously issued) –Recall the register vector from pipelining Increasingly Complex with degree of superscalarity –2-way, 3-way, …, n-way

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Issue Rules Stall at decode if: –RAW dependence and no data available –WAR or RAW dependence –No resource available This check is done in program order

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Issue Mechanism Assume 2 source & 1 target max per instr. –comparators for 2-way: 3 for tgt and 2 for src –comparators for 4-way: 2nd instr: 3 tgt and 2 src 3rd instr: 6 tgt and 4 src 4th instr: 9 tgt and 6 src tgtsrc1 tgtsrc1 tgtsrc1 simplifications may be possible resource checking not shown Program order

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Implications Need to multiport some structures –Register File Multiple Reads and Writes per cycle –Register Availability Vector Multiple Reads and Writes per cycle –From Decode and Commit! –Also need to worry about WAR and WAW Resource tracking –Additional issue conditions Many Superscalars had additional restrictions –E.g., execute one integer and one floating point op –one branch, or one store/load

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Preserving Sequential Semantics In principle not much different than pipelining Program order is preserved in the pipeline Some instructions proceed in parallel –But order is clearly defined Defer interrupts to commit stage (i.e., writeback) –Flush all subsequent instructions may include instructions committing simultaneously –Allow all preceding instructions to commit Recall comparisons are done in program order Must have sufficient time in clock cycle to handle these

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Interrupts Example fetchdecodeld fetchdecodeadd fetchdecodediv fetchdecodebne Exception raised Exception taken fetchdecodebne fetchdecodeld fetchdecodeadd fetchdecodediv fetchdecodebne Exception raised Exception taken fetchdecodebne Exception raised

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Superscalar vs. Pipelining In principle they are orthogonal –Superscalar non-pipelined machine –Pipelined non-superscalar –Superscalar and Pipelined (common) Additional functionality needed by Superscalar: –Another bound on clock cycle –At some point it limits the number of pipeline stages

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Superscalar vs. Superpipelining Superpipelining: –Vaguely defined as deep pipelining, i.e., lots of stages Superscalar issue may prevent superpipelining How do they compare? fetchdecodeinst fetchdecodeinst fetchdecodeinst fetchdecodeinst fetchdecodeinst fetchdecodeinst fetchdecodeinst fetchdecodeinst

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Superscalar vs. Superpipelining fetchdecodeinst fetchdecodeinst fetchdecodeinst fetchdecodeinst fetchdecodeinst fetchdecodeinst fetchdecodeinst fetchdecodeinst fetchdecodeinst fetchdecodeinst fetchdecodeinst fetchdecodeinst superscalar superpipelining Limit: All instructions independent Difference is initiation interval

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Superscalar vs. Superpipelining fetchdecodeinst fetchdecodeinst fetchdecodeinst fetchdecodeinst fetchdecodeinst fetchdecodeinst fetchdecodeinst fetchdecodeinst fetchdecodeinst fetchdecodeinst fetchdecodeinst fetchdecodeinst Every stall introduces an additional “initiation interval” difference between the two

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Case Study: Alpha 21164

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto 21164: Int. Pipe

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto 21164: Memory Pipeline

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto 21164: Floating-Point Pipe

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto 80486 Pipeline Fetch –Load 16-bytes from into prefetch buffer Decode 1 –Determine instruction length and type Decode 2 –Computer memory address –Generate immediate operands Execute –Register Read –ALU –Memory read/write Write-back –Update register file (source: CS740 CMU, ’97)

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto 80486 Pipeline detail Fetch –Moves 16 bytes of instruction stream into code queue –Not required every time –About 5 instructions fetched at once (avg. length 2.5 bytes) –Only useful if don’t branch –Avoids need for separate instruction cache D1 –Determine total instruction length –Signals code queue aligner where next instruction begins –May require two cycles When multiple operands must be decoded About 6% of “typical” DOS program

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto 80486 Pipeline D2 –Extract memory displacements and immediate operands –Compute memory addresses –Add base register, and possibly scaled index register –May require two cycles If index register involved, or both address & immediate operand Approx. 5% of executed instructions EX –Read register operands –Compute ALU function –Read or write memory (data cache) WB –Update register result

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Out-of-Order Execution Also known as dynamic scheduling –Compilers do static scheduling We will start by considering register only –Register interface helps a lot Later on we will expand to memory –Tricky: Memory interface is more powerful than registers –Makes it harder to figure out dependences In principle the same method will be used for both

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Beyond Superscalar Execution loop:ld r2, 10(r4) addr3,r3,r2 subr1,r1,1 bner1,r0,loop Superscalar: sum += a[m); m-- out-of-order fetchdecodeld fetchdecodeadd fetchdecodesub fetchdecodebne ld add sub bne fetchdecodeld fetchdecodeadd fetchdecodesub fetchdecodebne ld add sub bne

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Sequential Semantics? Execution does NOT adhere to sequential semantics To be precise: Eventually it may Simplest solution: Define problem away –Imprecise interrupts On interrupt some instr. committed some not software we’ll have to figure out what is going on Horrible for debugging and programming fetchdecodeld fetchdecodeadd fetchdecodesub fetchdecodebne ld add sub bne inconsistent

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Interrupt Recall we use the term interrupt to signify the need to observe the machine’s state after any instruction –This can be indeed the result of interrupt in the classical sense –But, it could be a debugger (still uses interrupts)

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Out-of-Order vs. Pipelining and Superscalar Definition: two or more instructions can execute in any order if they have no dependences (RAW, WAW, WAR) –beware of transitive dependences Is this better than pipelining or superscalar exec? –If all are independent: not –if all dependent: not –Programs have some parallelism –Pipelining “drain” and “fill-up” overheads –Superscalar, parallelism only when adjacent –OoO exploits par. even when not adjacent OoO Orthogonal to pipelining and Superscalar

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Out-of-order Execution Issues Preserving Sequential Semantics Stalling Instructions w/ dependences Issuing Instructions when dependences are satisfied

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Back to Sequential Semantics Instr. exec. in 3 phases: –In-progress, Completed (NEW), Committed –OOO for in-progress and Completed –In-order Commits Completed - out-of-order: –Results visible to subsequent instructions –Results not visible to outsiders On interrupts completed results are discarded Committed - in-order: –Results visible to subsequent instructions –Results visible to outsiders On interrupt committed results are preserved

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto How Completes Help w/ Performance Time DIV R3, _, _ ADD R1, _, _ ADD _, R1, _ In-order commits in-order completes out-of-order completes in-order commits fetchdecodeld fetchdecodeadd fetchdecodesub fetchdecodebne ld add sub bne commit complete

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Implementing Completes/Commits Key idea: –Maintain sufficient state around to be able to roll- back when necessary –Roll-back: Discard (aka Squash) all not committed One solution (conceptual): –Upon Complete instruction records previous value of target register –Upon Discard, instruction restores target value –Upon Commit, nothing to do We will return to this shortly Focus on scheduling mechanisms

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Out-of-Order Execution the big Picture Program FormProcessing Phase Static program dynamic inst. Stream (trace) execution window completed instructions Dispatch/ dependences inst. Issue inst execution inst. Reorder & commit

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Out-of-Order Execution: Stages Fetch: get instruction from memory Decode/Dispatch: what is it? What are the dependences Issue: Go – all dependences satisfied Execute: perform operation Complete: result available to other insts. Commit: result available to outsiders We’ll start w/ Decode/Dispatch Then we’ll consider Issue

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto OOO Scheduling Instruction @ Decode: –Do I have dependences yet to be satisfied? –Yes, stall until they are –No, clear to issue Wakeup Instructions Stalled: –Dependences satisfied –Allow instruction to issue Dependence: –(later instruction, earlier instruction) & type We’ll first consider RAW and then move on to WAW and WAR

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Stalling @ Decode for RAW Are there unsatisfied dependences? –RAW: have to wait for register value –We don’t really care who is producing the value –Only whether it is available Can use the Register Availability Vector as in pipelining/superscalar –Also known as scoreboard On decode –Reset bit corresponding to your target –At writeback set –Check all bits for source regs: if any is 0 stall

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Issuing Instructions: Scheduling Determine when an instruction can issue –Ignore Resources for the time being Stalled because of RAW w/ preceding instruction Concept: –Producer (write) notifies consumers (read) Requirements: –Consumers need to be able to identify producer –The register name is one possible link Mechanism –Consumers placed in a reservation station –Producers on complete broadcasts identity –Waiting instructions observe –Update Operand Availability –Issue if all operands now available

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Reservation Station State pertaining to an instruction –What registers it reads –Whether they are available –What is the destination register –What state is the instruction in Waiting Executing

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Out-Of-Order Exec. Example loop:ld r2, 10(r4)4 cycles lat addr3,r3,r2 subr1,r1,1 addr4,r4,4 bner1,r0,loop 1111 r1r2r3r4 RAV opsrc1src2tgt Cycle 0 status

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Out-Of-Order Exec. Example loop:ld r2, 10(r4) addr3,r3,r2 subr1,r1,1 addr4,r4,4 bner1,r0,loop 1011 r1r2r3r4 RAV opsrc1src2tgt Cycle 0 ldr4/1NA/1r2/0Rdy status

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Out-Of-Order Exec. Example loop:ld r2, 10(r4) addr3,r3,r2 subr1,r1,1 addr4,r4,4 bner1,r0,loop 1001 r1r2r3r4 RAV opsrc1src2tgt Cycle 1 ldr4/1NA/1r2/0Exec status addr3/1r2/0r3/0Wait

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Out-Of-Order Exec. Example loop:ld r2, 10(r4) addr5,r3,r2 subr1,r1,1 addr4,r4,4 bner1,r0,loop 0001 r1r2r3r4 RAV opsrc1src2tgt Cycle 2 ldr4/1NA/1r2/0Exec status addr3/1r2/0r3/0Wait subr1/1NA/1r1/0Rdy

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Out-Of-Order Exec. Example loop:ld r2, 10(r4)5 cycles lat addr3,r3,r2 subr1,r1,1 addr4,r4,4 bner1,r0,loop 0000 r1r2r3r4 RAV opsrc1src2tgt ldr4/1NA/1r2/0Exec status addr3/1r2/0r3/0Wait subr1/1NA/1r1/0Exec addr4/1NA/1r4/0Rdy Cycle 3

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Out-Of-Order Exec. Example loop:ld r2, 10(r4) addr3,r3,r2 subr1,r1,1 addr4,r4,4 bner1,r0,loop 1101 r1r2r3r4 RAV opsrc1src2tgt Cycle 4 ldr4/1NA/1r2/1Comp status addr3/1r2/1r3/0Exec subr1/1NA/1r1/1Comp bner1/1NA/1 Rdy addr4/1NA/1r4/0Exec

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Out-Of-Order Exec. Example loop:ld r2, 10(r4) addr3,r3,r2 subr1,r1,1 addr4,r4,4 bner1,r0,loop 1111 r1r2r3r4 RAV opsrc1src2tgt Cycle 5 ldr4/1NA/1r2/1Comm status addr3/1r2/1r3/1Comp subr1/1NA/1r1/1Comp bner1/1NA/1 exec addr4/1NA/1r4/1Comp

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Out-Of-Order Exec. Example loop:ld r2, 10(r4) addr3,r3,r2 subr1,r1,1 addr4,r4,4 bner1,r0,loop 1110 r1r2r3r4 RAV opsrc1src2tgt Cycle 7 ldr4/1NA/1r2/1Comm status addr3/1r2/1r3/1Comm subr1/1NA/1r1/1Comm bner1/1NA/1 Comp addr4/1NA/1r4/1Comm

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Out-Of-Order Exec. Example loop:ld r2, 10(r4) addr3,r3,r2 subr1,r1,1 addr4,r4,4 bner1,r0,loop 1111 r1r2r3r4 RAV opsrc1src2tgt Cycle 8 ldr4/1NA/1r2/1Comm status addr3/1r2/1r3/1Comm subr1/1NA/1r1/1Comm bner1/1NA/1 Comm addr4/1NA/1r4/1Comm

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Notifying Consumers Identity of Producer Uniquely Identify the Instruction Easily retrievable @ decode by others –Target Register Recall we stall on WAR or WAW –Functional Unit If not pipelined –Place in instruction window –PC? not. Why?

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Name Dependences and OOO WAW or WAR: We need to update register but others are still using it –add r1, r1, 10 –sw r1, 20(r2) –addr1, r3, 30 –subr2, r1, 40 There is only one r1 –sw needs to see the value of 1 st add –Sub needs to wait for 2 nd add and not 1 st Solution: Stall decode when WAW or WAR

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Detecting WAW and WAR WAW? Look at Scoreboard –If bit is 0 then there is a pending write –Stall WAR? Need to know whether all preceding consumers have read the value –Keep a count per register –Increase at decode for all reads –Decrease on issue More elegant solution via register renaming –soon

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Scoreboarding Schedule based on RAW dependences WAW and WAR cause stalls –WAW at decode –WAR at writeback Implemented in the CDC 6600 in ‘64 –18 non-pipelined FUs 4 FP: 2 mul, 1 add, 1 div 7 MEM: 5 load, 2 store 7 INT: add, shift, logical etc. Centralized Control Scheme –Controls all Instruction Issue –Detects all hazards

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto DLX w/ Scoreboarding Register File FP mul FP divide FP add FP integer scoreboard

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Scoreboarding Overview Ignore IF and MEM for simplicity 4-stage execution –IssueCheck for structural hazards Check for WAW hazards Stall until all clear –ReadOpCheck for RAW hazards Wait until all operands ready Read Registers –ExecuteExecute Operations Notify scoreboard when complete –WriteCheck for WAR hazards Stall Write until all clear A completing instruction cannot write dest if an earlier instruction has not read dest.

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Scoreboarding Optimizations/Tricks WAW as in original OOO WAR is optimized –Second Producer is allowed to execute up to complete –It is stalled there until preceding consumers complete No Commit –No precise interrupts Window is implemented in the scoreboard One entry per Functional Unit –Recall not pipelined –Instructions identified by FU id

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Scoreboarding Organization Three structures –Instruction Status –Functional Unit Status –Register Result Status Instruction Status –Which stage the instruction is currently in Functional Unit Status: scheduling –Busy –OPOperation –FiDest. Reg. –Fj, FkSource Regs –Qj, QkFUs producing sources –Rj, RkReady bits for sources Register Result Status: dep. determination –Which FU will produce a register

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Scoreboarding explained Register status reg: –Which FU produces the register Use at decode –Source reg match is a RAW –Target reg macth is a WAW stall

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Functional Unit Status Busy: –resource allocation OP: –what to do once issued (e.g., add, sub) Dest. Reg.: –Where to write result –To find WAR Fj, FkSource Regs –for WAR: can’t write if consumers pending for previous value of register (if FU not the same) Qj, QkFUs producing sources –To wait for appropriate producer Rj, RkReady bits for sources –To determine when ready: all ready

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Scoreboarding Example

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Example: Cycle 0

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Example, contd. The rest you’ll find on the web site Go through it Source: Patterson Summary: –Execution proceeds in an order dictated by dependences –RAW, WAR and WAW force ordering –Tricks may be possible

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Beyond Simple OoO AB CD E A:LFF6, 34(R2) B:LFF2,45(R3) C:MULFF0,F2, F4 D:SUBFF8,F2,F6 E:ADDFF2,F7,F4 E will wait for B, C and D. WAR w/ C and D WAW w/ B Can we do better?

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto What if we had infinite registers A:LFF6, 34(R2) B:LFF2,45(R3) C:MULFF0,F2, F4 D:SUBFF8,F2,F6 E:ADDFF2,F7,F4 A:LFF6, 34(R2) B:LFF2,45(R3) C:MULFF0,F2, F4 D:SUBFF8,F2,F6 E:ADDFF9,F7,F4 No false dependences anymore Since we do not reuse a name we can’t have WAW and WAR

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Why we can’t have Infinite Registers False/Name dependences (WAR and WAW) –Artifact of having finite registers There is no such thing as infinite There is no such thing as large enough –Well there is (in a sec.) –Computers execute Billions of Instructions per sec. Even having a multi-billion register file would soon be exhausted Want to exploit parallelism across several instances of the same code –Loops, recursive functions (most frequent part)

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Yes, there is “large enough” At any given point there will be a finite number of instructions in the window if each instruction has a single register target if there are N instructions How many registers do we need? N? N + X?

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Register Renaming Register Version –Every Write creates a new version –Uses read the last version –Need to keep a version until all uses have read it. Register Renaming: –Architectural vs. Physical Registers more phys. than arch. –Maintain a map of arch. to phys. regs. –Use in-order decoding to properly identify dependences. –Instructions wait only for input op. availability. –Only last version is written to reg. file.

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Register Renaming A:DIVFF3, F1,F0r1, -, - B:SUBFF2,F1,F0 r2, -, - C:MULFF0,F2, F4 r3, r2, - D:SUBFF6,F2,F3 r4, r2, r1 E:ADDFF2,F5,F4r5, -, - F: ADDFF0,F0,F2 r6, r3, r5 Need more physical registers than architectural Ignore control flow for the time being.

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Register Renaming Process Only need to remember last producer of each architectural register –Vector At decode –Find the most recent producers for all source registers –After: declare self as most recent producer of target register Complication: –May have to retract –Need to be able to restore the mapping state

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Register Renaming Support Structures Register Rename Table –f(aR) = pR –one entry per architectural Register Free Register List –Lists not used Physical Registers At Decode –grab a new register from the free list –Change mapping in rename table At Commit –Release Register? Not… Why? –Could release previous version

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto How Many Physical Registers? Correctness: –At least as many architectural plus? Performance: –As many as possible –Not correctness –Recall not all instructions produce register results stores and branches

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Dynamic Scheduling A:DIVFF3, F1,F0r1, -, - B:SUBFF2, F1,F0r2, -, - C:MULFF0, F2, F4r3, r2, - D:SUBFF6, F2,F3r4, r2, r1 E:ADDFF2, F5,F4r5, -, - F: ADDFF0, F0,F2r6, r3, r5 Name Value - Values and Names flow together - Writeback specifies both value and name - A waiting instruction inspects all results - It is allowed to execute when all inputs are available

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Physical Registers Physical register file is just one option What we need is separate storage –Consumers could keep values in their reservation station –Tomasulo’s next

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Tomasulo’s Algorithm IBM 360/91 - Fast 360 for scientific code –Completed in 1967 –Dynamic scheduling –Predates cache memories Pipelined FUs –Adder up to 3 instructions –Multiplier up to 2 instructions Tomasulo vs. Scoreboard –Distributed hazard detection and control –Results are bypassed to FUs –Common Data Bus (CDB) for results All results visible to all instead of via a register

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto DLX w/ Tomasulo Tomasulo’s Algorithm –Use “tags” to identify data values –Reservation stations distributed control –CDB broadcasts all results to all RSs Extend DLX as example –Assume multiple FUs than pipelined –Main difference is Register-Memory Insts. I.e., DLX does not have them But that’s really a detail :-) Physical Registers? –Not really! What we need is different storage and name for every version. –Here it’s the producing reservation station

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Dynamic DLX addersMults Load buffers Store buffers CDB RS Operation Stack Registers

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Tomasulo’s Algorithm 3 major steps –Dispatch Get instruction from fetch queue ALU op: check for available RS Load: Check for available load buffer If available: dispatch and copy read regs to RS or load buffer if not: stall - structural hazard –Issue If all ops are available: issue If not monitor CDB for operands –Complete If CDB available, broadcast result else stall

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Tomasulo’s Algorithm contd. Reservation stations –Handle distributed hazard detection and instruction control Everything receiving data get its tag –4-bit tag specifies reservation station or load buffer –Also which FU will produce result Register specifier is used to assign tags –Then they are discarded –Input register specifiers are ONLY used in dispatch. (Rename table) Common Data Bus: –value + “tag” = where this comes from –vs. typical bus: value + “tag” = where this goes to

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Tomasulo’s Algorithm Contd. Reservation Stations –OpOpcode –Qj, QkTag Fields (source ops) –Vj, VkOperand values (source ops) –BusyCurrently in use Register file and Store Buffer –QiTag field –BusyCurrently in use –ViValue Load Buffers –Busy Currently in Use

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Arch. Reg. Name Tomasulo’s: Understanding Speculative vs. Architectural State add r1, r2, 10 sub r4, r1, 20 add r1, r3, 30 Value of r1I have it Value of r2I have it Value of r3I have it Value of r4I have it Register file Where is the register? Can be: “I have it”, “reservation station id” Value of Src1NA Value of Src2NA tgtsrc2 Value of Src1NA Value of Src2NA Reservation Stations Reg Arch. name src1

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Renaming 1 st Instruction add r1, r2, 10 sub r4, r1, 20 add r1, r3, 30 -----RS0 Value of r2I have it Value of r3I have it Value of r4I have it Register file Value of R2r1I have it10I have it tgtsrc2 Value of Src1NA Value of Src2NA Reservation Stations src1 Value of Src1NA Value of Src2NA RS0 Read sources (r2) Rename r1 to RS0

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Renaming 2 nd Instruction -----RS0 Value of r2I have it Value of r3I have it ----RS1 Register file Value of R2r1I have it10I have it tgtsrc2 ----r4RS020I have it Reservation Stations src1 Value of Src1NA Value of Src2NA RS1 Sources: r1 in RS0 NYA Rename r4 to RS1 add r1, r2, 10 sub r4, r1, 20 add r1, r3, 30

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Renaming 3 rd Instruction -----RS2 Value of r2I have it Value of r3I have it ----RS1 Register file Value of R2r1I have it10I have it tgtsrc2 ----r4RS020I have it Reservation Stations src1 Value of R3r1I have it30I have it RS2 Sources: r3 Avail. Rename r1 to RS2 add r1, r2, 10 sub r4, r1, 20 add r1, r3, 30

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Example… Check the web site… Too much for in-class Summary: –Execution proceeds in any order that does not violate RAW dependences –WAR and WAW are removed

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Tomasulo’s vs. Scoreboard - In-order issue - Out-of-order execution - Out-of-order completion Scoreboard:

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Tomasulo’s Out-of-order loads and stores? –What about WAW, RAW and WAR? –Compare all load addresses against the addresses of all preceding store buffers –Stall if they match CDB is a bottleneck –One write per cycle –Could duplicate –But, come at a cost –Datapath + duplicated tags and control Complex Implementation –Scalability? –All results to all sources –What if we want 128 instrs?

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Tomasulo’s Advantages –Distribution of hazard detection –Elimination of WAR and WAW stalls Common Data Bus –Broadcasts result to multiple instrs (+) –Bottleneck Register Renaming –Removes WAR and WAW hazards –More interesting when same code appears twice Think of loops More on this later –BUT: Associative lookups –RECALL: direct map is faster

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions.

Similar presentations

Presentation on theme: "A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions.

Similar presentations

Presentation on theme: "A. Moshovos ©ECE1773 - Spring ‘02 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions."— Presentation transcript:

Similar presentations

About project

Feedback