COSC121: Computer Systems. ISA and Performance

COSC121: Computer Systems. ISA and Performance
Jeremy Bolton, PhD Assistant Teaching Professor Constructed using materials: - Patt and Patel Introduction to Computing Systems (2nd) - Patterson and Hennessy Computer Organization and Design (4th) **A special thanks to Eric Roberts and Mary Jane Irwin

Notes Project 3 Due soon. Next HW posted soon. Read PH.1 and PH.4

Outline ISA and performance Details of Pipelining
CISC RISC Details of Pipelining Avoiding Hazards (and avoiding stalls) Data Stalls and no-ops Forwarding Branch Branch Delay Scheduling Prediction Predictions schemes Unrolling loops

This week … our journey takes us …
COSC 121: Computer Systems Application (Browser) Operating System (Win, Linux) Compiler COSC 255: Operating Systems Software Assembler Drivers Instruction Set Architecture Hardware Processor Memory I/O system Datapath & Control Digital Design COSC 120: Computer Hardware Circuit Design transistors

Evaluating ISAs Design-time metrics: Static Metrics: Dynamic Metrics:
Can it be implemented, in how long, at what cost? Can it be programmed? Ease of compilation? Static Metrics: How many bytes does the program occupy in memory? Dynamic Metrics: How many instructions are executed? How many bytes does the processor fetch to execute the program? How many clocks are required per instruction? How "lean" a clock is practical? Best Metric: Time to execute the program! CPI Inst. Count Cycle Time depends on the instructions set, the processor organization, and compilation techniques.

RISC vs CISC Ideologies for ISA design Two extremes:
Build very complex instructions that can execute multiple or complex operations as 1 instruction (CISC) Build very simple instructions that execute quickly (RISC)

CISC Architecture The simplest way to examine the advantages and disadvantages of RISC architecture is by contrasting it with it's predecessor: CISC (Complex Instruction Set Computers) architecture.

Multiplying Two numbers in Memory
On the right is a diagram representing the storage scheme for a generic computer. The main memory is divided into locations numbered from (row) 1: (column) 1 to (row) 6: (column) 4.

The execution unit is responsible for carrying out all computations. However, the execution unit can only operate on data that has been loaded into one of the six registers (A, B, C, D, E, or F).

Let's say we want to find the product of two numbers - one stored in location 2,3 and another stored in location 5,2 - and then store the product back in the location 2,3.

The CISC Approach The primary goal of CISC architecture is to complete a task in as few lines of assembly as possible. This is achieved by building processor hardware that is capable of understanding and executing a series of operations.

The CISC Approach For this particular task, a CISC processor would come prepared with a specific instruction (we'll call it "MULT"). When executed, this instruction loads the two values into separate registers, multiplies the operands in the execution unit, and then stores the product in the appropriate register. Thus, the entire task of multiplying two numbers can be completed with one instruction: MULT 2:3, 5:2

The CISC Approach MULT is what is known as a "complex instruction."
It operates directly on the computer's memory banks and does not require the programmer to explicitly call any loading or storing functions. It closely resembles a command in a higher level language. For instance, if we let "a" represent the value of 2:3 and "b" represent the value of 5:2, then this command is identical to the C statement "a = a * b."

The CISC Approach One of the primary advantages of this system is that the compiler has to do very little work to translate a high-level language statement into assembly. Because the length of the code is relatively short, very little RAM is required to store instructions. The emphasis is put on building complex instructions directly into the hardware.

The RISC Approach RISC processors only use simple instructions that can be executed within one clock cycle.(amortized via pipeline) Thus, the "MULT" command described above could be divided into three separate commands: "LOAD," which moves data from the memory bank to a register, "PROD," which finds the product of two operands located within the registers, and "STORE," which moves data from a register to the memory banks.

The RISC Approach In order to perform the exact series of steps described in the CISC approach, a programmer would need to code four lines of assembly: LOAD A, 2:3 LOAD B, 5:2 PROD A, B STORE 2:3, A

The RISC Approach At first, this may seem like a much less efficient way of completing the operation. Because there are more lines of code, more RAM is needed to store the assembly level instructions. The compiler must also perform more work to convert a high-level language statement into code of this form.

The RISC Approach …However, the RISC strategy also brings some very important advantages. Because each instruction requires only one clock cycle to execute, the entire program will execute in approximately the same amount of time as the multi-cycle "MULT" command. These RISC "reduced instructions" require less transistors of hardware space than the complex instructions, leaving more room for general purpose registers. Because all of the instructions execute in a uniform amount of time (i.e. one clock), pipelining is possible and effective.

Morgan Kaufmann Publishers
17 September, 2018 Pipeline Performance Single-cycle (Tc= 800ps) Pipelined (Tc= 200ps) Chapter 4 — The Processor

17 September, 2018 Pipeline Speedup If all stages are balanced i.e., all take the same time Time between instructionspipelined= Time between instructionsnonpipelined Number of stages If not balanced, speedup is less Speedup due to increased throughput Latency (time for each instruction) does not decrease Chapter 4 — The Processor

CISC RISC 1.Emphasis on hardware 1.Emphasis on software
2.Includes multi-clock complex instructions 2.Single-clock, reduced instruction only 3.Memory-to-memory: "LOAD" and "STORE" incorporated in instructions 3.Register to register: "LOAD" and "STORE" are independent instructions 4.Small code sizes, high cycles per second 4.Low cycles per second, large code sizes 5.Transistors used for storing complex instructions 5.Spends more transistors on memory registers

MIPS (RISC) Design Principles
Simplicity favors regularity fixed size instructions small number of instruction formats Smaller is faster limited instruction set limited number of registers in register file** Make the common case fast arithmetic operands from the register file (load-store machine) allow instructions to contain immediate operands Good design demands good compromises three instruction formats

MIPS Arithmetic Instructions
MIPS assembly language arithmetic statement add $t0, $s1, $s2 sub $t0, $s1, $s2 Each arithmetic instruction performs one operation Each specifies exactly three operands that are all contained in the datapath’s register file ($t0,$s1,$s2) destination  source1 op source2 Instruction Format (R format) Register designators (numbers) in decimal Op codes and function fields in hex (0x designation) x22

MIPS Instruction Fields
MIPS fields are given names to make them easier to refer to op rs rt rd shamt funct op 6-bits opcode that specifies the operation rs 5-bits register file address of the first source operand rt 5-bits register file address of the second source operand rd 5-bits register file address of the result’s destination shamt 5-bits shift amount (for shift instructions) funct 6-bits function code augmenting the opcode

MIPS Register File Registers are Holds thirty-two 32-bit registers
32 bits 5 32 Holds thirty-two 32-bit registers Two read ports and One write port src1 addr src1 data 5 src2 addr 32 locations 5 dst addr 32 src2 data 32 write data Registers are Faster than main memory Easier for a compiler to use e.g., (A*B) – (C*D) – (E*F) can do multiplies in any order Can hold variables so that code density improves (since register are named with fewer bits than a memory location) write control All machines (since 1975) have used general purpose registers

Aside: MIPS Register Convention
Name Register Number Usage Preserve on call? $zero constant 0 (hardware) n.a. $at 1 reserved for assembler $v0 - $v1 2-3 returned values no $a0 - $a3 4-7 arguments yes $t0 - $t7 8-15 temporaries $s0 - $s7 16-23 saved values $t8 - $t9 24-25 $gp 28 global pointer $sp 29 stack pointer $fp 30 frame pointer $ra 31 return addr By standard: $t are caller save, $s are callee save

Review: Why Pipeline? For Performance!
Time (clock cycles) Once the pipeline is full, one instruction is completed every cycle, so CPI = 1 ALU IM Reg DM Inst 0 I n s t r. O r d e ALU IM Reg DM Inst 1 ALU IM Reg DM Inst 2 ALU IM Reg DM Inst 3 ALU IM Reg DM Inst 4 Time to fill the pipeline

Review: MIPS Pipeline Data and Control Paths
PCSrc ID/EX EX/MEM Control IF/ID Add Branch MEM/WB Add 4 RegWrite Shift left 2 Instruction Memory Read Addr 1 Data Memory Register File Read Data 1 Read Addr 2 MemtoReg Read Address ALUSrc PC Read Data Address Write Addr ALU Read Data 2 Write Data Write Data How many bits wide is each pipeline register? PC – 32 bits IF/ID – 64 bits ID/EX – x = 147 EX/MEM – x3 + 5 = 107 MEM/WB – x2 + 5 = 71 ALU cntrl MemWrite MemRead Sign Extend 16 32 ALUOp RegDst

Review: Can Pipelining Get Us Into Trouble?
Yes: Pipeline Hazards structural hazards: attempt to use the same resource by two different instructions at the same time data hazards: attempt to use data before it is ready An instruction’s source operand(s) are produced by a prior instruction still in the pipeline control hazards: attempt to make a decision about program control flow before the condition has been evaluated and the new PC target address calculated branch and jump instructions, exceptions Note that data hazards can come from R-type instructions or lw instructions Pipeline control must detect the hazard and then take action to resolve hazards

Review: Register Usage Can Cause Data Hazards
Read before write data hazard Value of $ / ALU IM Reg DM add $1, ALU IM Reg DM sub $4,$1,$5 ALU IM Reg DM and $6,$1,$7 ALU IM Reg DM or $8,$1,$9 ALU IM Reg DM xor $4,$1,$5

One Way to “Fix” a Data Hazard
Can fix data hazard by waiting – stall – but impacts CPI ALU IM Reg DM add $1, I n s t r. O r d e stall stall sub $4,$1,$5 and $6,$1,$7 ALU IM Reg DM

Another Way to “Fix” a Data Hazard
Fix data hazards by forwarding results as soon as they are available to where they are needed ALU IM Reg DM add $1, I n s t r. O r d e ALU IM Reg DM sub $4,$1,$5 ALU IM Reg DM and $6,$1,$7 ALU IM Reg DM or $8,$1,$9 For lecture Forwarding paths are valid only if the destination stage is later in time than the source stage. Forwarding is harder if there are multiple results to forward per instruction or if they need to write a result early in the pipeline. Notice that for now we are showing the forwarded data coming out of the ALU. After looking at the problem more closely, we will see that it is really supplied by the pipeline register EX/MEM or MEM/WB and will depict is as such. ALU IM Reg DM xor $4,$1,$5

Data Forwarding (aka Bypassing)
Take the result from the earliest point that it exists in any of the pipeline state registers and forward it to the functional units (e.g., the ALU) that need it that cycle For ALU functional unit: the inputs can come from any pipeline register rather than just from ID/EX by adding multiplexors to the inputs of the ALU connecting the Rd write data in EX/MEM or MEM/WB to either (or both) of the EX’s stage Rs and Rt ALU mux inputs adding the proper control hardware to control the new muxes Other functional units may need similar forwarding logic (e.g., the DM) With forwarding can achieve a CPI of 1 even in the presence of data dependencies

Data Forwarding Control Conditions
EX Forward Unit: if (EX/MEM.RegWrite and (EX/MEM.RegisterRd != 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRs)) ForwardA = 10 and (EX/MEM.RegisterRd = ID/EX.RegisterRt)) ForwardB = 10 Forwards the result from the previous instr. to either input of the ALU MEM Forward Unit: if (MEM/WB.RegWrite and (MEM/WB.RegisterRd != 0) and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01 and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01 Forwards the result from the second previous instr. to either input of the ALU

Forwarding Illustration
ALU IM Reg DM add $1, I n s t r. O r d e ALU IM Reg DM sub $4,$1,$5 ALU IM Reg DM and $6,$7,$1 Now we see that the forwarded data is supplied by the pipeline register EX/MEM or MEM/WB. EX forwarding MEM forwarding

Yet Another Complication!
Another potential data hazard can occur when there is a conflict between the result of the WB stage instruction and the MEM stage instruction – which should be forwarded? I n s t r. O r d e ALU IM Reg DM add $1,$1,$2 add $1,$1,$3 ALU IM Reg DM For lecture ALU IM Reg DM add $1,$1,$4

Corrected Data Forwarding Control Conditions
EX Forward Unit: if (EX/MEM.RegWrite and (EX/MEM.RegisterRd != 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRs)) ForwardA = 10 and (EX/MEM.RegisterRd = ID/EX.RegisterRt)) ForwardB = 10 Forwards the result from the previous instr. to either input of the ALU MEM Forward Unit: if (MEM/WB.RegWrite and (MEM/WB.RegisterRd != 0) and (EX/MEM.RegisterRd != ID/EX.RegisterRs) and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01 and (EX/MEM.RegisterRd != ID/EX.RegisterRt) and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01 Forwards the result from the previous or second previous instr. to either input of the ALU

Datapath with Forwarding Hardware
PCSrc Read Address Instruction Memory Add PC 4 Write Data Read Addr 1 Read Addr 2 Write Addr Register File Data 1 Data 2 16 32 ALU Shift left 2 Data IF/ID Sign Extend ID/EX EX/MEM MEM/WB Control cntrl Branch Forward Unit For lecture. How many bits wide is each pipeline register now? PC – 32 IF/ID – 32*2 ID/EX – x = = 157 EX/MEM – *3 + 5 = 107 MEM/WB – *2 + 5 = 71 Control line inputs to Forward Unit EX/MEM.RegWrite and MEM/WB.RegWrite not shown on diagram EX/MEM.RegisterRd MEM/WB.RegisterRd ID/EX.RegisterRt ID/EX.RegisterRs

Memory-to-Memory Copies
For loads immediately followed by stores (memory-to-memory copies) can avoid a stall by adding forwarding hardware from the MEM/WB register to the data memory input. Would need to add a Forward Unit and a mux to the MEM stage I n s t r. O r d e ALU IM Reg DM lw $1,4($2) What if lw was replaced with add $1, - is forwarding still needed? From where, to where? What if $1 was used to compute the effective address (it would be a load-use data hazard and would require a stall insertion between the lw and sw) ALU IM Reg DM sw $1,4($3)

Forwarding with Load-use Data Hazards
ALU IM Reg DM lw $1,4($2) I n s t r. O r d e ALU IM Reg DM stall sub $4,$1,$5 ALU IM Reg DM sub $4,$1,$5 and $6,$1,$7 xor $4,$1,$5 or $8,$1,$9 ALU IM Reg DM and $6,$1,$7 xor $4,$1,$5 or $8,$1,$9 ALU IM Reg DM For lecture The one case where forwarding cannot save the day is when an instruction tries to read a register following a load instruction that writes the same register. ALU IM Reg DM Will still need one stall cycle even with forwarding

Load-use Hazard Detection Unit
Need a Hazard detection Unit in the ID stage that inserts a stall between the load and its use ID Hazard detection Unit: if (ID/EX.MemRead and ((ID/EX.RegisterRt = IF/ID.RegisterRs) or (ID/EX.RegisterRt = IF/ID.RegisterRt))) stall the pipeline The first line tests to see if the instruction now in the EX stage is a lw; the next two lines check to see if the destination register of the lw matches either source register of the instruction in the ID stage (the load-use instruction) After this one cycle stall, the forwarding logic can handle the remaining data hazards …. RESCHEDULING CAN HELP HERE … MORE LATER

Hazard/Stall Hardware
Along with the Hazard Unit, we have to implement the stall Prevent the instructions in the IF and ID stages from progressing down the pipeline – done by preventing the PC register and the IF/ID pipeline register from changing Hazard detection Unit controls the writing of the PC (PC.write) and IF/ID (IF/ID.write) registers Insert a “bubble” between the lw instruction (in the EX stage) and the load-use instruction (in the ID stage) (i.e., insert a noop in the execution stream) Set the control bits in the EX, MEM, and WB control fields of the ID/EX pipeline register to 0 (noop). The Hazard Unit controls the mux that chooses between the real control values and the 0’s. Let the lw instruction and the instructions after it in the pipeline (before it in the code) proceed normally down the pipeline

Adding the Hazard/Stall Hardware
PCSrc ID/EX.MemRead Hazard Unit ID/EX IF/ID.Write ID/EX.RegisterRt EX/MEM PC.Write IF/ID 1 Control Add MEM/WB Branch Add 4 Shift left 2 Instruction Memory Read Addr 1 Data Memory Register File Read Data 1 Read Addr 2 Read Address PC Read Data Address Write Addr ALU Read Data 2 Write Data Write Data For lecture In reality, only the signals RegWrite and MemWrite need to be 0, the other control signals can be don’t cares. Another consideration is energy – where clock gating is called for. ALU cntrl 16 32 Sign Extend Forward Unit

Control Hazards When the flow of instruction addresses is not sequential (i.e., PC = PC + 4); incurred by change of flow instructions Unconditional branches (j, jal, jr) Conditional branches (beq, bne) Exceptions* Possible approaches Stall (impacts CPI) Move decision point as early in the pipeline as possible, thereby reducing the number of stall cycles Delay decision and scheduling (requires compiler support), out-of-order execution Predict and hope for the best ! Control hazards occur less frequently than data hazards, but there is nothing as effective against control hazards as forwarding is for data hazards

17 September, 2018 Control Hazards Stall / Flush Add hardware / optimize ISA Determine branch condition and target as early as possible VLIW Scheduling / Out-of-Order Execution Static (by compiler) Dynamic (need to implement hardware) Register renaming Predict branch Static or dynamic Don’t commit until branch outcome determined Loop Unrolling Done by compiler Chapter 4 — The Processor

Datapath Branch and Jump Hardware
Shift left 2 Jump PC+4[31-28] Branch PCSrc Shift left 2 Add ID/EX EX/MEM IF/ID Control Add MEM/WB 4 Instruction Memory Read Addr 1 Data Memory Register File Read Data 1 Read Addr 2 Read Address PC Read Data Address Write Addr ALU Read Data 2 Write Data Write Data ALU cntrl 16 32 Sign Extend Forward Unit

Jumps Incur One Stall Jumps not decoded until ID, so one flush is needed To flush, set IF.Flush to zero the instruction field of the IF/ID pipeline register (turning it into a noop) ALU IM Reg DM j I n s t r. O r d e Fix jump hazard by waiting – flush ALU IM Reg DM flush ALU IM Reg DM j target Fortunately, jumps are very infrequent – only 3% of the SPECint instruction mix

Two “Types” of Stalls Noop instruction (or bubble) inserted between two instructions in the pipeline (as done for load-use situations) Keep the instructions earlier in the pipeline (later in the code) from progressing down the pipeline for a cycle (“stall” them in place with write control signals) Insert noop by zeroing control bits in the pipeline register at the appropriate stage Let the instructions later in the pipeline (earlier in the code) progress normally down the pipeline Result: all operations in pipeline are simply stalled. Flushes (or instruction squashing) where an instruction in the pipeline is replaced with a noop instruction (as done for instructions located sequentially after j instructions) Zero the control bits for the instruction to be flushed Result: the flushed instruction is “clobbered” – never executed.

Supporting ID Stage Jumps
ID/EX Read Address Instruction Memory Add PC 4 Write Data Read Addr 1 Read Addr 2 Write Addr Register File Data 1 Data 2 16 32 ALU Data IF/ID Sign Extend EX/MEM MEM/WB Control cntrl Forward Unit Branch PCSrc Shift left 2 Jump PC+4[31-28]

Review: Branch Instr’s Cause Control Hazards
Dependencies backward in time cause hazards beq ALU IM Reg DM I n s t r. O r d e ALU IM Reg DM lw ALU IM Reg DM Inst 3 ALU IM Reg DM Inst 4

One Way to “Fix” a Branch Control Hazard
Fix branch hazard by waiting – flush – but affects CPI ALU IM Reg DM beq I n s t r. O r d e ALU IM Reg DM flush ALU IM Reg DM flush ALU IM Reg DM flush beq target ALU IM Reg DM Inst 3

Another Way to “Fix” a Branch Control Hazard
Move branch decision hardware back to as early in the pipeline as possible – i.e., during the decode cycle Fix branch hazard by waiting – flush ALU IM Reg DM beq I n s t r. O r d e ALU IM Reg DM flush beq target ALU IM Reg DM Inst 3 Another “solution” is to put in enough extra hardware so that we can test registers, calculate the branch address, and update the PC during the second stage of the pipeline. That would reduce the number of stalls to only one.

Reducing the Delay of Branches
One option: Move the branch decision hardware back to the EX stage Reduces the number of stall (flush) cycles to two Adds an and gate and a 2x1 mux to the EX timing path Another Option: Add hardware to compute the branch target address and evaluate the branch decision to the ID stage Reduces the number of stall (flush) cycles to one (like with jumps) But now need to add forwarding hardware in ID stage Computing branch target address can be done in parallel with RegFile read (done for all instructions – only used when needed) Comparing the registers can’t be done until after RegFile read, so comparing and updating the PC adds a mux, a comparator, and an and gate to the ID timing path For deeper pipelines, branch decision points can be even later in the pipeline, incurring more stalls Want a small branch penalty. Need more forwarding and hazard detection hardware for second option (one stall implementation) since a branch depended on a result still in the pipeline (that is one of the source operands for the comparison logic) must be forwarded from the EX/MEM or MEM/WB pipeline latches.

Supporting ID Stage Branches
PCSrc Hazard Unit ID/EX IF.Flush 1 EX/MEM IF/ID Control Add MEM/WB 4 Shift left 2 Add Compare Read Addr 1 Instruction Memory Data Memory RegFile Read Addr 2 Read Address PC Read Data 1 Read Data Write Addr ALU Address ReadData 2 Write Data Write Data Now IF.Flush is generated by the Hazard Unit for both jumps and for taken branches. Book claims that you have to forward from the MEM/WB pipeline latch, but with RegFile write before read, I don’t think that is the case!! ALU cntrl 16 Sign Extend 32 Forward Unit Forward Unit

Delayed Branches and (Static) Scheduling
If the branch hardware has been moved to the ID stage, then we can eliminate all branch stalls with delayed branches which are defined as always executing the next sequential instruction after the branch instruction – the branch takes effect after that next instruction MIPS compiler moves an instruction to immediately after the branch that is not affected by the branch (a safe instruction) thereby hiding the branch delay With deeper pipelines, the branch delay grows requiring more than one delay slot Delayed branches have lost popularity compared to more expensive but more flexible (dynamic) hardware branch prediction Growth in available transistors has made hardware branch prediction relatively cheaper No processor uses delayed branches of more than 1 cycle. For longer branch delays, hardware-based branch prediction is used.

Code Scheduling to Avoid Stalls
Morgan Kaufmann Publishers 17 September, 2018 Code Scheduling to Avoid Stalls Reorder code to avoid use of load result in the next instruction C code for A = B + E; C = B + F; lw $t1, 0($t0) lw $t2, 4($t0) add $t3, $t1, $t2 sw $t3, 12($t0) lw $t4, 8($t0) add $t5, $t1, $t4 sw $t5, 16($t0) lw $t1, 0($t0) lw $t2, 4($t0) lw $t4, 8($t0) add $t3, $t1, $t2 sw $t3, 12($t0) add $t5, $t1, $t4 sw $t5, 16($t0) stall stall 13 cycles 11 cycles Chapter 4 — The Processor

Scheduling Branch Delay Slots
A. From before branch B. From branch target C. From fall through add $1,$2,$3 if $2=0 then add $1,$2,$3 if $1=0 then sub $4,$5,$6 delay slot delay slot add $1,$2,$3 if $1=0 then delay slot sub $4,$5,$6 becomes becomes becomes if $2=0 then add $1,$2,$3 add $1,$2,$3 if $1=0 then sub $4,$5,$6 add $1,$2,$3 if $1=0 then sub $4,$5,$6 Limitations on delayed-branch scheduling come from 1) restrictions on the instructions that can be moved/copied into the delay slot and 2) limited ability to predict at compile time whether a branch is likely to be taken or not. In B and C, the use of $1 prevents the add instruction from being moved to the delay slot In B the sub may need to be copied because it could be reached by another path. B is preferred when the branch is taken with high probability (such as loop branches A is the best choice, fills delay slot and reduces IC In B and C, the sub instruction may need to be copied, increasing IC In B and C, must be okay to execute sub when branch fails Or Branch Prediction Hazard

Dynamic Scheduling A major limitation of the simple pipelining techniques is in-order execution If an instruction is stalled in the pipeline all the instructions behind it must wait Even if there would be enough hardware resources to execute them Solution: Let the instructions behind the stalled instruction proceed Split the Instruction Decode phase of the pipeline into: Issue: decode instruction and check for structural hazards Read operands: wait until no data hazards, then read operands We will have out-of-order execution and out-of-order completion of the instructions.

Dynamic Pipeline Scheduling
Morgan Kaufmann Publishers 17 September, 2018 Dynamic Pipeline Scheduling Allow the CPU to execute instructions out of order to avoid stalls **But commit result to registers in order** Another Example lw $t0, 20($s2) addu $t1, $t0, $t2 sub $s4, $s4, $t3 add $t5, $s4, $s4 Can start sub while addu is waiting for lw Issue out of order, eg, lw $t0, 20($s2) sub $s4, $s4, $t3 add $t5, $s4, $s4 addu $t1, $t0, $t2 Chapter 4 — The Processor

Why Do Dynamic Scheduling?
Morgan Kaufmann Publishers 17 September, 2018 Why Do Dynamic Scheduling? Why not just let the compiler schedule code? Not all stalls are predicable Can’t always schedule around branches Branch outcome is dynamically determined Different implementations of an ISA have different latencies and hazards EG: Scorecard and Tomasolos Algorithm** An example upcoming if time (See appendix) Chapter 4 — The Processor

(Static) Branch Prediction
Resolve branch hazards by assuming a given outcome and proceeding without waiting to see the actual branch outcome Predict not taken – always predict branches will not be taken, continue to fetch from the sequential instruction stream, only when branch is taken does the pipeline stall If taken, flush instructions after the branch (earlier in the pipeline) in IF, ID, and EX stages if branch logic in MEM – three stalls In IF and ID stages if branch logic in EX – two stalls in IF stage if branch logic in ID – one stall ensure that those flushed instructions haven’t changed the machine state – automatic in the MIPS pipeline since machine state changing operations are at the tail end of the pipeline (MemWrite (in MEM) or RegWrite (in WB)) restart the pipeline at the branch destination This is a static scheme since the same decision is always made (not taken or taken).

Flushing with Misprediction (Not Taken)
ALU IM Reg DM I n s t r. O r d e 4 beq $1,$2,2 flush ALU IM Reg DM 8 sub $4,$1,$5 16 and $6,$1,$7 20 or r8,$1,$9 ALU IM Reg DM For lecture Note branch address is PC-relative branch to 4+4+2*4 = 16 To flush the IF stage instruction, assert IF.Flush to zero the instruction field of the IF/ID pipeline register (transforming it into a noop)

Branching Structures Predict not taken works well for “top of the loop” branching structures Loop: beq $1,$2,Out 1nd loop instr . last loop instr j Loop Out: fall out instr But such loops have jumps at the bottom of the loop to return to the top of the loop – and incur the jump stall overhead Predict not taken doesn’t work well for “bottom of the loop” branching structures Loop: 1st loop instr 2nd loop instr . last loop instr bne $1,$2,Loop fall out instr

Branch Prediction, con’t
Resolve branch hazards by assuming a given outcome and proceeding Predict taken – predict branches will always be taken Predict taken always incurs one stall cycle (if branch destination hardware has been moved to the ID stage) Is there a way to “cache” the address of the branch target instruction ?? As the branch penalty increases (for deeper pipelines), a simple static prediction scheme will hurt performance. With more hardware, it is possible to try to predict branch behavior dynamically during program execution Dynamic branch prediction – predict branches at run-time using run-time information Predict taken always incurs one stall at least – assuming the branch destination address hardware has been moved up to the ID stage. So predict not taken is easier since sequential instruction address can be computed in the IF stage.

Dynamic Branch Prediction
A branch prediction buffer (aka branch history table (BHT)) in the IF stage addressed by the lower bits of the PC, contains bit(s) passed to the ID stage through the IF/ID pipeline register that tells whether the branch was taken the last time it was executed Prediction bit may predict incorrectly (may be a wrong prediction for this branch this iteration or may be from a different branch with the same low order PC bits) but the doesn’t affect correctness, just performance Branch decision occurs in the ID stage after determining that the fetched instruction is a branch and checking the prediction bit(s) If the prediction is wrong, flush the incorrect instruction(s) in pipeline, restart the pipeline with the right instruction, and invert the prediction bit(s) A 4096 bit BHT varies from 1% misprediction (nasa7, tomcatv) to 18% (eqntott) 4096 entry table programs vary from 1% misprediction (nasa7, tomcatv) to 18% (eqntott), with spice at 9% and gcc at 12% 4096 about as good as infinite table, but 4096 is a lot of hardware

Branch Target Buffer The BHT predicts when a branch is taken, but does not tell where its taken to! A branch target buffer (BTB) in the IF stage caches the branch target address, but we also need to fetch the next sequential instruction. The prediction bit in IF/ID selects which “next” instruction will be loaded into IF/ID at the next clock edge Would need a two read port instruction memory Read Address Instruction Memory PC BTB Or the BTB can cache the branch taken instruction while the instruction memory is fetching the next sequential instruction Except for the first time the branch is encountered, when we don’t have the branch instruction loaded into the BTB Its not quite this simple – what if the BTB instruction is for the wrong branch!! – but close enough for students at this level If the prediction is correct, stalls can be avoided no matter which direction they go

1-bit Prediction Scheme
A 1-bit predictor will be incorrect twice when not taken Assume predict_bit = 0 to start (indicating branch not taken) and loop control is at the bottom of the loop code First time through the loop, the predictor mispredicts the branch since the branch is taken back to the top of the loop; invert prediction bit (predict_bit = 1) As long as branch is taken (looping), prediction is correct Exiting the loop, the predictor again mispredicts the branch since this time the branch is not taken falling out of the loop; invert prediction bit (predict_bit = 0) Loop: 1st loop instr 2nd loop instr . last loop instr bne $1,$2,Loop fall out instr For 10 times through the loop we have a 80% prediction accuracy for a branch that is taken 90% of the time

2-bit Predictors right 9 times wrong on loop fall out 1 1
A 2-bit scheme can give higher accuracy since a prediction must be wrong twice before the prediction bit is changed right 9 times Loop: 1st loop instr 2nd loop instr . last loop instr bne $1,$2,Loop fall out instr wrong on loop fall out Taken Not taken Scenario: Consecutive Loops. When the first loop ends, the branch prediction will likely fail, but the prediction strategy will not change. Thus, predict branch taken strategy at the beginning of next loop will likely succeed (using 1-bit, this prediction would fail). 1 Predict Taken 1 Predict Taken 11 10 Taken right on 1st iteration Not taken Taken For lecture In a counter implementation, the counters are incremented when a branch is taken and decremented when not taken (and saturate at 00 or 11). Since we read the prediction bits on every cycle, a 2-bit predictor will need both a read and a write access port (for updating the prediction bits). Not taken Predict Not Taken Predict Not Taken 00 BHT also stores the initial FSM state 01 Taken Not taken

Speculation Speculation is used to allow execution of future instr’s that (may) depend on the speculated instruction Speculate on the outcome of a conditional branch (branch prediction) Speculate that a store (for which we don’t yet know the address) that precedes a load does not refer to the same address, allowing the load to be scheduled before the store (load speculation) Must have (hardware and/or software) mechanisms for Checking to see if the guess was correct Recovering from the effects of the instructions that were executed speculatively if the guess was incorrect In a VLIW processor the compiler can insert additional instr’s that check the accuracy of the speculation and can provide a fix-up routine to use when the speculation was incorrect In SS, the processor buffers the speculative results until it knows they are no longer speculative, then allows the instructions to complete by allowing the contents of the buffers to be written to the registers or memory

Multiple-Issue Processor Styles
Static multiple-issue processors (aka VLIW) Decisions on which instructions to execute simultaneously are being made statically (at compile time by the compiler) E.g., Intel Itanium and Itanium 2 for the IA-64 ISA – EPIC (Explicit Parallel Instruction Computer) 128-bit “bundles” containing three instructions, each 41-bits plus a 5-bit template field (which specifies which FU each instruction needs) Five functional units (IntALU, Mmedia, Dmem, FPALU, Branch) Extensive support for speculation and predication Dynamic multiple-issue processors (aka superscalar) Decisions on which instructions to execute simultaneously (in the range of 2 to 8) are being made dynamically (at run time by the hardware) E.g., IBM Power series, Pentium 4, MIPS R10K, AMD Barcelona

Multiple-Issue Datapath Responsibilities
Must handle, with a combination of hardware and software fixes, the fundamental limitations of How many instructions to issue in one clock cycle – issue slots Storage (data) dependencies – aka data hazards Limitation more severe in a SS/VLIW processor Procedural dependencies – aka control hazards Resource conflicts – aka structural hazards A SS/VLIW processor has a much larger number of potential resource conflicts Functional units may have to arbitrate for result buses and register-file write ports Register renaming and reservation stations can help. Pipelining is much less expensive than duplicating

Static Multiple Issue Machines (VLIW)
Static multiple-issue processors (aka VLIW) use the compiler (at compile-time) to statically decide which instructions to issue and execute simultaneously Issue packet – the set of instructions that are bundled together and issued in one clock cycle – think of it as one large instruction with multiple operations The mix of instructions in the packet (bundle) is usually restricted – a single “instruction” with several predefined fields The compiler does static branch prediction and code scheduling to reduce (control) or eliminate (data) hazards VLIW’s have Multiple functional units Multi-ported register files Wide program bus

Load or Store (I format)
An Example: A VLIW MIPS Consider a 2-issue MIPS with a 2 instr bundle 64 bits ALU Op (R format) or Branch (I format) Load or Store (I format) Instructions are always fetched, decoded, and issued in pairs If one instr of the pair can not be used, it is replaced with a noop Need 4 read ports and 2 write ports and a separate memory address adder

A MIPS VLIW (2-issue) Datapath
Add Add 4 ALU Instruction Memory Register File PC Data Memory Write Addr Add Write Data Assume forwarding hardware as necessary. What is the top mux input to the ALU doing? Sign Extend Sign Extend

The University of Adelaide, School of Computer Science
17 September 2018 Register Renaming Example: DIV.D F0,F2,F4 ADD.D F6,F0,F8 S.D F6,0(R1) SUB.D F8,F10,F14 MUL.D F6,F10,F8 + name dependence with F6 (and F8) antidependence antidependence Chapter 2 — Instructions: Language of the Computer

17 September 2018 Register Renaming Example: DIV.D F0,F2,F4 ADD.D S,F0,F8 S.D S,0(R1) SUB.D T,F10,F14 MUL.D F6,F10,T Now only RAW hazards remain, which can be rescheduled Chapter 2 — Instructions: Language of the Computer

17 September 2018 Register Renaming Name dependency but no true data dependency Register renaming is provided by reservation stations (RS) Contains: The instruction Buffered operand values (when available) Reservation station number of instruction providing the operand values RS fetches and buffers an operand as soon as it becomes available (not necessarily involving register file) Pending instructions designate the RS to which they will send their output Result values broadcast on a result bus, called the common data bus (CDB) Only the last output updates the register file As instructions are issued, the register specifiers are renamed with the reservation station May be more reservation stations than registers Reservation stations and reorder buffer effectively provide register renaming Chapter 2 — Instructions: Language of the Computer

17 September, 2018 Loop Unrolling Replicate loop body to expose more parallelism Reduces loop-control overhead Use different registers per replication Called “register renaming” Avoid loop-carried “anti-dependencies” Store followed by a load of the same register Aka “name dependence” Reuse of a register name Chapter 4 — The Processor

Code Scheduling Example (with VLIW)
Consider the following loop code lp: lw $t0,0($s1) # $t0=array element addu $t0,$t0,$s2 # add val in $s2 sw $t0,0($s1) # store result addi $s1,$s1,-4 # decrement pointer bne $s1,$0,lp # branch if $s1 != 0 Must “schedule” the instructions to avoid pipeline stalls Instructions in one bundle must be independent Must separate load use instructions from their loads by one cycle Notice that the first two instructions have a load use dependency, the next two and last two have data dependencies Assume branches are perfectly predicted by the hardware

The Scheduled (out-of-order) Code (Not Unrolled with VLIW)
ALU or branch Data transfer CC lp: 1 2 3 4 5 ALU or branch Data transfer CC lp: lw $t0,0($s1) 1 addi $s1,$s1,-4 2 addu $t0,$t0,$s2 3 bne $s1,$0,lp sw $t0,4($s1) 4 For lecture Note that displacement value for the sw has to be adjusted because the addi has been scheduled before it rather than after it as in the original code. Note load-use resolutions since addu is two cycles after its corresponding load Four clock cycles to execute 5 instructions for a CPI of 0.8 (versus the best case of 0.5) IPC of 1.25 (versus the best case of 2.0) noops don’t count towards performance !!

Loop Unrolling Loop unrolling – multiple copies of the loop body are made and instructions from different iterations are scheduled together as a way to increase ILP Apply loop unrolling (4 times for our example) and then schedule the resulting code Eliminate unnecessary loop overhead instructions Schedule so as to avoid load use hazards During unrolling the compiler applies register renaming to eliminate all data dependencies that are not true data dependencies

Unrolled Code Example lp: lw $t0,0($s1) # $t0=array element
addu $t0,$t0,$s2 # add scalar in $s2 addu $t1,$t1,$s2 # add scalar in $s2 addu $t2,$t2,$s2 # add scalar in $s2 addu $t3,$t3,$s2 # add scalar in $s2 sw $t0,0($s1) # store result sw $t1,-4($s1) # store result sw $t2,-8($s1) # store result sw $t3,-12($s1) # store result addi $s1,$s1,-16 # decrement pointer bne $s1,$0,lp # branch if $s1 != 0

The Scheduled Code (Unrolled)
ALU or branch Data transfer CC lp: addi $s1,$s1,-16 lw $t0,0($s1) 1 lw $t1,12($s1) 2 addu $t0,$t0,$s2 lw $t2,8($s1) 3 addu $t1,$t1,$s2 lw $t3,4($s1) 4 addu $t2,$t2,$s2 sw $t0,16($s1) 5 addu $t3,$t3,$s2 sw $t1,12($s1) 6 sw $t2,8($s1) 7 bne $s1,$0,lp sw $t3,4($s1) 8 Eight clock cycles to execute 14 instructions for a CPI of 0.57 (versus the best case of 0.5) IPC of 1.8 (versus the best case of 2.0)

Summary All modern day processors use pipelining for performance (a CPI of 1 and fast a CC) Pipeline clock rate limited by slowest pipeline stage – so designing a balanced pipeline is important Must detect and resolve hazards Structural hazards – resolved by designing the pipeline correctly Data hazards Stall (impacts CPI) Forward (requires hardware support) Control hazards – put the branch decision hardware in as early a stage in the pipeline as possible Delay decision (requires compiler support) Static and dynamic prediction (requires hardware support) Scheduling and Speculation can reduce stalls Multiple-issue, and VLIW can improve ILP

Jeremy Bolton, PhD Assistant Teaching Professor
Appendix Jeremy Bolton, PhD Assistant Teaching Professor Constructed using materials: - Patt and Patel Introduction to Computing Systems (2nd) - Patterson and Hennessy Computer Organization and Design (4th) **A special thanks to Eric Roberts and Mary Jane Irwin

Dynamic Scheduling Algorithm: Tomasulo Algorithm
DIV.D F0, F2, F4 ADD.D S, F0, F8 S.D S, 0(R1) register renaming SUB.D T, F10, F14 MUL.D F6, F10, T Implemented through reservation stations (rs) per functional unit Buffers an operand as soon as it is available – avoids WAR hazards. Pending instr. designate rs that will provide their inputs – avoids WAW hazards. The last write in a sequence of same-register-writing actually updates the register Decentralize hazard detection and execution control Instruction results are passed directly to the FU from rs rather than from registers Through common data bus (CDB) Nov. 2, 2004 Lec. 7

FP unit and load-store unit using Tomasulo’s alg.
Nov. 2, 2004 Lec. 7

Dynamically Scheduled CPU
Morgan Kaufmann Publishers 17 September, 2018 Dynamically Scheduled CPU Preserves dependencies Hold pending operands Results also sent to any waiting reservation stations Reorders buffer for register writes Can supply operands for issued instructions Chapter 4 — The Processor

Three Stages of Tomasulo Algorithm
1. Issue—get instruction from FP Op Queue Stall if structural hazard, ie. no space in the rs. If reservation station (rs) is free, the issue logic issues instr to rs & read operands into rs if ready (Register renaming => Solves WAR). Make status of destination register waiting for this latest instn even if the previous instn writing to this register hasn’t completed => Solves WAW hazards. 2. Execution—operate on operands (EX) When both operands are ready then execute; if not ready, watch CDB for result – Solves RAW 3. Write result—finish execution (WB) Write on Common Data Bus to all awaiting units; mark reservation station available. Write result into dest. reg. if its status is r. => Solves WAW. Normal data bus: data + destination (“go to” bus) CDB: data + source (“come from” bus) 64 bits of data + 4 bits of Functional Unit source address Write if matches expected Functional Unit (produces result) Does broadcast Nov. 2, 2004 Lec. 7

Reservation Station Components
Op—Operation to perform in the unit (e.g., + or –) Vj, Vk— Value of the source operand. Qj, Qk— Name of the RS that would provide the source operands. Value zero means the source operands already available in Vj or Vk, or is not necessary. Busy—Indicates reservation station or FU is busy Register File Status Qi: Qi —Indicates which functional unit will write each register, if one exists. Blank (0) when no pending instructions that will write that register meaning that the value is already available. Nov. 2, 2004 Lec. 7

Tomasulo Loop Example Loop: LD F0 0 R1 MULTD F4 F0 F2 SD F4 0 R1
SUBI R1 R1 #8 BNEZ R1 Loop Assume Multiply takes 4 clocks Assume first load takes 8 clocks (cache miss), second load takes 1 clock (hit) To be clear, will show clocks for SUBI, BNEZ Reality: integer instructions ahead

Loop Example

Loop Example Cycle 1

Loop Example Cycle 3 Implicit renaming sets up “DataFlow” graph

Loop Example Cycle 4 Dispatching SUBI Instruction

Loop Example Cycle 5 And, BNEZ instruction

Loop Example Cycle 6 Notice that F0 never sees Load from location 80

Loop Example Cycle 7 Register file completely detached from computation First and Second iteration completely overlapped

Loop Example Cycle 9 Load1 completing: who is waiting?
Note: Dispatching SUBI

Loop Example Cycle 10 Load2 completing: who is waiting?
Note: Dispatching BNEZ

Loop Example Cycle 11 Next load in sequence

Loop Example Cycle 12 Why not issue third multiply?

Loop Example Cycle 14 Mult1 completing. Who is waiting?

Loop Example Cycle 15 Mult2 completing. Who is waiting?

Why can Tomasulo overlap iterations of loops?
Register renaming Multiple iterations use different physical destinations for registers (dynamic loop unrolling). Reservation stations Permit instruction issue to advance past integer control flow operations Also buffer old values of registers - totally avoiding the WAR stall that we saw in the scoreboard. Other idea: Tomasulo building “DataFlow” graph on the fly.

17 September, 2018 Register Renaming Reservation stations and reorder buffer effectively provide register renaming On instruction issue to reservation station If operand is available in register file or reorder buffer Copied to reservation station No longer required in the register; can be overwritten If operand is not yet available It will be provided to the reservation station by a function unit Register update may not be required Chapter 4 — The Processor

COSC121: Computer Systems. ISA and Performance

Similar presentations

Presentation on theme: "COSC121: Computer Systems. ISA and Performance"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

COSC121: Computer Systems. ISA and Performance

Similar presentations

Presentation on theme: "COSC121: Computer Systems. ISA and Performance"— Presentation transcript:

Similar presentations

About project

Feedback