1 DLX computer Electronic Computers M. 2 RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer RISC architectures In CISC.

Slides:

Advertisements

Similar presentations

DLX computer Electronic Computers M.

Advertisements

Morgan Kaufmann Publishers The Processor

1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.

CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.

Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.

ELEN 468 Advanced Logic Design

Review: Pipelining. Pipelining Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer.

CS252/Patterson Lec 1.1 1/17/01 Pipelining: Its Natural! Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer.

Instruction-Level Parallelism (ILP)

1 RISC Pipeline Han Wang CS3410, Spring 2010 Computer Science Cornell University See: P&H Chapter 4.6.

Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University Pipeline Hazards See: P&H Chapter 4.7.

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

©UCB CS 162 Computer Architecture Lecture 3: Pipelining Contd. Instructor: L.N. Bhuyan

1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.

CIS629 Fall 2002 Pipelining 2- 1 Control Hazards Created by branch statements BEQZLOC ADDR1,R2,R3. LOCSUBR1,R2,R3 PC needs to be computed but it happens.

 The actual result $1 - $3 is computed in clock cycle 3, before it’s needed in cycles 4 and 5  We forward that value to later instructions, to prevent.

DLX Instruction Format

Lec 9: Pipelining Kavita Bala CS 3410, Fall 2008 Computer Science Cornell University.

Appendix A Pipelining: Basic and Intermediate Concepts

ENGS 116 Lecture 51 Pipelining and Hazards Vincent H. Berk September 30, 2005 Reading for today: Chapter A.1 – A.3, article: Patterson&Ditzel Reading for.

1 Stalls and flushes  So far, we have discussed data hazards that can occur in pipelined CPUs if some instructions depend upon others that are still executing.

Pipeline Hazard CT101 – Computing Systems. Content Introduction to pipeline hazard Structural Hazard Data Hazard Control Hazard.

1 Pipelining Reconsider the data path we just did Each instruction takes from 3 to 5 clock cycles However, there are parts of hardware that are idle many.

Pipelining. 10/19/ Outline 5 stage pipelining Structural and Data Hazards Forwarding Branch Schemes Exceptions and Interrupts Conclusion.

55:035 Computer Architecture and Organization Lecture 10.

EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining.

CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.

CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.

CMPE 421 Parallel Computer Architecture

Electrical and Computer Engineering University of Cyprus LAB3: IMPROVING MIPS PERFORMANCE WITH PIPELINING.

Electrical and Computer Engineering University of Cyprus LAB 2: MIPS.

CMPE 421 Parallel Computer Architecture Part 2: Hardware Solution: Forwarding.

CECS 440 Pipelining.1(c) 2014 – R. W. Allison [slides adapted from D. Patterson slides with additional credits to M.J. Irwin]

1 (Based on text: David A. Patterson & John L. Hennessy, Computer Organization and Design: The Hardware/Software Interface, 3 rd Ed., Morgan Kaufmann,

Processor Design CT101 – Computing Systems. Content GPR processor – non pipeline implementation Pipeline GPR processor – pipeline implementation Performance.

Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.

Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.

Pipelining: Implementation CPSC 252 Computer Organization Ellen Walker, Hiram College.

CS161 – Design and Architecture of Computer Systems

Electrical and Computer Engineering University of Cyprus

Computer Organization

Stalling delays the entire pipeline

Note how everything goes left to right, except …

Morgan Kaufmann Publishers

ELEN 468 Advanced Logic Design

5 Steps of MIPS Datapath Figure A.2, Page A-8

Single Clock Datapath With Control

Appendix C Pipeline implementation

ECS 154B Computer Architecture II Spring 2009

ECS 154B Computer Architecture II Spring 2009

ECE232: Hardware Organization and Design

School of Computing and Informatics Arizona State University

Pipelining review.

Single-cycle datapath, slightly rearranged

ELEC / Computer Architecture and Design Spring Pipelining (Chapter 6)

Current Design.

Pipelining in more detail

The Processor Lecture 3.6: Control Hazards

Control unit extension for data hazards

An Introduction to pipelining

The Processor Lecture 3.5: Data Hazards

Instruction Execution Cycle

Overview What are pipeline hazards? Types of hazards

Control unit extension for data hazards

Pipelining Appendix A and Chapter 3.

Introduction to Computer Organization and Architecture

Control unit extension for data hazards

©2003 Craig Zilles (derived from slides by Howard Huang)

ELEC / Computer Architecture and Design Spring 2015 Pipeline Control and Performance (Chapter 6) Vishwani D. Agrawal James J. Danaher.

Pipelining Hazards.

Presentation transcript:

1 DLX computer Electronic Computers M

2 RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer RISC architectures In CISC architectures the 10% of the instructions are used in 90% of cases Waste of silicon Bottleneck: the bus Mid ‘80s a new architecture: RISC Solution: reduction of instruction number and complexity (fewer simpler machine instructions) Fixed instruction format (simpler instruction decoders) Simpler control logic network increasing the number of on-chip registers Reduction of bus/memory accesses Increase of machine instructions needed for a job which is (in many cases) more than compensated (in term of time) by the reduction of bus accesses CISC and RISC are each one the best solution in different application fields Nowadays coexistence of both architectures in the same processor: analysis at the end of the course A simplified RISC architecture: DLX (implemented as real processor in the ‘80s as R4000)

3 DLX (fixed) instruction format R Op-code RaRb Rc Cod. op (11 bit) extension 6 bit5 bit 11 bit Arithmetic or logic instructions as Rd  RS1 op RS2 or Set Conditions between registers J Op-code26 bit (PC relative) offset Direct and unconditional control transfer(J e JAL) I Op-code RaRb 16 bit immediate operand Data transfer (Load, Store), conditional Branch, JR and JALR (Control transfer via register), Set Condition e ALU with immediate operator. In LD and ALU instructions RS2=destination, in the ST RS2=source. -- RS1 used as base address or as ALU value for the immediate instructions

4 DLX non floating-point instructions (31x32bit registers R31…R1 - R0=0 fixed - Ra and Rb any of the 32 registers) Data Transfer LWRa, offset(Rb) LB Ra, offset(Rb) LBURa, offset(Rb) LHU Ra, offset(Rb) LH Ra, offset(Rb) SW Ra, offset(Rb) SH Ra, offset(Rb) SB Ra, offset(Rb) LHI Ra, value Arithmetic/Logic ADD Ra,Rb,Rc ADDIRa,Rb,value ADDURa,Rb,Rc ADDUI Ra,Rb, value SUB Ra,Rb,Rc SUBIRa,Rb,value SUBURa,Rb,Rc SUBUI Ra,Rb, value DIV Ra,Rb,Rc DIVIRa,Rb,value MULURa,Rb,Rc MULI Ra,Rb, value SLL Ra,Rb,Rc SLLI Ra,Rb;value SHR Ra,Rb.Rc SHRI Ra,Rb,value SLA Ra,Rb,Rc SLAI Ra,Rb,value OR Ra,Rb,Rc ORIRa,Rb,value XORRa,Rb,Rc XORIRa,Rb,value ANDRa,Rb,Rc ANDIRa,Rb,value Control SETxRa,Rb,Rc SETIxRa,Rb,value BEQZRa, offset BNEQZ Ra, offset Joffset JRRa JLoffset JLRRa N.B. Postfix x (set condition) can be LT, GT, LE, GE, EQ, NE JL (via or non via register) -> Jump and link saving PC in R31 Offset is a value within the instruction Postfix I means «immediate» (value within the instruction) PostfixA means «arithmetic» (sign extension) Postfix U means «unsigned» Value is the immediate within the instruction No STACK registers

5 DLX ALU operations Two inputs data One output data plus flags S1, S2 : ALU inputs (32 bit) S1 + S2 S1 – S2 S1 and S2 S1 or S2 S1 exor S2 Left Shift S1 of S2 positions Right Shift S1 of S2 positions Arithmetic Right Shift S1 of S2 positions S1 S2 0 1 Output Flags Zero Negative sign ALU is a combinatorial circuit !!! 32 S1 S2 OUT ALU Flags

6 PC is the Program Counter A and B are two scratchpad internal registers unknown to the programmer Ready ? INSTRUCTION FETCH Abstract instruction execution INSTRUCTION DECODE [PC] <= [PC] +4 [A ]<= [Ra] [B] <= [Rb] [REG INSTR] ]<= M [PC] Data transfer ALU Set Jump Branch INSTRUCTION EXECUTION

Next Instruction 7 I NSTR <= M [PC] Example: LB (LOAD BYTE format I) Sign extension !! Example M[Addr] 7..0 =A7 H => ( ) b Sign extended address <= FFFFFFA7 H Instr is the instruction offset Address is always 32 bit 31 MBbit 0 LSbit LB Ra, offset(Rb) Op-codeRS1 RS2 16 bit immediate operand [Ra] < =(M[Addr.] 7 ) 24 ## M[Addr.] 7..0 (Dest. Reg. = RS2) Byte in register [PC] <= [PC] +4 [A ]<= [Ra] [B ]<= [Rb] LOAD Byte Addr. < =[B] + (Instr 15 ) 16 ## Instr [A] = [RS1] ## => JOIN operator Sign extension Byte address compute Instruction bit 15 (sign) is left extended 16 times

8 Sign extension (IR 15 ) 16 ## IR IR 31 30…………17 16 From the Control Unit 15-0 Tri-state devices

Data transfer Instructions (R format) Examples LWRa, offset(Rb) LB Ra, offset(Rb) LBURa, offset(Rb) unsigned LHU Ra, offset(Rb) unsigned SW Ra, offset(Rb) LB LB (byte) [ Ra] <= (M[Addr] 7 ) 24 ## M[Addr] 7..0 LBU LBU (byte) [Ra] < = (0) 24 ## M[Addr] 7..0 M[Addr] <=[A] SW Addr. <= [B] + (Instr 15 ) 16 ## Instr A unsigned LH LH (half word) [Ra ]< = (M[Addr] 15 ) 16 ## M[Addr] Signed. LHU LW LHU (half word) [Ra] <= (0) 16 ## M[Addr]

10 [Ra ]<= [B ]+ [T] [Ra] <= [B] xor [T] [Ra]<= [B] - [Rc] [Ra] <= [B] and [T] [Ra] <=[B] or [T] ADDAND SUBXOROR ALUinstructions examples (I format) (T is a temporary hidden register unknown to the programmer) The same scheme for the shift etc. A and B generic registers (RS1, RS2) Register (format R)Immediate (format I) [T]<= [Rc][T]<= (Instr 15 ) 16 ## Instr 15..0] Register content signed if arithmetic operations ADD Ra,Rb,Rc ADDIRa,Rb,value ADDURa,Rb,Rc ADDUI Ra,Rb, value ………………………

11 SET instructions (see branch) ex. SLT Ra,Rb,Rc Set Ra=1 if Rb is less than Rc otherwise Ra=0 Register (format R)Immediate (format I) [T]<= [Rc][T]<= (Instr 15 ) 16 ## Instr [Ra] = 1 if [Rb] = [T] SEQ SLT SGE SNESGT SLE [Ra] = 1 if [Rb] < [T] [Ra] = 1 if [Rb] >= [T] [Ra] =1 if [Rb] <= [T] [Ra] = 1 if [Rb] > [T] [ Ra] = 1 if [Rb]! = [T] Register content as signed

12 [T] <= [PC] JALR JUMP Instructions JAL [T] <= [PC] JALR JAL [R31 ]<= [T] For saving [PC] in R31 JR JALR JMP JAL [PC] <= [PC] + (Instr 25 ) 6 ## Instr [PC] <= [A] format I format J Joffset (jump address) JRRa (jump register) JLoffset (jump and link address) JLRRa (jump and link register)

INIT 13 [A] = 1 BRANCH YES NO BEQZBNEZ Branch Instructions [A!] = 1 [PC] <= [PC] + (Instr 15 ) 16 ## Instr Ex. BNEQZ R5, 100 Jump to PC+100 if R5 not equal 0

14 The Pipelining Principle Pipelining is the main basic technique used for “speeding-up” a CPU. The key idea for pipelining is general, and is currently applied to several industry fields (productions lines, oil pipelines, …) A system S must operate N times on a task A i producing result R i : A 1, A 2, A 3 …A N S R 1, R 2, R 3 …R N Latency : time occurring between the beginning and the end of task A (T A ). Throughput : frequency of each task completion

15 The Pipelining Principle 1) Sequential System A2A2 A3A3 t ANAN A1A1 TATA Latency (execution time of a single instruction) = T A 2) Pipelined System (instruction are subdivided in stages – each stage during one n th – 4 in this example - of the entire instruction) – Instructions overlap S A P1P1 P2P2 P3P3 P4P4 t S1S1 S2S2 S3S3 S4S4 S i : pipeline stage

16 The Pipelining Principle P1P1 TPTP P2P2 P3P3 A1A1 P4P4 S S1S1 S2S2 S3S3 S4S4 P1P1 A2A2 P2P2 P3P3 P4P4 P1P1 A3A3 P2P2 P3P3 P4P4 P1P1 A4A4 P2P2 P3P3 P4P4 tAnAn T P : pipeline cycle Each cycle one instruction terminates

Instruction stages 17 EXIDMEMWB IF Instruction fetch (from memory) Instruction decode Instruction execution (ALU) Data memory access (if needed) Write-back (if needed)

18 Pipelining of a CPU (DLX) Instruction sequence: I 1, I 2, I 3 …I N Instruction j EXID t MEMWB IF ClockPerInstruction=1 (ideally !) IF/IDID/EXEX/MEMMEM/WB CPU (datapath) IFIDEXMEMWB Pipeline CycleClock Cycle Delay of the slowest stage Registers (Pipeline Registers D FF) Combinatorial circuits

19 DLX Pipeline Instr i Instr i+1 Instr i+2 Instr i+3 Instr i+4 IFIDEXMEMWB T clk = T d + T P + T su Clock Cycle CPI (ideally) = 1 Overhead introduced by the Pipeline Registers: Switch delay of the input stage register Set-up time of the output stage register Delay of the slowest combinatorial stage IFIDEXMEMWB IFIDEXMEMWBIFIDEXMEMWBIFIDEXMEMWB

D Tp Switch delay of the input stage register D Set-up time of the output stage register Combinatorial Circuit Delay of the slowest combinatorial stage 20

21 Pipeline implementation requirements  Each stage is active at each clock cycle.  The PC is incremented in the IF stage PC Always 0  An ADDER should be introduced (PC <=PC+4 – one instruction is 4 bytes) in the IF stage. But instructions are aligned (each one ends to an address multiple of the instruction length in bytes) and therefore a 30 bit only register (a programmable counter for jumps) is used, incremented by 1 each clock cycle  Two Memory Data Registers are required (referred to as LMDR e SMDR). In fact when a LOAD is immediately followed by a STORE there is a WB/MEM stages overlap – two data waiting therefore to be written (one onto the memory, the other onto a register of the RF).  Each clock cycle 2 memory accesses must be possibly executed (IF, MEM): Instruction Memory (IM) and Data Memory (DM): “Harvard” Architecture  The CPU clock is determined by the slowest stage  Pipeline Registers store both data and control information ( “distributed” control unit)

IF ID EXMEMWB DLX Pipelined Datapath ADDADD 4 MUXMUX DATA MEM ALUALU MUXMUX MUXMUX =0? INSTR MEM RF SE PC DEC MUXMUX IF/IDID/EXEX/MEMMEM/WB Sign extension Number of dest. registers in case of LOAD and ALU instr. For computing new PC value when branch For operations with immediates RD D Ra Rb destination register number (1-31) Data (from reg. or mem or PC per link) PC Actually a programmable counter if jump For Set Condition (also 0) [it acts on the output] =0? for Branch JL and JLR (PC in R31) 22 RS1 RS2 scratchpad)

23 ID stage ( N.B. stage layout different from previous slide! ) IRIR SE DR D Ra Rb IF/IDID/EX IR IR Number of the dest. register (from WB stage) Data (from WB stage) (31-16) Immed./Branch (31-26) Jump IR 15 IR 25 LB SW IR 15-0 (Offset/Immediate– as dest. reg. in R instr. ) IR (Jump; Jump and Link) PC 31-0 (JL and JLR) PCPC A B 26 (J and JL) Info travelling with the instruction IR (R Istr.) DEC Sign extension IR (Opcode) Sing extension RF

DLX Pipelined Datapath ADDADD 4 MUXMUX DM ALUALU MUXMUX MUXMUX IM RF SE PC DEC MUXMUX IF/IDID/EXEX/MEMMEM/WB IR1IR1 A B IR2IR2 PC2PC2 CONDCOND X X: Computed data or Memory Address or Branch Address SMDRSMDR Y LMDRLMDR Y: Computed data from the previous stage IF ID EXMEMWB PC1PC1 PC3PC3 PC4PC4 Address Data IR3IR3 IR4IR4 destination register number for Set Condition (also 0) [it acts on output] =0? for Branch JL JLR (PC saved in R31) SMDR => Store Memory Data Register LMDR => Load memory data Register IRi => Instruction Register i 24 Ra Rb DR D

25 Pipelined execution of an “ALU” instruction X : “ALUOUTPUT” (in EX/MEM), Y : “ALUOUTPUT1” IF ID EX MEM Y <= X (temp. Storage for WB) WB RD <= Y IR <= M[PC] ; PC <= PC + 4 ; PC1 <= PC + 4 A <= Ra; B <= Rb; PC2 <= PC1; IR2<=IR1 ID/EX <= Instruction decode; X <= A op B or X <= A op [(IR2 15 ) 16 ## IR ] [PC4 <= PC3] [PC3 <= PC2] Decoded opcode travels through all stages [IR3 <= IR2] [IR4 <.= IR3] NOTE: IRi bits which are dropped stage by stage when no more needed for all instructions. Why ?

26 Pipelined execution of a “MEM” instruction IF ID EX MEM LMDR <= M[MAR] (if LOAD) or M[MAR] <= SMDR (if STORE) WB RD <= MDR (if LOAD) [Sign ext.] IR <= M[PC] ; PC <= PC + 4 ; PC1 <= PC + 4 A <= Ra; B <= Rb; PC2 <= PC1; IR2<=IR1 ID/EX <= Instruction decode ; MAR <= A op (IR2 15 ) 16 ## IR SMDR <= B [PC4 <= PC3] [PC3 <= PC2] Decoded opcode travels through all stages [ IR3 <= IR2 [IR4 <= IR3]

27 Pipelined execution of a “BRANCH” instruction (normally after a SCn instruction – see later) X : “BTA (BRANCH TARGET ADDRESS)” IF ID EX MEM if (Cond) PC <= X WB (NOP) IR <= M[PC] ; PC <= PC + 4 ; PC1 <= PC + 4 A <= Ra; B <= Rb; PC2 <= PC1; IR2<=IR1 ID/EX <= Instruction decode; X <= PC2 op (IR 15 ) 16 ## IR Cond <= A op 0 [PC4 <= PC3] [PC3 <= PC2] Decoded opcode travels through all stages [IR3 <= IR2] [IR4 <= IR3 Branch on Reg A value (0/1) New value in PC in this interval. When Branch is taken 3 new unwanted instructions are already started Computed new PC address

28 Pipelined execution of a “JR” instruction ID MEM WB IF ID EX MEM PC <= X WB (NOP) IR <= M[PC] ; PC <= PC + 4 ; PC1 <= PC + 4 A <= Ra; B <= Rb; PC2 <= PC1; IR2<=IR1 ID/EX <= Instruction decode; X <= A [PC4 <= PC3] [PC3 <= PC2] Decoded opcode travels through all stages [IR3 <= IR2] [IR4 <= IR3] Which would be the stage sequence for a J instruction? New value in PC in this interval. When Jump executed 3 new unwanted instructions are already started new PC address

29 Pipelined execution of a “JL or JLR” instruction ID IF ID EX MEM PC <= X ; PC4<= PC3 WB R31 <= PC4 IR <= M[PC] ; PC <= PC + 4 ; PC1 <= PC + 4 A <= Ra; B <= Rb; PC2 <= P1; IR2<=IR1 ID/EX <= Instruction decode; PC3 <= PC2 X <= A (If JLR) X <= PC2 + (IR 25 ) 6 ## IR (If JL) NOTE: Write on R31 CANNOT be performed on-the fly since it could overlap with another register write Decoded opcode through all stages [IR4 <= IR3] [IR3 <= IR2] In this case PCi values are used New value in PC in this interval. When Jump executed 3 new unwanted instructions are already started

30 Which would be the sequence in case of SCn (ex SLT R1,R2,R3) ? ID IF ID EX MEM WB IR <= M[PC] ; PC <= PC + 4 ; PC1 <= PC + 4 A <= Ra; B <= Rb; PC2 <= P1; IR2<=IR1 ID/EX <= Instruction decode; ? ? ?

31 Pipeline Hazards A “Hazard” occurs when during a clock cycle an instruction currently in a pipeline stage can’t be executed in the same clock cycle. Structural Hazards – The same resource is used by two different pipeline stages: the instructions currently in those stages can’t be executed simultaneously. Data Hazards – they are due to instruction dependencies. For example, an instruction that needs to read a RF register not yet written by a previous instruction (Read After Write). Control Hazards – Instructions following a branch depend from the branch result (taken/not taken). The instruction that cannot be executed must be stalled (“pipeline stall” or “pipeline bubbling”), together with all the following instructions, while the previous instructions must proceed normally (so as to eliminate the hazard).

Clk 6 Clk 7Clk 8 Hazards and stalls IF IDEXMEMWB I i-3 I i-2 I i-1 IDEXMEM IDEX IF Clk 1Clk 2Clk 3Clk 4Clk 5 WB Clk 9Clk 10Clk 11Clk 12 T 5 = 8 * CLK = (5 + 3) * CLK T 5 = 5 * (1 + 3/5 ) * CLK Instruction stalls IDIiIi IF I i+1 WB SS S SSIFS MEMWB Stall: the clock signal for I i, I i+1 …etc. is blocked for three periods The consequence of a data hazard: if instruction I i needs the result of instruction I i-1 (data are read in ID stage), must wait until after WB of I i-1 32 Normally the three stalled instructions are transformed in NOPs to avoid clock blocking

33 Forwarding Forwarding allows eliminating almost all RAW hazards of the pipeline without stalling the pipeline. (NOTE: in DLX, registers are modified only in WB stage) Clk 6 Clk 7Clk 8 ADD R3, R1, R4 IF IDEXMEMWB Clk 1Clk 2Clk 3Clk 4Clk 5 SUB R7, R3, R5 hazard ID EXMEMIFWB Clk 9 OR R1, R3, R5 hazard ID MEMWBEXIF Here too the requested data is not yet in RF since it is written on the positive clock edge at the end of WB (register value is read in ID!) LW R6, 100 (R3) hazard ID IF EX MEM WB AND R9, R5, R3 no hazard IF IDEXMEM WB Data are read from registers in the ID stage

34 Forwarding implementation FU EX/MEM MUXMUX MEM/WB ALUALU MUXMUX ID/EX MUXMUX MUXMUX RS1/RS2 OPCODE RD2/OpCode RD1 (destination register/OpCode) Combinatorial!! comparison between RS1, RS2 and RD1, RD2 and the Opcodes RF MUXMUX Often performed inside the RF It allows “the anticipation” of the register on ID/EX MUX control: IF/ID opcode and comparison of RD with RS1 and RS2 Memory ALU IR3 IR4 Offset B A Bypass MUXMUX PC INSTRUCTION DECODE * MUXMUX PC

35 Data hazard due to LOAD instructions NOTE: the data required by the ADD is available only at the end of MEM stage. This hazard cannot be eliminated by forwarding (unless there is an additional input in the MUXs between memory and ALU – delays!) ADD R4,R1,R7 SUB R5,R1,R8 AND R6,R1,R7 LW R1,32(R6) MEM WB IFID EX MEM IFID EX IF ID IF ID EX LW R1,32(R6) IFIDEX MEM WB ADD R4,R1,R7 IFIDS EX MEM SUB R5,R1,R8 IF ID EX AND R6,R1,R7 IFID The pipeline needs to be stalled Transformed in NOP PC-<PC-4 From the end of this stage onwards: standard forwarding ADD R4,R1,R7 IFID EX MEM NOP IFID EX MEM WB

36 Delayed load In many RISC CPUs, the hazard associated with the LOAD instruction is not handled by HW by stalling the pipeline but by software through the compiler (delayed load): LOAD Instruction delay slot Next instruction The compiler tries to fill the delay-slot with a “useful” instruction (worst case: NOP). LW R1,32(R6) LW R3,10 (R4) ADD R5,R1,R3 LW R6, 20 (R7) LW R8, 40(R9) LW R1,32(R6) LW R3,10 (R4) ADD R5,R1,R3 LW R6, 20 (R7) LW R8, 40(R9)

37 Control Hazards BEQZ R4, 200 PC BEQZ R4, 200 PC+4 SUB R7, R3, R5 PC+8 OR R1, R3, R5 PC+12 LW R6, 100 (R8) PC AND R9, R5, R3 (BTA) Next Instruction Address R4 = 0 : Branch Target Address (taken) R4  0 : PC+4 (not taken) Clk 6 Clk 7Clk 8 IF IDEXMEMWB ID Clk 1Clk 2Clk 3Clk 4Clk 5 MEMWB EXMEM EX IF WB ID IF EX WB IDMEM Fetch with the new PC New computed PC value (Aluout) SUB R7, R3, R5 OR R1, R3, R5 LW R6, 100 (R8) New value in PC (one clock after: new value must be clocked onto the PC)) IDIF EX WB IDMEM

ADDADD 4 IMRF SE PC DEC Instruction Fetch Instruction Decode Execute Memory Write Back IF/IDID/EX ALUALU MUXMUX EX/MEM MUXMUX MUXMUX DLX Pipelined Datapath (Branch or JMP) BEQZ R4, 200 MUXMUX DM MEM/WB When the new PC acts on the IM three instructions have already travelled through the first three stages (EX included) NOTE if the feedback signal of the new PC were output directly from the ALU instead than from ALUOUT the required stalls would be only two – slower clock! =0? 38

39 Handling the Control Hazards BEQZ R4,200 Clk 6 Clk 7Clk 8 IF IDEXMEMWB Clk 1Clk 2Clk 3Clk 4Clk 5 SS IF S Fetch at new PC Always Stall ( three-clock block being propagated) Predict Not Taken IF IDEXMEMWB ID BEQZ R4, 200 SUB R7, R3, R5 OR R1, R3, R5 LW R6, 100 (R8) Clk 6 Clk 7Clk 8 Clk 1Clk 2Clk 3Clk 4Clk 5 MEMWB EXMEM EX IF WB EX WB ID MEM Branch Completion IF here: the previous instruction has not been yet decoded SIF IDS Real situation Repeated IF PC <= PC - 4 Here the new value is sampled by the PC No problem because no instruction in WB stage NOP If branch taken: flush. They become NOP. No data yet written Here the new value of PC is computed

IF ID EXMEMWB Stalls with jumps (1/3) ADDADD 4 MUXMUX DATA MEM ALUALU MUXMUX MUXMUX =0? INSTR MEM RF SE PC DEC MUXMUX IF/IDID/EXEX/MEMMEM/WB DR D RS1 RS2 Data PC if jump =0? NOPNOP NOPNOP NOPNOP Jump forced NOP Three NOPs MUST replace the 3 unwanted instructions already started When the Branch Target Address is clocked into the PC three unwanted instructions are already in IF/ID, ID/EX and EX/MEM 40

IF ID EXMEMWB Stalls with jump (2/3) ADDADD 4 MUXMUX DATA MEM ALUALU MUXMUX MUXMUX =0? INSTR MEM RF SE PC DEC MUXMUX IF/IDID/EXEX/MEMMEM/WB DR D RS1 RS2 Data PC if jump =0? NOPNOP NOPNOP forced NOP when jump NOTE in this case the jump condition detection and the new PC value are input to the MUX in the same clok interval Two NOPs MUST replace the 2 unwanted instructions already started 41

IF ID EXMEMWB Stalls with jump (3/3) ADDADD 4 DATA MEM ALUALU MUXMUX MUXMUX =0? INSTR MEM RF SE DEC MUXMUX IF/IDID/EXEX/MEMMEM/WB DR D RS1 RS2 Data PC if jump =0? NOPNOP NOP for jump NOTE In this case the jump condition and the new PC act on the MUX in the same period when the condition is detected PC MUXMUX A NOP MUST replace the unwanted instruction already started Very slow solution ! 42

43 Delayed branch Similarly to the LOAD case. In several RISC CPUs the BRANCH instructions hazard is handled by SW through the compiler (delayed branch): BRANCH instruction delay slot Next instruction The compiler tries to fill the delay-slots with “useful” instructions (worst case: NOP). delay slot

44 Delayed branch/jump Add R5, R4, R3 Sub R6, R5, R2 Or R14, R6, R21 Sne R1, R8, R9 ; branch condition Br R1, +100 Sne R1, R8, R9 ; branch condition Br R1, +100 Add R5, R4, R3 Sub R6, R5, R2 Or R14, R6, R21 CompiledOriginal Executed in both cases Obviously in this instructions group there must be no jumps!!! Instead of one or more “postponed” instructions, the compiler inserts NOPs when no suitable instructions are available

45 Independent Adder for BRANCH/JMP To reduce the number of stalls BTA <=PC1+ (IR 15 ) 16 ## IR 15-0 / (IR 25 ) 6 ## IR if Branch: if (RS1 op 0) PC <= BTA if JMP always PC <= BTA IF ID EX MEM WB (New fetch only one stall) ALU (additional full adder) A <- Ra; B <- Rb; PC2 <- PC1 ID/EX <- Decode; ID/EX <- Opc ext. IR <- M[PC] ; PC <- PC + 4; PC1 <- PC + 4 NOTE: in this case there is only one “stall” since the new value is inserted in the PC on the positive clock edge that ends the ID stage while, in the previous case, it was inserted after the MEM stage, that is, two clock later!!!!!!

BRANCH/JMP – 1 stall ADDERADDER 4 IMRF PC DEC IF/ID ID/EX IR1IR1 IF ID PC1PC1 MUXMUX MUXMUX SE ## A B PC2PC2 NOTE: for “Unconditional Jump” instructions there a similar situation : we need only to provide further inputs to the MUXs of the PC by considering either the RS1 register (JR and JRL) or the 26 less-significant bits of the IR with SE (J and JL) to be added to the instruction PC (not the current PC) The source of the next PC is selected according to the opcode and the value of the branch test register = 0 ? For Branches Standard increment Branch Offset and sign extension Displacement of the Branch instruction PC of the Branch instruction 46

47 Handling the Control Hazards Dynamic Prediction: Branch Target Buffer => no stall (almost..) T/NT TAGS Predicted PC PC = HIT : Fetch with predicted PC MISS : Fetch with PC + 4 Correct prediction : no stalls Wrong prediction : 1-3 stalls (correct fetch in ID or EX, see before) N.B. Here the branch slot is selected during the IF clock cycle that loads IR1 in IF/ID

48 Prediction Buffer: the simplest implementation uses a single bit that indicates what happened when last branch occurred. In case of predominance of one prediction, when the opposite situation occurs we have two consecutive errors. Loop1 Loop2 When the program ends loop2, the prediction fails (branch predicted as taken but actually it is untaken), then it fails again when it predicts as untaken whilst entering once again loop2

49 Usually two bits. TAKEN UNTAKEN TAKEN UNTAKEN TAKEN UNTAKEN TAKEN UNTAKEN