Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 DLX computer Electronic Computers M. 2 RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer RISC architectures In CISC.

Similar presentations


Presentation on theme: "1 DLX computer Electronic Computers M. 2 RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer RISC architectures In CISC."— Presentation transcript:

1 1 DLX computer Electronic Computers M

2 2 RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer RISC architectures In CISC architectures the 10% of the instructions are used in 90% of cases Waste of silicon Bottleneck: the bus Mid ‘80s a new architecture: RISC Solution: reduction of instruction number and complexity (fewer simpler machine instructions) Fixed instruction format (simpler instruction decoders) Simpler control logic network increasing the number of on-chip registers Reduction of bus/memory accesses Increase of machine instructions needed for a job which is (in many cases) more than compensated (in term of time) by the reduction of bus accesses CISC and RISC are each one the best solution in different application fields Nowadays coexistence of both architectures in the same processor: analysis at the end of the course A simplified RISC architecture: DLX (implemented as real processor in the ‘80s as R4000)

3 3 DLX (fixed) instruction format R Op-code RaRb Rc Cod. op (11 bit) extension 6 bit5 bit 11 bit 31 26 25 21 20 16 15 11 10 0 Arithmetic or logic instructions as Rd  RS1 op RS2 or Set Conditions between registers J Op-code26 bit (PC relative) offset Direct and unconditional control transfer(J e JAL) I Op-code RaRb 16 bit immediate operand Data transfer (Load, Store), conditional Branch, JR and JALR (Control transfer via register), Set Condition e ALU with immediate operator. In LD and ALU instructions RS2=destination, in the ST RS2=source. -- RS1 used as base address or as ALU value for the immediate instructions

4 4 DLX non floating-point instructions (31x32bit registers R31…R1 - R0=0 fixed - Ra and Rb any of the 32 registers) Data Transfer LWRa, offset(Rb) LB Ra, offset(Rb) LBURa, offset(Rb) LHU Ra, offset(Rb) LH Ra, offset(Rb) SW Ra, offset(Rb) SH Ra, offset(Rb) SB Ra, offset(Rb) LHI Ra, value Arithmetic/Logic ADD Ra,Rb,Rc ADDIRa,Rb,value ADDURa,Rb,Rc ADDUI Ra,Rb, value SUB Ra,Rb,Rc SUBIRa,Rb,value SUBURa,Rb,Rc SUBUI Ra,Rb, value DIV Ra,Rb,Rc DIVIRa,Rb,value MULURa,Rb,Rc MULI Ra,Rb, value SLL Ra,Rb,Rc SLLI Ra,Rb;value SHR Ra,Rb.Rc SHRI Ra,Rb,value SLA Ra,Rb,Rc SLAI Ra,Rb,value OR Ra,Rb,Rc ORIRa,Rb,value XORRa,Rb,Rc XORIRa,Rb,value ANDRa,Rb,Rc ANDIRa,Rb,value Control SETxRa,Rb,Rc SETIxRa,Rb,value BEQZRa, offset BNEQZ Ra, offset Joffset JRRa JLoffset JLRRa N.B. Postfix x (set condition) can be LT, GT, LE, GE, EQ, NE JL (via or non via register) -> Jump and link saving PC in R31 Offset is a value within the instruction Postfix I means «immediate» (value within the instruction) PostfixA means «arithmetic» (sign extension) Postfix U means «unsigned» Value is the immediate within the instruction No STACK registers

5 5 DLX ALU operations Two inputs data One output data plus flags S1, S2 : ALU inputs (32 bit) S1 + S2 S1 – S2 S1 and S2 S1 or S2 S1 exor S2 Left Shift S1 of S2 positions Right Shift S1 of S2 positions Arithmetic Right Shift S1 of S2 positions S1 S2 0 1 Output Flags Zero Negative sign ALU is a combinatorial circuit !!! 32 S1 S2 OUT ALU Flags

6 6 PC is the Program Counter A and B are two scratchpad internal registers unknown to the programmer Ready ? INSTRUCTION FETCH Abstract instruction execution INSTRUCTION DECODE [PC] <= [PC] +4 [A ]<= [Ra] [B] <= [Rb] [REG INSTR] ]<= M [PC] Data transfer ALU Set Jump Branch INSTRUCTION EXECUTION

7 Next Instruction 7 I NSTR <= M [PC] Example: LB (LOAD BYTE format I) Sign extension !! Example M[Addr] 7..0 =A7 H => (10100111) b Sign extended address <= FFFFFFA7 H Instr 15.0. is the instruction offset Address is always 32 bit 31 MBbit 0 LSbit LB Ra, offset(Rb) Op-codeRS1 RS2 16 bit immediate operand [Ra] < =(M[Addr.] 7 ) 24 ## M[Addr.] 7..0 (Dest. Reg. = RS2) Byte in register [PC] <= [PC] +4 [A ]<= [Ra] [B ]<= [Rb] LOAD Byte Addr. < =[B] + (Instr 15 ) 16 ## Instr 15..0 [A] = [RS1] 31 26 25 21 20 16 15 0 ## => JOIN operator Sign extension Byte address compute Instruction bit 15 (sign) is left extended 16 times

8 8 Sign extension (IR 15 ) 16 ## IR 15..0 0 15 31 IR 31 30…………17 16 From the Control Unit 15-0 Tri-state devices

9 Data transfer Instructions (R format) Examples LWRa, offset(Rb) LB Ra, offset(Rb) LBURa, offset(Rb) unsigned LHU Ra, offset(Rb) unsigned SW Ra, offset(Rb) LB LB (byte) [ Ra] <= (M[Addr] 7 ) 24 ## M[Addr] 7..0 LBU LBU (byte) [Ra] < = (0) 24 ## M[Addr] 7..0 M[Addr] <=[A] SW Addr. <= [B] + (Instr 15 ) 16 ## Instr 15..0 A unsigned LH LH (half word) [Ra ]< = (M[Addr] 15 ) 16 ## M[Addr] 15..0 Signed. LHU LW LHU (half word) [Ra] <= (0) 16 ## M[Addr] 15..0 9

10 10 [Ra ]<= [B ]+ [T] [Ra] <= [B] xor [T] [Ra]<= [B] - [Rc] [Ra] <= [B] and [T] [Ra] <=[B] or [T] ADDAND SUBXOROR ALUinstructions examples (I format) (T is a temporary hidden register unknown to the programmer) The same scheme for the shift etc. A and B generic registers (RS1, RS2) Register (format R)Immediate (format I) [T]<= [Rc][T]<= (Instr 15 ) 16 ## Instr 15..0] Register content signed if arithmetic operations ADD Ra,Rb,Rc ADDIRa,Rb,value ADDURa,Rb,Rc ADDUI Ra,Rb, value ………………………

11 11 SET instructions (see branch) ex. SLT Ra,Rb,Rc Set Ra=1 if Rb is less than Rc otherwise Ra=0 Register (format R)Immediate (format I) [T]<= [Rc][T]<= (Instr 15 ) 16 ## Instr 15..0 [Ra] = 1 if [Rb] = [T] SEQ SLT SGE SNESGT SLE [Ra] = 1 if [Rb] < [T] [Ra] = 1 if [Rb] >= [T] [Ra] =1 if [Rb] <= [T] [Ra] = 1 if [Rb] > [T] [ Ra] = 1 if [Rb]! = [T] Register content as signed

12 12 [T] <= [PC] JALR JUMP Instructions JAL [T] <= [PC] JALR JAL [R31 ]<= [T] For saving [PC] in R31 JR JALR JMP JAL [PC] <= [PC] + (Instr 25 ) 6 ## Instr 25..0 [PC] <= [A] format I format J Joffset (jump address) JRRa (jump register) JLoffset (jump and link address) JLRRa (jump and link register)

13 INIT 13 [A] = 1 BRANCH YES NO BEQZBNEZ Branch Instructions [A!] = 1 [PC] <= [PC] + (Instr 15 ) 16 ## Instr 15..0 Ex. BNEQZ R5, 100 Jump to PC+100 if R5 not equal 0

14 14 The Pipelining Principle Pipelining is the main basic technique used for “speeding-up” a CPU. The key idea for pipelining is general, and is currently applied to several industry fields (productions lines, oil pipelines, …) A system S must operate N times on a task A i producing result R i : A 1, A 2, A 3 …A N S R 1, R 2, R 3 …R N Latency : time occurring between the beginning and the end of task A (T A ). Throughput : frequency of each task completion

15 15 The Pipelining Principle 1) Sequential System A2A2 A3A3 t ANAN A1A1 TATA Latency (execution time of a single instruction) = T A 2) Pipelined System (instruction are subdivided in stages – each stage during one n th – 4 in this example - of the entire instruction) – Instructions overlap S A P1P1 P2P2 P3P3 P4P4 t S1S1 S2S2 S3S3 S4S4 S i : pipeline stage

16 16 The Pipelining Principle P1P1 TPTP P2P2 P3P3 A1A1 P4P4 S S1S1 S2S2 S3S3 S4S4 P1P1 A2A2 P2P2 P3P3 P4P4 P1P1 A3A3 P2P2 P3P3 P4P4 P1P1 A4A4 P2P2 P3P3 P4P4 tAnAn T P : pipeline cycle Each cycle one instruction terminates

17 Instruction stages 17 EXIDMEMWB IF Instruction fetch (from memory) Instruction decode Instruction execution (ALU) Data memory access (if needed) Write-back (if needed)

18 18 Pipelining of a CPU (DLX) Instruction sequence: I 1, I 2, I 3 …I N Instruction j EXID t MEMWB IF ClockPerInstruction=1 (ideally !) IF/IDID/EXEX/MEMMEM/WB CPU (datapath) IFIDEXMEMWB Pipeline CycleClock Cycle Delay of the slowest stage Registers (Pipeline Registers D FF) Combinatorial circuits

19 19 DLX Pipeline Instr i Instr i+1 Instr i+2 Instr i+3 Instr i+4 IFIDEXMEMWB T clk = T d + T P + T su Clock Cycle CPI (ideally) = 1 Overhead introduced by the Pipeline Registers: Switch delay of the input stage register Set-up time of the output stage register Delay of the slowest combinatorial stage IFIDEXMEMWB IFIDEXMEMWBIFIDEXMEMWBIFIDEXMEMWB

20 D Tp Switch delay of the input stage register D Set-up time of the output stage register Combinatorial Circuit Delay of the slowest combinatorial stage 20

21 21 Pipeline implementation requirements  Each stage is active at each clock cycle.  The PC is incremented in the IF stage. 31 2 1 0 PC Always 0  An ADDER should be introduced (PC <=PC+4 – one instruction is 4 bytes) in the IF stage. But instructions are aligned (each one ends to an address multiple of the instruction length in bytes) and therefore a 30 bit only register (a programmable counter for jumps) is used, incremented by 1 each clock cycle  Two Memory Data Registers are required (referred to as LMDR e SMDR). In fact when a LOAD is immediately followed by a STORE there is a WB/MEM stages overlap – two data waiting therefore to be written (one onto the memory, the other onto a register of the RF).  Each clock cycle 2 memory accesses must be possibly executed (IF, MEM): Instruction Memory (IM) and Data Memory (DM): “Harvard” Architecture  The CPU clock is determined by the slowest stage  Pipeline Registers store both data and control information ( “distributed” control unit)

22 IF ID EXMEMWB DLX Pipelined Datapath ADDADD 4 MUXMUX DATA MEM ALUALU MUXMUX MUXMUX =0? INSTR MEM RF SE PC DEC MUXMUX IF/IDID/EXEX/MEMMEM/WB Sign extension Number of dest. registers in case of LOAD and ALU instr. For computing new PC value when branch For operations with immediates RD D Ra Rb destination register number (1-31) Data (from reg. or mem or PC per link) PC Actually a programmable counter if jump For Set Condition (also 0) [it acts on the output] =0? for Branch JL and JLR (PC in R31) 22 RS1 RS2 scratchpad)

23 23 ID stage ( N.B. stage layout different from previous slide! ) IRIR SE DR D Ra Rb IF/IDID/EX IR 25-21 IR 20-16 Number of the dest. register (from WB stage) Data (from WB stage) (31-16) Immed./Branch (31-26) Jump IR 15 IR 25 LB SW IR 15-0 (Offset/Immediate– 11-15 as dest. reg. in R instr. ) IR 25-16 (Jump; Jump and Link) PC 31-0 (JL and JLR) PCPC A B 26 (J and JL) 61632 Info travelling with the instruction IR 10-00 (R Istr.) DEC Sign extension IR 31-26 (Opcode) Sing extension RF

24 DLX Pipelined Datapath ADDADD 4 MUXMUX DM ALUALU MUXMUX MUXMUX IM RF SE PC DEC MUXMUX IF/IDID/EXEX/MEMMEM/WB IR1IR1 A B IR2IR2 PC2PC2 CONDCOND X X: Computed data or Memory Address or Branch Address SMDRSMDR Y LMDRLMDR Y: Computed data from the previous stage IF ID EXMEMWB PC1PC1 PC3PC3 PC4PC4 Address Data IR3IR3 IR4IR4 destination register number for Set Condition (also 0) [it acts on output] =0? for Branch JL JLR (PC saved in R31) SMDR => Store Memory Data Register LMDR => Load memory data Register IRi => Instruction Register i 24 Ra Rb DR D

25 25 Pipelined execution of an “ALU” instruction X : “ALUOUTPUT” (in EX/MEM), Y : “ALUOUTPUT1” IF ID EX MEM Y <= X (temp. Storage for WB) WB RD <= Y IR <= M[PC] ; PC <= PC + 4 ; PC1 <= PC + 4 A <= Ra; B <= Rb; PC2 <= PC1; IR2<=IR1 ID/EX <= Instruction decode; X <= A op B or X <= A op [(IR2 15 ) 16 ## IR2 15.. 0 ] [PC4 <= PC3] [PC3 <= PC2] Decoded opcode travels through all stages [IR3 <= IR2] [IR4 <.= IR3] NOTE: IRi bits which are dropped stage by stage when no more needed for all instructions. Why ?

26 26 Pipelined execution of a “MEM” instruction IF ID EX MEM LMDR <= M[MAR] (if LOAD) or M[MAR] <= SMDR (if STORE) WB RD <= MDR (if LOAD) [Sign ext.] IR <= M[PC] ; PC <= PC + 4 ; PC1 <= PC + 4 A <= Ra; B <= Rb; PC2 <= PC1; IR2<=IR1 ID/EX <= Instruction decode ; MAR <= A op (IR2 15 ) 16 ## IR2 15..0 SMDR <= B [PC4 <= PC3] [PC3 <= PC2] Decoded opcode travels through all stages [ IR3 <= IR2 [IR4 <= IR3]

27 27 Pipelined execution of a “BRANCH” instruction (normally after a SCn instruction – see later) X : “BTA (BRANCH TARGET ADDRESS)” IF ID EX MEM if (Cond) PC <= X WB (NOP) IR <= M[PC] ; PC <= PC + 4 ; PC1 <= PC + 4 A <= Ra; B <= Rb; PC2 <= PC1; IR2<=IR1 ID/EX <= Instruction decode; X <= PC2 op (IR 15 ) 16 ## IR 15..0 Cond <= A op 0 [PC4 <= PC3] [PC3 <= PC2] Decoded opcode travels through all stages [IR3 <= IR2] [IR4 <= IR3 Branch on Reg A value (0/1) New value in PC in this interval. When Branch is taken 3 new unwanted instructions are already started Computed new PC address

28 28 Pipelined execution of a “JR” instruction ID MEM WB IF ID EX MEM PC <= X WB (NOP) IR <= M[PC] ; PC <= PC + 4 ; PC1 <= PC + 4 A <= Ra; B <= Rb; PC2 <= PC1; IR2<=IR1 ID/EX <= Instruction decode; X <= A [PC4 <= PC3] [PC3 <= PC2] Decoded opcode travels through all stages [IR3 <= IR2] [IR4 <= IR3] Which would be the stage sequence for a J instruction? New value in PC in this interval. When Jump executed 3 new unwanted instructions are already started new PC address

29 29 Pipelined execution of a “JL or JLR” instruction ID IF ID EX MEM PC <= X ; PC4<= PC3 WB R31 <= PC4 IR <= M[PC] ; PC <= PC + 4 ; PC1 <= PC + 4 A <= Ra; B <= Rb; PC2 <= P1; IR2<=IR1 ID/EX <= Instruction decode; PC3 <= PC2 X <= A (If JLR) X <= PC2 + (IR 25 ) 6 ## IR 25..0 (If JL) NOTE: Write on R31 CANNOT be performed on-the fly since it could overlap with another register write Decoded opcode through all stages [IR4 <= IR3] [IR3 <= IR2] In this case PCi values are used New value in PC in this interval. When Jump executed 3 new unwanted instructions are already started

30 30 Which would be the sequence in case of SCn (ex SLT R1,R2,R3) ? ID IF ID EX MEM WB IR <= M[PC] ; PC <= PC + 4 ; PC1 <= PC + 4 A <= Ra; B <= Rb; PC2 <= P1; IR2<=IR1 ID/EX <= Instruction decode; ? ? ?

31 31 Pipeline Hazards A “Hazard” occurs when during a clock cycle an instruction currently in a pipeline stage can’t be executed in the same clock cycle. Structural Hazards – The same resource is used by two different pipeline stages: the instructions currently in those stages can’t be executed simultaneously. Data Hazards – they are due to instruction dependencies. For example, an instruction that needs to read a RF register not yet written by a previous instruction (Read After Write). Control Hazards – Instructions following a branch depend from the branch result (taken/not taken). The instruction that cannot be executed must be stalled (“pipeline stall” or “pipeline bubbling”), together with all the following instructions, while the previous instructions must proceed normally (so as to eliminate the hazard).

32 Clk 6 Clk 7Clk 8 Hazards and stalls IF IDEXMEMWB I i-3 I i-2 I i-1 IDEXMEM IDEX IF Clk 1Clk 2Clk 3Clk 4Clk 5 WB Clk 9Clk 10Clk 11Clk 12 T 5 = 8 * CLK = (5 + 3) * CLK T 5 = 5 * (1 + 3/5 ) * CLK Instruction stalls IDIiIi IF I i+1 WB SS S SSIFS MEMWB Stall: the clock signal for I i, I i+1 …etc. is blocked for three periods The consequence of a data hazard: if instruction I i needs the result of instruction I i-1 (data are read in ID stage), must wait until after WB of I i-1 32 Normally the three stalled instructions are transformed in NOPs to avoid clock blocking

33 33 Forwarding Forwarding allows eliminating almost all RAW hazards of the pipeline without stalling the pipeline. (NOTE: in DLX, registers are modified only in WB stage) Clk 6 Clk 7Clk 8 ADD R3, R1, R4 IF IDEXMEMWB Clk 1Clk 2Clk 3Clk 4Clk 5 SUB R7, R3, R5 hazard ID EXMEMIFWB Clk 9 OR R1, R3, R5 hazard ID MEMWBEXIF Here too the requested data is not yet in RF since it is written on the positive clock edge at the end of WB (register value is read in ID!) LW R6, 100 (R3) hazard ID IF EX MEM WB AND R9, R5, R3 no hazard IF IDEXMEM WB Data are read from registers in the ID stage

34 34 Forwarding implementation FU EX/MEM MUXMUX MEM/WB ALUALU MUXMUX ID/EX MUXMUX MUXMUX RS1/RS2 OPCODE RD2/OpCode RD1 (destination register/OpCode) Combinatorial!! comparison between RS1, RS2 and RD1, RD2 and the Opcodes RF MUXMUX Often performed inside the RF It allows “the anticipation” of the register on ID/EX MUX control: IF/ID opcode and comparison of RD with RS1 and RS2 Memory ALU IR3 IR4 Offset B A Bypass MUXMUX PC INSTRUCTION DECODE * MUXMUX PC

35 35 Data hazard due to LOAD instructions NOTE: the data required by the ADD is available only at the end of MEM stage. This hazard cannot be eliminated by forwarding (unless there is an additional input in the MUXs between memory and ALU – delays!) ADD R4,R1,R7 SUB R5,R1,R8 AND R6,R1,R7 LW R1,32(R6) MEM WB IFID EX MEM IFID EX IF ID IF ID EX LW R1,32(R6) IFIDEX MEM WB ADD R4,R1,R7 IFIDS EX MEM SUB R5,R1,R8 IF ID EX AND R6,R1,R7 IFID The pipeline needs to be stalled Transformed in NOP PC-<PC-4 From the end of this stage onwards: standard forwarding ADD R4,R1,R7 IFID EX MEM NOP IFID EX MEM WB

36 36 Delayed load In many RISC CPUs, the hazard associated with the LOAD instruction is not handled by HW by stalling the pipeline but by software through the compiler (delayed load): LOAD Instruction delay slot Next instruction The compiler tries to fill the delay-slot with a “useful” instruction (worst case: NOP). LW R1,32(R6) LW R3,10 (R4) ADD R5,R1,R3 LW R6, 20 (R7) LW R8, 40(R9) LW R1,32(R6) LW R3,10 (R4) ADD R5,R1,R3 LW R6, 20 (R7) LW R8, 40(R9)

37 37 Control Hazards BEQZ R4, 200 PC BEQZ R4, 200 PC+4 SUB R7, R3, R5 PC+8 OR R1, R3, R5 PC+12 LW R6, 100 (R8) PC+4+200 AND R9, R5, R3 (BTA) Next Instruction Address R4 = 0 : Branch Target Address (taken) R4  0 : PC+4 (not taken) Clk 6 Clk 7Clk 8 IF IDEXMEMWB ID Clk 1Clk 2Clk 3Clk 4Clk 5 MEMWB EXMEM EX IF WB ID IF EX WB IDMEM Fetch with the new PC New computed PC value (Aluout) SUB R7, R3, R5 OR R1, R3, R5 LW R6, 100 (R8) New value in PC (one clock after: new value must be clocked onto the PC)) IDIF EX WB IDMEM

38 ADDADD 4 IMRF SE PC DEC Instruction Fetch Instruction Decode Execute Memory Write Back IF/IDID/EX ALUALU MUXMUX EX/MEM MUXMUX MUXMUX DLX Pipelined Datapath (Branch or JMP) BEQZ R4, 200 MUXMUX DM MEM/WB When the new PC acts on the IM three instructions have already travelled through the first three stages (EX included) NOTE if the feedback signal of the new PC were output directly from the ALU instead than from ALUOUT the required stalls would be only two – slower clock! =0? 38

39 39 Handling the Control Hazards BEQZ R4,200 Clk 6 Clk 7Clk 8 IF IDEXMEMWB Clk 1Clk 2Clk 3Clk 4Clk 5 SS IF S Fetch at new PC Always Stall ( three-clock block being propagated) Predict Not Taken IF IDEXMEMWB ID BEQZ R4, 200 SUB R7, R3, R5 OR R1, R3, R5 LW R6, 100 (R8) Clk 6 Clk 7Clk 8 Clk 1Clk 2Clk 3Clk 4Clk 5 MEMWB EXMEM EX IF WB EX WB ID MEM Branch Completion IF here: the previous instruction has not been yet decoded SIF IDS Real situation Repeated IF PC <= PC - 4 Here the new value is sampled by the PC No problem because no instruction in WB stage NOP If branch taken: flush. They become NOP. No data yet written Here the new value of PC is computed

40 IF ID EXMEMWB Stalls with jumps (1/3) ADDADD 4 MUXMUX DATA MEM ALUALU MUXMUX MUXMUX =0? INSTR MEM RF SE PC DEC MUXMUX IF/IDID/EXEX/MEMMEM/WB DR D RS1 RS2 Data PC if jump =0? NOPNOP NOPNOP NOPNOP Jump forced NOP Three NOPs MUST replace the 3 unwanted instructions already started When the Branch Target Address is clocked into the PC three unwanted instructions are already in IF/ID, ID/EX and EX/MEM 40

41 IF ID EXMEMWB Stalls with jump (2/3) ADDADD 4 MUXMUX DATA MEM ALUALU MUXMUX MUXMUX =0? INSTR MEM RF SE PC DEC MUXMUX IF/IDID/EXEX/MEMMEM/WB DR D RS1 RS2 Data PC if jump =0? NOPNOP NOPNOP forced NOP when jump NOTE in this case the jump condition detection and the new PC value are input to the MUX in the same clok interval Two NOPs MUST replace the 2 unwanted instructions already started 41

42 IF ID EXMEMWB Stalls with jump (3/3) ADDADD 4 DATA MEM ALUALU MUXMUX MUXMUX =0? INSTR MEM RF SE DEC MUXMUX IF/IDID/EXEX/MEMMEM/WB DR D RS1 RS2 Data PC if jump =0? NOPNOP NOP for jump NOTE In this case the jump condition and the new PC act on the MUX in the same period when the condition is detected PC MUXMUX A NOP MUST replace the unwanted instruction already started Very slow solution ! 42

43 43 Delayed branch Similarly to the LOAD case. In several RISC CPUs the BRANCH instructions hazard is handled by SW through the compiler (delayed branch): BRANCH instruction delay slot Next instruction The compiler tries to fill the delay-slots with “useful” instructions (worst case: NOP). delay slot

44 44 Delayed branch/jump Add R5, R4, R3 Sub R6, R5, R2 Or R14, R6, R21 Sne R1, R8, R9 ; branch condition Br R1, +100 Sne R1, R8, R9 ; branch condition Br R1, +100 Add R5, R4, R3 Sub R6, R5, R2 Or R14, R6, R21 CompiledOriginal Executed in both cases Obviously in this instructions group there must be no jumps!!! Instead of one or more “postponed” instructions, the compiler inserts NOPs when no suitable instructions are available

45 45 Independent Adder for BRANCH/JMP To reduce the number of stalls BTA <=PC1+ (IR 15 ) 16 ## IR 15-0 / (IR 25 ) 6 ## IR 25..0 if Branch: if (RS1 op 0) PC <= BTA if JMP always PC <= BTA IF ID EX ------------------------- MEM WB ------------------------- (New fetch only one stall) ALU (additional full adder) A <- Ra; B <- Rb; PC2 <- PC1 ID/EX <- Decode; ID/EX <- Opc ext. IR <- M[PC] ; PC <- PC + 4; PC1 <- PC + 4 NOTE: in this case there is only one “stall” since the new value is inserted in the PC on the positive clock edge that ends the ID stage while, in the previous case, it was inserted after the MEM stage, that is, two clock later!!!!!!

46 BRANCH/JMP – 1 stall ADDERADDER 4 IMRF PC DEC IF/ID ID/EX IR1IR1 IF ID PC1PC1 MUXMUX MUXMUX SE ## A B PC2PC2 NOTE: for “Unconditional Jump” instructions there a similar situation : we need only to provide further inputs to the MUXs of the PC by considering either the RS1 register (JR and JRL) or the 26 less-significant bits of the IR with SE (J and JL) to be added to the instruction PC (not the current PC) The source of the next PC is selected according to the opcode and the value of the branch test register = 0 ? For Branches Standard increment Branch Offset and sign extension Displacement of the Branch instruction PC of the Branch instruction 46

47 47 Handling the Control Hazards Dynamic Prediction: Branch Target Buffer => no stall (almost..) T/NT TAGS Predicted PC PC = HIT : Fetch with predicted PC MISS : Fetch with PC + 4 Correct prediction : no stalls Wrong prediction : 1-3 stalls (correct fetch in ID or EX, see before) N.B. Here the branch slot is selected during the IF clock cycle that loads IR1 in IF/ID

48 48 Prediction Buffer: the simplest implementation uses a single bit that indicates what happened when last branch occurred. In case of predominance of one prediction, when the opposite situation occurs we have two consecutive errors. Loop1 Loop2 When the program ends loop2, the prediction fails (branch predicted as taken but actually it is untaken), then it fails again when it predicts as untaken whilst entering once again loop2

49 49 Usually two bits. TAKEN UNTAKEN TAKEN UNTAKEN TAKEN UNTAKEN TAKEN UNTAKEN


Download ppt "1 DLX computer Electronic Computers M. 2 RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer RISC architectures In CISC."

Similar presentations


Ads by Google