Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 DLX computer Electronic Computers M. 2 RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer RISC architectures In CISC.

Similar presentations


Presentation on theme: "1 DLX computer Electronic Computers M. 2 RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer RISC architectures In CISC."— Presentation transcript:

1 1 DLX computer Electronic Computers M

2 2 RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer RISC architectures In CISC architectures the 10% of the instructions are used in 90% of cases Waste of silicium Bottleneck: the bus Mid ‘80s a new architecture: RISC Solution: reduction of instruction number and complexity (fewer simpler machine instructions) Fixed instruction format (simpler instruction decoders) Simpler control logic network increasing the number of on-chip registers Reduction of bus/memory accesses Increase of machine instructions needed for a job which is (in many cases) more than compensated (in term of time) by the reduction of bus accesses CISC and RISC are each one the best solution in different application fields Nowadays coexistence of both architectures in the same processor: analysis at the end of the course A simplified RISC architecture: DLX (implemented as real processor in the ‘80s as R4000)

3 3 DLX (fixed) instruction format R Op-code RS1RS2 Rd Cod. op (11 bit) extension 6 bit5 bit 11 bit Arithmetic or logic instructions as Rd  RS1 op RS2 or Set Conditions between registers J Op-code26 bit (PC relative) offset Direct and unconditional control transfer(J e JAL) I Op-code RS1RS2 16 bit immediate operand Load, Store, conditional Branch, JR and JALR (Control transfer via register), Set Condition e ALU with immediate operator. In LD and ALU instructions RS2=destination, in the ST RS2=source. -- RS1 used as base address or as ALU value for the immediate instructions

4 4 DLX non floating-point instructions (31x32bit registers R31…R1 - R0=0 fixed - Ra and Rb any of the 32 registers) Data Transfer LWRa, offset(Rb) LB Ra, offset(Rb) LBURa, offset(Rb) LHU Ra, offset(Rb) LH Ra, offset(Rb) SW Ra, offset(Rb) SH Ra, offset(Rb) SB Ra, offset(Rb) LHI Ra, value Arithmetic/Logic ADD Ra,Rb,Rc ADDIRa,Rb,value ADDURa,Rb,Rc ADDUI Ra,Rb, value SUB Ra,Rb,Rc SUBIRa,Rb,value SUBURa,Rb,Rc SUBUI Ra,Rb, value DIV Ra,Rb,Rc DIVIRa,Rb,value MULURa,Rb,Rc MULI Ra,Rb, value SLL Ra,Rb,Rc SLLI Ra,Rb;value SHR Ra,Rb.Rc SHRI Ra,Rb,value SLA Ra,Rb,Rc SLAI Ra,Rb,value OR Ra,Rb,Rc ORIRa,Rb,value XORRa,Rb,Rc XORIRa,Rb,value ANDRa,Rb,Rc ANDIRa,Rb,value Control SETxRa,Rb,Rc SETIxRa,Rb,value BEQZRa, offset BNEQZ Ra, offset Joffset JRRa JLoffset JLRRa N.B. x can be LT, GT, LE, GE, EQ, NE JLx-> Jump and link saving PC in R31 Offset is a value within the instruction Postfix I means «immediate» (value within the instruction) PostfixA means «arithmetic» (sign extension) Postfix U means «unsigned» Postfix «x» is the ste condition Value is the immediate within the instruction

5 5 DLX ALU operations S1, S2 => ALU inputs (32 bit) S1 + S2 S1 – S2 S1 and S2 S1 or S2 S1 exor S2 Left Shift S1 of S2 positions Right Shift S1 of S2 positions Arithmetic Right Shift S1 of S2 positions S1 S2 0 1 Output Flags Zero Negative sign ALU is a combinatorial circuit !!!

6 6 A and B two internal registers unknown to the programmer Data transfer ALU Set Jump Branch Ready ? INSTRUCTION FETCH Abstract instruction execution INSTRUCTION DECODE * PC <= PC +4 A <= RS1 B <= RS2 I NSTR <= M [PC]

7 LOAD Address < A + (Instr 15 ) 16 ## Instr (A) = (RS1) Next Instruction 7 I NSTR < M [PC] PC < PC +4 A < RS1 B < RS2 Example: LB (LOAD BYTE format I) Sign extension !! Example M[Addr] 7..0 =A7 H => ( ) b Dest.REG <= FFFFFFA7 H ## => JOIN operator Dest. Reg. < (M[Addr] 7 ) 24 ## M[Addr] 7..0 (Dest. Reg. = RS2) Sign extension Memory byte Instr is the instruction offset Address 32 bit 31 MBbit 0 LSbit LB Ra, offset(Rb) Op-code RS1RS2 16 bit immediate operand

8 8 Sign extension (IR 15 ) 16 ## IR IR 31 30…………17 16 BUS S1 o S0 From the Control Unit 15-0

9 Data transfer Instructions (R format) Examples LWRa, offset(Rb) LB Ra, offset(Rb) LBURa, offset(Rb) LHU Ra, offset(Rb) SW Ra, offset(Rb) LB Dest. <(M[Addr] 7 ) 24 ## M[Addr] 7..0 LBU Dest. < (0) 24 ## M[Addr] 7..0 M[Addr] <=B SW Addr. <- A + (Instr 15 ) 16 ## Instr A unsigned LH Dest. < (M[Addr] 15 ) 16 ## M[Addr] Signed. LHU LW LHU Dest. <(0) 16 ## M[Addr] Unsigned 9

10 10 Dest <= A + T Dest <= A xor T Dest <= A = T Dest <= A and T Dest <= A or T ADDAND SUBXOROR ALUinstructions examples (I format) (T is a temporary hidden register unknown to the programmer) The same scheme for the shift etc. A and B generic registers (RS1, RS2) Register (format R)Immediate (format I) T<= BT<= (Instr 15 ) 16 ## Instr Register content signed if arithmetic operations ADD Ra,Rb,Rc ADDIRa,Rb,value ADDURa,Rb,Rc ADDUI Ra,Rb, value ………………………

11 11 SET instructions (see branch) ex. SLT R1,R2,R3 Set R1=1 if R2 is less than R3 Register (format R)Immediate (format I) T<= BT<= (Instr 15 ) 16 ## Instr Dest=1 if A = T SEQ SLT SGE SNESGT SLE Dest=1 if A < T Dest=1 if A >= T Dest=1 A <= T Dest=1 A > T Dest=1 if A! = T Register content as signed

12 12 T <= PC JALR JUMP Instructions JAL T <= PC JALR JAL R31 <= T For saving PC in R31 JR JALR JMP JAL PC <= PC + (Instr 25 ) 6 ## Instr PC <= A format I format J Joffset JRRa JLoffset JLRRa

13 INIT 13 A = 0 BRANCH YES NO BEQZBNEZ Branch Instructions A! = 0 PC <= PC + (Instr 15 ) 16 ## Instr Ex. BNEQZ R5, 100 Jump to PC+100 if R5 not equal 0

14 14 The Pipelining Principle Pipelining is the main basic technique used for “speeding-up” a CPU. The key idea for pipelining is general, and is currently applied to several industry fields (productions lines, oil pipelines, …) A system S must execute N times a task A: A 1, A 2, A 3 …A N S R 1, R 2, R 3 …R N Latency : time occurring between the beginning and the end of task A (T A ). Throughput : frequency of each task completion

15 15 The Pipelining Principle 1) Sequential System A2A2 A3A3 t ANAN A1A1 TATA Latency (execution time of a single instruction) = T A 2) Pipelined System (instruction are subdivided in stages – each stage during one nth – 4 in this example - of the entire instruction) – Instructions overlap S A P1P1 P2P2 P3P3 P4P4 t S1S1 S2S2 S3S3 S4S4 S i : pipeline stage

16 16 The Pipelining Principle P1P1 TPTP P2P2 P3P3 A1A1 P4P4 S S1S1 S2S2 S3S3 S4S4 P1P1 A2A2 P2P2 P3P3 P4P4 P1P1 A3A3 P2P2 P3P3 P4P4 P1P1 A4A4 P2P2 P3P3 P4P4 tAnAn T P : pipeline cycle Each cycle one instruction terminates

17 Instruction stages 17 EXIDMEMWB IF Instruction fetch (from memory) Instruction decode Instruction execution (ALU) Data memory access (if needed) Instruction write-back (if needed)

18 18 Pipelining of a CPU (DLX) Instruction sequence: I 1, I 2, I 3 …I N Instruction j EXID t MEMWB IF CPI=1 (ideally !) IF/IDID/EXEX/MEMMEM/WB CPU (datapath) IFIDEXMEMWB Pipeline CycleClock Cycle Delay of the slowest stage Registers (Pipeline Registers D FF) Combinatorial circuits

19 19 DLX Pipeline Instr i Instr i+1 Instr i+2 Instr i+3 Instr i+4 IFIDEXMEMWB T clk = T d + T P + T su Clock Cycle CPI (ideally) = 1 Overhead introduced by the Pipeline Registers: Switch delay of the input stage register Set-up time of the output stage register Delay of the slowest combinatorial stage IFIDEXMEMWB IFIDEXMEMWBIFIDEXMEMWBIFIDEXMEMWB

20 D Tp Switch delay of the input stage register D Set-up time of the output stage register Combinatorial Circuit Delay of the slowest combinatorial stage 20

21 21 Pipeline implementation requirements  Each stage is active at each clock cycle.  The PC is incremented in the IF stage PC Always 0  An ADDER should be introduced (PC <=PC+4) in the IF stage. But instructions are aligned then a 30 bit register (counter) is used, incremented by 1 each clock cycle  Two Memory Data Registers are required (referred to as LMDR e SMDR). In fact when a LOAD is immediately followed by a STORE there is a WB/MEM stages overlap – two data waiting therefore to be written (one onto the memory, the other onto the RF).  Each clock cycle 2 memory accesses must be possibly executed (IF, MEM): Instruction Memory (IM) and Data Memory (DM): “Harvard” Architecture  The CPU clock is determined by the slowest stage:  Pipeline Registers store both data and control information ( “distributed” control unit)

22 IF ID EXMEMWB DLX Pipelined Datapath ADDADD 4 MUXMUX DATA MEM ALUALU MUXMUX MUXMUX =0? INSTR MEM RF SE PC DEC MUXMUX IF/IDID/EXEX/MEMMEM/WB Sign extension Number of dest. registers in case of LOAD and ALU instr. For computing new PC value when branch For operations with immediates RD D RS1 RS2 destination register number Data PC Actually a programmable counter if jump For Set Condition (also 0) [it acts on the output] =0? for Branch JL and JLR (PC in R31) 22

23 23 ID stage ( N.B. stage layout different from previous slide! ) IRIR RF SE DR D RS1 RS2 IF/IDID/EX IR IR Number of the dest. register (from WB stage) Data (from WB stage) (31-16) Immed./Branch (31-26) Jump IR 15 IR 25 LB SW IR 15-0 (Offset/Immediate– Dest. reg. in R instr. ) IR (Jump; Jump and Link) PC 31-0 (JL and JLR) PCPC A B 26 (J and JL) Info travelling with the instruction IR (R Istr.) DEC Sign extension IR (Opcode) Sing extension

24 DLX Pipelined Datapath ADDADD 4 MUXMUX DM ALUALU MUXMUX MUXMUX IMRF SE PC DEC MUXMUX IF/IDID/EXEX/MEMMEM/WB IR1IR1 A B IR2IR2 PC2PC2 CONDCOND X X: Computed data or Memory Address or Branch Address SMDRSMDR Y LMDRLMDR Y: Computed data from the previous stage IF ID EXMEMWB PC1PC1 PC3PC3 PC4PC4 Address Data IR3IR3 IR4IR4 destination register number for Set Condition (also 0) [it acts on output] =0? for Branch JL JLR (PC saved in R31) SMDR => Store Memory Data Register LMDR => Load memory data Register IRi => Instruction Register i 24

25 25 Pipelined execution of an “ALU” instruction X : “ALUOUTPUT” (in EX/MEM), Y : “ALUOUTPUT1” IF ID EX MEM Y <= X (temp. Storage for WB) WB RD <= Y IR <= M[PC] ; PC <= PC + 4 ; PC1 <= PC + 4 A <= RS1; B <= RS2; PC2 <= PC1; IR2<=IR1 ID/EX <= Instruction decode; X <= A op B or X <= A op [(IR2 15 ) 16 ## IR ] [PC4 <= PC3] [PC3 <= PC2] Decoded opcode travels through all stages [IR3 <= IR2] [IR4 <.= IR3] NOTE: IRi bits which are dropped stage by stage when no more needed for all instructions. Why ?

26 26 Pipelined execution of a “MEM” instruction IF ID EX MEM LMDR <= M[MAR] (if LOAD) or M[MAR] <= SMDR (if STORE) WB RD <= MDR (if LOAD) [Sign ext.] IR <= M[PC] ; PC <= PC + 4 ; PC1 <= PC + 4 A <= RS1; B <= RS2; PC2 <= PC1; IR2<=IR1 ID/EX <= Instruction decode ; MAR <= A op (IR2 15 ) 16 ## IR SMDR <= B [PC4 <= PC3] [PC3 <= PC2] Decoded opcode travels through all stages [ IR3 <= IR2 [IR4 <= IR3]

27 27 Pipelined execution of a “BRANCH” instruction (normally after a SCn instruction – see later) X : “BTA (BRANCH TARGET ADDRESS)” IF ID EX MEM if (Cond) PC <= X WB (NOP) IR <= M[PC] ; PC <= PC + 4 ; PC1 <= PC + 4 A <= RS1; B <= RS2; PC2 <= PC1; IR2<=IR1 ID/EX <= Instruction decode; X <= PC2 op (IR 15 ) 16 ## IR Cond <= A op 0 [PC4 <= PC3] [PC3 <= PC2] Decoded opcode travels through all stages [IR3 <= IR2] [IR4 <= IR3 Branch on Reg A value (0/1) New value in PC in this interval. When Branch is taken 3 new unwanted instructions are already started

28 28 Pipelined execution of a “JR” instruction ID MEM WB IF ID EX MEM PC <= X WB (NOP) IR <= M[PC] ; PC <= PC + 4 ; PC1 <= PC + 4 A <= RS1; B <= RS2; PC2 <= PC1; IR2<=IR1 ID/EX <= Instruction decode; X <= A [PC4 <= PC3] [PC3 <= PC2] Decoded opcode travels through all stages [IR3 <= IR2] [IR4 <= IR3] Which would be the stage sequence for a J instruction? New value in PC in this interval. When Jump executed 3 new unwanted instructions are already started

29 29 Pipelined execution of a “JL or JLR” instruction ID IF ID EX MEM PC <= X ; PC4<= PC3 WB R31 <= PC4 IR <= M[PC] ; PC <= PC + 4 ; PC1 <= PC + 4 A <= RS1; B <= RS2; PC2 <= P1; IR2<=IR1 ID/EX <= Instruction decode; PC3 <= PC2 X <= A (If JLR) X <= PC2 + (IR 25 ) 6 ## IR (If JL) NOTE: Write on R31 CANNOT be performed on-the fly since it could overlap with another register write Decoded opcode through all stages [IR4 <= IR3] [IR3 <= IR2] In this case PCi values are used New value in PC in this interval. When Jump executed 3 new unwanted instructions are already started

30 30 Which would be the sequence in case of SCn (ex SLT R1,R2,R3) ? ID IF ID EX MEM WB IR <= M[PC] ; PC <= PC + 4 ; PC1 <= PC + 4 A <= RS1; B <= RS2; PC2 <= P1; IR2<=IR1 ID/EX <= Instruction decode; ? ? ?

31 31 Pipeline Hazards A “Hazard” occurs when during a clock cycle an instruction currently in a pipeline stage can’t be executed in the same clock cycle. Structural Hazards – The same resource is used by two different pipeline stages: the instructions currently in those stages can not be executed simultaneously. Data Hazards – they are due to instruction dependencies. For example, an instruction that needs to read a register not yet written by a previous instruction (Read After Write). Control Hazards – Instructions following a branch depend from the branch result (taken/not taken). The instruction that cannot be executed must be stalled (“pipeline stall” or “pipeline bubbling”), together with all the following instructions, while the previous instructions must proceed normally (so as to eliminate the hazard).

32 Clk 6 Clk 7Clk 8 Hazards and stalls IF IDEXMEMWB I i-3 I i-2 I i-1 IDEXMEM IDEX IF Clk 1Clk 2Clk 3Clk 4Clk 5 WB Clk 9Clk 10Clk 11Clk 12 T 5 = 8 * CLK = (5 + 3) * CLK T 5 = 5 * (1 + 3/5 ) * CLK Instruction stalls IDIiIi IF I i+1 WB SS S SSIFS MEMWB Stall: the clock signal for I i, I i+1 …etc. is blocked for three periods The consequence of a data hazard: if instruction I i needs the result of instruction I i-1 (data are read in ID stage), must wait until after WB of I i-1 32

33 33 Forwarding ADD R3, R1, R4 ID Forwarding allows eliminating almost all RAW hazards of the pipeline without stalling the pipeline. (NOTE: in DLX, registers are modified only in WB stage) SUB R7, R3, R5 hazard OR R1, R3, R5 hazard LW R6, 100 (R3) hazard AND R9, R5, R3 no hazard Here too the requested data is not yet in RF since it is written on the positive clock edge at the end of WB (register value is read in ID!) Clk 6 Clk 7Clk 8 MEMWB IF IDEXMEMWB EXMEM EX IF Clk 1Clk 2Clk 3Clk 4Clk 5 WB EX MEM IDEX Clk 9 MEM WB Data are read from registers in the ID stage

34 34 Forwarding implementation FU1 EX/MEM MUXMUX MEM/WB ALUALU MUXMUX ID/EX MUXMUX MUXMUX RS1/RS2 OPCODE RD2/OpCode RD1 (destination register/OpCode) Comparison between RS1, RS2 and RD1, RD2 and the Opcodes RF MUXMUX Often performed inside the RF It allows “the anticipation” of the register on ID/EX MUX control: IF/ID opcode and comparison of RD with RS1 and RS2 Alternatively, SPLIT-CYCLE (see next slide) write before read Memory ALU IR3 IR4 Offset B A Bypass MUXMUX

35 35 Data hazard due to LOAD instructions NOTE: the data required by the ADD is available only at the end of MEM stage. This hazard cannot be eliminated by forwarding (unless there is an additional input in the MUXs between memory and ALU – delays!) ADD R4,R1,R7 SUB R5,R1,R8 AND R6,R1,R7 LW R1,32(R6) MEM WB IFID EX MEM IFID EX IF ID IF ID EX LW R1,32(R6) ADD R4,R1,R7 SUB R5,R1,R8 AND R6,R1,R7 IFIDEX MEM WB IFIDS EX MEM IFS ID EX SIFID The pipeline needs to be stalled Actually the clock signal is not generated. The clock block is propagated along the pipeline one stage at a time. From the end of this stage onwards: standard forwarding

36 36 Delayed load In many RISC CPUs, the hazard associated with the LOAD instruction is not handled by HW by stalling the pipeline but by software through the compiler (delayed load): LOAD Instruction delay slot Next instruction The compiler tries to fill the delay-slot with a “useful” instruction (worst case: NOP). LW R1,32(R6) LW R3,10 (R4) ADD R5,R1,R3 LW R6, 20 (R7) LW R8, 40(R9) LW R1,32(R6) LW R3,10 (R4) ADD R5,R1,R3 LW R6, 20 (R7) LW R8, 40(R9)

37 37 Control Hazards BEQZ R4, 200 PC BEQZ R4, 200 PC+4 SUB R7, R3, R5 PC+8 OR R1, R3, R5 PC+12 LW R6, 100 (R8) PC AND R9, R5, R3 (BTA) Next Instruction Address R4 = 0 : Branch Target Address (taken) R4  0 : PC+4 (not taken) Clk 6 Clk 7Clk 8 IF IDEXMEMWB ID Clk 1Clk 2Clk 3Clk 4Clk 5 MEMWB EXMEM EX IF WB ID IF EX WB IDMEM Fetch with the new PC New computed PC value (Aluout) SUB R7, R3, R5 OR R1, R3, R5 LW R6, 100 (R8) New value in PC (one clock after) IDIF EX WB IDMEM

38 ADDADD 4 IMRF SE PC DEC Instruction Fetch Instruction Decode Execute Memory Write Back IF/IDID/EX ALUALU MUXMUX EX/MEM MUXMUX MUXMUX DLX Pipelined Datapath (Branch or JMP) BEQZ R4, 200 MUXMUX DM MEM/WB When the new PC acts on the IM three instructions have already travelled through the first three stages (EX included) NOTE if the feedback signal of the new PC were output directly from the ALU instead than from ALUOUT the required stalls would be only two – slower clock! =0? 38

39 39 Handling the Control Hazards BEQZ R4,200 Clk 6 Clk 7Clk 8 IF IDEXMEMWB Clk 1Clk 2Clk 3Clk 4Clk 5 SS IF S Fetch at new PC Always Stall ( three-clock block being propagated) Predict Not Taken IF IDEXMEMWB ID BEQZ R4, 200 SUB R7, R3, R5 OR R1, R3, R5 LW R6, 100 (R8) Clk 6 Clk 7Clk 8 Clk 1Clk 2Clk 3Clk 4Clk 5 MEMWB EXMEM EX IF WB EX WB ID MEM Branch Completion Flush: they become NOP IF here: the previous instruction has not been yet decoded SIF IDS Real situation Repeated IF PC <= PC - 4 Here the new value is sampled by the PC No problem because no instruction in WB stage

40 IF ID EXMEMWB Stalls with jumps (1/3) ADDADD 4 MUXMUX DATA MEM ALUALU MUXMUX MUXMUX =0? INSTR MEM RF SE PC DEC MUXMUX IF/IDID/EXEX/MEMMEM/WB DR D RS1 RS2 Data PC if jump =0? NOPNOP NOPNOP NOPNOP Jump forced NOP Three NOPs MUST replace the 3 unwanted instructions already started When the Branch Target Address is clocked into the PC three unwanted instructions are already in IF/ID, ID/EX and EX/MEM 40

41 IF ID EXMEMWB Stalls with jump (2/3) ADDADD 4 MUXMUX DATA MEM ALUALU MUXMUX MUXMUX =0? INSTR MEM RF SE PC DEC MUXMUX IF/IDID/EXEX/MEMMEM/WB DR D RS1 RS2 Data PC if jump =0? NOPNOP NOPNOP forced NOP when jump NOTE in this case the jump condition detection and the new PC value are input to the MUX in the same clok interval Two NOPs MUST replace the 2 unwanted instructions already started 41

42 IF ID EXMEMWB Stalls with jump (3/3) ADDADD 4 DATA MEM ALUALU MUXMUX MUXMUX =0? INSTR MEM RF SE DEC MUXMUX IF/IDID/EXEX/MEMMEM/WB DR D RS1 RS2 Data PC if jump =0? NOPNOP NOP for jump NOTE In this case the jump condition and the new PC act on the MUX in the same period when the condition is detected PC MUXMUX A NOP MUST replace the unwanted instruction already started Very slow solution ! 42

43 43 Delayed branch Similarly to the LOAD case. In several RISC CPUs the BRANCH instructions hazard is handled by SW thorugh the compiler (delayed branch): BRANCH instruction delay slot Next instruction The compiler tries to fill the delay-slots with “useful” instructions (worst case: NOP). delay slot

44 44 Delayed branch/jump Add R5, R4, R3 Sub R6, R5, R2 Or R14, R6, R21 Sne R1, R8, R9 ; branch condition Br R1, +100 Sne R1, R8, R9 ; branch condition Br R1, +100 Add R5, R4, R3 Sub R6, R5, R2 Or R14, R6, R21 CompiledOriginal Executed in both cases Obviously in this instructions group there must be no jumps!!! Instead of one or more “postponed” instructions, the compiler inserts NOPs when no suitable instructions are available

45 45 Independent Adder for BRANCH/JMP To reduce the number of stalls BTA <=PC1+ (IR 15 ) 16 ## IR 15-0 / (IR 25 ) 6 ## IR if Branch: if (RS1 op 0) PC <= BTA if JMP always PC <= BTA IF ID EX MEM WB (New fetch only one stall) ALU (additional full adder) A <- RS1; B <- RS2; PC2 <- PC1 ID/EX <- Decode; ID/EX <- Opc ext. IR <- M[PC] ; PC <- PC + 4; PC1 <- PC + 4 NOTE: in this case there is only one “stall” since the new value is inserted in the PC on the positive clock edge that ends the ID stage while, in the previous case, it was inserted after the MEM stage, that is, two clock later!!!!!!

46 BRANCH/JMP – 1 stall ADDERADDER 4 IMRF PC DEC IF/ID ID/EX IR1IR1 IF ID PC1PC1 MUXMUX MUXMUX SE ## A B PC2PC2 NOTE: for “Unconditional Jump” instructions there a similar situation : we need only to provide further inputs to the MUXs of the PC by considering either the RS1 register (JR and JRL) or the 26 less-significant bits of the IR with SE (J and JL) to be added to the instruction PC (not the current PC) The source of the next PC is selected according to the opcode and the value of the branch test register = 0 ? For Branches Standard increment Branch Offset and sign extension Displacement of the Branch instruction PC of the Branch instruction 46

47 47 Handling the Control Hazards Dynamic Prediction: Branch Target Buffer => no stall (almost..) T/NT TAGS Predicted PC PC = HIT : Fetch with predicted PC MISS : Fetch with PC + 4 Correct prediction : no stalls Wrong prediction : 1-3 stalls (correct fetch in ID or EX, see before) N.B. Here the branch slot is selected during the IF clock cycle that loads IR1 in IF/ID

48 48 Prediction Buffer: the simplest implementation uses a single bit that indicates what happened when last branch occurred. In case of predominance of one prediction, when the opposite situation occurs we have two consecutive errors. Loop1 Loop2 When the program ends loop2, the prediction fails (branch predicted as taken but actually it is untaken), then it fails again when it predicts as untaken whilst entering once again loop2

49 49 Usually two bits. TAKEN UNTAKEN TAKEN UNTAKEN TAKEN UNTAKEN TAKEN UNTAKEN


Download ppt "1 DLX computer Electronic Computers M. 2 RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer RISC architectures In CISC."

Similar presentations


Ads by Google