Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computer Architecture and Parallel Computing 并行结构与计算 Lecture 2 - Pipelining Peng Liu Dept. of Info. Sci. & Elec. Engg. Zhejiang University

Similar presentations


Presentation on theme: "Computer Architecture and Parallel Computing 并行结构与计算 Lecture 2 - Pipelining Peng Liu Dept. of Info. Sci. & Elec. Engg. Zhejiang University"— Presentation transcript:

1 Computer Architecture and Parallel Computing 并行结构与计算 Lecture 2 - Pipelining Peng Liu Dept. of Info. Sci. & Elec. Engg. Zhejiang University liupeng@zju.edu.cn May 12, 2014

2 2 Microcoding became less attractive as gap between RAM and ROM speeds reduced Complex instruction sets difficult to pipeline, so difficult to increase performance as gate count grew Iron Law explains architecture design space –Trade instructions/program, cycles/instruction, and time/cycle Load-Store RISC ISAs designed for efficient pipelined implementations –Very similar to vertical microcode –Inspired by earlier Cray machines (more on these later)

3 3 An Ideal Pipeline All objects go through the same stages No sharing of resources between any two stages Propagation delay through all pipeline stages is equal The scheduling of an object entering the pipeline is not affected by the objects in other stages stage 1 stage 2 stage 3 stage 4 These conditions generally hold for industrial assembly lines, but instructions depend on each other!

4 4 Pipelined MIPS To pipeline MIPS: First build MIPS without pipelining with CPI=1 Next, add pipeline registers to reduce cycle time while maintaining CPI=1

5 5 Unpipelined Datapath for MIPS 0x4 RegWrite Add clk WBSrcMemWrite addr wdata rdata Data Memory we RegDst BSrc ExtSelOpCode z OpSel clk zero? clk addr inst Inst. Memory PC rd1 GPRs rs1 rs2 ws wd rd2 we Imm Ext ALU Control 31 PCSrc br rind jabs pc+4

6 6 OpcodeExtSelBSrcOpSelMemWRegWWBSrcRegDstPCSrc ALU ALUi ALUiu LW SW BEQZ z=0 BEQZ z=1 J JAL JR JALR Hardwired Control Table BSrc = Reg / ImmWBSrc = ALU / Mem / PC RegDst = rt / rd / R31PCSrc = pc+4 / br / rind / jabs ***noyesrindPC R31 rind***no ** jabs ** * no yesPCR31 jabs * * * no ** pc+4sExt 16 *0?no ** brsExt 16 *0?no ** pc+4sExt 16 Imm+yesno** pc+4ImmOpnoyesALUrt pc+4*RegFuncnoyesALUrd sExt 16 ImmOppc+4noyesALUrt pc+4sExt 16 Imm+noyesMemrt uExt 16

7 7 Pipelined Datapath Clock period can be reduced by dividing the execution of an instruction into multiple cycles t C > max {t IM, t RF, t ALU, t DM, t RW } ( = t DM probably) However, CPI will increase unless instructions are pipelined write -back phase fetch phase execute phase decode & Reg-fetch phase memory phase addr wdata rdata Data Memory we ALU Imm Ext 0x4 Add addr rdata Inst. Memory rd1 GPRs rs1 rs2 ws wd rd2 we IR PC

8 8 “Iron Law” of Processor Performance Time = Instructions Cycles Time Program Program * Instruction * Cycle –Instructions per program depends on source code, compiler technology, and ISA –Cycles per instructions (CPI) depends upon the ISA and the microarchitecture –Time per cycle depends upon the microarchitecture and the base technology MicroarchitectureCPIcycle time Microcoded>1short Single-cycle unpipelined1long Pipelined1short

9 CPI Examples 9 Time Inst 3 7 cycles Inst 1Inst 2 5 cycles10 cycles Microcoded machine 3 instructions, 22 cycles, CPI=7.33 Unpipelined machine 3 instructions, 3 cycles, CPI=1 Inst 1Inst 2Inst 3 Pipelined machine 3 instructions, 3 cycles, CPI=1 Inst 1 Inst 2 Inst 3

10 10 Technology Assumptions Thus, the following timing assumption is reasonable A small amount of very fast memory (caches) backed up by a large, slower memory Fast ALU (at least for integers) Multiported Register files (slower!) t IM t RF t ALU t DM t RW A 5-stage pipeline will be the focus of our detailed design - some commercial designs have over 30 pipeline stages to do an integer add!

11 11 5-Stage Pipelined Execution time t0t1t2t3t4t5t6t7.... instruction1IF 1 ID 1 EX 1 MA 1 WB 1 instruction2 IF 2 ID 2 EX 2 MA 2 WB 2 instruction3IF 3 ID 3 EX 3 MA 3 WB 3 instruction4 IF 4 ID 4 EX 4 MA 4 WB 4 instruction5 IF 5 ID 5 EX 5 MA 5 WB 5 Write - Back (WB) I-Fetch (IF) Execute (EX) Decode, Reg. Fetch (ID) Memory (MA) addr wdata rdata Data Memory we ALU Imm Ext 0x4 Add addr rdata Inst. Memory rd1 GPRs rs1 rs2 ws wd rd2 we IR PC

12 12 5-Stage Pipelined Execution Resource Usage Diagram time t0t1t2t3t4t5t6t7.... IFI 1 I 2 I 3 I 4 I 5 IDI 1 I 2 I 3 I 4 I 5 EX I 1 I 2 I 3 I 4 I 5 MA I 1 I 2 I 3 I 4 I 5 WB I 1 I 2 I 3 I 4 I 5 Resources Write - Back (WB) I-Fetch (IF) Execute (EX) Decode, Reg. Fetch (ID) Memory (MA) addr wdata rdata Data Memory we ALU Imm Ext 0x4 Add addr rdata Inst. Memory rd1 GPRs rs1 rs2 ws wd rd2 we IR PC

13 13 Pipelined Execution: ALU Instructions IR 31 PC A B Y R MD1 MD2 addr inst Inst Memory 0x4 Add IR Imm Ext ALU rd1 GPRs rs1 rs2 ws wd rd2 we wdata addr wdata rdata Data Memory we Not quite correct! We need an Instruction Reg (IR) for each stage

14 14 Pipelined MIPS Datapath without jumps IR 31 PC A B Y R MD1 MD2 addr inst Inst Memory 0x4 Add IR Imm Ext ALU rd1 GPRs rs1 rs2 ws wd rd2 we Data Memory wdata addr wdata rdata we OpSel ExtSelBSrc WBSrc MemWrite RegDst RegWrite FDEMW Control Points Need to Be Connected

15 15 Instructions interact with each other in pipeline An instruction in the pipeline may need a resource being used by another instruction in the pipeline  structural hazard An instruction may depend on something produced by an earlier instruction –Dependence may be for a data value  data hazard –Dependence may be for the next instruction’s address  control hazard (branches, exceptions)

16 Resolving Structural Hazards Structural hazards occurs when two instructions need same hardware resource at same time –Can resolve in hardware by stalling newer instruction till older instruction finished with resource A structural hazard can always be avoided by adding more hardware to design –E.g., if two instructions both need a port to memory at same time, could avoid hazard by adding second port to memory Our 5-stage pipe has no structural hazards by design –Thanks to MIPS ISA, which was designed for pipelining 16

17 17 Data Hazards... r1 r0 + 10 r4 r1 + 17... r1 is stale. Oops! r1 … r4 r1… IR 31 PC A B Y R MD1 MD2 addr inst Inst Memory 0x4 Add IR Imm Ext ALU rd1 GPRs rs1 rs2 ws wd rd2 we wdata addr wdata rdata Data Memory we

18 18 Resolving Data Hazards (1) Strategy 1: Wait for the result to be available by freezing earlier pipeline stages  interlocks

19 19 Feedback to Resolve Hazards Later stages provide dependence information to earlier stages which can stall (or kill) instructions FB 1 stage 1 stage 2 stage 3 stage 4 FB 2 FB 3 FB 4 Controlling a pipeline in this manner works provided the instruction at stage i+1 can complete without any interference from instructions in stages 1 to i (otherwise deadlocks may occur)

20 20 IR 31 PC A B Y R MD1 MD2 addr inst Inst Memory 0x4 Add IR Imm Ext ALU rd1 GPRs rs1 rs2 ws wd rd2 we wdata addr wdata rdata Data Memory we nop Interlocks to resolve Data Hazards... r1 r0 + 10 r4 r1 + 17... Stall Condition

21 21 stalled stages time t0t1t2t3t4t5t6t7.... IFI 1 I 2 I 3 I 3 I 3 I 3 I 4 I 5 IDI 1 I 2 I 2 I 2 I 2 I 3 I 4 I 5 EX I 1 nopnopnopI 2 I 3 I 4 I 5 MA I 1 nopnopnopI 2 I 3 I 4 I 5 WB I 1 nopnopnopI 2 I 3 I 4 I 5 Stalled Stages and Pipeline Bubbles time t0t1t2t3t4t5t6t7.... (I 1 ) r1 (r0) + 10IF 1 ID 1 EX 1 MA 1 WB 1 (I 2 ) r4 (r1) + 17IF 2 ID 2 ID 2 ID 2 ID 2 EX 2 MA 2 WB 2 (I 3 )IF 3 IF 3 IF 3 IF 3 ID 3 EX 3 MA 3 WB 3 (I 4 ) IF 4 ID 4 EX 4 MA 4 WB 4 (I 5 ) IF 5 ID 5 EX 5 MA 5 WB 5 Resource Usage nop  pipeline bubble

22 22 IR 31 PC A B Y R MD1 MD2 addr inst Inst Memory 0x4 Add IR Imm Ext ALU rd1 GPRs rs1 rs2 ws wd rd2 we wdata addr wdata rdata Data Memory we nop Interlock Control Logic Compare the source registers of the instruction in the decode stage with the destination register of the uncommitted instructions. stall C stall ws rs rt ?

23 23 C dest Interlock Control Logic ignoring jumps & branches Should we always stall if the rs field matches some rd? IR PC A B Y R MD1 MD2 addr inst Inst Memory 0x4 Add IR Imm Ext ALU rd1 GPRs rs1 rs2 ws wd rd2 we wdata addr wdata rdata Data Memory we 31 nop stall C stall ws rs rt ? we re1re2 C re wswe ws C dest we not every instruction writes a register we not every instruction reads a register re

24 24 Source & Destination Registers source(s) destination ALUrd (rs) func (rt) rs, rtrd ALUirt (rs) op immrs rt LWrt M [(rs) + imm] rs rt SWM [(rs) + imm] (rt) rs, rt BZcond (rs) true:PC (PC) + immrs false: PC (PC) + 4rs JPC (PC) + imm JALr31 (PC), PC (PC) + imm31 JRPC (rs) rs JALRr31 (PC), PC (rs) rs31 R-type: op rs rt rd func I-type: op rs rt immediate16 J-type: op immediate26

25 25 Deriving the Stall Signal C dest ws = Case opcode ALUrd ALUi, LWrt JAL, JALRR31 we = Case opcode ALU, ALUi, LW (ws  0) JAL, JALR on... off C re re1 = Case opcode ALU, ALUi, on off re2 = Case opcode on off LW, SW, BZ, JR, JALR J, JAL ALU, SW... C stall stall = ((rs D =ws E ).we E + (rs D =ws M ).we M + (rs D =ws W ).we W ). re1 D + ((rt D =ws E ).we E + (rt D =ws M ).we M + (rt D =ws W ).we W ). re2 D This is not the full story !

26 26 Hazards due to Loads & Stores... M[(r1)+7]  (r2) r4  M[(r3)+5]... IR 31 PC A B Y R MD1 MD2 addr inst Inst Memory 0x4 Add IR Imm Ext ALU rd1 GPRs rs1 rs2 ws wd rd2 we wdata addr wdata rdata Data Memory we nop Stall Condition Is there any possible data hazard in this instruction sequence? What if (r1)+7 = (r3)+5 ?

27 27 Load & Store Hazards However, the hazard is avoided because our memory system completes writes in one cycle ! Load/Store hazards are sometimes resolved in the pipeline and sometimes in the memory system itself. More on this later in the course.... M[(r1)+7]  (r2) r4  M[(r3)+5]... (r1)+7 = (r3)+5  data hazard

28 28 Resolving Data Hazards (2) Strategy 2: Route data as soon as possible after it is calculated to the earlier pipeline stage  bypass

29 29 Bypassing Each stall or kill introduces a bubble in the pipeline CPI > 1 time t0t1t2t3t4t5t6t7.... (I 1 ) r1 r0 + 10IF 1 ID 1 EX 1 MA 1 WB 1 (I 2 ) r4 r1 + 17IF 2 ID 2 ID 2 ID 2 ID 2 EX 2 MA 2 WB 2 (I 3 )IF 3 IF 3 IF 3 IF 3 ID 3 EX 3 MA 3 (I 4 ) stalled stagesIF 4 ID 4 EX 4 (I 5 ) IF 5 ID 5 timet0t1t2t3t4t5t6t7.... (I 1 ) r1 r0 + 10IF 1 ID 1 EX 1 MA 1 WB 1 (I 2 ) r4  r1 + 17IF 2 ID 2 EX 2 MA 2 WB 2 (I 3 )IF 3 ID 3 EX 3 MA 3 WB 3 (I 4 ) IF 4 ID 4 EX 4 MA 4 WB 4 (I 5 ) IF 5 ID 5 EX 5 MA 5 WB 5 A new datapath, i.e., a bypass, can get the data from the output of the ALU to its input

30 30 Adding a Bypass ASrc... (I 1 )r1 r0 + 10 (I 2 )r4 r1 + 17 r4 r1r1  IR PC A B Y R MD1 MD2 addr inst Inst Memory 0x4 Add IR Imm Ext ALU rd1 GPRs rs1 rs2 ws wd rd2 we wdata addr wdata rdata Data Memory we 31 nop stall D EMW When does this bypass help? r1 M[r0 + 10] r4 r1 + 17 JAL 500 r4 r31 + 17 yesno

31 31 The Bypass Signal Deriving it from the Stall Signal ASrc = (rs D =ws E ).we E.re1 D we = Case opcode ALU, ALUi, LW (ws  0) JAL, JALR on... off No because only ALU and ALUi instructions can benefit from this bypass Is this correct? Split we E into two components: we-bypass, we-stall stall = ( ((rs D =ws E ).we E + (rs D =ws M ).we M + (rs D =ws W ).we W ).re1 D +((rt D =ws E ).we E + (rt D =ws M ).we M + (rt D =ws W ).we W ).re2 D ) ws = Case opcode ALUrd ALUi, LWrt JAL, JALRR31

32 32 Bypass and Stall Signals we-bypass E = Case opcode E ALU, ALUi(ws  0)... off ASrc = (rs D =ws E ).we-bypass E. re1 D Split we E into two components: we-bypass, we-stall stall = ((rs D =ws E ).we-stall E + (rs D =ws M ).we M + (rs D =ws W ).we W ). re1 D +((rt D = ws E ).we E + (rt D = ws M ).we M + (rt D = ws W ).we W ). re2 D we-stall E = Case opcode E LW (ws  0) JAL, JALRon... off

33 33 Fully Bypassed Datapath ASrc IR PC A B Y R MD1 MD2 addr inst Inst Memory 0x4 Add IR ALU Imm Ext rd1 GPRs rs1 rs2 ws wd rd2 we wdata addr wdata rdata Data Memory we 31 nop stall D EMW PC for JAL,... BSrc Is there still a need for the stall signal ? stall = (rs D =ws E ). (opcode E =LW E ).(ws E 0 ).re1 D + (rt D =ws E ). (opcode E =LW E ).(ws E 0 ).re2 D

34 34 Resolving Data Hazards (3) Strategy 3: Speculate on the dependence. Two cases: Guessed correctly  do nothing Guessed incorrectly  kill and restart …. We’ll later see examples of this approach in more complex processors.

35 35 Control Hazards What do we need to calculate next PC? –For Jumps » Opcode, offset and PC –For Jump Register »Opcode and Register value –For Conditional Branches »Opcode, PC, Register (for condition), and offset –For all other instructions »Opcode and PC have to know it’s not one of above!

36 36 time t0t1t2t3t4t5t6t7.... (I 1 ) r1 (r0) + 10IF 1 ID 1 EX 1 MA 1 WB 1 (I 2 ) r3 (r2) + 17 IF 2 IF 2 ID 2 EX 2 MA 2 WB 2 (I 3 )IF 3 IF 3 ID 3 EX 3 MA 3 WB 3 (I 4 ) IF 4 IF 4 ID 4 EX 4 MA 4 WB 4 time t0t1t2t3t4t5t6t7.... IFI 1 nop I 2 nop I 3 nop I 4 IDI 1 nop I 2 nop I 3 nop I 4 EX I 1 nopI 2 nopI 3 nopI 4 MA I 1 nopI 2 nopI 3 nopI 4 WB I 1 nopI 2 nopI 3 nopI 4 Opcode Decoding Bubble (assuming no branch delay slots for now) Resource Usage nop  pipeline bubble

37 37 Speculate next address is PC+4 I 1 096ADD I 2 100J 304 I 3 104ADD I 4 304ADD kill A jump instruction kills (not stalls) the following instruction stall How? I2I2 I1I1 104 IR PC addr inst Inst Memory 0x4 Add nop IR E M Add Jump? PCSrc (pc+4 / jabs / rind/ br)

38 38 Pipelining Jumps I 1 096ADD I 2 100J 304 I 3 104ADD I 4 304ADD kill I2I2 I1I1 104 stall IR PC addr inst Inst Memory 0x4 Add nop IR E M Add Jump? PCSrc (pc+4 / jabs / rind/ br) IRSrc D = Case opcode D J, JAL nop... IM To kill a fetched instruction -- Insert a mux before IR Any interaction between stall and jump? nop IRSrc D I2I2 I1I1 304 nop

39 39 time t0t1t2t3t4t5t6t7.... IFI 1 I 2 I 3 I 4 I 5 IDI 1 I 2 nop I 4 I 5 EX I 1 I 2 nop I 4 I 5 MA I 1 I 2 nop I 4 I 5 WB I 1 I 2 nop I 4 I 5 Jump Pipeline Diagrams time t0t1t2t3t4t5t6t7.... (I 1 ) 096: ADDIF 1 ID 1 EX 1 MA 1 WB 1 (I 2 ) 100: J 304IF 2 ID 2 EX 2 MA 2 WB 2 (I 3 ) 104: ADDIF 3 nop nop nop nop (I 4 ) 304: ADD IF 4 ID 4 EX 4 MA 4 WB 4 Resource Usage nop  pipeline bubble

40 40 Pipelining Conditional Branches I 1 096ADD I 2 100BEQZ r1 +200 I 3 104ADD 108 … I 4 304ADD BEQZ? I2I2 I1I1 104 stall IR PC addr inst Inst Memory 0x4 Add nop IR E M Add PCSrc (pc+4 / jabs / rind / br) nop IRSrc D Branch condition is not known until the execute stage what action should be taken in the decode stage ? A Y ALU zero?

41 41 Pipelining Conditional Branches I 1 096ADD I 2 100BEQZ r1 +200 I 3 104ADD 108 … I 4 304ADD stall IR PC addr inst Inst Memory 0x4 Add nop IR E M Add PCSrc (pc+4 / jabs / rind / br) nop IRSrc D A Y ALU zero? If the branch is taken - kill the two following instructions - the instruction at the decode stage is not valid  stall signal is not valid I2I2 I1I1 108 I3I3 BEQZ? ?

42 42 Pipelining Conditional Branches I 1 096ADD I 2 100BEQZ r1 +200 I 3 104ADD 108 … I 4 304ADD stall IR PC addr inst Inst Memory 0x4 Add nop IR E M PCSrc (pc+4/jabs/rind/br) nop A Y ALU zero? I2I2 I1I1 108 I3I3 BEQZ? Jump? IRSrc D IRSrc E If the branch is taken - kill the two following instructions - the instruction at the decode stage is not valid  stall signal is not valid Add PC

43 43 New Stall Signal stall = ( ((rs D =ws E ).we E + (rs D =ws M ).we M + (rs D =ws W ).we W ).re1 D + ((rt D =ws E ).we E + (rt D =ws M ).we M + (rt D =ws W ).we W ).re2 D ). !((opcode E =BEQZ).z + (opcode E =BNEZ).!z) Don’t stall if the branch is taken. Why? Instruction at the decode stage is invalid

44 44 Control Equations for PC and IR Muxes PCSrc = Case opcode E BEQZ.z, BNEZ.!z br...  Case opcode D J, JALjabs JR, JALRrind... pc+4 IRSrc D = Case opcode E BEQZ.z, BNEZ.!z nop...  Case opcode D J, JAL, JR, JALR nop... IM Give priority to the older instruction, i.e., execute-stage instruction over decode-stage instruction IRSrc E = Case opcode E BEQZ.z, BNEZ.!z nop... stall.nop + !stall.IR D

45 45 time t0t1t2t3t4t5t6t7.... IFI 1 I 2 I 3 I 4 I 5 IDI 1 I 2 I 3 nop I 5 EX I 1 I 2 nop nop I 5 MA I 1 I 2 nop nop I 5 WB I 1 I 2 nop nop I 5 Branch Pipeline Diagrams (resolved in execute stage) time t0t1t2t3t4t5t6t7.... (I 1 ) 096: ADDIF 1 ID 1 EX 1 MA 1 WB 1 (I 2 ) 100: BEQZ +200IF 2 ID 2 EX 2 MA 2 WB 2 (I 3 ) 104: ADDIF 3 ID 3 nop nop nop (I 4 ) 108: IF 4 nop nop nop nop (I 5 ) 304: ADD IF 5 ID 5 EX 5 MA 5 WB 5 Resource Usage nop  pipeline bubble

46 46 One pipeline bubble can be removed if an extra comparator is used in the Decode stage –But might elongate cycle time PC addr inst Inst Memory 0x4 Add IR nop E Add PCSrc (pc+4 / jabs / rind/ br) rd1 GPRs rs1 rs2 ws wd rd2 we nop Zero detect on register file output Pipeline diagram now same as for jumps D Reducing Branch Penalty ( resolve in decode stage)

47 47 Branch Delay Slots (expose control hazard to software) Change the ISA semantics so that the instruction that follows a jump or branch is always executed –gives compiler the flexibility to put in a useful instruction where normally a pipeline bubble would have resulted. I 1 096ADD I 2 100BEQZ r1 +200 I 3 104ADD I 4 304ADD Delay slot instruction executed regardless of branch outcome Other techniques include more advanced branch prediction, which can dramatically reduce the branch penalty... to come later

48 48 time t0t1t2t3t4t5t6t7.... IFI 1 I 2 I 3 I 4 IDI 1 I 2 I 3 I 4 EX I 1 I 2 I 3 I 4 MA I 1 I 2 I 3 I 4 WB I 1 I 2 I 3 I 4 Branch Pipeline Diagrams (branch delay slot) time t0t1t2t3t4t5t6t7.... (I 1 ) 096: ADDIF 1 ID 1 EX 1 MA 1 WB 1 (I 2 ) 100: BEQZ +200IF 2 ID 2 EX 2 MA 2 WB 2 (I 3 ) 104: ADDIF 3 ID 3 EX 3 MA 3 WB 3 (I 4 ) 304: ADD IF 4 ID 4 EX 4 MA 4 WB 4 Resource Usage

49 49 Why an Instruction may not be dispatched every cycle (CPI>1) Full bypassing may be too expensive to implement –typically all frequently used paths are provided –some infrequently used bypass paths may increase cycle time and counteract the benefit of reducing CPI Loads have two-cycle latency –Instruction after load cannot use load result –MIPS-I ISA defined load delay slots, a software-visible pipeline hazard (compiler schedules independent instruction or inserts NOP to avoid hazard). »MIPS:“Microprocessor without Interlocked Pipeline Stages” –Removed in MIPS-II (pipeline interlocks added in hardware) Conditional branches may cause bubbles –kill following instruction(s) if no delay slots

50 Iron Law with Software-Visible NOPs If software has to insert NOP instructions for hazard avoidance, instructions/program increases –average cycles/instruction decreases - doing nothing fast is easy! But performance (time/program) worse or same as if hardware instead uses interlocks to avoid hazard –Hardware-generated interlocks (bubbles) don’t change instructions/program, but only add to cycles/instruction –Hardware interlocks don’t take space in instruction cache 50 Time = Instructions Cycles Time Program Program * Instruction * Cycle

51 51 Exceptions : altering the normal flow of control I i-1 HI 1 HI 2 HI n IiIi I i+1 program exception handler An exception transfers control to special handler code run in privileged mode. Exceptions are usually unexpected or rare from program’s point of view.

52 52 Causes of Exceptions Asynchronous: an external interrupt –input/output device service request –timer expiration –power disruptions, hardware failure Synchronous: an internal exception (a.k.a. traps) –undefined opcode, privileged instruction –arithmetic overflow, FPU exception –misaligned memory access –virtual memory exceptions: page faults, TLB misses, protection violations –software exceptions: system calls, e.g., jumps into kernel Exception: an event that requests the attention of the processor

53 53 History of Exception Handling First system with exceptions was Univac-I, 1951 –Arithmetic overflow would either »1. trigger the execution a two-instruction fix-up routine at address 0, or »2. at the programmer's option, cause the computer to stop –Later Univac 1103, 1955, modified to add external interrupts »Used to gather real-time wind tunnel data First system with I/O interrupts was DYSEAC, 1954 –Had two program counters, and I/O signal caused switch between two PCs –Also, first system with DMA (direct memory access by I/O device) [Courtesy Mark Smotherman]

54 54 DYSEAC, first mobile computer! Carried in two tractor trailers, 12 tons + 8 tons Built for US Army Signal Corps [Courtesy Mark Smotherman]

55 55 Asynchronous Interrupts: invoking the interrupt handler An I/O device requests attention by asserting one of the prioritized interrupt request lines When the processor decides to process the interrupt –It stops the current program at instruction I i, completing all the instructions up to I i-1 (a precise interrupt) –It saves the PC of instruction I i in a special register (EPC) –It disables interrupts and transfers control to a designated interrupt handler running in the kernel mode

56 56 MIPS Interrupt Handler Code Saves EPC before re-enabling interrupts to allow nested interrupts  –need an instruction to move EPC into GPRs –need a way to mask further interrupts at least until EPC can be saved Needs to read a status register that indicates the cause of the interrupt Uses a special indirect jump instruction RFE (return- from-exception) to resume user code, this: –enables interrupts –restores the processor to the user mode –restores hardware status and control state

57 57 Synchronous Exception A synchronous exception is caused by a particular instruction In general, the instruction cannot be completed and needs to be restarted after the exception has been handled –requires undoing the effect of one or more partially executed instructions In the case of a system call trap, the instruction is considered to have been completed –syscall is a special jump instruction involving a change to privileged kernel mode –Handler resumes at instruction after system call

58 58 Exception Handling 5-Stage Pipeline How to handle multiple simultaneous exceptions in different pipeline stages? How and where to handle external asynchronous interrupts? PC Inst. Mem D Decode EM Data Mem W + Illegal Opcode Overflow Data address Exceptions PC address Exception Asynchronous Interrupts

59 59 Exception Handling 5-Stage Pipeline PC Inst. Mem D Decode EM Data Mem W + Illegal Opcode Overflow Data address Exceptions PC address Exception Asynchronous Interrupts Exc D PC D Exc E PC E Exc M PC M Cause EPC Kill D Stage Kill F Stage Kill E Stage Select Handler PC Kill Writeback Commit Point

60 60 Exception Handling 5-Stage Pipeline Hold exception flags in pipeline until commit point (M stage) Exceptions in earlier pipe stages override later exceptions for a given instruction Inject external interrupts at commit point (override others) If exception at commit: update Cause and EPC registers, kill all stages, inject handler PC into fetch stage

61 61 Speculating on Exceptions Prediction mechanism –Exceptions are rare, so simply predicting no exceptions is very accurate! Check prediction mechanism –Exceptions detected at end of instruction execution pipeline, special hardware for various exception types Recovery mechanism –Only write architectural state at commit point, so can throw away partially executed instructions after exception –Launch exception handler after flushing pipeline Bypassing allows use of uncommitted instruction results by following instructions

62 62 time t0t1t2t3t4t5t6t7.... IFI 1 I 2 I 3 I 4 I 5 IDI 1 I 2 I 3 nop I 5 EX I 1 I 2 nop nop I 5 MA I 1 nop nop nop I 5 WB nop nop nop nop I 5 Exception Pipeline Diagram time t0t1t2t3t4t5t6t7.... (I 1 ) 096: ADDIF 1 ID 1 EX 1 MA 1 nop overflow! (I 2 ) 100: XORIF 2 ID 2 EX 2 nop nop (I 3 ) 104: SUBIF 3 ID 3 nop nop nop (I 4 ) 108: ADD IF 4 nop nop nopnop (I 5 ) Exc. Handler code IF 5 ID 5 EX 5 MA 5 WB 5 Resource Usage

63 63 Acknowledgements UCB material derived from course CS152 Harvard University material derived from course CS246

64 Readings Computer Architecture: A Quantitative Approach, 5th Edition (2012) D. A. Patterson and J. L. Hennessy, Computer Organization and Design: The Hardware/Software Interface, 4th Edition , 2013. Computer Organization and Design: The Hardware/Software Interface


Download ppt "Computer Architecture and Parallel Computing 并行结构与计算 Lecture 2 - Pipelining Peng Liu Dept. of Info. Sci. & Elec. Engg. Zhejiang University"

Similar presentations


Ads by Google