Presentation is loading. Please wait.

Presentation is loading. Please wait.

EE30332 Ch6 DP.1 Ch 6: Pipelining Modified from Dave Patterson’s notes  Laundry Example  Ann, Brian, Cathy, Dave each have one load of clothes to wash,

Similar presentations


Presentation on theme: "EE30332 Ch6 DP.1 Ch 6: Pipelining Modified from Dave Patterson’s notes  Laundry Example  Ann, Brian, Cathy, Dave each have one load of clothes to wash,"— Presentation transcript:

1 EE30332 Ch6 DP.1 Ch 6: Pipelining Modified from Dave Patterson’s notes  Laundry Example  Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold  Washer takes 30 minutes  Dryer takes 30 minutes  “Folder” takes 30 minutes  “Stasher” takes 30 minutes to put clothes into drawers ABCD

2 EE30332 Ch6 DP.2 Sequential Laundry  Sequential laundry takes 8 hours for 4 loads  If they learned pipelining, how long would laundry take? 30 TaskOrderTaskOrder B C D A Time 30 6 PM 7 8 9 10 11 12 1 2 AM

3 EE30332 Ch6 DP.3 Pipelined Laundry: Start work ASAP  Pipelined laundry takes 3.5 hours for 4 loads! TaskOrderTaskOrder 12 2 AM 6 PM 7 8 9 10 11 1 Time B C D A 30

4 EE30332 Ch6 DP.4 Pipelining Lessons  Pipelining doesn’t help latency of single task, it helps throughput of entire workload  Multiple tasks operating simultaneously using different resources  Potential speedup = Number pipe stages  Pipeline rate limited by slowest pipeline stage  Unbalanced lengths of pipe stages reduces speedup  Time to “fill” pipeline and time to “drain” it reduces speedup  Stall for Dependences 6 PM 789 Time B C D A 30 TaskOrderTaskOrder

5 EE30332 Ch6 DP.5 Pipelining  Improve performance by increasing instruction throughput.  To increase throughput, is to minimize the individual stage time duration.  One natural way to minimize the stage time duration is to split an instruction into more stages.  One disadvantage for more stage is branching, because it will alter the instruction flow, which is sequential.  What do we need to add to actually split the datapath into stages? Ans: By adding storage devices between stages to increase more stages.

6 EE30332 Ch6 DP.6 The Five Stages of Load  Ifetch: Instruction Fetch Fetch the instruction from the Instruction Memory  Reg/Dec: Registers Fetch and Instruction Decode  Exec: Calculate the memory address  Mem: Read the data from the Data Memory  Wr: Write the data back to the register file Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5 IfetchReg/DecExecMemWrLoad

7 EE30332 Ch6 DP.7 Conventional Pipelined Execution Representation IFetchDcdExecMemWB IFetchDcdExecMemWB IFetchDcdExecMemWB IFetchDcdExecMemWB IFetchDcdExecMemWB IFetchDcdExecMemWB Program Flow Time

8 EE30332 Ch6 DP.8 Single Cycle, Multiple Cycle, vs. Pipeline Clk Cycle 1 Multiple Cycle Implementation: IfetchRegExecMemWr Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7Cycle 8Cycle 9Cycle 10 LoadIfetchRegExecMemWr IfetchRegExecMem LoadStore Pipeline Implementation: IfetchRegExecMemWrStore Clk Single Cycle Implementation: LoadStoreWaste Ifetch R-type IfetchRegExecMemWrR-type Cycle 1Cycle 2

9 EE30332 Ch6 DP.9 Why Pipeline?  Suppose we execute 100 instructions  Single Cycle Machine 45 ns/cycle x 1 CPI x 100 inst = 4500 ns  Multicycle Machine 10 ns/cycle x 4.6 CPI (due to inst mix) x 100 inst = 4600 ns  Ideal pipelined machine 10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 ns

10 EE30332 Ch6 DP.10 Why Pipeline? Because the resources are there! I n s t r. O r d e r Time (clock cycles) Inst 0 Inst 1 Inst 2 Inst 4 Inst 3 ALU Im Reg DmReg ALU Im Reg DmReg ALU Im Reg DmReg ALU Im Reg DmReg ALU Im Reg DmReg

11 EE30332 Ch6 DP.11 Can pipelining get us into trouble?  Yes: Pipeline Hazards structural hazards: attempt to use the same resource two different ways at the same time -E.g., combined washer/dryer would be a structural hazard or folder busy doing something else (watching TV) data hazards: attempt to use item before it is ready -E.g., one sock of pair in dryer and one in washer; can’t fold until get sock from washer through dryer -instruction depends on result of prior instruction still in the pipeline control hazards: attempt to make a decision before condition is evaulated -E.g., washing football uniforms and need to get proper detergent level; need to see after dryer before next load in -branch instructions  Can always resolve hazards by waiting pipeline control must detect the hazard take action (or delay action) to resolve hazards

12 EE30332 Ch6 DP.12 Mem Single Memory is a Structural Hazard I n s t r. O r d e r Time (clock cycles) Load Instr 1 Instr 2 Instr 3 Instr 4 ALU Mem Reg MemReg ALU Mem Reg MemReg ALU Mem Reg MemReg ALU Reg MemReg ALU Mem Reg MemReg Detection is easy in this case! (right half highlight means read, left half write)

13 EE30332 Ch6 DP.13 Structural Hazards limit performance  Example: if 1.3 memory accesses per instruction and only one memory access per cycle then average CPI  1.3 otherwise resource is more than 100% utilized

14 EE30332 Ch6 DP.14  Stall: wait until decision is clear Its possible to move up decision to 2nd stage by adding hardware to check registers as being read  Impact: 2 clock cycles per branch instruction => slow Control Hazard Solutions I n s t r. O r d e r Time (clock cycles) Add Beq Load ALU Mem Reg MemReg ALU Mem Reg MemReg ALU Reg MemReg Mem

15 EE30332 Ch6 DP.15  Predict: guess one direction then back up if wrong Predict not taken  Impact: 1 clock cycles per branch instruction if right, 2 if wrong (right ­ 50% of time)  More dynamic scheme: history of 1 branch (­ 90%) Control Hazard Solutions I n s t r. O r d e r Time (clock cycles) Add Beq Load ALU Mem Reg MemReg ALU Mem Reg MemReg Mem ALU Reg MemReg

16 EE30332 Ch6 DP.16  Redefine branch behavior (takes place after next instruction) “delayed branch”  Impact: 0 clock cycles per branch instruction if can find instruction to put in “slot” (­ 50% of time)  As launch more instruction per clock cycle, less useful Control Hazard Solutions I n s t r. O r d e r Time (clock cycles) Add Beq Misc ALU Mem Reg MemReg ALU Mem Reg MemReg Mem ALU Reg MemReg Load Mem ALU Reg MemReg

17 EE30332 Ch6 DP.17 Data Hazard on r1 add r1,r2,r3 sub r4, r1,r3 and r6, r1,r7 or r8, r1,r9 xor r10, r1,r11

18 EE30332 Ch6 DP.18 Dependencies backwards in time are hazards Data Hazard on r1: I n s t r. O r d e r Time (clock cycles) add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 IFIF ID/R F EXEX ME M WBWB ALU Im Reg Dm Reg ALU Im Reg DmReg ALU Im Reg DmReg Im ALU Reg DmReg ALU Im Reg DmReg

19 EE30332 Ch6 DP.19 “Forward” result from one stage to another “or” OK if define read/write properly Data Hazard Solution: I n s t r. O r d e r Time (clock cycles) add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 IFIF ID/R F EXEX ME M WBWB ALU Im Reg Dm Reg ALU Im Reg DmReg ALU Im Reg DmReg Im ALU Reg DmReg ALU Im Reg DmReg

20 EE30332 Ch6 DP.20 Dependencies backwards in time are hazards Can’t solve with forwarding: Must delay/stall instruction dependent on loads Forwarding (or Bypassing): What about Loads Time (clock cycles) lw r1,0(r2) sub r4,r1,r3 IFIF ID/R F EXEX ME M WBWB ALU Im Reg Dm Reg ALU Im Reg DmReg

21 EE30332 Ch6 DP.21 Designing a Pipelined Processor  Go back and examine your datapath and control diagram  associated resources with states  ensure that flows do not conflict, or figure out how to resolve  assert control in appropriate stage

22 EE30332 Ch6 DP.22 Pipelined Processor (almost) for slides  What happens if we start a new instruction every cycle? Exec Reg. File Mem Access Data Mem ABSM Reg File Equal PC Next PC IR Inst. Mem Valid IRex Dcd Ctrl IRmem Ex Ctrl IRwb Mem Ctrl WB Ctrl

23 EE30332 Ch6 DP.23 Control and Datapath IR <- Mem[PC]; PC <– PC+4; A <- R[rs]; B<– R[rt] S <– A + B; R[rd] <– S; S <– A + SX; M <– Mem[S] R[rd] <– M; S <– A or ZX; R[rt] <– S; S <– A + SX; Mem[S] <- B If Cond PC < PC+SX; Exec Reg. File Mem Access Data Mem ABS Reg File Equal PC Next PC IR Inst. Mem DM

24 EE30332 Ch6 DP.24 Pipelining the Load Instruction  The five independent functional units in the pipeline datapath are: Instruction Memory for the Ifetch stage Register File’s Read ports (bus A and busB) for the Reg/Dec stage ALU for the Exec stage Data Memory for the Mem stage Register File’s Write port (bus W) for the Wr stage Clock Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7 IfetchReg/DecExecMemWr1st lw IfetchReg/DecExecMemWr2nd lw IfetchReg/DecExecMemWr3rd lw

25 EE30332 Ch6 DP.25 The Four Stages of R-type  Ifetch: Instruction Fetch Fetch the instruction from the Instruction Memory  Reg/Dec: Registers Fetch and Instruction Decode  Exec: ALU operates on the two register operands Update PC  Wr: Write the ALU output back to the register file Cycle 1Cycle 2Cycle 3Cycle 4 IfetchReg/DecExecWrR-type

26 EE30332 Ch6 DP.26 Pipelining the R-type and Load Instruction  We have pipeline conflict or structural hazard: Two instructions try to write to the register file at the same time! Only one write port Clock Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7Cycle 8Cycle 9 IfetchReg/DecExecWrR-type IfetchReg/DecExecWrR-type IfetchReg/DecExecMemWrLoad IfetchReg/DecExecWrR-type IfetchReg/DecExecWrR-type Ops! We have a problem!

27 EE30332 Ch6 DP.27 Important Observation  Each functional unit can only be used once per instruction  Each functional unit must be used at the same stage for all instructions: Load uses Register File’s Write Port during its 5th stage R-type uses Register File’s Write Port during its 4th stage IfetchReg/DecExecMemWrLoad 12345 IfetchReg/DecExecWrR-type 1234 °2 ways to solve this pipeline hazard.

28 EE30332 Ch6 DP.28 Solution 1: Insert “Bubble” into the Pipeline  Insert a “bubble” into the pipeline to prevent 2 writes at the same cycle The control logic can be complex. Lose instruction fetch and issue opportunity.  No instruction is started in Cycle 6! Clock Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7Cycle 8Cycle 9 IfetchReg/DecExecWrR-type IfetchReg/DecExec IfetchReg/DecExecMemWrLoad IfetchReg/DecExecWr R-type IfetchReg/DecExecWr R-type Pipeline Bubble IfetchReg/DecExecWr

29 EE30332 Ch6 DP.29 Solution 2: Delay R-type’s Write by One Cycle  Delay R-type’s register write by one cycle: Now R-type instructions also use Reg File’s write port at Stage 5 Mem stage is a NOOP stage: nothing is being done. Clock Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7Cycle 8Cycle 9 IfetchReg/DecMemWrR-type IfetchReg/DecMemWrR-type IfetchReg/DecExecMemWrLoad IfetchReg/DecMemWrR-type IfetchReg/DecMemWrR-type IfetchReg/Dec Exec WrR-type Mem Exec 123 4 5

30 EE30332 Ch6 DP.30 The Four Stages of Store  Ifetch: Instruction Fetch Fetch the instruction from the Instruction Memory  Reg/Dec: Registers Fetch and Instruction Decode  Exec: Calculate the memory address  Mem: Write the data into the Data Memory Cycle 1Cycle 2Cycle 3Cycle 4 IfetchReg/DecExecMemStoreWr

31 EE30332 Ch6 DP.31 The Three Stages of Beq  Ifetch: Instruction Fetch Fetch the instruction from the Instruction Memory  Reg/Dec: Registers Fetch and Instruction Decode  Exec: compares the two register operand, select correct branch target address latch into PC Cycle 1Cycle 2Cycle 3Cycle 4 IfetchReg/DecExecMemBeqWr

32 EE30332 Ch6 DP.32 Control Diagram IR <- Mem[PC]; PC < PC+4; A <- R[rs]; B<– R[rt] S <– A + B; R[rd] <– S; S <– A + SX; M <– Mem[S] R[rd] <– M; S <– A or ZX; R[rt] <– S; S <– A + SX; Mem[S] <- B If Cond PC < PC+SX; Exec Reg. File Mem Access Data Mem ABS Reg File Equal PC Next PC IR Inst. Mem D M <– S M

33 EE30332 Ch6 DP.33 Let’s Try it Out 10lw r1, r2(35) 14addI r2, r2, 3 20subr3, r4, r5 24beqr6, r7, 100 30orir8, r9, 17 34addr10, r11, r12 100andr13, r14, 15 these addresses are octal

34 EE30332 Ch6 DP.34 Start: Fetch 10 Exec Reg. File Mem Access Data Mem ABS Reg File PC Next PC IR Inst. Mem D Decode Mem Ctrl WB Ctrl M rsrt im 10lw r1, r2(35) 14addI r2, r2, 3 20subr3, r4, r5 24beqr6, r7, 100 30orir8, r9, 17 34addr10, r11, r12 100andr13, r14, 15 nn nn 10

35 EE30332 Ch6 DP.35 Fetch 14, Decode 10 Exec Reg. File Mem Access Data Mem ABS Reg File PC Next PC IR Inst. Mem D Decode Mem Ctrl WB Ctrl M 2rt im 10lw r1, r2(35) 14addI r2, r2, 3 20subr3, r4, r5 24beqr6, r7, 100 30orir8, r9, 17 34addr10, r11, r12 100andr13, r14, 15 n nn 14 lw r1, r2(35)

36 EE30332 Ch6 DP.36 Fetch 20, Decode 14, Exec 10 Exec Reg. File Mem Access Data Mem r2 BS Reg File PC Next PC IR Inst. Mem D Decode Mem Ctrl WB Ctrl M 2rt 35 10lw r1, r2(35) 14addI r2, r2, 3 20subr3, r4, r5 24beqr6, r7, 100 30orir8, r9, 17 34addr10, r11, r12 100andr13, r14, 15 nn 20 lw r1 addI r2, r2, 3

37 EE30332 Ch6 DP.37 Fetch 24, Decode 20, Exec 14, Mem 10 Exec Reg. File Mem Access Data Mem r2 B r2+35 Reg File PC Next PC IR Inst. Mem D Decode Mem Ctrl WB Ctrl M 45 3 10lw r1, r2(35) 14addI r2, r2, 3 20subr3, r4, r5 24beqr6, r7, 100 30orir8, r9, 17 34addr10, r11, r12 100andr13, r14, 15 n 24 lw r1 sub r3, r4, r5 addI r2, r2, 3

38 EE30332 Ch6 DP.38 Fetch 30, Dcd 24, Ex 20, Mem 14, WB 10 Exec Reg. File Mem Access Data Mem r4 r5 r2+3 Reg File PC Next PC IR Inst. Mem D Decode Mem Ctrl WB Ctrl M[r2+35] 67 10lw r1, r2(35) 14addI r2, r2, 3 20subr3, r4, r5 24beqr6, r7, 100 30orir8, r9, 17 34addr10, r11, r12 100andr13, r14, 15 30 lw r1 beq r6, r7 100 addI r2 sub r3

39 EE30332 Ch6 DP.39 Fetch 34, Dcd 30, Ex 24, Mem 20, WB 14 Exec Reg. File Mem Access Data Mem r6 r7 r2+3 Reg File PC Next PC IR Inst. Mem D Decode Mem Ctrl WB Ctrl r1=M[r2+35] 9xx 10lw r1, r2(35) 14addI r2, r2, 3 20subr3, r4, r5 24beqr6, r7, 100 30orir8, r9, 17 34addr10, r11, r12 100andr13, r14, 15 34 beqaddI r2 sub r3 r4-r5 100 ori r8, r9 17

40 EE30332 Ch6 DP.40 Fetch 100, Dcd 34, Ex 30, Mem 24, WB 20 Exec Reg. File Mem Access Data Mem r9 x Reg File PC Next PC IR Inst. Mem D Decode Mem Ctrl WB Ctrl r1=M[r2+35] 1112 10lw r1, r2(35) 14addI r2, r2, 3 20subr3, r4, r5 24beqr6, r7, 100 30orir8, r9, 17 34addr10, r11, r12 100andr13, r14, 15 100 beq r2 = r2+3 sub r3 r4-r5 17 ori r8 xxx add r10, r11, r12 ooops, we should have only one delayed instruction

41 EE30332 Ch6 DP.41 Fetch 104, Dcd 100, Ex 34, Mem 30, WB 24 Exec Reg. File Mem Access Data Mem r11 r12 Reg File PC Next PC IR Inst. Mem D Decode Mem Ctrl WB Ctrl r1=M[r2+35] 1415 10lw r1, r2(35) 14addI r2, r2, 3 20subr3, r4, r5 24beqr6, r7, 100 30orir8, r9, 17 34addr10, r11, r12 100andr13, r14, 15 104 beq r2 = r2+3 r3 = r4-r5 xx ori r8 xxx add r10 and r13, r14, r15 n Squash the extra instruction in the branch shadow! r9 | 17

42 EE30332 Ch6 DP.42 Fetch 108, Dcd 104, Ex 100, Mem 34, WB 30 Exec Reg. File Mem Access Data Mem r14 r15 Reg File PC Next PC IR Inst. Mem D Decode Mem Ctrl WB Ctrl r1=M[r2+35] 10lw r1, r2(35) 14addI r2, r2, 3 20subr3, r4, r5 24beqr6, r7, 100 30orir8, r9, 17 34addr10, r11, r12 100andr13, r14, 15 110 r2 = r2+3 r3 = r4-r5 xx ori r8 add r10 and r13 n Squash the extra instruction in the branch shadow! r9 | 17 r11+r12

43 EE30332 Ch6 DP.43 Fetch 114, Dcd 110, Ex 104, Mem 100, WB 34 Exec Reg. File Mem Access Data Mem Reg File PC Next PC IR Inst. Mem D Decode Mem Ctrl WB Ctrl r1=M[r2+35] 10lw r1, r2(35) 14addI r2, r2, 3 20subr3, r4, r5 24beqr6, r7, 100 30orir8, r9, 17 34addr10, r11, r12 100andr13, r14, 15 114 r2 = r2+3 r3 = r4-r5 r8 = r9 | 17 add r10 and r13 n Squash the extra instruction in the branch shadow! r11+r12 NO WB NO Ovflow r14 & R15

44 EE30332 Ch6 DP.44 Summary: Pipelining  What makes it easy all instructions are the same length just a few instruction formats memory operands appear only in loads and stores  What makes it hard? structural hazards: suppose we had only one memory control hazards: need to worry about branch instructions data hazards: an instruction depends on a previous instruction  We’ll build a simple pipeline and look at these issues  We’ll talk about modern processors and what really makes it hard: exception handling trying to improve performance with out-of-order execution, etc.

45 EE30332 Ch6 DP.45 Summary  Pipelining is a fundamental concept multiple steps using distinct resources  Utilize capabilities of the Datapath by pipelined instruction processing start next instruction while working on the current one limited by length of longest stage (plus fill/flush) detect and resolve hazards

46 EE30332 Ch6 DP.46 Ch6 Supplementary Branch prediction  Branch prediction is critical for superpipelining and superscalar computers  More instructions issued at same time, larger the penalty of hazards  Statistically, 60% conditional branches will branch  Higher-level (and more powerful) instruction sets need less conditional branches, such as those support variable-length operands  Conditional branching can be classified into two types Program loops Random decision making

47 EE30332 Ch6 DP.47 Static branch prediction  Static and Dynamic Branch Predictions  Static are compiler determining conditional branches  Dynamic ones are run-time (during execution) generated  Static – good for looping Loop Exit at Loop Start – Predict continue Loop Exit at Loop End – Predict branching Random decision – guess branch to be taken

48 EE30332 Ch6 DP.48 Dynamic branch prediction  Dynamic – Data sensitive, at run-time (at execution) One-bit dynamic branch prediction -Predict as previous record shows Two-bit dynamic branch prediction -If previous two are same, predict the same -If previous two are alternating, predict alternating Branch – branchpredict branch Not branch – not branchpredict not branch Branch – not branchpredict branch Not branch – branchpredict not branch

49 EE30332 Ch6 DP.49 Branch prediction cont.  Branch prediction cache Cache with entries for instruction addresses that host the branch instructions, and a bit to indicate the previous branching or not branching Branch inst address =? Valid AND Matched Last B or NB Previous to last B or NB B-Branch NB-Not Branch

50 EE30332 Ch6 DP.50 Other schemes for minimizing incorrect branch prediction penalty  Speculative execution Execute first and there is a way to roll back -By doing the storeback on shadow copy -By having a backup copy for roll undo  Conditional execution To minimize conditional branching for some common action such as clear, set-to-1, move, or add  Delayed (conditional) branching Always execute the next instruction and then do the conditional branching -Save a cycle or more

51 EE30332 Ch6 DP.51 Branching  Call and Return are branching instructions through Stack  Software interrupts are implicit branchings and “transparent” to the program even though their actions are carried out. They are treated as parts of the program.

52 EE30332 Ch6 DP.52 Superscalar  Superscalar Fetch and execute two or more instructions concurrently To achieve CPI < 1 needs superscalar Dynamic issue – decided at run-time to schedule executing two or more instructions concurrently Require multiple copies of functional units such as instruction fetch, arithmetic execution, etc., and multi-port register files and caches (or cache buffers) There are more potential data dependency hazards, resource hazards, and control hazards Penalty for incorrect branch prediction is BIG

53 EE30332 Ch6 DP.53 VLIW  Very Long Instruction Word A VLIW instruction is machine (implementation) dependent A VLIW instruction consists of various fields Each field specifies the operation of a functional unit, such as Ifetch (instruction fetch), Idecode, Ofetch (operand fetch), EX (integer execute), FPA (Floating-point add), FPMUL and FPDIV Static issue – instructions generated by compilers Multiple instructions issued and being executed at same time

54 EE30332 Ch6 DP.54 VLIW advantages  Static generating codes (by compiler) Compilers can take a lot of time to pack the VLIW instructions, else they are done dynamically by hardware instruction scheduler (analyzed by circuitry for scheduling functional units)  Easier to power down individual functional units if they are not used, and easier for compilers to deliberately arrange the functional unit executions to minimize power consumption  Can execute different computer architecture instruction sets with a machine through respective compilers. However, the functional units must be so constructed to support these instruction sets and architectures.

55 EE30332 Ch6 DP.55 VLIW disadvantages  Compilers are hard to build  Machine dependent – must have different compilers for different machines of the same architecture  Binary incompatible – must have different binary codes for different machines of the same architecture  Cannot see input data when compiling, must prepare for all possible cases of input data  Difficult to recover from compiler mistakes, and the time penalty can be BIG  Difficult to debug  Non-VLIW machines can also power down individual functional units when not used

56 EE30332 Ch6 DP.56 Ch6 Supplementary Branch prediction  Branch prediction is critical for superpipelining and superscalar computers  More instructions issued at same time, larger the penalty of hazards  Statistically, 60% conditional branches will branch  Higher-level (and more powerful) instruction sets need less conditional branches, such as those support variable-length operands  Conditional branching can be classified into two types Program loops Random decision making

57 EE30332 Ch6 DP.57 Static branch prediction  Static and Dynamic Branch Predictions  Static are compiler determining conditional branches  Dynamic ones are run-time (during execution) generated  Static – good for looping Loop Exit at Loop Start – Predict continue Loop Exit at Loop End – Predict branching Random decision – guess branch to be taken

58 EE30332 Ch6 DP.58 Dynamic branch prediction  Dynamic – Data sensitive, at run-time (at execution) One-bit dynamic branch prediction -Predict as previous record shows Two-bit dynamic branch prediction -If previous two are same, predict the same -If previous two are alternating, predict alternating Branch – branchpredict branch Not branch – not branchpredict not branch Branch – not branchpredict branch Not branch – branchpredict not branch

59 EE30332 Ch6 DP.59 Branch prediction cont.  Branch prediction cache Cache with entries for instruction addresses that host the branch instructions, and a bit to indicate the previous branching or not branching Branch inst address =? Valid AND Matched Last B or NB Previous to last B or NB B-Branch NB-Not Branch

60 EE30332 Ch6 DP.60 Other schemes for minimizing incorrect branch prediction penalty  Speculative execution Execute first and there is a way to roll back -By doing the storeback on shadow copy -By having a backup copy for roll undo  Conditional execution To minimize conditional branching for some common action such as clear, set-to-1, move, or add  Delayed (conditional) branching Always execute the next instruction and then do the conditional branching -Save a cycle or more

61 EE30332 Ch6 DP.61 Branching  Call and Return are branching instructions through Stack  Software interrupts are implicit branchings and “transparent” to the program even though their actions are carried out. They are treated as parts of the program.

62 EE30332 Ch6 DP.62 Superscalar  Superscalar Fetch and execute two or more instructions concurrently To achieve CPI < 1 needs superscalar Dynamic issue – decided at run-time to schedule executing two or more instructions concurrently Require multiple copies of functional units such as instruction fetch, arithmetic execution, etc., and multi-port register files and caches (or cache buffers) There are more potential data dependency hazards, resource hazards, and control hazards Penalty for incorrect branch prediction is BIG

63 EE30332 Ch6 DP.63 VLIW  Very Long Instruction Word A VLIW instruction is machine (implementation) dependent A VLIW instruction consists of various fields Each field specifies the operation of a functional unit, such as Ifetch (instruction fetch), Idecode, Ofetch (operand fetch), EX (integer execute), FPA (Floating-point add), FPMUL and FPDIV Static issue – instructions generated by compilers Multiple instructions issued and being executed at same time

64 EE30332 Ch6 DP.64 VLIW advantages  Static generating codes (by compiler) Compilers can take a lot of time to pack the VLIW instructions, else they are done dynamically by hardware instruction scheduler (analyzed by circuitry for scheduling functional units)  Easier to power down individual functional units if they are not used, and easier for compilers to deliberately arrange the functional unit executions to minimize power consumption  Can execute different computer architecture instruction sets with a machine through respective compilers. However, the functional units must be so constructed to support these instruction sets and architectures.

65 EE30332 Ch6 DP.65 VLIW disadvantages  Compilers are hard to build  Machine dependent – must have different compilers for different machines of the same architecture  Binary incompatible – must have different binary codes for different machines of the same architecture  Cannot see input data when compiling, must prepare for all possible cases of input data  Difficult to recover from compiler mistakes, and the time penalty can be BIG  Difficult to debug  Non-VLIW machines can also power down individual functional units when not used


Download ppt "EE30332 Ch6 DP.1 Ch 6: Pipelining Modified from Dave Patterson’s notes  Laundry Example  Ann, Brian, Cathy, Dave each have one load of clothes to wash,"

Similar presentations


Ads by Google