Presentation is loading. Please wait.

Presentation is loading. Please wait.

Embedded Computer Architectures Hennessy & Patterson Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation Gerard Smit (Zilverling 4102),

Similar presentations


Presentation on theme: "Embedded Computer Architectures Hennessy & Patterson Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation Gerard Smit (Zilverling 4102),"— Presentation transcript:

1 Embedded Computer Architectures Hennessy & Patterson Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation Gerard Smit (Zilverling 4102), André Kokkeler (Zilverling 4096),

2 Contents Introduction Hazards <= dependencies Instruction Level Parallelism; Tomasulo’s approach Branch prediction

3 Dependencies True Data dependency Name dependency —Antidependency —Output dependency Control dependency

4 Data Dependency Inst i Inst i+1 Inst i+2 Result Data Dep Data Dep Data Dep Two instructions are data dependent => risk of RAW hazard

5 Name Dependency Antidependence Output dependence Inst i register or memory location Inst j Write Read Two instructions are antidependent => risk of WAR hazard Inst i register or memory location Inst j Write Two instructions are antidependent => risk of WAW hazard

6 Control Dependency Branch condition determines whether instruction i is executed => i is control dependent on the branch

7 Instruction Level Parallelism Pipelining = ILP Other approach: Dynamic scheduling => out of order execution Instruction Decode stage split into —Issue (decode, check for structural hazards) —Read Operands

8 Instruction Level Parallelism Scoreboard: —Sufficient resources —No data dependencies Tomasulo’s approach —Minimize RAW hazards —Register renaming to minimize WAW and RAW hazards issue Read operands Reservation Station (park instructions while waiting for operands)

9 Tomasulo’s approach Register Renaming 1.Read F0 2.Write F0 3.Read F0 4.Write F0 Register F0 start of instruction register use of instruction Time

10 Tomasulo’s approach Register Renaming 1.Read F0 2.Write F0 3.Read F0 4.Write F0 Register F0 Time Problems if arrows cross

11 Tomasulo’s approach Register Renaming 1.Read F0 2.Write F0 3.Read F0 4.Write F0 Register F0 Time Instr 2, 3,… will be stalled. Note that Instr 2 and 3 are stalled only because Instr 1 is not ready. If not for Instr 1, they could be executed earlier

12 Tomasulo’s approach Register Renaming 1.Read F0 2.Write F0 3.Read F0 4.Write F0 Instr 3.Register F0 Instr 1.Register F0 How is it arranged that value is written into Instr 3. Register F0 and not in Instr 1. Register F0?

13 Tomasulo’s approach Register Renaming 1.Read F0 2.Write F0 3.Read F0 4.Write F0 Instr 3.Register F0 Instr 3.F0Source Instr 1.Register F0 Instr 1.F0Source Instr. k Instr. 2 The result of Instr 2 is labelled with ‘Instr. 2’. Hardware checks whether there Is an instruction waiting for the result (checking the F0Source fields of instructions) And places the result in the correct place.

14 Tomasulo’s approach Register Renaming 1.Read F0 2.Write F0 3.Read F0 4.Write F0 Instr 3.Register F0 Instr 3.F0Source Instr. 2 F0DataF0Sourceoperation (read)

15 Tomasulo’s approach Register Renaming 1.Read F0 2.Write F0 3.Read F0 4.Write F0 F0DataF0Sourceoperation (read) F0DataF0Sourceoperation (read)

16 Tomasulo’s approach Register Renaming 1.Read F0 2.Write F0 3.Read F0 4.Write F0 F0DataF0Sourceoperation (read) F0DataF0Sourceoperation (read) operation (write) Reservation Station Issue Filled during execution Filled during Issue

17 Tomasulo’s approach Effects —Register Renaming: prevents WAW and WAR hazards —Execution starts when operands are available (datafields are filled): prevents RAW

18 Tomasulo’s approach Issue in more detail (issue is done sequentially) 1.Read F0 2.Write F0 3.Read F0 4.Write F0 Empty?????read1 Empty write1 Reservation Station read2 read write datalabeloperationsource Format: This is the only information you have: During issue, you have to keep track which instruction changed F0 last!!!!

19 Tomasulo’s approach Issue in more detail 1.Read F0 2.Write F0 3.Read F0 4.Write F0 Empty?????read1 EmptyWrite1 write1 Reservation Station read2 write2 read write datalabeloperationsource Format: ???? write1 write2 F0 Keeping track of register status during issue is done for every register

20 Tomasulo’s approach Definitions for the MIPS —For each reservation station: Name Busy Operation Vj Vk Qj Qk A Name = label Busy = in execution or not Operation= instruction V= operand value Q= operand source A= memory address (Load, Store)

21 Tomasulo’s approach; hardware view Issue hardware Reservation Station “Execution Control Hardware” Execution Units “Reservation Fill Hardware” Common Data Bus Of which instructions are operands and corresponding execution units available? => Transport operands to executions unit Puts data in correct place in reservation station From instruction queue Register Renaming Fill Reservation Stations Results + identification Of instruction producing the result

22 Branch prediction Data Hazards => Tomasulo’s approach Branch (control) hazards => Branch prediction —Goal: Resolve outcome of branch early => prevent stalls because of control hazards

23 Branch prediction; 1 history bit Example: Outerloop:… R=10 Innerloop:… R=R-1 BNZR, Innerloop … BranchOuterloop History bit History bit: is branch taken previously or not:- predict taken: fetch from ‘Innerloop’ - predict not taken: fetch next instr Actual outcome of branch:- taken: set history bit to ‘taken’ - not taken: set history bit to ‘not taken’ In this situation: Correct prediction in 80 % of branch evaluations

24 Branch prediction; 2 history bits Example: Outerloop:… R=10 Innerloop:… R=R-1 BNZR, Innerloop … BranchOuterloop 2 history bits Predict taken Predict not taken Not taken taken In this application: correct prediction in 90 % of branch evaluations

25 Branch prediction; Correlating branch predictors If (aa == 2) aa=0; If (bb == 2) bb=0; If (aa != bb) Results of these branches are used in prediction of this branch Example: suppose aa == 2 and bb == 2 then condition for last ‘if’ is always false => if previous two branches are not taken, last branch is taken.

26 Branch prediction; Correlating branch predictors Mechanism: Suppose result of 3 previous branches is used to influence decision. 8 possible sequences: br-3br-2br-1br NTNTNTT NTNT T NT ….….….…. T T TT Dependent on outcome of branch under consideration prediction is changed: —1 bit history: (3,1) predictor Branch under consideration For the sequence (NT NT NT) the prediction is that the branch will be taken => Fetches from branch destination

27 Branch prediction; Correlating branch predictors Mechanism: Suppose result of 3 previous branches is used to influence decision. 8 possible sequences: br-3br-2br-1br NTNTNTT NTNT T NT ….….….…. T T TT Dependent on outcome of branch under consideration prediction is changed: —1 bit history: (3,1) predictor —2 bit history: (3,2) predictor Branch under consideration For the sequence (NT NT NT) the prediction is that the branch will be taken => Fetches from branch destination Represented by 2 bits -2 combinations indicate:predict taken -2 combinations indicate: predict non taken Updated by means of statemachine

28 Branch Target Buffer Solutions: —Delayed Branch —Branch Target buffer Even with a good prediction, we don’t know where to branch too until here and we’ve already retrieved the next instruction

29 Branch Target Buffer Memory (Instruction cache) Program Counter AddressBranch Target Corresponding Branch Targets Addresses of branch instructions Hit? From Instruction Decode hardware Select After IF stage, branch address already in PC

30 Branch Folding Memory (Instruction cache) Program Counter AddressInstruction at target Corresponding Instructions at Branch Targets Addresses of branch instructions Hit? Unconditional Branches: Effectively removing Branch instruction (penalty of -1)

31 Return Address Predictors Indirect branches: branch address known at run time. 80% of time: return instructions. Small fast stack: RET Procedure Call Procedure Return RET

32 Multiple Issue Processors Goal: Issue multiple instructions in a clockcycle Superscalar issue varying number of instructions per clock —Statically scheduled —Dynamically scheduled VLIW issue fixed number of instructions per clock —Statically scheduled

33 Multiple Issue Processors Example Instruction type Pipe Stages IntegerIFIDEXMEMWB FPIFIDEX WB IntegerIFIDEXMEMWB FPIFIDEX WB IntegerIFIDEXMEMWB FPIFIDEX WB IntegerIFIDEXMEMWB FPIFIDEX

34 Hardware Based Speculation Multiple Issue Processors => nearly 1 branch every clock cycle Dynamic scheduling + branch prediction: fetch+issue Dynamic scheduling + branch speculation: fetch+issue+execution KEY: Do not perform updates that cannot be undone until you’re sure the corresponding operation really should be executed.

35 Hardware Based Speculation Tomasulo: Branch (Predict Not Taken) Register File Operation i Operation k Operations beyond this point are finished Issued Operation k: -Operand available -Execution postponed until clear whether branch is taken

36 Hardware Based Speculation Tomasulo: Branch (Predict Not Taken) Register File Operation i Operation k Issued Finished Dependent on outcome branch: -Flush reservation stations -Start execution

37 Hardware Based Speculation Speculation: Branch (Predict Not Taken) Register File Operation i Operation k Results of operations beyond this point are committed (from reorder buffer to register file) Issued Operation k: -Operand available and executed Reorder Buffer Commit: sequentially

38 Hardware Based Speculation Speculation: Branch (Predict Not Taken) Register File Operation i Operation k Issued Operation k: -Operand available and executed Reorder Buffer Commit: sequentially Committed

39 Hardware Based Speculation Speculation: Branch (Predict Not Taken) Register File Operation i Operation k Operation k: -Operand available and executed Reorder Buffer Commit: sequentially Committed

40 Hardware Based Speculation Speculation: Branch (Predict Not Taken) Register File Operation i Operation k Operation k: -Operand available and executed Reorder Buffer Commit: sequentially Committed

41 Hardware Based Speculation Some aspects —Instructions causing a lot of work should not have been executed => restrict allowed actions in speculative mode —ILP of a program is limited —Realistic branch predictions: easier to implement => less efficient

42 Pentium Pro Implementation Pentium Family ProcessorYearClock Rate (MHz) L1 Cache (instr, data) L2 Cache (instr, data) Pentium Pro KB, 8 KB256 KB, 1024 KB Pentium I KB, 16 KB256 KB, 512 KB Pentium II Xeon KB, 16 KB512 KB, 2 MB Celeron KB, 16 KB128 KB Pentium III KB, 16 KB256 KB, 512 KB Pentium III Xeon KB, 16 KB1 MB, 2 MB

43 Pentium Pro Implementation I486: CISC => problems with pipelining 2 observations —Translation CISC instructions into sequence of microinstructions —Microinstruction is of equal length Solution: pipelining microinstructions

44 Pentium Pro Implementation... Jump to Indirect or Execute... Jump to Execute... Jump to Fetch Jump to Op code routine... Jump to Fetch or Interrupt... Jump to Fetch or Interrupt Fetch cycle routine Indirect Cycle routine Interrupt cycle routine Execute cycle begin AND routine ADD routine Note: each micro-program ends with a branch to the Fetch, Interrupt, Indirect or Execute micro-program

45 Pentium Pro Implementation

46 All RISC features are implemented on the execution of microinstructions instead of machine instructions —Microinstruction-level pipeline with dynamically scheduled microoperations –Fetch machine instruction (3 stages) –Decode machine instruction into microinstructions (2 stages) –Issue microinstructions (2 stages, register renaming, reorder buffer allocation performed here) –Execute of microinstructions (1 stage, floating point units pipelined, execution takes between 1 and 32 cycles) –Write back (3 stages) –Commit (3 stages) —Superscalar can issue up to 3 microoperations per clock cycle —Reservation stations (20 of them) and multiple functional units (5 of them) —Reorder buffer (40 entries) and speculation used

47 Pentium Pro Implementation Execution Units have the following stages Integer ALU1 Integer Load3 Integer Multiply4 FP add3 FP multiply5 (partially pipelined –multiplies can start every other cycle) FP divide32 (not pipelined)

48 Thread-Level Parallelism ILP: on instruction level Thread-Level Parallelism: on a higher level —Server applications —Database queries Thread: has all information (instructions, data, PC register state etc) to allow it to execute —On a separate processer —As a process on a single process.

49 Thread-Level Parallelism Potentially high efficiency Desktop applications: —Costly to switch to ‘thread-level reprogrammed’ applications. —Thread level parallelism often hard to find => ILP continues to be focus for desktop-oriented processors (for embedded processors, the situation is different)


Download ppt "Embedded Computer Architectures Hennessy & Patterson Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation Gerard Smit (Zilverling 4102),"

Similar presentations


Ads by Google