Presentation is loading. Please wait.

Presentation is loading. Please wait.

The University of Adelaide, School of Computer Science

Similar presentations


Presentation on theme: "The University of Adelaide, School of Computer Science"— Presentation transcript:

1 The University of Adelaide, School of Computer Science
Computer Architecture A Quantitative Approach, Fifth Edition The University of Adelaide, School of Computer Science 10 June 2018 Chapter 3 Instruction-Level Parallelism and Its Exploitation Chapter 2 — Instructions: Language of the Computer

2 The University of Adelaide, School of Computer Science
10 June 2018 Introduction Introduction Pipelining became universal technique in 1985 Overlaps execution of instructions Exploits “Instruction Level Parallelism” There are two main approaches: Hardware-based dynamic approaches Used in server and desktop processors like Intel Core Used in recent versions of ARM processors. Compiler-based static approaches Used in PMD processors like Cortex-A8 Not as successful outside of scientific applications Chapter 2 — Instructions: Language of the Computer

3 Instruction-Level Parallelism
The University of Adelaide, School of Computer Science 10 June 2018 Instruction-Level Parallelism Introduction When exploiting instruction-level parallelism, goal is to minimize CPI Pipeline CPI = Ideal pipeline CPI + Structural stalls + Data hazard stalls + Control stalls structural hazard: When an instruction cannot execute in the proper clock cycle because the hardware does not support the combination of instructions that are set to execute. data hazard: When an instruction cannot execute in the proper clock cycle because data that is needed to execute the instruction is not yet available. control hazard: When an instruction cannot execute in the proper pipeline clock cycle because the instruction that was fetched is not the one that is needed. Chapter 2 — Instructions: Language of the Computer

4 Hazard Mitigation Methods
The University of Adelaide, School of Computer Science 10 June 2018 Hazard Mitigation Methods Introduction Chapter 2 — Instructions: Language of the Computer

5 What is ILP? Parallelism within basic block (with no branches) is limited because instructions are interdependent. Typical size of basic blocks = 3-6 instructions Must exploit ILP across branches Loop level parallelism: Unroll loop statically or dynamically Use SIMD vector extentions.

6 Data Dependences and Hazards
The University of Adelaide, School of Computer Science 10 June 2018 Data Dependences and Hazards Introduction Independent instructions are good for parallelism. Data dependency Instruction j is data dependent on instruction i if Instruction i produces a result that may be used by instruction j Instruction j is data dependent on instruction k and instruction k is data dependent on instruction i Chapter 2 — Instructions: Language of the Computer

7 The University of Adelaide, School of Computer Science
10 June 2018 Data Dependence Introduction Dependencies are a property of programs Pipeline organization (interlocks & depth) determines if dependence is detected and causes a stall Data dependence conveys: Possibility of a hazard Order in which results must be calculated Upper bound on exploitable instruction level parallelism Data dependence can be overcame by: Instruction reordering (software + hardware) Register renaming (tomasulo) Chapter 2 — Instructions: Language of the Computer

8 The University of Adelaide, School of Computer Science
10 June 2018 Name Dependence Introduction Two instructions use the same register or memory (name) but no flow of information Antidependence : instruction j writes a register or memory location that instruction i reads Initial ordering must be preserved (e.g. S.D and DADDIU in above example) Output dependence: instruction i and instruction j write the same register or memory location To resolve: use renaming techniques Data Hazard: conflict from out of order manipulation of dependent data Read after write (RAW): caused by true data dependence; most popular Write after write (WAW): caused by output dependence; when multiple writes are present Write after read (WAR): caused by antidependence: occurs less often because read is always done in ID (instruction decode) stage and write is done in WB (write back) stage. Chapter 2 — Instructions: Language of the Computer

9 Compiler Techniques for Exposing ILP: 1) Pipeline scheduling
The University of Adelaide, School of Computer Science Compiler Techniques for Exposing ILP: 1) Pipeline scheduling 10 June 2018 Compiler Techniques Move independent instructions together + Separate dependent instruction from the source instruction by the pipeline latency of the source instruction Example: for (i=999; i>=0; i=i-1) x[i] = x[i] + s; Chapter 2 — Instructions: Language of the Computer

10 The University of Adelaide, School of Computer Science
10 June 2018 unscheduled Stalls Compiler Techniques Loop: L.D F0,0(R1) stall ADD.D F4,F0,F2 S.D F4,0(R1) DADDUI R1,R1,#-8 stall (assume integer load latency is 1) BNE R1,R2,Loop 9 cycles/iteration Chapter 2 — Instructions: Language of the Computer

11 The University of Adelaide, School of Computer Science
10 June 2018 Pipeline Scheduling Scheduled code: Loop: L.D F0,0(R1) DADDUI R1,R1,#-8 ADD.D F4,F0,F2 stall overhead cycles: 2 stalls + 2 loop overhead stall S.D F4,8(R1) BNE R1,R2,Loop Compiler Techniques 7 cycles/iteration Chapter 2 — Instructions: Language of the Computer

12 The University of Adelaide, School of Computer Science
10 June 2018 2) Loop Unrolling Compiler Techniques Unroll by a factor of 4 (assume array elements is divisible by 4) Eliminate unnecessary instructions Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) ;drop DADDUI & BNE L.D F6,-8(R1) ADD.D F8,F6,F2 S.D F8,-8(R1) ;drop DADDUI & BNE L.D F10,-16(R1) ADD.D F12,F10,F2 S.D F12,-16(R1) ;drop DADDUI & BNE L.D F14,-24(R1) ADD.D F16,F14,F2 S.D F16,-24(R1) DADDUI R1,R1,#-32 BNE R1,R2,Loop 27 cycles/4 iteration = 6.75 cycles/iteration note: number of live registers vs. original loop Chapter 2 — Instructions: Language of the Computer

13 Loop Unrolling/Pipeline Scheduling
The University of Adelaide, School of Computer Science 10 June 2018 Loop Unrolling/Pipeline Scheduling Compiler Techniques Pipeline schedule the unrolled loop: Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) DADDUI R1,R1,#-32 ADD.D F4,F0,F2 ADD.D F8,F6,F2 ADD.D F12,F10,F2 ADD.D F16,F14,F2 S.D F4,32(R1) S.D F8, 24(R1) S.D F12,16(R1) S.D F16,8(R1) BNE R1,R2,Loop 14 cycles/4 iteration = 3.5 cycles/iteration Chapter 2 — Instructions: Language of the Computer

14 Loop Unrolling Decisions
Requires understanding how one instruction depends on another and how the instructions can be changed or reordered given the dependences: Determine loop unrolling useful by finding that loop iterations were independent (except for maintenance code) Use different registers to avoid unnecessary constraints forced by using same registers for different computations Eliminate the extra test and branch instructions and adjust the loop termination and iteration code Determine that loads and stores in unrolled loop can be interchanged by observing that loads and stores from different iterations are independent Transformation requires analyzing memory addresses and finding that they do not refer to the same address Schedule the code, preserving any dependences needed to yield the same result as the original code 6/10/2018

15 Loop Unrolling Limitations
Decrease in the amount of overhead amortized by each unrolling Code size limitation (I-cache misses) Compiler limitations (register pressure) 6/10/2018

16 The University of Adelaide, School of Computer Science
10 June 2018 Dynamic Scheduling Branch Prediction Hardware Rearranging order of instructions to reduce stalls while maintaining data flow & exception behavior Advantages: Compiler doesn’t need to have knowledge of microarchitecture Handles cases where dependencies are unknown at compile time Allowing processor to tolerate delays (cache misses), by executing other code. Disadvantage: Substantial increase in hardware complexity Chapter 2 — Instructions: Language of the Computer

17 The University of Adelaide, School of Computer Science
10 June 2018 Dynamic Scheduling Branch Prediction Dynamic scheduling implies: Out-of-order execution Out-of-order completion Out of order execution may create the possibility for WAR and WAW hazards out-of-order completion may result imprecise exceptions. Completed instructions that are later than an exception Not completing instructions that are earlier than an exception Chapter 2 — Instructions: Language of the Computer

18 The University of Adelaide, School of Computer Science
10 June 2018 Dynamic Scheduling Branch Prediction Out-of-order execution is done by splitting the ID stage to : Issue: decode & check for structural hazards. Read operands: wait for data hazards then read oreands. Dynamic scheduling = In-order issue + stall or bypass in operand read (out of order execution) Out of order execution mechanisms: scoreboarding : used in 2-issue ARM A8 Tomasulo’s Approach: used in 4-issue Intel Core-i7 Tracks when operands are available Uses register renaming to handle antidependeces & output dependences Extended version can do speculation, to handle control dependences Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

19 The University of Adelaide, School of Computer Science
10 June 2018 Register Renaming Branch Prediction Example: DIV.D F0,F2,F4 ADD.D F6,F0,F8 S.D F6,0(R1) SUB.D F8,F10,F14 MUL.D F6,F10,F8 Data dependence antidependence antidependence Output dependence Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

20 The University of Adelaide, School of Computer Science
10 June 2018 Register Renaming Branch Prediction Example: DIV.D F0,F2,F4 ADD.D S,F0,F8 S.D S,0(R1) SUB.D T,F10,F14 MUL.D F6,F10,T All subsequent calls to F8 must be replaced by T. DIV.D F0,F2,F4 SUB.D T,F10,F14 ADD.D S,F0,F8 MUL.D F6,F10,T S.D S,0(R1) Chapter 2 — Instructions: Language of the Computer

21 The University of Adelaide, School of Computer Science
10 June 2018 Register Renaming Branch Prediction Register renaming is provided by reservation stations (RS) As instructions are issued, the registers are renamed with the reservation station RS fetches and buffers an operand as soon as it becomes available (not necessarily involving register file) Pending instructions designate the RS to which they will send their output Result values are broadcast on a result bus, called the common data bus (CDB) The Final output updates the register file Chapter 2 — Instructions: Language of the Computer

22 The University of Adelaide, School of Computer Science
10 June 2018 Tomasulo’s Algorithm Load and store buffers act like reservation stations Top-level design: Store buffers, reservation stations, Operation units, & FP registers are all connected to CDB. Branch Prediction Chapter 2 — Instructions: Language of the Computer

23 The University of Adelaide, School of Computer Science
10 June 2018 Tomasulo’s Algorithm Branch Prediction Three Steps: Issue (dispatch) Get next instruction from FIFO queue If RS available, issue the instruction to the RS with operand values if available + rename the registers to prevent WAR, WAW If RS not available stall the instruction If operand values not available, wait for the functional units Execute When operands becomes available, store it in all waiting reservation stations When all operands are ready, execute the instruction No instruction allowed to initiate execution until all branches that proceed it in program order have completed Write result Write result on CDB into reservation stations, register file and store buffers (Stores must wait until both address and value are received) Chapter 2 — Instructions: Language of the Computer

24 The University of Adelaide, School of Computer Science
10 June 2018 Register Tags Branch Prediction Tags are attached to RSes, register files, and load store buffers Tags identify the reservation stations that contains operands. Zero tag means that operands are already available. One cycle latency is needed between producing the results and consuming them for the sake of Write Results stage. Chapter 2 — Instructions: Language of the Computer

25 Reservation station fields
The University of Adelaide, School of Computer Science 10 June 2018 Reservation station fields Branch Prediction Op: the operation to perform on source operands S1, S2 Qj, Qk: RS tags that produce S1/S2. zero= operand is available in Vj/Vk Vj, Vk: operand values. A: the memory address for load/store. Busy: the RS and its corresponding functional unit are busy. Register files field Qi: the RS tag that will produce the results for register, zero means that the register information is valid Load store buffers field: A: holds the effective address after its calculation. Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

26 The University of Adelaide, School of Computer Science
10 June 2018 Branch Prediction Eliminating WAR between ADD & DIV Eliminating WAW between ADD & LD Chapter 2 — Instructions: Language of the Computer

27

28 Tomasulo’s algorithm: the details

29 Tomasulo’s Algorithm: a Loop example
Multiplying an array by a scalar: Branch is assumed taken. Dynamic loop unrolling

30 Tomasulo’s Algorithm Requires comprehensive and complex hardware
Was rarely used before 1990 Is widely adopted in advanced multiple issue processors Do not require complex compilers.

31 The University of Adelaide, School of Computer Science
10 June 2018 Control Dependences Introduction Ordering of instruction with respect to a branch S1 is control dependent on p1 & S2 on p2. Instruction control dependent on a branch cannot be moved before the branch so that its execution is no longer controlled by the branch An instruction not control dependent on a branch cannot be moved after the branch so that its execution is controlled by the branch Chapter 2 — Instructions: Language of the Computer

32 Advanced Branch Prediction
The University of Adelaide, School of Computer Science 10 June 2018 Advanced Branch Prediction Basic 2-bit predictor (0,2): For each branch: Predict taken or not taken If the prediction is wrong two consecutive times, change prediction Branch Prediction Chapter 2 — Instructions: Language of the Computer

33 Correlating predictor (m,n)
Multiple n-bit predictors for each branch One for each possible combination of outcomes of preceding m branches Number of overhead bits:

34 Correlated prediction

35 The University of Adelaide, School of Computer Science
10 June 2018 Tournament predictor Branch Prediction Using multiple predictors: 1 local + 1 global A seperate 1 or 2 bit predictor is used to select between local and global predictors 40% times selects global predictor for integer operations and 15% times for FP operations. Chapter 2 — Instructions: Language of the Computer

36 Intel Core-i7 Branch Predictor
A simple 2 bit predictor A global history predictor Target address buffer

37 The University of Adelaide, School of Computer Science
10 June 2018 Branch-Target Buffer Next PC prediction buffer if a branch instruction is encountered indexed by current PC Adv. Techniques for Instruction Delivery and Speculation Chapter 2 — Instructions: Language of the Computer

38 Typical delayed branch penalty = 0.5 cycles

39 Hardware-Based Speculation
The University of Adelaide, School of Computer Science 10 June 2018 Hardware-Based Speculation Combines 3 Key ideas: Dynamic branch prediction Speculation to allow predicted branches to be executed Dynamic scheduling to schedule multiple basic blocks Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register file when instruction is no longer speculative Need an additional piece of hardware called reorder buffer (ROB) to prevent any irrevocable action until an instruction commits Branch Prediction Chapter 2 — Instructions: Language of the Computer

40 The University of Adelaide, School of Computer Science
10 June 2018 Reorder Buffer Branch Prediction Reorder buffer – holds the result of instruction between completion and commit Four fields: Instruction type: branch/store/register (load or ALU) Destination field: register number Value field: output value Ready field: completed execution? Modify reservation stations: Operand source is now reorder buffer instead of functional unit Chapter 2 — Instructions: Language of the Computer

41 Basic structure of FP unit with tomasulo & speculation

42 Four step instruction execution
Issue: get the instruction if RS & ROB has empty slots. Get operands from registers or ROB. Indicate the ROB for storing results. Execute: executing the program when operands are ready. Write results: put the results on CDB with the destination ROB tag. Store values are also put in the ROB entry. Commit: if the instruction reaches the head of ROB and its results are present, the registers or memory are updated. If the instruction is a branch that is mispredicted, the ROB is flushed and execution is restarted from the instruction before the branch.

43

44

45

46 Speculation algorithm - 1

47 Speculation algorithm - 2

48 Multiple Issue and Static Scheduling
The University of Adelaide, School of Computer Science 10 June 2018 Multiple Issue and Static Scheduling To achieve CPI < 1, need to complete multiple instructions per clock Solutions: Statically scheduled superscalar processors In order execution of Variable number of instructions VLIW (very long instruction word) processors Fixed number of instructions as a packet dynamically scheduled superscalar processors Out of order execution of variable number of instructions Multiple Issue and Static Scheduling Chapter 2 — Instructions: Language of the Computer

49 The University of Adelaide, School of Computer Science
10 June 2018 VLIW Processors Package multiple operations into one instruction Example VLIW processor (5 operation units): One integer instruction (or branch) Two independent floating-point operations Two independent memory references Must be enough parallelism in code to fill the available slots Loop unrolling Local scheduling (single basic block) Global scheduling (branched basic blocks) Multiple Issue and Static Scheduling Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

50 7 results in 9 cycles = 1.29 cycles per result
Can be improved to 1 if enough registers are available

51 The University of Adelaide, School of Computer Science
10 June 2018 Multiple Issue Limit the number of instructions of a given class that can be issued in a “bundle” I.e. one FP, one integer, one load, one store Examine all the dependencies among the instructions in the bundle If dependencies exist in bundle, encode them in reservation stations Also need multiple completion/commit Dynamic Scheduling, Multiple Issue, and Speculation Chapter 2 — Instructions: Language of the Computer

52 The University of Adelaide, School of Computer Science
10 June 2018 Example Loop: LD R2,0(R1) ;R2=array element DADDIU R2,R2,#1 ;increment R2 SD R2,0(R1) ;store result DADDIU R1,R1,#8 ;increment pointer BNE R2,R3,LOOP ;branch if not last element Create a table for first 3 iterations for speculative and non-speculative multiple issue processors. Assume that up to 2 instructions can commit per clock. Dynamic Scheduling, Multiple Issue, and Speculation Chapter 2 — Instructions: Language of the Computer

53 answer (No Speculation)
The University of Adelaide, School of Computer Science 10 June 2018 answer (No Speculation) Dynamic Scheduling, Multiple Issue, and Speculation Chapter 2 — Instructions: Language of the Computer

54 Example (with speculation)
The University of Adelaide, School of Computer Science 10 June 2018 Example (with speculation) Dynamic Scheduling, Multiple Issue, and Speculation Chapter 2 — Instructions: Language of the Computer

55 Integrated Instruction Fetch Unit
The University of Adelaide, School of Computer Science 10 June 2018 Integrated Instruction Fetch Unit A single pipeline IF stage is not sufficient for multiple issue processors. Design monolithic unit that performs: Branch prediction Instruction prefetch Fetch ahead Instruction memory access and buffering Deal with crossing cache lines Adv. Techniques for Instruction Delivery and Speculation Chapter 2 — Instructions: Language of the Computer

56 Speculation issues: Register Renaming vs. ROB
The University of Adelaide, School of Computer Science 10 June 2018 Speculation issues: Register Renaming vs. ROB Instead of virtual registers from reservation stations and reorder buffer, create a single register pool Contains physical registers and architectral registers Use hardware-based map to rename registers during issue Speculation recovery occurs by copying during commit Still need a ROB-like queue to update renaming table in order Simplifies commit: Record that mapping between architectural register and physical register is no longer speculative. Physical register hold final value. Free up physical registers used to hold older values. In other words: SWAP physical registers on commit Adv. Techniques for Instruction Delivery and Speculation Chapter 2 — Instructions: Language of the Computer

57 Multiple Issue and Renaming
The University of Adelaide, School of Computer Science 10 June 2018 Multiple Issue and Renaming Combining instruction issue with register renaming: Issue logic pre-reserves enough physical registers for the bundle Issue logic finds dependencies within bundle, maps pre-reserved registers as necessary Issue logic finds dependencies between current bundle and already in-flight bundles, maps architectural registers as necessary Adv. Techniques for Instruction Delivery and Speculation Chapter 2 — Instructions: Language of the Computer

58 The ARM Cortex-A8 Dual issue. 13 stage pipeline

59 The ARM Cortex-A8 512 entry 2-way branch target buffer.
4K entry global history branch predictor. If branch target buffer misses the prediction from branch predictor is used to calculate the target address. 8 entry return address stack.

60 The ARM Cortex-A8

61 A8 performance

62 Intel Core-i7 14 stage pipeline Branch target buffer Predecode
15 cycles misprediction penalty Fetches 16 bytes from I-cache Predecode Breaks 16 bytes into x86 instructions Fusing compare & branch instructions. Micro-op decode Decode x86 instructions into MIPS ops. Micro-op fuse combines LD/STR & ALU Register renaming and ROB allocation 36 entry reservation buffer 6 functional units.

63 Core-i7 performance


Download ppt "The University of Adelaide, School of Computer Science"

Similar presentations


Ads by Google