Download presentation
Presentation is loading. Please wait.
1
Chapter 3: ILP and Its Exploitation
Review simple static pipeline ILP Overview Dynamic branch prediction Dynamic scheduling, out-of-order execution Multiple issue (superscalar) Hardware-based speculation ILP limitation Intel Core i7 and Cortex-A8 CDA5155, Fall 2016
2
Dynamic Scheduling If an instruction is stalled, there’s no need to stall later instructions that aren’t dependent on any of the stalled instructions, i.e. out-of-order execution Example: DIVD F0,F2,F4 Long-running ADDD F10,F0,F8 Depends on DIVD SUBD F12,F8,F14 Independent of both The ADDD is stalled before execution, but the SUBD can go ahead Out-of-order execution! Encounter WAW, WAR harzards
3
The University of Adelaide, School of Computer Science
22 September 2018 Dynamic Scheduling Branch Prediction Rearrange order of instructions to reduce stalls while maintaining data flow Advantages: Compiler doesn’t need to have knowledge of microarchitecture Handles cases where dependencies are unknown at compile time Disadvantage: Substantial increase in hardware complexity Complicates exceptions Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer
4
The University of Adelaide, School of Computer Science
22 September 2018 Dynamic Scheduling Branch Prediction Dynamic scheduling implies: Out-of-order execution Out-of-order completion Creates the possibility for WAR and WAW hazards Tomasulo’s Approach Tracks when operands are available Introduces register renaming in hardware Minimizes WAW and WAR hazards Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer
5
The University of Adelaide, School of Computer Science
22 September 2018 Register Renaming Branch Prediction Example: DIV.D F0,F2,F4 ADD.D F6,F0,F8 S.D F6,0(R1) SUB.D F8,F10,F14 MUL.D F6,F10,F8 + name dependence with F6, F8 antidependence outputdependence Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer
6
The University of Adelaide, School of Computer Science
22 September 2018 Register Renaming Branch Prediction Example: DIV.D F0,F2,F4 ADD.D S,F0,F8 S.D S,0(R1) SUB.D T,F10,F14 MUL.D F6,F10,T Now only RAW hazards remain, which can be strictly ordered Dynamic Scheduling rename register using Reservation Station Rename! Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer
7
Issue Logic / Control Unit
Tomasulo’s Algorithm Key differences (from Scoreboarding in Appendix C) : Hazard detection & inst issue is done per execution unit Data results go straight to where they are needed, use CDB Loads/stores get their own execution units Reservation Station for register renaming (like loop unrolling) Issue Logic / Control Unit Common Data Bus (CDB) Register File Reser-vation Station Execution unit 1 Instruction Fetch Instruction Queue Reser-vation Station Execution unit 2 …
8
The University of Adelaide, School of Computer Science
22 September 2018 Tomasulo’s Algorithm Branch Prediction Load and store buffers Contain data and addresses, act like reservation stations Top-level design: (no-show integer unit) Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer
9
Major Steps in Tomasulo
Instruction Fetch Fetch instruction, branch prediction, etc. Issue Get instruction from FP instruction queue If a slot in appropriate RS (or load-store buffer) is available, send instruction there; else stall it (structural hazard). Send operand values to RS if already available, otherwise, just note the names (RS) where the operands to be available Execute While operands not yet available, monitor CDB for them. When all operands are in RS, begin executing instruction. Write result When result available & CDB is free, write result to CDB, then to registers & RS/store slots for receiving instructions. Update register status, RS’s value, flag, busy state, etc.
10
Components of a Tomasulo Unit
Reservation stations (RSs) Buffer the operands to pending instructions while they are waiting for operands to enter the execution units. Issue logic Redirects (renames) instructions’ register outputs to reservation-station slots. Results go directly to RSs rather than thru reg. file. Distributed hazard detection Handled separately by each functional unit Load & store buffers (can be combined with RS) Queue up memory access requests
11
Reservation Station (RS) Fields
In each slot of RS: Op - The operation to perform on operands S1 & S2 Qj, Qk - The RS slots that will produce S1, S2 Vj, Vk - The values of S1 & S2. Busy - RS & its execution unit are occupied In register file (status) entries & store buffer slots: Qi - The RS slot containing the op whose result should be stored here. In load and store buffers (combined in RS): A : hold effective address for load and store. See Figure 3.8 See Figure 3.9 for detailed operations
12
Example for Tomasulo’s Algorithm
We will go through the same code fragment to see how Tomasulo’s Algorithm handles out-of-order Execution 1. LD F6,34(R2) 2. LD F2,45(R3) 3. MULTD F0,F2,F4 4. SUBD F8,F6,F2 5. DIVD F10,F0,F6 6. ADDD F6,F8,F2 Data Dependence Anti- Dependence Output Dependence
13
Tomasulo Example Instruction stream 3 Load/Buffers FU count
down 3 FP Adder R.S. 2 FP Mult R.S. (Qi) Clock cycle counter
14
Cycle 1 (Rename F6)
15
Cycle 2 Note: Can have multiple loads outstanding
16
Cycle 3 Note: registers names are removed (“renamed”) in Reservation Stations; MULT issued Load1 completing (1 cycle addgen, 1 cycle load); what is waiting for Load1?
17
Cycle 4 Load2 completing; what is waiting for Load2?
18
Cycle 5 Timer starts down for Add1, Mult1
19
Cycle 6 Issue ADDD here despite name dependency on F6?
20
Cycle 7 Add1 (SUBD) completing; what is waiting for it?
21
Cycle 8
22
Cycle 9 Out of order
23
Cycle 10 Add2 (ADDD) completing; what is waiting for it?
24
Cycle 11 Write result of ADDD here?
All quick instructions complete in this cycle!
25
Cycle 12
26
Cycle 13
27
Cycle 14
28
Cycle 15 Mult1 (MULTD) completing; what is waiting for it?
29
Cycle 16 Just waiting for Mult2 (DIVD) to complete
30
Cycle 55 (after skip cycles…)
31
Cycle 56 Mult2 (DIVD) is completing; what is waiting for it?
32
Cycle 57 Once again: In-order issue, out-of-order execution, and out-of-order completion.
33
Tomasulo’s Two Major Advantages
Distribution of the hazard detection logic distributed reservation stations and the CDB If multiple instructions waiting on single result, & each instruction has other operand, then instructions can be released simultaneously by broadcast on CDB If a centralized register file were used, the units would have to read their results from the registers when register buses are available Elimination of stalls for WAW and WAR hazards
34
Elimination of WAR Hazards
Note the potential WAR hazard between DIVD and ADDD involving F6. But, as soon as DIVD enters the RS (in-order), it becomes independent of the ADDD! The 2nd source operand no longer refers to F6, but stores the value of F6 produced earlier by the LD. If the LD had not yet completed, the 2nd operand would then refer to its R.S., but still not to F6! So, ADDD can write its new value for F6 before DIVD executes, without messing it up!
35
Elimination of WAW Hazards
Note the potential WAW hazard between First LD and last ADD involving F6. But, as soon as ADD is issued, the register status table is updated with F6 assigned to “adder2” So, LD when it completes will not update F6, thus eliminate WAW
36
Tomasulo Drawbacks Complexity
delays of 360/91, MIPS 10000, Alpha 21264, IBM PPC 620 in CA:AQA 2/e, but not in silicon! Many associative stores (CDB) at high speed Performance limited by Common Data Bus Each CDB must go to multiple functional units high capacitance, high wiring density Number of functional units that can complete per cycle limited to one! Multiple CDBs more FU logic for parallel assoc stores Non-precise interrupts! (out-of-order completion) this will be addressed later
37
Overlap Loop Interactions
Register renaming Multiple iterations use different physical destinations for registers (accomplish dynamic loop unrolling). Reservation stations Permit instruction issue to advance past integer control flow operations Also buffer old values of registers - totally avoiding the WAR stall Other perspective: Tomasulo building data flow dependency graph on the fly Note, branch prediction is still needed!
38
Dynamic Loop Scheduling
Loop example: Loop: LD F0,0(R1) MULTD F4,F0,F2 SD 0(R1),F4 SUBI R1,R1,#8 BNEZ R1,Loop Note data dependences can span loop iterations. But, using Tomasulo, & predict-taken, multiple iterations can issue and begin execution simultaneously! Like dynamic loop unrolling by the HW.
39
Dynamic Loop-Unrolling with Out-of-order Execution
Loop: load F0,0(R1) add F4,F0,F2 store F4,0(R1) addui R1,R1,#-8 bne R1,R2,Loop Data Dependence Control Dependence Branch prediction Register Renaming: R1, F0, F4, R1 load F0,0(R1) add F4,F0,F2 store F4,0(R1) addui R1,R1,#-8 bne R1,R2,Loop Note, Hardware discover ILP, Most flexible
40
Check Figure 3.10
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.