Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 12 Reorder Buffers

Similar presentations


Presentation on theme: "Lecture 12 Reorder Buffers"— Presentation transcript:

1 Lecture 12 Reorder Buffers
CSCE 513 Computer Architecture Lecture 12 Reorder Buffers Topics Tomasulo’s Loop example Speculation Reorder Buffers Readings: October 16, 2017

2 Overview Last Time New References
Control Hazards: Lecture 7 slides 27-32 Data Hazards Review Tomasulo Overview, examples New Tomasulo Overview, examples revisited Figures 2.10 right one, 2.11 Tomasulo’s Algorithm details fig 2.12 Tomasulo + ReOrder Buffer (ROB) fig 2.14, 2.15, 2.16 References Chapter 2 section 2.6 Test 1

3 The University of Adelaide, School of Computer Science
18 September 2018 Dynamic Scheduling Branch Prediction Dynamic scheduling implies: Out-of-order execution Out-of-order completion Creates the possibility for WAR and WAW hazards Tomasulo’s Approach Tracks when operands are available Introduces register renaming in hardware Minimizes WAW and WAR hazards Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

4 The University of Adelaide, School of Computer Science
18 September 2018 Register Renaming Branch Prediction Example: DIV.D F0,F2,F4 ADD.D F6,F0,F8 S.D F6,0(R1) SUB.D F8,F10,F14 MUL.D F6,F10,F8 + name dependence with F6 antidependence antidependence Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

5 Figure 2.9 Tomasulo CDB Register Renaming

6 The University of Adelaide, School of Computer Science
18 September 2018 Tomasulo’s Algorithm Branch Prediction Three Steps: Issue Get next instruction from FIFO queue If available RS, issue the instruction to the RS with operand values if available If operand values not available, stall the instruction Execute When operand becomes available, store it in any reservation stations waiting for it When all operands are ready, issue the instruction Loads and store maintained in program order through effective address No instruction allowed to initiate execution until all branches that proceed it in program order have completed Write result Write result on CDB into reservation stations and store buffers (Stores must wait until address and value are received) Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

7 Example (new and improved in 5th edition)
The University of Adelaide, School of Computer Science 18 September 2018 Example (new and improved in 5th edition) Branch Prediction Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

8 Figure 3.8 3

9 Data-Flow graph

10 Figure 3.9.a Tomasulo Issue

11 Figure 3.9.b Tomasulo Execute

12 Figure 3.9.c Tomasulo Write Result

13 Tomasulo Loop Example Loop: L.D F0, 0(R1) MUL.D F4, F0, F2 S.D F4, 0(R1) DADDIU R1, R1, -8 BNE R1, R2, Loop Dynamic loop unrolling of floating/LD point operations

14 Observations on Tomasulo’s Alg
Tomasulo designed for the IBM 360/91 Does not require compiler to do all of the work Changes to hardware do not require changes to compiler (adding another multiplier) Designed before caches, but OoOE really helps with cache misses Dynamic scheduling required for “speculation”

15 Figure 3.12 Tomasulo + ROB example

16 Figure 3.10 - Two active Iterations of loop

17 Reorder Buffers

18

19 Speculation Issue Execute Write result Commit

20 Koren’s Tools Again

21 Figure 2.15 Tomasulo + ROB example

22 Figure 2.16 Tomasulo + ROB example

23 Fig 2.17a Tomasulo+ROB Details

24 Fig 2.17b Tomasulo+ROB Execute

25 Fig 2.17c Tomasulo+ROB Write-result

26 Fig 2.17d Tomasulo+ROB Commit

27 Figure 2.18 Multiple Issue Approaches

28 Unrolling for VLIW For i=1,10000 x[i] = x[i]+ c Loop: L.D F0, 0(R1) ADD.D F4,F0,F2 S.D F4, 0(R1) DADDUI R1, R1, -8 BNE R1,R2, loop Registers for Load Sum F0 F4 F6 F8 F10 F12 F14 F16 F18 F20 F22 F24 F26 F28

29 Figure 2.19 VLIW

30 Advanced Techniques for Instruction Delivery and Speculation
Increasing Instruction Fetch Bandwidth Branch Target Buffers

31 When is the Branch Target Address available?
Fig ? Appendix A

32 Figure A.24 – getting the branch target quicker

33 When is the Branch Target Address available?

34

35

36 Pentium 4 (sec 2.10) Front end –decoder IA32 instructions micro-ops (uops) which are RISC-like 3 IA32 instructions can be decoded per cycle upto 6 uops Uops are executed using a out-of-order speculative pipeline (using reg. renaming instead of ROB) Pentium 3 required at least 11 cycles for an instruction to go from fetch to “retire” Pentinum 4 pipeline depth continued to increase 21 cycles allowing 1.5GHz 31 cycles allowing 3.2GHz

37 Figure 2-26 Pentium 4 (Prescott)

38 Figure 2-27 Pentium 4 (Prescott)
.

39 Tomasulo + Re-Order Buffer (ROB)
Configuration Defaults except: F0  F15 in operands for the loads FU latencies: FP-Adder: 2 FP-Multiplier: 6 FP-Divider: 12 Load latency: 2 Start simulation, then Clock+1 to step through

40 From Memory FP adder FP multipler
Dest Instr Value Ready ROB Example pp 108 From Memory Registers To Memory Dest Addr Dest Op Vj/Qj Vk/Qk Dest Op Vj/Qj Vk/Qk FP adder FP multipler Reg F0 F2 F4 F6 F8 F10 F12 F14 ROB Busy

41 From Memory FP adder FP multipler
Dest Instr Value Ready ROB Example pp 108 From Memory Registers To Memory Dest Addr Dest Op Vj/Qj Vk/Qk Dest Op Vj/Qj Vk/Qk FP adder FP multipler Reg F0 F2 F4 F6 F8 F10 F12 F14 ROB Busy

42 From Memory FP adder FP multipler
Dest Instr Value Ready ROB Example pp 108 From Memory Registers To Memory Dest Addr Dest Op Vj/Qj Vk/Qk Dest Op Vj/Qj Vk/Qk FP adder FP multipler Reg F0 F2 F4 F6 F8 F10 F12 F14 ROB Busy

43 From Memory FP adder FP multipler
Dest Instr Value Ready ROB Example pp 108 From Memory Registers To Memory Dest Addr Dest Op Vj/Qj Vk/Qk Dest Op Vj/Qj Vk/Qk FP adder FP multipler Reg F0 F2 F4 F6 F8 F10 F12 F14 ROB Busy

44 From Memory FP adder FP multipler
Dest Instr Value Ready ROB Example pp 108 From Memory Registers To Memory Dest Addr Dest Op Vj/Qj Vk/Qk Dest Op Vj/Qj Vk/Qk FP adder FP multipler Reg F0 F2 F4 F6 F8 F10 F12 F14 ROB Busy

45 From Memory FP adder FP multipler
Dest Instr Value Ready ROB Example pp 108 From Memory Registers To Memory Dest Addr Dest Op Vj/Qj Vk/Qk Dest Op Vj/Qj Vk/Qk FP adder FP multipler Reg F0 F2 F4 F6 F8 F10 F12 F14 ROB Busy

46 From Memory FP adder FP multipler
Dest Instr Value Ready ROB Example pp 108 From Memory Registers To Memory Dest Addr Dest Op Vj/Qj Vk/Qk Dest Op Vj/Qj Vk/Qk FP adder FP multipler Reg F0 F2 F4 F6 F8 F10 F12 F14 ROB Busy

47 From Memory FP adder FP multipler
Dest Instr Value Ready ROB Example pp 108 From Memory Registers To Memory Dest Addr Dest Op Vj/Qj Vk/Qk Dest Op Vj/Qj Vk/Qk FP adder FP multipler Reg F0 F2 F4 F6 F8 F10 F12 F14 ROB Busy

48 From Memory FP adder FP multipler
Dest Instr Value Ready ROB Example pp 108 From Memory Registers To Memory Dest Addr Dest Op Vj/Qj Vk/Qk Dest Op Vj/Qj Vk/Qk FP adder FP multipler Reg F0 F2 F4 F6 F8 F10 F12 F14 ROB Busy

49 From Memory FP adder FP multipler
Dest Instr Value Ready ROB Example pp 108 From Memory Registers To Memory Dest Addr Dest Op Vj/Qj Vk/Qk Dest Op Vj/Qj Vk/Qk FP adder FP multipler Reg F0 F2 F4 F6 F8 F10 F12 F14 ROB Busy

50 From Memory FP adder FP multipler
Dest Instr Value Ready ROB Example pp 108 From Memory Registers To Memory Dest Addr Dest Op Vj/Qj Vk/Qk Dest Op Vj/Qj Vk/Qk FP adder FP multipler Reg F0 F2 F4 F6 F8 F10 F12 F14 ROB Busy

51 Tomasulo Example Page 98

52 Memory FP adder FP multipler
Tomasulo’s Example pp 98 Instruction Issue Execute WriteResult L.D F6, 32(R2) MUL.D F0, F2, F4 Cycle Memory Dest Addr Busy Op Vj/Qj Vk/Qk Busy Op Vj/Qj Vk/Qk FP adder FP multipler Reg F0 F2 F4 F6 F8 F10 F12 F14 F16 F18 F20 Qj

53 Power Wall ~125W CPU near limit for “air cooled” Water cooled


Download ppt "Lecture 12 Reorder Buffers"

Similar presentations


Ads by Google