Download presentation
Presentation is loading. Please wait.
1
Lecture 12 Reorder Buffers
CSCE 513 Computer Architecture Lecture 12 Reorder Buffers Topics Tomasulo’s Loop example Speculation Reorder Buffers Readings: October 16, 2017
2
Overview Last Time New References
Control Hazards: Lecture 7 slides 27-32 Data Hazards Review Tomasulo Overview, examples New Tomasulo Overview, examples revisited Figures 2.10 right one, 2.11 Tomasulo’s Algorithm details fig 2.12 Tomasulo + ReOrder Buffer (ROB) fig 2.14, 2.15, 2.16 References Chapter 2 section 2.6 Test 1
3
The University of Adelaide, School of Computer Science
18 September 2018 Dynamic Scheduling Branch Prediction Dynamic scheduling implies: Out-of-order execution Out-of-order completion Creates the possibility for WAR and WAW hazards Tomasulo’s Approach Tracks when operands are available Introduces register renaming in hardware Minimizes WAW and WAR hazards Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer
4
The University of Adelaide, School of Computer Science
18 September 2018 Register Renaming Branch Prediction Example: DIV.D F0,F2,F4 ADD.D F6,F0,F8 S.D F6,0(R1) SUB.D F8,F10,F14 MUL.D F6,F10,F8 + name dependence with F6 antidependence antidependence Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer
5
Figure 2.9 Tomasulo CDB Register Renaming
6
The University of Adelaide, School of Computer Science
18 September 2018 Tomasulo’s Algorithm Branch Prediction Three Steps: Issue Get next instruction from FIFO queue If available RS, issue the instruction to the RS with operand values if available If operand values not available, stall the instruction Execute When operand becomes available, store it in any reservation stations waiting for it When all operands are ready, issue the instruction Loads and store maintained in program order through effective address No instruction allowed to initiate execution until all branches that proceed it in program order have completed Write result Write result on CDB into reservation stations and store buffers (Stores must wait until address and value are received) Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer
7
Example (new and improved in 5th edition)
The University of Adelaide, School of Computer Science 18 September 2018 Example (new and improved in 5th edition) Branch Prediction Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer
8
Figure 3.8 3
9
Data-Flow graph
10
Figure 3.9.a Tomasulo Issue
11
Figure 3.9.b Tomasulo Execute
12
Figure 3.9.c Tomasulo Write Result
13
Tomasulo Loop Example Loop: L.D F0, 0(R1) MUL.D F4, F0, F2 S.D F4, 0(R1) DADDIU R1, R1, -8 BNE R1, R2, Loop Dynamic loop unrolling of floating/LD point operations
14
Observations on Tomasulo’s Alg
Tomasulo designed for the IBM 360/91 Does not require compiler to do all of the work Changes to hardware do not require changes to compiler (adding another multiplier) Designed before caches, but OoOE really helps with cache misses Dynamic scheduling required for “speculation”
15
Figure 3.12 Tomasulo + ROB example
16
Figure 3.10 - Two active Iterations of loop
17
Reorder Buffers
19
Speculation Issue Execute Write result Commit
20
Koren’s Tools Again
21
Figure 2.15 Tomasulo + ROB example
22
Figure 2.16 Tomasulo + ROB example
23
Fig 2.17a Tomasulo+ROB Details
24
Fig 2.17b Tomasulo+ROB Execute
25
Fig 2.17c Tomasulo+ROB Write-result
26
Fig 2.17d Tomasulo+ROB Commit
27
Figure 2.18 Multiple Issue Approaches
28
Unrolling for VLIW For i=1,10000 x[i] = x[i]+ c Loop: L.D F0, 0(R1) ADD.D F4,F0,F2 S.D F4, 0(R1) DADDUI R1, R1, -8 BNE R1,R2, loop Registers for Load Sum F0 F4 F6 F8 F10 F12 F14 F16 F18 F20 F22 F24 F26 F28
29
Figure 2.19 VLIW
30
Advanced Techniques for Instruction Delivery and Speculation
Increasing Instruction Fetch Bandwidth Branch Target Buffers
31
When is the Branch Target Address available?
Fig ? Appendix A
32
Figure A.24 – getting the branch target quicker
33
When is the Branch Target Address available?
36
Pentium 4 (sec 2.10) Front end –decoder IA32 instructions micro-ops (uops) which are RISC-like 3 IA32 instructions can be decoded per cycle upto 6 uops Uops are executed using a out-of-order speculative pipeline (using reg. renaming instead of ROB) Pentium 3 required at least 11 cycles for an instruction to go from fetch to “retire” Pentinum 4 pipeline depth continued to increase 21 cycles allowing 1.5GHz 31 cycles allowing 3.2GHz
37
Figure 2-26 Pentium 4 (Prescott)
38
Figure 2-27 Pentium 4 (Prescott)
.
39
Tomasulo + Re-Order Buffer (ROB)
Configuration Defaults except: F0 F15 in operands for the loads FU latencies: FP-Adder: 2 FP-Multiplier: 6 FP-Divider: 12 Load latency: 2 Start simulation, then Clock+1 to step through
40
From Memory FP adder FP multipler
Dest Instr Value Ready ROB Example pp 108 From Memory Registers To Memory Dest Addr Dest Op Vj/Qj Vk/Qk Dest Op Vj/Qj Vk/Qk FP adder FP multipler Reg F0 F2 F4 F6 F8 F10 F12 F14 ROB Busy
41
From Memory FP adder FP multipler
Dest Instr Value Ready ROB Example pp 108 From Memory Registers To Memory Dest Addr Dest Op Vj/Qj Vk/Qk Dest Op Vj/Qj Vk/Qk FP adder FP multipler Reg F0 F2 F4 F6 F8 F10 F12 F14 ROB Busy
42
From Memory FP adder FP multipler
Dest Instr Value Ready ROB Example pp 108 From Memory Registers To Memory Dest Addr Dest Op Vj/Qj Vk/Qk Dest Op Vj/Qj Vk/Qk FP adder FP multipler Reg F0 F2 F4 F6 F8 F10 F12 F14 ROB Busy
43
From Memory FP adder FP multipler
Dest Instr Value Ready ROB Example pp 108 From Memory Registers To Memory Dest Addr Dest Op Vj/Qj Vk/Qk Dest Op Vj/Qj Vk/Qk FP adder FP multipler Reg F0 F2 F4 F6 F8 F10 F12 F14 ROB Busy
44
From Memory FP adder FP multipler
Dest Instr Value Ready ROB Example pp 108 From Memory Registers To Memory Dest Addr Dest Op Vj/Qj Vk/Qk Dest Op Vj/Qj Vk/Qk FP adder FP multipler Reg F0 F2 F4 F6 F8 F10 F12 F14 ROB Busy
45
From Memory FP adder FP multipler
Dest Instr Value Ready ROB Example pp 108 From Memory Registers To Memory Dest Addr Dest Op Vj/Qj Vk/Qk Dest Op Vj/Qj Vk/Qk FP adder FP multipler Reg F0 F2 F4 F6 F8 F10 F12 F14 ROB Busy
46
From Memory FP adder FP multipler
Dest Instr Value Ready ROB Example pp 108 From Memory Registers To Memory Dest Addr Dest Op Vj/Qj Vk/Qk Dest Op Vj/Qj Vk/Qk FP adder FP multipler Reg F0 F2 F4 F6 F8 F10 F12 F14 ROB Busy
47
From Memory FP adder FP multipler
Dest Instr Value Ready ROB Example pp 108 From Memory Registers To Memory Dest Addr Dest Op Vj/Qj Vk/Qk Dest Op Vj/Qj Vk/Qk FP adder FP multipler Reg F0 F2 F4 F6 F8 F10 F12 F14 ROB Busy
48
From Memory FP adder FP multipler
Dest Instr Value Ready ROB Example pp 108 From Memory Registers To Memory Dest Addr Dest Op Vj/Qj Vk/Qk Dest Op Vj/Qj Vk/Qk FP adder FP multipler Reg F0 F2 F4 F6 F8 F10 F12 F14 ROB Busy
49
From Memory FP adder FP multipler
Dest Instr Value Ready ROB Example pp 108 From Memory Registers To Memory Dest Addr Dest Op Vj/Qj Vk/Qk Dest Op Vj/Qj Vk/Qk FP adder FP multipler Reg F0 F2 F4 F6 F8 F10 F12 F14 ROB Busy
50
From Memory FP adder FP multipler
Dest Instr Value Ready ROB Example pp 108 From Memory Registers To Memory Dest Addr Dest Op Vj/Qj Vk/Qk Dest Op Vj/Qj Vk/Qk FP adder FP multipler Reg F0 F2 F4 F6 F8 F10 F12 F14 ROB Busy
51
Tomasulo Example Page 98
52
Memory FP adder FP multipler
Tomasulo’s Example pp 98 Instruction Issue Execute WriteResult L.D F6, 32(R2) MUL.D F0, F2, F4 Cycle Memory Dest Addr Busy Op Vj/Qj Vk/Qk Busy Op Vj/Qj Vk/Qk FP adder FP multipler Reg F0 F2 F4 F6 F8 F10 F12 F14 F16 F18 F20 Qj
53
Power Wall ~125W CPU near limit for “air cooled” Water cooled
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.