Lecture 12 Reorder Buffers

Lecture 12 Reorder Buffers
CSCE 513 Computer Architecture Lecture 12 Reorder Buffers Topics Tomasulo’s Loop example Speculation Reorder Buffers Readings: October 16, 2017

Overview Last Time New References
Control Hazards: Lecture 7 slides 27-32 Data Hazards Review Tomasulo Overview, examples New Tomasulo Overview, examples revisited Figures 2.10 right one, 2.11 Tomasulo’s Algorithm details fig 2.12 Tomasulo + ReOrder Buffer (ROB) fig 2.14, 2.15, 2.16 References Chapter 2 section 2.6 Test 1

The University of Adelaide, School of Computer Science
18 September 2018 Dynamic Scheduling Branch Prediction Dynamic scheduling implies: Out-of-order execution Out-of-order completion Creates the possibility for WAR and WAW hazards Tomasulo’s Approach Tracks when operands are available Introduces register renaming in hardware Minimizes WAW and WAR hazards Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

18 September 2018 Register Renaming Branch Prediction Example: DIV.D F0,F2,F4 ADD.D F6,F0,F8 S.D F6,0(R1) SUB.D F8,F10,F14 MUL.D F6,F10,F8 + name dependence with F6 antidependence antidependence Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

Figure 2.9 Tomasulo CDB Register Renaming

18 September 2018 Tomasulo’s Algorithm Branch Prediction Three Steps: Issue Get next instruction from FIFO queue If available RS, issue the instruction to the RS with operand values if available If operand values not available, stall the instruction Execute When operand becomes available, store it in any reservation stations waiting for it When all operands are ready, issue the instruction Loads and store maintained in program order through effective address No instruction allowed to initiate execution until all branches that proceed it in program order have completed Write result Write result on CDB into reservation stations and store buffers (Stores must wait until address and value are received) Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

Example (new and improved in 5th edition)
The University of Adelaide, School of Computer Science 18 September 2018 Example (new and improved in 5th edition) Branch Prediction Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

Figure 3.8 3

Data-Flow graph

Figure 3.9.a Tomasulo Issue

Figure 3.9.b Tomasulo Execute

Figure 3.9.c Tomasulo Write Result

Tomasulo Loop Example Loop: L.D F0, 0(R1) MUL.D F4, F0, F2 S.D F4, 0(R1) DADDIU R1, R1, -8 BNE R1, R2, Loop Dynamic loop unrolling of floating/LD point operations

Observations on Tomasulo’s Alg
Tomasulo designed for the IBM 360/91 Does not require compiler to do all of the work Changes to hardware do not require changes to compiler (adding another multiplier) Designed before caches, but OoOE really helps with cache misses Dynamic scheduling required for “speculation”

Figure 3.12 Tomasulo + ROB example

Figure 3.10 - Two active Iterations of loop

Reorder Buffers

Speculation Issue Execute Write result Commit

Koren’s Tools Again

Fig 2.17a Tomasulo+ROB Details

Fig 2.17b Tomasulo+ROB Execute

Fig 2.17c Tomasulo+ROB Write-result

Fig 2.17d Tomasulo+ROB Commit

Figure 2.18 Multiple Issue Approaches

Unrolling for VLIW For i=1,10000 x[i] = x[i]+ c Loop: L.D F0, 0(R1) ADD.D F4,F0,F2 S.D F4, 0(R1) DADDUI R1, R1, -8 BNE R1,R2, loop Registers for Load Sum F0 F4 F6 F8 F10 F12 F14 F16 F18 F20 F22 F24 F26 F28

Figure 2.19 VLIW

Advanced Techniques for Instruction Delivery and Speculation
Increasing Instruction Fetch Bandwidth Branch Target Buffers

When is the Branch Target Address available?
Fig ? Appendix A

Figure A.24 – getting the branch target quicker

When is the Branch Target Address available?

Pentium 4 (sec 2.10) Front end –decoder IA32 instructions micro-ops (uops) which are RISC-like 3 IA32 instructions can be decoded per cycle upto 6 uops Uops are executed using a out-of-order speculative pipeline (using reg. renaming instead of ROB) Pentium 3 required at least 11 cycles for an instruction to go from fetch to “retire” Pentinum 4 pipeline depth continued to increase 21 cycles allowing 1.5GHz 31 cycles allowing 3.2GHz

Figure 2-26 Pentium 4 (Prescott)

Figure 2-27 Pentium 4 (Prescott)
.

Tomasulo + Re-Order Buffer (ROB)
Configuration Defaults except: F0  F15 in operands for the loads FU latencies: FP-Adder: 2 FP-Multiplier: 6 FP-Divider: 12 Load latency: 2 Start simulation, then Clock+1 to step through

From Memory FP adder FP multipler
Dest Instr Value Ready ROB Example pp 108 From Memory Registers To Memory Dest Addr Dest Op Vj/Qj Vk/Qk Dest Op Vj/Qj Vk/Qk FP adder FP multipler Reg F0 F2 F4 F6 F8 F10 F12 F14 ROB Busy

Tomasulo Example Page 98

Memory FP adder FP multipler
Tomasulo’s Example pp 98 Instruction Issue Execute WriteResult L.D F6, 32(R2) MUL.D F0, F2, F4 Cycle Memory Dest Addr Busy Op Vj/Qj Vk/Qk Busy Op Vj/Qj Vk/Qk FP adder FP multipler Reg F0 F2 F4 F6 F8 F10 F12 F14 F16 F18 F20 Qj

Power Wall ~125W CPU near limit for “air cooled” Water cooled

Lecture 12 Reorder Buffers

Similar presentations

Presentation on theme: "Lecture 12 Reorder Buffers"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 12 Reorder Buffers

Similar presentations

Presentation on theme: "Lecture 12 Reorder Buffers"— Presentation transcript:

Similar presentations

About project

Feedback