Final touches on Out-of-Order execution Review Tclk Superscalar Looking back Looking forward.

Slides:

Advertisements

Similar presentations

Final touches on Out-of-Order execution Review Superscalar Looking back Looking forward.

Advertisements

Instruction Level Parallelism

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.

CS 6290 Instruction Level Parallelism. Instruction Level Parallelism (ILP) Basic idea: Execute several instructions in parallel We already do pipelining…

CSCI 4717/5717 Computer Architecture

Lecture 9: R10K scheme, Tclk

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

EECS 470 Lecture 8 RS/ROB examples True Physical Registers? Project.

Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.

Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.

THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 14, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)

CS 211: Computer Architecture Lecture 5 Instruction Level Parallelism and Its Dynamic Exploitation Instructor: M. Lancaster Corresponding to Hennessey.

EECS 470 Lecture 7 Branches: Address prediction and recovery (And interrupt recovery too.)

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Final touches on Out-of-Order execution Review Tclk Superscalar Looking back Looking forward.

EECE476: Computer Architecture Lecture 23: Speculative Execution, Dynamic Superscalar (text 6.8 plus more) The University of British ColumbiaEECE 476©

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

CS 152 Computer Architecture & Engineering Andrew Waterman University of California, Berkeley Section 8 Spring 2010.

EECS 470 Register Renaming Lecture 8 Coverage: Chapter 3.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.

Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.

Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

Multiscalar processors

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 2: Pipeline problems & tricks dr.ir. A.C. Verschueren Eindhoven.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

RISC:Reduced Instruction Set Computing. Overview What is RISC architecture? How did RISC evolve? How does RISC use instruction pipelining? How does RISC.

1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.

10/27: Lecture Topics Survey results Current Architectural Trends Operating Systems Intro –What is an OS? –Issues in operating systems.

Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.

1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.

Finishing out EECS 470 A few snapshots of the real world.

1 Out-Of-Order Execution (part I) Alexander Titov 14 March 2015.

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.

EECS 470 Lecture 7 Branches: Address prediction and recovery (And interrupt recovery too.)

1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.

1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve perfomance by increasing instruction throughput.

Pentium III Instruction Stream. Introduction Pentium III uses several key features to exploit ILP This part of our presentation will cover the methods.

OOO Pipelines - III Smruti R. Sarangi Computer Science and Engineering, IIT Delhi.

15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.

Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.

Samira Khan University of Virginia Feb 9, 2016 COMPUTER ARCHITECTURE CS 6354 Precise Exception The content and concept of this course are adapted from.

CS 352H: Computer Systems Architecture

Data Prefetching Smruti R. Sarangi.

5.2 Eleven Advanced Optimizations of Cache Performance

CS203 – Advanced Computer Architecture

/ Computer Architecture and Design

EECS 470 Final touches on Out-of-Order execution

Morgan Kaufmann Publishers The Processor

CS 152 Computer Architecture & Engineering

Out of Order Processors

Superscalar Processors & VLIW Processors

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

ECE 2162 Reorder Buffer.

Yingmin Li Ting Yan Qi Zhao

15-740/ Computer Architecture Lecture 5: Precise Exceptions

How to improve (decrease) CPI

November 5 No exam results today. 9 Classes to go!

Data Prefetching Smruti R. Sarangi.

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011

CSC3050 – Computer Architecture

Conceptual execution on a processor which exploits ILP

Presentation transcript:

Final touches on Out-of-Order execution Review Tclk Superscalar Looking back Looking forward

Times and Dates and Places Office hours tomorrow –10:30-noon –May move from my office, see note on door. Exam Q&A –Saturday 3-5pm, 1670 Beyster –90% will be old exam questions Bring paper copies of the old exams. In class on Monday similar Q&A. Midterm 6-8pm Monday –Dow 1006, 1010, 1014 –Seat yourself. I may move people around.

Terminology and background Branch RAT (BRAT) –Sometimes called GC (don’t know why) –“Branch Stack” used by MIPSR10K paper Retirement RAT (RRAT) –Retirement Map table –Architected Map table

What do people really use? Option 1: –Track value you overwrote in MAP table and store in ROB. Can undo operation (but slowly) by rolling back Option 2: –Keep state for each branch in BRAT –Can recover more quickly R10K choice to use BRAT to recover branches (at the head of the RoB as far as I can tell) and recover from exceptions using the rollback –Not clear why no RRAT.

Optimizing CPU Performance Golden Rule: t CPU = N inst *CPI*t CLK Given this, what are our options –Reduce the number of instructions executed –Reduce the cycles to execute an instruction –Reduce the clock period Our first focus: Reducing CPI –Approach: Instruction Level Parallelism (ILP)

t CLK Recall: t CPU = N inst *CPI*t CLK What defines t CLK ? –Critical path latency (= logic + wire latency) –Latch latency –Clock skew –Clock period design margins In current and future generation designs –Wire latency becoming dominant latency of critical path –Due to growing side-wall capacitance –Brings a spatial dimension to architecture optimization E.g., How long are the wires that will connect these two devices?

Determining the Latency of a Wire scale shrinks grows

But reality is worse…. (Fringe) For Intel 0.25u process W~=0.64 T~=0.48 H is around

Moral of the “t CLK " story As we shrink wire delay starts to dominate –Agarwal et. al. Clock Rate versus IPC: the End of the Road for Conventional Microarchitectures, ISCA 2000 –Above paper says this will be the end of Moore’s law. So long buses and large fan-outs get to be difficult. –tCLK still limits logic but limits clocks even more. Is this the end of Moore’s law? I don’t think so. Power is more of an issue There are lots of potential solutions here (ballistic conductors and superconductors being two possibilities).

And reducing the number of instructions executed… Sorry, wrong class. –Compilers can help with this (a lot in some cases) –So can ISA design, but making instructions too complex hurts ILP and t CLK Not clear there is a lot of room here for improvement. So on the whole reducing # of instructions doesn’t look to be viable. –So ILP would seem to be “where it’s at”

Optimizing CPU Performance Golden Rule: t CPU = N inst *CPI*t CLK Given this, what are our options –Reduce the number of instructions executed –Reduce the cycles to execute an instruction –Reduce the clock period Our first focus: Reducing CPI –Approach: Instruction Level Parallelism (ILP)

BRAT Simple idea: –When we mispredict we need to recover things to the state when the branch finished issuing. RAT: –Just make a copy –Free list is non-trivial RS –Can we just make a copy? »Not really, some things done etc. »Also really big (each RS has a lot of data). RoB –Easy money.

SuperScalar OoO Out-of-order and superscalar make for a nice combination –The whole point of OoO is to find something to do. Superscalar provides the resources to do it. Out-of-order scales pretty nicely –Dependencies resolved at rename –True dependencies dealt with already by rename and the general OoO model. So we’ve already done a lot of the work.

But more to go To be superscalar one needs to be able to complete more than 1 instruction per cycle in a sustained way. –This means fetch, rename, issue, execute, CDB broadcast and retire must all be able to do 2 instructions at once. –It is mostly a matter of scale.

Fetch Performing more than one fetch seems straightforward. –Just grab PC and PC+4 –It can be complicated by hardware restrictions Say the two instructions span a cacheline –Branches also cause problems What if PC+4 is wrong? –But as long as you can usually/often fetch two life is good. And we can add tricks to handle these problems –Trace cache, multi-branch predictor, Icache annotations

Decode Just have two of them. –For x86 or other CISC this might be unreasonable Trace cache or otherwise caching decoded instructions might help here.

Rename One change is we need more ports to the RAT. Other (bigger) issue is making sure dependencies inside of the group are dealt with. R1=… …=R1 –How do we handle this? Basically similar to “register forwarding” inside of the register file

Write Read AW1 AW2 AR1a AR1b AR2a AR2b PW1 PW2 PR1a PR1b PR2a PR2b RAT Situation: Two instructions (1 and 2) come in to the RAT. RAT renames two source registers per instruction (ARXa and ARXb) and allocates new PRF for two destination locations.

Dispatch Need to be able to send more than one instruction to the RoB and RS per cycle –Just more ports in RS and RoB. –A bit more complexity with “Ships passing in the Night”. –Read ports in PRF (R10K) Read ports in ARF/ROB (P6)

Wake-up/select We’ve already been allowing more than one instruction to go to an exec unit per cycle. –No real change here.

Execute complete/CDB As two instructions need to be able to complete per cycle need more than one CDB. –In general everyone has to listen to everyone –Could try to partition CDBs but this is tricky. –Makes RS’s bigger and probably slower. RoB needs yet more ports.

Commit In R10K this isn’t a big deal. –But need to realize that more than one instruction at the head of the RoB is done (more ports) and must be able to complete them (maybe more ports) –In P6, you’ve got to do more copies. Multiple read ports (RoB); multiple write ports (ARF)

LSQ issues Load/Store queue –It is pretty tricky to get right. Send to LSQ at issue Does this replace the RS? –Maybe… –Probably a Store (circular) queue and a load buffer Loads need to note which stores they care about –Ones that were there when the load issued –Need to not get caught by “wrap around” of the store queue Loads need to check for what exactly?

So what to do? You have a lot of options on load launch –Conservative Launch loads at the head of the LSQ (bad) –Moderate Launch loads at when no conflicting/unknown stores exist in front of you (ok) –Aggressive Launch loads ASAP, but fix if wrong. Lots of potential issues. –Imagine you launched a load then solve it by forwarding. What happens when the load returns? And store forwarding might be tricky. –Can you identify the situation when you can forward? If so, can you write verilog code for that?

Non-LSQ options Just launch loads from the RoB when they hit the head (easy/poor) As above, but prefetch the data into the cache ASAP. –This might actually work well. Probably need non- direct-mapped cache though. Use RoB to track when load has no conflicting/unknown stores in front of it. –Seems annoying, might be easy. Still poorish performance.

More details RS –We’ve been doing generic RSs –Could also dedicate groups of RSs to a single execution unit (or group of similar execution units). May make the RSs simpler Will result in needing more total RSs to get the same performance Everyone needs to listen to the CDB For the project, means you have a bunch of similar code. Often a bad idea when fixing bugs.

Reading the register file on the way to EX The idea is to not read the PRF or RoB or CDB for the value, only for the fact that the value is available. Grab the value on your way to the EX unit. –Advantages No CDB broadcast of values Don’t need to look in the PRF/ARF/RoB for values on issue. –Disadvantages Already slow Wake-up/select/dispatch now slower –(But as we may be pipelining this anyways, not too bad).

Back-to-Back dependent instructions What has to happen to get them to go back to back? –Why is this hard? –How is it solved in the real world?

Looking back Keep the big picture in mind –OoO is about finding ILP. –If you don’t need ILP don’t bother Why might you not need ILP? –Application doesn’t need it –TLP instead. –In many ways this is about finding work to do while high-latency instructions run If you fix the memory problem, it isn’t clear that OoO makes any sense.

Looking back Renaming is distinct from OoO –You can rename without OoO (Not obviously useful!) –You can have OoO without renaming In fact the first OoO processor, CDC 6600 used scoreboarding which had no renaming but is out- of-order.

Static vs. Dynamic reordering Some reordering is better done statically –Have a global view –(infinite window) –Have access to the original source May tell you about intent or even behavior. –Array behavior may make load/store conflicts easy to identify. –Regular code structures may lend themselves to optimal ordering Software pipelining Just a one-time compile cost. –Saves power and hardware cost if reordering done in software!

Static vs. Dynamic Some things are better done dynamically –Have live data Worst case vs. actual case –Load/Store ordering possible to get right without being paranoid about the worst case. Program behavior may change based on data set –Branch prediction in particular –Can speculate Static specifies program behavior. Much harder to speculate in the complier.

Looking forward There is a LOT more to architecture than out-of- order execution –Memory If OoO is mostly about reordering around high-latency loads memory sounds important –Power Modern processors are eating huge amounts of power and we can’t cool them. So what can an architect do? –Multi-processors One way to get performance is to have many processors working on a task. –Static reordering As noted, saves power over dynamic & might be able to use both together to get a nice impact.