Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Slides:



Advertisements
Similar presentations
Lecture 9 – OOO execution © Avi Mendelson, 5/ MAMAS – Computer Architecture Lecture 9 – Out Of Order (OOO) Dr. Avi Mendelson Some of the slides.
Advertisements

Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.
Computer Organization and Architecture
CSCI 4717/5717 Computer Architecture
Real-time Signal Processing on Embedded Systems Advanced Cutting-edge Research Seminar I&III.
ILP: IntroductionCSCE430/830 Instruction-level parallelism: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
Instruction-Level Parallelism (ILP)
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
CS 211: Computer Architecture Lecture 5 Instruction Level Parallelism and Its Dynamic Exploitation Instructor: M. Lancaster Corresponding to Hennessey.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
National & Kapodistrian University of Athens Dep.of Informatics & Telecommunications MSc. In Computer Systems Technology Advanced Computer Architecture.
1 IF IDEX MEM L.D F4,0(R2) MUL.D F0, F4, F6 ADD.D F2, F0, F8 L.D F2, 0(R2) WB IF IDM1 MEM WBM2M3M4M5M6M7 stall.
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
Chapter 14 Superscalar Processors. What is Superscalar? “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently.
Chapter 12 Pipelining Strategies Performance Hazards.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.
1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )
Review of CS 203A Laxmi Narayan Bhuyan Lecture2.
1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
OOO execution © Avi Mendelson, 4/ MAMAS – Computer Architecture Lecture 7 – Out Of Order (OOO) Avi Mendelson Some of the slides were taken.
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
1 Pipelining Reconsider the data path we just did Each instruction takes from 3 to 5 clock cycles However, there are parts of hardware that are idle many.
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
1 Out-Of-Order Execution (part I) Alexander Titov 14 March 2015.
COMP25212 Lecture 51 Pipelining Reducing Instruction Execution Time.
CS 1104 Help Session IV Five Issues in Pipelining Colin Tan, S
Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
PART 5: (1/2) Processor Internals CHAPTER 14: INSTRUCTION-LEVEL PARALLELISM AND SUPERSCALAR PROCESSORS 1.
Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)
Computer Structure 2015 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
CSCE 212 Chapter 6 Enhancing Performance with Pipelining Instructor: Jason D. Bakos.
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
PipeliningPipelining Computer Architecture (Fall 2006)
CS203 – Advanced Computer Architecture ILP and Speculation.
CS 352H: Computer Systems Architecture
Dynamic Scheduling Why go out of style?
CDA3101 Recitation Section 8
CS203 – Advanced Computer Architecture
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Pipeline Implementation (4.6)
Pipelining: Advanced ILP
Morgan Kaufmann Publishers The Processor
Sequential Execution Semantics
Morgan Kaufmann Publishers The Processor
Superscalar Processors & VLIW Processors
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
How to improve (decrease) CPI
Control unit extension for data hazards
Instruction Level Parallelism (ILP)
Control unit extension for data hazards
Control unit extension for data hazards
Computer Structure Out-Of-Order Execution
Presentation transcript:

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz

Computer Architecture 2011 – Out-Of-Order Execution 2 What’s Next u Remember our goal: minimize CPU Time CPU Time = clock cycle  CPI  IC u So far we have learned  Minimize clock cycle  add more pipe stages  Minimize CPI  use pipeline  Minimize IC  architecture u In a pipelined CPU:  CPI w/o hazards is 1  CPI with hazards is > 1 u Adding more pipe stages reduces clock cycle but increases CPI  Higher penalty due to control hazards  More data hazards u Beyond some point adding more pipe stages does not help u What can we do ? Further reduce the CPI !

Computer Architecture 2011 – Out-Of-Order Execution 3 A Superscalar CPU u Duplicating HW in one pipe stage won’t help  e.g., have 2 ALUs  the bottleneck moves to other stages u Getting IPC > 1 requires to fetch, decode, exe, and retire >1 instruction per clock: IF ID EXE MEM WB

Computer Architecture 2011 – Out-Of-Order Execution 4 The Pentium  Processor u Fetches and decodes 2 instructions per cycle u Before register file read, decide on pairing  can the two instructions be executed in parallel u Pairing decision is based on  Data dependencies: instructions must be independent  Resources: Some instructions use resources from the 2 pipes The second pipe can only execute part of the instructions IF ID U-pipe V-pipe pairing

Computer Architecture 2011 – Out-Of-Order Execution 5 u MPI : miss-per-instruction: #incorrectly predicted branches #predicted branches MPI = = MPR× total # of instructions total # of instructions u MPI correlates well with performance. E.g., assume:  MPR = 5%, %branches = 20%  MPI = 1%  Without hazards IPC=2 (2 instructions per cycles)  Flush penalty of 5 cycles u We get:  MPI = 1%  flush in every 100 instructions  IPC=2  flush every 100/2 = 50 cycles  5 cycles flush penalty every 50 cycles  10% in performance u For IPC=1 we would get  5 cycles flush penalty per 100 cycles  5% in performance Misprediction Penalty in a Superscalar CPU

Computer Architecture 2011 – Out-Of-Order Execution 6 Is Superscalar Good Enough ? u A superscalar processor can fetch, decode, execute and retire 2 instructions in parallel  Can execute only independent instructions in parallel u But … adjacent instructions are usually dependent  The utilization of the second pipe is usually low  There are algorithms in which both pipes are highly utilized u Solution: out-of-order execution  Execute instructions based on “data flow” rather than program order  Still need to keep the semantics of the original program

Computer Architecture 2011 – Out-Of-Order Execution 7 Out Of Order Execution u Look ahead in a window of instructions and find instructions that are ready to execute  Don’t depend on data from previous instructions still not executed  Resources are available u Out-of-order execution  Start instruction execution before execution of a previous instructions u Advantages:  Help exploit Instruction Level Parallelism (ILP)  Help cover latencies (e.g., L1 data cache miss, divide) u Can Compilers do the Work ?  Compilers can statically reschedule instructions  Compilers do not have run time information Conditional branch direction → limited to basic blocks Data values, which may affect calculation time and control Cache miss / hit

Computer Architecture 2011 – Out-Of-Order Execution 8 Data Flow Analysis u Example: (1) r1  r4 / r7 ; assume divide takes 20 cycles (2) r8  r1 + r2 (3) r5  r5 + 1 (4) r6  r6 - r3 (5) r4  r5 + r6 (6) r7  r8 * r In-order execution Out-of-order execution Data Flow Graph r1 r5r6 r4 r8

Computer Architecture 2011 – Out-Of-Order Execution 9 OOOE – General Scheme u Fetch & decode instructions in parallel but in order, to fill inst. pool u Execute ready instructions from the instructions pool  All the data required for the instruction is ready  Execution resources are available u Once an instruction is executed  signal all dependant instructions that data is ready u Commit instructions in parallel but in-order  Can commit an instruction only after all preceding instructions (in program order) have committed Fetch & Decode Instruction pool Retire (commit) In-order Execute Out-of-order

Computer Architecture 2011 – Out-Of-Order Execution 10 Out Of Order Execution – Example u Assume that executing a divide operation takes 20 cycles (1)r1  r5 / r4 (2)r3  r1 + r8 (3)r8  r5 + 1 (4)r3  r7 - 2 (5)r6  r6 + r7 u Inst2 has a RAW dependency on r1 with Inst1  It cannot be executed in parallel with Inst1 u Can successive instructions pass Inst2 ?  Inst3 cannot since Inst2 must read r8 before Inst3 writes to it  Inst4 cannot since it must write to r3 after Inst2  Inst5 can

Computer Architecture 2011 – Out-Of-Order Execution 11 False Dependencies u OOOE creates new dependencies  WAR: write to a register which is read by an earlier inst. (1)r3  r2 + r1 (2)r2  r4 + 3  WAW: write to a register which is written by an earlier inst. (1)r3  r1 + r2 (2)r3  r4 + 3 u These are false dependencies  There is no missing data  Still prevent executing instructions out-of-order u Solution: Register Renaming

Computer Architecture 2011 – Out-Of-Order Execution 12 Register Renaming u Hold a pool of physical registers u Map architectural registers into physical registers  Before an instruction can be sent for execution Allocate a free physical register from a pool The physical register points to the architectural register  When an instruction writes a result Write the result value to the physical register  When an instruction needs data from a register Read data from the physical register allocated to the latest inst which writes to the same arch register, and precedes the current inst If no such instruction exists, read directly from the arch. register  When an instruction commits Move the value from the physical register to the arch register it points

Computer Architecture 2011 – Out-Of-Order Execution 13 OOOE with Register Renaming: Example cycle 1 cycle 2 (1)r1  mem1r1’  mem1 (2)r2  r2 + r1 r2’  r2 + r1’ (3)r1  mem2r1”  mem2 (4)r3  r3 + r1 r3’  r3 + r1” (5)r1  mem3r1”’  mem3 (6)r4  r5 + r1 r4’  r5 + r1”’ (7)r5  2r5’  2 (8)r6  r5 + 2 r6’  r5’ + 2 Register Renaming Benefits Removes false dependencies Removes architecture limit for # of registers WAW WAR

Computer Architecture 2011 – Out-Of-Order Execution 14 Executing Beyond Branches u So far we do not look for instructions ready to execute beyond a branch  Limited to the parallelism within a basic-block  A basic-block is ~5 instruction long ( 1) r1  r4 / r7 (2)r2  r2 + r1 (3)r3  r2 - 5 (4)beq r3,0,300 If the beq is predicted NT, (5)r8  r8 + 1 Inst 5 can be spec executed u We would like to look beyond branches  But what if we execute an instruction beyond a branch and then it turns out that we predicted the wrong path ? Solution: Speculative Execution

Computer Architecture 2011 – Out-Of-Order Execution 15 Speculative Execution u Execution of instructions from a predicted (yet unsure) path  Eventually, path may turn wrong u Implementation:  Hold a pool of all not yet executed instructions  Fetch instructions into the pool from a predicted path  Instructions for which all operands are ready can be executed  An instruction may change the processor state (commit) only when it is safe An instruction commits only when all previous (in-order) instructions have committed  instructions commit in-order Instructions which follow a branch commit only after the branch commits If a predicted branch is wrong all the instructions which follow it are flushed u Register Renaming helps speculative execution  Renamed registers are kept until speculation is verified to be correct

Computer Architecture 2011 – Out-Of-Order Execution 16 Speculative Execution – Example cycle 1 cycle 2 (1) r1  mem1 r1’  mem1 (2) r2  r2 + r1 r2’  r2 + r1’ (3) r1  mem2 r1”  mem2 (4) r3  r3 + r1 r3’  r3 + r1” (5) jmp cond L2 predicted taken to L2 (6)L2 r1  mem3 r1”’  mem3 (7) r4  r5 + r1 r4’  r5 + r1”’ (8) r5  2 r5’  2 (9) r6  r5 + 2 r6’  r5’ + 2 u Instructions 6-9 are speculatively executed u If the prediction turns wrong  Flush the instructions following the mispredicted branch  Restore the renaming table to its pre-flush state Have each arch register be pointed to by the last phy register that writes to it prior to the speculative section (e.g. r1  r1’’ and not r1  r1’’’) WAW WAR Speculative Execution

Computer Architecture 2011 – Out-Of-Order Execution 17 OOO Requires Accurate Branch Predictor u Accurate branch predictor increases effective scheduling window size  Speculate across multiple branches (a branch every 5 – 10 instructions) Instruction pool branches High chances to commit Low chances to commit

Computer Architecture 2011 – Out-Of-Order Execution 18 Interrupts and Faults Handling u Complications for pipelined and OOO execution  Interrupts occur in the middle of an instruction  A speculative instruction can get a fault (divide by 0, page fault) u Faults are served in program order, at retirement only  Mark an instruction that takes a fault at execution  Instructions older than the faulting instruction are retired  Only when the faulting instruction retires – handle the fault Flush subsequent instructions Initiate the fault handling code according to the fault type Restart faulting and/or subsequent instructions u Interrupts are served when the next instruction retires  Let the instruction in the current cycle retire  Flush subsequent instructions and initiate the interrupt service code  Fetch the subsequent instructions

Computer Architecture 2011 – Out-Of-Order Execution 19 Out Of Order Execution – Summary u Advantages  Help exploit Instruction Level Parallelism (ILP)  Help cover latencies (e.g., cache miss, divide)  Superior/complementary to compiler scheduler Dynamic instruction window Reg Renaming: can use more than the number architectural registers u Complex micro-architecture  Complex scheduler  Requires reordering mechanism (retirement) in the back-end for: Precise interrupt resolution Misprediction/speculation recovery Memory ordering u Speculative Execution  Advantage: larger scheduling window  reveals more ILP  Issues: misprediction cost and misprediction recovery