We think you have liked this presentation. If you wish to download it, please recommend it to your friends in any social system. Share buttons are a little bit lower. Thank you!
Presentation is loading. Please wait.
Published byTyson Sweeting
Modified about 1 year ago
Lecture 9 – OOO execution © Avi Mendelson, 5/ MAMAS – Computer Architecture Lecture 9 – Out Of Order (OOO) Dr. Avi Mendelson Some of the slides were taken from: (1) Lihu Rapoport (2) Randi Katz and (3) Petterson
Lecture 9 – OOO execution © Avi Mendelson, 5/ Agenda Introduction Data flow execution Out of order execution – principles Pentium-(II,III and 6) example
Lecture 9 – OOO execution © Avi Mendelson, 5/ Introduction The goal is to increase the performance of a system. In order to achieve that, so far we discussed – Adding pipe stages in order to increase the clock rate – Increase the potential parallelism within the pipe by Fetching decoding and executing several instructions in parallel Are there any other options?
Lecture 9 – OOO execution © Avi Mendelson, 5/ Data flow vs. conventional computer ? In theory “data flow machines” has the best performance – View the program as a parallel operations wait to be executed (will be demonstrate next slide) – Execute instruction as soon as its inputs are ready So, why computers are Van-Neumann based and not Data-Flow based – Hard to debug – Hard to write Data-Flow programs (need special programming language in order to be efficient)
Lecture 9 – OOO execution © Avi Mendelson, 5/ Data flow execution – a different approach for high performance computers Data flow execution is an alternative for Van-Neumann execution. Here, the instructions are executed in the order of their input dependencies and not in the order they appears in the program Example : assume that we have as many execution units as we need: (1) r1 r4 / r7 (2) r8 r1 + r2 (3) r5 r5 + 1 (4) r6 r6 - r3 (5) r4 r5 + r6 (6) r7 r8 * r Data Flow Graph We could execute it in 3 cycles
Lecture 9 – OOO execution © Avi Mendelson, 5/ Data flow execution - cont Can we build a machine that will execute the “data flow graph”? In the early 70 th several machines were built to work according to the data-flow graph. They were called “data flow machines”. They were vanished due to the reasons we mentioned before. Solution: Let the user think he/she are using Van-Neumann machine, and let the system work in “data-flow mode”
Lecture 9 – OOO execution © Avi Mendelson, 5/ OOOE - General Scheme Fetch & Decode Instruction pool Retire (commit) In-order Execute Out-of-order Most of the modern computers are using OOO execution. Most of them are doing the fetching and the retirement IN- ORDER, but it executes in OUT_OF_ORDER
Lecture 9 – OOO execution © Avi Mendelson, 5/ Out Of Order Execution Basic idea: – The fetch is done in the program order (in-order) and fast enough in order to “fill-out” a window of instructions. – Out of the instruction window, the system forms a data flow graph and looks for instructions which are ready to be executed: All the data the instructions are depended on, are ready Resources are available. – As soon as the instruction is execution it needs to signal to all the instructions which are depend on it that it generate new input. – The instructions are commit in “program’s order” to preserve the “user view” Advantages: – Help exploit Instruction Level Parallelism (ILP) – Help cover latencies (e.g., cache miss, divide)
Lecture 9 – OOO execution © Avi Mendelson, 5/ How to convert “In-order” instruction flow into “data flow” The problems: 1. Data Flow has only RAW dependencies, while OOOE has also WAR and WAW dependencies 2. How to guarantee the in-order complition. The Solutions: 1. Register Renaming (based on “Tomuselo algorithm”) solves the WAR and WAW dependencies 2. We need to “enumerate” the instructions at decode time (in order) so we know in what order to retire them
Lecture 9 – OOO execution © Avi Mendelson, 5/ Register Renaming Hold a pool of physical registers. Architectural registers are mapped into physical registers – When an instruction writes to an architectural register A free physical register is allocated from the pool The physical register points to the architectural register The instruction writes the value to the physical register – When an instruction reads from an architectural register reads the data from the latest instruction which writes to the same architectural register, and precedes the current instruction. If no such instruction exists, read directly from the architectural register. – When an instruction commits Moves the value from the physical register to the architectural register it points.
Lecture 9 – OOO execution © Avi Mendelson, 5/ OOOE with Register Renaming: Example Before renaming After renaming (1)r1 mem1t1 mem1 (2)r2 r2 + r1t2 r2 + t1 (3)r1 mem2 t3 mem2 (4)r3 r3 + r1 t4 r3 + t3 (5)r1 mem3t5 mem3 (6)r4 r5 + r1t6 r5 + t5 (7)r5 2t7 2 (8)r6 r5 + 2t8 t7 + 2 After renaming all the false dependencies (WAW and WAR) were removed WAW WAR
Lecture 9 – OOO execution © Avi Mendelson, 5/ Executing Beyond Branches The scheme we saw so far does not search beyond a branch Limited to the parallelism within a basic-block – A basic-block is ~5 instruction long (1) r1 r4 / r7 (2)r2 r2 + r1 (3)r3 r2 - 5 (4)beq r3,0,300 If the beq is predicted NT, (5)r8 r8 + 1 Inst 5 can be spec executed We would like to look beyond branches – But what if we execute an instruction beyond a branch and then it turns out that we predicted the wrong path ? Solution: Speculative Execution
Lecture 9 – OOO execution © Avi Mendelson, 5/ Speculative Execution Execution of instructions from a predicted (yet unsure) path Eventually, path may turn wrong Implementation: – Hold a pool of all “not yet executed” instructions – Fetch instructions into the pool from a predicted path – Instructions for which all operands are “ready” can be executed – An instruction may change the processor state (commit) only when it is safe An instruction commits only when all previous (in-order) instructions had committed Instructions commit in-order Instructions which follow a branch commit only after the branch commits If a predicted branch is wrong all the instructions which follow it are flushed Register Renaming helps speculative execution – Renamed registers are kept until speculation is verified to be correct
Lecture 9 – OOO execution © Avi Mendelson, 5/ The magic of the modern X86 architectures (Intel, AMD, etc.) The user view of the X86 machine is as a CISC architecture. The machine supports this view by keeping the in-order parts as close as possible to the X86 view. While moving from the In-order part (front-end) to the OOO part (execution), the hardware translates each X86 instruction into a set of uop operations, which are the internal machine operations. These operations are RISC like (load-store based). During this translation, the hardware performs the register renaming. So, during the execution time it uses internal registers and not the X86 ones. The number of these registers can be changed from one generation to another. While moving back from the OOO part (execution) to the In-Order part (commit), the hardware translates the registers back to X86, in order to keep for the user a coherent picture.
Lecture 9 – OOO execution © Avi Mendelson, 5/ OOOE Architecture: based on Pentium-II Write back bus Instruction cache Data cache Bus Interface Unit IFU Instr. Fetch ID Instr. Decode and rename RATROB RS MOB Load/Store Operations Arithmetic Operations Retire (commit) Logic In Order Front-end 1. Fetch from instruction cache, base on Branch prediction. 2. Decode and rename: Translate to Uops Use the RAT table for renaming Put ALL instructions in ROB Put all “arithmetic instructions” in the RS queue Put all Load/Store instructions in MOB Out-Of-Order execution 3. Do in Parallel: Load and store operations are executed based on MOB information Arithmetic operations are executed based on RS information. 4. All results are written back to ROB, while RS and MOB “still” values they need In Order completion (retirement) 5. The retire logic (commit logic) moves instructions out of the ROB and updates the architectural registers
Lecture 9 – OOO execution © Avi Mendelson, 5/ Re-order Buffer (ROB) Mechanism for keeping the in-order view of the user. Basic ROB functions – Provide large physical register space for register renaming – Keeps intermediate results, some of them may not be commit if the branch prediction was wrong (we will discuss this mechanism later on) – Keeps information on what is the “Real Register” the commit need to update
Lecture 9 – OOO execution © Avi Mendelson, 5/ Reservation station (RS) Pool of all “not yet executed” uops – Holds both the uop attributes as well as the values of the input data For each operand, it keeps indication if it is ready – Operand that need to be retrieved from the RRF is always ready – Operand that waits for another Uop to generate its value, will “lesson” to the WB bus. When the value appears on the bus (the value is always associated with the ROB number it needs to update), all RS entries how need to consume this value, “still” it from the bus and mark the input as ready (this is done in parallel to the ROB update. – Uops whose all operands are ready can be dispatched for execution – Dispatcher chooses which of the ready uops to execute next. If can also do “forwarding”; i.e., schedule the instruction at the same cycle the information is written to the RS entry. As soon as Uop completes its execution, it is deleted from the RS. If the RS is full, it stalls the decoder
Lecture 9 – OOO execution © Avi Mendelson, 5/ Memory Order Buffer (MOB) Goal – Manipulates the Load and Store operations. If possible, it allows out-of-order among memory operations Structure similar in concept to ROB Every memory uop allocates new entry in-order. Address need to be updated when known Problem- Memory dependencies cannot be fully resolved statically (memory disambiguation) – store r1,a; load r2,b can advance load before store – store r1,[r3]; load r2,b load should wait till r3 is known In most of the modern processors, Loads may pass loads/stores but Stores must be execute in order (among stores). For simplicity, this course assumes that all MOB operations are executed in order.
Lecture 9 – OOO execution © Avi Mendelson, 5/ An example of OOO Execution
Lecture 9 – OOO execution © Avi Mendelson, 5/ Instruction Q MOB RS ROB Execute Retire RAT R0 R1 R2 R3
Lecture 9 – OOO execution © Avi Mendelson, 5/ Instruction Q MOB RS ROB Execute Retire RAT R0 R1 R2 R3 LD R1,X R2 <- R3 R1 <- R1+R0 I4 I5
Lecture 9 – OOO execution © Avi Mendelson, 5/ Instruction Q MOB RS ROB Execute Retire RAT R0 R1 R2 R3 LD R1,X R2 <- R3 R1 <- R1+R0 LD R1,X RB0 M0 LD RB0,X I4 I5 Takes 3 cycles
Lecture 9 – OOO execution © Avi Mendelson, 5/ Instruction Q MOB RS ROB Execute Retire RAT R0 R1 R2 R3 LD R1,X R2 <- R3 R1 <- R1+R0 LD R1,X RB0 M0 LD RB0,X R2 <- R3 RB1 RB1 <- R3 RS0 I4 I5
Lecture 9 – OOO execution © Avi Mendelson, 5/ Instruction Q MOB RS ROB Execute Retire RAT R0 R1 R2 R3 LD R1,X R2 <- R3 R1 <- R1+R0 LD R1,X RB2 M0 LD RB0,X R2 <- R3 RB1 RB1 <- R3 RS0 R1 <- R1+R0 RB1 <- R3 RS1 I4 RB2 <- RB0+R0 I5
Lecture 9 – OOO execution © Avi Mendelson, 5/ Instruction Q MOB RS ROB Execute Retire RAT R0 R1 R2 R3 LD R1,X R2 <- R3 R1 <- R1+R0 LD R1,X RB2 M0 LD RB0,X R2 <- R3 RB1 O.K R1 <- R1+R0 RS1 I4 RB2 <- RB0+R0 I5 Got the value now I4 Cannot execute since the data is not ready yet
Lecture 9 – OOO execution © Avi Mendelson, 5/ Instruction Q MOB RS ROB Execute Retire RAT R0 R1 R2 R3 LD R1,X R2 <- R3 R1 <- R1+R0 LD R1,X RB2 OK R2 <- R3 RB1 OK R1 <- RB0+R0 RS1 I4 I5 I4 RB2 <- RB0+R0 RS2 I4 I5 RS3 I5
Lecture 9 – OOO execution © Avi Mendelson, 5/ Instruction Q MOB RS ROB Execute Retire RAT R0 R1 R2 R3 LD R1,X R2 <- R3 R1 <- R1+R0 LD R1,X RB3 R2 <- R3 R1 <- RB1+R0 OK I4 I5 I4 I5 I6 I4 I5 I4I5 I6 rs2 rs3 rs0
Lecture 9 – OOO execution © Avi Mendelson, 5/ Backup
Chapter 13 Instruction-Level Parallelism and Superscalar Processors.
Instruction Level Parallelism Taewook Oh. Instruction Level Parallelism Measure of how many of the operations in a computer program can be performed simultaneously.
Lecture 8: Data-Capture Instruction Schedulers. The goal is to execute instructions in dataflow order as opposed to the sequential order specified by.
In-Order Execution In-order execution does not always give the best performance on superscalar machines. The following example uses in-order execution.
Scoreboarding & Tomasulos Approach Bazat pe slide-urile lui Vincent H. Berk.
William Stallings Computer Organization and Architecture 8 th Edition Chapter 14 Instruction Level Parallelism and Superscalar Processors.
CSE502: Computer Architecture Out-of-Order Schedulers.
RISC Instruction Pipelines and Register Windows By: Andy Le CS147 – Dr. Sin-Min Lee San Jose State University, Fall 2003.
Yaser Zhian Fanafzar Game Studio IGDI, Workshop 07, January 2 nd, 2013.
Multithreaded Processors. Pipeline Hazards LW r1, 0(r2) LW r5, 12(r1) ADDI r5, r5, #12 SW 12(r1), r5 Each instruction may depend on the next – Without.
Memory Management Background Swapping Contiguous Memory Allocation Paging Structure of the Page Table Segmentation Example: The Intel Pentium.
1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 2 Parallel Hardware and Parallel Software An Introduction to Parallel Programming Peter Pacheco.
Chapter 17: Recovery System Failure Classification Storage Structure Recovery and Atomicity Log-Based Recovery Shadow Paging Recovery With Concurrent Transactions.
RAM (cont.) 2 20 bytes of RAM (1 Mega-byte) Write Address Data input Data Output 20 bits of address 8 bits (1 byte) of data.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 8: Main Memory.
CH14 Instruction Level Parallelism and Superscalar Processors CH01 TECH Computer Science Decode and issue more and one instruction at a time Executing.
Hier wird Wissen Wirklichkeit Computer Architecture – Part 14 – page 1 of 44 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Part 11 Memory Management.
Final touches on Out-of-Order execution Review Superscalar Looking back Looking forward.
Spring 2003CSE P5481 Out-of-Order Execution Several implementations out-of-order completion CDC 6600 with scoreboarding IBM 360/91 with Tomasulos algorithm.
© Krste Asanovic, 2014CS252, Spring 2014, Lecture 16 CS252 Graduate Computer Architecture Spring 2014 Lecture 16: Virtual Machines Krste Asanovic
Computer Architecture And Organization. Difference between computer organization and computer architecture Computer architecture is the architectural.
Main Memory. Goals for Today Protection: Address Spaces –What is an Address Space? –How is it Implemented? Address Translation Schemes –Segmentation –Paging.
Asanovic/Devadas Spring Advanced CISC Implementations: Pentium 4 Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology.
C HAPTER 1 Computer Organization and Architecture.
Operating Systems Chapter 6. Main functions of an operating system 1. User/computer interface: Provides an interface between the user and the computer.
Chapter 7 Memory Management Operating Systems: Internals and Design Principles, 6/E William Stallings Patricia Roy Manatee Community College, Venice, FL.
Multiprocessors Advanced Computers Architecture, UNIT 4 Flynn's classificationVector computersPipelining in Vector computersCrayMultiprocessor interconnectionGeneral.
Database System Concepts, 6 th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Chapter 12: Query Processing.
CM20145 Recovery + Intro. to Concurrency Dr Alwyn Barry Dr Joanna Bryson.
Chapter 4 MARIE: An Introduction to a Simple Computer.
© 2016 SlidePlayer.com Inc. All rights reserved.