1 Lecture 5 Overview of Superscalar Techniques CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading: Textbook, Ch. 2.1 “Complexity-Effective.

Slides:

Advertisements

Similar presentations

Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.

Advertisements

Instruction-level Parallelism Compiler Perspectives on Code Movement dependencies are a property of code, whether or not it is a HW hazard depends on.

1 Lecture 11: Modern Superscalar Processor Models Generic Superscalar Models, Issue Queue-based Pipeline, Multiple-Issue Design.

Lec18.1 Step by step for Dynamic Scheduling by reorder buffer Copyright by John Kubiatowicz (http.cs.berkeley.edu/~kubitron)

A scheme to overcome data hazards

1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

COMP25212 Advanced Pipelining Out of Order Processors.

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.

Instruction-Level Parallelism (ILP)

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

DAP Spr.‘98 ©UCB 1 Lecture 6: ILP Techniques Contd. Laxmi N. Bhuyan CS 162 Spring 2003.

1 IF IDEX MEM L.D F4,0(R2) MUL.D F0, F4, F6 ADD.D F2, F0, F8 L.D F2, 0(R2) WB IF IDM1 MEM WBM2M3M4M5M6M7 stall.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Computer Architecture

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

Computer Architecture Lecture 18 Superscalar Processor and High Performance Computing.

1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.

Review of CS 203A Laxmi Narayan Bhuyan Lecture2.

1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.

COMP381 by M. Hamdi 1 Pipelining (Dynamic Scheduling Through Hardware Schemes)

1 Lecture 10: ILP Innovations Today: handling memory dependences with the LSQ and innovations for each pipeline stage (Section 3.5)

CSC 4250 Computer Architectures October 13, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.

Nov. 9, Lecture 6: Dynamic Scheduling with Scoreboarding and Tomasulo Algorithm (Section 2.4)

1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.

1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.

1 Lecture 6 Tomasulo Algorithm CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading:Textbook 2.4, 2.5.

1 Lecture 7: Speculative Execution and Recovery using Reorder Buffer Branch prediction and speculative execution, precise interrupt, reorder buffer.

Professor Nigel Topham Director, Institute for Computing Systems Architecture School of Informatics Edinburgh University Informatics 3 Computer Architecture.

1 Lecture 5: Dependence Analysis and Superscalar Techniques Overview Instruction dependences, correctness, inst scheduling examples, renaming, speculation,

1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.

1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.

CSC 4250 Computer Architectures September 29, 2006 Appendix A. Pipelining.

Introduction to Computer Organization Pipelining.

Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.

04/03/2016 slide 1 Dynamic instruction scheduling Key idea: allow subsequent independent instructions to proceed DIVDF0,F2,F4; takes long time ADDDF10,F0,F8;

1 Lecture 10: Memory Dependence Detection and Speculation Memory correctness, dynamic memory disambiguation, speculative disambiguation, Alpha Example.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

Code Example LD F6,34(R2) LD F2,45(R3) MULTI F0,F2,F4 SUBD F8,F6,F2

Instruction-Level Parallelism and Its Dynamic Exploitation

IBM System 360. Common architecture for a set of machines

The University of Adelaide, School of Computer Science

/ Computer Architecture and Design

CS5100 Advanced Computer Architecture Instruction-Level Parallelism

Out of Order Processors

CS203 – Advanced Computer Architecture

Lecture 10 Tomasulo’s Algorithm

Lecture 12 Reorder Buffers

Chapter 3: ILP and Its Exploitation

Tomasulo With Reorder buffer:

Out of Order Processors

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

The University of Adelaide, School of Computer Science

CS 704 Advanced Computer Architecture

Lecture 11: Memory Data Flow Techniques

Computer Architecture

Adapted from the slides of Prof

Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)

Advanced Computer Architecture

Reduction of Data Hazards Stalls with Dynamic Scheduling

Adapted from the slides of Prof

Dynamic Hardware Prediction

High-level view Out-of-order pipeline

Lecture 10: ILP Innovations

Lecture 5: Pipeline Wrap-up, Static ILP

Presentation transcript:

1 Lecture 5 Overview of Superscalar Techniques CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading: Textbook, Ch. 2.1 “Complexity-Effective Superscalar Processors”, PhD Thesis by Subbarao Palacharla, Ch.1

2 Sequential Execution Model Any program execution is “correct” if the final architectural states (registers and memory contents) is the same as by sequential execution Single-cycle implementation is intuitively correct If instructions are not executed sequentially, what is “correct” execution?

3 Out-of-order Execution Compared with a sequential execution, an out-of-order execution may 1. Fetch and execute instructions that should not be executed 2. Execute instructions in an different order

4 Sequential Execution Model A program execution is correct if 1. The same set of instructions write to user-visible register and memory; 2. Each instruction receives the same operands as in the sequential execution; and 3. Any register or memory word receives the value of the last write as in the sequential execution

5 Dependences and Correctness Three types of dependences between instructions Control dependence Data dependence Name dependence Why do we care dependences? Processor hardware can observe those dependences By correctly handling the dependences, the three statements will hold

6 Data Dependence LD F2,0(R3) MULTI F0,F2,F4 LD F6,0(R2) SUBD F8,F6,F2 DIVD F10,F0,F6 ADD F12,F8,F2 LD2LD1 MULTISUBD DIVDADD Note: no branch in this code

7 Data Dependences Instruction J is dependent on I if I ’ s output is used by J, or J is dependent on K, and K is dependent on I Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) DADDUI R1,R1,#-8 BNE R1,R2,LOOP Data Dependence Graph L.D ADD.D S.D DADDUI BNE

8 Data Dependences Dependences through registers Dependence through memory regfile LoadALU LoadALUBrStore Memory SW LW ADD r8, r9, r10 BEQ r8, r11, loop SW r8, 100(r9) LW r10, 100(r9)

9 Name Dependences Antidependence (WAR): one instruction overwrite a register or memory location that a prior instruction reads Output dependence (WAW): two instructions write the same register or memory location LW R1, 100(R2) ADD R2, R3, R4 LW R1, 100(R2) Add R2, R1, R2 Add R1, R3, R4 Those dependences can be removed

10 Dependences vs Hazards Dependences are properties of programs Hazards are properties of pipelines Dependences indicates the potential of hazards Pipeline implementations determine actual hazards and the length of any stall What hazards are exposed by MIPS 5-stage pipeline?

11 Dynamic Scheduling General idea: when an instruction stalls, look for independent instructions following it DIV.D F0, F2, F4 ADD.D F10, F0, F8 SUB.D F12, F8, F14 Instruction window: how far to look ahead Out-of-order execution Respect data dependence What hazards would be exposed? DIV.DSUB.D ADD.D

12 Data Dependence between Operations ALU to ALU SUBD F8,F6,F2 ADD F6,F8,F2 SUBDADDIFID EX-- WBEX WB Load and other insts LD F2,0(R3) MULTI F0,F2,F4 LDMULTIIFID EX-- MEM-- WBEX WB

13 Dependences between Operations Store to load //R3+100==R4? S.D F6,100(R3) L.D F2,0(R4) S.DL.DIFIDEX MEM-- WBMEM Register instruction can be detected by matching register index Detecting memory dependence is more difficult ?

14 Dynamic Scheduling L.DF2,0(R3) MULTIF0,F2,F4 L.DF6,0(R2) SUB.DF8,F6,F2 DIV.D F10,F0,F6 ADD.DF12,F8,F2 LD2LD1 MULTISUBD DIVDADD How to schedule pipeline operations?

15 Is This Working? Inst IFIDSchdEXEMEMWB L.D MULT L.D SUB.D DIV.D Add.D Assume (1) two-way issue; (2) FU delay as implied

16 Dynamic Scheduling Implementation Scoreboarding: 1966: scoreboarding in CDC6600 Tomasulo: Three years later in IBM 360/91 Introducing register renaming Use tag-based instruction wakeup Adapted from UCB CS252 S98, Copyright 1998 USB SELECT I1I2I3…I_k To FUs Wakeup

17 Name Dependences and Register Renaming Renamed code: R3, R4, R3, R3 renamed to P6, P7, P8, P9 sequentially ADD P6, R1, R2 SUB P7, R4, P6 ADD P8, R6, R7 SUB P9, R5, P7 Finally R3 <= P9, R4 <= P7 Original code: ADD R3, R1, R2 SUB R4, R4, R3 ADD R3, R6, R7 SUB R3, R3, R4 What prevents parallelism?

18 Register Renaming and Correctness 1. The same set of instructions write to user-visible register and memory; 2. Each instruction receives the same operands as in the sequential execution; and 3. Any register or memory word receives the value of the last write as in the sequential execution

19 Renaming Implementation First proposed in Tomasulo (1969) Use register status table Renamed to reservation station In P-III Use register alias table Renamed arch. register to physical register Data copied back to arch. register In other processors (e.g. Alpha 21264, Intel P4) Use register mapping table No separate architectural/physical registers; no copy-back Renaming Rd RsRt Pd PsPt

20 Branch Prediction and Speculative Execution Modern processors must speculate! Branch prediction: SPEC2k INT has one branch per seven instructions! Precise interrupt Memory disambiguation More performance-oriented speculations Two disjointed but connected issues: 1. How to make the best prediction 2. What to do when the speculation is wrong

21 Branch Prediction and Speculative Execution Review the three conditions of correctness: 1. The processor commits the same set of instructions as executed in a sequential processor 2. Any committed instruction receives the same operands (from its parents) as in the sequential execution 3. Any register or memory word receives the value of the last write from the committed instructions and as in the sequential execution

22 Branch Prediction and Speculative Execution Branch prediction – control speculation Must predict on branches What to predict Branch direction Branch target address What info can be used PC value Previous branch outputs also use branch pattern in complex branch predictors What building blocks are need Branch prediction table (BHT), branch target buffer (BTB), pattern registers, and some logics

23 Generic Superscalar Processor Models FetchRename Wakeup select Regfile FU bypass D-cache execute commit FetchRename ROB FU bypass D-cache execute commit Reg Wakeup select Issue queue based Reservation based Revised from Paracharla PhD thesis 1998 schedule