Superscalar processors Review. Dependence graph S1S2 Nodes: instructions Edges: ordered relations among the instructions Any ordering-based transformation.

Slides:



Advertisements
Similar presentations
Instruction Level Parallelism and Superscalar Processors
Advertisements

CH14 Instruction Level Parallelism and Superscalar Processors
Machine cycle.
Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
Lecture 4: CPU Performance
PIPELINE AND VECTOR PROCESSING
Computer Organization and Architecture
Computer architecture
Computer Architecture Instruction-Level Parallel Processors
CSCI 4717/5717 Computer Architecture
Arsitektur Komputer Pertemuan – 13 Super Scalar
ARM Cortex A8 Pipeline EE126 Wei Wang. Cortex A8 is a processor core designed by ARM Holdings. Application: Apple A4, Samsung Exynos What’s the.
Instruction-Level Parallel Processors {Objective: executing two or more instructions in parallel} 4.1 Evolution and overview of ILP-processors 4.2 Dependencies.
CSC 370 (Blum)1 Instruction-Level Parallelism. CSC 370 (Blum)2 Instruction-Level Parallelism Instruction-level Parallelism (ILP) is when a processor has.
RISC and Pipelining Prof. Sin-Min Lee Department of Computer Science.
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
COMP25212 Advanced Pipelining Out of Order Processors.
CMSC 611: Advanced Computer Architecture Scoreboard Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Instruction-Level Parallelism (ILP)
1 IF IDEX MEM L.D F4,0(R2) MUL.D F0, F4, F6 ADD.D F2, F0, F8 L.D F2, 0(R2) WB IF IDM1 MEM WBM2M3M4M5M6M7 stall.
S. Barua – CPSC 440 CHAPTER 6 ENHANCING PERFORMANCE WITH PIPELINING This chapter presents pipelining.
Pipelining II Andreas Klappenecker CPSC321 Computer Architecture.
Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
1 Pertemuan 21 Parallelism and Superscalar Matakuliah: H0344/Organisasi dan Arsitektur Komputer Tahun: 2005 Versi: 1/1.
(Page 554 – 564) Ping Perez CS 147 Summer 2001 Alternative Parallel Architectures  Dataflow  Systolic arrays  Neural networks.
King Fahd University of Petroleum and Minerals King Fahd University of Petroleum and Minerals Computer Engineering Department Computer Engineering Department.
Pipelined Processor II CPSC 321 Andreas Klappenecker.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.
Chapter One Introduction to Pipelined Processors.
Instruction Sets and Pipelining Cover basics of instruction set types and fundamental ideas of pipelining Later in the course we will go into more depth.
Computer Organization and Architecture Instruction-Level Parallelism and Superscalar Processors.
P IPEL INED EXECUT ION(1) Part 4 Dr. Abdel-Rahman Al-Qawasmi.
Chapter One Introduction to Pipelined Processors.
1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.
Pipelining. 10/19/ Outline 5 stage pipelining Structural and Data Hazards Forwarding Branch Schemes Exceptions and Interrupts Conclusion.
Lecture 8: Processors, Introduction EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014,
Professor Nigel Topham Director, Institute for Computing Systems Architecture School of Informatics Edinburgh University Informatics 3 Computer Architecture.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
Instruction level parallelism And Superscalar processors By Kevin Morfin.
Exercise 4.6 Problems in this exercise assume that logic blocks needed to implement a processor’s datapath have the following latencies: [10]
William Stallings Computer Organization and Architecture 8th Edition
Performance of Single-cycle Design
Chapter 9 a Instruction Level Parallelism and Superscalar Processors
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Morgan Kaufmann Publishers The Processor
Instruction Level Parallelism and Superscalar Processors
Figure 13.1 MIPS Single Clock Cycle Implementation.
Figure 8.1 Architecture of a Simple Computer System.
High-level view Out-of-order pipeline
Out of Order Processors
Pipelining Multicycle, MIPS R4000, and More
Instruction Level Parallelism and Superscalar Processors
Figure 8.1 Architecture of a Simple Computer System.
How to improve (decrease) CPI
Computer Architecture
Pipelining.
Created by Vivi Sahfitri
Appendix C Practice Problem Set 1
COMPUTER ORGANIZATION AND ARCHITECTURE
Conceptual execution on a processor which exploits ILP
Presentation transcript:

Superscalar processors Review

Dependence graph S1S2 Nodes: instructions Edges: ordered relations among the instructions Any ordering-based transformation that does not change the dependencies of the program will be guarantied not to change the result of the program. Example S1: Load R1, A/ R1  Memory (A) / S2: Add R2, R1/ R2  R2+R1 /

Data dependency Flow dependence: Statement 2 uses a variable computed by Statement 1. Statement 1 must store/send the variable before Statement 2 fetches. S1  S2 Output dependence : Statement 1 and Statement 2 both compute the same variable and Statement 2's value must be stored/sent after Statement 1's. S1 S2 Antidependence: Statement 1 reads from a location into which the second statement stores. S1 S2 Example S1: Load R1, A/ R1  Memory (A) / S2: Add R2, R1/ R2  R2+R1 / S3: Move R1, R3/ R1  R3 / S4: Store B, R1/ Memory (B)  R3 / S1S2 S4S3

EXAMPLE How long would the following sequence of instructions take to execute on an superscalar processor with two execution units, each of which can execute any instruction? Load operations have a latency of two cycles, and all other operations have a latency of one cycle. Assume that the pipeline depth is 5 stages. LD r1, (r2) ADD r3, r1, r4 SUB r5, r6, r7 MUL r8, r9, r10

Example (cont.) In-order execution There are five pipeline stages and load has latency of 2 clock cycles Fetch, Decode, Execution, Memory access and Write back are the pipeline stages Total number of cycles is 8

Example (cont.) Out-of-order execution There are five pipeline stages and load has latency of 2 clock cycles Fetch, Decode, Execution, Memory access and Write back are the pipeline stages Total number of cycles is 7 Solutions

Register renaming On an out-of-order superscalar processor with 8 execution units, what is the execution time of the following sequence with and without register renaming it any execution unit can execute any instruction and the latency of all instructions is one cycle? Assume that the hardware register file contains enough registers to remap each destination register to a different hardware register and that the pipeline depth is 5 stages. LD r7, (r8) MUL r1, r7, r2 SUB r7, r4, r5 ADD r9, r7, r8 LD r8, (r12) DIV r10, r8, r10.

Solution In this example, WAR dependencies are a significant limitation on paralle­lism, forcing the DIV to issue 3 cycles after the first LD, for a total execution time of 8 cycles (the MUL and the SUB can execute in parallel, as can the ADD and the second LD). After register renaming, the program becomes LD hw7, (hw8) MUL hw1, hw7, hw2 SUB hw17, hw4, hw5 ADD hw9, hw17, hw8 LD hw18, (hw 12) DIV hw10, hw18, hw10 (Again, all of the renaming register choices are arbitrary.) With register renaming, the program has been broken into three sets of two dependent instructions (LD and MUL, SUB and ADD, LD and DIV). The SUB and the second LD instruction can now issue in the same cycle as the first LD. The MUL, ADD, and DIV instructions all issue in the next cycle, for a total execution time of 6 cycles.

Example Figure on the next slide shows an example of a superscalar processor organization. The processor can issue two instructions per cycle if there is no resource conflict and no data dependence problem. There are essentially two pipelines, with four processing stages (fetch, decode, execute, and store). Each pipeline has its own fetch decode and store unit. Four functional units (multiplier, adder, logic unit, and load unit! are available for use in the execute stage and are shared by the two pipelines ana dynamic basis. The two store units can be dynamically used by the two pipeline depending on availability at a particular cycle. There is a lookahead window with its own fetch and decoding logic. This window is used for instruction lookahead for out-of-order instruction issue.

Example (cont.) What dependencies exist in the program? Show the pipeline activity for this program on the processor using in-order issue with in-order completion policies and using a presentation similar to the Figure. Repeat for in-order issue with out-of-order completion. Repeat for out-of-order issue with out-of-order completion.