Two-issue Super Scalar CPU. CPU structure, what did we have to deal with: -double clock generation -double-port instruction cache -double-port instruction.

Slides:



Advertisements
Similar presentations
COMP375 Computer Architecture and Organization Senior Review.
Advertisements

Machine cycle.
CSCI 4717/5717 Computer Architecture
1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.
COMP25212 Further Pipeline Issues. Cray 1 COMP25212 Designed in 1976 Cost $8,800,000 8MB Main Memory Max performance 160 MFLOPS Weight 5.5 Tons Power.
Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.
Exploring Memory Consistency for Massively Threaded Throughput- Oriented Processors Blake Hechtman Daniel J. Sorin 0.
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
Performance of Cache Memory
Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.
Computer Organization and Architecture
Computer Organization and Architecture
On-Chip Cache Analysis A Parameterized Cache Implementation for a System-on-Chip RISC CPU.
Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Chapter 12 Pipelining Strategies Performance Hazards.
Topics covered: CPU Architecture CSE 243: Introduction to Computer Architecture and Hardware/Software Interface.
Chapter 12 CPU Structure and Function. Example Register Organizations.
GCSE Computing - The CPU
1 Sec (2.3) Program Execution. 2 In the CPU we have CU and ALU, in CU there are two special purpose registers: 1. Instruction Register 2. Program Counter.
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
Pipelining By Toan Nguyen.
SUPERSCALAR EXECUTION. two-way superscalar The DLW-2 has two ALUs, so it’s able to execute two arithmetic instructions in parallel (hence the term two-way.
Lecture 12 Today’s topics –CPU basics Registers ALU Control Unit –The bus –Clocks –Input/output subsystem 1.
Instruction Sets and Pipelining Cover basics of instruction set types and fundamental ideas of pipelining Later in the course we will go into more depth.
Basic Microcomputer Design. Inside the CPU Registers – storage locations Control Unit (CU) – coordinates the sequencing of steps involved in executing.
Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.
Multiple-bus organization
Chapter 2 Summary Classification of architectures Features that are relatively independent of instruction sets “Different” Processors –DSP and media processors.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 12 Overview and Concluding Remarks.
AMD Opteron Overview Michael Trotter (mjt5v) Tim Kang (tjk2n) Jeff Barbieri (jjb3v)
The Central Processing Unit (CPU) and the Machine Cycle.
1 Pipelining Part I CS What is Pipelining? Like an Automobile Assembly Line for Instructions –Each step does a little job of processing the instruction.
Lecture 16: Basic Pipelining
The Alpha – Data Stream Matt Ziegler.
The Central Processing Unit (CPU)
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
11 Pipelining Kosarev Nikolay MIPT Oct, Pipelining Implementation technique whereby multiple instructions are overlapped in execution Each pipeline.
How does the CPU work? CPU’s program counter (PC) register has address i of the first instruction Control circuits “fetch” the contents of the location.
Control units In the last lecture, we introduced the basic structure of a control unit, and translated our assembly instructions into a binary representation.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
GCSE Computing - The CPU
GCSE OCR Computing A451 The CPU Computing hardware 1.
PROGRAMMABLE LOGIC CONTROLLERS SINGLE CHIP COMPUTER
CDA3101 Recitation Section 8
Basic Processor Structure/design
Computer Design & Organization
Architecture Background
Decode and Operand Read
פרק 2: חיווט, זיכרונות בנקים זוגיים ואי-זוגיים
Pipelining: Advanced ILP
Pipelining review.
Microprocessor & Assembly Language
Lecture 5: Pipelining Basics
Pipelining in more detail
Control unit extension for data hazards
Instruction Execution Cycle
Control units In the last lecture, we introduced the basic structure of a control unit, and translated our assembly instructions into a binary representation.
Control unit extension for data hazards
Control unit extension for data hazards
GCSE Computing - The CPU
Chapter 11 Processor Structure and function
Computer Architecture
Pipelining.
Sec (2.3) Program Execution.
Presentation transcript:

Two-issue Super Scalar CPU

CPU structure, what did we have to deal with: -double clock generation -double-port instruction cache -double-port instruction fetch (bubble handling) -decode stage (instr handling, scoreboard implemented) -execute stage (doubled execution unit, forwarding, branch resolving, write-back ports) -load-store stage (memory access handling, doubled write-back signal)

Top level model Global 50MHz clock connected do DLL component which performs clock frequency doubling Doubled clock needed to implement 4-port Block RAM performance counter CPU chipset DLL CLK IO interface CLK0 CLK2x

Instruction cache Block RAM extension to two-port implementation Cache miss and hit tests for two ports One memory port FSM responsible for memory access is switched between two requests from instruction fetch first portsecond port Block RAM FSM Memory Access

Instruction fetch Fetching two instruction from cache bubble insertion for each instruction stream instructions passed to the output in order two instruction cache ports Instruction Fetch two decode stage ports branch request bubble1bubble2

Decode stage Decoding two instructions Quad-port Block RAM inferred Taking advantage from doubled clock – double write-back handling Scoreboard implemented – set of conditions for checking data dependencies Bubble generation Instruction stream prepared for load-store stage two instruction fetch ports two execute stage ports Scoreboard Block RAM Write-back Instruction decoding Write-back Previous Instr.

Scoreboard Simplification of full scoreboard unit Introduced as a set of conditions implemented in decode stage Used for bubble insertion of both types (concurrent and consecutive instructions) and separating memory access instructions Presented by abtract instruction table consisted of two lines NrInstructionIdx_dIdx_aIdx_bExecutability In practice corresponds to Outputs of instructions fetch 1 2 MUL ST

And few examples: Firstly, normal operation without any bubble insertion, two instructions are fully independent Write-back two instruction fetch ports two execute stage ports Block RAM Instruction decoding Scoreboard Previous Instr.

Bubble insertion caused by data dependencies between concurrent instructions two instruction fetch ports two execute stage ports Block RAM Instruction decoding Write-back Scoreboard Previous Instr.

Bubble insertion caused by data dependencies between load instruction and consecutive arbitrary instructions two execute stage ports Block RAM Instruction decoding Write-back InstrInstr $1,$0LD $0 Instr Scoreboard Previous Instr.

Bubble insertion introduced to split two memory-access instructions two execute stage ports Block RAM Instruction decoding Write-back LD ST Instr Scoreboard Previous Instr.

Execute stage Doubled ALU Resolving of branch priority Forwarding from both instruction streams Write-back generation two decode stage ports two load store stage ports Data forwarding ALU Register branch request

Load-store stage It is ensured that only one memory access instruction is passed to load store unit Memory access process is switched to the right instruction write back signals are generated write back signals write back from execute memory access write back multiplexing memory ports

In action

Performance (1) – blinking leds Additional parameters: Number of simulated cycles : Execution Frequency of Memory Access Instructions compared with number of all instructions: - Super Sc : 0,29 - SIMD : 0,24 ALU Instructions : - Super Sc : 0,14 - SIMD : 0,13 Instruction/ cycle SIMDSuper scalar SIMD 0,5 0,42

Performance (2) - apfel Additional parameters: Execution Frequency of Memory Access Instructions: - for both : 0,2 ALU Instructions : - both : 0,4 Measurement Results of Instruction Execution Frequency are surprising, probably because of many memory access instructions executed at the beginning of program (the longer the simulation time is, the better results we should get) Instruction/ cycle SIMDSuper scalar SIMD 0,56 0,45

Synthesis last version seen working on XCV300 was 2-way SIMD (MUCH faster than HaPra CPU!) 4-way SIMD and Super Scalar versions are too big for XCV and for unknown reasons don't work in XCV800 probably severe timing issues - running on 25MHz instead of 50MHs doesn't help (but 4-way SIMD should work anyway!) all we've got is fully working simulation