Two-issue Super Scalar CPU. CPU structure, what did we have to deal with: -double clock generation -double-port instruction cache -double-port instruction.

Two-issue Super Scalar CPU

CPU structure, what did we have to deal with: -double clock generation -double-port instruction cache -double-port instruction fetch (bubble handling) -decode stage (instr handling, scoreboard implemented) -execute stage (doubled execution unit, forwarding, branch resolving, write-back ports) -load-store stage (memory access handling, doubled write-back signal)

Top level model Global 50MHz clock connected do DLL component which performs clock frequency doubling Doubled clock needed to implement 4-port Block RAM performance counter CPU chipset DLL CLK IO interface CLK0 CLK2x

Instruction cache Block RAM extension to two-port implementation Cache miss and hit tests for two ports One memory port FSM responsible for memory access is switched between two requests from instruction fetch first portsecond port Block RAM FSM Memory Access

Instruction fetch Fetching two instruction from cache bubble insertion for each instruction stream instructions passed to the output in order two instruction cache ports Instruction Fetch two decode stage ports branch request bubble1bubble2

Decode stage Decoding two instructions Quad-port Block RAM inferred Taking advantage from doubled clock – double write-back handling Scoreboard implemented – set of conditions for checking data dependencies Bubble generation Instruction stream prepared for load-store stage two instruction fetch ports two execute stage ports Scoreboard Block RAM Write-back Instruction decoding Write-back Previous Instr.

Scoreboard Simplification of full scoreboard unit Introduced as a set of conditions implemented in decode stage Used for bubble insertion of both types (concurrent and consecutive instructions) and separating memory access instructions Presented by abtract instruction table consisted of two lines NrInstructionIdx_dIdx_aIdx_bExecutability In practice corresponds to Outputs of instructions fetch 1 2 MUL ST 012 21 - 1 0

And few examples: Firstly, normal operation without any bubble insertion, two instructions are fully independent Write-back two instruction fetch ports two execute stage ports Block RAM Instruction decoding Scoreboard Previous Instr.

Bubble insertion caused by data dependencies between concurrent instructions two instruction fetch ports two execute stage ports Block RAM Instruction decoding Write-back Scoreboard Previous Instr.

Bubble insertion caused by data dependencies between load instruction and consecutive arbitrary instructions two execute stage ports Block RAM Instruction decoding Write-back InstrInstr $1,$0LD $0 Instr Scoreboard Previous Instr.

Bubble insertion introduced to split two memory-access instructions two execute stage ports Block RAM Instruction decoding Write-back LD ST Instr Scoreboard Previous Instr.

Execute stage Doubled ALU Resolving of branch priority Forwarding from both instruction streams Write-back generation two decode stage ports two load store stage ports Data forwarding ALU Register branch request

Load-store stage It is ensured that only one memory access instruction is passed to load store unit Memory access process is switched to the right instruction write back signals are generated write back signals write back from execute memory access write back multiplexing memory ports

In action

Performance (1) – blinking leds Additional parameters: Number of simulated cycles : 124988 Execution Frequency of Memory Access Instructions compared with number of all instructions: - Super Sc : 0,29 - SIMD : 0,24 ALU Instructions : - Super Sc : 0,14 - SIMD : 0,13 Instruction/ cycle SIMDSuper scalar SIMD 0,5 0,42

Performance (2) - apfel Additional parameters: Execution Frequency of Memory Access Instructions: - for both : 0,2 ALU Instructions : - both : 0,4 Measurement Results of Instruction Execution Frequency are surprising, probably because of many memory access instructions executed at the beginning of program (the longer the simulation time is, the better results we should get) Instruction/ cycle SIMDSuper scalar SIMD 0,56 0,45

Synthesis last version seen working on XCV300 was 2-way SIMD (MUCH faster than HaPra CPU!) 4-way SIMD and Super Scalar versions are too big for XCV300......and for unknown reasons don't work in XCV800 probably severe timing issues - running on 25MHz instead of 50MHs doesn't help (but 4-way SIMD should work anyway!) all we've got is fully working simulation

Two-issue Super Scalar CPU. CPU structure, what did we have to deal with: -double clock generation -double-port instruction cache -double-port instruction.

Similar presentations

Presentation on theme: "Two-issue Super Scalar CPU. CPU structure, what did we have to deal with: -double clock generation -double-port instruction cache -double-port instruction."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Two-issue Super Scalar CPU. CPU structure, what did we have to deal with: -double clock generation -double-port instruction cache -double-port instruction.

Similar presentations

Presentation on theme: "Two-issue Super Scalar CPU. CPU structure, what did we have to deal with: -double clock generation -double-port instruction cache -double-port instruction."— Presentation transcript:

Similar presentations

About project

Feedback