Introduction to VLSI Programming Lecture 8: High Performance (DLX)

Introduction to VLSI Programming Lecture 8: High Performance (DLX)
(course 2IN30) Prof. dr. ir.Kees van Berkel Dr. Johan Lukkien

Time table 2005 date class | lab subject Aug. 30 2 | 0 hours
intro; VLSI Sep. 6 3 | 0 hours handshake circuits Sep. 13 handshake circuits assignment Sep. 20 Tangram Sep. 27 no lecture Oct. 4 Oct. 11 1 | 2 hours demo, fifos, registers | deadline assignment Oct. 18 design cases; Oct. 25 DLX introduction Nov. 1 low-cost DLX Nov. 8 high-speed DLX Dec. 13 deadline final report 12/31/2018 Kees van Berkel

Lecture 8 Outline: Recapitulation of Lecture 7
VLSI programming for high performance: parallelism: expressions, commands, loops, pipelining pipelining the DLX Lab work: improve performance of Tangram DLX by introducing pipelining 12/31/2018 Kees van Berkel

DLX instruction formats
, , , , Opcode Reg-reg ALU operations rs1 rd rs2 function R-type Opcode loads, stores, conditional branch, .. rs1 rd Immediate I-type offset Opcode Jump, jump and link, trap, return from exception J-type 12/31/2018 Kees van Berkel

Example instructions 12/31/2018 Kees van Berkel

DLX interface, state Instruction memory Mem (Data memory) address r0
pc r1 r2 DLX CPU Reg instruction data r/w r31 clock interrupt 12/31/2018 Kees van Berkel

VLSI programming for … Low costs: introduce resource sharing.
Low delay (high throughput): introduce parallelism. Low energy (low power): reduce activity; … 12/31/2018 Kees van Berkel

VLSI programming for high performance
Keep it simple!! Make the analysis; focus on bottlenecks Introduce parallelism: expressions, commands, loops, pipelining Enable parallelism, by reducing dependencies such as resource sharing 12/31/2018 Kees van Berkel

Expression-level parallelism
Examples: balancing: (v+w)+(x+y) is faster than v+w+x+y substitution: z:=g(f(x)) is faster than y:= f(x) ; z:= g(y) carry-select adder carry-save multiplier 12/31/2018 Kees van Berkel

Command level parallelism
If S2 does not depend on outcome of S1 then S1 ; S2 can be transformed into S1 || S2. (dependencies: data, sharing, synchronization) This reduces computation time , unless ordering is enforced through external synchronization. (S1 ; S2 ) = (;) + (S1) + (S2) (S1 || S2 ) =  (||) + max((S1), (S2)) 12/31/2018 Kees van Berkel

Exposure of cmd-level parallelism
Let *[S] be a shorthand for forever do S od Assume S0 must precede S1 and S1 must precede S2; How to speedup *[ S0 ; S1 ; S2 ] ? *[ S0 ; S1 ; S2 ] = { loop unfolding } S0 ; *[S1 ; S2 ; S0 ] = { S0 does not depend on S1} S0 ; *[S1 ; (S2 || S0) ] 12/31/2018 Kees van Berkel

wagging *[a?x ; b!f(x)] = { loop unrolling, renaming }
*[a?x ; b!f(x) ; a?y ; b!f(y) ] = { loop folding } a?x ; *[b!f(x) ; a?y ; b!f(y) ; a?x]  {increases slack by 1} a?x ; *[(b!f(x) || a?y) ; (b!f(y) || a?x)] 12/31/2018 Kees van Berkel

Parallel reads from REG file
Let RF be a register file. Then x:= RF[i] ; y:= RF[j] cannot be parallelized. (Register files have a single read port.) Parallel read actions can be realized by doubling the register file: << RF[i] , RG[i] >> := << z , z >> { write } and << x , y >> := << RF[i] , RG[j] >> { read } 12/31/2018 Kees van Berkel

Pipelining in Tangram Compare three programs:
P0: *[ a?x0 ; b!f2(f1(f0(x0))) ] P1: *[ a?x0; x1:= f0(x0) ; x2:= f1(x1) ; b!f2(x2) ] P2: *[ a?x0 ; a1!f0(x0) ] || *[ a1?x1 ; a2!f1(x1) ] || *[ a2?x2 ; b!f2(x2) ] 12/31/2018 Kees van Berkel

Pipelining in Tangram (cntd)
Output sequence b identical for P0, P1, and P2. P0 and P1 have same communication behavior; P1 is larger, slower, and warmer. P2 vs P1: similar in size, energy, and latency, but up to 3 times higher throughput, depending on (relative) complexity of f0, f1, f2. 12/31/2018 Kees van Berkel

DLX: 5-step sequential execution
IF ID EX MM WB Reg A B Imm ir npc pc aluo cond lmd 0? Instr. mem 4 Mem 12/31/2018 Kees van Berkel

DLX: pipelined execution
Time  [in clock cycles] IF ID EX MM WB Program execution  [instructions] 12/31/2018 Kees van Berkel

DLX: pipelined execution
Instruction Fetch Inst.Decode EXecute Memory Write Back 4 0? pc Instr. mem Reg Mem 12/31/2018 Kees van Berkel

Lab work Assignment 5: Create a 2-stage pipelined dlx2.tg Throughput must exceed 5 MIPS (benchmark = GCD). Design a reduced-costs version dlx2s.tg Note: use of shared variables is not allowed. Let command S1 || S2 be part of your DLX. When S1 has write access to variable x, S2 may neither read nor write x (and vice versa). 12/31/2018 Kees van Berkel

Next week: lecture 9 Outline:
Pipelining the DLX, using branch-delay slots. Lab work: Assignment 6 (3-stage DLX) 12/31/2018 Kees van Berkel

DLX system organization
RAMaddr datatoRAM datafromRAM ROMaddr ROMdata dlx(…) system boundary rom(…) ram(…) files: RAMout RAMin system_dlx(…) file: gcd.bin 12/31/2018 Kees van Berkel

dlx0.ht #include types.ht & dlx0 : export proc ( ROMaddr!chan adtype
& ROMdata?chan word & RAMaddr!chan rwadtype & datatoRAM!chan S & datafromRAM?chan S30 ) . begin … RF: ram array U5 of S30 end 12/31/2018 Kees van Berkel

system_dlx0.ht #include "dlx0.ht" & dlx0 : proc ( ROMaddr!chan adtype
& ROMdata?chan word & RAMaddr!chan rwadtype & datatoRAM!chan S30 & datafromRAM?chan S30 ) . import & env_dlx4 : main proc ( & ROMfile? chan word & RAMinfile? chan S30 & RAMfile! chan S30 /* <<address,data>> */ ) . begin next slide end 12/31/2018 Kees van Berkel

system_dlx0.ht : main body
begin & ROMaddr : chan adtype & ROMdata : chan word & RAMaddr : chan rwadtype & datatoRAM : chan S30 & datafromRAM: chan S30 … & ROMinterface : proc() . begin .. end & RAMinterface : proc() . begin .. end | initialise() ; ROMinterface() || RAMinterface() || dlx0( ROMaddr, ROMdata, RAMaddr, datatoRAM, datafromRAM ) end 12/31/2018 Kees van Berkel

script htcomp -B system_dlx0
htsim -limit 1000 system_dlx0 gcd.bin RAMin RAMout htview system_dlx0 12/31/2018 Kees van Berkel

Introduction to VLSI Programming Lecture 8: High Performance (DLX)

Similar presentations

Presentation on theme: "Introduction to VLSI Programming Lecture 8: High Performance (DLX)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to VLSI Programming Lecture 8: High Performance (DLX)

Similar presentations

Presentation on theme: "Introduction to VLSI Programming Lecture 8: High Performance (DLX)"— Presentation transcript:

Similar presentations

About project

Feedback