Introduction to Silicon Programming in the Tangram/Haste language Material adapted from lectures by: Prof.dr.ir Kees van Berkel [Dr. Johan Lukkien] [Dr.ir.

Slides:



Advertisements
Similar presentations
Lecture 4: CPU Performance
Advertisements

EECC551 - Shaaban #1 Lec # 2 Fall Instruction Set Architecture (ISA) “... the attributes of a [computing] system as seen by the programmer,
ELEN 468 Advanced Logic Design
CMPT 334 Computer Organization
CS252/Patterson Lec 1.1 1/17/01 Pipelining: Its Natural! Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer.
331 W08.1Spring :332:331 Computer Architecture and Assembly Language Spring 2006 Week 8: Datapath Design [Adapted from Dave Patterson’s UCB CS152.
1 RISC Pipeline Han Wang CS3410, Spring 2010 Computer Science Cornell University See: P&H Chapter 4.6.
Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University RISC Pipeline See: P&H Chapter 4.6.
CS-447– Computer Architecture Lecture 12 Multiple Cycle Datapath
The Processor: Datapath & Control
Chapter 5 The Processor: Datapath and Control Basic MIPS Architecture Homework 2 due October 28 th. Project Designs due October 28 th. Project Reports.
Levels in Processor Design
Computer ArchitectureFall 2007 © October 3rd, 2007 Majd F. Sakr CS-447– Computer Architecture.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
Computer Structure - Datapath and Control Goal: Design a Datapath  We will design the datapath of a processor that includes a subset of the MIPS instruction.
CPEN Digital System Design Chapter 10 – Instruction SET Architecture (ISA) © Logic and Computer Design Fundamentals, 4 rd Ed., Mano Prentice Hall.
DLX Instruction Format
Lecture 16: Basic CPU Design
ECE 4436ECE 5367 ISA I. ECE 4436ECE 5367 CPU = Seconds= Instructions x Cycles x Seconds Time Program Program Instruction Cycle CPU = Seconds= Instructions.
COSC 3430 L08 Basic MIPS Architecture.1 COSC 3430 Computer Architecture Lecture 08 Processors Single cycle Datapath PH 3: Sections
CSC 4250 Computer Architectures September 15, 2006 Appendix A. Pipelining.
1 (Based on text: David A. Patterson & John L. Hennessy, Computer Organization and Design: The Hardware/Software Interface, 3 rd Ed., Morgan Kaufmann,
1 COMP541 Multicycle MIPS Montek Singh Apr 4, 2012.
COMP541 Multicycle MIPS Montek Singh Apr 8, 2015.
CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.
EECS 322: Computer Architecture
CDA 3101 Fall 2013 Introduction to Computer Organization
Computer Architecture and Design – ECEN 350 Part 6 [Some slides adapted from A. Sprintson, M. Irwin, D. Paterson and others]
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? Pipelining Ver. Jan 14, 2014 Marco D. Santambrogio:
1 Pipelining Part I CS What is Pipelining? Like an Automobile Assembly Line for Instructions –Each step does a little job of processing the instruction.
TEAM FRONT END ECEN 4243 Digital Computer Design.
W.S Computer System Design Lecture 4 Wannarat Suntiamorntut.
1. Building A CPU  We’ve built a small ALU l Add, Subtract, SLT, And, Or l Could figure out Multiply and Divide  What about the rest l How do.
1 Processor: Datapath and Control Single cycle processor –Datapath and Control Multicycle processor –Datapath and Control Microprogramming –Vertical and.
COMP541 Multicycle MIPS Montek Singh Mar 25, 2010.
Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.
By Wannarat Computer System Design Lecture 4 Wannarat Suntiamorntut.
ECE-C355 Computer Structures Winter 2008 The MIPS Datapath Slides have been adapted from Prof. Mary Jane Irwin ( )
COM181 Computer Hardware Lecture 6: The MIPs CPU.
MIPS Processor.
Computer Architecture Lecture 6.  Our implementation of the MIPS is simplified memory-reference instructions: lw, sw arithmetic-logical instructions:
CS161 – Design and Architecture of Computer Systems
Morgan Kaufmann Publishers
ELEN 468 Advanced Logic Design
CMSC 611: Advanced Computer Architecture
CS/COE0447 Computer Organization & Assembly Language
School of Computing and Informatics Arizona State University
CSCE 212 Chapter 5 The Processor: Datapath and Control
Introduction to VLSI Programming Lecture 9: High Performance DLX
Building a Computer I wonder where this goes? MIPS Kit ALU 1
CS/COE0447 Computer Organization & Assembly Language
Introduction to VLSI Programming Lecture 7: Introduction to the DLX
Lecturer: Alan Christopher
Serial versus Pipelined Execution
Building a Computer I wonder where this goes? MIPS Kit ALU 1
MIPS Processor.
Introduction to VLSI Programming Lecture 8: High Performance (DLX)
Introduction to Silicon Programming in the Tangram/Haste language
An Introduction to pipelining
The Processor Lecture 3.1: Introduction & Logic Design Conventions
Guest Lecturer TA: Shreyas Chand
Systems Architecture I
Introduction to VLSI Programming Lecture 7: Introduction to the DLX
Arrays versus Pointers
Review Fig 4.15 page 320 / Fig page 322
Pipelining Appendix A and Chapter 3.
CS/COE0447 Computer Organization & Assembly Language
MIPS Processor.
CS161 – Design and Architecture of Computer Systems
CS/COE0447 Computer Organization & Assembly Language
Presentation transcript:

Introduction to Silicon Programming in the Tangram/Haste language Material adapted from lectures by: Prof.dr.ir Kees van Berkel [Dr. Johan Lukkien] [Dr.ir. Ad Peeters] at the Technical University of Eindhoven, the Netherlands

Philips Research, Kees van Berkel, Ad Peeters, TU/e VLSI programming for … Low costs: –introduce resource sharing. Low delay (high throughput): –introduce parallelism. Low energy (low power): –reduce activity; …

Philips Research, Kees van Berkel, Ad Peeters, TU/e VLSI programming for high performance Keep it simple!! Make the analysis; focus on bottlenecks Introduce parallelism: expressions, commands, loops, pipelining Enable parallelism, by reducing dependencies such as resource sharing

Philips Research, Kees van Berkel, Ad Peeters, TU/e Expression-level parallelism Examples: balancing: (v+w)+(x+y) is faster than v+w+x+y substitution: z:= g(f(x)) is faster than y:= f(x) ; z:= g(y) carry-select adder carry-save multiplier

Philips Research, Kees van Berkel, Ad Peeters, TU/e Command level parallelism If S2 does not depend on outcome of S1 then S1 ; S2 can be transformed into S1 || S2. (dependencies: data, sharing, synchronization) This reduces computation time , unless ordering is enforced through external synchronization.  (S1 ; S2 ) =  (;) +  (S1) +  (S2)  (S1 || S2 ) =  (||) + max(  (S1),  (S2))

Philips Research, Kees van Berkel, Ad Peeters, TU/e Exposure of cmd-level parallelism Let *[S] be a shorthand for forever do S od Assume S0 must precede S1 and S1 must precede S2; How to speedup *[ S0 ; S1 ; S2 ] ? *[ S0 ; S1 ; S2 ] = { loop unfolding } S0 ; *[S1 ; S2 ; S0 ] = { S0 does not depend on S1} S0 ; *[S1 ; (S2 || S0) ]

Philips Research, Kees van Berkel, Ad Peeters, TU/e wagging *[a?x ; b!f(x)] ={ loop unrolling, renaming } *[a?x ; b!f(x) ; a?y ; b!f(y) ] ={ loop folding } a?x ; *[b!f(x) ; a?y ; b!f(y) ; a?x]  {increases slack by 1} a?x ; *[(b!f(x) || a?y) ; (b!f(y) || a?x)]

Philips Research, Kees van Berkel, Ad Peeters, TU/e Parallel reads from REG file Let RF be a register file. Then x:= RF[i] ; y:= RF[j] cannot be parallelized. (Register files have a single read port.) Parallel read actions can be realized by doubling the register file: > := > { write } and > := > { read }

Philips Research, Kees van Berkel, Ad Peeters, TU/e Pipelining in Tangram Compare three programs: P0: *[ a?x0 ; b!f2(f1(f0(x0))) ] P1: *[ a?x0; x1:= f0(x0) ; x2:= f1(x1) ; b!f2(x2) ] P2: *[ a?x0 ; a1!f0(x0) ] || *[ a1?x1 ; a2!f1(x1) ] || *[ a2?x2 ; b!f2(x2) ]

Philips Research, Kees van Berkel, Ad Peeters, TU/e Pipelining in Tangram (cntd) Output sequence b identical for P0, P1, and P2. P0 and P1 have same communication behavior; P1 is larger, slower, and warmer. P2 vs P1: similar in size, energy, and latency, but up to 3 times higher throughput, depending on (relative) complexity of f0, f1, f2.

Philips Research, Kees van Berkel, Ad Peeters, TU/e A Processor Example: DLX (“Deluxe”) (AMD 29K + DECstation HP850 + IBM801 + Intel i860 + MIPS M/120A + MIPS M/ Motorola 88K + RISC I + SGI 4D/60 + SPARCstation-1 + Sun 4/110 + Sun-4/260) / 13 = DLX Other RISC examples include: Cray-1,2,3, AMD2900, DEC Alpha, ARM.

Philips Research, Kees van Berkel, Ad Peeters, TU/e DLX instruction formats Opcode loads, stores, conditional branch,.. rs1 rd Immediate I-type offset Opcode Jump, jump and link, trap, return from exception J-type Opcode Reg-reg ALU operations rs1 rdrs2 function R-type 31 26, 25 21, 20 16, 15 11, 10 0

Philips Research, Kees van Berkel, Ad Peeters, TU/e Example instructions

Philips Research, Kees van Berkel, Ad Peeters, TU/e GCD in DLX assembler pre:LWR1,4(R0)R1:=Mem[4+0] LWR2,8(R0)R2:=Mem[8+0] loop: SUBR3,R1,R2R3:=R1-R2 BEQZR3,”exit”if (R3=0) then PC:=“exit” SLTR4,R1,R2R4:=(R1<R2) BEQZR4,”pos2”if (R4=0) then PC:=“pos2” pos1:SUBR2,R2,R1R2:=R2-R1 J“loop”PC:=“loop” pos2:SUBR1,R1,R2R1:=R1-R2 J“loop”PC:=“loop” exit:SW20(R0),R1Mem[20+0]:=R1 HLT

Philips Research, Kees van Berkel, Ad Peeters, TU/e DLX interface, state Instruction memory Mem (Data memory) pc address instruction address data r/w clockinterrupt r0 r1 r2 r31 DLX CPU Reg

Philips Research, Kees van Berkel, Ad Peeters, TU/e DLX: “Moore machine” (ignoring interrupts)  Reg[0],pc  :=  0,0  ; do  Mem[Reg[rs1 +immediate], pc, Reg[rd]  :=  if SW  Reg[rd] fi, if J  pc+4+offset [] BEQZ  if Reg[rs]=0  pc+4 +immediate [] Reg[rs]#0  pc+4 fi [] else  pc+4 fi, if LW  Mem[rs1+immediate] [] ADD  ALU(add, Reg[rs1], Reg[rs2]) fi  od

Philips Research, Kees van Berkel, Ad Peeters, TU/e DLX: 5-step sequential execution Reg A B Imm ir npc pc aluo cond lmd 0? Instr. mem 4 Mem IFIDEXMM WB

Philips Research, Kees van Berkel, Ad Peeters, TU/e DLX: pipelined execution IFIDEXMMWB IFIDEXMMWB IFIDEXMM IFIDEX IFIDEXMMWB IFIDEXMMWB Time  [in clock cycles] Program execution  [instructions]

Philips Research, Kees van Berkel, Ad Peeters, TU/e DLX: pipelined execution Reg pc 0? Instr. mem 4 Mem Instruction FetchInst.DecodeEXecuteMemory Write Back

Philips Research, Kees van Berkel, Ad Peeters, TU/e DLX system organization dlx(…) rom(…)ram(…) system_dlx(…) file: gcd.bin files: RAMout RAMin RAMaddr datatoRAM datafromRAM ROMaddr ROMdata system boundary

Philips Research, Kees van Berkel, Ad Peeters, TU/e dlx0.ht #include types.ht & dlx0 : export proc ( ROMaddr!chan adtype & ROMdata?chan word & RAMaddr!chan rwadtype & datatoRAM!chan S30 & datafromRAM?chan S30 ). begin … RF: ram array U5 of S30 end

Philips Research, Kees van Berkel, Ad Peeters, TU/e system_dlx0.ht #include "dlx0.ht" & dlx0 : proc ( ROMaddr!chan adtype & ROMdata?chan word & RAMaddr!chan rwadtype & datatoRAM!chan S30 & datafromRAM?chan S30 ). import & env_dlx4 : main proc ( & ROMfile? chan word & RAMinfile? chan S30 & RAMfile! chan S30 /* > */ ). begin next slide end

Philips Research, Kees van Berkel, Ad Peeters, TU/e system_dlx0.ht : main body begin & ROMaddr : chan adtype & ROMdata : chan word & RAMaddr : chan rwadtype & datatoRAM : chan S30 & datafromRAM: chan S30 … & ROMinterface : proc(). begin.. end & RAMinterface : proc(). begin.. end | initialise() ; ROMinterface() || RAMinterface() || dlx0( ROMaddr, ROMdata, RAMaddr, datatoRAM, datafromRAM ) end

Philips Research, Kees van Berkel, Ad Peeters, TU/e script htcomp system_dlx0 htsim -limit 1000 system_dlx0 RAMin RAMout htview system_dlx0 Htmap system_dlx0

Philips Research, Kees van Berkel, Ad Peeters, TU/e DLX0: instruction loop do -halted then ROMaddr!PC ; ROMdata?ir ; PC:=PC+4 {auxPC:=PC+4 ; PC:=PCaux} ; case (ir cast Itype.0) is > then LW() or > then SW() or > then if (ir cast Rtype.4 = 1) then SLT() fi or > then BEQZ() or > then J() or > then halted:=true si od