Presentation is loading. Please wait.

Presentation is loading. Please wait.

Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman1 Platform-based Design 5kk70 2007.

Similar presentations


Presentation on theme: "Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman1 Platform-based Design 5kk70 2007."— Presentation transcript:

1 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman1 Platform-based Design 5kk

2 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman2 flexibility efficiency DS P Programmable CPU Programmable DSP Application specific instruction set processor (ASIP) Application specific processor

3 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman3 low medium high high medium low flexibility efficiency ASIC GP proc FPGA DSP ASIP

4 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman4 Programmable CPU cores introduction architecture of the MIPS core discussed as an example pipelining application examples software issues comparison between different CPU cores towards application specific architectures discussion

5 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman5 rationale: General-purpose -> large market consequence: often handcrafted design optimised for clock rate problem : fast changes in the IC process technology examples embedded: MIPS (first one, licensing instruction set architecture) ARM (Advanced Risc Machines, telecom, low power, small code size, most popular one, licensing also the micro-architecture as hard or soft IP) derivatives from general purpose CPUs Intel, NEC, Hitachi, National, PowerPC Introduction

6 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman6 Instruction set architectures implicit operandsexplicit operands stack machines (e.g. ST20) accumulator machines general purpose registers register-memory register-register = load-store Introduction

7 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman7 PC Clk Instruction address Instruction Memory Instruction Rd Rs Rt Imm Architecture of the MIPS core [Hennessy& Patterson] Data Memory Clk Data address Data in 32 Data out Rw Ra Rb bit registers Clk 32

8 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman Op rs rt rd shamt funct 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits R - type Op rs rt immediate 6 bits 5 bits 5 bits 16 bits I - type Op target address 6 bits 26 bits J - type opoperation of the instruction rs,rt,rdsource and destination registers shamtshift amount functoperation of the instruction-part 2 immfor program constants addrtarget address of a jump MIPS instruction formats ( 32 bits ) [Hennessy& Patterson]

9 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman Op rs rt rd shamt funct 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits Example 1 : R - type : add instruction Rw Ra Rb bit registers Clk Result Rd Rs Rt BusA 32 Reg Wr Bus W BusB 32 ALUctr add rd, rs, rt mem[PC] R[rd] = R[rs] + R[rt] PC = PC + 4 [Hennessy& Patterson]

10 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman10 PC Instruction Memory Rw Ra Rb bit registers Data Memory Clk Data address Data in Data out Instruction address Instruction Rd Rs Rt Imm Critical path R-type operation [Hennessy& Patterson]

11 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman Op rs rt immediate 6 bits 5 bits 5 bits 16 bits Example 2 : I-type : load word Rw Ra Rb bit registers Clk Result Rs dc (Rt) BusA 32 Reg Wr Bus W Data In 32 ALUctr RdRt RedDst 32 Extender Imm ALUSrcExtOp WrEn Adr Data Memory Clk MemtoReg MemWr BusB 32 lw rs, rt, imm16 mem[PC] addr = R[rs] + ext[imm16] R[rt] = mem[addr] PC = PC + 4 [Hennessy& Patterson]

12 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman Op rs rt immediate 6 bits 5 bits 5 bits 16 bits beq rs, rt, imm16 mem[PC] cond = R[rs] - R[rt] if cond = 0 PC = PC ext(imm16)*4 else PC = PC + 4 Example 3 : I-type : branch [Hennessy& Patterson]

13 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman Op rs rt immediate 6 bits 5 bits 5 bits 16 bits Rw Ra Rb bit registers Clk Rs dc (Rt) BusA 32 Reg Wr Bus W ALUctr RdRt RedDst 32 Extender Imm ALUSrcExtOp BusB 32 Next Address Logic Imm Branch To Instruction Memory PC Clk Zero Example 3 : I-type : branch [Hennessy& Patterson]

14 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman14 PC Branch Zero 0 1 SignExt Imm Instruction “00” Addr Instruction Memory 30 Clk “1” 32 Instruction Example 3 : I-type : branch [Hennessy&Patterson]

15 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman15 problem : long critical path defined by the slowest instruction (load) solution ? = pipelining break the instruction into smaller steps all steps have about the same critical path IfetchRF readALUdmemRF write E.g. load cycle 1cycle 2cycle 3cycle 4cycle 5 5 stages Architecture of the MIPS core

16 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman16 IfetchRF readALUdmemRF write cycle 1cycle 2cycle 3cycle 4cycle 5cycle 6cycle 7 IfetchRF readALUdmemRF write IfetchRF readALUdmemRF write lw Pipelining lw instructions One instructions enters the pipeline every clock cycle One instructions leaves the pipeline every clock cycle => CPI = 1 (Cycles per Instruction) [Hennessy&Patterson]

17 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman17 IRAMW InstructionsData IRAMW IRAMWIRAMW IRAMWIRAMW Current CPU cycle Pipelining lw instructions

18 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman18 IfetchRF readALURF write E.g. ADD 4 stages of R-type instruction cycle 1cycle 2cycle 3cycle 4 [Hennessy&Patterson]

19 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman19 Resource conflict on the write port of the Rfile IfetchRF readALUdmemRF write cycle 1cycle 2cycle 3cycle 4cycle 5cycle 6cycle 7 IfetchRF readALURF write lw add Pipelining lw and R-type instructions [Hennessy&Patterson]

20 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman20 IfetchRF readALUdmemRF write cycle 1cycle 2cycle 3cycle 4cycle 5cycle 6cycle 7 IfetchRF readALUdmemRF write IfetchRF readALUdmemRF write lw add Solution: stretch R-type to 5 stages IfetchRF readALUdmemRF write Dummy op (noop) [Hennessy&Patterson]

21 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman21 BusA Din RegDst ext. Imm16 ALUSrc ExtOp Data mem MemtoReg MemWr BusB Ra Rb RwDi Rs Rt Rd adr Prog mem + 4 Dout Rfile flags ALUop branch RegWr Ifetch Reg/dec exec memwr Next PC [Hennessy&Patterson]

22 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman22 IM RF DM RF IM RF DM RF IM RF DM RF IM RF DM RF IM RF DM RF R1 =... … = R Data dependencies : R-type instructions [Hennessy&Patterson]

23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman23 IM RF DM RF IM RF DM RF IM RF DM RF IM RF DM RF IM RF DM RF R1 =... … = R Data dependencies : R-type instructions Solution: bypasses [Hennessy&Patterson]

24 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman24 Data mem adr Bypasses [Hennessy&Patterson]

25 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman25 IM RF DM RF IM RF DM RF IM RF DM RF IM RF DM RF R1 = lw... … = R Data dependencies : load instruction [Hennessy&Patterson]

26 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman26 IM RF DM RF IM RF DM RF IM RF DM RF IM RF DM RF R1 = lw... … = R … = R Data dependencies : load instruction Bypass is no solution for + instruction [Hennessy&Patterson]

27 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman27 IM RF DM RF IM RF DM RF IM RF DM RF IM RF DM RF R1 = lw... … = R … = R Data dependencies : load instruction Solution: pipeline interlock = detects a data hazard and stalls the pipeline until the hazard is cleared [Hennessy&Patterson]

28 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman28 * Z -1 * * * + c3c3 c4c4 c2c2 c1c1 x4x3x2x1 y Z -1 c0c0 x0 * Application examples (1)

29 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman29 Application examples (1) 19 instructions per tap!!

30 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman30 Bit level operations: finite field arithmetic Application examples (2) 10 instructions!! Very simple in hardware

31 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman31 srl$13, $2, 20 andi$25, $13, 1 srl$14, $2, 21 andi$24, $14, 6 or$15, $25, $24 srl$13, $2, 22 andi$14, $13, 56 or$25, $15, $14 sll$24, $25, source register ($2) destination register ($24) Bit level operations : DES example Application examples (2)

32 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman32 srl$24, $5, 18 srl$25, $5, 17 xor$8, $24, $25 srl$9, $5, 16 xor$10, $8, $9 srl$11, $5, 13 xor$12, $10, $11 andi$13, $12, xor $5 1 $13 … 0... Bit level operations : A5 example (GSM encryption) Application examples (2)

33 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman33 Application examples: conclusions CPUs offer flexibility, but… not efficient in performance not efficient in code size not efficient in power consumption

34 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman34 Power Consumption in microprocessors Power consumption is (becoming) the limiting factor in processor design Solution in direction of Hardware acceleration Instruction Level Parallelism instead of clock speed Code size efficiency source: ISSCC2001, Patrick Gelsinger, Intel

35 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman35 Amdahl’s law Impact of an improvement on the execution time of a program depends on 2 parameters: –f = fraction of the original computation time that is affected by the improvement –s = speedup factor (local) exec_time_new = exec_time_old * (1-f) + exec_time_old * f / s speedup_overall = exec_time_old / exec_time_new = 1 / ( 1 – f + f / s) if s >> 1 then speedup_overall = 1 / ( 1 – f ) Example: 40 % of program can be executed 10 x faster speedup_overall = 1 / ( / 10 ) = 1.56

36 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman36 Programmable CPU cores are important for the control parts of the application. They are well supported with tools to support the development of end-user software. ( vs. deeply embedded sw) Keep it Simple heuristic (RISC vs. CISC) Make frequent cases fast and rare cases correct. Regular (orthogonal) instruction set No special features that match a high level language construct. At least 16 registers to ease register allocation. Embedded cores are often light cores which are a compromise between performance, area and power dissipation. (vs. stand-alone CPU cores which are optimised for performance) Conclusions

37 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman37 Programmable Digital Signal Processors real-time worst-case processing = need for more compute power sec instr cycles sec prog prog instr cycle CPI = 1 instruction level parallelism (ILP) hardware support for loop control attention for high level data types e.g. arrays, delaylines (vs. scalars for CPUs) difficult to compare architectures e.g. DIT, DIF, radix 2/4, FFT loop unrolling, scaling, shuffling, intialisation … can be included or forgotten benchmarking (Berkeley Design Technology Inc (BDTi)) (compare to SpecInt benchmarks for CPs)

38 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman38 architectures for programmable DSPs multiplier-accumulator modified Harvard architecture extension with an ALU (decision making) controller architectures examples: TI, Motorola, Philips code generation recent developments: VLIW (Very Long Instruction Word) examples: C6 and TM Outline

39 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman39 Goal = 1 cycle per iteration position ACR (1 or 2) adder/subtractor extra pipelines asymmetric inputs multi-precision PR ADDER ACR MPY (Booth, Wallace..) c(i)x(i)  c(i) * x(i) Sum of products = basic operation for correlation, filtering, spectral analysis... linear expr. Modifications extra inputs/outputs clock P_reg control

40 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman40 not every signal requires 32 bits 2 types of DSP: floating point and integer advantages FP: most specs are in FP (conversion to int is time consuming since the behaviour may change) disadvantage FP: cost (area, speed, power) wanted : type of output of an operation = type of input (because both stored in RAM) no problem for FP but for integer integer multiplication doubles the number of bits: n * n => 2n What about fractional numbers ? x DSP data types

41 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman41 integer and fractional numbers are a special case of fixed point fix (ART designer & SystemC) /8 = fix negative weight 2’s complement if q=0 then integer e.g. int if q=p-1 then fractional e.g. int DSP data types Scale factor 1/8 p q quantization error Same alu handles fix, fix,...

42 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman42 continue (after multiplication) with msb only represents the limit of the accuracy of the result (can not be larger than the accuracy of the inputs) more efficient solution continue with msb + lsb sum-of-product operations generate accumulative noise at 32nd vs. 16th bit Still overflow for addition = overflow bits double precision accumulator + extra overflow bits + shift, round, truncate unit DSP data types

43 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman43 PR ADDER ACR MPY (Booth, Wallace..) c(i)x(i) SHIFT ROUND TRUNCATE clock P_reg clock P_reg control

44 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman44 Prog/data memory EXU Von Neumann (sequencial) prog mem. EXU Harvard data mem. prog mem. EXU data mem. 1 data mem. 2 Modified Harvard  c(i) * x(i) Goal = 1 cycle per iteration

45 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman45 RAM_A RAM_B ACU_A AR_A ACU_B AR_B MAC DR_A DR_B +1PC Interrupt address Stack Reset Program Memory IR Control Bus Rfile

46 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman46 * Z -1 * * * + c4c4 c5c5 c3c3 c2c2 x5x4x3x2 y Z -1 c1c1 x1 *  c i * x i time loop filter loop i How updating the delayline ? 1 cycle/tap ?

47 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman47 Solution 2: indirect adressing use of a pointer to mark the begin of the delay line update the pointer instead of moving the data problem: trashing of the whole memory solution: modulo addressing need for a register to store the pointer

48 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman48 AS Modulo output to RAM Output reg Areg S Read_A A AS Read_S S AS incA A+1 A+1S decA A-1 A-1S Step A+S A+SS Inc_step S+1 AS+1 Modulo can be implemented as a mask operation if the size is 2 k mask =hold ACU architecture and Instruction set

49 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman49 Addressing modes register ADD R4, R3 R[R4] = R[R4] + R[R3] immediate ADD R4, #3 R[R4] = R[R4] + #3 direct ADD R4, (100) R[R4] = R[R4] + Mem[100] indirect ADD R4, (R3) R[R4] = R[R4] + Mem[R[R3]] w. inc/dec ADD R4, (R3)± R[R4] = R[R4] + Mem[R[R3]] R[R3] = R[R3] ± 1 indexed ADD R4, (R3±R2) R[R4] = R[R4] + Mem[R[R3]] R[R3] = R[R3] ± R[R2] Remarks direct = for static data indirect = for arrays inc/dec = for stepping through arrays e.g.  x n index = for stepping through arrays e.g.  x 2n

50 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman50 8 ARs (address or auxiliary register) available extra indirect modes circular *ARn ± % post inc/dec by 1 - circular *ARn ± AR0 % post inc/dec by AR0 - circular bit reverse *ARn ± AR0 B post inc/dec by AR0 - bit rev. Addressing modes: extra for DSP

51 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman51 +1PC Interrupt address Stack Reset Program Memory IR ACU_A AR_A RAM_A DR_A ACU_B AR_B RAM_B DR_B MACALU Control Bus Rfile

52 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman52  c(i) * x(i) 6 clockcycles/sample limit pipelines in the controller first solution resources time (cc) Not shown coefficient RAM+ACU

53 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman53 f g h aiai bibi cici didi f g h a0a0 b0b0 c0c0 d0d0 f g h a1a1 b1b1 c1c1 d1d1 f g h a2a2 b2b2 c2c2 d2d2 h g f aiai bibi b i-1 c i-2 c i-1 d i-2 for i = 0 to n b i = f(a i ) c i = g(b i ) d i = h(c i ) for i = 2 to n b i = f(a i ) c i-1 = g(b i-1 ) d i-2 = h(c i-2 ) Loopfolding (software pipelining)

54 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman54  c(i) * x(i) Pre- and postamble 4 clockcycles /sample Loopfolding (software pipelining)

55 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman55  c(i) * x(i) hardware support for loop control 1 clockcycles/sample repeat instruction and repeat block

56 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman56 T register Sign ctr T Multiplier (17*17) A(40)B(40) MUX A 0 A A B BA fractional MUX Adder (40) ZEROSATROUND M ALU (40) U B MUX TABC D CD Barrer shifter MSW/LSW select E COMP TRN TC B A P CD D TMS320C5000

57 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman57 Address bus 16 bits EXTERNAL ADRESS SWITCH Y Address Y memory 256-by-24-bit RAM 256-by-24-bit ROM Address ALU X memory 256-by-24-bit RAM 256-by-24-bit ROM 2,048-by-24-bit PROGRAM MEMORY ROM X Address P Address EXTERNAL DATA-BUS SWITCH INTERNAL DATA-BUS SWITCH 24 BITS DATA BUS X-DATA Y DATA P DATA GLOBAL DATA DATA ALU 24-by-24 bit MULTIPLIER- ACCUMULATOR PRODUCING 56 BIT RESULT PROGRAM CONTROLLER ON CHIP PERIPHERALS, HOST, SYNCHRONOUS SERIAL INTERFACE SERIAL COMMU- NICATIONS INTERFACE, PROGRAMMED I/O, BUS CONTROL 2 BITS CLOCK 3 BITS INTERRUPT 24 BITS I/O PORTS 7 BITS Motorola 56K family

58 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman58 X data Y data Z data Buses for X X data memory 16 bit bus Y data memory 16 bit bus Two address Compution units Y Instruction decoder 96-bit instructions Program control unit Program memory (Z data) 16-bit bus Two 16-by-16 bit multipliers Y0 Y1 X Y0 Y1 X POP1 scale Two 40 bit arithmic- logic units Saturation Four 40 bit accumulators Saturation/scale shift R.E.A.L.

59 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman59 lexical analysis syntax analysis semantic analysis Code selection Register allocation scheduling Front end Code generation code source Intermediate machine independent representation 1 instr = // ops order of instr

60 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman60 ab * cd + + * c t1 := a * b t2 := c + d t3 := t1 + c out := t2 * t3 t1 t2 t3 BBi BBj BBk Intermediate machine independent representation

61 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman61 axay ar afmxmy mr mf + - xyxy * ALU MAC d memoryp memory ADSP [Analog Devices] Code selection example

62 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman62 ab * cd + + * c t1 t2 t3 mx := dmemmy := pmemax := dmemay := pmem mr := dmem 2: 1: 3: ar := ax + ay my := ar mr = mr * my Mr := mr + (mx * my) Example of code selection = covering of intermediate representation with RTPs

63 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman63 Problems local decisions which have a global impact phase coupling: example asap schedule maximal freedom for scheduling code selection during scheduling register allocation comes afterwards can lead to infeasible solutions

64 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman64 Solution: 1. Solve code generation for DSPs 2. Step back and rethink the architecture develop an architecture which is still efficient but also a good model for building a compiler Efficiency = exploit instruction level parallelism (ILP) compilation = systematic positioning of registers and regular interconnect = VLIW = Very Long Instruction Word It is very difficult and almost impossible to develop robust and efficient DSP compilers. Current DSP practice = programming in assembler phase coupling: discussion

65 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman65 Will embedded CPUs and DSPs converge ? Converging forces both include a hardware multiplier trend in DSPs towards caches and RTK trend in DSPs towards C/C++ common trend towards VLIW Diverging forces deeply embedded code (DSP) vs. end-user SW (CPU) different RTKs SPOX, Virtuoso (DSP) vs. pSOS, WinCE (top down) Conclusions VLIW good balance between hw and sw between efficiency (ILP) and cost fundamental problems: code size, interruptability


Download ppt "Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman1 Platform-based Design 5kk70 2007."

Similar presentations


Ads by Google