Presentation is loading. Please wait.

Presentation is loading. Please wait.

Embedded Processor Architecture 5kk73. Embedded Processor Architecture Henk Corporaal / Bart Mesman2 flexibility efficiency DS P Programmable CPU Programmable.

Similar presentations


Presentation on theme: "Embedded Processor Architecture 5kk73. Embedded Processor Architecture Henk Corporaal / Bart Mesman2 flexibility efficiency DS P Programmable CPU Programmable."— Presentation transcript:

1 Embedded Processor Architecture 5kk73

2 Embedded Processor Architecture Henk Corporaal / Bart Mesman2 flexibility efficiency DS P Programmable CPU Programmable DSP Application specific instruction set processor (ASIP) Application specific processor

3 Embedded Processor Architecture Henk Corporaal / Bart Mesman3 * Z -1 * * * + c3c3 c4c4 c2c2 c1c1 x4x3x2x1 y Z -1 c0c0 x0 * Application examples (1)

4 Embedded Processor Architecture Henk Corporaal / Bart Mesman4 Application examples (1) 19 instructions per tap!!

5 Embedded Processor Architecture Henk Corporaal / Bart Mesman5 Bit level operations: finite field arithmetic Application examples (2) 10 instructions!! Very simple in hardware

6 Embedded Processor Architecture Henk Corporaal / Bart Mesman6 srl$13, $2, 20 andi$25, $13, 1 srl$14, $2, 21 andi$24, $14, 6 or$15, $25, $24 srl$13, $2, 22 andi$14, $13, 56 or$25, $15, $14 sll$24, $25, source register ($2) destination register ($24) Bit level operations : DES example Application examples (2)

7 Embedded Processor Architecture Henk Corporaal / Bart Mesman7 srl$24, $5, 18 srl$25, $5, 17 xor$8, $24, $25 srl$9, $5, 16 xor$10, $8, $9 srl$11, $5, 13 xor$12, $10, $11 andi$13, $12, xor $5 1 $13 … 0... Bit level operations : A5 example (GSM encryption) Application examples (2)

8 Embedded Processor Architecture Henk Corporaal / Bart Mesman8 Application examples: conclusions CPUs offer flexibility, but… not efficient in performance not efficient in code size not efficient in power consumption

9 Embedded Processor Architecture Henk Corporaal / Bart Mesman9 Power Consumption in microprocessors Power consumption is (becoming) the limiting factor in processor design Solution in direction of Hardware acceleration Instruction Level Parallelism instead of clock speed Code size efficiency source: ISSCC2001, Patrick Gelsinger, Intel

10 Embedded Processor Architecture Henk Corporaal / Bart Mesman10 Amdahl’s law Impact of an improvement on the execution time of a program depends on 2 parameters: –f = fraction of the original computation time that is affected by the improvement –s = speedup factor (local) exec_time_new = exec_time_old * (1-f) + exec_time_old * f / s speedup_overall = exec_time_old / exec_time_new = 1 / ( 1 – f + f / s) if s >> 1 then speedup_overall = 1 / ( 1 – f ) Example: 40 % of program can be executed 10 x faster speedup_overall = 1 / ( / 10 ) = 1.56

11 Embedded Processor Architecture Henk Corporaal / Bart Mesman11 Programmable CPU cores are important for the control parts of the application. They are well supported with tools to support the development of end-user software. ( vs. deeply embedded sw) Keep it Simple heuristic (RISC vs. CISC) Make frequent cases fast and rare cases correct. Regular (orthogonal) instruction set No special features that match a high level language construct. At least 16 registers to ease register allocation. Embedded cores are often light cores which are a compromise between performance, area and power dissipation. (vs. stand-alone CPU cores which are optimised for performance) Conclusions

12 Embedded Processor Architecture Henk Corporaal / Bart Mesman12 Programmable Digital Signal Processors real-time worst-case processing = need for more compute power sec instr cycles sec prog prog instr cycle CPI = 1 instruction level parallelism (ILP) hardware support for loop control attention for high level data types e.g. arrays, delaylines (vs. scalars for CPUs) difficult to compare architectures e.g. DIT, DIF, radix 2/4, FFT loop unrolling, scaling, shuffling, intialisation … can be included or forgotten benchmarking (Berkeley Design Technology Inc (BDTi)) (compare to SpecInt benchmarks for CPs)

13 Embedded Processor Architecture Henk Corporaal / Bart Mesman13 architectures for programmable DSPs multiplier-accumulator modified Harvard architecture extension with an ALU (decision making) controller architectures examples: TI, Motorola, Philips code generation developments: VLIW (Very Long Instruction Word) examples: C6 and TM Outline

14 Embedded Processor Architecture Henk Corporaal / Bart Mesman14 not every signal requires 32 bits 2 types of DSP: floating point and integer advantages FP: most specs are in FP (conversion to int is time consuming since the behavior may change) disadvantage FP: cost (area, speed, power) integer multiplication doubles the number of bits: n * n => 2n DSP data types

15 Embedded Processor Architecture Henk Corporaal / Bart Mesman15 PR ADDER ACR MPY (Booth, Wallace..) c(i)x(i) SHIFT ROUND TRUNCATE clock P_reg clock P_reg control

16 Embedded Processor Architecture Henk Corporaal / Bart Mesman16 Prog/data memory EXU Von Neumann (sequential) prog mem. EXU Harvard data mem. prog mem. EXU data mem. 1 data mem. 2 Modified Harvard  c(i) * x(i) Goal = 1 cycle per iteration

17 Embedded Processor Architecture Henk Corporaal / Bart Mesman17 RAM_A RAM_B ACU_A AR_A ACU_B AR_B MAC DR_A DR_B +1PC Interrupt address Stack Reset Program Memory IR Control Bus Rfile

18 Embedded Processor Architecture Henk Corporaal / Bart Mesman18 * Z -1 * * * + c4c4 c5c5 c3c3 c2c2 x5x4x3x2 y Z -1 c1c1 x1 *  c i * x i time loop filter loop i How updating the delayline ? 1 cycle/tap ?

19 Embedded Processor Architecture Henk Corporaal / Bart Mesman19 Solution 2: indirect adressing use of a pointer to mark the begin of the delay line problem: trashing of the whole memory solution: modulo addressing need for a register to store the pointer

20 Embedded Processor Architecture Henk Corporaal / Bart Mesman20 AS Modulo output to RAM Output reg Areg S Read_A A AS Read_S S AS incA A+1 A+1S decA A-1 A-1S Step A+S A+SS Inc_step S+1 AS+1 Modulo can be implemented as a mask operation if the size is 2 k mask =hold ACU architecture and Instruction set

21 Embedded Processor Architecture Henk Corporaal / Bart Mesman21 Addressing modes register ADD R4, R3 R[R4] = R[R4] + R[R3] immediate ADD R4, #3 R[R4] = R[R4] + #3 direct ADD R4, (100) R[R4] = R[R4] + Mem[100] indirect ADD R4, (R3) R[R4] = R[R4] + Mem[R[R3]] w. inc/dec ADD R4, (R3)± R[R4] = R[R4] + Mem[R[R3]] R[R3] = R[R3] ± 1 indexed ADD R4, (R3±R2) R[R4] = R[R4] + Mem[R[R3]] R[R3] = R[R3] ± R[R2] Remarks direct = for static data indirect = for arrays inc/dec = for stepping through arrays e.g.  x n index = for stepping through arrays e.g.  x 2n

22 Embedded Processor Architecture Henk Corporaal / Bart Mesman22 8 ARs (address or auxiliary register) available extra indirect modes circular *ARn ± % post inc/dec by 1 - circular *ARn ± AR0 % post inc/dec by AR0 - circular bit reverse *ARn ± AR0 B post inc/dec by AR0 - bit rev. Addressing modes: extra for DSP

23 Embedded Processor Architecture Henk Corporaal / Bart Mesman23 +1PC Interrupt address Stack Reset Program Memory IR ACU_A AR_A RAM_A DR_A ACU_B AR_B RAM_B DR_B MACALU Control Bus Rfile

24 Embedded Processor Architecture Henk Corporaal / Bart Mesman24  c(i) * x(i) 6 clockcycles/sample limit pipelines in the controller first solution resources time (cc) Not shown coefficient RAM+ACU

25 Embedded Processor Architecture Henk Corporaal / Bart Mesman25 f g h aiai bibi cici didi f g h a0a0 b0b0 c0c0 d0d0 f g h a1a1 b1b1 c1c1 d1d1 f g h a2a2 b2b2 c2c2 d2d2 h g f aiai bibi b i-1 c i-2 c i-1 d i-2 for i = 0 to n b i = f(a i ) c i = g(b i ) d i = h(c i ) for i = 2 to n b i = f(a i ) c i-1 = g(b i-1 ) d i-2 = h(c i-2 ) Loopfolding (software pipelining)

26 Embedded Processor Architecture Henk Corporaal / Bart Mesman26  c(i) * x(i) Pre- and postamble 4 clockcycles /sample Loopfolding (software pipelining)

27 Embedded Processor Architecture Henk Corporaal / Bart Mesman27  c(i) * x(i) hardware support for loop control 1 clockcycles/sample repeat instruction and repeat block

28 Embedded Processor Architecture Henk Corporaal / Bart Mesman28 T register Sign ctr T Multiplier (17*17) A(40)B(40) MUX A 0 A A B BA fractional MUX Adder (40) ZEROSATROUND M ALU (40) U B MUX TABC D CD Barrer shifter MSW/LSW select E COMP TRN TC B A P CD D TMS320C5000

29 Embedded Processor Architecture Henk Corporaal / Bart Mesman29 Address bus 16 bits EXTERNAL ADRESS SWITCH Y Address Y memory 256-by-24-bit RAM 256-by-24-bit ROM Address ALU X memory 256-by-24-bit RAM 256-by-24-bit ROM 2,048-by-24-bit PROGRAM MEMORY ROM X Address P Address EXTERNAL DATA-BUS SWITCH INTERNAL DATA-BUS SWITCH 24 BITS DATA BUS X-DATA Y DATA P DATA GLOBAL DATA DATA ALU 24-by-24 bit MULTIPLIER- ACCUMULATOR PRODUCING 56 BIT RESULT PROGRAM CONTROLLER ON CHIP PERIPHERALS, HOST, SYNCHRONOUS SERIAL INTERFACE SERIAL COMMU- NICATIONS INTERFACE, PROGRAMMED I/O, BUS CONTROL 2 BITS CLOCK 3 BITS INTERRUPT 24 BITS I/O PORTS 7 BITS Motorola 56K family

30 Embedded Processor Architecture Henk Corporaal / Bart Mesman30 X data Y data Z data Buses for X X data memory 16 bit bus Y data memory 16 bit bus Two address Compution units Y Instruction decoder 96-bit instructions Program control unit Program memory (Z data) 16-bit bus Two 16-by-16 bit multipliers Y0 Y1 X Y0 Y1 X POP1 scale Two 40 bit arithmic- logic units Saturation Four 40 bit accumulators Saturation/scale shift R.E.A.L.

31 Embedded Processor Architecture Henk Corporaal / Bart Mesman31 lexical analysis syntax analysis semantic analysis Code selection Register allocation scheduling Front end Code generation code source Intermediate machine independent representation 1 instr = // ops order of instr

32 Embedded Processor Architecture Henk Corporaal / Bart Mesman32 ab * cd + + * c t1 := a * b t2 := c + d t3 := t1 + c out := t2 * t3 t1 t2 t3 BBi BBj BBk Intermediate machine independent representation

33 Embedded Processor Architecture Henk Corporaal / Bart Mesman33 axay ar afmxmy mr mf + - xyxy * ALU MAC d memoryp memory ADSP [Analog Devices] Code selection example

34 Embedded Processor Architecture Henk Corporaal / Bart Mesman34 ab * cd + + * c t1 t2 t3 mx := dmemmy := pmemax := dmemay := pmem mr := dmem 2: 1: 3: ar := ax + ay my := ar mr = mr * my Mr := mr + (mx * my) Example of code selection = covering of intermediate representation with RTPs

35 Embedded Processor Architecture Henk Corporaal / Bart Mesman35 Problems local decisions which have a global impact phase coupling: example asap schedule maximal freedom for scheduling code selection during scheduling register allocation comes afterwards can lead to infeasible solutions

36 Embedded Processor Architecture Henk Corporaal / Bart Mesman36 Solution: 1. Solve code generation for DSPs 2. Step back and rethink the architecture develop an architecture which is still efficient but also a good model for building a compiler Efficiency = exploit instruction level parallelism (ILP) compilation = systematic positioning of registers and regular interconnect = VLIW = Very Long Instruction Word It is very difficult and almost impossible to develop robust and efficient DSP compilers. Current DSP practice = programming in assembler phase coupling: discussion


Download ppt "Embedded Processor Architecture 5kk73. Embedded Processor Architecture Henk Corporaal / Bart Mesman2 flexibility efficiency DS P Programmable CPU Programmable."

Similar presentations


Ads by Google