Presentation is loading. Please wait.

Presentation is loading. Please wait.

Platform Design TU/e 5kk70 Henk Corporaal Bart Mesman ASIP Application Specific Instruction-set Processor.

Similar presentations


Presentation on theme: "Platform Design TU/e 5kk70 Henk Corporaal Bart Mesman ASIP Application Specific Instruction-set Processor."— Presentation transcript:

1 Platform Design TU/e 5kk70 Henk Corporaal Bart Mesman ASIP Application Specific Instruction-set Processor

2 7/2/2015Platform Design H.Corporaal and B. Mesman 2 flexibility efficiency DSP Programmable CPU Programmable DSP Application domain specific Application specific processor Application domain specific processors (ADSP or ASIP)

3 7/2/2015Platform Design H.Corporaal and B. Mesman 3 Application domain specific processors (ADSP or ASIP) takes a well defined application domain as a starting point exploits characteristics of the domain (computation kernels) still programmable within the domain e.g. MPEG2 coding uses 8*8 DCT transform, DECT, GSM etc... performance: clock speed + ILP ILP,DLP, tuning to domain flexible dev. (new apps.) cost effective (high volume) Appl. domain implementation ADSP implementation Appl. domain GP problems - specification manual design, - design time and effort large effort => synthesized cores

4 7/2/2015Platform Design H.Corporaal and B. Mesman 4 www.adelantetech.com

5 7/2/2015Platform Design H.Corporaal and B. Mesman 5 application(s) processor - model OK? more appl.? yes no yes Estimations cycles/alg occupation HW design SW (code generation) Estimations nsec/cycle, area, power/instr go to phase 2 3 phases 1. exploration 2. hw design (layout) + processing 3. design appl. sw Fast, accurate and early feedback Design process parameters instance e.g. VLIW with shared RFs

6 7/2/2015Platform Design H.Corporaal and B. Mesman 6 A compiler is retargetable if it can generate code for a ‘new’ processor architecture specified in a machine description file. A guarded register transfer pattern (GRTP) is a register transfer pattern (RTP) together with the control bits of the instruction word that control the RTP. a: = b + c | instr = xxxx0101 GRTPs contain all inter-RT-conflict information. Instruction set extraction (ISE) is the process of generating all possible GRTPs for a specific processor. Problem statement

7 7/2/2015Platform Design H.Corporaal and B. Mesman 7 Algorithm spec FE CDFG Code Generation Machinecode Processor spec (instance) ISE GRTP Problem statement in ch 4 this is part of the code generator

8 7/2/2015Platform Design H.Corporaal and B. Mesman 8 PC IM +1 I.(20:0) RAM I.(12:5) I.(4) Inp I.(20:13) I.(3:2) I.(1:0) REG outp Example: Simple processor [Leupers]

9 7/2/2015Platform Design H.Corporaal and B. Mesman 9 Example: Simple processor [Leupers]

10 7/2/2015Platform Design H.Corporaal and B. Mesman 10 ASIP/VLIW architectures A|RT designer template as an example (= set of rules, a model) Differences with VLIW processors of ch. 4 1. // FUs ASUs = complex appl. Spec. FUs (beyond subword //) e.g. biquad, median, DCT etc … larger grainsize, more heterogeneous, more pipelines 2. Rfiles many Rfiles (>5 vs 1 or 2) limited # ports (3 vs 15) limited size (<16 vs. 128) 3. Issue slots all in parallel vs. 5

11 7/2/2015Platform Design H.Corporaal and B. Mesman 11 RF1 FU1 RF2 RF3 FU2 RF4 RF5 FU3 RF6 RF7 FU4 RF8 IR1IR2 IR3IR4 Instruction memory Con- trol flags

12 7/2/2015Platform Design H.Corporaal and B. Mesman 12 read address RF 1 write address RF 1 read address RF 2 write address RF 2 mux 1 mux 2 control FU output drivers Additional characteristics of the A|RT designer template interconnect network: busses + input multiplexers mux control is part of the instruction control can change every clock cycle network can be incomplete busses can be merged memories are modeled as FUs separate data in and data out 2 inputs (data in and address) and 1 output Each FU can generate one or more flags instruction format (per issue slot) ASIP/VLIW architectures

13 7/2/2015Platform Design H.Corporaal and B. Mesman 13 ALUMAC bus1bus2 RF1RF2RF3RF4 mux 2 read RF1 write RF1 read RF2 write RF2 ALU instr. mux 3 read RF4 write RF4 read RF3 write RF3 MAC instr. 0 9 10 19 ASIP/VLIW architectures: example

14 7/2/2015Platform Design H.Corporaal and B. Mesman 14 ASIP/VLIW architectures : example

15 7/2/2015Platform Design H.Corporaal and B. Mesman 15 Datapath synthesis Controller synthesis OK? Change pragmas Algorithm spec no yes RTs Estimations area, power, timing RF1 : x = RF2 : y, RF3 : z | ALU = ADD Inmux = bus2 assign ( a+b, ALU, fu_alu1) assign ( a+_, ALU, fu_alu2) assign ( _+_, ALU, fu_alu3) VLIW makes relatively simple code selection possible ASIP/VLIW architectures: design flow

16 7/2/2015Platform Design H.Corporaal and B. Mesman 16 * 1 + 2 * 3 * 4 * 5 + 6 + 7 * 8 * 9 + 10 IPB OPB ALU MULT IPB OPB + 2 * 3 * 1 * 1 * 3 + 2 * 1 * 3 * 4 * 3 * 4 * 4 * 3 + 6 * 3 + 6 + 7 * 8 * 5 * 5 * 8 * 8 + 7 * 5 * 9 * 5 * 9 * 5 * 9 + 10 * 9 + Candidate LIST Conflict & Priority Comp. Scheduled Operation 00 11 2233 44 5 ASIP/VLIW architectures: list scheduling

17 7/2/2015Platform Design H.Corporaal and B. Mesman 17 ASIP/VLIW architectures: feedback

18 7/2/2015Platform Design H.Corporaal and B. Mesman 18 Implementation Independent Design Database Implementation Independent Design Database Low power aspects Estimation area speed power Estimation Database + Architecture Mistral2

19 7/2/2015Platform Design H.Corporaal and B. Mesman 19 GSM viterbi decoder : default solution 13750 EXUACTIVAREAPOWER alu_196%346946196 romctrl_148%39259 acu_126%3271209 ipb_15%131105 opb_123%18045801 ctrl9821135035 total15591188605 EXUACTIVAREAPOWER alu_196%346946196 romctrl_148%39259 acu_126%3271209 ipb_15%131105 opb_123%18045801 ctrl9821135035 total15591188605 controller responsible for 70% of power consumption –maximum resource-sharing –heavy decision-making : “main” loop with 16 metrics-computations per iteration EXU-numbers include Registers for local storage

20 7/2/2015Platform Design H.Corporaal and B. Mesman 20 GSM viterbi decoder : no loop-folding area down by 33% power down by 35% next step: reduce # of program-steps with second ALU 14247 EXUACTIVAREAPOWER alu_192%341145073 romctrl_145%39255 acu_125%2941087 ipb_15%10786 opb_122%16615340 ctrl491970087 total10431121928 EXUACTIVAREAPOWER alu_192%341145073 romctrl_145%39255 acu_125%2941087 ipb_15%10786 opb_122%16615340 ctrl491970087 total10431121928

21 7/2/2015Platform Design H.Corporaal and B. Mesman 21 GSM viterbi decoder : 2 ALU’s 9739 EXUACTIVAREAPOWER alu_169%179712248 alu_265%13938916 romctrl_167%39255 acu_137%2941087 ipb_18%149119 opb_133%21366871 ctrl895787235 total14766116731 EXUACTIVAREAPOWER alu_169%179712248 alu_265%13938916 romctrl_167%39255 acu_137%2941087 ipb_18%149119 opb_133%21366871 ctrl895787235 total14766116731 © cycle count down 30% © area up 42% © power down by 5% © next step: introduce ASU to reduce ALU-load

22 7/2/2015Platform Design H.Corporaal and B. Mesman 22 GSM viterbi decoder : 1 x ACS-ASU EXUACTIVAREAPOWER alu_120%261105 acs_asu_183%23823816 or_asu_110%611122 romctrl_116%6521 acu_136%294205 ipb_120%10743 opb_111%16335 ctrl18643597 total57477944 EXUACTIVAREAPOWER alu_120%261105 acs_asu_183%23823816 or_asu_110%611122 romctrl_116%6521 acu_136%294205 ipb_120%10743 opb_111%16335 ctrl18643597 total57477944 func ACS ( M1, M2, d ) MS, MS8 = begin MS = if ( M1+d > M2-d ) -> ( M1+d) || ( M2-d) fi; MS8 = if ( M1- d > M2+d) -> ( M1- d) || ( M2+d) fi; end; func ACS ( M1, M2, d ) MS, MS8 = begin MS = if ( M1+d > M2-d ) -> ( M1+d) || ( M2-d) fi; MS8 = if ( M1- d > M2+d) -> ( M1- d) || ( M2+d) fi; end; = 1930 © cycle count down 5X © power down 20X !

23 7/2/2015Platform Design H.Corporaal and B. Mesman 23 GSM viterbi decoder : 4 x ACS-ASU EXUACTIVAREAPOWER alu_194%24397 acs_asu_195%1041420 acs_asu_295%1041420 acs_asu_395%1041420 acs_asu_495%1041420 split_asu_147%9018 or_asu_147%592118 romctrl_128%486 acu_198%21285 ipb_123%606 opb_150%36980 ctrl1306555 total70842645 EXUACTIVAREAPOWER alu_194%24397 acs_asu_195%1041420 acs_asu_295%1041420 acs_asu_395%1041420 acs_asu_495%1041420 split_asu_147%9018 or_asu_147%592118 romctrl_128%486 acu_198%21285 ipb_123%606 opb_150%36980 ctrl1306555 total70842645 © cycle count down another 5X © area up 23% © power down another 3X ! 425

24 7/2/2015Platform Design H.Corporaal and B. Mesman 24 GSM viterbi example : summary Implementation Independent Design Database Implementation Independent Design Database 72x ! Mistral2

25 7/2/2015Platform Design H.Corporaal and B. Mesman 25 Exploration phase Application software development: constraint driven compilation application(s) processor - model OK? more appl.? yes no yes HW design SW (code generation) application(s) OK? no yes SW (code generation) Freeze processor model no Discussion: phase 3

26 7/2/2015Platform Design H.Corporaal and B. Mesman 26 Discussion: problems with VLIWs code compaction = reduce code size after scheduling possible compaction ratio ? e.g. p0 = 0.9 and p1 = 0.1 information content (entropy) = -  p i log 2 p i = 0.47 maximum compression factor  2 control parallelism during scheduling = switch between different processor models (10% of code = 90% runtime) architecture reduce number of control bits for operand addresses e.g. 128 reg (TM) -> 28 bits/issue slot for addresses only => use stacks and fifos code size and instruction bandwidth

27 7/2/2015Platform Design H.Corporaal and B. Mesman 27 RF1 FU1FU2 FU3FU4 IR1IR2 IR3IR4 Instruction memory Con- trol flags RF2 RF3RF4

28 7/2/2015Platform Design H.Corporaal and B. Mesman 28 Conclusions ASIPs provide efficient solutions for well-defined application domains (2 orders of magnitude higher efficiency). The methodology is interesting for IP creation. The key problem is retargetable compilation. A (distributed) VLIW model is a good compromise between HW and SW. Although an automatic process can generate a default solution, the process usually is interactive and iterative for efficiency reasons. The key is fast and accurate feedback.


Download ppt "Platform Design TU/e 5kk70 Henk Corporaal Bart Mesman ASIP Application Specific Instruction-set Processor."

Similar presentations


Ads by Google