Presentation is loading. Please wait.

Presentation is loading. Please wait.

6/25/2015Platform Design H.Corporaal and B. Mesman1 Platform Design TU/e 5kk70 Henk Corporaal Bart Mesman ASIP Application Specific Instruction-set Processor.

Similar presentations


Presentation on theme: "6/25/2015Platform Design H.Corporaal and B. Mesman1 Platform Design TU/e 5kk70 Henk Corporaal Bart Mesman ASIP Application Specific Instruction-set Processor."— Presentation transcript:

1 6/25/2015Platform Design H.Corporaal and B. Mesman1 Platform Design TU/e 5kk70 Henk Corporaal Bart Mesman ASIP Application Specific Instruction-set Processor

2 6/25/2015Platform Design H.Corporaal and B. Mesman2 flexibility efficiency DS P Programmable CPU Programmable DSP Application domain specific Application specific processor Application domain specific processors (ADSP or ASIP)

3 6/25/2015Platform Design H.Corporaal and B. Mesman3 Application domain specific processors (ADSP or ASIP) takes a well defined application domain as a starting point exploits characteristics of the domain (computation kernels) still programmable within the domain e.g. MPEG2 coding uses 8*8 DCT transform, DECT, GSM etc... performance: clock speed + ILP ILP + tuning to domain flexible dev. (new apps.) cost effective (high volume) Appl. domain implementation ADSP implementation Appl. domain GP problems - specification manual design, - design time and effort large effort => synthesized cores

4 6/25/2015Platform Design H.Corporaal and B. Mesman4 www.adelantetech.com

5 6/25/2015Platform Design H.Corporaal and B. Mesman5 design process retargetable code generation (problem statement) ADSP/VLIW architectures (Mistral 2 /A|RT designer) low power aspects (Mistral 2 /A|RT designer) discussion conclusion Outline

6 6/25/2015Platform Design H.Corporaal and B. Mesman6 application(s) processor - model OK? more appl.? yes no yes Estimations cycles/alg occupation HW design SW (code generation) Estimations nsec/cycle, area, power/instr go to phase 2 3 phases 1. exploration 2. hw design (layout) + processing 3. design appl. sw Fast, accurate and early feedback Design process parameters instance e.g. VLIW with shared RFs

7 6/25/2015Platform Design H.Corporaal and B. Mesman7 A compiler is retargetable if it can generate code for a ‘new’ processor architecture specified in a machine description file. A guarded register transfer pattern (GRTP) is a register transfer pattern (RTP) together with the control bits of the instruction word that control the RTP. a: = b + c | instr = xxxx0101 GRTPs contain all inter-RT-conflict information. Instruction set extraction (ISE) is the process of generating all possible GRTPs for a specific processor. Problem statement

8 6/25/2015Platform Design H.Corporaal and B. Mesman8 Algorithm spec FE CDFG Code Generation Machinecode Processor spec (instance) ISE GRTP Problem statement in ch 4 this is part of the code generator

9 6/25/2015Platform Design H.Corporaal and B. Mesman9 PC IM +1 I.(20:0) RAM I.(12:5) I.(4) Inp I.(20:13) I.(3:2) I.(1:0) REG outp Example: Simple processor [Leupers]

10 6/25/2015Platform Design H.Corporaal and B. Mesman10 Example: Simple processor [Leupers]

11 6/25/2015Platform Design H.Corporaal and B. Mesman11 ASIP/VLIW architectures A|RT designer template as an example (= set of rules, a model) Differences with VLIW processors of ch. 4 1. // FUs ASUs = complex appl. Spec. FUs (beyond subword //) e.g. biquad, median, DCT etc … larger grainsize, more heterogeneous, more pipelines 2. Rfiles many Rfiles (>5 vs 1 or 2) limited # ports (3 vs 15) limited size (<16 vs. 128) 3. Issue slots all in parallel vs. 5

12 6/25/2015Platform Design H.Corporaal and B. Mesman12 RF1 FU1 RF2 RF3 FU2 RF4 RF5 FU3 RF6 RF7 FU4 RF8 IR1IR2 IR3IR4 Instruction memory Con- trol flags

13 6/25/2015Platform Design H.Corporaal and B. Mesman13 read address RF 1 write address RF 1 read address RF 2 write address RF 2 mux 1 mux 2 control FU output drivers Additional characteristics of the A|RT designer template interconnect network: busses + input multiplexers mux control is part of the instruction control can change every clock cycle network can be incomplete busses can be merged memories are modeled as FUs separate data in and data out 2 inputs (data in and address) and 1 output Each FU can generate one or more flags instruction format (per issue slot) ASIP/VLIW architectures

14 6/25/2015Platform Design H.Corporaal and B. Mesman14 ALUMAC bus1bus2 RF1RF2RF3RF4 mux 2 read RF1 write RF1 read RF2 write RF2 ALU instr. mux 3 read RF4 write RF4 read RF3 write RF3 MAC instr. 0 9 10 19 ASIP/VLIW architectures: example

15 6/25/2015Platform Design H.Corporaal and B. Mesman15 ASIP/VLIW architectures : example

16 6/25/2015Platform Design H.Corporaal and B. Mesman16 Datapath synthesis Controller synthesis OK? Change pragmas Algorithm spec no yes RTs Estimations area, power, timing RF1 : x = RF2 : y, RF3 : z | ALU = ADD Inmux = bus2 assign ( a+b, ALU, fu_alu1) assign ( a+_, ALU, fu_alu2) assign ( _+_, ALU, fu_alu3) VLIW makes relatively simple code selection possible ASIP/VLIW architectures: design flow

17 6/25/2015Platform Design H.Corporaal and B. Mesman17 * 1 + 2 * 3 * 4 * 5 + 6 + 7 * 8 * 9 + 10 IPB OPB ALU MULT IPB OPB + 2 * 3 * 1 * 1 * 3 + 2 * 1 * 3 * 4 * 3 * 4 * 4 * 3 + 6 * 3 + 6 + 7 * 8 * 5 * 5 * 8 * 8 + 7 * 5 * 9 * 5 * 9 * 5 * 9 + 10 * 9 + Candidate LIST Conflict & Priority Comp. Scheduled Operation 00 11 2233 44 5 ASIP/VLIW architectures: list scheduling

18 6/25/2015Platform Design H.Corporaal and B. Mesman18 ASIP/VLIW architectures: feedback

19 6/25/2015Platform Design H.Corporaal and B. Mesman19 design process retargetable code generation (problem statement) ADSP/VLIW architectures (Mistral 2 /A|RT designer) low power aspects (Mistral 2 /A|RT designer) discussion conclusion Outline

20 6/25/2015Platform Design H.Corporaal and B. Mesman20 Implementation Independent Design Database Implementation Independent Design Database Low power aspects Estimation area speed power Estimation Database + Architecture Mistral2

21 6/25/2015Platform Design H.Corporaal and B. Mesman21 GSM viterbi decoder : default solution 13750 EXUACTIVAREAPOWER alu_196%346946196 romctrl_148%39259 acu_126%3271209 ipb_15%131105 opb_123%18045801 ctrl9821135035 total15591188605 EXUACTIVAREAPOWER alu_196%346946196 romctrl_148%39259 acu_126%3271209 ipb_15%131105 opb_123%18045801 ctrl9821135035 total15591188605 controller responsible for 70% of power consumption –maximum resource-sharing –heavy decision-making : “main” loop with 16 metrics-computations per iteration EXU-numbers include Registers for local storage

22 6/25/2015Platform Design H.Corporaal and B. Mesman22 GSM viterbi decoder : no loop-folding area down by 33% power down by 35% next step: reduce # of program-steps with second ALU 14247 EXUACTIVAREAPOWER alu_192%341145073 romctrl_145%39255 acu_125%2941087 ipb_15%10786 opb_122%16615340 ctrl491970087 total10431121928 EXUACTIVAREAPOWER alu_192%341145073 romctrl_145%39255 acu_125%2941087 ipb_15%10786 opb_122%16615340 ctrl491970087 total10431121928

23 6/25/2015Platform Design H.Corporaal and B. Mesman23 GSM viterbi decoder : 2 ALU’s 9739 EXUACTIVAREAPOWER alu_169%179712248 alu_265%13938916 romctrl_167%39255 acu_137%2941087 ipb_18%149119 opb_133%21366871 ctrl895787235 total14766116731 EXUACTIVAREAPOWER alu_169%179712248 alu_265%13938916 romctrl_167%39255 acu_137%2941087 ipb_18%149119 opb_133%21366871 ctrl895787235 total14766116731 © cycle count down 30% © area up 42% © power down by 5% © next step: introduce ASU to reduce ALU-load

24 6/25/2015Platform Design H.Corporaal and B. Mesman24 GSM viterbi decoder : 1 x ACS-ASU EXUACTIVAREAPOWER alu_120%261105 acs_asu_183%23823816 or_asu_110%611122 romctrl_116%6521 acu_136%294205 ipb_120%10743 opb_111%16335 ctrl18643597 total57477944 EXUACTIVAREAPOWER alu_120%261105 acs_asu_183%23823816 or_asu_110%611122 romctrl_116%6521 acu_136%294205 ipb_120%10743 opb_111%16335 ctrl18643597 total57477944 func ACS ( M1, M2, d ) MS, MS8 = begin MS = if ( M1+d > M2-d ) -> ( M1+d) || ( M2-d) fi; MS8 = if ( M1- d > M2+d) -> ( M1- d) || ( M2+d) fi; end; func ACS ( M1, M2, d ) MS, MS8 = begin MS = if ( M1+d > M2-d ) -> ( M1+d) || ( M2-d) fi; MS8 = if ( M1- d > M2+d) -> ( M1- d) || ( M2+d) fi; end; = 1930 © cycle count down 5X © power down 20X !

25 6/25/2015Platform Design H.Corporaal and B. Mesman25 GSM viterbi decoder : 4 x ACS-ASU EXUACTIVAREAPOWER alu_194%24397 acs_asu_195%1041420 acs_asu_295%1041420 acs_asu_395%1041420 acs_asu_495%1041420 split_asu_147%9018 or_asu_147%592118 romctrl_128%486 acu_198%21285 ipb_123%606 opb_150%36980 ctrl1306555 total70842645 EXUACTIVAREAPOWER alu_194%24397 acs_asu_195%1041420 acs_asu_295%1041420 acs_asu_395%1041420 acs_asu_495%1041420 split_asu_147%9018 or_asu_147%592118 romctrl_128%486 acu_198%21285 ipb_123%606 opb_150%36980 ctrl1306555 total70842645 © cycle count down another 5X © area up 23% © power down another 3X ! 425

26 6/25/2015Platform Design H.Corporaal and B. Mesman26 GSM viterbi example : summary Implementation Independent Design Database Implementation Independent Design Database 72x ! Mistral2

27 6/25/2015Platform Design H.Corporaal and B. Mesman27 Exploration phase Application software development: constraint driven compilation application(s) processor - model OK? more appl.? yes no yes HW design SW (code generation) application(s) OK? no yes SW (code generation) Freeze processor model no Discussion: phase 3

28 6/25/2015Platform Design H.Corporaal and B. Mesman28 Discussion: problems with VLIWs code compaction = reduce code size after scheduling possible compaction ratio ? e.g. p0 = 0.9 and p1 = 0.1 information content (entropy) = -  p i log 2 p i = 0.47 maximum compression factor  2 control parallelism during scheduling = switch between different processor models (10% of code = 90% runtime) architecture reduce number of control bits for operand addresses e.g. 128 reg (TM) -> 28 bits/issue slot for addresses only => use stacks and fifos code size and instruction bandwidth

29 6/25/2015Platform Design H.Corporaal and B. Mesman29 RF1 FU1FU2 FU3FU4 IR1IR2 IR3IR4 Instruction memory Con- trol flags RF2 RF3RF4

30 6/25/2015Platform Design H.Corporaal and B. Mesman30 Conclusions ASIPs provide efficient solutions for well-defined application domains (2 orders of magnitude higher efficiency). The methodology is interesting for IP creation. The key problem is retargetable compilation. A (distributed) VLIW model is a good compromise between HW and SW. Although an automatic process can generate a default solution, the process usually is interactive and iterative for efficiency reasons. The key is fast and accurate feedback.


Download ppt "6/25/2015Platform Design H.Corporaal and B. Mesman1 Platform Design TU/e 5kk70 Henk Corporaal Bart Mesman ASIP Application Specific Instruction-set Processor."

Similar presentations


Ads by Google