Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introducing the ConnX D2 DSP Engine Introduced: August 24, 2009.

Similar presentations


Presentation on theme: "Introducing the ConnX D2 DSP Engine Introduced: August 24, 2009."— Presentation transcript:

1 Introducing the ConnX D2 DSP Engine Introduced: August 24, 2009

2 Copyright © 2009, Tensilica, Inc. Fastest Growing Processor / DSP IP Company Customizable Dataplane Processor/DSP IP Licensing –Leading provider of customizable Dataplane Processor Units (DPUs) –Unique combination of processor & DSP IP cores + software design tools –Customization enables improved power, cost, performance –Standard DPU solutions for audio, video/imaging & baseband comms –Dominant patent portfolio for configurable processor technology Broad-Based Success –150+ Licensees, including 5 of the top 10 semiconductor companies –Shipping in high volume today (>200M/yr rate) –Fastest growing Semiconductor Processor IP company (per Gartner, Jan-09) 21% revenue growth in 2007, 25% in

3 Copyright © 2009, Tensilica, Inc. Focus: Dataplane Processing Units (DPUs) Embedded Controller For Dataplane Processing Main Applications CPU Tensilica focus: Dataplane Processors DPUs: Customizable CPU+DSP delivering 10 to 100x higher performance than CPU or DSP and providing better flexibility & verification than RTL 3

4 Copyright © 2009, Tensilica, Inc. Communications DSP Trends / Challenges 4 Markets Changing Faster Market requirements in flux as economy wobbles Emerging standards evolve faster in the Internet age Development Teams Shrink SOC development schedules tightening Tightening resource constraints (do more with less) Code Size Increases Communications standards growing in number & complexity DSP algorithm code heavily integrated with more (and more complex) control code Maintenance and flexibility pushes DSP algorithms towards C-code

5 Copyright © 2009, Tensilica, Inc. Trends Within Licensable DSP Architectures 1 st Generation Licensable DSP Cores Modest/Medium performance (single/dual MAC) Simple architecture (single issue, compound Instructions) Limited or no compiler support (mostly hand coded) 2 nd Generation Licensable DSP Cores Added RISC like architecture features (register arrays) Improved compiler targets, but still assembly Some offer wide VLIW for performance Large area; code bloat Some offer wide SIMD for performance Good area/performance tradeoff No performance when vectorization fails 5

6 Copyright © 2009, Tensilica, Inc. Vectorization Benefits (SIMD) Loop counts can be reduced Data computation can be done in parallel Cheapest (hardware cost) method to get higher performance Example: 2-way SIMD performance benefit Data7 Data6 Data5 Data4 Data3 Data2 Data1 Data0 Data6 Data4 Data2 Data0 Data7 Data5 Data3 Data1 Before Vectorization After Vectorization Single Execution 2-way SIMD Execution 6

7 Copyright © 2009, Tensilica, Inc. VLIW Technology 7 Parallel execution of Instructions Effective use of multiple ALUs/MACs Compiler allocates instructions to VLIW slots Orthogonal allocation yields more flexibility Instruction #1 Instruction #2 Instruction #3 Instruction #4 Execution ALU Instruction #1 Instruction #3 VLIW Execution ALU1 Instruction #2 Instruction #4 VLIW Execution ALU2

8 Copyright © 2009, Tensilica, Inc. Ideal 3 rd Generation Licensable DSP Ideal Characteristics VLIW capability for good performance on general code Parallelization of independent operations SIMD capability for good performance on loop code Data parallel execution Good C compiler target Reduce or eliminate need to assembly program Productivity benefit Small, compact size Keep costs down in brutally competitive markets 8

9 Copyright © 2009, Tensilica, Inc. Tensilica - the Stealth DSP Company Single MAC Dual MAC Quad MAC 8 MAC ConnX D2 8 MAC and more Xtensa TIE 8 MAC and more Xtensa TIE HiFi 2 Single Precision Floating Point Unit Single Precision Floating Point Unit Double Precision Acceleration Floating Point HW Double Precision Acceleration Floating Point HW Custom DSPs DSP Building Blocks ConnX Vectra LX ConnX 545CK DSP ConnX BBE 16 MAC 388VDO MAC16 MUL32 DIV32 CommsAudioVideo Xtensa: Other Markets 9

10 Copyright © 2009, Tensilica, Inc. ConnX D2 DSP Engine - Overview Dual 16b MAC Architecture with Hybrid SIMD / VLIW Optimum performance on a wide range of algorithms SIMD offers high data computation rate for DSP algorithms 2-way VLIW allows parallel instruction execution on SIMD and scalar code “Out of the Box” industry standard software compatibility TI C6x fixed-point C intrinsics supported Fully bit for bit equivalent with TI C6x ITU reference code fixed point C intrinsics directly supported Goals: Ease of Use, Low Area/Cost Click and go “Out of the Box” performance from standard C code Standard C and fixed point data types - 16-bit, 32-bit and 40-bit Advanced optimizing, vectorizing compiler Less than 70K gates (under 0.2mm 2 in 65nm) 10

11 Copyright © 2009, Tensilica, Inc. Target Applications: ConnX D2 Embedded control VoIP gateways, voice-over-networks (including VoIP codecs) Femto-cell and pico-cell base stations Next generation disk drives, data storage Mobile terminals and handsets Home entertainment devices Computer peripherals, printers General purpose 16-bit DSP for a wide range of applications 11

12 Copyright © 2009, Tensilica, Inc. ConnX D2 DSP: An ingredient of an Xtensa DPU Hardware Use Model Click-button configuration option within Xtensa LX core Part of the Tensilica configurable core deliverable package Two reference configurations Typical DSP solution for high performance Small size for cost and power sensitive applications Full tool support from Tensilica High level simulators (SystemC), ISS and RTL Debugger and Trace Compiler, IDE and Operating Systems 12

13 Copyright © 2009, Tensilica, Inc. ConnX D2 Processor Block Diagram (Typical) 13

14 Copyright © 2009, Tensilica, Inc. ConnX D2 Engine Architecture AR Register Bank (32 bits) Local Memory and/or Cache XDU Alignment Registers (4 x 32 bits) XDU Alignment Registers (4 x 32 bits) XDD Register File (8 x 40-bits) XDD Register File (8 x 40-bits) 16-bit vector Overflow State Carry State Hi / Lo 16-bit select Load Store Unit Load Store Unit 32-bits 32b 32-bits 16-bits 16-bit vector 8-bit 40-bit, 32-bit & 16-bit fixed 40-bit, 32-bit & 16-bit integer 16-bit imaginary 16-bit real Addressing Modes Immediate Immediate updating Indexed Indexed updating Aligning updating Circular (instruction) Bit-reversed (instruction) 32b 14 DSP specific instructions Add-Bit-Reverse-Base and Add-Subtract : Useful for FFT implementation Add-Compare-Exchange : Useful for Viterbi implementation Add-Modulo : Circular buffer implementation. Useful for FIR implementation

15 Copyright © 2009, Tensilica, Inc. 15 ConnX D2 : Instruction Allocation Options 16-bit Instructions Base ISA 24-bit Instructions Base ISA or ConnX D2 Slot 0 ConnX D2 or Base ISA Slot 1 ConnX D2 or Base ISA (register moves & C ops on register data) VLIW Instructions (64-bits) Flexible allocation of instructions available to compiler Optimum use of VLIW slots (ConnX D2 or base ISA instructions) Improved performance and no code bloat (reduced NOPs) Reduce code size when algorithm is less performance intensive Modeless switching between instruction formats

16 Copyright © 2009, Tensilica, Inc. loopgtz a3,.LBB52_energy # [3] l16si a3,a2,2 # [0*II+0] id:16 a+0x0 l16si a5,a2,4 # [0*II+1] id:16 a+0x0 l16si a6,a2,6 # [0*II+2] id:16 a+0x0 l16si a7,a2,8 # [0*II+3] id:16 a+0x0 mul16s a3,a3,a3 # [0*II+4] mul16s a5,a5,a5 # [0*II+5] mul16s a6,a6,a6 # [0*II+6] mul16s a7,a7,a7 # [0*II+7] addi.n a2,a2,8 # [0*II+8] add.n a3,a4,a3 # [0*II+9] add.n a3,a3,a5 # [0*II+10] add.n a3,a3,a6 # [0*II+11] add.n a4,a3,a7 # [0*II+12] ConnX D2 : SIMD with VLIW – Extra Performance Example : Energy Calculation Combining SIMD and VLIW can give 6 times performance 127 A = ∑ X n * X n n=0 Instruction Execution (Control) Vectorization and SIMD gives double data computation performance VLIW gives 2 pipeline executions (one is SIMD) with auto-increment loads ConnX D2 architecture gives this combination and performance loop { # format XD2_FLIX_FORMAT xd2_la.d16x2s.iu xdd0,xdu0,a4,4; xd2_mulaa40.d16s.ll.hh xdd1,xdd0,xdd0 } Slot0 Slot1 416 cycles Base Xtensa Configuration ConnX D2: 64 cycles 128 iteration C algorithm SIMD Computation 16 One instruction (64-bit VLIW instruction)

17 Copyright © 2009, Tensilica, Inc. When Vectorization is Not Possible Performance for scalar code bases int energy(short *a, int col, int cols, int rows) { int i; int sum=0; for (i=0; i

18 Copyright © 2009, Tensilica, Inc. When Vectorization is Not Possible Performance for scalar code bases Confirmed that ConnX D2 and TI C6x compilers can not vectorize this code ConnX D2 compiler can however use VLIW to increase performance int energy(short *a, int col, int cols, int rows) { int i; int sum=0; for (i=0; i

19 Copyright © 2009, Tensilica, Inc. Optimization with ITU / TI Intrinsics Performance for generic code bases #define ASIZE 1000 extern int a[ASIZE]; extern int red; void energy() { int i; int red_0 = red; for (i = 0; i < ASIZE; i++) { red_0 = L_mac(red_0, a[i], a[i]); } red = red_0; } #define ASIZE 1000 extern int a[ASIZE]; extern int red; void energy() { int i; int red_0 = red; for (i = 0; i < ASIZE; i++) { red_0 = L_mac(red_0, a[i], a[i]); } red = red_0; } entrya1,32 l32ra2,.LC1_40_18 l32ra5,.LC0_40_17 xd2_l.d16x2s.iu xdd0,a2,4 test_arr_1+0x0 l32i.na3,a5,0 test_global_red_0+0x0 { # format XD2_ARUSEDEF_FORMAT xd2_mov.d32.a32s xdd1,a3 movi a3,499 } loopgtza3, {# format XD2_FLIX_FORMAT xd2_l.d16x2s.iu xdd0,a2,4; xd2_mulaa.fs32.d16s.ll.hh xdd1,xdd0,xdd0 } entrya1,32 l32ra2,.LC1_40_18 l32ra5,.LC0_40_17 xd2_l.d16x2s.iu xdd0,a2,4 test_arr_1+0x0 l32i.na3,a5,0 test_global_red_0+0x0 { # format XD2_ARUSEDEF_FORMAT xd2_mov.d32.a32s xdd1,a3 movi a3,499 } loopgtza3, {# format XD2_FLIX_FORMAT xd2_l.d16x2s.iu xdd0,a2,4; xd2_mulaa.fs32.d16s.ll.hh xdd1,xdd0,xdd0 } Generated Assembly Code 19 Energy calculation loop 1000 looping, using L_mac ITU intrinsic Energy calculation loop 1000 looping, using L_mac ITU intrinsic L_mac maps to one ConnX D2 instruction Compiler further optimizes by using SIMD to accelerate loop VLIW allows further accelerates with parallel loads 1000 loop C algorithm optimized to 500 cycles loop Sustained 3 operations / cycle L_mac maps to one ConnX D2 instruction Compiler further optimizes by using SIMD to accelerate loop VLIW allows further accelerates with parallel loads 1000 loop C algorithm optimized to 500 cycles loop Sustained 3 operations / cycle

20 Copyright © 2009, Tensilica, Inc. “Out of the Box” Performance - Results Comparison to TI C55x (TI C55x is an industry benchmark Dual-MAC, 2-way VLIW) 20% more performance (256 point complex FFT) Comparison to other DSP IP vendors Almost twice the performance ConnX D2 "Out of the Box" C code TI C55x Optimized assembly Cycle count (lower is better) # ConnX D2 (Out of the Box ITU reference code) CEVA - X1620 (Out of the Box ITU reference code) Required MHz for AMR-NB (VAD2) Encode + Decode 27.7 MHz48 MHz * * , From CEVA published Whitepaper # - Dec 2008, 20 FFT specific instructions Dual write to Register Files Advanced Complier SIMD and VLIW performance 1 to 1 mapping of ITU intrinsics SIMD and VLIW performance Flexibility in VLIW allocation VLIW Performance for scalar code Why better?

21 Copyright © 2009, Tensilica, Inc. Small, Low Power, & High Performance Optimized for low area / low cost applications Less than 70,000 gates 0.18mm 2 in 65nm GP * Low power 52uW/MHz power consumption 65nm GP, measured running AMR-NB algorithm Very high performance 600MHz in 65nm GP ** * - After full Place and Route, when optimized for area/power. Size is for the full Xtensa core including the D2 DSP option ** - After full Place and Route, when optimized for speed 21

22 Copyright © 2009, Tensilica, Inc. Flexible and Customizable Configure memory subsystems to exact requirements Up to 4 local memories Instruction memory, data memory RAM and ROM options DMA path into these memories Instruction and data cache configurations MMU and memory region protection Memory port interface Option of dual load/store architecture Full customization Instruction set extensions Custom I/O Interfaces TIE Ports, Queues and Lookup Memory interfaces 22

23 Copyright © 2009, Tensilica, Inc. ConnX D2 DSP Engine: Summary Small size Low power Excellent performance on wide range of code Easy to use – C programming centric “Out of the Box” performance Reduce development time – reduced cost ITU and T.I. C intrinsic support – large existing code base Bit equivalent to TI C6x Take current TI code, port and get same functionality on ConnX D2 Flexible & customizable 23


Download ppt "Introducing the ConnX D2 DSP Engine Introduced: August 24, 2009."

Similar presentations


Ads by Google