Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 © 2003 TENSILICA INC. Fundamental Change in MPSOC A fifteen year outlook Chris Rowen, President and CEO Tensilica, Inc. The Configurable Processor Company.

Similar presentations


Presentation on theme: "1 © 2003 TENSILICA INC. Fundamental Change in MPSOC A fifteen year outlook Chris Rowen, President and CEO Tensilica, Inc. The Configurable Processor Company."— Presentation transcript:

1 1 © 2003 TENSILICA INC. Fundamental Change in MPSOC A fifteen year outlook Chris Rowen, President and CEO Tensilica, Inc. The Configurable Processor Company

2 2 © 2003 TENSILICA INC. Design Productivity Crisis (SRC 1997) Potential Design Complexity and Designer Productivity Moores Law: Opportunity, Crisis and ROI Source: ITRS 2001, Moore 1965, Tensilica 2001200320052007 2009 2011 2013 2015 10,000 1,000 100 Density (Kgates / mm 2 ) ASIC clock (MHz) Gates Clock Moores Law: Standard cell density and speed Logic Transistor per Chip ( M ) Productivity ( K) Trans./Staff – Mo. 1981 1983 1985 1987 1989 19911993 1995 1997 1999200120032005 2007 2009 100,000,000 0.01 0.1 1 10 100 1,000 10,000 Equivalent Added Complexity 1,000 100 10 1 0.1 0.01 0.001 10,000 21% / yr compounded Productivity Growth Rate x x x x x x x x 58% / yr compounded Complexity Growth Rate Logic Tr. / Chip Tr. / S.M.

3 3 © 2003 TENSILICA INC. ROI Goal: One Design, Many Design-ins $10M design cost, $15 manf. cost, 5% premium for programmability Low-end still camera High-end still camera Video camcorder one chip many system designs System designs per chip design Total per unit cost SOC Flexibility = Cost Reduction (Model: 100K and 1M system volumes)

4 4 © 2003 TENSILICA INC. Configurable Processors RoleConfigurableProcessors Performance Flexibility Application-specificLogic General-purposeProcessors

5 5 © 2003 TENSILICA INC. Configurable Processors Enable New Roles Taking Performance to a New Level © 2003 TENSILICA INC. Source: EEMBC ~Energy efficiency

6 6 © 2003 TENSILICA INC. Automatic Generation of Processors Achieves Required Performance Faster Electronic Specification Hardware Design RISC DSP OCD Timer FPUDesigner-Defined Cache Customized Software Processor Generator Build using any IC process Design processor in one hour

7 7 © 2003 TENSILICA INC. CPU SignalprocessingProtocolprocessing I / OI / OI / OI / OMemory ApplicationacceleratorEncryptImaging Audio Processors as Basic Build BlockDSPApplication-specificlogic Application-specificlogicApplication-specificlogicApplication-specificlogic Application-specificlogic Configurable processor

8 8 © 2003 TENSILICA INC. Flexibility is the Key to ROI Flexibility means more systems per design Programmability more hot features available Little impact on chip cost – pennies per processor Automatically-generated configurable processors reduce design time, team size and re-spin risk

9 9 © 2003 TENSILICA INC. Example: NEC TCP/IP Offload Engine (TOE) Platform NEC TOE achieves full wire speed by eight parallel and two management and dispatch Tensilica cores (Total 10) for high performance IP-based network storage NAS & IP-SAN 8 parallel Xtensa processors 200 MHz Gigabit Ether × 2 ports 200 MHz MAC

10 10 © 2003 TENSILICA INC. Implications of Multiprocessor SOC Designers will routinely waste processors to get other benefits Greater speed to market and certainty of success Higher abstraction in design Tremendous creativity and diversity in on-chip communications Topologies: buses, hierarchies of buses, cross-bars, systolic arrays, pipelines New issues and methods: reliability, redundancy, asynchrony, QOS Programming models for large numbers of task – finding parallelism Software languages displace hardware languages C/C++, not Verilog, VHDL, System Verilog etc. Changing demographics of complex SOC design Broader population of engineers and programmers capable of SOC design Unified hardware-software design user interface cockpit

11 11 © 2003 TENSILICA INC. New Types of Processors Exploiting Latent Parallelism 0 2 4 6 8 10 12 14 16 18 20 0 123456789 Operations per cycle Number of Processors 32 48 64 10 8 instrs/cycle Source: K. Keutzer, UCB Xelerated Intel IXP1200 Broadcom BCM1250 Cognigine RCU/RSF Cisco PXF EZchip NP-1 IBM PowerNP Lexra NetVortex Motorola C-5 BRECIS AMCC np7120 Clearwater CNP810 Vitesse IQ2x00 Agere PayloadPlus Alchemy Mindspeed CX27470 64 instrs/cycle 16 instrs/cycle Multiple processors vs. multiple-issue in network processors Very small processors Modest extensions High task-level parallelism High-performance processors VLIW, SIMD and application-specific extensions High data- and instruction-level parallelism 10 1520 80 96 112 128

12 12 © 2003 TENSILICA INC. Projected Processor Speed and Density 20092016 Geometry 50nm22nm Clock 1.8GHz5.7GHz Small proc area (mm 2 ) 0.080.016 Small proc/chip 2401400 High perf proc/chip 1015 MIPS/chip600,00011,000,000 40mm 2 die size for consumer SOC

13 13 © 2003 TENSILICA INC. The Law of SOC Processor Scaling Processors/chip: Up to 30% per year Total MIPS: 65% per year Tensilica model based on ITRS 2001, 140mm 2 die size

14 14 © 2003 TENSILICA INC. Enablers for Large Scale MPSOC What is Tensilica Working On? Performance Throughput and efficiency mean more opportunities for application- specific processors over RTL New processor interfaces enable greater parallelism Insight Unified hardware-software development environment Performance and cost-oriented analysis Automation Automatic generation of compilers, RTOS, MP models Hands-free instruction set optimization

15 15 © 2003 TENSILICA INC. Performance: FLIX FLIX = Flexible Length Instruction Xtensions FLIX freely intermixes 16-, 24-, and 64-bit instructions No code-bloat No modes Full backwards code compatibility with current Xtensa ISA Long instructions implement complex extensions Fast and parallel code when needed, else very compact code Arbitrary Instruction Field Specification Multiple independent operations packed into a wide instruction word Multiple Load / Store Units Minimal Overhead ~5000 gates added control logic 64 2416 24 64 241624 0 63 31 Instruction packing in Memory (Little Endian Shown)

16 16 © 2003 TENSILICA INC. Performance: Pushing to new levels of throughput FLIX: Average of 6% larger code on complex code sets Simple RISC Task Engine Minimal Configuration Xtensa processor (18K gates) 155,389 cycles Scalar Performance Base Xtensa processor with MUL32 option 23633 cycles SIMD Performance Xtensa processor with 4-way SIMD Vectra DSP Engine 3055 cycles FLIX Performance Conexant Testarossa DSP with 4-way SIMD and FLIX 1063 cycles 256pt FFT (Radix-4)

17 17 © 2003 TENSILICA INC. Performance: A Complex FLIX Example Register File (16 x 256) AB CDE 4K x 256 RAM Addr 4K x 256 RAM Addr In-Q Out-Q 635958575352373625241918141398430 MemA InQ WrtAExA/BExC/DExEWrtBOutQMemB1110 515161265554 WrtA WrtB 9 independent operation fields Multiple load/store Input/output queues

18 18 © 2003 TENSILICA INC. length l64 64 { InstBuf[3:0] == 14 } format flix64 l64 slot slot0 flix64[*] slot slot1 flix64[*] opcode L32I slot0 opcode S32I slot0 opcode ADD slot0 opcode NOP slot0 opcode ADD slot1 opcode ADDI slot1 opcode SUB slot1 opcode NOP slot1 All components of processor solution automatically generated from the TIE code in <2 hours. RTL & HW flow scripts Toolchain System models Operating System support Performance: Writing TIE for FLIX

19 19 © 2003 TENSILICA INC. Performance: Conexant DSP Architectural Requirements VLIW-SIMD programming model 16- and 24-bit scalar instructions 64-bit instructions with multiple operations 2 or 4 16x16 MAC units 6R/3W Conexant-defined register file At least two load store units 7-stage pipe with 2 cycles for I/D memory access Stall on memory bank conflicts Backward compatibility with previous Conexant DSPs via Translation Instruction Set (a sub-operation of the 64-bit instructions)

20 20 © 2003 TENSILICA INC. Performance: Conexant Testarossa Encoding Testarossa Load/Store Vector Load/Store Scalar Load/Store Unaligned Load/Store Xtensa Core Instructions Load/Store Branch ALU 234 operations Complex Multiply Real Multiply Select 24 operations ALU Shift 2 nd Load/Store 52 operations

21 21 © 2003 TENSILICA INC. Insight: The Multiple Core SOC Design Problem Software Development Environment C code development Debugging C project management Code profiling, tuning Processor Optimization Environment TIE code development for extensions Configuration option management SOC System Architecture Exploration System modeling and simulation Multiple core debug Web-based Xtensa Processor Generator TIE Compiler Single source TIE file for processor extension Xtensa Modeling Protocol (XTMP) Bus functional models for co-simulation / co- verification EDA tools GNU-based Tensilica software development tools Xtensa C/C++ compiler Xtensa Instruction Set Simulator Tools Command line interface Partner-provided software IDEs (WindRiver, ATI/Mentor, MontaVista) Command line interface Web browser interface Command line interface EDA partners system analysis / debug environments Environment Three Skill Sets, Three Environments?

22 22 © 2003 TENSILICA INC. Insight: Xtensa Xplorer Software Development Environment C code development Debugging C project management Code profiling, tuning Processor Optimization Environment TIE code development for extensions Configuration option management SOC System Architecture Exploration System modeling and simulation Multiple core debug

23 23 © 2003 TENSILICA INC. Insight: Develop and Manage Processor Configurations Manage complexity of growing variety of processor optimization choices Software and processor optimization within same IDE Gate count estimate: per instruction per register file per user state Interactive display of instruction… operands pipelining semantics Interactive TIE Editor language-sensitive editing and help

24 24 © 2003 TENSILICA INC. Insight: Create, Analyze & Tune ISA Extensions (TIE) Profile and visualize performance impact of custom instructions Pipeline Viewer shows instruction flow of disassembled code Static analysis of pipeline stalls pinpoints areas for fine tuning Highlight instructions with variable latency (e.g. cache misses) Interlocks on deep TIE pipelines fully modeled and explained

25 25 © 2003 TENSILICA INC. Insight: Analyze and Select Caches to Meet Speed/Area Goals Automatically profile code across range of cache configuration options Performance charts visually compare different configurations

26 26 © 2003 TENSILICA INC. Insight: Chip-level Software and Simulation for MPSOC Manage system memory maps & link/load for multiple-core SOCs Develop, run and debug multiple-core simulations using Xtensa Modeling Protocol (XTMP) Auto-generated XTMP model based on memory maps Specify chip-level memory maps for shared/private memories Place interrupt and reset vectors Assign code/data to distributed memories

27 27 © 2003 TENSILICA INC. Automation: The Next Generation Xtensa Processor Generator Complete Hardware Design Customized Software Tools Any Fab ALU DSP OCD Timer FPURegister File Cache Electronic Specification Application Source Code NEW Automation Tool int main() { int i; short c[100]; for (i=0;i<N;i++) { c[i] = 0; } for (i=0;i<N;i++) int main() { int i; short c[100]; for (i=0;i<N;i++) { c[i] = 0; } for (i=0;i<N;i++)

28 28 © 2003 TENSILICA INC. Automation: Goals for Processor Extension Flexibility Application code might be written/modified after tape-out Generated TIE must be sufficiently general purpose so that small changes to application code do not degrade performance Control Full automation C/C++ in TIE out C/C++ + generated TIE in binary code out Optional full control by user Guide tool and/or to select instructions Add to or change generated TIE Tune application to better take advantage of TIE Speed: minutes, not days

29 29 © 2003 TENSILICA INC. int *a, *b, *c; for (int i=0; i<n; i++) c[i] = (a[i] + b[i]) >> 2 Automation: Basic Operation - Fusion operation add_shift (out AR c, in AR a, in AR b) { wire t[31:0] = a+b; assign c = {2{t[29]},t[29:0]}; } + >> 2 Original C Code Complete TIE Code Combined add-shift operator automatically used wherever equivalent expression occurs in source

30 30 © 2003 TENSILICA INC. length l 64 { InstBuf[3:0] == 14 } format f l slot slot0 f[*] ADDI, NOP slot slot1 f[*] ADD, SRAI, NOP slot slot2 f[*] L32I, S32I, NOP loop: {addi a9,a9,4; add a12,a10,a8;l32i a8,a9,0} {addi a11,a11,4;srai a12,a12,2; l32i a10,a11,0} {addi a13,a13,4;nop; s32i a12,a13,0} for (int i=0; i<n; i++) c[i] = (a[i] + b[i]) >> 2 Automation: Basic Operation - Multiple Ops in FLIX Original C compiled to 3 cycles/iteration S0S1S2 Original C Code 64 Bit Instruction with 3 Slots Complete TIE Code Generated Assembly

31 31 © 2003 TENSILICA INC. Automation: Basic Operation - SIMD/Vector short *a, *b, *c; for (int i=0; i<n; i++) c[i] = a[i] + b[i]; regfile vec 64 16 v; operation add16x4(out vec c, in vec a, in vec b) { assign c = {a[63:48]+b[63:48], a[47:32]+b[47:32], a[31:16]+b[31:16], a[15:0]+b[15:0]}; } + = a b c Complete TIE Code Original C Code … … … Four iterations in parallel

32 32 © 2003 TENSILICA INC. Automation: Processor Extension Step 1 Compile the C/C++ application code Designer specifies compiler optimization flag Compiler generates comments to help user tune code Optimized code yields better results Compiler generates information from application Feedback optimization ranks code regions by frequency Vectorizer determines which loops can be vectorized Fuser generates dataflow graphs for important regions Operation counts for each type of opcode for every region

33 33 © 2003 TENSILICA INC. Automation: Processor Extension Step 2 Generated information used to select and generate TIE: For each code region, generate many potential sets of TIE extensions (configurations) Vectorize by 1, 2, 4, 8 Add FLIX functional units Add fusions Generation guided by estimated performance Evaluate all generated configurations across all regions Find best set of merged configurations given budget

34 34 © 2003 TENSILICA INC. Automation: Processor Extension Step 3 Use the TIE with a C/C++ or assembly application Compiler reads TIE (automatically or manually generated) and generates code FLIX slot/format TIE specification mapped to resource tables Generalized graph matcher generates dataflow graphs from TIE Vectorizer vectorizes a loop and checks if all required operations available in TIE User free to tune the code in ANSI C/C++ or assembly Simulator, assembler, debugger, RTOS support generated directly from TIE

35 35 © 2003 TENSILICA INC. Automation: Example: Sum-of-Absolute Differences Search i Speedup Gates Added (K) SIMD Factor FLIX Width (Slots) Load / Store Units Fusion 18.7x74832 Yes 28.1x57822 Yes 37.6x46432 Yes 47.6x37821 Yes 56.8x33422 Yes 66.8x26811 Yes 76.1x18421 Yes 85.1x12411 Yes 94.3x8221 Yes 103.4x5211 Yes 111.4x0.3111 Yes Generated Configuration Parameters 1 6 2 34 5 7 8 9 10 Wide range of choices of performance increase versus hardware cost

36 36 © 2003 TENSILICA INC. Automation: Application Examples ApplicationSpeedup Original Code Size (Before Acceleration) Code Size After Acceleration Code Size on MIPS32 ( using gcc –O2) Configurations Visited Run Time to Generate Configurations Radix-4 FFT10.6x1.5 KB3.6 KB4.4KB175,7963 minutes GSM Encoder3.9x17 KB20 KB38 KB576,72215 minutes GSM Encoder (using FFT TIE) 1.8x17 KB19 KB38 KBN/A MPEG4 Encoder 3.3x111 KB136 KB356 KB 1,340,312 30 minutes

37 37 © 2003 TENSILICA INC. Conclusion MPSOC represents a new medium of implementation: Opportunity: Cost, power, bandwidth potential of semiconductors Challenge: Return on investment for design of complex chips The transition to MPSOC will drive… …new parallel architectures (focus becomes interconnect not ISA) …shift from hardwired design to programmable design …new class of hardware/software environments for processor and SOC generation, integration and use …rapid growth in processor counts and aggregate performance Important historical parallel between integrated circuit (many small transistors per chip) MPSOC (many small processors per chip) and

38 38 © 2003 TENSILICA INC. Key Research Directions Tool environments for identification/exploitation of latent parallelism Unified programming model for MP Technical and economic tools for optimizing efficiency vs. flexibility (spectrum of early vs. late binding) Application-specific interconnect topologies and generators Role for hardware-centric programmability (FPGA) vs. software-centric programmability (processor) Vision: set of communicating tasks + chip interface specification + performance constraints set of program binaries + chip GDSII Profile-based automation: Generation of ISA Assignment of tasks to processors [1 n, n 1; static vs. dynamic allocation] Profile-based implementation of messaging mechanism and physical interconnect Memory configuration, memory map, shared code and data section allocation University Program: Free license to tools and models for MPSOC design using extensible processors: Steve Roddy: roddy@tensilica.com


Download ppt "1 © 2003 TENSILICA INC. Fundamental Change in MPSOC A fifteen year outlook Chris Rowen, President and CEO Tensilica, Inc. The Configurable Processor Company."

Similar presentations


Ads by Google