Presentation is loading. Please wait.

Presentation is loading. Please wait.

How to realize high-performance compute with Multicore DSP

Similar presentations


Presentation on theme: "How to realize high-performance compute with Multicore DSP"— Presentation transcript:

1 How to realize high-performance compute with Multicore DSP

2 C667x Target Applications (Non- Telecom)
Mission Critical Test and Automation HPC, Imaging and Medical Video Infrastructure Infrastructure Audio Emerging Others Emerging Broadband C6472 Target Application Areas Meeting the needs for today's leading-edge, high-performance applications, TI’s high power efficient C6472 was designed to support applications that drive many channels, applications that demand maximum performance density and breakthrough applications for which designers must have access to sophisticated functions. These devices are ideal for high-performance applications such as high-end industrial, mission critical, test and measurement, communication, high-end image and video, blade server and cloud computing. ATM/currency verification. Innovations 2

3 RF and Communication Applications
Military & Defense Avionics Govt & Public Safety Application ISR (Intelligence/Surveillance/Reconnaissance) SIGINT/COMINT/Signal Generators Military Communications. SDR(JTRS)-Manpack/LMR/Fixed Comm. Infra - VoIP/Video Gateways Satellite\Avionics Communications Ground Receiver/Repeaters Weather Radar FAA – Civil Aviation/Govt Comm. Conventional PS – TETRA/APCO/E911 Wireless Infrastructure Emerging Broadband (OFDM/LTE/WiMAX) Utilities/Transport/Smart Grid Key Customer Careabouts Long Term Partnership Financial Stability Strong Roadmap and R&D Floating Point Performnce Size, Weight, and Power (SWaP) I/O Bandwidth Longevity of supply (10+yrs) 3 3

4 RF and Comm. Product Requirements
End Product Need DSP Requirement Needs Raw Performance in terms of MIPS/GHz/MMACS Floating Point Capable ISA to achieve “precision” and high GFLOPS. Large On Chip RAM Reduce accesses to slow external memory. High Speed External Memory Interface Large addressable memory Efficient DMA architecture Wireless specific accelerators and TCP/IP Offload Support Multiple Waveforms Common Platform for TDMA/CDMA/OFDMA Multi-channel VoIP/Video capability Support FEC and Modulation TCP/IP Networking support Common DSP requirements Highest levels of raw performance – MIPS e.g. Image Processing/Analytics Integrated Fixed and Floating Point capability e.g. Radar/Sonar/Precision Guidance applications Large On Chip Memory, Large addressable external memory space, high BW EMIF e.g. Electro-Optical Imaging Apps Memory ECC – System Reliability Multiple high BW I/O for on-board & backplane connectivity, FPGA connectivity as well as transporting raw I/O data e.g. Phased Array RADAR input Scalable H/W and S/W solution e.g. COTS cards Efficiency: LOW mW/Perf e.g. UAV electronics, Avionics, Handheld SDR Ease of Use – S/W development tools + rich selection of easily available S/W IP + easily available experts

5 Imaging Product Requirements
End Product Need DSP Requirement High BW Interface RF Front End and Telecom ports Connect Multiple DSPs on a board e.g. in ATCA Card High BW Backplane and Network Connectivity Needs multiple high speed interfaces PCIe ,Serial RapidIO OBSAI/CPRI Interface Gigabit Ethernet etc Memory Error Correction & Checking (ECC) Efficient Low Power DSPs Support Extended Temp ranges from -40oC to 105oC and others Temp Reliability in Mission Critical Designs Low Power Design Dev and Debug Tools Multicore S/W Frameworks Signal/Image Processing functions. VoIP Library Audio/Video Codecs Ease of Use

6 KEYSTONE Architecture
Introducing “Keystone Architecture” (C66x) The Best Combination of Performance (GHz) and Power Consumption in the Industry 16GFLOPs & 32GMACS per 1GHz Next-Generation C66x DSP Core C64x+ Core (Fixed pt) C64x+ Fixed and Floating-point 1.25 GHz 4x C64x+ MAC (32) 4xC67x Fl pt MAC(8) 16FLOP/cy compared to 6FLOP/cy 8 Core C6678 based on C66x core delivers 320 GMACs/160GFLOPS @ 1.25GHz/Core (effectively a 10GHz DSP) 100% Code Compatible with all C64x (fixed) & C67x (floating) Devices Similar Power Profiles as C64x Core Supported by Code Composer Studio IDE Fixed Point Lowest Power Highest Performance DSP Core NEW MultiCore DSP C66x Floating Point C67x Core (Floating pt) C67xx Industry’s Lowest Power FP DSP Core High precision and wide dynamic range KEYSTONE Architecture

7 Unmatched Performance
BDTImark2000 TM Score BDTI Score for Floating Point Processors BDTI Score for Fixed Point Processors Algorithm 300MHz C64x+ @1.2GHz Gain Single Precision Floating Point FFT, 2048 pt, Radix 4 86.84 us 14.00 us* ~600% Fixed Point FFT, 2048 pt, Radix 4 8.23 us 4.46 us* ~200% FIR Filter, 40 samples, 40 taps 0.69 us 0.34 us* Matrix Multiply 32 x 32 17.92 us 6.16 us* ~300% Matrix Inverse 4 x 4 0.53 us 0.13 us* ~400% BDTI - Numbers based on 1.5GHz 6672 Platform – Dual Core C66x DSP running at 1.5GHz Data available on BDTI’s website Core to Core Performance Comparison (6678 has 8 C66x Cores) TI Internal Benchmarks. Full Utilization, Memory impacts not comprehended 7

8 TI Multicore KeyStone Architecture
TeraNet 2 Shared Memory High Speed I/O Multicore Shared Memory Controller C66x, ARM Processing Cores Multicore Navigator Application Accelerator HyperLink 50 System Management (Debug, Clocking, Power) Network on Chip Highest Integration Cost & Power  Common Architecture Portable Software Scalable  Tailored Solutions Navigator Innovative Multi-core Floating Point Development Time  Tools & Debugging R&D Efficiency  Quality Software Solutions & Libraries The first network on chip infrastructure to unleash full multicore entitlement 8 8 8

9 Product Highlights: C6670 and C6678
Performance Optimized Core C6678 Power Optimized Core Next Generation C66x Core 4 C66x 1GHz GHz Memory Architecture 4MB Local L2/Core (1MB per Core) 2MB Multicore Shared Memory Communication Accelerators TCP3e (Turbo Encode) – Up to 550Mbps TCP3d (Turbo Decode) – Up to 600Mbps FFTC – 2048 FFT every 4.6µs VCP2 for voice channel decoding Next Generation C66x Core Up to 8 C66x 1GHz -1.25GHz Available Options: 1, 2, 4, and 8 Core Devices Memory Architecture 4MB Local L2/Core (512KB per Core) 4MB Multicore Shared Memory Power Optimized Core <10W at 1Ghz nominal temp Multicore Navigator TeraNet C66X DSP L1 L2 SRIO x4 PCIe x2 AIF2 x6 I2C SPI UART Peripherals & IO SGMII 4x VCP2 3x TCP3d Communications CoProcessors Power Management Debug Multicore Shared Memory Controller (MSMC) Shared Memory 2MB DDR3- 64b EDMA SysMon System Elements Memory Subsystem HyperLink 2x RAC 1x TAC 3x FFTC BCP Crypto Packet Accelerator Network Multicore Navigator TeraNet C66X DSP L1 L2 8 x CorePac SRIO x4 PCIe x2 EMIF 16 TSIP I2C SPI UART Peripherals & IO GbE Switch SGMII IP Interfaces Crypto Packet Accelerator Network CoProcessors Power Management Debug Multicore Shared Memory Controller (MSMC) Shared Memory 4MB DDR3- 64b EDMA SysMon System Elements Memory Subsystem HyperLink The Four Core C6670 performance optimized device (due to accelerators), goes up to 1.2 GHz and enables 150GMAcs of fixed point theoretical performance. Compare this to the 1.2 GHz C GMACs, and there is roughly a 5x theoretical performance improvement TI has managed to put together the C66x cores with a variety of peripherals, accelerators and on-chip infrastructure to enable a high performance SoC. Memory Architecture: Compared to the C6474, the C6670 now also supports 2 MB shared Memory, in addition to the 1 MB L2 dedicated memory. Also, there were significant enhancements made to the memory architecture that now enables very high speed memory access to both internal and external memory through a DDR MHz. Also, the C6670 addressable memory space is 8 GB. Whether its on chip connectivity, managing traffic and flows through the device or having high speed communications, the C66x core provides improvements on all fronts. The 2 TB TeraNet switch fabric provides high bandwidth on chip communication. The Multicore navigator also helps streamline and manage efficient data transfer between the various on-chip components. When it comes to data going off-chip, multiple lanes of Serial RapidIO and PCI Express allow a very fat pipe to chip to chip or chip to backplane communication. 2 Gigabit Ethernet ports allow a mechanism to transfer data as well allow additional debug and boot mechanism. With six lanes each at 6 Gbps, this provides another large Bandwidth interface to the outside world. The Hyperlink50 is TI’s new approach to providing a very high speed, very low latency interface directly to the switch fabric of C66x core of devices. This “Serdes” based interface has ~50 Gbps Fast Data Bandwidth at full line rate and is the ideal connection to FPGAs and other Hyperlink 50 enabled devices. Acceleration – The C66x SOCs enable TCP/IP Packet (L1/L2/L3) Processing and offloads this from the DSP Cores. The Cryptographic Engine (available in the C6670 only – which is located in Network Co-processor block) supports AES/DES/3DES/ Snow/Kasumi To ease software development on such a Multicore, the C66x devices include hardware IP such as Multicore Navigator, Hardware Semaphores, and Embedded Debug capability such as Trace. TI Confidential – NDA Restrictions 9

10 Multicore Shared Memory
Innovation & Integration via C6678 DSP Highlights C66x Core Next generation Fixed / Floating-Point DSP core with clock speeds ranging from 1GHz– 1.25GHz and Up to 8 core options Multicore Navigator Data transfer engine that is architected to move data between various system elements without using any CPU overhead so maximum system efficiency is achieved Multicore Navigator TeraNet C66X DSP L1 L2 8 x CorePac SRIO x4 PCIe x2 EMIF 16 TSIP I2C SPI UART Peripherals & IO GbE Switch SGMII IP Interfaces Crypto Packet Accelerator Network CoProcessors Power Management Debug Multicore Shared Memory Controller (MSMC) Shared Memory 4MB DDR3- 64b EDMA SysMon System Elements Memory Subsystem HyperLink Memory Architecture 0.5 MB of local Memory per core; 4 MB of Shared Memory. Enhanced memory architecture through an enhanced Multicore Shared memory Controller Bottleneck free fast on- and off-chip memory access including a DDR3-1333MHz (64-bit) interface L1/L2/L3 ECC Network Co- Processor and Accelerators A cost effective implementation to off-load the TCP/IP and secure networking functions from the DSP TeraNet Switch fabric that has 2 Terabits of bandwidth which allows maximum data transfer between system components to realize full system entitlement TI has managed to put together the C66x cores with a variety of peripherals, accelerators and on-chip infrastructure to enable a high performance SoC. Memory Architecture: There were significant enhancements made to the memory architecture that now enables very high speed memory access to both internal and external memory through a DDR MHz with addressable memory space up to 8 GB. Whether its on chip connectivity, managing traffic and flows through the device or having high speed communications, the C66x core provides improvements on all fronts. The 2 TB TeraNet switch fabric provides high bandwidth on chip communication. The Multicore navigator also helps streamline and manage efficient data transfer between the various on-chip components. When it comes to data going off-chip, multiple lanes of Serial RapidIO and PCI Express allow a very fat pipe to chip to chip or chip to backplane communication. 2 Gigabit Ethernet ports allow a mechanism to transfer data as well allow additional debug and boot mechanism. With six lanes each at 6 Gbps, this provides another large Bandwidth interface to the outside world. The Hyperlink50 is TI’s new approach to providing a very high speed, very low latency interface directly to the switch fabric of C66x core of devices. This “Serdes” based interface has ~50 Gbps Fast Data Bandwidth at full line rate and is the ideal connection to FPGAs and other Hyperlink 50 enabled devices. Acceleration – The C66x SOCs enable TCP/IP Packet (L1/L2/L3) Processing and offloads this from the DSP Cores. The Cryptographic Engine (available in the C6670 only – which is located in Network Co-processor block) supports AES/DES/3DES/ Snow/Kasumi To ease software development on such a Multicore, the C66x devices include hardware IP such as Multicore Navigator, Hardware Semaphores, and Embedded Debug capability such as Trace. Improved Debug S/W Dev and Debug Support Leveraged by CCS Peripherals and I/O Interfaces High bandwidth peripherals that operate independently (NOT Shared) allowing simultaneous data transfer to prevent bottle necks - featuring: RapidIO v2.1 – 5Gbps with 1x, 2x and 4x support PCIe x2 – 2lanes, running independently of RapidIO HyperLink Ultra high-speed ( up to 50 Gbaud), low latency serial interface that connects to other DSPs and FPGAs in the systems 10

11 Value Prop against FPGA Value Prop against other DSPs
Competitive Analysis Value Prop against FPGA Value Prop against other DSPs C66x Performance 320GMACS/160GFLOP Baseband on a chip. Handles multiple waveforms supporting OFDM,CDMA,TDM L1/L2/L3 Processing capability Wireless Accelerators (VCP/TCP/FFT) Software Programmability Time To Market Smaller Package (more DSP/Board) Lower Power smaller battery, simpler cooling Low Cost - MIPs/$ C66x Fixed & Floating Point Industry’s Fastest DSP at 10GHz On-Chip RAM up to 8MB DDR3 1600MHz, 64Bit, 8GB Address space Multiple Independent High Speed IO 4xsRIOv2.1,2xPCIe Gen II, 2xSGMII, 2xTSIP High BW FPGA connectivity 50Gbps 1/2/4/8 Core Option (Pin Compatible) L1/L2/L3 Memory ECC – System Reliability Low Power per GFLOPs and GMACS Extended Temp support -40oC to 105oC CCS Tools + S/W Collateral 3rd Party Network

12 TMDXEVM6678L EVM Singe wide AMC form factor
Code Composer Studio™ IDE *Design *Code and Build *Debug *Analyze *Tune H/W Development Tools CCSv5 Allows designers of all experience levels to move quickly through application development ( Time Limited FREE Evaluation Versions available for download. Includes C667x Simulator EVM Kit includes BIOS 6.x, BIOS-MCSDK / LINUX-MCSDK 2.0 (NDK, PDK, LIB etc), Sample Program and Out of box demo (OOB) e.g. I/O Benchmark, Imaging Processing Pipeline and High Performance DSP Utility Application (HUA) User Guide, Starter guide, Tech Ref Guide, App Notes etc TMDXEVM6678L – EVM with XDS100 emulation - $399 TMDXEVM6678LE – EVM with XDS560V2 emulation - $599 TMDXEVM6678LXE – EVM with XDS560V2 emulation –Encryption Enabled - $599 TMDSEMU560v2STM-UE - XDS560v2 System Trace Emulator with 128Mb System Trace buffer and Ethernet / USB support Optional PCIe adapter card to connect the C6678 EVM to a standard PCI header of a desktop. Low cost EVM starting at $399 (differs in the emulation technology used) Standard AMC form factor card (MicroTCA chassis) but can be used in a standalone manner The board has a Xilinx FPGA All the interfaces have been brought out either through individual connectors or the backplane connector. A separate PCIe adapter card available which will allow connecting this board to the PC The EVM comes with Multicore Software Development Kit for quick startup.

13 TI’s Multicore Hardware Ecosystem
Others Standardized Boards Chassis / System PCIExpress (with Gen 2) Advanced Mezzanine (AMC) Custom ATCA Other

14 TI’s Multicore Software Ecosystem
Customer Application Layer 2+ Multicore Entitlement IP Network Stack Layer 1 UMTS Layer 1 LTE TI Runtime TI’s Device Entitlement Libraries TI Layer 1 Libraries TI BIOS, Linux, OSE(ck)

15 Multicore Tools and Software (MC-SDK)
Codegen with OpenMP support Emulator/Debugger Simulator Profiler / DVT 3rd party tools Software BIOS/Linux SDK Multicore Demonstration 6.x DSP BIOS Platform Abstraction Basic Networking Inter core communication Application Specific Libraries Audio/Video CODECS VoIP Components WiMAX Toolkit, LTE Toolkit, DSPLib others.. Eclipse DSP Customer Application Code Composer StudioTM Third Party Plug-Ins Multicore Software Development Kit Demo App Multicore BIOS Demo App Multicore Linux Editor/IDE Polycore Demo App Multicore BIOS and Linux ENEA Optima Compiler Linker (Codegen) 3L DSPLIB IMGLIB Speech Codec NDK Audio Codec Video Codec Profiler Operating System w/ Boot Loader BIOS Linux Debugger Platform Development Kit Multicore Entitlement Remote Debug Inter Core Communication SoC Analyzer Full Silicon Entitlement Host Computer Target Board XDS 560 V2 XDS 560 Trace

16 KeyStone Multicore Software – Libraries & Codecs
Digital Signal Processing FFT Adaptive Filtering Filtering and convolution Others….. Available free from TI Image Processing Edge Detection Boundary Morphology Others….. Available free from TI Voice and Fax Line Echo Cancellation Voice Activity Detection Others… Available free from TI Libraries Vision Lib (object only) 50+ royalty-free kernels: • Background modeling & subtraction • Object feature extraction • Tracking, recognition • Low-level pixel processing MATLAB Image processing Math operations Vision Analytics Security/Cryptography AES, SHA1, 3DES Voice G.711, G.722 G.723, G.729 CDMA, AMR(NB/WB), EVRC-B Others Video H.263 H.264 MPEG2 MPEG4 VC1/WMV9 Decode Others Audio MPEG1 Layer2 AAC LC/HE AC3 2.0/5.1 Sample Rate Conversion Codecs Fax T.38 Fax Modem

17 High-Performance and Multicore Processor
High Value Keystone Architecture High-Performance at the Right Power & Price Low-Cost EVM Open & Affordable Tools Easy to Use Training Product Collateral Drivers & Example Code User Community Quick to Market Delivering affordable, out of the box experience with SW enablers for fast product development Enabler Software Quick-Start Hardware Benchmarks & Functional Understanding Frameworks & Abstraction Generic Libraries Application Libraries 17

18 Getting Started – More Information/Links
Product Folders: C66X Informational Wiki Page All C6000 Multicore DSPs TMS320C6670 TMS320C6678 EVMs and Software Tools: TMS320C6678 EVM TMS320C6670 EVM AMC to PCIe Adapter Card Multicore Software Development Kit for BIOS & Linux MCSDK Wiki CCS v5 Wiki C66x Linux Wiki DSP Signal Processing Library(DSPLIB) Image and Video Processing Library (IMGLIB) LTE /WiMAX Toolkit – Discuss with BDM Technical Support TI E2E Community (Online Support) Product Training This slides gives you the links and a brief description of all the EVMs TI provides through TI.com for the different High Perf. DSPs. The most notable thing to notice is how TI is aggressively trying to make really low-cost EVMs (see $ EVM) available to our customers. There are many TI 3rd Parties developing different types of hardware platforms based on TI DSPs as well. You can find the names of some of those 3rd Parties by visiting TI’s website at TI Confidential – NDA Restrictions

19 Online Video Training http://focus. ti

20 Mission Critical DSP Market “What Customers Like about TI”
2002 2009 Revenue Undisputed #1 DSP and SoC supplier Strong Growth for 8 years in a row, even in 2009 Higher R&D spending than DSP revenue of most competitors KeyStone SoC Architecture secures future success Rich Product Portfolio & Strong Roadmap 2 Families with multiple devices and growing Nyquist(6670), Shannon(6678/4/2) 40nm -> 28nm Tools/Software & Compilers 3rd Party Eco-System Multiple Design Wins Pre-Announcement Secure Supply – No DSP product discontinuation (end of life) History of delivery upon promises (Power, GHz, ..) Field Experience - Completeness of system analysis, Architecture, Internal Switch, …. Customer Support Business Model - Long Term relationships with key customers – Actively seek and incorporate customer feedback in roadmap devices. TI SoC Architecture Layer 1 Layer 2 Layer 3+ PHY MAC Layer 3, 4 Radio IP Network Macro Pico Femto Software

21 Backup Slides Product Details

22 C6678 (Shannon) “Lightning” Half-Length PCIe Card Feature Set
TI TMS320C6678 (8-core) x 4 C66x Core Frequency: 1.25GHz DDR3 Memory Data Frequency: 1600MHz Data Bus Width: 64-bit Serial RapidIO Gen-2 Interface PCIe Gen-2 Interface 10/100/1000Mbps Ethernet w/ SGMII Hyperlink50 Interface 1024 MB DDR on board PLX PEX8624 PCIe Gen-2 Switch Serial RapidIO daisy-chain Ethernet daisy-chain Each DSP device is linked to PCIe switch by x2 lanes Dual DSPs linked by Hyperlink50 Power: Max 54Watts Now we are going to talk about the Mirage family. The 1st member of this family is “Mirage I”. It has 2 Shannon devices and 1 P2010 power PC. As well as the MMC controller.

23 What is Hyperlink? “high-speed, low-latency, and low-pin-count communication interface”
Low pin count (24 pins) Point to Point Connection Interconnect DSP-to-DSP DSP-to-FPGA. SerDes for data transfer x1 x4 modes for Tx and Rx 12.5GBaud/lane Effectively 8b9b encoding LVCMOS sideband signals for flow control & power mgmt errors/events/timeouts * Simple packet-based transfer protocol for memory-mapped access * Read/Write to DSP/FPGA local memory - discrete memory access of any byte aligned width up to 64bits. - burst transfer modes Write (Maximum Burst Size 256Bytes) Write Request ---> Data Packet ---> Read (Maximum Burst Size 256Bytes) Read Request ---> Read Response - Interrupt Request <--> Up to 64 Memory mapped Regions each region up to 256MB

24 Universal Parallel Port (uPP)
What is it? Parallel bus, two independent channels (separate data buses) I/O speeds up to 75 MHz with 8-16 bit data width per channel 1 or 2 channel parallel interface operating in RX, TX or FD mode Supports Double data rate mode of operation (Bandwidth does not change/increase) Application Each channel can interface cleanly with high-speed ADCs and/or DACs with up to 16-bit data width (per channel). Useful as low cost interface with FPGAs. Can run up to 120MByte/s per channel in single channel or bi-directional mode ( 240MByte for both channels in unidirectional mode) Can also be used to interface two C6655/57 devices or to connect C6655/57 with C674x or OMAP-L13x family of devices. Other benefits Internal DMA – leaves CPU EDMA free Simple protocol with few control pins (configurable: 2-4 per channel) Multiple data packing formats for 9-15 bit data widths Interleave mode (single channel only) Simple interface: IO Queued by software Throughput Estimates: Note: Max. clock of 50 MHz in (*) configuration

25 Thank You


Download ppt "How to realize high-performance compute with Multicore DSP"

Similar presentations


Ads by Google