Xilinx Core Solutions Group

Name: Xilinx Core Solutions Group
Uploaded: 2017-09-09T08:25:39+00:00
Duration: PTM31S46
Channel: Angel Ramsey
Description: Xilinx Core Solutions Group

Xilinx Core Solutions Group
DSP Xilinx Core Solutions Group Why is an FPGA vendor talking to you about DSP?. What is it that we have to say? Well, I’m going to give you an overview of Xilinx DSP and by the end of this presentation you’re going to have a new perspective on solving high performance DSP problems.

Traditional DSP: DSP Processors
Single MAC Programmable Off-the-shelf, standard part Hardware multiplier Multiply One MAC (Multiply Accumulate) Time-Shared Performance ceiling Add If you look at the performance that a DSP microprocessor can deliver, there is an inverse relationship between how many operations the processor can perform on a data sample and the sample rate at which the processor can operate. While a processor can perform simple tasks very fast, this performance drops off quite quickly. Adding more processors to a system helps increase the performance capability, but as you can see the increase is not that significant when the cost of adding these processors is considered. Bear in mind this is more than just component cost, the cost of writing complex multi-processor DSP code is easily underestimated. Sequential Processing

Xilinx DSP High Performance Alternative - Parallel Processing
Programmable Off-the-shelf, standard part Many Multiplies in one clock cycle! Extend the performance of DSP Processors Multiply Add Multiply Multiply Multiply Add Add Add Multiple MACs, Parallel Processing

Xilinx DSP Solution CORE Generator DSP LogiCOREs Tools Integration
System-Level Tools Tools Integration

Existing Xilinx DSP Design Methodology
CORE Generator CORE Generator Parameterize DSP LogiCOREs Connect the cores with HLD or schematic M1 XC4000X/Spartan/Virtex

Addition of DSP System Level Tool
Tools DSP System level tools Used by all DSP systems engineers 100,000 copy installed base Fit into existing DSP environment Connect through the CORE Generator SystemLINX interface CORE Generator M1

Performance XC4085XL > 10x Faster than 320C6x 5
16-bit FIR Filter Benchmark 4 3 Billions of MACs per Second 2 1 First, performance. At a peak rate of 400 million multiply-accumulates per second, the best DSP processors available today deliver slightly more performance in data processing applications as a small Xilinx FPGA. Using a larger device and adding more logic, more parallel processing power, increases the performance that Xilinx can deliver. The Xilinx XC4085XL, which is the largest FPGA shipping today, provides more than ten times the data throughput rate of the fastest processor currently available. Choose the horsepower you need for your application and add an FPGA, not more processors. 320C6x 4005XL 4013XL 4036XL 4062XL 4085XL XC4085XL > 10x Faster than 320C6x

120 Million Samples per Second 512-Tap Decimating FIR
3.8 Billion MACs >10 DSP uPs 5,120 Flip-Flops Just for data buffer XC4085XL 150,000 Gates 10 bits R E G 1 32-Tap FIR Adder Tree 2 32-Tap FIR 8 32-Tap FIR 10 bits R E G 1 32-Tap FIR The implementation is based on sixteen 32-tap FIR filters all working in parallel. All of the cores are generated by the Xilinx CORE Generator and tied together in a top level design which makes the process of implementing the design simplicity itself. The results, however, are staggering. The design delivers almost 4 billion multiply accumulate operations per second, performance that would require more than 10 high performance processors to match. The data buffer alone would require about 5000 flip-flops, more than are available in most FPGAs. The use of distributed RAM instead of flip flops make this design possible in an FPGA. The equivalent gate count for this design is approximately 150,000 gates. 2 32-Tap FIR R E G 18-bits 8 32-Tap FIR

Price per Million MACs per Second
$0.25 $0.20 Price per Million MACs per Second $0.15 $0.10 $0.05 The Xilinx 4000XL family Is based on a 0.35u processing technology and is very cost effective. Latest generations of DSP microprocessors have done a lot to reduce the cost of high performance devices, but even when compared to the cheapest member of the C6X family a programmable solution from Xilinx can be up to one fifth the cost. Further cost reductions can be achieved by migrating the design to a HardWire device which we’ll talk about more later. This can reduce the cost to less than a penny per million multiply-accumulates per second. Add an FPGA, not more processors. Lowest Cost C6x Xilinx XC4000XL

DSP LogiCOREs Exploit FPGA Architecture
16-word RAM F/F Matrix of 16 by 1 RAM primitives Look-up-table logic FIFOs, shift-registers, … Multiple small memories 10,000 RAM primitives on a chip Regular, monolithic, scalable structure Efficient: Million MACs per CLB

Distributed RAM & Distributed Arithmetic (DA): Perfect Match
Basic DA Structure Matches XC4000 Architecture DA Algorithms: 4-Input Look-Up-Tables (LUT) Scaled with adders For higher performance Use more LUTs = more parallelism 4-Input LUT N-bits ADD or ACC. Efficiency similar to custom solution Achievable with LUT logic More ASIC gate equivalents More cost effective 4-Input LUT

Common DSP Functions Filters Transforms Modulation Basics FIR IIR FFT
DCT Modulation Multipliers SIN tables Basics Multiply / add Storage

FIR Filter FIR FILTER SUM N BITS WIDE SAMPLE DATA X X X K TAPS LONG X0
C0 X1 X SUM X2 C1 OUTPUT DATA X C2 K SUM’s K TAPS LONG

FIR Filter LogiCOREs Two Basic Types:
1. Serial Distributed Arithmetic FIR SDA FIR - Single Channel SDA FIR - Dual Channel 2. Parallel Distributed Arithmetic FIR Combine basic PDA or SDA FIR cores to solve many problems

Serial Distributed Arithmetic
SDA FIR Filters Serial Distributed Arithmetic Parallel In, Parallel Out, Bit-Serial Internally All taps processed in parallel Full precession through entire core One clock cycle required for each data bit One additional clock cycle for symmetric filters EXAMPLE: 10-bit data, 80 taps, symmetrical FIR: For a bit level clock = 90 MHz Max sample rate = 90 MHz / 11 clks = 8.2 Million samples/sec. Process 80 taps every 122 nsec. 656 Million MACs, 257 CLBs, Million MACs / CLB

SDA FIR Properties For a Given # of Taps:
Coefficient bit-width determines size # CLBs = function of D.A. LUT width Data bit-width determines max sample rate One serial clock per bit Output data width does not effect CLB count

What to Ask Data sample rate Number of taps Data word width
Coefficient width Coefficient Symmetry Same input & output sample rate? Number of CLBs

Serial Distributed Arithmetic FIR Filters
Data Word = Coefficient Size: # CLBs 5 bit 8 bit 10 bit 12 bit 14 bit 16 bit 18 bit 20 bit 8 tap Symm 33 36 39 42 45 52 55 Non 46 54 59 64 69 77 85 16 tap Symm 53 61 69 71 76 81 96 102 Non 80 95 104 112 123 138 142 24 tap Symm 80 89 101 108 116 127 146 154 Non 101 114 127 140 153 174 187 32 tap Symm 93 107 118 126 137 148 175 182 Non 40 tap Symm 116 138 154 165 179 191 226 239 Non 48 tap Symm 158 173 187 202 217 246 261 64 tap Symm 197 215 233 250 268 305 323 80 tap Symm 236 257 278 299 320 364 385 5 bit 8 bit 10 bit 12 bit 14 bit 16 bit 18 bit 20 bit Sample Symm 13.3 8.9 7.3 6.2 5.3 4.7 4.2 3.8 Rate Non 16.0 10.0 8.0 6.7 5.7 5.0 4.4 4.0 XC4000E-1 MHz MHz MHz MHz MHz MHz MHz MHz

Distributed RAM is More Efficient
For SDA FIR Filters: Distributed RAM is More Efficient Build the Time-Skew Buffer with Distributed RAM not Flip Flops 1 Logic Cell One 16x1 RAM Cell Primitive 16 x 1 Shift Register 16 Logic Cells FF FF FF FF FF FF FF FF 16 x 1 Shift Register

Best Device Utilization Distributed RAM well suited to DSP
1600 SDA FIR Filters 1200 Block RAM Device Size (LCs) 800 Xilinx Distributed RAM 400 Xilinx FPGAs implement DSP functions more efficiently than other FPGA architectures. Let’s look at a benchmark for serial distributed arithmetic FIR filters to highlight the advantages of Xilinx’ distributed RAM over block RAM based architectures. As you would expect, it takes a more logic to build a bigger filter, but with Xilinx FPGAs you get a two to three X area saving for the same functions. This means you can use a cheaper device to implement the function. 16-Taps 8-Bits 16-Taps 16-Bits 64-Taps 9-Bits 64-Taps 16-Bits Xilinx Distributed RAM - Uses One Third the Area

Parallel Distributed Arithmetic FIR Filters
PDA FIR Filter Core Parallel Distributed Arithmetic FIR Filters Fully parallel implementation All taps processed in parallel (same as SDA) All bits processed in parallel Up to 100 million samples per second 2 billion MACs per 20-tap core PDA FIR Clock Inputs Outputs Data_IN DATA_OUT CK Cascade Data_Out Mid_Out Mid_In C_M_OUT C_M_IN C_D_OUT

The high data sample rate solution
PDA FIR Filters Parameterized Input data: 4 to 24 bits Coefficients: 4 to 24 bits Symmetric, non-symmetric, negative symmetry Output data: 2 to 31 bits Taps: 2 to 20 per core Automatically trims unused coefficient ROMs Supports cascading multiple filter cores The high data sample rate solution

CORE Generator Software
SystemLINX: Ability to call CORE Generator from Third Party Tools AllianceCORE: Data Sheets LogiCORE: Web Mechanism to download new cores

One line Documentation

CORE Generator Methodology
1. Select a CORE 2. Enter parameters 3. Generate Core

LogiCORE - SDA Filter Filter Design Package 160 CLB HOW ?

DSP CORE Generator Outputs
32 Tap FIR Filter Schematic symbol VHDL or Verilog HDL instantiation code Simulation model Design netlist with constraints FIR Filter Recipe DSP CORE Generator Parameters 20 rows by 9 columns 160 CLBs used Predictable Performance regardless number of cores

Predictable Size & Performance
Built for System Performance - Not Benchmarks. Generated with RPM (Relationally Placed Macro). RPM Macro Level Advantages RPM System Level Advantages Predictable size. Close proximity of communicating elements Alignment of Critical paths Accessible I/O signals Improves Density Rapid progress for automatic and manual design methods (1 macro, NOT 100’s of elements!) Consistent performance anywhere on the die. Packing density very high Adequate set-up times Filling a device with Xilinx Cores does not reduce performance

Performance Independent of core location
80 MHz 80 MHz Same core installed in different locations Xilinx LogiCOREs deliver the same performance for any placement Non-segmented routing FPGAs can’t do this

Performance Independent of Device Utilization
80 MHz 80 MHz 80 MHz 80 MHz Xilinx has performance independent of the number of cores added Non-segmented routing FPGAs can’t do this

Best FPGA Performance Xilinx is more Predictable
80 Xilinx Segmented Non Segmented 70 Speed (MHz) 60 50 Another benchmark based on 12 x 12 multipliers highlights Xilinx’ performance advantage over competitor’s FPGAs. As you add more instances of a Core to a design based on a non-segmented architecture, the performance drops off at an unpredictable rate. If you do the same in Xilinx the performance is essentially the same regardless of how many instances you add. This gives you higher predictability and good repeatability from one design iteration to the next. This is in part due to the segmented routing architecture, but the software also plays a part and we’ll talk about this next. 12x12 Area Efficient Multiplier 40 1 2 3 4 8 Number of Instances Segmented = More Predictable and Repeatable

Performance Independent of Device Size
80 MHz 80 MHz 80 MHz Same performance for a 4005 or 4085 Non-segmented routing FPGAs can’t do this

Design Flow ~ ~ ~ ~ ~ ~ ~ ~ ~ Mixer Generate each module.
4K x 16 RAM ~ ~ I ~ ~ ~ 4:1 COS 48-TAP FIR 32-TAP FIR Decimate 20 MHz Complex Demod Base-band processor 5 MHz Q ~ ~ ~ ~ 4 multipliers 4:1 SIN Low Pass Mixer Generate each module. Use Schematic or HDL at a system level.

Implementing the Mixer
This mixer supports sample rates in excess of 85MHz. It even supports sample rates up to 45.6MHz using the slowest Xilinx device(E-4)

Joining the Cores Here VHDL is used to link the cores into a system. Schematic symbols may also be used. skip_value: skip_val --The integrator for skipping through the Sine table with forcing constant port map (cb => skip_constant); skip_integrater: skip_int port map (b => skip_constant, s => skip_integrate, l => GND, ce => VCC, c => clk); form_sine_address: for i in 0 to 6 generate --extract 7 bits required to address look-up table --MSB is not used as this represents overflow. --Lower bits are internal precision for integrator. skip_address (i) <= skip_integrate(i+10); end generate form_sine_address; sine_table : sine_lut -- sine wave look-up table port map (theta => skip_address, output => sine_wave, ctrl => VCC, select SINE output when high All component declaration and port map code provided by Coregen

Power Dissipation Advantage Often the Limiting Factor In DSP
Xilinx Advantage over competitive FPGAs Segmented routing is essential in DSP applications Altera Runs 3X HOTTER than Xilinx! Xilinx advantage over DSP processors: TI Runs 2X HOTTER 320c6 Independent study by Stanford STOP Too Much Heat

Segmented Interconnect Yields Lower Power
Ceramic 10 Package Thermal Limit Non-Segmented Xilinx Segmented Power (W) 5 Plastic Power dissipation is an important factor in high-speed DSP applications. Xilinx has a significant advantage here over other FPGAs due to our use of segmented interconnect to implement routing inside the device. Every device package has a thermal limit. A ceramic package can handle a lot of power plastic packages, less so. Due to higher power dissipation, a design implemented with a non-segmented routing architecture hits the wall sooner than the same design implemented in Xilinx devices, which use segmented routing. This means that Xilinx DSP can operate at higher clock frequencies for any given package, or will dissipate less power for a given application. 20 40 60 80 100 Clock Frequency (MHz) Segmented = Lower Power, Faster Operation

Where to find opportunities
Look for high performance applications Multiple DSP processors Fixed function DSP parts Gate array / custom DSP Data rates typically above 1 MHz Multiple channels required 100 Million FIR Filter CORE Samples / sec.

DSP Applications Image & Video Processing Communications
Industrial, Military Medical Imaging Copiers Cameras Security Systems Video editors Inspection Sys Fingerprint ID Wireless Comm Cellular / PCS Modems Satellite Cable ADSL Telephone Test Motor control Numerical control Test equipment Vibration analysis Power supplies Radar Secure comm.

Where FPGA Solutions Fit
Audio RF, Video, Multiple Channels kHz sample rates Single channel Processors Fixed-point arithmetic MHz sample rates FPGAs Fixed-point arithmetic Processors Floating-point arithmetic FPGAs ideal for high sample rates and computational intensity

Xilinx Core Solutions Group

Similar presentations

Presentation on theme: "Xilinx Core Solutions Group"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Xilinx Core Solutions Group

Similar presentations

Presentation on theme: "Xilinx Core Solutions Group"— Presentation transcript:

Similar presentations

About project

Feedback