Section I Introduction to Programmable Logic Devices

Section I Introduction to Programmable Logic Devices
1

Programmable Logic Device Families
Source: Dataquest Logic Standard Logic ASIC Programmable Logic Devices (PLDs) Gate Arrays Cell-Based ICs Full Custom ICs SPLDs (PALs) CPLDs FPGAs Acronyms SPLD = Simple Prog. Logic Device PAL = Prog. Array of Logic CPLD = Complex PLD FPGA = Field Prog. Gate Array Common Resources Configurable Logic Blocks (CLB) Memory Look-Up Table AND-OR planes Simple gates Input / Output Blocks (IOB) Bidirectional, latches, inverters, pullup/pulldowns Interconnect or Routing Local, internal feedback, and global

CPLDs and FPGAs CPLD FPGA
Complex Programmable Logic Device Field-Programmable Gate Array Architecture PAL/22V10-like Gate array-like More Combinational More Registers + RAM Density Low-to-medium Medium-to-high 0.5-10K logic gates 1K to 500K system gates Performance Predictable timing Application dependent Up to 200 MHz today Up to 135MHz today Interconnect “Crossbar” Incremental Not shown: Simple PLD (SPLD) Architecture

PLD Industry Growth

Programmable Logic vs. Semi-Custom ASIC Market
Total 1996 Market – $9.5B Total 2001 Market – $15.8B Mask Programmed Gate Arrays $7.4B Mask Programmed Gate Arrays $5.6B 47% 59% 21% 20% 16% 37% Standard Logic $2.0B Programmable Logic Share $1.9B Standard Logic $2.6B Programmable Logic Share $5.8B Source: Dataquest, May 1997

Foundation and Alliance Series
Who is Xilinx? World’s leading innovator of complete programmable logic solutions Inventor of the Field Programmable Gate Array $600M Annual Revenues; 35+% annual growth Fabless* Semiconductor and Software Company UMC (Taiwan) {*Xilinx acquired an equity stake in UMC in 1996} Yamaha (Japan) Seiko Epson (Japan) Programmable Logic Chips Foundation and Alliance Series Design Software

Xilinx vs. Competitors 1997 Calendar Year Revenues
$ Millions Source: Company reports & In-Stat. Includes SPLD, CPLD, FPGA revenues.

FPGA Market Share Q4 1997 Source: In-Stat Research, March 1998
Altera number includes both 8K and 10K families

Process & Density Leadership
Virtex 1 Million Gates 7.5 25 50 75 0.25u process XC40250XV ~500K gates XC40150XV Transistor Count (millions) XC40125XV - Industry’s 1st 0.25u PLD. ~250K gates, 5 LM. 4Q97 1Q98 2Q98 3Q98 4Q98

Xilinx Integrated Circuit Products
XC9500: Flash-based In System Program. CPLDs Lowest price, best pin locking, K gates XC4000: Industry’s largest & fastest FPGAs XC4000E: 0.5, 5V, 5K - 40K gates XC4000EX: 0.5, 5V, 45K - 60K gates XC4000XL: 0.35, 3.3V devices, V compatible I/O, 3K - 180K gates XC4000XV: 0.25, 2.5V / 3.3V, 5V compatible I/O, 250K - 500K gates Spartan: 0.5, 5V, Low Cost, 10K - 40K gates Virtex: New FPGA architecture in 1998 0.25, 5LM, 250K-1M gates, Select & Block-RAM XC6200: Reconfigurable Processing Unit Dynamically and partially reconfigurable Low-cost solutions (Industry) XC3000 (no RAM), XC5200 (no RAM), HardWire X X X X X X X X X X X X X X X Core Class Upper Level Class Research * Gates are in terms of system-level gates

Programming Controller
XC CPLDs Function Block 1 JTAG Controller Block 2 I/O Block 4 3 Global Tri-States 2 or 4 Block 3 In-System Programming Controller FastCONNECT Switch Matrix JTAG Port Global Set/Reset Global Clocks Blocks 1 5 volt in-system programmable (ISP) CPLDs 5 ns pin-to-pin 36 to 288 macrocells (6400 gates) Industry’s best pin-locking architecture 10,000 program/erase cycles Complete IEEE JTAG capability

Xilinx XC4000 Architecture
High Density -> 1M System Gates SRAM Based LUT for Synchronous Dual Port RAM or Logic ASIC-like array structure Built-in Tri-States Infinite reconfigurations, downloaded from PC or workstation in ~1 second Configurable Logic Blocks (CLBs) I/O Blocks (IOBs) Programmable Interconnect

XC6200 Reconfigurable Processing Unit
1000x improvement in reconfiguration time from external memory CPU Memory I/O FastMAPtm assures high speed direct access to all internal registers Microprocessor interface built-in: “XC6200 is memory mapped to look like SRAM to a host processor” XC6200 RPU All registers accessed via built-in low-skew FastMAPtm busses High capacity distributed memory permits allocation of chip resources to logic or memory - 256kbits in XC6264 Ultrafast Partial Reconfiguration (40ns to 100’s of usec) I/O Up to 100,000 gates

Exponential Growth in Density
Year Logic Cells Logic Gates 1,000 10,000 100,000 1,000,000 1994 1996 1998 2000 2002 12M 1.2M 120K 12K 2 Million logic gates Nov shipping world’s largest FPGA, XC40125XV (10,982 logic cells, K System Gates) 1 Logic cell = 4-input LUT + FF 175,000 Logic cells = 2.0 M logic gates in 2001 D Q FF LUT

Design Flow Design Entry in schematic, ABEL, VHDL, and/or Verilog. Vendors include Synopsys, Aldec (Xilinx Foundation), Mentor, Cadence, Viewlogic, and 35 others. 1 M1 Technology Implementation includes Placement & Routing and bitstream generation using Xilinx’s M1 Technology. Also, analyze timing, view layout, and more. 2 XC4000 Download directly to the Xilinx hardware device(s) with unlimited reconfigurations* !! 3 *XC9500 has 10,000 write/erase cycles

Foundation Series Delivers Value & Ease of Use
Complete, ready-to-use software solution Simple, easy-to-use design environment Easy-to-learn schematic, state-diagram, ABEL, VHDL, & Verilog design Synopsys FPGA Express Integration*

The Xilinx Student Edition
Prentice Hall’s most requested new engineering product in Q1 ‘98 ! Complete, affordable, and practical digital design course environment for all students Predeveloped and tested lab-based course Includes Foundation Series 1.3 for students’ computers Practical Xilinx Designer lab tutorial book Coupon for XS40-005XL and XS boards ($129) Sold through bookstores by Prentice Hall and listed at $79 (ISBN ) Integrated tutorial projects cover: TTL, Boolean Logic, State Machines, Memories, Flip Flops, Timing, 4-bit and 8-bit processors Upgradeable for free to F1.4 Express with VHDL & Verilog, 40K gates, VHDL labs on the web

Section II Basic PLD Architecture
1

Section II Agenda Basic PLD Architecture
XC9500 and XC4000 Hardware Architectures Foundation and Alliance Series Software

Section II Basic PLD Architecture XC9500 and XC4000 Hardware Architectures
1

XC CPLDs Function Block 1 JTAG Controller Block 2 I/O Block 4 3 Global Tri-States 2 or 4 Block 3 In-System Programming Controller FastCONNECT Switch Matrix JTAG Port Global Set/Reset Global Clocks Blocks 1 5 volt in-system programmable (ISP) CPLDs 5 ns pin-to-pin 36 to 288 macrocells (6400 gates) Industry’s best pin-locking architecture 10,000 program/erase cycles Complete IEEE JTAG capability

XC9500 - Architectural Features
Uniform, all pins fast, PAL-like architecture FastCONNECT switch matrix provides 100% routing with 100% utilization Flexible function block 36 inputs with 18 outputs Expandable to 90 product terms per macrocell Product term and global three-state enables Product term and global clocks Product term and global set/reset signals 3.3V/5V I/O operation Complete IEEE JTAG interface

XC9500 Function Block Each function block is like a 36V18 ! To
FastCONNECT From 2 or 4 3 Global Tri-State Clocks I/O 36 Product- Term Allocator Macrocell 1 AND Array Macrocell 18 Each function block is like a 36V18 !

XC9500 Product Family 9536 9572 95108 95144 95216 95288 Macrocells 36
Usable Gates 800 1600 2400 3200 4800 6400 tPD (ns) 5 7.5 7.5 7.5 10 10 Registers 36 72 108 144 216 288 Max I/O 34 72 108 133 166 192 Packages VQ44 PC44 PC44 PC84 TQ100 PQ100 PC84 TQ100 PQ100 PQ160 PQ100 PQ160 HQ208 BG352 PQ160 HQ208 BG352

XC4000 Architecture Programmable Interconnect I/O Blocks (IOBs)
Configurable Logic Blocks (CLBs)

XC4000E/X Configurable Logic Blocks
D Q SD RD EC S/R Control 1 F' G' H' DIN H Func. Gen. G F G4 G3 G2 G1 F4 F3 F2 F1 C4 C1 C2 C3 K YQ Y XQ X H1 DIN S/R EC 2 Four-input function generators (Look Up Tables) - 16x1 RAM or Logic function 2 Registers - Each can be configured as Flip Flop or Latch - Independent clock polarity - Synchronous and asynchronous Set/Reset

Look Up Tables Combinatorial Logic is stored in 16x1 SRAM Look Up Tables (LUTs) in a CLB Example: Look Up Table 4-bit address Combinatorial Logic A B C D Z A B C D Z 4 (2 ) 2 = 64K ! Capacity is limited by number of inputs, not complexity Choose to use each function generator as 4 input logic (LUT) or as high speed sync.dual port RAM G Func. Gen. G4 G3 G2 G1 WE

XC4000X I/O Block Diagram Shaded areas are not included in XC4000E family.

Xilinx FPGA Routing 1) Fast Direct Interconnect - CLB to CLB
2) General Purpose Interconnect - Uses switch matrix CLB Switch Matrix 3) Long Lines Segmented across chip Global clocks, lowest skew 2 Tri-states per CLB for busses Other routing types in CPLDs and XC6200

Other FPGA Resources Tri-state buffers for busses (BUFT’s)
Global clock & high speed buffers (BUFG’s) Wide Decoders (DECODEx) Internal Oscillator (OSC4) Global Reset to all Flip-Flops, Latches (STARTUP) CLB special resources Fast Carry logic built into CLBs Synchronous Dual Port RAM Boundary Scan

What’s Really In that Chip?
Programmable Interconnect Points, PIPs (White) Switch Matrix Routed Wires (Blue) Direct Interconnect (Green) CLB (Red) Long Lines (Purple)

XC4000XL Family 4005XL 4010XL 4013XL 4020XL 4028XL
Logic Cells ,368 1,862 2,432 Typ Gate Range* K 7-20K 10-30K 13-40K 18-50K (Logic + Select-RAM) Max. RAM bits 6K 13K 18K 25K 33K (no Logic) I/O Initial Packages PC84 PC84 PQ100 PQ100 PQ160 PQ160 PQ160 PQ160 PQ208 PQ208 PQ208 PQ208 HQ208 PQ240 PQ240 HQ240 BG256 BG256 BG256 BG352 BG352 4036XL 4044XL 4052XL 4062XL 4085XL XV Logic Cells 3,078 3,800 4,598 5,472 7,448 10,982 Typ Gate Range* K K K K K K Max. RAM bits 42K 51K 62K 74K 100K 158K I/O Initial packages HQ208 HQ240 HQ240 HQ240 HQ240 BG352 BG432 BG432 BG432 BG432 PG411 PG411 PG411 PG475 PG559 PG559 BG560 BG560 BG560 BG560 * 20-25% of CLBs as RAM * 25-30% of CLBs as RAM

HardWireTM Unique no-risk 100% compatible mask-programmed cost reduction of Xilinx FPGA Cost-effective for volume applications Savings of 40% to 70% Architecture-equivalent mask-programmed version of any FPGA Requires virtually no customer engineering resources, test vectors, or simulation ALL FPGA features (e.g., Configuration, Power-On Reset, JTAG, etc.) are fully supported FPGA HARDWIRE

HardWire Methodology vs. Gate Array Conversion
Xilinx ATPG Prototypes T e s t D v l o p m n Verification Place and Route Capture Typical Gate Array Design Phases FPGA Design Xilinx HardWire Methodology Production Ready Physical Data Base I r a i Gate Array Redesign Path .LCA File Conversion

Cost Reduction & Density Increases
1996 1997 1998 Cost XC40250XV (500K System-level Gates) 1M Gates* XC4085XL XC4036EX XC4000EX XC4000XL XC4000XV XC4000E Virtex Series HardWire XC5200 5,000 36,000 85,000 250,000 Logic Gates 0.4K 3K 7.5K 20K Logic Cells * Starting with Virtex, Xilinx numbering scheme reflects approximate Logic + RAM gates rather than Logic gates only.

CPLD or FPGA? CPLD Non-volatile JTAG Testing Wide fan-in
Fast counters, state machines Combinational Logic Small student projects, lower level courses FPGA SRAM reconfiguration Excellent for computer architecture, DSP, registered designs ASIC like design flow Great for first year to graduate work More common in schools PROM required for non-volatile operation

Section II Basic PLD Architecture Foundation and Alliance Series Software

Xilinx M1-Based Software
Foundation Series ALLIANCE Series Software Backplane Libraries and Interfaces for Leading EDA Vendors Core Implementation Software - Map, Place, Route, Bitstream generation, and analysis Complete, Ready-to-Use Includes Schematic, Simulation, VHDL and Verilog Synthesis Graphical User Interface is very similar to XACTStep v.6.0

Design Tools Standard CAE entry and verification tools
Xilinx Implementation software implements the design The design is optimized for best performance and minimal size Graphical User Interface and Command Line Interface Easy access to other Xilinx programs Manages and tracks design revisions ~ Foundation or Alliance Functional Simulation Design Entry Simulator Back Annotation Schematic, State Mach., HDL Code, LogiBLOX, CORE Gen Verification Static Timing Analysis, In-Circuit Testing M1 Design Manager Xilinx Design Implementation

Multi-Source Integration Mixed-Level Flows
HDL Schematic Enables multiple sources and multiple EDA vendors in the same flow Allows team development Reduces design source translations Design the way you are used to Enables rapid, accurate iterations Works well within existing ASIC flows Facilitates Design Reuse Existing Designs Cores Design Source Integration EDIF VHDL Verilog SDF Standards Based Check Point Verification Knowledge Driven Implementation

3rd Party Support & Libraries
Xilinx 3rd Party Design Entry & Simulation Support Synopsys, Cadence, Mentor Graphics, Aldec (Foundation) Viewlogic, Synplicity, OrCad, Model Technologies, Synario, Exemplar and others supply libs & interfaces Industry standard file formats: VHDL, Verilog, and EDIF netlist formats SDF Standard Delay files VITAL library support Xilinx Libraries Optimized components for use in any Xilinx FPGA or CPLD Wide range of functions Comparators, Arithmetic functions, memory DSP and PCI interfaces Easy to use with ABEL, VHDL, Verilog, schematic entry

Libraries, Macros & Attributes
Libraries are common design sets for all design entry tools (eg. text, schematic, Foundation, Synopsys, Viewlogic, etc.) Library “interfaces” are specific to each front end Attributes are library element properties Online “Libraries Guide” has full listings and descriptions Unified Libraries: Boolean functions, TTL, Flip-Flops, Adders, RAM, small functions LogiBlox Libraries: Variable size blocks of adders, registers, RAM, ROM, etc. Properties defined as attributes

Core Design Technology Optimal Core Creation & Flexible Core Delivery
Data sheets Parameterizable Cores CoreLINX: Web Mechanism to Download New Cores SystemLINX: Third Party System Tools Directly Linked With Core Generator

Foundation Series Express Overview
Easy to use, yet powerful Based on Industry Standards, not proprietary languages Features: Schematic (partnership with Aldec) IEEE VHDL, Verilog, ABEL State Diagram Editor Interactive Simulation Exclusive partnership with Synopsys, the synthesis leader Synopsys Aldec Xilinx

Foundation Project Manager
Integrates all tools into one environment

Schematic Entry

ABEL and VHDL Text Entry
From schematic menu (or via HDL Editor), select Hierarchy -> New Symbol Wizard… to create symbol. Select HDL Editor & Language Assistant to learn by example, then define block. Synthesize to EDIF. 1 5 4 3 2

State Machine Graphical Editor
Graphical editor synthesizes into ABEL or VHDL code

Simulation - Easy to Use and Learn
Generate stimulus easily and quickly Keyboard toggling Simple clock stimulus Custom formulas Easy debugging Waveform viewer Signals easily added and removed Simulator access from schematic Color-coded values on schematic Script Editor

Foundation Express 1.4 Features
Express Technology Optimizes the design for Xilinx Architectures Optimized arithmetic functions Automatic Global Signal Mapping Automatic I/O Pad Mapping Resource Sharing Hierarchy Control Source Code Compatible With Synopsys Design Compiler and FPGA Compiler Verilog (IEEE 1364) and VHDL (IEEE ) Support Easy, graphical constraint entry F1.4 is stand-alone F1.5: Sept / Oct ’98 Integrated into Foundation Project Manager Replaces Metamor

Xilinx-Express Design Flow
.VEI .VHI .UCF Reports DSP COREGen & LogiBLOX Module Generator XNF .NGO HDL Editor State Diagram Editor VHDL Verilog .V .VHD Foundation Design Entry Tools Gate Level Simulator Schematic Capture EDIF Timing Requirements Express EDIF/XNF .XNF BIT JDEC SDF Xilinx Implementation Tools H D L S I M U A T O N Behavioral Simulation Models

Express Input and Output
Input files may be VHDL or Verilog format Timing Specifications are not used during Synthesis Timing Specifications can be included in the output netlist Mixed Verilog/VHDL modules are accepted Schematics may also be used, but should not be input into Express Schematic files in XNF or EDIF format will be merged into the design in Xilinx Design Manager Output netlists are in XNF format Timing Specifications may be specified in Express VHDL Verilog Timing Requirements Express .XNF Reports

Express Design Process
1 3 2 { 4 1. Analyze - Syntax check 2. Implement - Create generic logic design (Elaborate) 3. Enter constraints and options 4. Synthesize - Optimize the design for specific device 5. Export XNF Netlist 6. Implement layout with Xilinx Design Manager

Implementation - M1 Design Manager
Manages design data Access reports Supports CPLDs, FPGAs Flow Engine Timing Analyzer PROM File Formatter Hardware Debugger EPIC Design Editor

Terminology Project Version Revision Part type
Source file; has a defined working directory and family Version A Xilinx netlist translation of the schematic Multiple Versions result from iterative schematic changes Revision An implementation of a Xilinx netlist Multiple revisions typically result from different options Part type Specified at translation; can be changed in a new revision

Toolbox Programs Flow Engine Timing Analyzer PROM File Formatter
Controls start/stop points and custom options Timing Analyzer Report on net and path delays PROM File Formatter Create file to program configuration file into PROM Hardware Debugger Download configuration file with XChecker, Serial or JTAG Cable EPIC Design Editor Device-level view of routing

Flow Engine View status of tools Control tool options
Implements design to the bitstream

Section III Advanced Hardware Design Techniques
1

Section III Agenda Advanced Hardware Design Techniques
General Hardware Information Combinational Logic Design (Look Up Tables and other Resources) Synchronous Logic (Flip Flops and Latches Memory Design (RAM and ROM) Input / Output Design

Section III Advanced Hardware Design Techniques General Hardware Information
1

Resource Estimation Find comparable functions in macro library and XAPP application notes Or, use other designs to estimate device utilization Or, quickly implement a design and view the MAP report file Select Utilities -> Report Browser -> Map Report IOBs, CLBs, Global Buffers, and other components listed separately For unfinished designs Use save flags on unconnected nets, or Deselect “Trim Unconnected Logic in Implementation Options MACRO S

Performance Estimation
Use block delays as estimate of net delays Use desired clock frequency to determine allowed CLB depth Compare to functional requirements and modify design to meet performance needs Example for 50 MHz clock frequency in XC4000XL-3: Clock period 20 ns One level ns (tCO + tNET + tSU) Delay allowance 12 ns Each added level % 6 ns (tPD + tNET) Added levels of logic allowed 2 CLBs tCO tNET tPD tSU CLB

Power Consumption Xilinx FPGAs have flexible routing Power = kCV2F
Power consumption can be half that of FPGAs with less flexible routing channels Power = kCV2F How many nodes change state (hard to estimate) Capacitive loading on CLB and IOB outputs (known) Power consumption is not a concern in regular course labs Power estimation methods See application notes under

XC4000XL 3.3 V, 0.35m, 5 Volt Compatible
5 V Tolerant Inputs Any 5 V device 5 V XC4000XL FPGA 0.35 m 3.3 V Logic 3.3 V I/O 3.3 V Meets TTL Levels Accepts 5Volt inputs Drives standard TTL levels Totally compatible in 5Volt environment 0.25m XV family is also 5 Volt TTL compatible when used with 3.3Volt I/O supply, 2.5Volt core supply

XC4000XV & Virtex 2.5 V, 0.25m, 5 Volt Compatible
Devices with 5V, 3.3V, and 2.5V power supplies can be interfaced

Section III Advanced Hardware Design Techniques Combinational Logic Design (Look Up Tables and Other Resources) 1

XC4000X Configurable Logic Blocks
D Q SD RD EC S/R Control 1 F' G' H' DIN H Func. Gen. G F G4 G3 G2 G1 F4 F3 F2 F1 C4 C1 C2 C3 K YQ Y XQ X H1 DIN S/R EC G, F, H function generators 2 Flip-Flops Individual clock polarity Sync. and async. Set/Reset Delay from F1 to Y in the XC4000X-1 is ~1 nsec

Look Up Tables Combinatorial Logic is stored in 16x1 SRAM Look Up Tables (LUTs) in a CLB Example: Look Up Table 4-bit address Combinatorial Logic A B C D Z A B C D Z 4 (2 ) 2 = 64K ! Capacity is limited by number of inputs, not complexity Choose to use each function generator as 4 input logic (LUT) or as high speed sync.dual port RAM G Func. Gen. G4 G3 G2 G1 WE

16-bit Adder Examples Many choices for implementing an adder
Speed vs. density trade-off controlled by user and PLD features Family Type CLBs Levels AppLINX XC3000A Bit-Serial 16 16 XAPP 022 XC3000A Parallel 24 8 XAPP 022 XC3000A Lookahead 30 6 XAPP 022 XC3000A Conditional 41 3 XAPP 022 XC4000E-3 Carry 8 10.1ns XAPP 018 XC5200-5 Carry 8 20ns 5200 DataSheet

Arithmetic Functions Arithmetic Macros are optimized for density and speed with dedicated carry logic in CLBs Example: Each CLB can form a two-bit full-adder Carry Logic components have vertical orientation Needed for speed and utilization Known as RPM or “Relationally Placed Macro” Examples: ADDx adders ADSUx adder/subtractors CCx counters COMPMCx magnitude comparators A<3> B<3> A<2> B<2> A<1> B<1> A<0> B<0> Z<3> Z<2> Z<1> Z<0> ADD4

Three-State Buffers Each CLB is associated with two Three-State buffers (BUFT) BUFTs are used independently of LUTs and Flip-Flops Three-State library components: Three-state buffers: BUFT, BUFT4, BUFT8, BUFT16 Wired AND (open Drain) : WAND1, WAND4, WAND8, WAND16 Two input OR driving Wired AND : WOR2AND Delay varies per family 3.7 ns in the XC4005XL (-1) 13.6 ns in the XC4085XL (-1)

Use BUFT for Buses Use to multiplex signals onto long routing lines to use as buses B3 B2 B1 B0 A3 A2 A1 A0 BUS<3> BUS<2> BUS<1> BUS<0> _ENABLE_A _ENABLE_B BUFT

BUFTs for Multiplexers
BUFT can can be used to build large MUXes Large MUXes composed of LUTs need multiple levels of logic Large MUXes composed of BUFTs have only one level of logic CLB resources are not used Use of BUFTs constrains placement Multiplexer macros use lookup tables Example: M4_1E Create BUFT macros from Three-State buffer components BUFT, BUFT4, BUFT8, BUFT16

Wide Decoders The Wide Decoder is a dedicated wired-AND
Useful for address decoding IOBs or CLBs can drive the Wide Decoder Located along the periphery of the die All IOB drivers must be on same edge as the decoder Four decoder lines per edge Use DECODE macro DECODE4/8/16/24 Must use a PULLUP primitive A0 A1 A2 A3 A4 A5 A6 A7 O DECODE8 PULLUP

CLB Mapping Control in Schematic
Allows user to force mapping of logic from schematic into a single CLB XC3000 CLBMap can specify entire CLB XC4000/XC5000 FMap specifies a function generator in a CLB HMap specifies an XC4000 H function generator in a CLB A0 FMAP B0 A0 I1 C0 B0 I2 C0 O A2 I3 A2 B2 I4 B2

Section III Advanced Hardware Design Techniques Synchronous Logic (Flip-Flops and Latches)
1

CLB Registers Each register can be configured as a Flip-Flop or Latch
Independent clock polarity Asynchronous Set or Reset Clock Enable Direct input from CLB input (Connections bypass LUTs) S/R DIN F G K (CLOCK) EC (CLOCK ENABLE) RESET SET Q QX D H EC 1 Control QY FPGAs are rich in Flip-Flops. The smallest 4000XL (XC4005XL) has 486 Flip-Flops. The largest (XC40085XL) has 7,448 Flip-Flops.

Library offerings “Unified” library contains many standard functions
Pre-defined size and functionality LogiBLOX templates are available Can be customized for bus size and function Types of LogiBLOX register functions Shift Registers Left/Right, Arithmetic, Logical, Circular Clock Dividers Output Duty Cycle Counters LFSR, Binary, One_Hot, Carry Logic Accumulators Xilinx CORE Generator recommended for very complex functions (DSP, FFT, UARTs, Multipliers...)

Naming Conventions FD PE _1 LDCE_1 Transparent D Latch
Flip-Flop D-Type (D), JK-Type (JK), Toggle-Type (T) Asynchronous Preset (P), Asynchronous Clear (C) Synchronous Set (S), Synchronous Reset (R) Clock Enable Inverted Clock FD PE _1 LDCE_1 Transparent D Latch Asynchronous Preset (P), Asynchronous Clear (C) Gate Enable Inverted Gate Flip-Flop, D Type Size Synchronous Reset Clock Enable FD16 R E

Counters Libraries support a wide variety of fast and efficient counters Counters offer trade-offs between speed, density, and complexity Example: LogiBlox counter styles Binary: predictable outputs, uses carry logic Johnson: fastest practical counter, but uses more flip-flops; glitch free decoding LFSR: fast & dense, but pseudo-random outputs One-Hot: useful for generating series of enables Carry Chain: High speed and density The LogiBlox synthesizer will automatically pick the best implementation based on your design, or you can force an implementation with the STYLE parameter (schematic).

16 Bit Counter Examples The following are implemented in XC4000XL-3
Macro CLBs Clock CB16CLE/D ns CC16CLED ns CC16CLE ns X-BLOX: LFSR ns Simpler functions are faster and smaller Carry Logic Counters are generally faster (depends on size)

Global Clock Buffers Clock Buffers are low-skew, high drive buffers
Also known as Global Buffers Drive low-skew, high-speed long line resources Drive all Flip-Flops and Latches in FPGA Can also be used for high-fanout signals Additional clocks and high fanout signals can be routed on long lines Instantiation: if the BUFG component is instantiated, software will select one of these buffers based on the design Synthesis: Clocks are identified by different means depending on Vendor Example: Synopsys FPGA compiler connects clock buffers to all fan-in of clock pins Control clock buffer insertion with separate commands Consult Synthesis interface guide or vendor

Global Buffer Types BUFGLS is used by default in the Xilinx software if a BUFG component is specified in the design

Generating Clock On-Chip
Internal configuration clock available after configuration Use OSC4 primitive Nominal values (approximately): 8 MHz, (500 kHz, 16 kHz, 490 Hz, 15 Hz) Very limited accuracy (+/- 50%) OSC4 F15 F500k F16k F490 F8M BUFGS

Global Reset All flip-flops are initialized during power up via Global Set/Reset network You can access Global Set/Reset network by instantiating the STARTUP primitive Assert GSR for global set or reset GSR is automatically connected to all CLB flip-flops using dedicated routing resources Saves general use routing resources for your design DO NOT CONNECT GSR to set/reset inputs on Flip-Flops Any signal can source the global set/reset, but the source must be defined in the design GR/GSR GTS CLK Q1 Q2 Q3 DoneIn STARTUP Q4 Use Global Reset as much as possible Limit the number of flip-flops with an asynchronous reset Extra routing resources are used

Avoid Gated-Clock or Asynch. Reset
Move gating from clock pin to prevent glitch from affecting logic. Poor Design: D Q TC Q0 Q1 Q2 Binary Counter CK TC and Q may glitch during the transition of Q<0:2> from 011 to 100 Improved Designs: Carry-1 Q0 Q1 Q2 Binary Counter CE Q D TC CK TC will not glitch during the transition of Q<0:2> from 011 to 100 D Q CE Or use MUXed data when using only 1-2 logic inputs

Shift Registers are Fast & Dense
The CLB can handle two bits of a shift register Fast and dense independent of size Fast connections between adjacent lookup tables D Q Left/Right Qi Qi+1 Qi-1 Qi+2 EC

Prescale Non-Loadable Counters
Counter speed is determined by the carry delay from LSB to MSB Non-loadable counters can use prescaling Pre-scaling restricts load timing Fast Small Counter Large Dense Counter with Slower Carry TC CE

Use One-Hot Encoding for State Machines
Shift register is always fast and dense “One-hot” uses one flip-flop for each count Useful for state machine encoding in FPGAs Another alternative is a Johnson Counter Inverted output of last stage drives input of first stage Doubles the number of states versus one-hot Binary encoding is best for CPLDs D Q

State Machine Design Tips
Split complex states Need to minimize number of inputs, not number of flip-flops, in FPGAs Use one-hot encoding for medium-size state machines (~8-16 states) Complex states may be improved by breaking up into additional simpler states State A State A1 State A2 cond1 cond1 cond1 State B State B

Use binary sequence only if necessary
CLB can generate any sequence desired at same speed Use Pre-Scaling on non-loadable counters to increase speed LSBs toggle quickly See Application Notes XAPP001 and XAPP014 Use Gray code counters if decoding outputs One bit changes per transition Consider Linear Feedback Shift Register for speed when terminal count is all that is needed Or when any regular sequence is acceptable (e.g., FIFO) Large Dense Counter with Slower Carry TC CE Fast Small 10-bit SR Q0 Q9 Q6

Pipeline for Speed Register-rich FPGAs encourage pipelining
Pipelining improves speed Consider wherever latency is not an issue Use for terminal counts, carry lookahead, etc. How to estimate the clock period 2 x (number of combinatorial levels) x (speed grade) XC4000XL-3: 3 levels x 2 x 3ns = 18 ns clock period

Section III Advanced Hardware Design Techniques Memory Design (RAM and ROM)
1

ROM is Equivalent to Logic
When using ROM, it is simply defining logic functions in a look-up table format Memory might be an easier way to define logic Xilinx provides ROM library cells FPGA lookup tables are essentially blocks of RAM Data is written during configuration Data is read after configuration Effectively operate as a ROM O = I1*I2 I1 I2 O F1 F2 X DATA(0)=0 DATA(1)=0 DATA(2)=0 DATA(3)=1 A0 A1 DOUT As Gates As ROM

RAM Provides 16X the Storage of Flip-Flops
32 bits versus 2 bits of storage Two 16x1 RAMS or One 32X1 Single Port Ram fit in one CLB One 16x1 Dual Port RAM fits in one CLB 32x8 shift register with RAM = 11 CLBs Using flip-flops, takes 128 CLBs for data alone Address decoders not included 32 bits A0 A1 A2 A3 A4 O1 2 bits D Q Q1 Q2 CLB D1 D2 WE CLK

RAM Types Synchronous RAM (SYNC_RAM) Synchronous Dual-Port (DP_RAM)
Synchronous Write Operation Synchronous Dual-Port (DP_RAM) Can read & write to different addresses simultaneously Data Write Enable Write Clock Address Output Data Write Enable Write Clock Write Address/ Single-Port Read Address SP Output DP Dual-Port Read Address

RAM Guidelines Less than 32 words is best
32x1 or 16x2 per RAM requires only one CLB Delays are short, (one level of logic) Data and output MUXes are required to expand depth Less than 256 words recommended per RAM Use external memory for 256 words or more Width easily expanded Connect the address lines to multiple blocks Recommendation: Use less than 1/2 of max memory resources Maximum memory uses all logic resources of CLBs

Memory Use Most synthesis tools can synthesize ROM from behavioral HDL code, but RAMS must be instantiated Use library primitives and macros for standard size memory RAM/ROM16X1S to 32X8S Use S suffix for Synchronous RAM Use D suffix for Dual-Port RAM Use LogiBlox to generate arbitrary size memories O RAM32X1S D WE A0 A1 A2 A3 A4

How to Generate Memory Use LogiBlox utility to create arbitrary size RAM or ROM Select type: ROM, Synchronous, Asynchronous, or Dual Port RAM Specify Depth: number of words must be a multiple of 16, ranging from 16 to 256 words Specify Width: word size ranges from 1 to 64 bits Specify initialization values with attribute file LogiBLOX also creates RAM interface Entity and component declaration - cut and paste into the design (VHDL designs) Module declaration (Verilog designs) Symbol Graphic (schematic entry designs)

Memory Generator Dialog
Specify memory type, size, name and function in the LogiBLOX GUI Instance Name example LogiBLOX function Memory Function Data file for initialization

Section III Advanced Hardware Design Techniques Input / Output Design
1

XC4000X IOB Block Diagram Shaded areas are not included in XC4000E family.

How to specify IO blocks - Schematic
User explicitly defines what resources in the IOB are to be used I/Os are defined with 1 pad primitive At least 1 function primitive: Buffer, F/F ,or Latch 1 input element, 1 output element or both Inverters may also be pulled into IOBs IOBs are named by net between pad and function primitives IPAD IN1_PAD IOB IN1_PAD IBUF IN2_PAD IOB IN2_PAD ILD

Primary and Secondary Global Buffers
Eight global buffers per FPGA Four primary (BUFGP), Four secondary (BUFGS) Primary buffers must be driven by a semi-dedicated IOB Secondary buffers can be driven by a semi-dedicated IOB or internal logic and have more routing flexibility Use BUFGS if extra 1-2ns of delay is acceptable Use generic BUFG primitive in your design Allows software to choose best type of buffer Allows easy migration across families IPAD BUFG D

I/O Logic 4000E families have no boolean logic other than inverters in the IOBs XC4000EX adds optional output logic Can be used as a generic two-input function generator or MUX One input can be driven by IOB output clock signal Driving from FastCLK buffer provides less than 6 ns pin-to-pin delay Requires library components beginning with “O” IPAD F OPAD BUFFCLK FROM INTERNAL LOGIC FAST OAND2

Use Pull-ups/Pull-downs to Prevent Floating
Unused IOBs: Outputs of unused IOBs are automatically disabled Pull-ups are automatically connected on unused IOBs Used IOBs: A PULLUP or PULLDOWN primitive can be connected to used IOBs Inputs should not be left floating Add a pull-up to design inputs that may be left floating to reduce power and noise

Output Three-State Control
Output enable may be inverted Use OBUFE macro for active-high enable Use OBUFT primitive for active-low enable Three-state control also via a dedicated global net Controlled by same STARTUP primitive All I/O disabled during configuration OE OBUFE T OBUFT STARTUP GTS

Fast Capture Latch Additional latch on input driven by output’s clock signal Allows capture of input by very fast clock Followed by standard I/O storage element for synchonization to internal logic Very fast setup (6.8 NS for 4000EX-3), 0 ns hold Available on 4000X, not 4000E family Example ILDFFDX macro includes Fast Capture Latch and IFDX Connect BUFGE to fast capture latch Opposite edge of same clock via BUFGLS drives IFDX D GF CE Q IPAD BUFGE BUFGLS Data Clock to internal logic ILDFFDX

Decrease Hold time with NODELAY
NODELAY attribute Removes delay element to the IFD or ILD Decreases setup time, add creates hold time Available on IFD/ILD macros in XC5200 and XC4000E/X families Delay Q D IOB Routing Pad External Clock Input Buffer

Output MUX OMUX2 Fast output signal (from output clock pin) MUXes IOB output or clock enable pins to pad Effectively doubles the number of device outputs without requiring a larger, more expensive package Pin-to-pin delay is less than 6 ns D0 D1 S0 O OMUX2 OPAD

Slew Rate Control Slew rate controls output speed Two slew rates
Default slow slew rate reduces noise Use fast slew rate wherever speed is important FAST Slew rates are approximately 2x faster than SLOW slew rates Slew rate specification Instantiation: in the user constraint file: INST $1I87/obuf SLOW; Synthesis: vendor dependent Output drive varies by family 4KEX/XL families have 12 mA drive OPAD OBUF FAST

Choose TTL or CMOS Thresholds
Threshold is selected during configuration Default is TTL Global selection on inputs or outputs Change to CMOS in Configuration Template 3V devices need TTL threshold when interfacing to 5V devices

Section IV Advanced Software Design with Xilinx M1-Based Software

Section IV Agenda Design Entry Tips Library Types
FPGA Express for VHDL & Verilog M1-Based Software Flow Implementation Options Design Verification PLD Configuration Settings Design Constraints

Section IV Advanced Software Design with Xilinx M1-Based Software Design Entry Tips

Design Entry Tip - Label Nets
Label as many nets as possible Net names are passed to report files Eases debugging Names may change due to hierarchy or optimization An IOB is named by the net between the pad and I/O function primitives A CLB is named by the net on the output Flip-flops are always outputs IN1 IOB IN1 D Q Q2 CLB Q2

Use Legal and Readable Names
Allowable characters Alphanumeric: A - Z, a - z, 0 - 9 Underline _, Dash - Reserved characters Angle brackets for buses <> Slash / for hierarchy Dollar sign $ for reference designators Names must contain at least one non-digit Avoid using names that correspond to device resources CLB row/column locations: AA, AB, etc. IOB pin locations: P1, P2, etc.

Component Naming Conventions
Common component names, pin names and functions for all families Basic format is <function><width><control_inputs> CB4CLE = Counter, Binary, 4 bits, Clear, Load, Enable FD16RE = Flip-flops, D-type, 16 bits, Reset, Enable Control inputs are referenced by a single letter C = asynchronous Clear, R = synchronous Reset Listed in order of precedence

Use Hierarchy in Design
Adds structure to design Eases debug Users can build libraries of common functions Allows each design portion to be entered by most efficient method Facilitates incremental design and floorplanning Supports team design

Section IV Advanced Software Design with Xilinx M1-Based Software Library Types

Xilinx Libraries Overview
Libraries contain descriptions of each component with pin names, functionality, timing, etc. There are two libraries: The Unified Library contains “ready made” components with non-variable function and size The LogiBLOX Library contains templates which can be customized for function and size Both libraries allow easy design migration across Xilinx devices and families

LogiBLOX templates and GUI
LogiBLOX is composed of two parts: LogiBLOX Library containing templates of VARIABLE SIZE Templates are expanded or customized (Counters, Adders, Registers, RAM, ROM) Templates have many implementations (e.g. Binary, Johnson, LFSR counters) LogiBLOX GUI and Synthesizer to create A design file for implementation Symbol for schematic capture tool HDL code for instantiation in your design Functional simulation model

Generic LogiBLOX Functions
One generic model per function type(ex: counter) - Attributes can be specified ex: bus width, load, clock enable, etc. Arithmetic: COUNTER,ADDER, SUBTRACTOR, ACCUMULATOR Storage: SHIFT, DATA_REG, PROM, SRAM, DRAM Logic: ANDBUS, ORBUS, MUXBUS, DECODE, TRISTATE, COMPARATOR I/O: INPUTS, OUTPUTS, BIDIR_IO DSP and other complex functions are also available through CORE Generator

LogiBLOX Module Selector
Simple Combinatorial Logic Bus size from 2 to 32 bits Supports AND, Invert, NAND, NOR, OR, XNOR, XOR Any of the inputs or output can be inverted independently Use Decode or MASK function Three-State Drivers Optional pull-up resistors Constants Allows signals to be tied high or low

How to use LogiBLOX in HDL code
If a LogiBLOX function is inferred, there is nothing more to do! Check with the synthesis vendor. Most synthesis tools infer simple LogiBlox components automatically Example: Synthesis tools will infer an adder for X <= A +B; To instantiate a LogiBlox function, or if the synthesis tool does not infer LogiBLOX automatically Use LogiBLOX GUI from command-line in “stand-alone” mode: %lbgui -vendor Creates a LogiBLOX module for simulation Creates an entity or module declaration

Section IV Advanced Software Design with Xilinx M1-Based Software FPGA Express for VHDL & Verilog Design

Section Agenda Overview Design Flow Instantiation Guidelines
Coding Style Guidelines

Overview Xilinx leads in FPGAs - 55% market share
Synopsys leads in VHDL/Verilog synthesis - 80% market share One result of long term technology partnership is FPGA Express Xilinx is only silicon supplier with right to distribute FPGA Express technology Integration into Foundation Series

Foundation Express 1.4 Features
Express Technology Optimizes the design for Xilinx Architectures Optimized arithmetic functions Automatic Global Signal Mapping Automatic I/O Pad Mapping Resource Sharing Hierarchy Control Source Code Compatible With Synopsys Design Compiler and FPGA Compiler Verilog (IEEE 1364) and VHDL (IEEE ) Support Easy, graphical constraint entry F1.4 is stand-alone F1.5: Sept / Oct ’98 Integrated into Foundation Project Manager Replaces Metamor

Xilinx-Express Design Flow
.VEI .VHI .UCF Reports DSP COREGen & LogiBLOX Module Generator XNF .NGO HDL Editor State Diagram Editor VHDL Verilog .V .VHD Foundation Design Entry Tools Gate Level Simulator Schematic Capture EDIF Timing Requirements Express EDIF/XNF .XNF BIT JDEC SDF Xilinx Implementation Tools H D L S I M U A T O N Behavioral Simulation Models

Express Input and Output
Input files may be VHDL or Verilog format Mixed Verilog/VHDL modules are accepted Schematics may also be used, but should not be input into Express Schematic files in XNF or EDIF format will be merged into the design in Xilinx Design Manager Output netlists are in XNF format Timing Specifications may be specified in Express Timing Specifications are not used during Synthesis Timing Specifications can be included in the output netlist VHDL Verilog Timing Requirements Express .XNF Reports

Analyze the Design (1) “Analyze” checks the HDL code for syntax errors
Also creates internal files Files are automatically analyzed when selected for a project Do not select XNF or EDIF files Will be merged into the design by Design Manager Synthesis -> Identify Sources

Analyze the Design (2) As the design blocks are analyzed, status is displayed: In this example, all blocks were analyzed successfully No Errors or Warnings Out of Date Warnings Errors Main Window

Implement the Design Express Implementation maps the HDL code to standard logic, creating a generic netlist. At this stage, the design has not been optimized To implement a design, select only the top level block, and then select the Implement icon Main Window

Check for Errors and Warnings
After implementation is complete, the chip symbol plus status is displayed View errors, warnings, and messages Right click inside window to save information to a text file

Constraint Entry Constraints are NOT applied to Synthesis
Constraints are written to the output netlist (XNF) file for use by Design Manager (Xilinx Implementation Tools) Timing constraints control path delay Specify paths with timing groups, or groups of IO or sequential elements The INPUT Group includes all input ports at the top level of the design The OUTPUT Group includes all output ports at the top level of the design All flip-flops clocked by the same edge of a common clock belong to a group To define constraints: select Synthesis -> Edit Constraints forms

Synthesis -> Edit Constraints -> Clocks -> Define
Define Clock Period Enter Period, Rise, and Fall Time Select Clock entry -> Define Synthesis -> Edit Constraints -> Clocks Synthesis -> Edit Constraints -> Clocks -> Define

Define Global Synchronous Delays
The clock period creates 3 types of global constraints with the same default value: (1) All input ports to sequential Elements Setup of flip-flop or latch is included (2) Sequential Element to all output ports Flip-Flop Clock to Q delay is included (3) Sequential Element to Sequential Element 3 Clock period logic D Q 1 2 Synthesis -> Edit Constraints -> Paths form

Define Individual Synchronous Delays
Default delay from Clock specification is used in the Paths form Individual, or path specific delays can be defined on the Ports form Port delays over-write the global delays from the Paths form Input delay, shown here, arrives 20 ns before the rising edge of the clock. Synthesis -> Edit Constraints -> Ports

Define Key Port Features
Global Buffer defines the type of Clock Distribution network - Use BUFG for most applications(default) Resistance specifies use of pullup or pulldown resistor on unused pads Reduces power consumption and noise Use IO Reg allows use of sequential elements within IO Blocks to minimize Input or Output delay (default) Dependent on device type Pad Location is used to specify pin number of the IO pad Synthesis -> Edit Constraints -> Ports

Control the Hierarchy Eliminate (default) or save hierarchical boundaries Flat designs yield best results because more merging and sharing of boolean logic occurs However, small blocks are easier to debug Easier to match source HDL code to synthesized design Synthesis goals (Speed or Area) and Effort level can be defined for each module Synthesis -> Edit Constraints -> Modules (implemented design)

Optimize the Design Optimization minimizes the design for speed or area Select the implementation, and then select the Optimize icon After Optimization, check for errors and warnings again Main Window

View Results Select File -> Project Report to generate a report
Report file contains: Files and libraries used Settings for Synthesis Chip type and speed grade Estimated Timing Warning: Circuit timing estimates tend to be optimistic. Run timing analysis after routing for most accurate timing analysis. Report.txt file

Verify Results (1) After Optimization, open Synthesis -> Edit Constraints to verify that correct constraints were specified Results are based on estimated routing delays Synthesis -> Edit Constraints -> Paths (for an optimized design)

Verify Results (2) Review size of the design
Resource use is displayed for each hierarchical block Resources used per hierarchical block Black Box instantiations cannot be analyzed by Express Synthesis -> Edit Constraints -> Modules (Optimized Design)

Export Netlist Create the output netlist for use with the Xilinx Design Manager (Xilinx Implementation Tools) Output File format is XNF Select the optimized design, then select Synthesis -> Export Netlist to create the file XNF file format is used Enable Export Timing Specifications to include constraints in the output netlist Synthesis -> Export Netlist

Simulation Not covered in this workshop Free VHDL / Verilog simulators
See Active VHDL Simulator, by Aldec (Most Recommended) VHDL Tools from RASSP Accolade Design Automation demo VHDL Simulator SimuCAD Silos III (Recommended for Verilog) Wellspring Verilog Simulator Model Technology Inc. (MTI) and major CAD vendors sell other HDL simulators

Instantiation Guidelines

Instantiation and Hierarchy
Hierarchy is created when one design is instantiated into another design All components in the Unified and LogiBLOX Libraries may be instantiated Unified library components are described in the Libraries Guide LogiBLOX components are described in the LogiBLOX Reference/User Guide Cells that must be instantiated with Express Synthesis RAM/ROM Readback OSC Bscan WOR WAND OAND…(all IOB combinatorial logic)

Black Box Instantiation
What is a black box? Any element not analyzed by Express. Examples: Existing Design Modules or Elements (XNF, EDIF, .ngo) LogiBLOX Components Pre Optimized Netlists (PCI Cores or LOGICOREs) Procedure for using a black box: Create a place holder in the HDL code Synthesize the design without the XNF, EDIF, or NGO files The Xilinx Implementation Tools will resolve (link in) all black box references Limitations Express cannot check timing constraints through a black box. Express cannot include black box resources in it’s reports. GSR nets are not automatically inferred within Black Boxes Instantiate STARTUP and explicitly connect GSR ports in HDL Black Box Instantiation XNF attributes are stripped when read IN to Express. For this reason, it may be more appropriate to treat the XNF file as a black box and let NGDBUILD merge the final netlist.

LogiBLOX & CORE Generator Functions
For HDL designs, LogiBLOX and CORE Gen generate: Behavioral VHDL or Verilog model - for simulation only VHDL/Verilog Template - for component instantiation NGO file - for Xilinx implementation Most LogiBLOX functions can be inferred. Exceptions include READBACK and RAM blocks. Instantiation may provide better control of design implementation M1 - Introduction

How to Use LogiBLOX 1. Invoke LogiBLOX from Foundation 2. Select Setup
a. Specify VHDL or Verilog Template in the LogiBLOX Setup form b. Other setup options may also be required* 3. Specify component features 4. Select OK to create component 5. VHDL/Verilog) Use template file (.vhi / .vei) to easily instantiate the component Verilog - Add empty interface file to define busses. 6. Compile as usual *To access Verilog options, invoke LogiBLOX directly from Start -> Programs -> Xilinx Foundation Series -> LogiBLOX

RAM Example Code is shown in the following slides: VHDL instantiation:
Component and entity declarations where copied into top level design file from LogiBLOX VHI file Verilog instantiation: Module declaration is copied into top level design file from LogiBLOX VEI file Additional empty file is required to specify pin type (input or output) Do not try to Analyze the VHD or VEI file from LogiBLOX, but DO Analyze the top level design file Verilog users will synthesize the additional empty Verilog file

RAM Instantiation (VHDL)
Library IEEE; use IEEE.STD_LOGIC_1164.all; use IEEE.STD_LOGIC_UNSIGNED.all; entity top is port (NOTCLR, CLKEN, NOTLD, UPCNT: in STD_LOGIC; CNT_DI, RAM_DI: in STD_LOGIC_VECTOR (7 downto 0); QO_LO: out STD_LOGIC_VECTOR (7 downto 0)); end top; . . . component ram256x8 PORT( A: IN std_logic_vector(7 DOWNTO 0); DI: IN std_logic_vector(7 DOWNTO 0); WR_EN: IN std_logic; WR_CLK: IN std_logic; DO: OUT std_logic_vector(7 DOWNTO 0)); end component; Top level entity and RAM Component declaration Copied from VHI file

RAM Instantiation (VHDL) (2)
begin U1: OSC4 port map (OSC_CK); U2: BUFG port map (OSC_CK, CLK); U3: CB8CLED port map (CLK, NOTCLR, CLKEN, NOTLD, UPCNT, CNT_DI, ADDR); xram : ram256x8 port map (A => ADDR , DI => RAM_DI, WR_EN => CLKEN, WR_CLK => CLK , DO => QO_LO ); end cr; Last part of Top architecure Component declaration is copied from VHI file, and instance name is entered

Coding Style Guidelines

Coding for Performance
FPGAs require better coding styles and more effective design methodologies Pipelining techniques allow FPGAs to reach gate array system speeds Gate Arrays can tolerate poor coding styles and design practices 66 MHz is easy for an Gate Array Designs coded for a Gate Array tend to perform 3x slower when converted to an FPGA Not uncommon to see up to 30 layers of logic and MHz FPGA designs 6-8 FPGA Logic Levels = 50 MHz Think Hardware

Case vs If-Then-Else (Verilog)
module mux (in0, in1, in2, in3, sel, mux_out); input in0, in1, in2, in3; input [1:0] sel; output mux_out; reg mux_out; or in1 or in2 or in3 or sel) begin case (sel) 2'b00: mux_out = in0; 2'b01: mux_out = in1; 2'b10: mux_out = in2; default: mux_out = in3; endcase end endmodule in0 in1 mux_out in2 in3 sel Think Hardware module p_encoder (in0, in1, in2, in3, sel, p_encoder_out); input in0, in1, in2, in3; input [1:0] sel; output p_encoder_out; reg p_encoder_out; or in1 or in2 or in3 or sel) begin if (sel == 2'b00) p_encoder_out = in0; else if (sel == 2'b01) p_encoder_out = in1; else if (sel == 2'b10) p_encoder_out = in2; else p_encoder_out = in3; end endmodule in3 in2 in1 p_encoder_out in0 sel=10 sel=01 sel=00

Reduce Logical Levels of Critical Path (Verilog)
in0 module critical_bad (in0, in1, in2, in3, critical, out); input in0, in1, in2, in3, critical; output out; assign out = (((in0&in1) & ~critical) | ~in2) & ~in3; endmodule in1 critical in2 out in3 Think Hardware module critical_good (in0, in1, in2, in3, critical, out); input in0, in1, in2, in3, critical; output out; assign out = ((in0&in1) | ~in2) & ~in3 & ~critical; endmodule in0 in1 in2 in3 out critical

Resource Sharing (Verilog)
module poor_resource_sharing (a0, a1, b0, b1, sel, sum); input a0, a1, b0, b1, sel; output sum; reg sum; or a1 or b0 or b1 or sel) begin if (sel) sum = a1 + b1; else sum = a0 + b0; end endmodule a0 + b0 sum a1 + b1 sel Think Hardware module good_resource_sharing (a0, a1, b0, b1, sel, sum); input a0, a1, b0, b1, sel; output sum; reg sum; reg a_temp, b_temp; or a1 or b0 or b1 or sel) begin if (sel) begin a_temp = a1; b_temp = b1; end else begin a_temp = a0; b_temp = b0; sum = a_temp + b_temp; endmodule a0 a1 + sum sel b0 b1

Register Duplication to Reduce Fan-Out (Verilog)
tri_en module high_fanout(in, en, clk, out); input [23:0]in; input en, clk; output [23:0] out; reg [23:0] out; reg tri_en; clk) tri_en = en; or in) begin if (tri_en) out = in; else out = 24'bZ; end endmodule en clk [23:0]in [23:0]out 24 loads Think Hardware module low_fanout(in, en, clk, out); input [23:0] in; input en, clk; output [23:0] out; reg [23:0] out; reg tri_en1, tri_en2; clk) begin tri_en1 = en; tri_en2 = en; end or in)begin if (tri_en1) out[23:12] = in[23:12]; else out[23:12] = 12'bZ; or in) begin if (tri_en2) out[11:0] = in[11:0]; else out[11:0] = 12'bZ; endmodule tri_en1 en clk 12 loads [23:0]in tri_en2 [23:0]out en clk 12 loads

Design Partition - Reg at Boundary (Verilog)
module reg_in_module(a0, a1, clk, sum); input a0, a1, clk; output sum; reg sum; reg a0_temp, a1_temp; clk) begin a0_temp = a0; a1_temp = a1; end or a1_temp) begin sum = a0_temp + a1_temp; endmodule a0 clk + sum a1 clk Think Hardware module reg_at_boundary (a0, a1, clk, sum); input a0, a1, clk; output sum; reg sum; clk) begin sum = a0 + a1; end endmodule a0 + a1 sum clk

Managing FPGA Speed Booster Pipeline (Verilog)
module no_pipeline (a, b, c, clk, out); input a, b, c, clk; output out; reg out; reg a_temp, b_temp, c_temp; clk) begin out = (a_temp * b_temp) + c_temp; a_temp = a; b_temp = b; c_temp = c; end endmodule 1 cycle a * b + out c Think Hardware module pipeline (a, b, c, clk, out); input a, b, c, clk; output out; reg out; reg a_temp, b_temp, c_temp, mult_temp; clk) begin mult_temp = a_temp * b_temp; a_temp = a; b_temp = b; end out = mult_temp + c_temp; c_temp = c; endmodule 2 cycle a * b + out c

When to Use Tri-state Buffers (BUFTs)
BUFTs can be used to implement: Internal Tri-state busses Muxes greater than 4-to-1 or Multiplexed Buses BUFTs can be inferred: Tri-states are inferred when a ‘Z’ can be assigned to a signal BUFTs can be instantiated: BUFT components LogiBLOX Tri-State Buffers Within a wide MUX: LogiBLOX Wired-AND MUX Tri-state buffers are present on the silicon whether you choose to use them or not. Better device utilization may arise from their use. Since tri-state buffers must drive horizontal long-lines, they can also have a major impact on the placement of the design. They also make it easier for the designer to consider the floorplanning of a design. Although the place and route software performs basic checks on the logic of the design, it is still possible for the designer to arrange two or more tri-state buffers driving the long-line simultaneously. Contention by Two Tri-state buffers will not cause a problem, even if contention occurs for a long time. Contention by many buffers (more than 4 or 5 buffers driving low and 4 or 5 buffers driving a node low) may cause metal migration. M1 - Introduction

4-to-1 Tri-State MUX Before (VHDL)
library IEEE; use IEEE.std_logic_1164.all; use IEEE.std_logic_arith.all; entity TST is port( DATA: in std_logic_vector(3 downto 0); SEL: in integer; SIG: out std_logic ); end TST; architecture BEH of TST is begin LOOP1: for I in 0 to 3 generate SIG <= DATA(I) when (SEL = I) else 'Z'; end generate ; end BEH; SEL(0) SEL(2) DATA(0) DATA(2) SIG SEL(3) SEL(1) DATA(1) DATA(3) Is there a problem with this example? M1 - Introduction

4-to-1 Tri-State MUX After (VHDL)
How can this code be improved? Default integer is 32 bits Define a limit library IEEE; use IEEE.std_logic_1164.all; use IEEE.std_logic_arith.all; entity TST is port( DATA: in std_logic_vector(3 downto 0); SELECTOR: in integer range 0 to 3; SELECTION: out std_logic ); end TST; M1 - Introduction

Flip-Flop Examples (VHDL)
Flip-Flop inference driven by ‘event in VHDL -- D flip-flop FF: process (CLOCK) begin if (CLOCK'event and CLOCK='1') then A_Q_OUT <= D_IN; end if; end process; -- End FF -- Flip-flop with asynchronous preset and clock enable FF_CLOCK_ENABLE: process (ENABLE, PRESET, CLOCK) begin if (PRESET = '1') then D_Q_OUT <= " "; elsif (CLOCK'event and CLOCK='1') then if (ENABLE='1') then D_Q_OUT <= D_IN; end process; -- End FF_CLOCK_ENABLE Produces registered output Generates async preset Generates clock enable M1 - Introduction

Flip-Flops Vs. Latches Latches inference does not include an edge (‘event or posedge) Latches are generated when: A signal is assigned in one branch of an if statement or case statement, but not all branches An if or case statement does not define all possible conditions Does not apply to case statements in VHDL Use Synopsys parallel_case and full_case directives for Verilog to avoid latches Or, include a default clause before the if statement

Global SET/RESET All Xilinx FPGAs have a built-in global synchronous reset facility Global SET/RESET sets or resets every sequential element in the FPGA GSR signal is accessed by instantiating the STARTUP block. GSR will be inferred when the design has a net that sets / resets all sequential elements in the design Additionally, sequential elements may be set or reset individually These global nets exist outside of the general purpose routing within the device. M1 - Introduction

How to access Global SET/RESET
The Global Set/Reset (GSR) signal is accessed by instantiating the STARTUP block. Polarity may be inferred GSR will be inferred when the design has a net that sets / resets all sequential elements in the design Use of the global reset signal will reduce the burden on the FPGA’s routing resources significantly. The limitation with this global net is that it will reset (or set) every flip-flop in the device. If a flip-flop must be immune to the effects of the global reset net, then the general purpose routing must be used to distribute this signal to every other flip-flop in the design. M1 - Introduction

State Machine Encoding
For FPGAs, use of one-hot encoding for complex state machines Works well in Xilinx’ register-rich FPGAs Uses fewer wide-input functions Generally produces fast state machines For CPLDs, use Binary encoding One-hot and binary encoding can be selected in Express at Synthesis -> Options -> Project Other types of encoding such as BCD or Gray may be specified in the HDL code Its best to break up large state machines into smaller ones M1 - Introduction

Address Range Identification
For the inequality operators, synthesis will infer two 12-bit comparators VHDL Example: if ADDRESS(31 downto 20) <= “ ” and ADDRESS(31 downto 20) >= “ ” then More address ranges are synthesized to more comparators Better solution: look for patterns in address bits that can eliminate need for comparators if (ADDRESS(31 downto 23) = “ ”) and (ADDRESS(22 downto 20) /= “111”) and (ADDRESS(22 downto 20) /= “000”) then M1 - Introduction

Arithmetic and Comparison Operators
Use arithmetic and comparison operators whenever possible. Example: if (Y > Z) then X <= A + B; Arithmetic and comparaison operators give Express the most flexibility to optimize Multiplier Adder, Subtracter, and Adder/Subtracter Incrementer, decrementer, and incrementer/decrementer Comparater Mutiplexer (select operator) Operators can be instantiated, but generally you will get the best performance with operator inference

Expressions Expressions Use parentheses to indicate precedence.
Replace repetitive expressions with function calls or continuous assignments

Last but not least…. Expressions
Use parentheses to indicate precedence. Replace repetitive expressions with function calls or continuous assignments VHDL generate statements can cause long compile times unfolding the logic - Use wisely Be careful with generate statements nested in loops or within generate statements Generate example -- Generate 3 instances of ALU2 GEN1: for N in 0 to 2 generate ALU2_X3: ALU2 port map ( CTL(2+ N*3 downto N*3), A(7+ N*8 downto N*8), Y(7 + N*8 downto N*8)); end generate;

Resources Support Resources On-Line Documentation
( Answers Search) Express Expert Journal Synthesis Design Guides On-Line Documentation START -> Programs -> Xilinx Foundation Series -> VHDL Reference Manual START -> Programs -> Xilinx Foundation Series -> Verilog Reference Manual START -> Programs -> Xilinx Foundation Series -> On-Line Books -> Express User’s Guide and Express Application Supplement

Section IV Advanced Software Design with Xilinx M1-Based Software M1-Based Software Flow

Logical Design Files Logical Design Files describe your design, and are composed of logical components Typically a netlist, generated by Schematic Capture or Synthesis Composed of Boolean Gates, FIFOs, RAMs Netlist input to XACT-Step M1 is in EDIF format XNF files are also accepted EDIF format files are translated to (Native Generic Design) NGD format NGD files have varying extensions Ex: NGD, NGM, NGA, NGO NGD files can be translated to other formats for simulation

Physical Design Files Physical design files are composed of components found in a Xilinx FPGA such as look-up tables and flip-flops Physical design files have .ncd extension Map creates an NCD file from an NGD file NCD files contain varying pieces of information Mapping, placement, and routing tools each concatenate data to the bottom of the NCD file

M1-Based Design Flow UCF .XNF or EDIF netlist User Constraint File
NGDBUILD Flatten Hierarchical Design M1-Based Design Flow .NGD MAP Logical to Physical translation Groups LUTs and FFs Into CLBs *Design entry tool flows to M1 are shown in the Appendix. .PCF .NCD TRCE Static Timing Estimates TRCE Static Timing Analysis PAR Layout of Physical Design Routes Physical Design BITGEN Generates configuration file .NCD .BIT

Design Flow Programs (1)
NGDBUILD Merges hierarchical EDIF or XNF files into one hierarchical file Creates internal netlist .ngd(Native Generic Design) files Contains logical components: combinatorial gates, RAMS, flip-flops, etc. MAP Maps logical components to physical components found in Xilinx FPGA: look up tables, Flip-Flops, three state buffers, etc. Packs physical components into COMPS Creates internal .ncd (Native Circuit Design) file Translate Map Place & Route Configure

Design Flow Programs (2)
TRCE Analyzes Timing Use before PAR to analyze constraints PAR Places COMPS on FPGA Routes the FPGA Use after PAR to check delays NGDANNO Back-annotate timing delays for Simulation BITGEN Create file to configure FPGA

Key M1 Browser Reports Map Report
Displays result of DRC (Design Rule Check) Indicates if the design will fit into the specified part Identifies ways to improve the design Reports nets with no source or load Logic Level Timing Report provides delay estimates Reports longest paths in the design Created before placement Based on block delays and minimum net delays

Key Report Files Placement and Routing Report includes resource summary Indicates the percentage of utilization The number of I/O and flip-flops is specified Reports if the design routed Gives an overall timing score Score of zero indicates all timing specifications were met Post Layout Timing Report Based on block delays and net delays after routing Used for detailed delay analysis after implementation Pad report Cross reference of Input/Output components and package pins

BEL and Comp Terminology
XACTstep M1 uses two new terms for FPGA resources: “Comps” and “Bels” A comp may refer to a CLB, IOB, TBUF, or Decoder A BEL may refer to the contents of a comp, such as F-LUT, H-LUT, FFX, FFY, RAM, or PAD The Graphic Design Editor (EPIC), and TRCE timing reports will refer to BELS G_LUT F_LUT H_LUT FFX FFY The COMP shown here is a CLB, which contains BELS: F_LUT, G_LUT, H_LUT, FFX, and FFY 4000X CLB

Section IV Advanced Software Design with Xilinx M1-Based Software Implementation Options

Main Implementation Menu Options
Guide Option Use a previous implementation as template for current implementation Specify constraint file (optional) MAP, PAR, and configuration options Implementation has four sub-menus: Optimize and Map, Place and Route, Timing, and Interface Select Flow Engine -> Setup -> Options

Optimization and Map Options (1)
Map optimizes your design before it is partitioned into LUTs, Flip-Flops,etc. The GUI includes these options: Trim Unconnected Signals (default is On) Trims all fan-out/fan-in from unconnected pins Turn off to implement hierarchical blocks separately Replicate Logic (default is on) Duplicates logic with high fan-out Increases utilization, decreases delay

Optimization and Map Options (2)
Optimization Strategy (default is Off) Minimizes logic to optimize logic for speed, area, or both Synthesized designs have been optimized already Packing Strategy (default is minimum density) Informs Map of how to pack COMPS with logic Minimum Density - Map only puts related logic into the same COMP Fit Device - packs components more tightly into COMPS Can adversely affect timing and routability Generate 5-I/P Functions Reduces block levels but increases area

Place and Route Options (1)
Runtime (default is 2) Trades off placement effort verses CPU time Router Passes (default is Auto) The Router will run until no improvement is made to meet timing constraints. Specify a number to avoid very long run times for difficult designs. Start with 3 passes Utilities -> Template Manager -> Edit Implementation Template -> Place and Route

Place and Route Options (2)
Workstation users may run PAR LOOP on multiple workstations simultaneously Create a list of available workstations One name per line, no comments Include the file name in the Nodelist field Many other options for advanced users, not shown here

Implementation Options for Fast Runtime versus PAR Effort
Select fast placement option, 1-2 routing passes, 0 clean-up passes, and deselect “Use Timing Constraints” 1 Deselect these 3 checkboxes Other hints: - 4KX and 9500 families give fastest runtimes. - Save this as an implementation template

Timing Report Options Enable the creation of the Timing Report
Logic Level Timing Report is created before PAR Has minimal net delays Used to predict realistic constraints Post Layout Timing Report is created after PAR Verify that the design meets constraints

Timing Report Options (2)
These options limit the information placed in the report file All options list paths in order of delay length; longest paths are listed first Design Performance Summary (Default) Displays longest clock-to-setup, pad-to-setup, and setup-to-pad delays for each clock in the design Default Timing Constraints Lists longest Flip-Flop-to-Flip-Flop, Pad- to-Flip-Flop, and Flip-Flop-to-Pad paths User Timing Constraints Report longest paths for each constraint Design -> Implement -> Options -> Edit Template -> Timing

Controlling the Back Annotation Netlist Format
Format options: - VHDL - Verilog - XNF - EDIF EDIF formats: - Standard (2.0.0) - Viewlogic - Mentor EDIF - LogicModelling

How to Start and Stop the Flow Engine
Select Flow Engine -> Setup Advanced to select the starting state Select Flow Engine -> Setup -> Stop After to set stopping point

Create a Script from the GUI
M1 can create a script file from the GUI session Available from the Flow Engine or Design Manager Select Utilities -> Command History -> Command Line Select Utilities -> Project Notes Copy, paste, and save text from Command History Window

The Guide Option Allows use of a previously placed and routed design to guide a new placement Can be useful if there are few design changes Guide is used for Map, Place, and Route Map may take much longer to execute, but PAR will be faster Recommended alternative is to use location constraints in design Previous Design New Design Guide Place & Route

Effective use of Guide Guide uses signal and component names to determine edited parts of the design Name all nets Do not change names Minimize changes to the design Any new hierarchy changes all names below Avoid any changes to synthesized logic Synthesis users: please try to freeze the design with “set_don’t_touch” or like command Otherwise, guide option may not be useful

Section IV Advanced Software Design with Xilinx M1-Based Software Design Verification

Recommended Verification Flow
Netlist FUNCTIONAL SIMULATION Implement TIMING SIMULATION Timing Analysis Bitgen Prom File Formatter Download IN-CIRCUIT VERIFICATION

Timing Analyzer Analyze delays before and after implementation

Timing Analyzer Benefits
Combines block delays from data book with net delays from implementation files Quickly identifies critical paths and timing hazards Report shows all elements in path, each element's delay, and cumulative delay Can determine if slow paths are due to block delays (design) or net delays (implementation) Element Delay Total PAD to IOB.I IOB.I to CLB1.F CLB1.F1 to CLB1.X CLB1.X to CLB2.F CLB2.F3 to Clock IOB CLB1 D Q CLB2 I F1 X F3 block net block net block net block

Output files for Simulation
ngdanno ngd2xxx EDIF XNF Verilog / SDF VHDL Before implementation, output netlist has unit delays, no back-annotation (use for functional simulation) After implementation, post-route delays are back-annotated EDIF or XNF output files include back-annotated delays SDF files are created in addition to Verilog & VHDL netlists VHDL and Verilog output netlists do not contain delays

M1 HDL Simulation Flow

VHDL & Verilog Simulation Libraries
UNISIM New for A1.4 allowing RTL and post-synthesis simulation SIMPRIM Family/architecture independent models Used for Post-M1 simulation including full timing VHDL and Verilog Standard Delay Format (SDF) files Separate file used to specify design timing (delays) to VHDL and Verilog simulators Xilinx software version 1.4 supports SDF version 2.1

Hardware Configuration Readback
Can occur while FPGA runs Requires XChecker cable Readback Trigger input starts serial readback XC3000 controlled via Bitstream Generator Default is enabled Data and trigger connected to Mode pins XC4/5000 controlled via schematic and Bitstream Generator Include Readback symbol in schematic Connect TRIG and DATA to I/O pins Can use MD0 and MD1 See Appendix for more information IPAD IBUF OPAD OBUF (MD0) (MD1) CLK TRIG DATA RIP READBACK XChecker RT RD

Section IV Advanced Software Design with Xilinx M1-Based Software PLD Configuration Settings

Bitstream Generator Options - Configuration
Controlled via Configuration Template Increase Configuration Rate if not concerned about compatibility with earlier families Add Pull-Up or Pull-Down to avoid having to connect external resistors All configuration controls are set in template.

Bitstream Generator Options - Startup
The “Start-Up Clock” switch enables the designer to synchronize startup with the FPGAs’ own configuration clock or an external clock signal. Start-up can also begin when the “Done” pin goes high. To program the “Output Events” refer to the Implementation Options of the “Design Manager User Guide included with the Documentation CD.

Bitstream Generator Options - Readback
The Hardware Debugger can verify the downloaded configuration and probe the internal states of the device by using the Readback feature. To use this feature you will need to assert the “Enable Bitstream Verification” box, connect the XChecker Cable to your device, and insert the “Readback” symbol into your design. For more information, refer to the Xilinx Data Book and the Hardware Debugger Reference Guide on the Documentation CD.

Choose a Configuration Method
M[2:0] pins control configuration mode setting.

Section IV Advanced Software Design with Xilinx M1-Based Software Design Constraints

Section Agenda Overview Location and implementation constraints
General timing constraints Specific timing constraints Path and block specific constraints Path and block grouping Advanced constraint commands Priority

Constraint Entry Overview
All constraints can be entered in User Constraint File (UCF) Maximum allowable delay Placement of package pins Implementation Options Bitstream Generation / Prom Configuration Timing constraints may also be defined in schematic Advantage: Easy entry for hierarchical blocks UCF files must have hierarchical net and component names Disadvantage: Not all constraints are supported See Libraries guide for schematic syntax and availability Some synthesis tools allow entry of constraints Constraint files may be generated by the synthesis tools, or constraints may be written in output netlist FPGA Express puts constraints into XNF file

UCF Syntax Use uppercase letters for keywords
Keywords include names used in constraints, such as: AFTER OFFSET PERIOD BEFORE NET LOC IN OUT Use quotes around names with non-alphanumeric characters Two types of wildcards may be used: “?” is a wildcard for a single character “*” is a wildcard for any number of characters A complete list of keywords is given in the Libraries Guide, Chapter 13, Syntax Summary. The Libraries Guide is available in the DynaText Browser (On-line documents).

Pin Location, Implementation Constraints
Pads can be assigned to a package pin Ex: Assign a bus signal to pin 32 INST “QOUT<3>” LOC = P32; Physical Implementation may be controlled in the UCF file, such as: FAST: Set fast I/O slew rate Example: INST “$1I87/OBUF” FAST; PART: Define part type to be used Example: CONFIG PART=4005E-PQ160C-5; To assign a pin location in a netlist, assign attribute or parameter “LOC = P32” directly to PAD symbol.

Simple Combinatorial Path
Consider the following path: Assume system requirements dictate a delay of 27 ns for all input to output pins The TIMESPEC constraint communicates this requirement to software: TIMESPEC TS01 = FROM PADS TO PADS 27 NS; PAD-to-PAD TIMESPECS constrain the delay of input and output pads, and all net and block delays in the path 2 levels of logic A OUT2 B<9:0> 27 NS

Synchronous I/O Constraints
Timing requirements for the design are described by defining system delays System delay include these questions: What is the clock period? When do inputs arrive at IC2? When must outputs be stable to meet setup at IC3? IC2 : FPGA Under Development IC1 IC3 CLOCK

Input Arrival Calculation
Inputs are constrained by their input arrival. Example: When does data arrive at pin D1? After the clock trigger, data delay is TCKO + Tnet + Tpad + TC1 Delay C1 net delays, or other combinatorial elements on the board Delay TCD is the delay through the FPGA clock distribution network IC 1 IC2: Device under Development Tc1 Tpad D1 D Q C1 C2 D Q CK Tcko Tnet Tcd Tpad CLK 50 Tarrival Tarrival = Tcko + Tnet + Tpad + TC1

Output Stability Calculation
When does output data need to be stable? Data must be stable in order to meet the setup requirement for IC3 How long must the data be stable before data is latched in IC3? Tstable = Tc3 + Tpad + Tnet + Tc4 + Tsetup TCD is the delay through the clock distribution network IC 3 IC2: Device under Development Tc3 Tpad Tc4 O1 C2 D Q C3 C4 D Q Tsetup CK CK Tnet CLK 50 Tstable Tstable = Tc3 + Tpad + Tnet + Tc4 + Tsetup

Period and Offset Constraints
Two commands are used to describe synchronous delays Period defines the clock Offset constraints define input arrival time and output stability time relative to the clock Xilinx software determines internal FPGA delays from Period and Offset constraints Syntax: NET clock_name PERIOD = some_delay time_unit; NET input_name OFFSET = IN Tarrival time AFTER clock_name; NET output_name OFFSET = OUT Tstable BEFORE clock_name; (Input_name and output_name are the names of nets connecting to the IO Pad)

Clock Constraint Example
Use the Period Command to define the clock Given that the clock frequency is 20 MHz for the example: NET “CLK” PERIOD = 50 ns; 50 100 Example waveform for CLK

Synchronous Constraint Example
OFFSET defines the delay of a signal external to the chip, relative to a clock. Internal clock delays are determined by Software Determined by Software Determined by Software Tarrival 14ns 40ns Tstable 12ns FF1 FF2 20 40 14 ADD0_IN CLK 28 OUT1 NET “CLK” PERIOD = 40; NET “ADD0_IN” OFFSET = IN 14 AFTER CLK; NET “ADD0_OUT” OFFSET = OUT 12 BEFORE CLK;

Constraint Recommendations
Use a given TIMESPEC name for only one path Keep constraints in one source Either UCF file or in schematics, but not both Avoid OVER-constraining the design Design Performance suffers Critical timing paths get the best placement and fastest routing options As the number of critical paths increases, routability decreases Run times increase More information in the On-Line Docs: Libraries Guide Development Systems Reference Guide, Using Timing Constraints, UCF sections Schematic users: for path-specific constraints, vendor documentation may be necessary

Question Given the following: Clock Frequency = 20 MHz
Tarrival = 31 ns = delay from CLK to Input pin D1 of IC2 Tstable = 27 ns = Delay (including setup) from O1 to D pin of FF3 (IC3) IC2: Device under Development IC 1 IC 3 D1 O1 D Q C1 C2 D Q C3 C4 D Q CK CK CK CLK Fill in the constraints below : NET _____ PERIOD = _____ NS; NET _____ OFFSET = IN _____ AFTER CLK; NET _____ OFFSET = OUT _____ BEFORE CLK;

Question Given the following: Clock Frequency = 20 MHz
Tarrival = 31 ns = delay from CLK to Input pin D1 of IC2 Tstable = 27 ns = Delay (including setup) from O1 to D pin of FF3 (IC3) IC2: Device under Development IC 1 IC 3 D1 O1 D Q C1 C2 D Q C3 C4 D Q CK CK CK CLK Fill in the constraints below : Answers : NET _____ PERIOD = _____ NS; NET _____ OFFSET = IN _____ AFTER CLK; NET _____ OFFSET = OUT _____ BEFORE CLK; CLK 50 D 31 ns O1 27 ns

Path and Block Specific Constraints
Why use path or block specific constraints? To decrease speed requirements wherever possible To Increase routability and overall speed of the design To decrease software run-time General Methodology Use PERIOD and OFFSET to constrain the design globally Use specific “FROM-TO” constraints to modify timing for specific blocks or paths

“FROM-TO” Constraint Example
Consider the example shown below with TIMESPEC: TIMESPEC TS01 = FROM PADS TO PADS 21; TS01 is applied to both Y - OUT1 and Z - OUT2. TS01 over constrains path from Z to OUT2. Tight constraints decrease routability and increase run time CLK X Y Z<0:31> FF1 FF2 OUT1 1 Level of Logic 2 Levels of Logic OUT2 21 ns 21 ns

“FROM-TO” Constraints
The two paths could be constrained with two commands: TIMESPEC TS01 = FROM PADS(Y) TO PADS(OUT1)21; TIMESPEC TS02 = FROM PADS(Z) TO PADS(OUT2)28; “FROM:TO” Constraints can start and stop at Flip-Flops (use “FFS”), LATCHES, PADS, or RAMS Examples: Constrain all inputs to all Flip-Flops in block NEWFIE: TIMESPEC TS03 = FROM PADS TO FFS(NEWFIE)18 ns; Constrain all Flip-Flop to Flip-Flop paths in the design: TIMESPEC TS04 = FROM FFS TO FFS 15 ns; Constrain all Flip-Flop to output paths in the design TIMESPEC TS05 = FROM FFS TO PADS 25 ns;

Creating Groups with TNM
The TNM constraint creates a group of individual components Example: divide Flip-Flops into two groups based on instance name INST SLOWFF* TNM = SLO; INST FASTFF* TNM = FST; TIMESPECS are assigned to the new groups: TIMESPEC TS14 = FROM FFS TO SLO 40 NS; TIMESPEC TS15 = FROM FFS TO FST 20 NS; Greater flexibility in routing is achieved by creating a different timing requirement for these two groups SLOWFF2 SLOWFF1 FASTFF1 FASTFF2 REG1 REG2 COMB3

Pre-Scaled Counter Example
Q5 Q6 Q3 Q4 Q9 Q10 Q7 Q8 Q13 Q11 Q12 Q2 COUNT12 Q0 Q1 PRE2 TC CE Highest speed is required in the pre-scaled block Constrain the two counter blocks separately to avoid over-constraining COUNT12 Define two groups for use in TIMESPEC. Example UCF file: INST FFS(PRE2) TNM = PRE; INST COUNT12 TNM = UPPER; TIMESPEC TS_PRE = FROM PRE TO PRE 60 MHZ; TIMESPEC TS_TC2CE = FROM PRE TO UPPER 60 MHZ; TIMESPEC TS_UPPER = FROM UPPER TO UPPER 15 MHZ;

Creating Groups with TIMEGRP
Another way to constrain this design is by creating smaller groups of endpoints: The TIMEGRP constraint is used to create new groups from other groups. FFS, LATCHES, RAMS, and PADS are predefined groups Example: ALL_FFS group contains all Flip-Flops whose instance name begins with SLOWFF or FASTFF: INST SLOWFF* TNM = SLO; INST FASTFF* TNM = FST; TIMEGRP ALL_FFS = FFS (FST* : SLO*) ;

Select One Path From Many Paths
Use to constrain one path among several parallel paths First identify the path to be constrained with TPTHRU, then use THRU in Timespec constraint Example: constrain the path through component ABC NET RED TPTHRU = ABC; TIMESPEC TS_FIFOS = FROM RAMS(FIFORAM) THRU ABC TO FFS(MY_REG*) 25; my_reg00 fiforam RED TPTHRU=ABC my_reg01 my_reg02 my_reg03

Forward Tracing Forward tracing occurs when a constraint is assigned to a net Constraint is applied to all global endpoints driven by the net Example: constrain nets driven by DATA0 to Flip-Flops in block CNT25: NET “DATA0” TNM = MYBUS; TIMESPEC TS_REGCNT = FROM MYBUS TO FFS(CNT25) 30 NS; TS_REGNCT CNT25 BONE DATA0 ... CHEW BARK

Ignoring Paths with TIG and NET
Timespec Ignore, “TIG”, attribute ignores a TIMESPEC for a specific path or net Ex: Assume that net DOG_SLOW was constrained by 2 constraints, TS01 and TS02. The following specification ignores TS01. TS02 only is applied to DOG_SLOW. NET “DOG_SLOW” TIG = TS01; Example to ignore a slow path between registers: INST REGA* TNM = REGA; INST REGB* TNM = REGB; TIMESPEC TS_TIG01 = FROM FFS (REGA) TO FFS(REGB) TIG; TIG improves software run-time and routability of the design

Other Constraint Constructs
Use “Except” to filter a group of endpoints. INST FASTFF* TNM = FST; TIMEGRP SLO = FFS EXCEPT FST; TPSYNC allows definition of end points that are not FFS, RAMS, PADS or LATCHES. NET “BLUE” TPSYNC = BLUE_S; TIMESPEC TS_1A = FROM FFS TO BLUE_S 15 NS; Signal skew for logic driven by clocks can be constrained using MAXSKEW constraint NET “$1I3245/$SIG_6” MAXSKEW = 3; Specifies a 3 ns difference between the arrival times at all destinations of net $1I3245/$SIG_6. Cannot constrain skew of global nets (skew is fixed)

Constraint Priority All constraints are not created equal
Highest Priority - Timing ignores (TIG) - FROM:THRU:TO specs - FROM:TO specs Lowest Priority - PERIOD specs “FROM:TO” constraints are further prioritized: Highest: FROM PATH-SPECIFIC TO PATH_SPECIFIC FROM PATH-SPECIFIC TO GLOBAL Lowest: FROM GLOBAL TO GLOBAL

Section V Special Topics

Section V Agenda DSP Design with FPGAs
New Developments in Programmable Logic Virtex, XC6200 and Reconfigurable Logic FPGA versus ASIC costs Xilinx Student Edition Xilinx University Program participation

Section V Special Topics DSP Design with FPGAs

FPGAs Provide Outstanding DSP Performance
Processor FPGA 1 1 2 3 4 N Mult. Mult. Mult. Mult. Mult. Mult. Add Add Sequential processing Fixed architecture Complex real time software Parallel processing Configurable to specific needs No software programming

FPGAs Lower the Cost of High Performance DSP
500 • µP/PDSP 400 300 $ FPGA-Based DSP 200 100 • • 5 10 15 20 Relative Performance

Customer Successes TIM40 Module using FPGAs (XC4010)
3 times the price at 175 times the TI TMS320C40 performance DNA Matching (XC4010) Similar performance at 1/20th price 128-Track Audio Recording Studio (XC3190) 3 times the functionality at 1/10th the price

FIR Filter Example X X X Sum of Products Equation IMPLEMENTATION ???
N BITS WIDE Sum of Products Equation SAMPLE DATA X0 PRODUCT K Multiplies K Sums CLOCK = Multiply Time Sample Rate = Clock Rate SUM X C0 X1 K X C1 X2 OUTPUT DATA X C2 IMPLEMENTATION ??? K COEFFICIENTS K TAPS LONG K SUMS

Traditional FIR Filter Implementation
General-Purpose DSP PERFORMANCE = TMS320: MAC cycle time = one clock cycle 10-bit, 20-tap filter with 50 MHz TMS320 = 2.5 MHz Additional filter taps slow performance Pentium: MAC cycle time = 11 clock cycles 1 MAC cycle time X Number of Taps

Distributed Arithmetic (DA) Filter Design
SAMPLE DATA MSB 8 WORD X N BIT LOOK UP TABLE PARALLEL IN SERIAL OUT 000 001 010 011 100 101 110 111 2 -1 Scaler C0 LOOK UP TABLE C1 R E G I S T C1 + C0 A C2 Binary SHIFT ADRS C2 + C0 n FILTERED DATA OUT C2 + C1 B DATA n C2 + C1 + C0 Clock Frequency Number of Bits in Sample PERFORMANCE = 10-bit, 20-tap filter using XC4000 at 50 MHz = 5 MHz

Distributed Arithmetic - 3 bit Example
D2 x C2 1 0 0 x 1 1 0 D1 x C1 1 1 1 x 1 0 1 D0 x C0 0 1 1 x 1 0 0 Data Coefficient C2 x D2 1 1 0 x 1 0 0 0 0 0 C1 x D1 1 0 1 x 1 1 1 C0 x D0 1 0 0 x 0 1 1 0 0 0 Coefficient Data 0 1 1 = LUT Address ==> (C1 + C0 ) from previous slide

Resource Tradeoffs for Higher Performance
CLBs Double-Rate Distributed Arithmetic 66 MHz 16.2 MHz 400 Fully-Parallel Distributed Arithmetic 300 8.1 MHz Bit-Serial Distributed Arithmetic 200 100 100 Hz to 100 kHz Serial Sequential Number of Filter Taps

XC4085XL 10 Times Faster Than TMS320C6x
0.25 m, 200 MHz 1 2 3 4 5 6 7 8 4085XL XC4000XL using 80 MHz clock rate 16 bit FIR Filter Benchmark Multiply ACcumulates per Second Billions of MACs per Second

Price per Million MACs per Second - 16-bit word
FPGA DSP is Lower Cost $0.25 $0.20 $0.15 $0.10 $0.05 TMS320C6x (25,000 pcs) Xilinx FPGA (25,000 pcs) Price per Million MACs per Second - 16-bit word

Where FPGA-Based DSP is Used
t R e S m p l s r c o n d High Data Rates 1 to 70 M samples/sec High Complexity 10’s to 100’s of MACs in a single chip Fixed-Point Data Audio, Video, Radio & Voiceband Modems, HDTV 1k 10k 100k 1M 10M 100M 1G ASIC FPGA-Based DSP Multiple DSP Cores or Chips Single-Chip DSP MPU/MCU Less Complex More Complex Algorithm Complexity

DSP / FPGA Design Methodology
THIRD-PARTY DSP SOFTWARE CORE Generator Xilinx CORE Generator 1.4 available now! Coefficients Instantiate into schematic or HDL PLACE AND ROUTE POST ROUTE SIMULATION BIT STREAM FOR DOWNLOAD CABLE, OR EPROM

XC4000 Resource Cross Reference Chart (Bit-Serial Implementation)
TAPS 8 16 24 32 40 48 56 NUMBER OF XC4000 CLBs WORD SIZE M samples/sec @50MHz

Section V Special Topics
Team-Based Modular 10 Million 0.15u Cores 300Mhz HDL 1 Million 150Mhz 0.18u Schematic 500k 133Mhz 0.25u 100K 100Mhz 0.35u 25K 50 Mhz 0.5u Density/System Gates The Road Ahead New Developments in Programmable Logic Process Technology Performance

Process Technology and Supply Voltage
Feature Size (m) 1.2 1 Today Lower cost Faster speed Higher density Lower power 0.8 0.6 5 V 0.4 3.3 V 2.5 V 0.2 1.8 V 1.3 V 1990 1992 1994 1996 1998 2000 2002 Xilinx leads PLD industry in fab technology. Fab partners use FPGAs to drive their process.

Advanced Process Technology
0.5u Process 0.25u UMC Process - locos isolation - shallow trench isolation - birds beak - 0.9u metal pitch - no planarization - CMP - only contact plug - plug for all vias

Process & Density Leadership
10M Gates In 2002 10M Virtex II 2M Density (system gates) Virtex 75+M Transistors 1M XC40250XV 500k XC40125XV Industry’s 1st 0.25u PLD, 25M Transistors, 5LM 250k 180k XC4085XL 10 Million System Gates in 2002!

Architecture Innovation & Leadership
Reconfigurable Logic On-Chip AD/DA Embedded Functions 1GHz Diff. Interface Built-in Logic Analyzer Block Dual Port RAM Multiple Standard I/O Vector Based Interconnect Phase Locked Loops 66 MHz 64-Bit PCI Features Distributed Dual Port RAM IO Registers Internal Bussing 5V Tolerant I/O 3.3V and 5V PCI

Performance Leadership
220 240 260 280 300 MHz 233 MHz UP 300 MHz RAM I/F 133 MHz PCI 20 40 60 80 100 120 140 160 180 200 133 MHz SDRAM I/F 155 MHz SONET 66 MHz PCI System Clock Rate* (MHz) 100 MHz SDRAM I/F 100 MHz DSP for Wireless Base Station 33 MHz PCI 1995 1996 1997 1998 1999 2000 2001 2002 * 1/(Tsetup+Tclock-to-out)

Packaging Leadership Pins Flip Chip Technology 1000 Chip Scale
Fine Pitch BGA 700 SBGA <0.8mm 1.0mm BGA 500 HQFP 1.27mm PQFP 300 PGA PLCC 100

Compile Time Leadership
Minutes* Release * 100k System gate designs (200MHz Pentium) With Faster CPUs Faster Compile Times Modular Compile 1999 Goal: 1 Million Gates in 45 minutes!

F1.5 Features Tight integration Improved ease of use
FPGA Express inside Foundation Project Manager Single Project Management / Flow Engine environment Improved ease of use Complete pushbutton New Virtex, XC9500XL support Improved FPGA Express synthesis runtimes & performance Improved PAR runtimes and performance

Xilinx Smart-IP Delivers...
Architectures tailored to cores Intelligent Software Implementation Flexible Core Technology Xilinx Smart-IP Technology High Flexibility High Predictability High Performance Performance + Time to Market

Leader in Core Solutions Xilinx and Partners’ COREs
82xx, UARTs, DMA, 66 MHz DRAM/SDRAM I/F Memory (RAM, ROM, FIFO) Micro Sequencer (2901) Proprietary RISC Processors Microprocessor I/Fs 8051/8031 IEEE 1284 MIPS 133+ MHz SDRAM I/F Base Level Functions Advanced processors ATM Cell Assembly/Delineation CRC-16/32 T1 Framer HDLC Reed-Solomon, Viterbi UTOPIA, 25/33/50 MHz 10/100 Ethernet 1Gb Ethernet ADSL, HDSL, XDSL ATM/IP Over SONET SONET OC3/12 Modems SONET OC48 Emerging Telecom and Networking Standards Communication & Networking Add, Subtract, Integrate Correlators Filters: FIR, Comb Multipliers Transforms: FFT, DFT Sin/Cos DCT Cordic DES Divider JPEG NCO DSP Processor I/Fs DSP Functions > 200 MSPS Programmable DSP Engines QAM Functions DSP Satellite decoders Speech Recognition CardBus FireWire( Mbps) PCI 64bit/66MHz PC104 VME Standard Bus Interfaces CAN Bus ISA PnP I2C PCI 32bit Emerging High- Speed Standard Interfaces PCMCIA USB 1998 1999 2000 By 2002: Virtually All Functions Available as Cores

Architecture Tailored to Cores Segmented Architecture
Segmented Routing Non-Segmented Routing Core1 Core2 Wasted Routing Unpredictable Timing High Power Consumption Efficient Routing Predictable Timing Low Power Consumption

Architecture Tailored to Cores Distributed RAM
RAM Available Locally To The Core Portable RAM Based Cores Improves Logic Efficiency by 16X High Performance Cores

Intelligent Software Pre-defined Placement & Routing
Fixed Placement & Pre-defined Routing Fixed Placement I/Os Relative Placement Guarantees Performance Guarantees I/O & Logic Predictability Other Logic Has No Effect on the Core Enhances Performance & Predictability

Smart-IP Delivers Performance
12x12 Multiplier 80 Xilinx Segmented 70 Speed(MHz) 60 Non-Xilinx Non-Segmented 50 1 2 4 8 Number of Cores Smart-IP Performance Is Independent of Number of Cores in a Design

Smart-IP Delivers Portability
80 MHZ 80 MHZ 80 MHZ 80 MHZ Smart-IP Performance Is Independent of a Core’s Placement in the Device

Smart-IP Delivers Transportability
80 MHZ 80 MHZ 80 MHZ Non-Segmented Architecture May Experience 30% Performance Degradation Smart-IP Performance is Independent of Device Size

Xilinx Architecture for Fastest Performance
Xilinx Segmented Interconnect Non-segmented Interconnect Across Chip Across Chip Logic Block 1 Logic Block 2 ... ... Logic Block 3 Logic Block n Logic Block 1 Logic Block 2 Logic Block n 1x 1x 4x 4x 3x 4x 1x 6x Logic Block (next row) Logic Block (next row) Segmented Interconnect Structure Provides Faster Logic Cell Connections

High Value Cores with Spartan
XCS30XL Price* Percentage of Device Used Effective Function Cost Core Function UART $6.95 17% $1.20 16-bit RISC Processor $6.95 36% $2.50 16-bit, 16-tap Symmetrical FIR Filter $6.95 27% $1.90 Reed-Solomon Encoder $6.95 6% $0.40 PCI Interface (w/ faster speed grade) $12.00 45% $5.40 *100,000 units, mid-1999 projection

Section V Special Topics Virtex, XC6200 and Reconfigurable Logic

Virtex Enables System on a Programmable Chip
VHDL Design Environment Verilog Design Environment CoreGen New Modules FIFO Designer #1 Designer #2 DSP IP Modules CPU Design Reuse AllianceCore Virtex 133Mhz SDRAM Gbit Ethernet LogiCore 160 MHz I/O Performance 133 MHz Memory Performance 1 Million System Gates 66Mhz PCI

Virtex Series Overview
New FPGA architecture, similar to XC4000 0.25 and 0.18 micron 5LM process Segmented routing SelectRAM+ offers 3 types of RAM Distributed SelectRAM Block SelectRAM (new) High-speed access to external memory (new) Traditional and Low Voltage support CMOS, TTL LVTTL, LVCMOS, GTL+, and SSTL3 250K - 1M system gates in 1998 Some XC6200-like features Ideal for Reconfigurable Logic Dynamic & Partial reconfiguration

Virtex Functional Block Diagram
Phase Locked Loop (PLL) CLB Segmented routing 66 MHz PCI SSTL3 Vector Based Interconnect delay=f(vector) SelectI/O Pins Block SelectRAM Memory Distributed SelectRAM Memory 11

Xilinx 0.25 m, 5 Volt-Compatible FPGAs
I/O Supply Logic Supply Accepts 5 V levels Any 5 V device (XC4000E) 5 V Virtex & XC4000XV 2.5 V logic 3.3 V I/O 3.3 V Any 3.3 V device (XC4000XL) 3.3 V 3.3 V Meets TTL Levels Xilinx is the first FPGA vendor to address the issues associated with 2.5 volt operation devices. Our customers have told us that we must provide a path back to 5 volts. The 4000XV family and the new Virtex devices are fabricated in a 0.25 micron process which uses 2.5 volts for it’s basic transistor logic. Our solution is shown above, where a split supply between the logic core (which must operate at 2.5 volts) and the I/O ring (which operates at 3.3 volts). The same 5 volt tolerant I/O structure as used on the 3.3 volt XL devices is used on these 2.5 volt devices, allowing a mixture of 3.3 volt TTL and 5 V TTL devices to be connected to a Virtex FPGA device. Xilinx will use this strategy on all future generations of being directly compatible with the previous process generation and being tolerant of voltages from two prior process generations. Another example of Xilinx innovation. 4KXL / 4KXV Family migration possible if you plan for: Additional power/ground pins Dedicated clock and configuration pins Voltage migration guide to help users 13 11 11

Virtex FPGA Performance
100+ MHz internal speeds 155 MHz SONET data stream processing 100+ MHz Pipelined Multipliers 66 MHz PCI 100+ MHz system interface speeds without PLL with PLL Tco (output register) ns ns Tsu (input register) ns ns Th (input register) ns ns Max I/O performance MHz 160 MHz With the fast I/Os and clocking in these Virtex devices, we will be able to achieve over 100 MHz system clock rates through all parts of the chip. The internal logic can operate at 100 MHz through 3 to 4 levels of logic depending on routing. The I/Os can operate at 110 MHz without the PLL and the setup and clock-to-out times shown also meet the 66 MHz PCI specification. With the PLL, the I/O performance increases to 160 MHz, opening up many new applications previously not possible using FPGA technology. 9

Segmented Routing Interconnect
Fast local routing within CLBs General purpose routing between CLBs Fast Interconnect 8ns across 250,000 system gates Predictable for early design analysis Optimized for five layer metal process CARRY CARRY 3-STATE BUSSES SWITCH CLB MATRIX 2 LCs 2 LCs This slide shows the relationship between the logic and the interconnect in the Virtex devices. A configurable logic block, or CLB, contains 4 logic cells organized in two pairs. All four look-up tables in the CLB can connect to each other through fast local interconnect that provides known delays between LUTs in a given CLB. To connect to other CLBs, signals go in and out of a CLB through the switch matrix. There is no direction dependence in this switch matrix: inputs can come from any of the 4 directions and outputs can to out in any of the 4 directions. The hierarchical general purpose routing was designed to be scalable across the wide range of densities this family will have, and it provides excellent utilization even on the largest parts, very fast routing delay times, and excellent predictability. CARRY CARRY 17 12

Virtex Configurable Logic Block
Polarity of all control signals selectable Fast arithmetic and multiplier circuitry Optimized for synthesis Carry and Control CO I3 I2 I1 I0 4 Input LUT Register PR D CE Q O CLK WI DI RS CI Carry and Control CO CLB I3 I2 I1 I0 4 Input LUT Register PR D CE Q O 2 LCs 2 LCs CLK WI DI RS CI 13

SelectRAM+ Memory Features
Distributed SelectRAM Memory Pioneered in XC4000 family 16x1 synchronous SRAM implemented in LUT Ideal for DSP applications Access over 100 Billion bytes/sec Block SelectRAM Memory Up to 32 4,096-bit blocks of dual port synchronous SRAM Configurable widths of 1, 2, 4, 8, and 16 Ideal for data buffers and FIFOs Up to 17 gigabytes/sec access Fast Access to External RAM Direct interface to SSTL3, 3.3V synchronous DRAM standard 133 MHz The Virtex devices will have several features that help solve system-level design issues. We talked about the two carry chains in each CLB. Like the 4000X family, these carry chains are uni-directional (up only) and are very fast. The clocking and PLL support was designed to offer I/O performance in line with the fast internal logic speeds the 0.25 micron process will give us. We’ll examine this and the RAM hierarchy in the next few slides. RAM capability in the Virtex devices we address at three levels: 1. We have SelectRAM that is fully compatible with the 4000X family. 2. There is dedicated block RAM on-chip that is not part of the CLB array. 3. For larger amounts of RAM, we have the I/O speed to interface to larger amounts of external RAM at better than 100 MHz. 15

Block RAM Configure as: 4096 bits with variable aspect ratio
8-32 blocks across family devices True dual-port, fully synchronous operation Cycle time <10 ns Flexible block RAM configuration 5 blocks: 2K x 10 video line buffer 1 block: x 8 ATM buffer (9 frames) 4 blocks: 2K x 8 FIFO 9 blocks: 4K x 9 FIFO with parity WEA ENA CLKA ADDRA DINA DOA DOB RAMB4 The block RAM in the Virtex devices is in addition to the SelectRAM that is part of each CLB. The block RAM blocks are 4K bits each with variable width from 1 to 16 bits wide. The number of blocks varies with device size: for example, there are between 8 and 30 blocks in the devices from 20K to 200K logic gates. The block RAMs are fully synchronous and have two ports that are fully independent read/write ports. Cycle time of the block RAM is less than 10 nanoseconds. As you can see on the slide, many popular configurations of memory are possible using a combination of blocks. NOTE: The block RAM will not be able to do asynchronous logic. We believe in letting the synthesis tool work with the LUTs and the local interconnect within the CLB for logic and keep the RAM focused on very high performance in RAM applications. WEB ENB CLKB ADDRB DINB 22

XC6200 Reconfigurable Processing Unit
1000x improvement in reconfiguration time from external memory CPU Memory I/O FastMAPtm assures high speed access to all internal registers Microprocessor interface built-in XC6200 RPU All registers accessed via built-in low-skew FastMAPtm busses High capacity distributed memory permits allocation of chip resources to logic or memory Ultrafast Partial Reconfiguration fully supported I/O XC Up to 100,000 gates

XC6200 Architecture 16x16 Tile 4x4 Block    FastMAPtm Interface  
User I/Os    FastMAPtm Interface   Function Cell User I/Os  User I/Os  Address   Data Control    User I/Os *Number of tiles varies between devices in family

How Dynamic Reconfiguration Helps Example: DSP
3D Graphics Reconfiguration- DSP Algorithms PDSP FPGA Optimized FPGAs - Texture - Shadow - Reflections - Perspective - Edge One function at a time Two or more functions at a time All functions done in time Some functions run while others are loading Reconfiguration Advantages: Lower cost by reusing silicon for multiple functions over time OR 10-500x performance increase in hardware versus software implementation

Reconfigurable Logic - Research vs. Component $
Reconfigurable Logic research has typically focussed on reconfigurable computing1. But there are really two potential markets: high-end embedded computing2 and the low-cost embedded market3. Zillions of Research Dollars(1) (2) ? Performance Zillions of Component Dollars(3) Computer Embedded Microprocessor Problem Size [Graph is compliments of Nick Treddenick.]

XC6200 Dynamic & Partial Reconfiguration
Design Swapping XC4013 200us 250ms Block Swapping XC6216 Circuit Updates Rewiring 40ns ns us ms s

Directions in Reconfigurable Logic
XC6200 was first Xilinx product to XC6200 chips & XACT6000 software are available, but no further product development Divergent architecture and incomplete tools support XUP support for Research only, not classes: Adaptive or Reconfigurable Logic, Place & Route algorithms Key XC6200 features brought into mainstream families (Virtex)! Dynamic & Partial reconfiguration Full industry and software support Easier to design to New Rec.Logic curriculum should use Virtex Virtex-ready PCI board available from Virtual Computer Corp. Further info:

Section V Special Topics FPGA versus ASIC Costs

FPGA Cost = Gate Array Cost
Pad-Limited Die Size core-limited pad-limited Core Core I/O pads I/O pads Mid-high density: Gate count determines die size Low Density: I/O count determines die size As Processes Migrate FPGA Cost = Gate Array Cost

FPGA Price Leadership Without Compromises
Pricing competitive with ASICs High Performance On-chip SelectRAMTM Spartan $395 SpartanXL $295 0.5 3LM More Features 5 Volt Price Spartan-II < $200 Spartan Next Generation < $150 0.35 5LM 3.3 Volt 0.25 5LM 0.18 2.5 Volt 1.8 Volt 2002 *Prices are for 5K system gates, 100K units, -3 speed, Lowest Cost Package

CPLD Price Leadership Without Compromises Flexible ISP
Highest Performance Pin-Locking Full JTAG $15 XC95216 Price $9 $1.80 XC9536 $0.80 1998 1999 2000 2001 2002 * Prices are based on 100Ku+, slowest speed grade, lowest cost package

Priced for High-Volume Leadership
200K $20 New Applications Set Top Box DVD Digital Camera PC Peripherals Consumer Electronics 100K 100K $10 Density (System Gates) 60K 60K 40K 25K $20 15K 10K gates/$ in 2002! $10 1997 1998 1999 2000 2001 2002 100K unit volume price projections

The Real Cost of Ownership
Even in mid & high density, FPGAs often have cost advantage FPGA vs ASIC goes far beyond obvious unit costs calculations Real Comparison includes Real factors Programmable FPGA Gate Array (Application Specific Integrated Circuit) Higher unit cost Standard Product Off the shelf delivery Fast Time to Market No Non-Recurring Eng. Fee No inventory risk Fully factory tested Simulation helpful In-Circuit verification (-) (+) Lower unit cost Custom Product Months to manufacture Slow Time to Market NRE+ Customer specific User Test Development Simulation Critical No In-Circuit verification

Cost Calculations - Basic Model
Breakeven - Solve for X (units) ASIC Cost = FPGA Cost $25K NRE + $79K Engineering& Tools + X * $10 = $0 NRE + $25K Engineering&Tools + X * $30 X 54K / 20 = X 2,700 units =

Cost Calculations - Market Model
Being late to market costs Real $$ Maximum Available Revenue Total ASIC Development = 32 weeks Total FPGA Development = 11 weeks % of Lost Revenue = (Delay * (3W-Delay)/2W^2)*100 = (5.25 (3* )/ 36^2) *100 = 19.75% Maximum Revenue from delayed entry Net Profit = Volume * (System Price - System Cost ) W W = ($2K - $1.1K) * (1K + 12K + 5K) Product Life = 2W = $16,200,000 = $3.2M ASIC Cost = $25K NRE + $79K Engineering *$16.2M Lost Profit + X*$10 FPGA Cost = $25K Engineering + X*$30 Breakeven, X = 162,700 units

Hardwire Technology Model
ASIC Re-spin delay & expense risk 30% PLD price reductions 25% vs. 5% per year Hardwire Technology lowers FPGA cost 40-60% No additional design work or test vectors Preserves nets, placement, routing All FPGA characteristics maintained Total ASIC Cost = $25K + $79K + $5.3M + $22.8K K + X * $10 FPGA/HWire Cost = $25K Engineering + 1K*$30 + $18K NRE + (X-1K) Units * $18 Breakeven, X = 674,000 units !!! Download the Xilinx ASIC Estimator program at to compare costs or learn more.

B = Basic analysis M = Market model H = Hardwire model

Section V Special Topics Xilinx Student Edition

The Xilinx Student Edition
Prentice Hall’s most requested new engineering product in Q1 ‘98 ! Complete, affordable, and practical digital design course environment for all students Predeveloped and tested lab-based course Includes Foundation Series 1.3 for students’ computers Practical Xilinx Designer lab tutorial book Coupon for XS40-005XL and XS boards ($129) Sold through bookstores by Prentice Hall and listed at $79 (ISBN ) Integrated tutorial projects cover: TTL, Boolean Logic, State Machines, Memories, Flip Flops, Timing, 4-bit and 8-bit processors Upgradeable for free to F1.4 Express with VHDL & Verilog, 40K gates, VHDL labs on the web Aug.1

The Practical Xilinx Designer
The Digital Design Process - Basic concepts and TTL logic Programmable Logic Design Techniques - Programmable logic introduction and Foundation tutorial Programmable Logic Architectures - XC9500 CPLD and XC4000 FPGA Combinatorial Logic Design - LED decoder circuit with both CPLDs and FPGAs. Modular Designs and Hierarchy - step-wise refinement using Foundation Electrical Characteristics of Programmable Logic - I/O drivers, timing/delay models, and power consumption Flip-Flops - introduces sequential logic State Machine Design - design examples for counters, drink machine, etc. Memories - how to build memory with flip-flops, logic gates. The GNOME Microcoomputer - construction and improvements of simple, 8-bit microcomputer.

Xilinx Student Edition Development Boards

Section V Special Topics Xilinx University Program Participation

Section Agenda Course recommendations How to learn more
Contacts & Support Why use Xilinx? Products & Ordering Software Hardware

Course Recommendations
See

Trends in Teaching with PLDs
Increasing density and Cores enable System-level design and test on an FPGA LogiCOREs available to all universities PCI, DSP, math, other complex functions VHDL or Verilog design is commonplace PLDs in many subjects beyond Digital Design and Computer Engineering System Level Design and Test Dynamically Reconfigurable Logic Digital Signal or Video Processing Network Design Prevalent usage in required EE, CS, CE courses Students use their own computers

How To Learn More (1) AppLinx CD / Xilinx data book
On-line books, On-line Help Excellent on-line tutorials in Foundation & Express Xilinx Web Site Application notes Latest technical information and status Fast Technical Help Whatever it is, it’s probably there! Subscribe to XCELL Journal Xilinx Student Edition is great practical guide

XUP Contacts & Support XUP Staff:
Jason Feinsmith, XUP Manager USA ) Anna Acevedo, XUP Coordinator USA ) Chris Grundy, XUP European Liason UK ) XUP Website: Xilinx commercial or university distributors Channel for product distribution, updates for listing of commercial distributors Europractice, Chip Implementation Center (Taiwan ROC), IDEC (S.Korea), Canadian MicroElectronics Corp. Technical Support Answers Database For Instructors: USA

Xilinx Donation Policy
“If a new or expanded course with lab or a research project is being added and funding is not adequate to purchase the required products at the University Program discounts, Xilinx encourages any university or college to submit a donation request.” To Purchase or To Request a Donation - What's Practical for you? If you have sufficient budget to purchase Xilinx software, development boards, and/or chips, then we encourage you to do so. We offer significant discounts for Xilinx software and Xilinx development boards. However, we recognize that very often, schools simply do not have the funding even for the discounted products. In some cases, a school might have some funding, but not enough to obtain everything that is needed for the lab. We encourage you to make the choice that you feel is right for your situation. Most importantly, if money is any barrier to your immediate use of Xilinx products, you should request a donation for what you need.

Why Xilinx? Xilinx is world’s leading Programmable Logic innovator with 55% commercial FPGA marketshare Xilinx is nearly twice as popular in the academic market as its nearest competitor Best PLD Software: Foundation; Alliance; & Synopsys partnership Best PLD hardware architectures Xilinx FPGAs and CPLDs all Reprogrammable In-System. Tri-state and dual port RAMs in FPGAs are best for computer structures, DSP, research, etc. Only vendor with dynamically & partially reconfigurable RPU’s Prentice Hall / Xilinx Student Edition includes best tools on the market with fully integrated hardware environment If you don’t have the budget, request a donation. FPGA 3K, 5K, 4K CPLD 9500 Functionality / Course Level Complexity Speed Exciting Research areas: Reconfigurable Computing Virtex, XC6200 Digital Signal Processing XC4000X Networking, PCI, Computer Architectures, Neural Nets, etc.

Computer Lab Requirements
Win ‘95, Win NT, HP, Sun, Solaris, use Xilinx software version 1.3, available now Foundation Series Express recommended for all PC users Other design entry tools OK too, especially on workstation v1.4 RAM Hard Drive Processor Minimum 32MB 200MB 486DX2 MuchBetter 32+MB 500MB Pentium 120+

Typical Lab Setup Primary and Additional licenses *
Cables vs. PROM Programmers Foundation Series Express package recommended for lab Software updates Full range of devices supported Additional license scheme 1 US-FND-EXP-PC Primary Foundation package 9 UA-FND-EXP-PC Additional FND licenses 10 XS40-010XL XC4010XL FPGA board & cable 2 XS XS9500 CPLD board & cable * Workstation users, use Ux-ALI-STD-WS, and subsitute these for 10 XS40-010XL’s 10 UW-FPGABOARD 3K/4K Development boards 10 UW-XCHCBL-PC XChecker cables

CPLD or FPGA? CPLD Non-volatile JTAG Testing Wide fan-in
Fast counters, state machines Combinational Logic Small student projects, lower level courses FPGA More common in schools Great for first year to graduate work Excellent for computer architecture, DSP, registered designs ASIC like design flow SRAM reconfiguration PROM required for non-volatile operation Since the software is integrated, you can teach with both !

Hardware Boards for PCs
XSTEND - Plug-in extension for XS40 & XS95’s - Purchase from XESS Corp. XS40 & XS95 Boards - Purchase from XESS or donation from Xilinx Access to I/O Pins for easy prototyping

Hardware Boards (2) H.O.T. II PCI Board1 UW-FPGABOARD2 Access to
Battery not included! Access to I/O Pins for easy prototyping (1) Purchase HOT II from VCC (2) Most popular board for the workstation. Purchase or donation from Xilinx

Summary Enhance Your Lab Curriculum with Xilinx
Students get better job offers Great products for your lab Leading, industry standard software IEEE Standard VHDL & Verilog Innovative hardware solutions Ideal from intro to graduate courses Great publications from Prentice Hall Areas of strength for research DSP, Reconfigurable Logic Xilinx = Long term Programable Logic Solutions Leader

Appendix A: Xilinx Configurable Logic Blocks

XC4000 CLB H Func. Gen. G F G4 G3 G2 G1 F4 F3 F2 F1 C4 C1 C2 C3 K YQ Y
D Q SD RD EC S/R Control 1 F' G' H' DIN H Func. Gen. G F G4 G3 G2 G1 F4 F3 F2 F1 C4 C1 C2 C3 K YQ Y XQ X H1 DIN S/R EC XC4000 CLB

XC4000X I/O Block Diagram Shaded areas are not included in XC4000E family.

XC CPLDs Function Block 1 JTAG Controller Block 2 I/O Block 4 3 Global Tri-States 2 or 4 Block 3 In-System Programming Controller FastCONNECT Switch Matrix JTAG Port Global Set/Reset Global Clocks Blocks 1

XC9500 Function Block Macrocell 1 Product- AND Term Array Allocator
FastCONNECT From 2 or 4 3 Global Tri-State Clocks I/O 36 Product- Term Allocator Macrocell 1 AND Array Macrocell 18

XC9500 Function Block (2nd View)
36 Inputs Fixed Output Pin D/T Q FastCONNECT Switch Matrix Function Block Logic

Appendix B: FPGA Family Comparisons

Xilinx Spartan Series 5 Volt -> XCS05 XCS10 XCS20 XCS30 XCS40
3.3 Volt -> XCS05XL XCS10XL XCS20XL XCS30XL XCS40XL System Gates 2K-5K 3K-10K 7K-20K 10K-30K 13K-40K Logic Cells Max Logic Gates 3,000 5,000 10,000 13,000 20,000 Flip-Flops Max RAM bits 3,200 6,272 12,800 18,432 25,088 Max I/O Performance 80MHz 80MHz 80MHz 80MHz 80MHz

XC4000E 5V FPGA Family 4003E 4005E 4006E 4008E 4010E 4013E 4020E 4025E
Logic Cells ,368 1,862 2,432 Max Logic Gates 3K 5K 6K 8K 10K 13K 20K 25K Typ Gate Range* 2-5K 3-9K 4-12K 6-15K 7-20K 10-30K 13-40K 15-45K (Logic + Select-RAM) Max I/O Packages: PC84 PC84 PC84 PC84 PC84 TQ100 PQ100 PQ100 TQ144 TQ144 PQ160 PQ160 PQ160 PQ160 PQ208 PQ208 PQ208 PQ208 PQ208 HQ208 HQ240 HQ240 HQ240 HQ304 PG120 PG156 PG156 PG191 PG191 PG223 PG223 PG223 BG225 BG225 PG299 100% Footprint Compatible * 20-25% of CLBs as RAM

Spartan/XC4000E/XC5200 Density
Xilinx FPGA Overview 4/30/98 Spartan/XC4000E/XC5200 Density Spartan/XL XC4000E XC5200 Logic Cells , , ,936 Typ Gate Range 2, ,000 2, ,000 2, ,000 (Logic + SelectRAM) I/O Number of Devices 5 8 5 Power Supply 5V / 3.3V 5V 5V I/O Interface 5V / 3.3V 5V 5V The medium-density Xilinx FPGAs include the Spartan Series, XC4000E family, and XC5200 family. All are available in similar density ranges. The best measure of capacity is the Logic Cell, which is a combination of a four-input lookup table and a flip-flop. The high end of the system gate range includes about 20% of the CLBs used as RAM (except for the XC5200, which does not have RAM). The XC4000E provides high speed in the most popular FPGA architecture, while the XC5200 provides low cost with a gates-only architecture. The new Spartan Series is a no-compromises solution providing high speed, on-chip RAM, and low cost. The Spartan Series will provide both 5V and 3.3V solutions. 2

XC4000X Series Density XC4000EX XC4000XL XC4000XV
Xilinx FPGA Overview 4/30/98 XC4000X Series Density XC4000EX XC4000XL XC4000XV Logic Cells 2, , ,448 10, ,102 Typ Gate Range 18, ,000 1, , , ,000 (Logic + SelectRAM) I/O Number of Devices Power Supply 5V 3.3V 3.3V + 2.5V I/O Interface 5V 5V / 3.3V 5V / 3.3V / 2.5V The XC4000 architecture has been extended to higher densities with the XC4000X Series. The series consists of three families: the XC4000EX at 5V, the XC4000XL at 3.3V, and the XC4000XV at 2.5V. The XC4000XV offers the option of a 3.3V I/O supply, and Xilinx 3.3V I/Os are completely compatible with 5V logic. In addition to lowering the power supply, each successive family offers higher density ranges. 3

Common Features Spartan XC4000 XC5200 Function Generators/CLB 3 3 4
Xilinx FPGA Overview 4/30/98 Common Features Spartan XC XC5200 Function Generators/CLB 3 3 4 Flip-flops/CLB 2 2 4 Global Nets 8 8 4 Global Three-State Control Yes Yes Yes Carry Logic Yes Yes Yes Internal Three-State Buffers Yes Yes Yes Boundary Scan Logic Yes Yes Yes Output Drive (Sink) 12 mA 12 mA 8 mA All families of Xilinx FPGAs are based on Configurable Logic Blocks (CLBs) that contain four-input lookup tables and dedicated flip-flops. Global nets are used to clock the flip-flops or drive high fanout signals. Carry logic increases the speed and density of arithmetic functions. 4

Differentiating Features
Xilinx FPGA Overview 4/30/98 Differentiating Features Spartan XC4000 XC5200 LCs/CLB RAM Sync. Sync./Async. None PCI Yes Yes No Decode No Yes No Wired-AND No Yes No I/O FFs Yes Yes No Config Ser Par/Ser Par/Ser Packages These are some of the key features that differentiate the three basic families of Xilinx FPGAs. The Spartan architecture is based on the XC4000 Series but has been streamlined for low cost. LCs/CLB is Logic Cells per Configurable Logic Block; the Spartan and XC4000 CLBs have a three-input lookup table that is counted as logic cells. PCI support requires not only high speed, but an architecture that provides the necessary functionality. The XC5200 does not have on-chip RAM or flip-flops in the I/O blocks. The Spartan Series offers two serial configuration modes, while the other families also offer parallel configuration modes. Complete pinout compatibility within Spartan Series Not directly pinout-compatible with XC4000/XC5200 - Spartan has only one MODE pin - Mode pin cannot be used as I/O 5

Xilinx XC4000-based Architecture Comparison
Xilinx FPGA Overview 4/30/98 Xilinx XC4000-based Architecture Comparison Spartan/XL XC4000X XC4000E Extended Routing No Yes No Fast Capture Latch No Yes No Global Early Buffers No Yes No Output Mux No Yes No CLB Latches No Yes No Asynchronous RAM No Yes Yes Edge Decoders No Yes Yes Wired-AND Function No Yes Yes The Spartan and XC4000X families are both based on the highly popular XC4000E architecture. The Spartan series provides a streamlined architecture to reduce overall costs. The XC4000X architecture adds several new features to increase its ability to meet high-density design needs. 6

Density Comparison

Xilinx University Workshops Appendix C
Design Tool Flows

Foundation Design Entry Tools Xilinx Implementation Tools
Xilinx-Express Design Flow .VEI .VHI .UCF Reports DSP COREGen & LogiBLOX Module Generator XNF .NGO HDL Editor State Diagram Editor VHDL Verilog .V .VHD Foundation Design Entry Tools Gate Level Simulator Schematic Capture EDIF Timing Requirements Express EDIF/XNF .XNF BIT JDEC SDF Xilinx Implementation Tools H D L S I M U A T O N Behavioral Simulation Models

Xilinx Design Manager Flow 1.4 FPGA Implementation

Xilinx Design Manager Flow 1.4 CPLD Implementation

M1 Design Flow Design Entry Concept Functional Simulation Mixed-Level Schematic/HDL Simulation Libraries Verilog XL, Leapfrog Netlist Information Design Synthesis & Retargetability Synthesis Libraries Synergy HDL/VHDL Synthesis Functional Simulation / Verification OpenSIM BackPlane Design Optimization/ Partitionning for PLDs Design Optimization for FPGAs PLD Designer FPGA Designer Netlist Creation *EDIF, XNF VHDL, VERILOG VerilogLink/VHDLLink while in M The focus is to provide the interface kit directly from Xilinx. This would be the greatest benefit since Cadence has not been very proactive in their support of Xilinx customers. Again the Cadence support for M1.0 is focused on providing a mix schematics and HDL design environment, Functional level Verilog and VHDL simulation support, netlisting via EDIF into the Xilinx tools, Timing simulation with Verilog-XL (directly from Place & Route tools and thus negating the need for translators and architecture specific Verilog models). The HDL support with Synergy synthesis will be supported by Cadence. Customer Benefit: The transfer of development, distribution and support of the Cadence interface to Xilinx will result in significant improvement in our support of Xilinx/Cadence customer base. Direct generation of Verilog netlist from the Core Place & Route tools helps with phasing out of a tactical Verilog solution (ES -Verilog) and all the other scripts used in the release. Again the main benefit to the customer is significant reduction in design time, a seamless support for Verilog and VHDL simulation and increased productivity. Place & Route Implementation Tools Timing Simulation Simulation Libraries Verilog-XL, Leapfrog Post Implementation Netlist & SDF Verilog, VHDL **SDF, *EDIF Schematic Redraw PLD & FPGA Designer *Standard Interface Netlist Format ** Standard Delay Format Timing Backannotation Device Programming Files 1 2

M1 Design Flow Implementation Tools while in M1.0.....
ABEL HDL LogiBlox LogiCores Optional VHDL Entry & Compile Behavioral Simulation Viewlogic ViewSyn Viewlogic Speedwave VHDL Synthesis Viewlogic ViewSyn Structural Simulation / Functional Simulation Schematic Entry / View Schematic Viewlogic ViewSim Viewlogic ViewDraw Netlist (XNF or *EDIF) Waveform Analysis Viewlogic ViewTrace Netlist Launcher NGDBUILD while in M M1.0 support for Viewlogic includes support for the latest WorkView office version 7.2 which supports both WIN 95 and NT, EDIF netlisting instead of wir2xnf in Pro-Series (standards), Xilinx certified VHDL timing simulation with Speedwave . Customer Benefit: The main benefit to the customer is the support of a far superior product, WorkView Office, an improved synthesis solution with Viewsynthesis which will result in significant improvement in design cycle. Addition of Speedwave will make it easy to do a mix level (schematics & HDL) design in Viewlogic environment. VHDL, *XNF Timing Simulation **SDF Viewlogic ViewSim Place & Route Timing Annotated EDIF Netlist Implementation Tools PAR (Place & Route) *Standard Interface Netlist Format ** Standard Delay Format Device Programming Files 4 5

M1 Design Flow HDL Design Flow Schematic Design Flow while in M1.0....
Mentor Design Manager Mentor Design Manager ABLE HDL LogiBlox LogiCores Optional LogiBlox VHDL / Verilog HDL Notepad / QuickHDL Design Entry Design Architect LogiCores FunctionalSimulation QuickHDL Simulation Preparation Design View Editor Optional Synthesis & Optomization Autologic II Functional Simulation QuickSim II *EDIF *EDIF Place & Route Implementation Tools Place & Route Implementation Tools while in M The focus is to provide a mix design environment with HDL and schematics. The will include mixed design entry with EDIF netlist, VHDL & Verilog HDL code, top-level XNF, schematics , certified VHDL and Verilog Timing simulation with QuickHDL pro and backannotation of timing with SDF. The design will be debugged with schematics generated with Mentor’s schematics generator which is definitely far superior to the gen_sch8 program currently available with XACT 5.2 kit. There are number of issues related to using XBLOX which have been addressed with the new LogiBLOX methodology. Customer Benefits: The main benefit is support of the standards like EDIF, Verilog & VHDL and SDF which fits in nicely with the existing mixed design environments of our high end Mentor customers. Schematics currently generated with gen_sch8 in kit is not readable and forces the customer to read the XNF file in order to debug the design. This and the new LogiBLOX will significantly reduce the design cycle which is very important with the high end customer base in 4KEX and after. *EDIF w/ Timing *SDF Device Programming Files VHDL or VERILOG *SDF Device Programming Files Timing Simulation QuickSim II Timing Simulation QuickHDL *Standard Interface Netlist Format ** Standard Delay Format 2 6

Post-layout Verification
M1 Synopsys Design Compiler Design Flow Xilinx Unified Libraries VHDL/VERILOG Models HDL Source File (VHDL or Verilog HDL) Functional Simulation Synopsys VHDL System Simulator or 3rd Party VHDL/VERILOG Simulator Synthesis Synthesis Library LogiBlox Synopsys FPGA Compiler or Design Compiler Simulation Library LogiCores Netlist (XNF or *EDIF) Optional Post-layout Verification Static Timing Verification Synopsys VSS Simulator Netlist Launcher NGDBUILD Constraints File Static Timing Report while in M1.0…. Designers will be able to take advantage of the Xilinx Designware Libraries. These libraries of optimized, pre-compiled datapath functions provide the designer with higher quality and decreased runtimes because the logic is merged rather than compiled. Designers will also be able to take advantage of the High Level Design Link (HLDL) to pass timing constraints prior to synthesis. Designers will benefit from more accurate and traceable constraints passing. Place & Route Timing Simulation VHDL, VERILOG, *SDF Synopsys VSS Simulator or 3rd Party VHDL/VERILOG Simulator Implementation Tools PAR (Place & Route) Device Programming Files *Standard Interface Netlist Format ** Standard Delay Format 8 3

M1 Design Flow Implementation Tools While in M1.0
XNF modules (Created by HDL Synthesis tools) ABEL HDL Schematic Entry OrCAD/ESP Design Environment Functional Simulation OrCAD Simulate XSimMake LogiBlox Netlist (XNF or *EDIF) LogiCores Optional Netlist Launcher NGDBUILD VHDL, *XNF While in M1.0 Orcad has taken over the distribution of Xilinx support in their product (FPGA pack) and is shipping the product for last 6 months. The new Capture and Simulate tools are Windows based, support WIN 95 and NT with VHDL simulation. Orcad has also announced the shipment of Synthesis tool and thus is moving up to support the Top-Down designs we expect our customers to transition to. The M1.0 support will include support of standard design flows with EDIF and VHDL. The benefit to our users will be improved design performance and increased productivity. Place & Route *SDF Implementation Tools PAR (Place & Route) Device Programming Files *Standard Interface Netlist Format ** Standard Delay Format 10 4

Synplicity Design Flow

Xilinx University Workshop Appendix D XChecker Cable
and Configuration *Note: Although differences are very minimal, this information has not been updated to reflect M1 information.

Use XChecker Cable to Simplify Verification
Downloading allows quick verification of design in circuit Bitstream downloaded via computer’s serial port directly into FPGA No PROM programming required Design changes and verifications made quickly Readback sends configuration data and flip-flop values back out of chip Verifies correct configuration Allows in-circuit “probing” of all signals Can occur while the FPGA is running Uses no CLBs or routing resources

Enabling Configuration Readback
Readback Trigger input starts serial readback XC3000 controlled via Bitstream Generator Default is enabled Data and trigger connected to Mode pins XC4/5000 controlled via schematic and Bitstream Generator Include Readback symbol in schematic Connect TRIG and DATA to I/O pins Can use MD0 and MD1 CLK DATA XChecker RD READBACK OPAD TRIG RIP OBUF XChecker RT (MD1) IPAD IBUF (MD0)

Available Readback Data
Data includes all storage elements in device XC4000/XC5000 readback data includes all outputs of CLBs and IOBs XC4000/XC5000 data is captured when readback is triggered XC3000 data is captured as readback progresses May want to stop system clock for logic verification Requires XChecker control of system clock

Cable Setup (XACT™step v6)
Connect cable to computer’s serial port Power up cable via target’s VCC and GND Hardware Debugger should find cable automatically Cable -> Communications... allows change to port

Downloading a Design (XACT™step v6)
Connect cable to target VCC, GND, PROG, DONE, DIN, CCLK Put device into slave mode Select Download -> Download Design or the “lightning” toolbar icon Can verify configuration with Download -> Verify Bitstream or “checkmark” toolbar icon Requires configuration readback to be enabled

Control Panel Defines Debug Session (XACT™step v6)
Opens automatically for Debug Allows direct control of: System clock source definition and application Readback trigger source definition and application Number of readbacks Display options

Use XChecker or System as Clock Source (XACT™step v6)
Can use one of four clock speeds in cable, or route system clock into cable for user control

Specify When Readback is to be Triggered (XACT™step v6)
Can trigger on external event Specify number of clocks between multiple readbacks

Choose Signals to Display (XACT™step v6)
Select Display... in Control Panel Can define groups first with Groups... button Can select flip flops in XC3000, RAM bits in XC4000, and combinatorial outputs in XC4000/XC5000 Filter for desired signals

Activating Readback (XACT™step v6)
Select Read button in Control Panel Waveform opens automatically Note waveform reflects several steady state conditions, not timing

Originally created by: Greg Goslin Xilinx, Corporate Applications
How to Use Programmable Logic to Build Fast and Efficient DSP Functions XUP Workshop Appendix E Originally created by: Greg Goslin Xilinx, Corporate Applications

Constraint Driven Design Methodology
Constraints System Requirements Hardware Limitations Data Rate Inputs Outputs Multi-Channel I/O Quality Number of Bits/Taps Number of Operations Error Tolerance Processor Power Clock Rate Constraint Driven Design methodologies Data Rate Quality Constraint Driven Design Methodology: In a “Constraint Driven Design” the design engineer will know or derive the following initial parameters for the design: 1> Data Rate: 2> Quality: 3> Processor Power & Clock Rate: The Constraints are defined by the system in which the design is to be integrated. (Non-engineering example: If you have a fireplace with an 18 inch width and 16 inch depth, the logs better be designed shorter then 18 inches) These factors define the physical hardware requirements for the system. If one or more of these parameters are not met the design is of zero value. (Any excess is of zero added value.) The FPGA is adaptive to these constraints. The use of Distributed Arithmetic will be described later to show how to actually design using the least amount of resources for a given performance specification. Time multiplexed circuit design for low rate data allows for circuit reduction. High rate systems might require pipelining and block separation for higher rates, which use more resources. The quality plays a factor on the size of the circuit as well. An example would be a filter. As the Filter specifications get tighter, the order (number of taps) increases, as does the number of bits per coefficient. This also will effect the performance characteristics. Processor Power Clock Rate Options Performance Efficiency

Building Fast and Efficient Filters in FPGAs
Efficient Filter Algorithms for FPGAs Distributed Arithmetic: Bit-Serial n-Bit Parallel Using Distributed Arithmetic for Filter Designs Serial FIR Filter Example Two-Bit Parallel FIR Example Full Parallel FIR Example This is where the fun begins. Now we will look at design implementation techniques for the FPGA. (Dismiss all managers and non-technical attendees from the seminar!) We will walk through a filter design and show how to build efficient blocks in the FPGA. We will look at: a. Serial Distributed Arithmetic, Mid-Range Data Rate (5MHz) b. 2-Bit Parallel Distributed Arithmetic, Higher Rate (10MHz) c. Full Parallel Distributed Arithmetic, The Highest Rate (70MHz) Building functional block and expanding the range of functionality.

FIR FILTER EXAMPLE X X X Sum of Products Equation K Multiplies SUM
N BITS WIDE Sum of Products Equation SAMPLE DATA X0 PRODUCT K Multiplies K Sums CLOCK = Multiply Time Sample Rate = Clock Rate SUM X C0 X1 K X C1 X2 OUTPUT DATA This foil outlines the basic structure of a Finite Impulse Response (FIR) Filter, or moving average (MA) filter. The actual concept behind the FIR Filter isn’t really important. It is important only to understand the design flow. There are far too many applications areas and names used for FIR Filters. These names can also represent other functional implementations. This makes it complicated on the surface level. Bartlett, Blackman, Hamming, Hanning, Kaiser, and Rectangular to mention a few FIR Filter methods. Common FIR Applications; Hi-Fi Audio (Echo, Equalizers, delta-sigma Codecs) Modem, Digital Radio, ISDN & HDSL (Equalizer, Echo Canceler, Raised Cosine & Hilbert Filters) Radar (Pulse Compression matched filters) Video & HDTV The real action of the FIR Filter is to produce a weighted average, based on the summation of each Tap (samples stored per a unit time) times a coefficient (sample weighting.) By changing the coefficient values the functionality or characteristics of the filter will change. The number of Taps will also effect the filter characteristics. It is beyond the scope of this seminar to teach filter theory. It is not required to understand how to build the functions in the FPGA. Let the Customer (Design Engineer) specify the filter or DSP design through a block diagram. Through experience you will have more and more insight into the design. It is critical to understand the way in which the data flows through the FIR process. New Data samples are loaded into the upper left Tap, in the diagram. The previously stored sample is shifted to the next Tap. This process is continued through all the Taps. (Note that the number of taps can range from one to thousands.) Once the new sample data is loaded, each Tap is multiplied by a Coefficient. Each of these product terms are then added together to give the final filtered data output. In a GP-DSP this process would require K-Multiplies and K-Additions to perform a K-Tap filter function. For such a device the process would require k-Clock cycles for a single clock cycle MAC in a GP-DSP. Hence, as K increases the data rate decreases. How do you implement this in an FPGA and why???? The next few foils will show just that! X C2 IMPLEMENTATION ??? K COEFFICIENTS K TAPS LONG K SUMS

2’s Complement Math The 2’s Complement of a number:
Invert (1’s Complement) then Add 1. (-6) the 2’s Comp. is (Invert) , (Add 1) Equals: (+6) Leading 1’s and 0’s are only place holders: (Sign extending a 2’s Comp. number doesn’t change its value) XMSB ... X2 X1 X0 equals XMSB XMSB XMSB ... X2 X1 X0 The following 2’s Complement pairs are the same: FFFF = FF, = 01, = 1101 Adding 2’s Complement numbers: - Sign Extend, the MSB (sign bit) must be extended to allow for word growth: SE (Note: Ignore Overflow)

X 8-Bit X 8-Bit Signed Multiply B7B6B5B4B3B2B1B0 A7A6A5A4A3A2A1A0
SIGN EXTEND A0(B7B6B5B4B3B2B1B0) A1(B7B6B5B4B3B2B1B0) A2(B7B6B5B4B3B2B1B0) A3(B7 B6B5B4B3B2B1B0) A4(B7 B6 B5B4B3B2B1B0) First, lets review Binary multiplication. If we multiply two binary signed numbers of n-bits. The product would be 2*n-bits (equal to the sum of the number of bits in each word being multiplied.) The actual process is performed by a multiplication (AND operation) of each multiplier bit (A) times the multiplicand word (B). Each multiplier bit is bit weighted to maintain the bit order of the resulting partial product Sum term (S). Starting with the Least Significant Bit of the multiplier word, each is multiplied with the multiplicand word and added to the product sum. Note: Each multiplier bit (A) from LSB to MSB in bit level increments have a difference of two in bit weighting. Hence, shifting each product term, right, one bit and adding in the next term will maintain the correct bit weighting for the product Sum (S). A Booth Multiplier does just that. Through a series of cascaded adders and AND Gated Buses. The multiplication is performed. This circuit gets quite large, though, it might be fast with pipelining. When looking at DSP functions with parallel MACs. The Booth based multiplier circuit grows too fast for an efficient implementation. This is where Distributed Arithmetic comes to play a key role in design implementation. A5(B7 B6 B5 B4B3B2B1B0) A6(B7 B6 B5 B4 B3B2B1B0) + A7(B7 B6 B5 B4 B3 B2B1B0) S15S14S13S12S11S10S9S8S7S6S5S4S3S2S1S0

X 8-Bit X 8-Bit Signed Multiply B7B6B5B4B3B2B1B0 A7A6A5A4A3A2A1A0
SIGN EXTEND SE(B7 B6 B5 B4 B3 B2B1B0)*A7 27 SE(B7 B6 B5 B4 B3B2B1B0)*A6 26 SE(B7 B6 B5 B4B3B2B1B0)*A5 25 SE(B7 B6 B5B4B3B2B1B0)*A4 24 SE(B7 B6B5B4B3B2B1B0)*A3 23 First, lets review Binary multiplication. If we multiply two binary signed numbers of n-bits. The product would be 2*n-bits (equal to the sum of the number of bits in each word being multiplied.) The actual process is performed by a multiplication (AND operation) of each multiplier bit (A) times the multiplicand word (B). Each multiplier bit is bit weighted to maintain the bit order of the resulting partial product Sum term (S). Starting with the Least Significant Bit of the multiplier word, each is multiplied with the multiplicand word and added to the product sum. Note: Each multiplier bit (A) from LSB to MSB in bit level increments have a difference of two in bit weighting. Hence, shifting each product term, right, one bit and adding in the next term will maintain the correct bit weighting for the product Sum (S). A Booth Multiplier does just that. Through a series of cascaded adders and AND Gated Buses. The multiplication is performed. This circuit gets quite large, though, it might be fast with pipelining. When looking at DSP functions with parallel MACs. The Booth based multiplier circuit grows too fast for an efficient implementation. This is where Distributed Arithmetic comes to play a key role in design implementation. SE(B7B6B5B4B3B2B1B0)*A2 22 SE(B7B6B5B4B3B2B1B0)*A1 21 + SE(B7B6B5B4B3B2B1B0)*A0 20 S15S14S13S12S11S10S9S8S7S6S5S4S3S2S1S0

4-Bit Signed Tree Multiplier
{ 1/2 B*A2 - B*A3 } -A3 *{ B3B3B2B1B0 } +A2 *{ B3B3B3B2B1B0 } { P7P6P5P4P3P2 } 1 CARRY IN B B3 -A3 *{ B3B3B2B1B0 } +A2 *{ B3B3B3B2B1B0 } +A1 *{ B3B3B3B3B2B1B0 } +A0 *{ B3B3B3B3B3B2B1B0 } { P7P6P5P4P3P2P1P0 } Sign Extend 3:0 A3 B * A3 3:0 -B R E B3 G 7:2 B2 A/2 Sign Extend 3:0 B * A2 A2 3:1 7:2 B R LSB B0 E 5-bit Signed Adder & Reg = 3 CLBs G 7:0 B5 B5 A/4 Sign Extend { 1/2 B*A0 - B*A1 } 5:2 Total = 18 CLBs 1 CARRY IN +A1 *{ B3B3B2B1B0 } +A0 *{ B3B3B3B2B1B0 } { P5P4P3P2P1P0 } B B3 B1 Sign Extend LSB 3:0 B * A1 B0 A1 3:0 B R E 6-bit Signed Adder & Reg = 4 CLBs B3 G B B2 A/2 Sign Extend 5:0 3:0 B * A0 A0 3:1 LSB B0 5-bit Signed Adder & Reg = 3 CLBs 16 Gated Bits and Reg = 8 CLBs

D.A. ONE TAP FIR FILTER = D0 C0
REDUCES TO MULTIPLYING A VARIABLE TIMES A CONSTANT N BITS WIDE 2 WORD X N BIT LOOK UP TABLE SAMPLE DATA A[0] Xn 2 -1 DIN C0 1 N LOOK UP TABLE X3 X2 R E G I S T A X0(B7B6B5B4B3B2B1B0) X1 +X1(B7B6B5B4B3B2B1B0) Scaling Accum. S9S8S7S6S5S4S3S2S1S0 X0 ADRS +X2(B7B6B5B4B3B2B1B0) A0 - This foil is the starting point for Distributed Arithmetic. This circuit represents a single Tap filter or a multiplier block. Refer to the white paper for more details on the specifics. Basic concepts of Distributed Arithmetic: Lets look at Data Flow (we will look at the actual implementation in the FPGA later.) Data-in is loaded into the Parallel Load, Serial out, Shift Register (X-n, ..., X1, X0) This data is the multiplicand word. The Multiplier (Coefficient) is stored in a Look Up Table (LUT). For this example the LUT is two words by n-bits. The actual address space is [0,1]. The LSB of the multiplicand, which is present at the end of the Parallel Load, Serial out, Shift Register. Because this is always a one bit value, it will always be a value of one or zero. When the Parallel Load, Serial out, Shift Register is loaded the output register on the Scaling Accumulator is cleared. The LSB (X0) is used to address the LUT [A0]. This value of X0 is used as a binary mask via the LUT. If the LSB is a zero, then the partial product of zero times the multiplier is zero. Hence the value in the LUT at address zero [0] is zero. If the LSB is a one, then the partial product of one times the multiplier is the multiplier word of n-bits. Hence the value in the LUT at address one [1] is the multiplier word (Coefficient). The output from the LUT is placed on the input to the Scaling Accumulator and added to the partial product sum. When the first Bit of the input is processed the Scaling Accumulator’s output is cleared. Hence, the output from the LUT is Registered at the output of the Scaling Accumulator. This represents the value at the top of the summation tree shown in the diagram above (X0 times B). This value is bit shifted right, one bit (to perform a divide by two). This is to adjust for the bit weighting (scaling) of the next incoming partial sum to be added in the Scaling Accumulator. On the next clock cycle the second input bit is shifted out to address the LUT. The LUT performs the same function as described previously. The output of the LUT is added to the Scaling Accumulator. The Scaling Accumulator does the same thing as before... This process continues until the MSB is processed. Once the MSB is processed the output from the Scaling Accumulator, represents the full product term (S). It can be seen from this example that it would require n-clock cycles to process a n-bit word. Note it is independent of the size of the coefficients. The clock frequency can easily be calculated for a maximum frequency by analyzing the carry-chain of the Scaling Accumulator (the indirect influence of the coefficient). This will most likely be the most significant delay path in the design. The carry-chain will only depend on the number of bits or the depth of the LUT. This makes it easy to estimate the performance of a design in a FPGA. (Maximum carry-chain delay, times the number of clock cycles require.) + FILTERED DATA OUT S10S9S8S7S6S5S4S3S2S1S0 +X3(B7 B6B5B4B3B2B1B0) 1 B S11S10S9S8S7S6S5S4S3S2S1S0 DATA +X4(B7 B6 B5B4B3B2B1B0) S12S11S10S9S8S7S6S5S4S3S2S1S0 +X5(B7 B6 B5 B4B3B2B1B0) S13S12S11S10S9S8S7S6S5S4S3S2S1S0 +X6(B7 B6 B5 B4 B3B2B1B0) S14S13S12S11S10S9S8S7S6S5S4S3S2S1S0 +X7(B7 B6 B5 B4 B3 B2B1B0) S15S14S13S12S11S10S9S8S7S6S5S4S3S2S1S0

D.A. TWO TAP FIR FILTER = D0 C0 + D1 C1
B Scaling Accum. R E G I S T FILTERED DATA OUT 2 -1 + - LOOK UP TABLE ADRS DATA C0 4 WORD X N BIT LOOK UP TABLE c1 C0 + C1 00 01 10 11 A[10] X0 X2 X1 XN D0 SAMPLE DATA N BITS WIDE D1 A0 A1 N (X0,0,X1,0)(B7B6B5B4B3B2B1B0) +(X0,1,X1,1)(B7B6B5B4B3B2B1B0) +(X0,2,X1,2)(B7B6B5B4B3B2B1B0) +(X0,3,X1,3)(B7 B6B5B4B3B2B1B0) +(X0,7,X1,7)(B7 B6 B5 B4 B3 B2B1B0) S15S14S13S12S11S10S9S8S7S6S5S4S3S2S1S0 S9S8S7S6S5S4S3S2S1S0 S10S9S8S7S6S5S4S3S2S1S0 S11S10S9S8S7S6S5S4S3S2S1S0 +(X0,4,X1,4)(B7 B6 B5B4B3B2B1B0) S12S11S10S9S8S7S6S5S4S3S2S1S0 +(X0,5,X1,5)(B7 B6 B5 B4B3B2B1B0) S13S12S11S10S9S8S7S6S5S4S3S2S1S0 +(X0,6,X1,6)(B7 B6 B5 B4 B3B2B1B0) S14S13S12S11S10S9S8S7S6S5S4S3S2S1S0 Now lets add an additional Tap to the Filter design. The Process is the same, it just has two address spaces (Taps or Coefficients). Data-in is loaded into the Parallel Load, Serial out, Shift Register (X-n, ..., X1, X0) and the previous Data Sample is now in the D1 Register. The D1 Register is actual loaded during the serial bit shift. As each bit is processed, the output of the top shift register is loaded into the cascaded shift register below. Thus, when the Parallel Load, Serial out, Shift Register has completely shifted out the loaded data sample, the following cascaded Serial in, Serial out Shift Register has been loaded with the previous data sample. These two data samples are the multiplicand words. The Multiplier values (Coefficients) are stored in a Look Up Table (LUT) just like before. For this example the LUT has four words by n-bits. The actual address space is [00, 01, 10, 11]. The LSB of the two multiplicands, which are present at the end of the Parallel Load, Serial out, Shift Register and the Serial in, Serial out Shift Register. Because these are always a one bit value, it will always be a value of one or zero at both address locations. When the Parallel Load, Serial out, Shift Register is loaded the output register on the Scaling Accumulator is cleared. The LSB (X0) from the Parallel Load, Serial out, Shift Register is used to address the LUT [A0]. The LSB (X0) from the Serial in, Serial out, Shift Register is used to address the LUT [A1]. If the LSB is a zero, then the partial product of zero times the multiplier is zero. If the LSB is a one (“true”), then the partial product of one times the multiplier is the multiplier word of n-bits. The output value of the LUT at address [A0, A1] is the sum of the two “true“ multiplier words (Coefficients). A1, A0 : VALUE A1, A0 : VALUE 0, : zero , : C1 0, : C , : {C1 + C0} The output from the LUT is placed on the input to the Scaling Accumulator and added to the partial product sum. When the first Bit of the inputs are processed the Scaling Accumulator’s output is cleared. Hence, the output from the LUT is Registered at the output of the Scaling Accumulator. This value is bit shifted right, one bit (to perform a divide by two). This is to adjust for the bit weighting, as discussed earlier. On the next clcok cycle the second input bit is shifted out to address the LUT. The output of the LUT is added to the Scaling Accumulator. This process continues until the MSB is processed. Once the MSB is processed the output from the Scaling Accumulator, represents the full product term (S). It can be seen from this example that it would require n-clock cycles to process a n-bit word, as before. Note it is independent of the number of the coefficients or Taps. The number of CLBs used for the LUTs will not change from between one to four input values (or address bits). We will look at the actual design implementation later.

D.A. THREE TAP FIR FILTER - D0 D1 + D2 8 WORD X N BIT LOOK UP TABLE
C0 8 WORD X N BIT LOOK UP TABLE C1 C1 + C0 000 001 010 011 100 101 110 111 C2 C2 + C0 C2 + C1 C2 + C1 + C0 A[210] N BITS WIDE X0 X2 X1 XN SAMPLE DATA N D0 D2 D1 A0 2 -1 LOOK UP TABLE X0 X2 X1 XN R E G I S T A Scaling Accum. ADRS A1 - Now lets add an additional Tap to the Filter design. The Process is the same, it just has three address spaces (Taps or Coefficients). The same old story... Data-in is loaded into the Parallel Load, Serial out, Shift Register (X-n, ..., X1, X0) and the previous Data Samples are now in the next cascaded Register (D0 >> D1 and D1 >> D2). These two data samples are the multiplicand words. The Multiplier values (Coefficients) are stored in a Look Up Table (LUT) just like before. For this example the LUT has eight words. The actual address space is [000, 001, 010, 011, 100, 101, 110, 111]. The LSB of the three multiplicands, which are present at the end of the Parallel Load, Serial out, Shift Register and the Serial in, Serial out Shift Register. Because these are always a one bit value, it will always be a value of one or zero at all three address locations. When the Parallel Load, Serial out, Shift Register is loaded the output register on the Scaling Accumulator is cleared. The LSB (X0) from the Parallel Load, Serial out, Shift Register is used to address the LUT [A0], the LSB (X0) from the first Serial in, Serial out, Shift Register is used to address the LUT [A1], the LSB (X0) from the second Serial in, Serial out, Shift Register is used to address the LUT [A2]. If the LSB is a zero, then the partial product of zero times the multiplier is zero. If the LSB is a one (“true”), then the partial product of one times the multiplier is the multiplier word of n-bits. The output value of the LUT at address [A2, A1, A0] is the sum of the three “true“ multiplier words (Coefficients). A2, A1, A0 : VALUE A2, A1, A0 : VALUE 0, 0, : zero , 0, : C2 0, 0, : C , 0, : {C2 + C0} 0, 1, : C , 1, : {C2 + C1} 0, 1, : {C1 + C0} , 1, : {C2 + C1 + C0} The output from the LUT is placed on the input to the Scaling Accumulator and added to the partial product sum. When the first Bit of the inputs are processed the Scaling Accumulator’s output is cleared. Hence, the output from the LUT is Registered at the output of the Scaling Accumulator. This value is bit shifted right, one bit (to perform a divide by two). This is to adjust for the bit weighting, as discussed earlier. On the next clock cycle the second input bit is shifted out to address the LUT. The output of the LUT is added to the Scaling Accumulator. This process continues until the MSB is processed. Once the MSB is processed the output from the Scaling Accumulator, represents the full product term (S). It can be seen from this example that it would require n-clock cycles to process a n-bit word, as before. Note it is independent of the number of the coefficients or Taps. This process can be continued for as many Taps as needed. The details of the design implementation will now be discussed through an example. + FILTERED DATA OUT B X0 X2 X1 XN DATA (X0,0,X1,0,X2,0)(B7B6B5B4B3B2B1B0) +(X0,1,X1,1,X2,1)(B7B6B5B4B3B2B1B0) S9S8S7S6S5S4S3S2S1S0 +(X0,2,X1,2,X2,2)(B7B6B5B4B3B2B1B0) S10S9S8S7S6S5S4S3S2S1S0 A2 +(X0,N,X1,N,X2,N)(B7B6B5B4B3B2B1B0) S(N+M) ... S13S12S11S10S9S8S7S6S5S4S3S2S1S0

Distributed Arithmetic FIR Filter
The Development of a Distributed Arithmetic FIR Filter 10-Bit 10-Tap - XC4000 Family Example Now that we have gone through the theory behind Distributed Arithmetic, lets look at building a 10-Tap, 10-Bit FIR Filter using this technique. Through this example we will look at optimizing the design implementation for the XC4000 architecture. For the initial implementation we will be focusing on building a circuit using a minimum amount of resources and still offer good performance. We will look at increasing the performance after the basic design is done.

10 BIT 10 TAP SYMMETRICAL FIR FILTER
DATA LOOK UP TABLE PARALLEL IN SERIAL OUT SAMPLE XOR COMPLEMENT ON LAST BIT & ADD 1 A B R E G I S T 100 BIT SHIFT REGISTER FILTERED DATA OUT D0 D1 D9 D8 D2 D7 D3 D6 D4 D5 ADD A0 A1 A2 A3 A4 10 BIT 10 TAP SYMMETRICAL FIR FILTER 32 X 10 MEMORY 10 10 BIT SUM(10,1) 11 Scaling Accum. A10 A9 A8 S1 SUM(0) DIN Shift Reg. Least Significant BYTE Most Significant OPTIONAL DOUBLE PRECISION S10 S9 320 BITS Look Up Table is only 32 words by 10 bits Serial Adders C_I B(9:0) SIGN EXT B10 LD LOAD ON FIRST BIT This foil schematically represents a completed, 10-Bit, 10-Tap Symmetrical FIR Filter. There are no assumptions other than the following: a Bit, fixed point, Data and Coefficient (2’s Complement format) b. Coefficient are constant (static) during run time. c. Coefficients are symmetrical (not required). For this example, C9=C0, C8=C1, C7=C2, C6=C3, C5=C4. Now lets look at each functional block stages with in the application block. 1. The Parallel Load, Serial out, Shift Register and the Serial in, Serial out Shift Registers. 2. The Serial Adders (used only with symmetrical coefficients.) 3. LUT based coefficient multiplication and accumulate block. 4. Complementing block. 5. Scaled Accumulator block. Now lets look at each individually and optimize the functionality in side the xc4000 FPGA architecture.

SERIAL TIME SKEW BUFFER
N K BIT SHIFT REGISTER N K BIT SHIFT REGISTER SERIAL TIME SKEW BUFFER SAMPLE DATA SAMPLE DATA N N BIT SHIFT REGISTER N N BIT SHIFT REGISTER PARALLEL IN SERIAL OUT PARALLEL IN D_0 D_0 SAMPLE DATA WORD SIZE = N BITS NUMBER OF TAPS = K RAM16X1R DATA_I A3 A2 A1 A0 WR CLK One N Bit Shift Register Per Tap Use 4000 RAM to build Shift Register One 16 Bit Shift Register Per 1/2 CLB D_1 DATA_O D_1 SHIFT The Serial Time Skew Buffer is nothing more than a cascade of serial shift registers. The design was specified as a 10-Bit, 10-Tap FIR Filter. Therefor, we need to store ten, 10-Bit words. There are several ways to store data in the FPGA. So lets look at the requirements to better understand the options we have in implementing each stage. First, lets look at the first shift register. The Data is loaded in parallel, and shifted out serially. This implies that each bit must be accessible during the load. The only way to efficiently build this block is using Data Register (D Flip-Flops.) The Registers are also tied together in a serial cascade to serially shift data bits from one register to the next. (This will require a 2:1 MUXed input to each of the registers. The MUX is controlled by the load signal. During a load the MUX passes the input Data word to the Parallel load, Serial out, Shift Registers. Once the Data is loaded in parallel the MUX is switched to pass the data bits through the cascade of registers.) The Serial in, Serial out Shift Registers are set up, such that, only one bit changes at a time. Because of this there is no need to have access to all of the bits simultaneously. Hence, the design only needs to access one of n-bits. This sounds like a RAM function. The XC4000 family offers 16 times the density in RAM over Registers. That is each CLB can have two bits of data stored in Registers. Or two, 16-Bit words stored in the two LUTs (RAMs.) This gives 16 times the density. The additional overhead of a four bit counter is trivial in comparison. The overall CLB utilization for the input and Data Storage Registers is easy to compute: 1. The Parallel load, Serial out, Shift Registers: (1/2-CLBs) per Bit. (First Tap only.) (e.g. 10-Bits per Data Sample = 5-CLBs) 2. The Serial in, Serial out Shift Registers : (1/2-CLBs) per (16-Bits or less) or (1-CLBs) per (32-Bits or less) Times (number of Taps - 1) (All but first Tap.) (e.g. 10-Bits per Data Sample, 9-Taps = 4.5 CLBs) Total = = call it an even 10-CLBs (the Counter for control will be absorbed in the control logic.) RAM16X1R SHIFT REGISTER IMPLEMENTED IN RAM DATA_I A3 A2 A1 A0 WR CLK D_k-1 DATA_O D_k-1 # OUTPUTS = # TAPS 10 BIT 10 TAP = 50 CLBs 10 BIT 10 TAP = 10 CLBs

Serial Adder 1 CLB Per 2 Taps SUM FF FF CNT=10 Serial Adders A + B D
Clk ADD A B A+B+Carry D2 D7 Carry ADD Carry In D FF Clk D3 D6 ADD CLR Because the design is Symmetrical, half of the Taps share the same coefficients. This means the design can be down sized by reducing the size of the LUTs used to store the Coefficients. To do this the following circuitry must be added. Half Adders or serial adders are used to add the two common Taps together before multiplying the Data and the Coefficients. There are Pro’s and Con’s to this. Pro’s: 1. The LUTs are half as large. The LUTs grow exponentially. This be discussed in the next foil. 2. The reduction in the number of LUTs can also reduce the number of Adders required to add the fanout of Taps. This will be seen in following slides. Con’s: 1. The overall process, the number of clock cycles, will increase by one clock cycle. This equates to (n+1) clock cycles for a n-bit word. The reason for the extra clock cycle is to push the carryout (overflow) out of the Half Adder left over from the Addition of the two MSB Bits. ( The Half Adder pushes the carryout into the next addition. After adding the two MSBs, zeros are added in the A & B inputs to flush out the carryout.) The added circuitry is equal to (1/2-CLBs) per Tap. (e.g. 10-Taps = 5-CLBs, The equivalent design using LUTs would use (1-CLBs) per Coefficient Bit plus 5-Taps and a 12-Bit Adder. Or, (10-Bits + 2) + 6 = 18-CLBs.) If speed is more important than size.... Don’t use the Half adders and gain one clock cycle in performance.... Otherwise, reduce the logic count. CNT=10 D4 D5 ADD 1 CLB Per 2 Taps Serial Adders

DISTRIBUTED ARITHMETIC
LOOK-UP TABLE 32 X 10 MEMORY A0 LOOK UP TABLE HOLDS ALL PARTIAL PRODUCTS LUT IS AS WIDE AS COEFF CAN USE MEMGEN TO BUILD LUT A1 DATA A2 The LUT is some what straight forward. That is once you get the concept. The LUT hold all partial sums of the Coefficients. This is controlled through the addressing of the RAM. Each address bit represents one coefficient in the RAM. Each “true” address bit corresponds to an inclusion of the corresponding coefficient. Refer to the white paper for more details. A3, A2, A1, A0 : VALUE A3, A2, A1, A0 : VALUE 0, 0, 0, : zero , 0, 0, : C3 0, 0, 0, : C , 0, 0, : {C3 + C0} 0, 0, 1, : C , 0, 1, : {C3 + C1} 0, 0, 1, : {C1 + C0} , 0, 1, : {C3 + C1 + C0} 0, 1, 0, : C , 1, 0, : {C3 + C2} 0, 1, 0, : {C2 + C0} , 1, 0, : {C3 + C2 + C0} 0, 1, 1, : {C2 + C1} , 1, 1, : {C3 + C2 + C1} 0, 1, 1, : {C2 + C1 + C0} 1, 1, 1, : {C3 + C2 + C1 + C0} The number of CLBs is easy to figure out. Worst case for n-Bit Coefficient, is 4-Taps per 1/2-CLBs times n+2-Bits. The extra two bits are for word growth. The number of CLBs are fixed in multiples of four. that is anything less than four will use the same resources as four, in reference to the LUT. For this 10-Tap example, using five unique Taps, has the following structure. The five outputs from the Half Adders are used to address the LUTs. Because the LUTs are 4- or 5-Input LUTs in reference to the CLB structures in the XC we use the 5-Input LUT. This allows 1-CLB per bit for all five (ten-Symmetrical) Taps. To allow for bit growth we would estimate the design using (10+2)-Bits times 1-CLBs = 12-CLBs. Because the coefficients used for Filter designs are typically positive and negative, fewer bits are required in the end design. In the reference design only 10-Bits are required for a total of 10-CLBs. This design can use any 10-Bit combination of five Coefficients. MEMGEN is used to build the LUTs from an ASCII file. The file can be changed as needed to allow the filter characteristics to be changed, during development or in the field. Design size estimate: (n-Bits+2) times 1/2-CLBs per 4-Taps (or less) (n-Bits+2) times 1-CLBs per 8-Taps (or less) Note: Additional Adders will be required to add all partial sum outputs from multiple LUTs. That is as the design exceeds 5-Taps, more than one LUT is required to represent the partial sum of the Data times the Coefficient. This adder tree will be shown in later foils or you can reference the white paper. A3 A4 320 BITS

1’s COMPLEMENTER INVERT D D0 Q INVERTS DATA ON LAST CYCLE
2 BITS PER CLB D1 D Q The complementing circuit (2’s Complement) is actually a two step process. The 1’s Complement is initially calculated in the circuit above. This is done by monitoring the MSB Bit being processed. If the MSB is a one on any of the input Data Samples, then the output must be inverted. This is due to the MSB representing a negative number in 2’s Complement format when it is a one. Hence the output of the LUT is signed data also and must be inverted if multiplied by a negative number. The 1’s Complement is calculated and a one is added to the carry_in of the adder in the Scaled accumulator. This will be seen in the next foil.

SCALING ACCUMULATOR A ADDS DATA TO (1/2) *(SUMOUT) 2 BITS PER CLB
10 A R E G I S T ADDS DATA TO (1/2) *(SUMOUT) 2 BITS PER CLB NEED N+1 BITS DOUBLE PRECISION WITH SR CAN USE XBLOX FOR RPM S1 A0 11 Scaling Accum. DATA SIGN EXT SUM OUT B10 B(9:0) B Most Significant BYTE 10 C_I LD FORCE CARRY-IN ON LAST BIT The last stage is the Scaling Accumulator. This is a simple block It consists of an Adder with a registered output. The Output is shifted down one bit and fed back into the Adder during the accumulation. The binary shift is the scaling function of the block. This is used to maintain the correct bit weighting between consecutive accumulations of partial sums. The Output Register is cleared at the beginning of the Data processing. Each partial sum is accumulated and stored in the output register. The LSB is shifted down and out of the output Register. This LSB can be stripped or stored in a serial in parallel out, shift register for double precision Output Data (The lower bits). The feedback into the Adder and the Partial Sum Data Inputs are sign extended to one bit larger than the Partial Sum Data. This preserves the 2’s Complement Data, preventing overflow conditions. Again, note that there is a forced carry_in on the addition of the 1’s Complement of the MSBs Partial Sum Data. Predicting the size if this circuit is done as follows: Scaled Accumulator: (n-Bits per Coefficient + 3) times 1/2-CLBs Double Precision: (n-Bits per Coefficient) times 1/2-CLBs (e.g. Scaled Accumulator: (10+1)*1/2 =5.5-CLBs, Double Precision: 10*1/2=5-CLBs, Total = 10.5-CLBs or an even 11-CLBs) SUM(0) LOAD ON FIRST BIT DIN Least Significant BYTE Shift Reg. OPTIONAL DOUBLE PRECISION 10

10 BIT 10 TAP SYMMETRICAL FIR FILTER
DATA LOOK UP TABLE PARALLEL IN SERIAL OUT SAMPLE XOR COMPLEMENT ON LAST BIT & ADD 1 A B R E G I S T 100 BIT SHIFT REGISTER FILTERED DATA OUT D0 D1 D9 D8 D2 D7 D3 D6 D4 D5 ADD A0 A1 A2 A3 A4 10 BIT 10 TAP SYMMETRICAL FIR FILTER 32 X 10 MEMORY 10 10 BIT SUM(10,1) 11 Scaling Accum. A10 A9 A8 S1 SUM(0) DIN Shift Reg. Least Significant BYTE Most Significant OPTIONAL DOUBLE PRECISION S10 S9 320 BITS Serial Adders C_I B(9:0) SIGN EXT B10 LD LOAD ON FIRST BIT (RAM) Lets put it all back together and.... Here it is. This example can be scaled to any number of Taps and/or any number of Data & Coefficient Bits. Again the only assumption is that the coefficients are symmetrical. They do not need to be Symmetrical. What are some of the problems with this design? Speed!!!!!! Perhaps it is not Fast enough!!!!!! We will show how to make it faster in a few slides...

NUMBER OF 10 BIT 10 TAP SYMMETRICAL FIR FILTERS PER XC4000 DEVICE
RAM BASED SHIFT REGISTER FIVE 2 BIT ADDERS RAM OR ROM LOOK UP TABLE SAMPLE DATA 10 5 32 X 10 10 A3 ADRS DATA A2 A1 10 CLBs 5 CLBs 10 CLBs CLK A0 TIMING AND CONTROL CNTEQ10 SERIAL TIME SKEW BUFFER 2 TO 1 REDUCTION DUE TO SYMMETRY FIR FILTER COEFFICIENTS AND MULTIPLY LOOK UP CNTEQ9 50 MHz CLK A3 7 CLBs A2 A1 COMPLEMENT ON LAST CYCLE 10 A0 R E G I S T A 10 10 XOR ADDER 9 B FILTER OUT 5 CLBs 7 CLBs 1’S COMPLEMENT 9 Most Significant Bits SCALING ACCUMULATOR This foil shows an overview of the number of CLBs used for each block in the design. The chart at the bottom shows how many Filter blocks can fit in each of the devices listed, from smallest to largest. How's that for expansion capability... TOTAL OF 44 CLBS: FITS IN A 4002A (WITH 20 CLBS EXTRA FOR SYSTEM DESIGN) ABOUT 1300 EQUIVALENT GATES - LITTLE INTERCONNECT BETWEEN BLOCKS NUMBER OF 10 BIT 10 TAP SYMMETRICAL FIR FILTERS PER XC4000 DEVICE 10 BIT 10 TAP FIR FILTER XC4000 PART 4002A A A A NUMBER OF INSTANCES 1 2 3 5 6 8 10 15 23

PERFORMANCE FIR Filter Macro
FIR10B10T MACRO CAN BE CLOCKED AT 66 10 BIT WORD REQUIRES 11 CLOCKS 10 BIT SAMPLE WORD RATE IS 6 MHZ 8 BIT WORD REQUIRES 9 CLOCKS, ETC 8 BIT SAMPLE WORD RATE IS 8 MHZ FIR Filter Macro Relatively Placed Macro DATA IN DIN_ DOUT_ DATA OUT Performance! The number one issue in design is speed. Again, if the design doesn’t meet speed, it is of zero value. Any excess is of no added value. We want to design the smallest circuit for a given speed. BIT_CLK 10X_CLK CLK_OUT WORD_CLK FIR10B10T WORD SIZE BITS SAMPLE RATE MSPS

Double-Rate DA FIR Filters
Now that we have looked at Distributed Arithmetic using Bit Serial approach, lets speed the process up with a Bit parallel approach. We will start with a “Two-Bit Parallel“ Distributed Arithmetic and move from there into a “Fully Parallel“ Distributed Arithmetic.

Two Bit Parallel Distributed Arithmetic FIR Filter
16 WORD X N BIT LOOK UP TABLE A[3210] 0000 0001 0010 0011 0100 0101 0110 0111 X0 X2 X1 XN D0 N SAMPLE DATA C0 N BITS WIDE A1 2 -2 2C0 A0 3C0 LOOK UP TABLE C1 C2 + 2C1 C1 + 3C0 C2 + C1 R E G I S T A X0 X2 X1 XN Scaling Accum. ADRS D1 - + FILTERED DATA OUT A3 A2 B DATA 1000 1001 1010 1011 The Two-Bit Parallel Distributed Arithmetic implementation is the same as that used with the Serial Distributed Arithmetic, with only a few slight changes. First lets look at the input data and the parallel load in, serial shift out register. The load function is the same. The output serial shift is now a two bit shift verses the one bit shift previously described for SDA. Therefor, every clock cycle, two bits are processed rather than one. The LUTs are shown here as being addressed by two bits from two data samples. This is only shown to simplify the drawing. The most efficient way to do this is to use the same bit weighted values on a common LUT. That is, have all bits at weight zero of four data samples address the same LUT. This will reduce the number of CLBs required. This will be much more obvious in the “Fully Parallel“ Distributed Arithmetic implementation. For more details related to the LUTs refer to the white paper. The Scaled Accumulator performs the same function with a two bit scaling or divide by four. This is due to the weighting of each two-bit pair gives a two bit offset during the process. Because each data sample is processed at a rate two bits per clock cycle. The process will require half as many cycles as SDA. The circuit will grow in the following ways. The number of serial I/O shift data sample registers will double. The number of LUTs will double, along with the Adder tree. The circuit is much larger than that of the SDA, but it is twice as fast. This “Bit Parallel” concept can be expanded to increase the performance of the design by increasing the number of parallel bits being processed. This gives a lot of design flexibility to keep the design as small as possible for a given performance. 2C1 2C1 + C0 2C1 + 2C0 2C1 + 3C0 Process 2 Bits per Clock # of Clocks = (N/2) + 1 Twice as fast

Double Sample Rate D.A. FIR Filters
Twice the I/O Data Sample Rate Two Taps Requires 4 Input LUT without Symmetry Four Taps Requires 4 Input LUT with Symmetrical FIR Time Skew Buffer uses Twice as many CLBs LUTs are the same, if equal bit weights are used to address the LUTs. 2-Bit PDA Performance, Clocked at 66 WORD SIZE BITS SAMPLE RATE MSPS (Double Precision)

Full Parallel D.A. FIR Filters
One 8-Bit Tap Requires two 4 Input LUTs and an ADDER with an offset for bit weighting. Time Skew Buffer must use REGs Maximum I/O Data Sample Rate Full PDA Performance, in a XC4000E-3/-2, MHz. Pipelinning can further increase sample rate LUTs are the same, if equal bit weights are used to address the 4-Coefficients in the LUT. WORD SIZE BITS SAMPLE RATE MSPS (Double Precision)

FPGA-Based DSP Coprocessor Design Implementation
Performance Programmable DSP (DSP56300) 24 clock cycles MHz FPGA-Based Coprocessor 9 clock cycles MHz Results: 37.5% of original processing time 2.67X Increase in throughput System Requirements: Before: 4-DSPs, 12-RAMs After: 2-DSPs, 6-RAMs, 1-XC4013E Viterbi Case Study with FPGA: The Viterbi was implemented in the FPGA giving an overall smaller Hardware solution with 2.67 times the throughput Read the white paper for more details... Again: The Schematic drawing shown above is a functional block diagram for a Viterbi decoder. The design was initially implemented in two GP-DSPs (Motorola The system required two Viterbi Decoder blocks. Hence, four GP-DSPs in total. The GP-DSP solution offered a maximum data rate of 33MHz. This is do to the two clock cycle memory read or write command (no-op state). The process required 360nsec of computation [(24-clock cycles)*(15 nsec)]. The FPGA accelerated design allowed the mathematical computational intensive function calls to be off loaded from the GP-DSP to the FPGA. The FPGA was able to process the Viterbi algorithm in 9-clock cycles resulting in a total processing time of 135 nsec at 66MHz in an XC4000E-3 [(9-clock cycles)*(15 nsec)]. The design has three inputs and four outputs (with three additional prestate buffer outputs, totaling ten I/O’s.) The I/O’s share a common I/O Bus. This required that the I/O’s be MUXed on the same bus. If the bus was split, the data rate could be increased by a factor of 2 or more. This is do to the fact that the Inputs could be written in at the same time as the Outputs are being read (resulting in a four clock cycle process).

8 Bit Word FIR Filter Structures
Two-Bit Parallel Distributed Arithmetic 16 MHz # CLBs Parallel Distributed Arithmetic 55 MHz 300 Serial Distributed Arithmetic 8 MHz 200 100 Serial Sequential Distributed Arithmetic 1000 to 50 KHz Number of TAPS

FIR Filter Implementation Options
8 Bit Word Example Serial* Parallel* Serial* Distributed Distributed Sequential Arithmetic Arithmetic 8 Taps 16 Taps 32 Taps 48 Taps 64 Taps 36 CLBs CLBs CLBs 1.08 MHz MHz MHz 36 CLBs CLBs 400 CLBs 0.46 MHz MHz MHz 44 CLBs CLBs 0.23 MHz MHz 62 CLBs CLBs 0.15 MHz MHz 70 CLBs CLBs 0.11 MHz MHz * Note: These designs are NOT Pipelined

Lower Sample Rate Applications:
Efficient CLB Counts Large Number of TAPs Moderate Sample Rates Non Symmetrical FIR OK Serial Sequential Architecture

Serial Sequential - FIR Filter
Sample Data 32 Tap 8 Bit Example SAMPLE DATA BUFFER 3 CLBs 5-BIT CNTR 32 x 8 LUT Bit Coefficients 8 CLBs 5 Coefficient Table SDB Out SERIAL MULTIPLY Coefficient Select 8 ACC Filtered Data Out 8 8 REG R E G 2-1 Scale Clk 50 Mhz PSR Select Parallel to Serial Converter 4 CLBs ADD 5 CLBs Serial Multiplier REGISTER 9 24 CLBs Total

64-TAP Serial Sequential FIR Filter Sample Data SAMPLE SAMPLE DATA
BUFFER SAMPLE DATA BUFFER SERIAL MULTIPLY SERIAL MULTIPLY Coefficient Select Coefficient Select ACC ACC REG REG REGISTER ADD

ACC REG SERIAL MULTIPLY Coefficient Select Sample Data SAMPLE DATA BUFFER R E G Filtered Data Out Serial Sequential - FIR Filter Number CLBs vs. Taps / Word Size 8 Bit 10 Bit 12 Bit 14 Bit Bit 8 Tap 16 Tap 32 Tap 48 Tap 64 Tap 80 Tap 96 Tap 128 Tap 4002 = 64 CLBs 4005 = 196 CLBs 4013 = 576 CLBs 4025 = CLBs

ACC REG SERIAL MULTIPLY Coefficient Select Sample Data SAMPLE DATA BUFFER R E G Filtered Data Out Serial Sequential - FIR Filter 781Khz Khz Khz 390Khz Khz Khz 195Khz Khz Khz 130Khz Khz Khz 97Khz Khz Khz 78Khz Khz Khz 65Khz Khz Khz 48Khz Khz Khz 8 Tap 16 Tap 32 Tap 48 Tap 64 Tap 80 Tap 96 Tap 128 Tap TAPS Bit Bit Bit Maximum Sample Rate / Word Size Serial Mult. Limitations Can Use Multiple 16 Tap Building Blocks 8X Faster at 128 Taps

A B C D E 13.5MHz Median Filter, 5-Point, 2-Bit PDA M(A,B,C,D,E)
32 WORD X 12 BIT LOOK UP TABLE A. B 8 X0 X2 X1 X8 A 4xCLK 4-CLBs per 8-Bit Shift Reg 4x5ea = 20 CLBs M(A,B,C,D,E) M(A,B,C,D,E) 13.5MHz Median Filter, 5-Point, 2-Bit PDA 58 CLBs for Function plus about 10 CLBs for Control Total = 68 CLBs 1-CLBs per Bit 12-Bit Partial Sums, MSB bit weight = 1 12x2ea = 24 CLBs A1 8 BITS WIDE C 8 X0 X2 X1 X8 4xCLK D E LUT-A B1 C1 ADRS 6-CLBs for Add 12-Bit Partial Sums 1-CLB for [ Carryout + LSB ] 6+1 = 7 CLBs D1 DATA 11 Bit E1 2x A A0 12-Bits LUT-A R E G B0 MSB 4x M(A,B,C,D,E) A C0 ADRS SIGN EXTEND B 14-Bits R E G D0 DATA 11 Bit LSB SIGN EXTEND 14 E0 MSB B 7-CLBs for 14-Bit Add 14-Bit Partial Product Sums no Carryout and LSBs are dropped 7 = 7 CLBs M = (A + B + C + D + E)/5

Design the following Application:
Equations: Y(R,G,B) = 0.299*R *G *B U(R,G,B) = *R *G *B V(R,G,B) = 0.500*R *G *B R, G, B Data is 8-Bits at 13.5 MHz. The circuit already has a 2x Clk (27 MHz). Draw a functional schematic diagram of the circuit. How do you implement the three multipliers or MACs? What is the estimated size of the final design? What is the estimated speed of the final design? How long would it take to turn over this design?

Video Coding Application with 4x Clock
8 WORD X 10 BIT LOOK UP TABLE A. PARALLEL LOAD 2-BIT SHIFT REG 4 CLBs EA, = 12 CLBs f(RGB) 000 001 010 011 100 101 110 111 CG CG + CB CR CR + CG CR + CG + CB CR + CB CB X0 X2 X1 X8 R LUTs are the same 5 CLBs EA, = 10 CLBs 8 8 BITS WIDE R1 LUT-A ADRS DATA 10 Bit 4xCLK G1 X0 X2 X1 X8 B1 G 10 Bit ADDER + REG 5.5 CLBs 8 8 BITS WIDE 2x 4xCLK A R0 R E G LUT-A ADRS DATA 10 Bit X0 X2 X1 X8 G0 A B R E G 12 Bit ADDER 6 CLBs MSB B 4x 8 B0 SIGN EXTEND B 12 BITS WIDE 8 BITS WIDE LSB SIGN EXTEND 12 4xCLK MSB Y(R,G,B) U(R,G,B) V(R,G,B) Y = *R *G *B U = *R *G *B V = *R *G *B The total design would use about 110 CLBs with control logic.

Video Coding Application with 2x Clock
LUTs are the same 5 CLBs EA, = 20 CLBs PARALLEL LOAD 4-BIT SHIFT REG 4 CLBs EA, = 12 CLBs LUT-A ADRS DATA 10 Bit G3 R3 B3 10 Bit ADDER + REG 5.5 CLBs EA, = 11 CLBs 2x X0 X2 X1 X8 A R 8 LUT-A ADRS DATA 10 Bit R E G G2 R2 B2 SIGN EXTEND MSB 12 Bit ADDER + 2 REGs 7 CLBs 8 BITS WIDE 4x B 2xCLK LSB A 14 Bit ADDER 7 CLBs X0 X2 X1 X8 LUT-A ADRS DATA 10 Bit R E G G G1 R1 B1 SIGN EXTEND A B R E G Y(R,G,B) U(R,G,B) V(R,G,B) 12 12 BITS WIDE 8 MSB 16x 8 BITS WIDE B 2x 2xCLK A R E G LUT-A ADRS DATA 10 Bit G0 R0 B0 LSB X0 X2 X1 X8 SIGN EXTEND MSB B 8 B 8 BITS WIDE LSB 2xCLK All four LUTs are the same. The total design would use about 180 CLBs with control logic.

Xilinx Introduces First Fully Programmable System Solution First FPGA Architecture Designed for Intellectual Property 1

FPGA Technology Roadmap
Generation 3 architecture 1 Million+ system gates System Solution 0.25/0.18 Density/Performance XC4000XV Largest Device XC40250XV 0.25m XC4000XL Largest Device XC4085XL 0.35m XC4000EX Largest Device XC4036EX 0.5m XC4000E Largest Device XC4025 0.5m 1995 1996 1997 1998 1999 Year 2

Process Technology and Supply Voltage
1.2 Feature Size (m) 1 Virtex FPGAs Ship 0.8 Lower cost Faster speed Higher density Lower power Voltage 0.6 5 0.4 3.3 2.5 0.2 The company that leads in process by definition owns performance and density. The same company also has a large advantage in price. This is why process is crucial to Xilinx and our customers in the future. Xilinx is committed to always having the most advanced process technology available. One of the major advantages of being a fabless semiconductor company is that we can tailor our fab partnerships in order to provide access to the leading edge fab processes. This graph shows how processes have migrated from 1.2u in 1990 to 0.5u today while maintaining 5 volt logic levels. As the process geometry's shrink below 0.5u, the smallest transistors cannot withstand 5 volts without damage. This leads to the voltage staircase shown above, where each successive process generation must use a lower supply voltage than the previous generation. In order to reap the benefits of increased performance, density, lower price and lower power dissipation, customers must be willing to migrate their designs down this voltage staircase. Xilinx is taking a lead in actively working with customers to help plan an orderly migration from one voltage standard to the next. Another interesting fact is that our process technology partners recognize that FPGAs are ideal process drivers for a fab. The XC4062XL has 3x as many transistors as the Pentium PRO and being SRAM-based, it is easy to isolate defects and faults in the chip. This provides the fab with a useful diagnostic tool to improve the fab defect density while providing Xilinx with access to leading edge technology. Note that Xilinx will continue to ship 5 volt devices, 3.3 volt devices and each new generation for some time on the order of 10 years. Although we encourage our customers to migrate their designs to lower voltage processes as they become available, we know that many designs will continue to run in production for many years on older processes and Xilinx is committed to supporting those designs. 1.8  1.3 1990 1992 1994 1996 1998 2000 2002 Virtex FPGAs Leverage Xilinx Process Technology Leadership 22

Voltage and Family Migration
Virtex FPGAs and XC4000XV share common process (0.25 m) 2.5 V logic, 3.3 V I/O with 5 V tolerance Family migration from XC4000XL possible Voltage migration guide will assist users Design with XC4000XL now and plan ahead for XC4000XV and Virtex FPGAs This foil is an overview of the power supply issues for the Virtex FPGA devices. The Virtex devices share with the 4000XV Series a common process. This 0.25 micron process uses 2.5 volts for the logic and 3.3 volts for the I/O and is 5-volt compatible and 5-volt tolerant like the 4000XV Series. Xilinx is about to release a voltage migration guide to assist users in planning for systems that will use 4000XL devices now but which may need to be upgraded to 4000XV or Virtex devices later. NOTE: don’t spend a lot of time on this foil. Most of the details are much more obvious on the next foil. Spend more time talking to the drawing on the next foil.

Xilinx 0.25 m, 5 Volt-Compatible FPGAs
I/O Supply Logic Supply Accepts 5 V levels Any 5 V device (XC4000E) 5 V Virtex & XC4000XV 2.5 V logic 3.3 V I/O 3.3 V Any 3.3 V device (XC4000XL) 3.3 V 3.3 V Meets TTL Levels Xilinx is the first FPGA vendor to address the issues associated with 2.5 volt operation devices. Our customers have told us that we must provide a path back to 5 volts. The 4000XV family and the new Virtex devices are fabricated in a 0.25 micron process which uses 2.5 volts for it’s basic transistor logic. Our solution is shown above, where a split supply between the logic core (which must operate at 2.5 volts) and the I/O ring (which operates at 3.3 volts). The same 5 volt tolerant I/O structure as used on the 3.3 volt XL devices is used on these 2.5 volt devices, allowing a mixture of 3.3 volt TTL and 5 V TTL devices to be connected to a Virtex FPGA device. Xilinx will use this strategy on all future generations of being directly compatible with the previous process generation and being tolerant of voltages from two prior process generations. Another example of Xilinx innovation. Family migration possible if you plan for: Additional power/ground pins Dedicated clock and configuration pins Voltage migration guide to help users 13 11 11

System Level Design Trend
PC Board Scratch Pad SRAM DSP RAM I/F PCI Bus I/F Custom Logic High-Density High-Performance Custom Device 3

Introducing Xilinx Virtex FPGAs
Segmented Routing, 4-Input LUT FPGA Architecture Fast, Flexible I/Os System Building Blocks Software IP Leading Edge Process Technology The Virtex of FPGAs from Xilinx was designed to address more than just lots of gates at high frequencies. The silicon technology will allow this family to go from below 20K gates to 400K gates, or from 1,500 to 32,000 logic cells, with performance well above 100 MHz through all parts of the device including I/Os, the logic cells, and the RAM. Much attention was paid to system performance including the raw speed of the I/Os, the inclusion of PLLs, and RAM performance at 3 levels of hierarchy. System interface needs are met through the ability to mix 3V and 5V I/O standards freely as with the 4000X Series. Software for these Virtex devices will be an ideal fit into a ASIC-like, synthesis design flow and place-and-route tools will allow excellent user control through constraint files. Considerable input from intellectual property users both external and internal to Xilinx has resulted in an architecture and software tool set that will provide good support for CORE-based design. World’s first fully programmable system-level architecture 5

Advanced Process Technology
0.5u Process 0.25u UMC Process - locos isolation - shallow trench isolation - birds beak - 0.9u metal pitch - no planarization - CMP - only contact plug - plug for all vias At this slide, we were asking that Mr. Brooks expand on the technologies in this 0.25 micron technology. It would also be appropriate for Mr. Brooks to go into detail about the semiconductor creation process by explaining the new challenges at this level of technology versus those at 0.5 micron and older. 23

Family Overview 0.25um, 5 layer metal process
Density: 50 thousand to 1 million system gates Performance 100+ MHz performance 3 to 4 LUT levels 160 MHz system performance Clock to output + input setup First device in 2Q98 250,000 system gates One million system gate device by end of 1998 6

Virtex FPGA Performance
100+ MHz internal speeds 155 MHz SONET data stream processing 100+ MHz Pipelined Multipliers 66 MHz PCI 100+ MHz system interface speeds without PLL with PLL Tco (output register) ns ns Tsu (input register) ns ns Th (input register) ns ns Max I/O performance MHz 160 MHz With the fast I/Os and clocking in these Virtex devices, we will be able to achieve over 100 MHz system clock rates through all parts of the chip. The internal logic can operate at 100 MHz through 3 to 4 levels of logic depending on routing. The I/Os can operate at 110 MHz without the PLL and the setup and clock-to-out times shown also meet the 66 MHz PCI specification. With the PLL, the I/O performance increases to 160 MHz, opening up many new applications previously not possible using FPGA technology. 9

Functional Block Diagram
PLL CLB Segmented routing 66 MHz PCI SSTL3 Vector Based Interconnect delay=f(vector) SelectI/O Pins Block SelectRAM Memory Distributed SelectRAM Memory 11

Virtex Clocking

Clocking and PLL 4 low skew clock resources
3ns setup, 0ns hold clock pad -> IOB input FF 6ns clock to out clock pad -> IOB output FF 24 Additional low skew globals clocks, enables, resets, etc faster than 4KXL secondary global buffer PLL for system clock deskew and fast clock to out.

Virtex CLB

Segmented Routing Interconnect
Fast local routing within CLBs General purpose routing between CLBs Fast Interconnect 8ns across 250,000 system gates Predictable for early design analysis Optimized for five layer metal process CARRY CARRY 3-STATE BUSSES SWITCH CLB MATRIX 2 LCs 2 LCs This slide shows the relationship between the logic and the interconnect in the Virtex devices. A configurable logic block, or CLB, contains 4 logic cells organized in two pairs. All four look-up tables in the CLB can connect to each other through fast local interconnect that provides known delays between LUTs in a given CLB. To connect to other CLBs, signals go in and out of a CLB through the switch matrix. There is no direction dependence in this switch matrix: inputs can come from any of the 4 directions and outputs can to out in any of the 4 directions. The hierarchical general purpose routing was designed to be scalable across the wide range of densities this family will have, and it provides excellent utilization even on the largest parts, very fast routing delay times, and excellent predictability. CARRY CARRY 17 12

Virtex Configurable Logic Block
Polarity of all control signals selectable Fast arithmetic and multiplier circuitry Optimized for synthesis Carry and Control CO I3 I2 I1 I0 4 Input LUT Register PR D CE Q O CLK WI DI RS CI Carry and Control CO CLB I3 I2 I1 I0 4 Input LUT Register PR D CE Q O 2 LCs 2 LCs CLK WI DI RS CI 13

Virtex IO

Simplified IOB Fast I/O drivers
Registered input, output, 3-state enable control Programmable slew rate, pull-up, input delay, etc. Selectable I/O Standards SSTL, GTL, LVTTL... D CE S/R Q DFF/LATCH PAD This slide shows a simplified view of the IOB. The IOB is designed for speed and flexibility. The I/O drivers are very fast and we’ll examine the performance in a few slides. There are registers (flip-flop or latch) on the input, output, and 3-state enable pins. The new register on the 3-state enable will allow very rapid turn-on or turn-off of busses. The IOB registers have a common clock and separate clock enable inputs for all three registers. The IOB also has programmable control of slew rates, pull-up, pull-down, and input delay like the 4000X family. NOTE: There is no fast capture latch or output mux (OMUX) in Virtex devices. We should not mention this unless asked. NOTE: If asked about I/O standards other than 3 V TTL (such as GTL+ or SSTL) we should say that we plan on supporting the most popular of these new I/O standards but aren’t ready to supply details at this time. We are looking for customer input on which standards they plan on using. Customers asking this level of question should be handled under NDA for further discussions. 14

Virtex Memory

SelectRAM+ Memory Features
Distributed SelectRAM Memory Pioneered in XC4000 family 16x1 synchronous SRAM implemented in LUT Ideal for DSP applications Access over one hundred billion bytes/sec Block SelectRAM Memory 4096 bit blocks of dual port synchronous SRAM Configurable widths of 1, 2, 4, 8, and 16 Ideal for data buffers and fifos Up to 17 gigabytes/sec access Fast Access to External RAM Direct interface to SSTL3, 3.3V synchronous DRAM standard 133 MHz The Virtex devices will have several features that help solve system-level design issues. We talked about the two carry chains in each CLB. Like the 4000X family, these carry chains are uni-directional (up only) and are very fast. The clocking and PLL support was designed to offer I/O performance in line with the fast internal logic speeds the 0.25 micron process will give us. We’ll examine this and the RAM hierarchy in the next few slides. RAM capability in the Virtex devices we address at three levels: 1. We have SelectRAM that is fully compatible with the 4000X family. 2. There is dedicated block RAM on-chip that is not part of the CLB array. 3. For larger amounts of RAM, we have the I/O speed to interface to larger amounts of external RAM at better than 100 MHz. 15

Block RAM Configure as: 4096 bits with variable aspect ratio
8-32 blocks across family devices True dual-port, fully synchronous operation Cycle time <10 ns Flexible block RAM configuration 5 blocks: 2K x 10 video line buffer 1 block: x 8 ATM buffer (9 frames) 4 blocks: 2K x 8 FIFO 9 blocks: 4K x 9 FIFO with parity WEA ENA CLKA ADDRA DINA DOA DOB RAMB4 The block RAM in the Virtex devices is in addition to the SelectRAM that is part of each CLB. The block RAM blocks are 4K bits each with variable width from 1 to 16 bits wide. The number of blocks varies with device size: for example, there are between 8 and 30 blocks in the devices from 20K to 200K logic gates. The block RAMs are fully synchronous and have two ports that are fully independent read/write ports. Cycle time of the block RAM is less than 10 nanoseconds. As you can see on the slide, many popular configurations of memory are possible using a combination of blocks. NOTE: The block RAM will not be able to do asynchronous logic. We believe in letting the synthesis tool work with the LUTs and the local interconnect within the CLB for logic and keep the RAM focused on very high performance in RAM applications. WEB ENB CLKB ADDRB DINB 22

Real Time Video Processor
High Speed Synchronous DRAM (Mbytes) Block SelectRAM Memory (Kbytes) Block SelectRAM Memory (kbytes) Frame Data Line Data Hierarchy of RAM provides efficient and very high bandwidth data processing Distributed SelectRAM Memory (bytes) Distributed SelectRAM Memory (bytes) Pixel Data Virtex FPGA Processed Video Out Video Pixel Processing Function (logic) Video Data In 16

Virtex FPGA Summary First fully programmable system solution
1 Million+ system gates 100+ MHz performance from all devices Building blocks for system level design ASIC design flow software Platform for CORE reuse First fully programmable system solution 19

Section I Introduction to Programmable Logic Devices

Similar presentations

Presentation on theme: "Section I Introduction to Programmable Logic Devices"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Section I Introduction to Programmable Logic Devices

Similar presentations

Presentation on theme: "Section I Introduction to Programmable Logic Devices"— Presentation transcript:

Similar presentations

About project

Feedback