Tools - Hardware Optimization - Chapter 12 slide 1 Version 1.5 FPGA Tools Training Class Hardware Optimization.

Tools - Hardware Optimization - Chapter 12 slide 1 Version 1.5 FPGA Tools Training Class Hardware Optimization

Tools - Hardware Optimization - Chapter 12 slide 2 Version 1.5 In This Chapter, You Will Learn Design techniques to optimize performance –Logic Techniques –Special Xilinx Hardware Features Topics apply to both synthesis and schematic users

Tools - Hardware Optimization - Chapter 12 slide 3 Version 1.5 Outline CLB Combinatorial Logic CLB Register Resources Memory Usage Input/Output Block Usage Tips and Guidelines Summary

Tools - Hardware Optimization - Chapter 12 slide 4 Version 1.5 Combinatorial Resource Review How is a 9 - input AND gate implemented in a CLB? –Three stages shown below explain the mapping process FFX FFY O CLB o

Tools - Hardware Optimization - Chapter 12 slide 5 Version 1.5 Wide MUXes implemented in LUTs have many levels of logic –BUFT Multiplex function uses SRAMs to decode select signals and internal tri-state buffers –Fewer CLBs are used and routing congestion is decreased BUFT delay varies with size of FPGA Small 4-to-1 MUX is shown below –Example: BUFT implementation Three state MUX O BUFT D S BUFT D BUFT D BUFT D D2_4E

Tools - Hardware Optimization - Chapter 12 slide 6 Version 1.5 BUFT Multiplexers BUFT can be used to build large MUXes –Wide MUXes composed of LUTs need multiple levels of logic –Wide MUXes composed of BUFTs use SRAMs to decode select signals and internal tri-state buffers MUX should be built across one row of CLBs Standard library Multiplexer macros use Look-Up Tables –Example: 4 to 1 MUX with enable, M4_1E, is built with CLBs LogiCORE MUXes with Style = WAND use BUFTs Xilinx Unified library BUFT components –BUFT, BUFT4, BUFT8, BUFT16 Synthesis tools: a BUFT MUX will be generated in all synthesizers whenever an IF-THEN type statement drives a high-Z. Otherwise CLB MUXes are generated.

Tools - Hardware Optimization - Chapter 12 slide 7 Version 1.5 Carry Logic Each CLB contains dedicated arithmetic logic for fast carry and borrow signals –Carry logic is associated with F and G function generators Carry logic components have a vertical orientation –Needed for speed and utilization –Known as RPM or “Relationally Placed Macro” –Examples: *ADDx adders *ADSUx adder/subtractors *CCx counters *COMPMCx magnitude comparators A B A B A B A B Z ADD4

Tools - Hardware Optimization - Chapter 12 slide 8 Version 1.5 Counters Libraries support a wide variety of fast and efficient counters –Counters offer trade-offs between speed, utilization, and complexity –Example: LogiBlox counter styles *Binary : slow and large *Johnson : fastest practical counter, uses few Flip-Flops *LFSR : fast & dense, but pseudo-random outputs *One-Hot : useful for generating series of enables *Carry Chain: High speed and utilization –Synthesis tools select a component based on the design, or the designer can instantiate a component using LogiBLOX.

Tools - Hardware Optimization - Chapter 12 slide 10 Version 1.5 Global Clock Buffers Global Buffers are low-skew, high-drive buffers –Drive low-skew, high-speed long line resources –Drive all Flip-Flops and Latches in FPGA –Can also be used for high-fanout non-clock signals –Check device for number of clocks To use the global buffer, instantiate the BUFG component For synthesis: Clocks are identified by different means depending on Vendor –Example: Synopsys FPGA compiler connects clock buffers to all fan-in of clock pins *Control clock buffer insertion with separate commands *Consult Synthesis interface guide or vendor

Tools - Hardware Optimization - Chapter 12 slide 11 Version 1.5 Each register can be configured as a Flip- Flop or Latch Independent clock polarity Asynchronous Preset or Clear Synchronous Set or Reset Clock Enable Direct input from CLB input (Connections bypass LUTs) CLB Registers S/R DIN F G K (CLOCK) EC (CLOCK ENABLE) RESET SET Q QX D H EC 1 S/R Control F G RESET SET Q QY D H EC 1 S/R Control

Tools - Hardware Optimization - Chapter 12 slide 12 Version 1.5 CLB Flip-Flop features include Asynchronous Preset/ Clear or Synchronous Set/Reset –Synchronous Set/Reset is implemented in LUT –Asynchronous Clear/Preset has two sources *Dedicated Global Set/Reset (GSR) net *Local Asynchronous Preset/Clear D Q Reset Local Async. Preset/Clear Q CLK D Synch. Set/Reset GSR D FDC CLB Set and Reset Capabilities LUT

Tools - Hardware Optimization - Chapter 12 slide 13 Version 1.5 Global Reset (1) All Flip-Flops are always initialized during power up –Via the Global Set/Reset network You can access this network by instantiating the STARTUP primitive –GSR is automatically connected to all CLB Flip-Flops using dedicated routing resources - in general you don’t need to connect Startup to Flip-Flops –GSR, GTS, and Clock can be driven by internal signals or pins –Assert GSR for global set or reset, GTS controls Tri-state buffer in IOBs –Can be driven by internal signals or pins –Saves general use routing resources for the design GSR GTS CLK Q2 Q3 DoneIn STARTUP Q1 Q4

Tools - Hardware Optimization - Chapter 12 slide 14 Version 1.5 Global Reset (2) Use Global Reset whenever possible –Local asynchronous reset is routed on general purpose interconnects –Global Set/Reset is routed on dedicated interconnects –Any signal or pin can drive the global set/reset pin To use global reset network, Register Reset and Startup RST pin must be driven by the same signal. Examples: Bad example: general purpose routing is used Improved example: general purpose routing is not used Startup To Flip- Flops Startup To Flip- Flops Startup Or Good for simulation; extra connections will be trimmed by Design Manager GSR

Tools - Hardware Optimization - Chapter 12 slide 15 Version 1.5 Flip-Flop Clock Enable (1) Register output does not change when clock enable is disabled Allows synchronous design Use instead of gating the clock signal Clock enable is implemented in two ways: –Directly inside the flip-flop via dedicated CE pin –In a Look-Up Table RESET SET Q QX D EC

Tools - Hardware Optimization - Chapter 12 slide 16 Version 1.5 Clock Enable Example Use clock enable when using most of or all logic inputs –Avoid gating of clock signal directly Use MUXed data when using only 1-2 logic inputs or for a gated clock enable –Or, when two different clock enables must drive Flip-Flops in one CLB DQ CE FDxE DQ CE Use Clock Enables Instead of Gating Clock

Tools - Hardware Optimization - Chapter 12 slide 17 Version 1.5 Minimize the Number of Clocks Use Clock enable to reduce the number of clocks. Example with two clocks: Consider using clock enable instead of a clock –Useful when: *CLK2 is much slower than CLK1 *Or, CLK1 and CLK2 have a definite phase relationship FF1FF2 OUT1 CLK1 X CLK2 FF1FF2 OUT1 CLK1 X CLK2 CE

Tools - Hardware Optimization - Chapter 12 slide 19 Version 1.5 RAM Provides 16X the Storage of Flip-Flops 32 bits versus 2 bits of storage –Two 16x1 RAMS or One 32X1 Single-Port Ram fit in one CLB –One 16x1 Dual-Port RAM fits in one CLB 32x8 shift register with RAM = 11 CLBs –Using Flip-Flops, takes 128 CLBs for data alone 32 bits A0 A1 A2 A3 A4 O1 2 bits DQ DQ Q1 Q2 CLB D1 D2 WE CLK D1

Tools - Hardware Optimization - Chapter 12 slide 20 Version 1.5 General RAM Guidelines Less than 32 words gives fastest performance –32x1 or 16x2 RAM fits in one CLB *Delays are short (one level of logic) –Data and output MUXes are required to expand depth Less than 256 words recommended per RAM –Exceptions include T1 Framers, which use RAMS as a shift register Width easily expanded –Connect the address lines to multiple blocks Recommendation: Use less than 1/2 of max memory resources –Maximum memory uses all logic resources of CLBs

Tools - Hardware Optimization - Chapter 12 slide 21 Version 1.5 Memory Use Most synthesis tools can synthesize ROM from behavioral HDL code RAM memories may be synthesized –Synplicity can synthesize RAMs Use library primitives and macros for standard size memory –RAM/ROM16X1S to 32X8S –Use S suffix for Synchronous RAM –Use D suffix for Dual-Port RAM Use LogiBLOX to generate custom size memories O[7:0] RAM16X8S A0 WE D[7:0] WCLK A3 A1 A2

Tools - Hardware Optimization - Chapter 12 slide 23 Version 1.5 IOB Block Diagram Three-state output Registered Input or Output Bi-directional I/O Output Slew Rate control Programmable setup/hold delay FF FF or LATCH IN OUT DELAY FAST LATCH SLEW RATE CONTROL PULL-UP PULL-DOWN PAD

Tools - Hardware Optimization - Chapter 12 slide 24 Version 1.5 IOB Flip-Flops and Latches Synthesis tools and Design Manager can move internal registers into IOBs to meet timing constraints Flip-Flops and Latches can be used in unbonded IOBs Use IOB Flip-Flops: –When all CLB Flip-Flops are used –To minimize the Flip-Flop-to-PAD delay –Minimize skew between outputs IO Blocks contain minimal combinatorial logic –IOB Flip-Flops can be used as part of an internal shift register –Do not use IOB Flip-Flops as part of a pipeline Library components begin with I - Examples: ILD, IFD16 Outputs components begin with O - Examples: OFD, OFDT16

Tools - Hardware Optimization - Chapter 12 slide 25 Version 1.5 Instantiation: Use OBUFE and OBUFT components –OBUFT output is in the high impedence state when OE is low Synthesis: If-Then statements driving a Hi-Z value onto an output may be synthesized into an OBUFE or OBUFT Three-state control also via a dedicated global net –Needed for configuration –Also controlled by GST on STARTUP primitive Output Three-State Control OEOE OBUFE T IN T OUT X 1 Z IN 0 IN

Tools - Hardware Optimization - Chapter 12 slide 26 Version 1.5 Small functions can be built into the IOB –Can be used as a generic two-input function generator or MUX –One input can be driven by IOB output clock signal –Requires library components beginning with “O”. *Examples: OAND, OMUX –F input pin is faster than IO pin –Does not apply to all FPGAs Output Combinatorial Logic F OPAD FAST OAND2 IO

Tools - Hardware Optimization - Chapter 12 slide 27 Version 1.5 Guidelines for IOB use Unused IOBs: –Outputs of unused IOBs are automatically disabled –Pull-ups are automatically connected on unused IOBs Used IOBs: –A PULLUP or PULLDOWN primitive can be connected to used IOBs –Inputs should not be left floating *Add a pull-up to design inputs that may be left floating to reduce power and noise Output drive –12 mA Sink current per output on most families –Two adjacent outputs can be tied together to double the drive off chip

Tools - Hardware Optimization - Chapter 12 slide 29 Version 1.5 Use synchronous design Pipelining improves speed –Consider wherever latency is not an issue –Use for terminal counts, carry lookahead, etc. How to estimate the number of logic levels per stage Example for 100 MHz clock frequency in XC4013XL-09: Clock period10 ns One level- 4.1 ns (t CO + t NET + t SU ~=.9 + 1.2 + 2 ns) Delay allowance7.9 ns Each added level / 3.2 ns (t PD + t NET ~= 1.2 + 2 ) Additional levels of logic allowed2 CLBs –Why isn’t the SRAM in the CLB included in the delay calculation? Pipeline for Speed t CO t NET t PD t NET t PD t NET t SU CLB

Tools - Hardware Optimization - Chapter 12 slide 30 Version 1.5 Pipeline Example Break up combinatorial logic into separate stages –Clock frequency increases –Latency also increases - extra cycle(s) are added Example: Frequency can double by adding another stage, but an extra cycle is added * + a b c out * + a b c

Tools - Hardware Optimization - Chapter 12 slide 31 Version 1.5 Example - Optimization is limited because hierarchical boundaries prevent sharing of common terms The path from Reg A to Reg C is divided between three different block descriptions ABC B C A Reg A Reg C No Hierarchy in Combinational Path Keep Related Logic Together (1)

Tools - Hardware Optimization - Chapter 12 slide 32 Version 1.5 Related combinational logic drive registers in the same block No hierarchical boundaries between combinational logic and registers – Allows for improved sequential mapping Keep Related Logic Together (2) Good Example B & C AC Reg A Reg C A

Tools - Hardware Optimization - Chapter 12 slide 33 Version 1.5 Register All Block Outputs Align block boundaries on Register outputs – Helps floorplanning Poor partitioning Good partitioning – Sum is not registered, and may become a critical path. a0 clk a1 clk + sum + a0 a1 clk sum – Why is performance improved when combinatorial logic drives a register in the same CLB?

Tools - Hardware Optimization - Chapter 12 slide 34 Version 1.5 Duplicate Registers to Reduce Fanout Why does fanout reduction improve performance? Register has 24 loads Each Register has 12 loads en clk [23:0]out... en clk [23:0]out... en clk...

Tools - Hardware Optimization - Chapter 12 slide 35 Version 1.5 Counter Tips (1) Do not use binary sequence if unnecessary Consider higher performance or smaller counter types –Examples: LFSR, Pre-scaled, Gray Use Pre-Scaling on non-loadable counters to increase speed –LSBs toggle quickly –See Application Notes XAPP001 and XAPP014 Large Dense Counter with Slower Carry TC CE Fast Small Counter

Tools - Hardware Optimization - Chapter 12 slide 36 Version 1.5 Counter Tips (2) Use Gray code counters if decoding outputs –Glitch free, because one-bit changes per transition Consider Linear Feedback Shift Register for speed when terminal count is all that is needed –Or when any regular sequence is acceptable (e.g., FIFO) 10-bit SR Q0Q9Q6

Tools - Hardware Optimization - Chapter 12 slide 37 Version 1.5 State Machine Design Tips(1) Use One-Hot Encoding for small state machines –Shift-register like structure –One Flip-Flop is assigned to each state –Works well in Xilinx “register-rich” FPGAs –Number of required Flip-Flops may be higher than other state machines, but logic to generate state is less complex –RAMs can be used to encode large state machine Prototype OHE State Machine: Qx, Qy, and Qz are composed of state variables from previous states FF D Q I1 In Qx Qn FF DQ I1 In Qy Qn + 1 FF D Q I1 In Qz Qn + 2

Tools - Hardware Optimization - Chapter 12 slide 38 Version 1.5 Split complex states Need to minimize number of inputs, not number of Flip-Flops, in FPGAs –Use One-Hot encoding for medium to large state machines (greater than 12 states) Complex states may be improved by breaking up into additional simpler states State A State A1 State A2 State B cond1 State B cond1 State Machine Design Tips(2)

Tools - Hardware Optimization - Chapter 12 slide 39 Version 1.5 Consider a pipeline: break the state machine into two or more clock cycles –Two clock cycles for a state is better than having to slow the clock for the entire state machine –This basically means to breakup wide input equations using intermediate nodes in the state diagram. State Machine Design Tips(3) State C State B State A State A State C

Tools - Hardware Optimization - Chapter 12 slide 41 Version 1.5 Summary Use Tri-state buffers for multiplexing Carry Logic is not the only way to create fast arithmetic functions Use the GSR net to save routing resources and use global routing resources Use Clock Enable port on registers to design synchronously and save logic Best memories are <=32 words Use LogiBLOX to customize memories Use IOB registers for modules that do not require logic, such as shift registers Refer to LogiBLOX or Design Manager Help for more information on LogiBLOX

Tools - Hardware Optimization - Chapter 12 slide 42 Version 1.5 Questions (1) What problem may occur in this circuit? How can the circuit be improved? DQ TC Q0 Q1 Q2 Binary Counter CK

Tools - Hardware Optimization - Chapter 12 slide 43 Version 1.5 Questions (2) What does GSR stand for? –What component sources the GSR net? –When should the GSR net be used? What component is instantiated to use the Global Clock? Can the Global Clock be synthesized?

Tools - Hardware Optimization - Chapter 12 slide 44 Version 1.5 Questions (3) How many global clocks can be used in an XC4085XL-3? –See the data sheet for the XC4000XL family, available on WEB or the AppLINX CD. Why is one hot encoding a good way to encode a small state machine? When should IOB registers be used? When should they be avoided?

Tools - Hardware Optimization - Chapter 12 slide 1 Version 1.5 FPGA Tools Training Class Hardware Optimization.

Similar presentations

Presentation on theme: "Tools - Hardware Optimization - Chapter 12 slide 1 Version 1.5 FPGA Tools Training Class Hardware Optimization."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Tools - Hardware Optimization - Chapter 12 slide 1 Version 1.5 FPGA Tools Training Class Hardware Optimization.

Similar presentations

Presentation on theme: "Tools - Hardware Optimization - Chapter 12 slide 1 Version 1.5 FPGA Tools Training Class Hardware Optimization."— Presentation transcript:

Similar presentations

About project

Feedback