Automated Extra Pipeline Analysis of Applications mapped to Xilinx UltraScale+ FPGAs

Slides:

Advertisements

Similar presentations

OCV-Aware Top-Level Clock Tree Optimization

Advertisements

Graduate Computer Architecture I Lecture 16: FPGA Design.

FPGA Latency Optimization Using System-level Transformations and DFG Restructuring Daniel Gomez-Prado, Maciej Ciesielski, and Russell Tessier Department.

Institute of Applied Microelectronics and Computer Engineering © 2014 UNIVERSITY OF ROSTOCK | College of Computer Science and Electrical Engineering.

Pipelining and Retiming 1 Pipelining  Adding registers along a path  split combinational logic into multiple cycles  increase clock rate  increase.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

ECE 699: Lecture 2 ZYNQ Design Flow.

Implementation of DSP Algorithm on SoC. Mid-Semester Presentation Student : Einat Tevel Supervisor : Isaschar Walter Accompaning engineer : Emilia Burlak.

GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.

156 / MAPLD 2005 Rollins 1 Reducing Energy in FPGA Multipliers Through Glitch Reduction Nathan Rollins and Michael J. Wirthlin Department of Electrical.

Global Timing Constraints FPGA Design Workshop. Objectives  Apply timing constraints to a simple synchronous design  Specify global timing constraints.

EVE: A CAD Tool Providing Placement and Pipelining Assistance for High-Speed FPGA Circuit Designs William Chow Supervisor : Prof. Jonathan Rose M.A.Sc.

© 2003 Xilinx, Inc. All Rights Reserved FPGA Design Techniques.

Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

SHA-3 Candidate Evaluation 1. FPGA Benchmarking - Phase Round-2 SHA-3 Candidates implemented by 33 graduate students following the same design.

Floating-Point Reuse in an FPGA Implementation of a Ray-Triangle Intersection Algorithm Craig Ulmer June 27, 2006 Sandia is a multiprogram.

AMIN FARMAHININ-FARAHANI CHARLES TSEN KATHERINE COMPTON FPGA Implementation of a 64-bit BID-Based Decimal Floating Point Adder/Subtractor.

05/04/06 1 Integrating Logic Synthesis, Tech mapping and Retiming Presented by Atchuthan Perinkulam Based on the above paper by A. Mishchenko et al, UCAL.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

FORMAL VERIFICATION OF ADVANCED SYNTHESIS OPTIMIZATIONS Anant Kumar Jain Pradish Mathews Mike Mahar.

Design Space Exploration for Application Specific FPGAs in System-on-a-Chip Designs Mark Hammerquist, Roman Lysecky Department of Electrical and Computer.

ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.

© 2003 Xilinx, Inc. All Rights Reserved Global Timing Constraints FPGA Design Flow Workshop.

1 Implementation in Hardware of Video Processing Algorithm Performed by: Yony Dekell & Tsion Bublil Supervisor : Mike Sumszyk SPRING 2008 High Speed Digital.

Timing and Constraints “The software is the lens through which the user views the FPGA.” -Bill Carter.

Company LOGO Final presentation Spring 2008/9 Performed by: Alexander PavlovDavid Domb Supervisor: Mony Orbach GPS/INS Computing System.

Evaluating Logic Resources Utilization in an FPGA-Based TMR CPU

FPGA-Based System Design: Chapter 6 Copyright  2004 Prentice Hall PTR Topics n Low power design. n Pipelining.

Requirements for the PS/PSB TFB board 1 Alfred Blas Working group meeting - 07 December Sampling frequency 2.Required Delayed Clocks.

FPGA Logic Cluster Design Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

1 WireMap FPGA Technology Mapping for Improved Routability Stephen Jang, Xilinx Inc. Billy Chan, Xilinx Inc. Kevin Chung, Xilinx Inc. Alan Mishchenko,

Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.

EEL 5722 FPGA Design Fall 2003 Digit-Serial DSP Functions Part I.

Quantifying Acceleration: Power/Performance Trade-Offs of Application Kernels in Hardware WU DI NOV. 3, 2015.

Optimizing Interconnection Complexity for Realizing Fixed Permutation in Data and Signal Processing Algorithms Ren Chen, Viktor K. Prasanna Ming Hsieh.

INF3430 / 4431 Synthesis and the Integrated Logic Analyzer (ILA) (WORK IN PROGRESS)

Kun Young Chung*, Andrew B. Kahng+ and Jiajia Li+

Presenter: Darshika G. Perera Assistant Professor

Time-borrowing platform in the Xilinx UltraScale+ family of FPGAs and MPSoCs Ilya Ganusov, Benjamin Devlin.

Floating-Point FPGA (FPFPGA)

FPGA implementation of CNN Convolution layer logic

Rapid Overlay Builder for Xilinx FPGAs

Intermediate Fabrics: Virtual FPGA Architectures for Circuit Portability and Fast Placement and Routing on FPGAs James Coole PhD student, University of.

Pipelining and Retiming 1

CS Spring 2008 – Lec #17 – Retiming - 1

SLP1 design Christos Gentsos 9/4/2014.

A Quantitative Analysis of Stream Algorithms on Raw Fabrics

Mapping into LUT Structures

FIT Front End Electronics & Readout

Delay Optimization using SOP Balancing

FPGAs in AWS and First Use Cases, Kees Vissers

Revisiting and Bounding the Benefit From 3D Integration

Xuechao Wei, Peng Zhang, Cody Hao Yu, and Jim Wu

SmartOpt An Industrial Strength Framework for Logic Synthesis

Exploiting Forwarding to Improve Data Bandwidth of Instruction-Set Extensions Ramkumar Jayaseelan, Haibin Liu, Tulika Mitra School of Computing, National.

FPGA Tools Course Basic Constraints

Programmable Logic- How do they do that?

ECE 699: Lecture 3 ZYNQ Design Flow.

Powerful High Density Solutions

Pipeline Principle A non-pipelined system of combination circuits (A, B, C) that computation requires total of 300 picoseconds. Comb. logic.

Architecture Synthesis

Win with HDL Slide 4 System Level Design

Alireza Hodjat IVGroup

Delay Optimization using SOP Balancing

Measuring the Gap between FPGAs and ASICs

Pipelined Array Multiplier Aldec Active-HDL Design Flow

Fast Min-Register Retiming Through Binary Max-Flow

Surrogate Modeling for Predicting FPGA Place and Route

Presentation transcript:

Automated Extra Pipeline Analysis of Applications mapped to Xilinx UltraScale+ FPGAs

Automated Extra Pipeline Analysis of Applications mapped to Xilinx UltraScale+ FPGAs Ilya Ganusov1 Henri Fraisse1 Aaron Ng1 Rafael Trapani Possignolo2 Sabya Das1 1Xilinx Inc. {ilya.ganusov}@xilinx.com 2University of California, Santa Cruz {rpossign}@ucsc.edu

Agenda Introduction Pipeline analysis tool in Vivado Automatic pipeline insertion Experimental data Conclusion

Introduction

How does pipelining help? Pipelining can reduce the length of critical paths Improve FMax Add extra cycles of latency Need to balance pipeline registers to preserve functionality

Limitations of pipelining Loops cannot be pipelined FMax improvement is limited by the slowest loop Initial state in the presence of loops need to be adjusted In most practical applications this requires minor modifications to the design

Automatic pipeline analysis in Vivado

Automatic Pipeline Analysis in Vivado New automatic pipeline analysis Automatically analyses design at any SPR stage Enables rapid design exploration Suggests most efficient places to insert registers in RTL Backward-compatible 7-series, UltraScale, UltraScale+ Synthesis Place Route Report timing Bit-stream Pipeline analysis

Automatic Pipeline Analysis in Vivado The tool interface is a TCL command: report_pipeline_analysis [-cell args] [-clocks args] [-max_added_latency arg] [-report_loops] Report several key metrics (FMax increase, WNS, register count) for each clock domain for each added stage of latency +-----------------+---------------+------------------+------------------+------------------+-----------+----------------+----------------+ | Clock | Added Latency | Ideal Fmax (MHz) | Ideal Delay (ns) | Requirement (ns) | WNS (ns)* | Added Pipe Reg | Total Pipe Reg | | SYS_CLK | 0 | 349.00 | 2.87 | 2.31 | -0.56 | n/a | 0 | | SYS_CLK | 1 | 354.69 | 2.82 | 2.31 | -0.51 | 7693 | 7693 | | SYS_CLK | 2 | 356.97 | 2.80 | 2.31 | -0.49 | 5213 | 12906 | | SYS_CLK | 3 | 364.78 | 2.74 | 2.31 | -0.43 | 3613 | 16519 | | SYS_CLK | 4 | 366.12 | 2.73 | 2.31 | -0.42 | 7348 | 23867 | | SYS_CLK | 5 | 537.24 | 1.86 | 2.31 | 0.45 | 20348 | 44215 |

Automatic pipeline insertion

Automatic Pipeline Insertion build graph find loops time loops build a pipeline stage Insert pipeline registers Synthesis Place Route Report timing Bit-stream Pipeline insertion Optimize build a pipeline stage

Building a pipeline stage Select all pins Sort pins by criticality Add most critical pin to stage Discard pins in Transitive Fanin Discard pins in Transitive Fanout Insert pipeline registers on all stage pins #pins > 0 Yes No

Building a pipeline stage Select the most critical legal pin that improve most the slack if pipelined LUT O1 I1 LUT LUT I2 I3 LUT LUT O2 I4 I5 LUT O3 I6

Building a pipeline stage Mark all pins in its Transitive Fan-In LUT O1 I1 LUT LUT I2 I3 LUT LUT O2 I4 I5 LUT O3 I6

Building a pipeline stage Mark all pins in its Transitive Fan-Out LUT O1 I1 LUT LUT I2 I3 LUT LUT O2 I4 I5 LUT O3 I6

Building a pipeline stage Select next un-marked critical pin LUT O1 I1 LUT LUT I2 I3 LUT LUT O2 I4 I5 LUT O3 I6

Building a pipeline stage Mark all pins in its Transitive Fan-In LUT O1 I1 LUT LUT I2 I3 LUT LUT O2 I4 I5 LUT O3 I6

Building a pipeline stage Mark all pins in its Transitive Fan-Out LUT O1 I1 LUT LUT I2 I3 LUT LUT O2 I4 I5 LUT O3 I6

Building a pipeline stage Select next un-marked critical pin LUT O1 I1 LUT LUT I2 I3 LUT LUT O2 I4 I5 LUT O3 I6

Building a pipeline stage Mark all pins in its Transitive Fan-In LUT O1 I1 LUT LUT I2 I3 LUT LUT O2 I4 I5 LUT O3 I6

Building a pipeline stage Select next un-marked critical pin LUT O1 I1 LUT LUT I2 I3 LUT LUT O2 I4 I5 LUT O3 I6

Building a pipeline stage Mark all pins in its Transitive Fan-In LUT O1 I1 LUT LUT I2 I3 LUT LUT O2 I4 I5 LUT O3 I6

Building a pipeline stage Mark all pins in its Transitive Fan-Out LUT O1 I1 LUT LUT I2 I3 LUT LUT O2 I4 I5 LUT O3 I6

Building a pipeline stage Extract the cut LUT O1 I1 LUT LUT I2 I3 LUT LUT O2 I4 I5 LUT O3 I6

Building a pipeline stage Insert pipeline registers LUT O1 I1 LUT LUT I2 I3 LUT LUT O2 I4 I5 LUT O3 I6

Experimental results

Experimental setup Used Xilinx standard QoR suite Metric Min Max Avg clk domains 1 67 3 FMax 10 MHz 760 MHz 300 MHz LUT 8200 464200 129600 FF 3392 586475 123163 BRAM 1152 187 DSP 2700 195 Total designs 93 Used Xilinx standard QoR suite Used post-place pipeline insertion Synthesis Place Route Report timing Bit-stream Pipeline insertion

Computing potential FMax gain of pipelining Considered two loop limits Initial: critical loop after default P&R flows Tight: critical loop after loop-only P&R flow Distribution of gain is bimodal About half of the designs are loop limited About 30% of designs can be improved by more than 50% Initial loop: 18% Gmean FMax Tight loop: 29% Gmean FMax

Achieved FMax gain with automatic pipelining FMax improvement greater than initial loop limit in 50% of cases FMax improvement close to tight loop limit in most cases Current limiting factors: DSP cascades BRAM cascades Not using SRL for balancing

Register utilization for pipelining In more than 95% of cases the ratio FF:LUT after pipelining is below 2 which is what offers Xilinx architecture DSP based Architecture ratio

Pipelining data across different architectures Current Xilinx architectures respond well to highly-pipelined designs time-borrowing should favor UltraScale+ over UltraScale

Conclusion

Concluding Remarks Presented the pipeline analysis tool implemented in Vivado Implemented and evaluated automatic pipeline insertion Demonstrated its potential on a representative set of designs Results are within 10% of theoretical optimal Showed that the UltraScale / UltraScale+ architectures handle well highly pipelined designs