1 WireMap FPGA Technology Mapping for Improved Routability Stephen Jang, Xilinx Inc. Billy Chan, Xilinx Inc. Kevin Chung, Xilinx Inc. Alan Mishchenko,

Slides:

Advertisements

Similar presentations

Simulation of Fracturable LUTs

Advertisements

OCV-Aware Top-Level Clock Tree Optimization

FPGA-Based System Design: Chapter 4 Copyright  2004 Prentice Hall PTR Topics n Logic synthesis. n Placement and routing.

ECE 667 Synthesis & Verificatioin - FPGA Mapping 1 ECE 667 Synthesis and Verification of Digital Systems Technology Mapping for FPGAs D.Chen, J.Cong, DAOMap.

ECE 506 Reconfigurable Computing Lecture 6 Clustering Ali Akoglu.

FPGA Technology Mapping Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

Design, Synthesis and Evaluation of Heterogeneous FPGA with Mixed LUTs and Macro-Gates Yu Hu 1, Satyaki Das 2, Steve Trimberger 2, and Lei He 1 1. Electrical.

ECE 667 Synthesis and Verification of Digital Systems

Reducing the Pressure on Routing Resources of FPGAs with Generic Logic Chains Hadi P. Afshar Joint work with: Grace Zgheib, Philip Brisk and Paolo Ienne.

Combining Technology Mapping and Retiming EECS 290A Sequential Logic Synthesis and Verification.

1 DAOmap: A Depth-optimal Area Optimization Mapping Algorithm for FPGA Designs Deming Chen, Jacon Cong ICCAD 2004 Presented by: Wei Chen.

Technology Mapping.

Continuous Retiming EECS 290A Sequential Logic Synthesis and Verification.

EDA (CS286.5b) Day 3 Clustering (LUT Map and Delay) N.B. no lecture Thursday.

DAG-Aware AIG Rewriting Alan Mishchenko, Satrajit Chatterjee, Robert Brayton Department of EECS, University of California Berkeley Presented by Rozana.

1 A New Enhanced Approach to Technology Mapping Alan Mishchenko Presented by: Sheng Xu May 2 nd 2006.

CS294-6 Reconfigurable Computing Day 15 October 13, 1998 LUT Mapping.

EDA (CS286.5b) Day 19 Covering and Retiming. “Final” Like Assignment #1 –longer –more breadth –focus since assignment #2 –…but ideas are cummulative –open.

Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 3: January 27, 2008 Clustering (LUT Mapping, Delay) Please work preclass example.

Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 17: March 30, 2009 Clustering (LUT Mapping, Delay)

FPGA Technology Mapping. 2 Technology mapping:  Implements the optimized nodes of the Boolean network to the target device library.  For FPGA, library.

An automatic tool flow for the combined implementation of multi-mode circuits Brahim Al Farisi, Karel Bruneel, João Cardoso, Dirk Stroobandt.

Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 15: February 12, 2003 Interconnect 5: Meshes.

ESE Spring DeHon 1 ESE534: Computer Organization Day 19: April 7, 2014 Interconnect 5: Meshes.

Titan: Large and Complex Benchmarks in Academic CAD

POWER-DRIVEN MAPPING K-LUT-BASED FPGA CIRCUITS I. Bucur, N. Cupcea, C. Stefanescu, A. Surpateanu Computer Science and Engineering Department, University.

1 Wire Length Prediction-based Technology Mapping and Fanout Optimization Qinghua Liu Malgorzata Marek-Sadowska VLSI Design Automation Lab UC-Santa Barbara.

05/04/06 1 Integrating Logic Synthesis, Tech mapping and Retiming Presented by Atchuthan Perinkulam Based on the above paper by A. Mishchenko et al, UCAL.

1 EECS 219B Spring 2001 Timing Optimization Andreas Kuehlmann.

Design Space Exploration for Application Specific FPGAs in System-on-a-Chip Designs Mark Hammerquist, Roman Lysecky Department of Electrical and Computer.

Combinational and Sequential Mapping with Priority Cuts Alan Mishchenko Sungmin Cho Satrajit Chatterjee Robert Brayton UC Berkeley.

DAOmap: A Depth-optimal Area Optimization Mapping Algorithm for FPGA Designs Deming Chen and Jason Cong Computer Science Department University of California,

CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #4 – FPGA.

1 A Min-Cost Flow Based Detailed Router for FPGAs Seokjin Lee *, Yongseok Cheon *, D. F. Wong + * The University of Texas at Austin + University of Illinois.

Timing-Driven Routing for FPGAs Based on Lagrangian Relaxation

CALTECH CS137 Winter DeHon CS137: Electronic Design Automation Day 3: January 12, 2004 Clustering (LUT Mapping, Delay)

Lecture 6: Mapping to Embedded Memory and PLAs September 27, 2004 ECE 697F Reconfigurable Computing Lecture 6 Mapping to Embedded Memory and PLAs.

1 Stephen Jang Kevin Chung Xilinx Inc. Alan Mishchenko Robert Brayton UC Berkeley Power Optimization Toolbox for Logic Synthesis and Mapping.

Technology Mapping. 2 Technology mapping is the phase of logic synthesis when gates are selected from a technology library to implement the circuit. Technology.

Optimality Study of Logic Synthesis for LUT-Based FPGAs Jason Cong and Kirill Minkovich.

Give qualifications of instructors: DAP

FPGA CAD 10-MAR-2003.

1 Area-Efficient FPGA Logic Elements: Architecture and Synthesis Jason Anderson and Qiang Wang 1 IEEE/ACM ASP-DAC Yokohama, Japan January 26-28,

Technology Mapping with Choices, Priority Cuts, and Placement-Aware Heuristics Alan Mishchenko UC Berkeley.

DAOmap: A Depth-optimal Area Optimization Mapping Algorithm for FPGA Designs Deming Chen, Jason Cong ， Computer Science Department ， UCLA Presented.

Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.

SEMI-SYNTHETIC CIRCUIT GENERATION FOR TESTING INCREMENTAL PLACE AND ROUTE TOOLS David GrantGuy Lemieux University of British Columbia Vancouver, BC.

A Semi-Canonical Form for Sequential Circuits Alan Mishchenko Niklas Een Robert Brayton UC Berkeley Michael Case Pankaj Chauhan Nikhil Sharma Calypto Design.

Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 5: February 2, 2015 Clustering (LUT Mapping, Delay)

6/19/ VLSI Physical Design Automation Prof. David Pan Office: ACES Placement (3)

Global Delay Optimization using Structural Choices Alan Mishchenko Robert Brayton UC Berkeley Stephen Jang Xilinx Inc.

Reducing Structural Bias in Technology Mapping

Power Optimization Toolbox for Logic Synthesis and Mapping

Mapping into LUT Structures

Delay Optimization using SOP Balancing

Revisiting and Bounding the Benefit From 3D Integration

SmartOpt An Industrial Strength Framework for Logic Synthesis

Standard-Cell Mapping Revisited

SAT-Based Area Recovery in Technology Mapping

Alan Mishchenko University of California, Berkeley

SAT-Based Optimization with Don’t-Cares Revisited

Mapping into LUT Structures

Topics Logic synthesis. Placement and routing..

Integrating Logic Synthesis, Technology Mapping, and Retiming

Alan Mishchenko UC Berkeley

Improvements in FPGA Technology Mapping

Delay Optimization using SOP Balancing

Robert Brayton Alan Mishchenko Niklas Een

Fast Min-Register Retiming Through Binary Max-Flow

Robert Brayton Alan Mishchenko Niklas Een

Presentation transcript:

1 WireMap FPGA Technology Mapping for Improved Routability Stephen Jang, Xilinx Inc. Billy Chan, Xilinx Inc. Kevin Chung, Xilinx Inc. Alan Mishchenko, UC Berkeley

2 Outline 1. Motivation and algorithm overview 2. Review of area recovery 3. Algorithm details 4. Results and summary

3 Motivation  Generic: Cut-based mapping algorithms do well in minimizing logic level and area (LUT count) Cut-based mapping algorithms do well in minimizing logic level and area (LUT count) Could we change cut-based mapping to improve netlist for packing, placement, and routing? Could we change cut-based mapping to improve netlist for packing, placement, and routing?  Specific 1 Fewer pin-to-pin connections should make the design easier to place and route Fewer pin-to-pin connections should make the design easier to place and route Could we come up with a mapping algorithm to minimize the total # of connections in a design? Could we come up with a mapping algorithm to minimize the total # of connections in a design?  Specific 2 Newer FPGAs allow two outputs per LUT Newer FPGAs allow two outputs per LUT Could we produce a mapping that “pack” better into these dual-output LUTs? Could we produce a mapping that “pack” better into these dual-output LUTs?

4 Area Recovery Overview 1. Perform delay-optimal mapping first Not all paths are critical Not all paths are critical 2. Perform area recovery on non-critical paths Consider all nodes with positive slack Consider all nodes with positive slack For each node, look for a different cut reducing area For each node, look for a different cut reducing area Area recovery heuristics Area-flow (global view) Chooses cuts with better logic sharing Exact local area (local view) Minimizes the number of LUTs by looking one node at a time Both are important Both are important

5 Edge Recovery Overview  Find a simple-to-compute metric to minimize edge count and create smaller LUTs Definition Definition Edge = pin-to-pin connection between LUTsEdge = pin-to-pin connection between LUTs  Cut-based area recovery algorithms can be extended to minimize edges!

6 Edge Flow Cost Functions  Edge flow phase Use edge flow to minimize global edge count Use edge flow to minimize global edge count  Exact local edge phase Exactly minimize edge count within MFFCs Exactly minimize edge count within MFFCs

7 WireMap Algorithm Input: And-Inverter Graph 1. Compute K-feasible cuts for each node 2. Compute best arrival time at each node  In topological order (from PI to PO)  Compute the depth of all cuts and choose the best one 3. Perform area and edge recovery  Using area flow and edge flow  Using exact local area and exact local edge 4. Choose the best cover Output: Mapped Netlist

8 Algorithm – Edge Flow 1. Do delay-optimal mapping 2. Compute slack at each node 3. Do area recovery with area-flow Visit nodes in topological order from PI to PO Visit nodes in topological order from PI to PO Choose cuts, which do not exceed slack budget and have smallest area-flow Choose cuts, which do not exceed slack budget and have smallest area-flow If two cuts have the same area-flow, then choose the cut with the lower edge-flow If two cuts have the same area-flow, then choose the cut with the lower edge-flow

9 Algorithm - Exact Local Edges 1. After optimization with area flow + edge flow o described on the previous page 2. Do edge recovery with exact edges o Visit nodes in topological order from PI to PO o Among all cuts within slack budget, choose cut with smallest area, and to break ties choose cuts with lower number of edges Note: Unlike edge-flow, no estimation is involved

10 Experimental Setup  Implemented WireMap in ABC  Compared WireMap against two algorithms in ABC Baseline – basic mapping with area recovery Baseline – basic mapping with area recovery Mapping with Structure Choices (MSC) – mapping with area recovery for several netlists produced by synthesis Mapping with Structure Choices (MSC) – mapping with area recovery for several netlists produced by synthesis  WireMap was implemented on top of MSC  Used VPR to place/route design for wirelength and critical path delays Single LUT cluster, single length wire segment model Single LUT cluster, single length wire segment model  Used SIS to pack single-output LUTs into dual- output LUTs using maximum cardinality matching

11 Results Summary  MSC is superior to baseline mapping Single-output LUT count reduced by 9.1% Single-output LUT count reduced by 9.1% Edge count reduced by 8.1% Edge count reduced by 8.1% Dual-output LUT count reduced by 7.7% Dual-output LUT count reduced by 7.7%  WireMap leads to further reduction in edges by 9.3% and dual-output LUT count by 9.4% versus MSC Single-output LUT count only reduced by 1.3% wrt. MSC Single-output LUT count only reduced by 1.3% wrt. MSC  WireMap reduction of edges and dual-output LUTs is not directly related to single-output LUT reduction

12 Comparison of Area Recovery and Area/Edge Recovery Flow Mapping (K = 6) Area recoveryArea/Edge recoveryArea recovery WireMap leads to dual-output LUT count reduction by 9.4% WireMap leads to further reduction in edges by 9.3%

13 Wirelength, Channel Width, and Critical Path Delay Comparison twl = total wire length, mcw = minimum channel width required to route in VPR, cpd = critical path delay with min channel width across the three implementations Wirelength was reduced by 8.5% vs. MSC Minimum channel width reduced by 6% Critical path delay reduced by 2.3% Area recoveryArea/Edge recoveryArea recovery

14 WireMap Results – LUT Packing The histogram shows how the single-output LUT size distribution is affected, leading to a 9.4% reduction in dual output LUT6s Reduced Increased LUT Distribution: MSC vs. WireMap 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% %LUTs MSCWireMap MSC 4.71%8.00%15.87%23.49%47.93% WireMap 10.12%12.66%17.89%20.19%39.14% LT2LT3LT4LT5LT6 Reduced Increased

15 Summary  Presented cut-based structural mapping with minimization of the number of edges  Extended area recover to perform edge recovery Area flow  Edge flow Area flow  Edge flow Exact local area  Exact local edges Exact local area  Exact local edges  Experimental results Reduced the total number of pin-to-pin connections Reduced the total number of pin-to-pin connections Improved QoR after place-and-route Improved QoR after place-and-route Improves packing by increasing ratio of smaller LUTs Improves packing by increasing ratio of smaller LUTs

16 Backup Material

17 Backup Material Technology Mapping  Delay-optimal mapping Delay-optimal mapping for all nodes Delay-optimal mapping for all nodes  Area recovery Global area recovery Global area recovery Local (exact) area recovery Local (exact) area recovery c d e f a b c d e f a b f f q r p t s s Cut {pqr} of node f has arrival time 3 Cut {stu} of node f has arrival time 2 u Cut size K = 3

18 Appendix - How to Measure Area? c d e f a b qr x p y c d e f a b qr x p y Area of cut {pcd} = 1 + [ ] = 2 Area of cut {abq} = 1 + [ ] = 2 Suppose we use the naïve definition: Area (cut) = 1 + [ Σ area (fanin) ] Naïve definition says both cuts are equally good in area Naïve definition ignores sharing due to multiple fanouts cut {pcd} cut {abq} 1 ?

19 Appendix - Area-flow c d e f a b qr x p y c d e f a b qr x p y Area-flow of cut {pcd} = 1 + [ ] = 2 Area-flow of cut {abq} = 1 + [ 0/1 + 0/1 + ½] = 1.5 area-flow (cut) = 1 + [ Σ ( area-flow ( fanin ) / fanout_num( fanin ) ) ] Area-flow “correctly” accounts for sharing Area-flow recognizes that cut {abq} is better ½ cut {pcd}

20 Appendix - Exact Local Area dbcefa s t p q f dbcefa s t p q f Cut {stq} Area flow = 1+ [ ] = 2.5 Exact area = = 2 (due to q) Area flow will choose this cut. Cut {pef} Area flow = 1+ [( )/2] = 2.75 Exact area = (p is used elsewhere) Exact area will choose this cut Exact-local-area (cut) = 1 + [ Σ exact-local-area (fanin with no other fanout) ] 11 1/8 1 1

21 Example dbcefa s t p q f dbcefa s t p q f Cut {stq} Area flow = 1+ [ ] = 2.5 Edge flow = 3+ [2 + 4(0.25)] = 6 Exact area = = 2 (due to q) Exact edge = = 5 (q is MFFC) Cut {pef} Area flow = 1+ [( )/2] = 2.75 Edge flow = 3+ [ ( )/2] = 8.5 Exact area = (p is used elsewhere) Exact edge = (p is NOT MFFC) Exact-local-area (cut) = 1 + [ Σ exact-local-area (fanin with no other fanout) ] 2 2 2/4 2 1/8 12.5

22 Appendix - Tuning Mapping for Placement  Placement-aware priority cost function The total number of edges in a mapped network The total number of edges in a mapped network  Advantages Correlates with the total wire-length after placement Correlates with the total wire-length after placement Easy to take into account during area recovery Easy to take into account during area recovery  Treat “edges” as “area”, resulting in Edge flow (similar to area flow) Edge flow (similar to area flow) Exact local edges (similar to exact local area) Exact local edges (similar to exact local area)  WireMap New placement-aware mapping algorithm New placement-aware mapping algorithm

23 Edge recovery overview  Key: Find a simple to compute cut metric that minimizes edge counts and creates more small LUTs 1. Edge flow phase: Use edge flow cost function to minimize global edge counts 2. Exact edge phase: Use optimal algorithm to minimize edge counts within MFFCs

24 Appendix – Additional VPR Results  VPR Result for 4-LUT cluster (resemble commercial FPGA SLICE structure) BaselineMSCWireMap CW TWL