Download presentation

Presentation is loading. Please wait.

Published byMerilyn Gibson Modified about 1 year ago

1
Design, Synthesis and Evaluation of Heterogeneous FPGA with Mixed LUTs and Macro-Gates Yu Hu 1, Satyaki Das 2, Steve Trimberger 2, and Lei He 1 1. Electrical Engineering Dept., UCLA 2. Research Labs, Xilinx Inc. Presented by Yu Hu Address comments to lhe@ee.ucla.edu

2
Outline Introduction Design of the Macro-gates Synthesis for the Proposed FPGA Architecture Comparison of Heterogeneous FPGA Architectures Conclusions and Future Work

3
Heterogeneity in FPGA Architectures Heterogeneity among SLICEs PProgrammable logic and routing TTiles are not identical soft logic fabric [Kaviani, FPGA’96]] hard structures [Jamieson, FPL’05] DDedicated hard structures e.g. DSP e.g memory block Heterogeneity within a SLICE PProgrammable logic and routing TTiles (SLICEs) are identical DDifferent logics exist within a SLICE e.g. LUTs with different size [Cong, FPGA’99] e.g. mixed PLAs and LUTs [Cong, TODAES’05] e.g. mixed macro-gates and LUTs (source: Jamieson@FPL’05)

4
Heterogeneous FPGA with Macro-Gates There exists programmability and cost trade-off between LUTs and macrogates Xilinx V4 benefits from small gates (MUX2, XOR2) built in SLICEs. The benefit of wider macro-gates Effectiveness of the incorporation of wider logic functions (macro gates) is not clear. Our contributions Design a new FPGA architecture with mixed LUTs and macro- gates Propose a new automatic synthesis flow for mapping a circuit to the proposed FPGA architecture Evaluate the architecture and show that the proposed architecture reduces delay and area by 16.5% and 30%, respective, compared to the LUT-only architecture.

5
Outline Introduction Design of the Macro-gates Synthesis for the Proposed FPGA Architecture Comparison of Heterogeneous FPGA Architectures Conclusions and Future Work

6
Overview of Macro-Gate Design Key problem Select the logic functions for the macro-gate Problem formulation: Input: a set of training circuits, which have been mapped to K-input LUTs Output: N K-input Boolean functions: f 1, …, f N Objective: Maximize the number of logics (in the training circuit set) which can be implemented by f 1, …, f N The proposed solution Ranking of the logic functions for a set of training circuits

7
NPN-Class Diagram: Organization of Logics Canonical and efficient representation of all NPN classes NPN-Equivalent: functional equivalency under inputs negation, permutation or output negation E.g., f(a,b,c)=a+bc, g(a,b,c)=b’a+b’c NPN-Cofactor relationship is indicated DAG: easy to manipulate It becomes impractical to compute for more than 6-input functions! Solution: Utilization NPN-Class Diagram Level3: 3-input Level2: 2-input Level1: 1-input Level0: constant Wider inputs

8
UND: Utilization NPN-Class Diagram UND is an DAG, sub-graph of NCD Help for scoring and ranking functions ab’c’+a’bc’ ab’c’+a’bc’ / 1 / xx% abc ab / 0 / xx% a / 0 / xx% ab’+a’b / 0 / xx% -0- / 0 / xx% abc/ 1 / xx% ab’+a’b a Implementation capability Appearance frequency functionality

9
UND: Utilization NPN-Class Diagram ab’c’+a’bc’ ab’c’+a’bc’ / 1 / xx% abc ab / 0 / xx% a / 0 / xx% ab’+a’b / 0 / xx% -0- / 0 / xx% abc/ 1 / xx% ab’+a’b a ab’+a’b / 1 / xx% a / 1 / xx%

10
a / 1 / 25% ab’+a’b / 1 / 50% UND: Utilization NPN-Class Diagram Calculate Implementation Capability ab’c’+a’bc’ ab’c’+a’bc’ / 1 / 75% abc ab / 0 / 25% -0- / 0 / xx% abc/ 1 / 50% ab’+a’b a Fanout cone of ab’c+a’bc’ The topology property (DAG) of UND enables us to efficiently explore different metrics for functionality ranking, e.g., utilization rate.

11
Recap: Overall Flow for Macro-Gate Design Map with LUT-N Extract logic functions Generate Utilization NPN Diagram Calculate score For logic functions Rank logic functions ab’c’+a’bc’ / 1 / xx% ab / 0 / xx% a / 0 / xx% ab’+a’b / 0 / xx% -0- / 0 / xx% abc/ 1 / xx% ab’+a’b / 1 / xx% a / 1 / xx% F f g d e h b a c LUT LUT LUT and2(3) inv(1) nand2(2) 0000001000000000 0000010000000000 0000100000000000 0001000000000000 0010000000000000 0100000000000000 …… a / 1 / 25% ab’+a’b / 1 / 50% ab’c’+a’bc’ / 1 / 75% ab / 0 / 25% -0- / 0 / xx% abc/ 1 / 50% 1+1*1/2=1.5 1 1*1/2=0.5 1+1*1/3=1.33 1+1*2/3+1*1/3=2 Best function: ab’c’+a’bc’

12
Proposed Macro-Gates and FPGA Architecture For IWLS’05 benchmarks, the following four 6-input functions have the highest ranks GI1=a b c d e f (AND-6) GI2=a’ b’ c’ + b c f’ + b c’ d’ + b’ c e(MUX-4) GI3=a b' c d' e + b c e f + d e f GI4=a b' + a' c d' + b' c' + e' + f‘ It can implement over 50% of logic functions in IWLS’05 benchmarks. The architecture of the proposed macro-gate and FPGA SLICE are

13
Outline Design of the Embedded Macro-gates Synthesis for the Proposed FPGA Architecture Technology Mapping for Heterogeneous FPGAs SAT-based Packing Place and Routing Comparison of Heterogeneous FPGA Architectures Conclusions and Future Work

14
Functional & Structural Cut Enumeration a b d z yx c w a=(x+y)’ b=y+wz d=ab=(x+y)’(y+wz) =x’y’wz Is x’v’wz in library? 4-input macro gate lib 0000001000000000 0000010000000000 0000100000000000 0001000000000000 0010000000000000 0100000000000000 …… Yes Phase1:Enumerate and label cuts from PIs to Pos Check the feasibility of a cut w.r.t. the macro-gate Phase2:Select best choice from POs to Pis A general yet efficient solution is SAT based Boolean matching Exploiting Symmetry in SAT-Based Boolean Matching for Heterogeneous FPGA Technology Mapping, Session 5C.1, ICCAD 07

15
Key in Technology Mapping: Balance Resource Utilization Asymmetric architecture causes problem to resource utilization Exclusively use of one logic resource leads to lots of unused fabric Simple yet effective solution : Change LUT-MG ratio by adjusting their area weights. Precise calibration is hard to reach by this approach. Total# too large! Hard to obtain precise calibration Objective architecture: LUT6:MacroGate6 =1:1 Best LUT-MG ratio = 1:1 LUT-MG ratio = LUT#/MG#

16
Post-Mapping Area Recovery (motivation example) Given: Target architecture = LUT6 + MG6 LUT-MG ratio in target architecture = 1:1 LUT# < MG# in the mapped design Intrinsic delay (LUT6 : MG6) = 5:4 Objective: balance LUT MG number without increasing delay LUT6 MG6 5 / 5 4 / 5 9 / 9 9 / 13 17 / 17 13 / 13 PI PO MG6 8 / 9

17
Post-Mapping Area Recovery (motivation example) Given: Target architecture = LUT6 + MG6 LUT-MG ratio in target architecture = 1:1 LUT# < MG# in the mapped design Intrinsic delay (LUT6 : MG6) = 5:4 Objective: balance LUT MG number without increasing delay LUT6 MG6 5 / 5 4 / 5 9 / 9 10 / 13 17 / 17 13 / 13 PI PO MG6 8 / 9 LUT6

18
Post-Mapping Area Recovery (motivation example) Given: Target architecture = LUT6 + MG6 LUT-MG ratio in target architecture = 1:1 LUT# < MG# in the mapped design Intrinsic delay (LUT6 : MG6) = 5:4 Objective: balance LUT MG number without increasing delay LUT6 MG6 5 / 5 9 / 9 10 / 13 18 / 17 14 / 13 PI PO MG6 10 / 9 LUT6 Timing target violation! Timing slack budgeting is necessary!

19
Post Mapping Area Recovery by Timing Budgeting Formulated as an Integer Linear Programming (ILP) Problem Objective (minimize gap between target and actual LUT-MG ratios): min |m2+…+m7-7/2| Arrival time constraints: ai+dj+bj<=aj Clock period target: ai<=17 LUT assignment with given timing slack: (5-4)*mj<=bj, mj={0,1} LUT6 MG6 a1 a6 a5 a2 a3 a4 PI PO MG6 a7 MG6 Easy to be generalized to handle arch with multiple macro gates with different input pin numbers

20
Outline Design of the Embedded Macro-gates Synthesis for the Proposed FPGA Architecture Technology Mapping for Heterogeneous FPGAs SAT-based Packing Comparison of Heterogeneous FPGA Architectures Conclusions and Future Work

21
SAT-Based Packing Motivation Traditional packing tools, e.g., T-VPack, hard-codes the architecture specification of a SLICEs…. Re-impalement from scratch when architecture changes Propose a unified implementation of the packers for different architectures: easy to perform architecture exploration! The architecture dependent sub-problem in packing Structural feasibility checking for a sub-circuit to the SLICE Solution Solve the problem of validating SLICE packing as a local place&route problem A SAT solver is used to carry out the validation checking

22
Example of SAT-Based SLICE Packing Examples of constraints: (for each classes of constraint…) Placement and routing choice variables: X@A, X@B, U 5 @N 10 Exclusively constraint: (¬X@A) ∨ (¬X@B) Presence constraint: (X@A) ∨ (¬X@B) Input/Output constraint: X@A → U 5 @N 10 Routing constraint: G 0 →out ∧ U 5 @N 10 ) → U 5 @N 12

23
Recap: Overall Synthesis Flow Area weight Setting Cut-based Mapping Area-Balance Trade-off? Y N Post-mapping Area recovery LUT6 M G6 LUT6 MG 6 LUT6 MG 6 LUT6 packing F f g d e h b a c LUT LUT LUT LUT

24
Outline Motivation and Objectives Methodology for Logic Function Exploration Technology Mapping for Heterogeneous FPGAs Evaluation of Heterogeneous FPGA Architectures Conclusions and Future Work

25
Experimental Setting Design library parameters [Cong, TODAES ’ 05] Benchmark set: IWLS 2005 Four architectures are compared: LUT4, LUT4 + macro gate, LUT6, and LUT6 + macro gate Synthesize the proposed macro-gate by SIS1.2 Delay and area model Interconnect delay is igonired

26
Delay Comparisons Compared to LUT4, LUT4+MG reduces both logic depth and delay by 9.2%. Compared to LUT6, LUT6+MG reduces delay by 30% while increasing logic depth by 36.5%. A LUT6 can implement more logics than a macro-gate

27
Logic Area Comparisons Compared to LUT4, LUT4+MG reduces logic area by 12.5%. Compared to LUT6, LUT6+MG reduces logic area by 16.9%.

28
Outline Motivation and Objectives Methodology for Logic Function Exploration Technology Mapping for Heterogeneous FPGAs Comparison of Heterogeneous FPGA Architectures Conclusions and Future Work

29
Conclusions A novel FPGA architecture with the mixed LUTs and macro- gates is proposed A synthesis flow for the proposed architecture is implemented The preliminary experimental results show the effectiveness of the proposed architecture for the area and delay reduction Future Work Perform the physical design for the synthesized circuits and compare the routing costs, architecture evaluation considering interconnect delay Study the effectiveness of the power reduction for the proposed architecture Macro-gates with wider inputs will be examined

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google