Download presentation

Presentation is loading. Please wait.

Published byWinfred Blake Bruce Modified about 1 year ago

1
Sungmin Bae, Hyung-Ock Kim, Jungyun Choi, and Jaehong Park Design Technology Infrastructure Design Center System-LSI Business Division Warning This document is intended only for the recipients designated by Samsung Electronics Co. Ltd. (“Samsung”). As it contains the trade secrets and confidential information of Samsung which are protected by Competition Law, Trade Secrets Protection Act and other related laws, this document may not be, in part or in whole, directly or indirectly publicized, distributed, photocopied or used (including in a posting on the Internet where unspecified individuals may access it) by any unauthorized third party. Samsung reserves its right to take legal measures and claim damages against any party that misappropriates Samsung’s trade secrets or confidential information.

2
1.Motivation 2.Design flow 3.Parallel multiplier 4.Coarse-grained structural placement methodology 5.Experimental results 6.Future works

3
Data-flow (design structure) awareness is crucial to enhance physical design qualities. Timing, area, congestion, and power etc. Structured datapath placement is mostly done manually. It is generally thought that placement tools do not perform well on the datapath designs. Design efforts; days ~ weeks 3 Floorplan Memory macro placement Structured datapath placement Coarser Finer Control granularity Sum = A + B Floorplan Memory macro placement Structured datapath placement Coarser Finer Control granularity

4
We have added another methodology in the data-flow aware physical design. Automated extracting and mapping for a synthesized parallel multiplier to guide structural placement of a global placement. 4 Sum = A * B Floorplan Memory macro placement Coarse-grained structured datapath placement Coarser Finer Control granularity Structured datapath placement Logic Synthesis Automated datapath extraction and mapping Datapath template Floorplan Memory macro placement Structured datapath placement Coarser Finer Control granularity

5
Identify cells of a synthesized parallel multiplier to be structurally placed Inherent structural location extractions of the cells Analyze data-flow of the multiplier Structurally mapping the cells on a logical 2-D array Physical bit-slice alignment of the cells Generate structural relative placement directives Guide structural placement during global placement 5 Technology independent and dependent optimizations RTL code Datapath generator Logic Synthesis Physical aware bit-slice alignment Optimized gate-level netlist Structural templates (Multiplier) Parsing/Elaboration Arithmetic operation extraction High-level arithmetic optimizations Non-arithmetic logic Dataflow analysis High-level optimizations Result satisfactory? Structure Extraction and Mapping User Technology library Timing/ Area constraints Structural location inference/ Cell mapping No Yes Coarse-grained structural placement Structural relative placement directives Global Placement

6
A parallel multiplier is one of the most abundant arithmetic circuits in today’s multi-media feature intensive SoCs. Parallel multiplier largely consists of three parts. Partial product generation Partial product reduction Carry propagating adder (Final adder) 6 Multiplicand Multiplier Partial Products Final Product Partial Product Reduction MultiplicandMultiplier Final Adder Final Product Multiplication in dot-notation X3Y 3 X3Y 2 X3Y 1 X3Y 0 X2Y 3 X2Y 2 X2Y 1 X2Y 0 X1Y 3 X1Y 2 X1Y 1 X1Y 0 X0Y 3 X0Y 2 X0Y 1 X0Y 0 S3S2 S1S0 S7S6 S5S4 X3X2 X1X0 Y3Y2 Y1Y0

7
Partial product generation Non-booth : it generates the logical product of a multiplicand and multiplier (AND). Booth (Radix-4) : it reduces the number of partial products to the half. Partial product reduction Carry-save adder tree: it reduces every column to 2 output rows using compressor cell. Carry-propagate adder (final adder) Carry look ahead adder : it adds the 2 output rows YjYj XiXi PP ij Booth Non-booth Partial product generation 3:2 PP i-1j+1 PP ij PP i+1j-1 C in C out Sum PP i+2j-2 Partial product reduction Carry-look ahead unit FA A2 B2A1 B1 A0 B0 S2S2 S1S1 S0S0 C2C1 C0 C3 P2 G2 P1 G1P0 G0 Carry-propagate adder Multiplicand Multiplier Partial Products Final Product Partial Product Reduction MultiplicandMultiplier Final Adder Final Product Multiplication in dot-notation

8
It performs 1.Identify cells of a synthesized parallel multiplier to be structurally placed The PI cells from the partial product generation The PO cells from the final adder 2.Inherent structural location extraction of the cells Tagging structural locations for the PI and PO cells 3.Analyze data-flow of the multiplier 4.Structurally mapping the cells on a logical 2-D array 5.Physical bit-slice alignment of the cells 6.Generate structural relative placement directives 7.Guide structural placement during global placement 8 Technology independent and dependent optimizations RTL code Datapath generator Logic Synthesis Physical aware bit-slice alignment Optimized gate-level netlist Structural templates (Multiplier) Parsing/Elaboration Arithmetic operation extraction High-level arithmetic optimizations Non-arithmetic logic Dataflow analysis High-level optimizations Result satisfactory? Structure Extraction and Mapping Use r Technology library Timing/ Area constraints Structural location inference/ Cell mapping NoNo NoNo Yes Coarse-grained structural placement Structural relative placement directives Global Placement

9
The PI cells from the partial product generation The PI cells are retrieved by the immediate fan-out cone cells of the input nets. A set of nets that to collect the PI cells differs depending on the type of the partial product generation. Non-booth : multiplicand and multiplier input nets Booth : multiplicand input nets 9 YjYj XiXi PP ij Booth Non-booth Partial product generation Partial Product Reduction MultiplicandMultiplier Final Adder Final Product X3Y 3 X3Y 2 X3Y 1 X3Y 0 X2Y 3 X2Y 2 X2Y 1 X2Y 0 X1Y 3 X1Y 2 X1Y 1 X1Y 0 X0Y 3 X0Y 2 X0Y 1 X0Y 0 S3S2 S1S0 S7S6 S5S4 X3X2 X1X0 Y3Y2 Y1Y0

10
After extracting the PI cells, the PI cells are tagged by 2-D locations of a partial product row and column. Row inference Column inference The row of the PI cell can be inferred by its topologically closest multiplier inputs. Row inference i indicates the ith row of the partial product generator. -PI row (C k ) : the row number of the PI cell C k -PI col (C k ) : the column number of the PI cell C k -B md (C k ) : the closest multiplicand bit of C k -B mr (C k ) : the closest multiplier bit of C k -PP type : the partial product type YjYj XiXi PP ij Booth Non-booth

11
The column of the PI cell can be inferred by its topologically closest and bit- slice aligned multiplier output bit. Topological order propagation is restricted to only follow the same weighted bit-slice along the CSA tree. -Ignoring carry-out pins of the compressor cells. Column inference Find topologically closest and bit-slice aligned result. 11 3:2 Column[i+1]Column[i] X3Y3X3Y2 X3Y1X3Y0 X2Y3X2Y2 X2Y1X2Y0 X1Y3X1Y2 X1Y1X1Y0 X0Y3X0Y2 X0Y1X0Y0 S3S2 S1S0 S7S6 S5S4 X3X2 X1X0 Y3Y2 Y1Y0

12
The PO cells are parts of the final carry propagating adder. The PO cells are retrieved by the immediate fan-in cone cells of the output nets. Tags corresponding multiplier output bits to the PO cells 12 Carry-look ahead unit FA A2 B2A1 B1 A0 B0 S2S2 S1S1 S0S0 C2C1 C0 C3 P2 G2 P1 G1P0 G0 Carry-propagate adder Partial Product Reduction MultiplicandMultiplier Final Adder Final Product

13
It performs 1.Identify cells of a parallel multiplier to be structurally placed 2.Inherent structural location extraction of the cells 3.Structurally mapping the cells on a logical 2-D array 4.Analyze data-flow of the multiplier 5.Physical bit-slice alignment of the cells 6.Generate structural relative placement directives 7.Guide structural placement during global placement 13 Technology independent and dependent optimizations RTL code Datapath generator Logic Synthesis Physical aware bit-slice alignment Optimized gate-level netlist Structural templates (Multiplier) Parsing/Elaboration Arithmetic operation extraction High-level arithmetic optimizations Non-arithmetic logic Dataflow analysis High-level optimizations Result satisfactory? Structure Extraction and Mapping Use r Technology library Timing/ Area constraints Structural location inference/ Cell mapping NoNo NoNo Yes Coarse-grained structural placement Structural relative placement directives Global Placement

14
Data-flow can be analyzed from a global placement Data-flow can be estimated by relative locations of the input and output related cells. A method for the data-flow analysis … Linear regression to get the lines of the input and output related cells. Analyze the input to output lines’ relation 14 any overlap between the lines? angle of the overlap etc. Top to bottom or left to right etc. ? MSB to LSB or LSB to MSB? ? ?

15
It performs 1.Identify cells of a parallel multiplier to be structurally placed 2.Inherent structural location extraction of the cells 3.Analyze data-flow of the multiplier 4.Structurally mapping the cells on a logical 2-D array Using the inferred row and column numbers. 5.Physical bit-slice alignment of the cells 6.Generate structural relative placement directives 7.Guide structural placement during global placement Technology independent and dependent optimizations RTL code Datapath generator Logic Synthesis Physical aware bit-slice alignment Optimized gate-level netlist Structural templates (Multiplier) Parsing/Elaboration Arithmetic operation extraction High-level arithmetic optimizations Non-arithmetic logic Dataflow analysis High-level optimizations Result satisfactory? Structure Extraction and Mapping Use r Technology library Timing/ Area constraints Structural location inference/ Cell mapping NoNo NoNo Yes Coarse-grained structural placement Structural relative placement directives Global Placement

16
The PI cells are mapped onto a logical 2-D array according to their tagged row and column numbers. However, the number of cells inferring to the same location can be uneven due to the local nature of logic synthesis optimizations. If enough slots are allocated for all the cells, the 2-D array may have uncontrollable aspect ratio which may degrade placement quality. The maximum number of columns is constrained to control the array dimension. The number of rows is fixed. Some mis-mappings are allowed. Slot sharing between adjacent columns. There are spacing between the rows of the 2-D array. Non-guided cells to be placed close to their inherent structural locations. 16

17
Min-cost max-flow based cell mapping to maximize the number of mapped PI cells with minimum mis-mapping cost for a given 2-D array. An initial 2-D slot array may not fully contain all the PI cells. It allows empty slot sharing between adjacent bit-slice columns. It iteratively add dummy (empty) column slots at columns with the worst mis-mapping costs during the mapping. 17 Column[i] PI Cell[i,0] Cost [0,0] Cost [0,1] Cost [0,n] m slots Column[i-1] Column[i+1] PI Cell[i-1,0]PI Cell[i+1,0] Capacity = m Dummy Slot[i] k slots Cost [0,0] Cost SH [0,0] Cost DS [0,0] Capacity = j j slots Shared Slot Capacity = k Cost SH [0,0] Cost DS [0,0] The slots are divided into the three types for each column having different mapping cost weights. Non-shared : mapping weight γ own Shared : mapping weight γ shared Dummy : mapping weight γ dummy Mis-mapping cost : γ x *|row cell – row slot | Column[i] PI Cell[i,0] Cost [0,0] Cost [0,1] Cost [0,n] m slots Column[i-1] Column[i+1] PI Cell[i-1,0]PI Cell[i+1,0] Capacity = m Cost [0,0] Cost SH [0,0] Cost DS [0,0] Capacity = j j slots Shared Slot Cost SH [0,0] Cost DS [0,0] Shared Slot

18
HPWL is considered to compensate for net-connection blindness of the mapping as a tiebreaker for the mapping. Linear programming formulations of the weighted sum of min-cost max-flow for Cost MA (c i ) and HPWL minimization for Cost HPWL (n i ) Cost MA (c i ) : weighted sum of mis-mapping cost of cell c i Cost HPWL (n i ) : weighted sum of mis-mapping cost of cell c i Gradually add dummy column slots to minimize mis-mapping cost at columns with the worst mis-mapping cost, then solve the linear program iteratively. 18

19
It performs 1.Identify cells of a parallel multiplier to be structurally placed 2.Inherent structural location extraction of the cells 3.Analyze data-flow of the multiplier 4.Structurally mapping the cells on a logical 2-D array 5.Physical bit-slice alignment of the cells 6.Generate structural relative placement directives 7.Guide structural placement during global placement Technology independent and dependent optimizations RTL code Datapath generator Logic Synthesis Physical aware bit-slice alignment Optimized gate-level netlist Structural templates (Multiplier) Parsing/Elaboration Arithmetic operation extraction High-level arithmetic optimizations Non-arithmetic logic Dataflow analysis High-level optimizations Result satisfactory? Structure Extraction and Mapping Use r Technology library Timing/ Area constraints Structural location inference/ Cell mapping NoNo NoNo Yes Coarse-grained structural placement Structural relative placement directives Global Placement

20
The logically mapped PI and PO cells are then bit-slice aligned with respect to their physical dimension. Strict bit-slice alignment : a column width is decided by the widest cell among them -uncontrollable cell alignment size Compression alignment : this generates a compact cell cluster -It cannot ensure vertical bit-slice alignment 20 C i,j+2 C i-1,j+2 C i-2,j+2 C i,j+3 C i-1,j+3 C i-2,j+3 C i,j C i,j+1 C i-1,j C i-1,j+1 C i,j-1 C i-2,j C i-2,j+1 i-1,j-1 i-2,j-1 C i,j+2 C i-1,j+2 C i-2,j+2 C i,j+3 C i-1,j+3 C i-2,j+3 C i,j C i,j+1 C i-1,j C i-1,j+1 C i,j-1 C i-2,j C i-2,j+1

21
Our method combines the advantages of the aforementioned methods. Align the columns within a maximum width constraint It performs bit slice misalignment minimization while ensuring a maximum alignment width. 21 C i,j+2 C i-1,j+2 C i-2,j+2 C i,j+3 C i-1,j+3 C i-2,j+3 C i,j C i,j+1 C i-1,j C i-1,j+1 C i,j-1 C i-2,j C i-2,j+1 i-1,j-1 i-2,j-1 Maximum width constraint Misalignment at each column

22
It performs 1.Identify cells of a parallel multiplier to be structurally placed 2.Inherent structural location extraction of the cells 3.Analyze data-flow of the multiplier 4.Structurally mapping the cells on a logical 2-D array 5.Physical bit-slice alignment of the cells 6.Generate structural relative placement directives The relative row and column locations of the cells The column spaces between the cells 7.Guide structural placement during global placement Technology independent and dependent optimizations RTL code Datapath generator Logic Synthesis Physical aware bit-slice alignment Optimized gate-level netlist Structural templates (Multiplier) Parsing/Elaboration Arithmetic operation extraction High-level arithmetic optimizations Non-arithmetic logic Dataflow analysis High-level optimizations Result satisfactory? Structure Extraction and Mapping Use r Technology library Timing/ Area constraints Structural location inference/ Cell mapping NoNo NoNo Yes Coarse-grained structural placement Structural relative placement directives Global Placement

23
After the bit-slice alignment, the structural locations and the cell spacings are transformed into structural relative placement directives. Relative row and column locations of the cells Cell spaces between the cells To accommodate the cell spaces, the number of the array column is set to be twice of the logical 2-D array. The compression based alignment is used to align the cell. An estimated dataflow direction is used to set the initial orientations of the arrays for global placement. 23 C i,j+2 C i-1,j+2 C i-2,j+2 C i,j+3 C i-1,j+3 C i-2,j+3 C i,j C i,j+1 C i-1,j C i-1,j+1 C i,j-1 C i-2,j C i-2,j+1 Cell spacing Cell slots Space slots

24
It performs 1.Identify cells of a parallel multiplier to be structurally placed 2.Inherent structural location extraction of the cells 3.Analyze data-flow of the multiplier 4.Structurally mapping the cells on a logical 2-D array 5.Physical bit-slice alignment of the cells 6.Generate structural relative placement directives 7.Guide structural placement during global placement Technology independent and dependent optimizations RTL code Datapath generator Logic Synthesis Physical aware bit-slice alignment Optimized gate-level netlist Structural templates (Multiplier) Parsing/Elaboration Arithmetic operation extraction High-level arithmetic optimizations Non-arithmetic logic Dataflow analysis High-level optimizations Result satisfactory? Structure Extraction and Mapping Use r Technology library Timing/ Area constraints Structural location inference/ Cell mapping NoNo NoNo Yes Coarse-grained structural placement Structural relative placement directives Global Placement

25
Structural relative placement directives hold the locations of the PI and PO cells. Non-guided cells are attracted to the PI and PO cells. 25 13*12 non-Booth multiplier 32*16 Booth multiplier

26
We implemented the proposed methodology in Tcl and CLP as a linear program solver. Commercial logic synthesis and P&R tools with industrial designs were used. About 2%, 42%, and 2% improvements in critical path delay, total negative slack, and total wire-length respectively. D11 degraded the physical implementation quality, which had about 25% of the inputs are pruned due to constant propagation, and was not sufficient for the approach. 26 Design# MultsArea ratioCPDTNSWirelength D170.490.940.020.99 D280.171.000.820.98 D360.331.000.740.95 D440.320.970.000.98 D530.300.990.971.00 D610.250.980.910.95 D790.210.980.280.94 D820.210.990.820.99 D980.180.990.581.00 D10160.090.960.140.99 D1110.401.031.101.02 Ave.60.270.980.580.98

27
A snapshot of D10 27

28
To further automate the method, surrounding (placement blockage, macro, and data-flow etc.) awareness is needed. The multipliers were required to be “naturally” placed in a narrow macro channel, while structural placement method may prevent this kind of placement. 28

29
The future works will focus on Extending the methodology for other synthesized datapath circuits. Developing regularity measuring methods to avoid structurally mapping insufficiently regular multipliers. Adding more surround awareness to further automate the methodology. 29

30

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google