Presentation is loading. Please wait.

Presentation is loading. Please wait.

Reinventing The Wheel: Developing a New Standard-Cell Synthesis Flow Alan Mishchenko University of California, Berkeley.

Similar presentations


Presentation on theme: "Reinventing The Wheel: Developing a New Standard-Cell Synthesis Flow Alan Mishchenko University of California, Berkeley."— Presentation transcript:

1 Reinventing The Wheel: Developing a New Standard-Cell Synthesis Flow Alan Mishchenko University of California, Berkeley

2 2 Outline Motivation Motivation The flow The flow Technology-independent synthesis Technology-independent synthesis Technology mapping Technology mapping Buffering Buffering Sizing Sizing Experimental results Experimental results Conclusion Conclusion

3 3 Motivation Synthesis tools are out there, but they are Synthesis tools are out there, but they are slow slow suboptimal suboptimal complicated complicated expensive expensive

4 4 ABC It is a public-domain tool developed by our research group since 2005 It is a public-domain tool developed by our research group since 2005 It addresses both synthesis and verification of synchronous hardware It addresses both synthesis and verification of synchronous hardware It is based on years of experience in developing efficient data-structures and algorithms It is based on years of experience in developing efficient data-structures and algorithms It is used in industry and academia It is used in industry and academia For more information, visit https://bitbucket.org/alanmi/abc For more information, visit https://bitbucket.org/alanmi/abc

5 5 The Flow Technology-independent synthesis Technology-independent synthesis Technology mapping Technology mapping Buffering Buffering Sizing Sizing These steps are not disconnected; they overlap These steps are not disconnected; they overlap Synthesis talks to mapping through structural choices Synthesis talks to mapping through structural choices Mapping talks to buffering through fanout estimations Mapping talks to buffering through fanout estimations Buffer and sizing can be interleaved Buffer and sizing can be interleaved

6 6 Synthesis: Old and New “AIG rewriting” “AIG rewriting” Delay/area costs Delay/area costs AND2 levels/nodes AND2 levels/nodes Restructuring Restructuring for all 4-input cuts, try all AIG subgraphs, choose the one with the min nodes under delay constraint for all 4-input cuts, try all AIG subgraphs, choose the one with the min nodes under delay constraint Results Results Acceptable quality Acceptable quality Acceptable runtime Acceptable runtime Problems Problems “Over-re-structuring” “Over-re-structuring” Slow for large, deep logic Slow for large, deep logic “AIG reshaping” “AIG reshaping” Delay/area cost Delay/area cost user-specified cost for n-input AND/XOR/MUX/MAJ Restructuring Restructuring iterate “mapping” and “unmapping” iterate “mapping” and “unmapping” several times Results Results Comparable quality 3-10 faster Problems Problems None so far

7 7 Mapping: Old and New “Traditional” cut-based mapping “Traditional” cut-based mapping iterate over the subject graph iterate over the subject graph re-compute priority cuts re-compute priority cuts use structural or functional matching (ICCAD’97) use structural or functional matching (ICCAD’97) For standard-cell mapping For standard-cell mapping use a gain-based library use a gain-based library map both (pos and neg) phase of each node into gates map both (pos and neg) phase of each node into gates select best cuts (gates) select best cuts (gates) Results Results Acceptable quality Acceptable quality Tolerable runtime Tolerable runtime “Improved” cut-based mapping “Improved” cut-based mapping pre-compute priority cuts pre-compute priority cuts iterate over the subject graph evaluate cuts using different costs use structural or functional matching For standard-cell mapping For standard-cell mapping use a gain-based library map into NPN classes of functions from the library select best cuts (NPN classes) perform phase-assignment and determine gates during buffering Results Results Quality not known yet Runtime is expected 3-10x faster

8 8 Buffering: Old and New Enumerating buffer tree topologies Enumerating buffer tree topologies Buffering for near-continuous libraries Buffering for near-continuous libraries Other incremental local fanout optimization methods Other incremental local fanout optimization methods Several ideas tried, none is a clear winner Several ideas tried, none is a clear winner “Technology-independent” buffering after the gain- based library Buffer-tree construction given required times and loads of the fanouts Incremental buffering interleaved with incremental sizing Results are mixed Results are mixed

9 9 Incremental Buffering Illustrated Growing Growing Bypassing Bypassing

10 10 Sizing: Old and New Non-linear programming Non-linear programming Linear programming Linear programming Lagrangian multipliers Lagrangian multipliers Incremental sizing Incremental sizing find critical region find best gates to resize perform the resizing incrementally update timing Iterate until no improvement Iterate until no improvement Can be combined with incremental buffering Can be combined with incremental buffering Results Results Reasonable Surprisingly fast If an optimum solution is known, seems to converge to it

11 11 Commands of The Flow read_lib read_lib write_lib write_lib print_lib print_lib read_scl read_scl write_scl write_scl dump_genlib dump_genlib print_gs print_gs stime stime buffer buffer unbuffer unbuffer minsize minsize maxsize maxsize upsize upsize dnsize dnsize print_buf print_buf read_constr read_constr print_constr print_constr reset_constr reset_constr

12 12 Experimental Setting 19 OpenCore designs were synthesized and mapped by an industrial tool using public library vsclib013.lib from http://www.vlsitechnology.org/ 19 OpenCore designs were synthesized and mapped by an industrial tool using public library vsclib013.lib from http://www.vlsitechnology.org/ Delay, area, and runtime were collected and used as a reference Delay, area, and runtime were collected and used as a reference Sizing was tested by applying min-sizing, followed by re- sizing Sizing was tested by applying min-sizing, followed by re- sizing Buffering was tested by un-buffering and min-sizing, followed by re-buffering and re-sizing Buffering was tested by un-buffering and min-sizing, followed by re-buffering and re-sizing The flow was tested by restructuring the design, followed by mapping, buffering, and sizing The flow was tested by restructuring the design, followed by mapping, buffering, and sizing

13 13 Comments on The Table Column “Gate” shows the number of gates produced by the industrial tool Column “Gate” shows the number of gates produced by the industrial tool Other columns “Gate” show the percentage in the number of gates relative to the result produced by the tool. Other columns “Gate” show the percentage in the number of gates relative to the result produced by the tool. Similarly, columns “Area” and “Delay” show the percentage of change in area and delay, respectively. Similarly, columns “Area” and “Delay” show the percentage of change in area and delay, respectively. Runtimes are in seconds on a Linux workstation Runtimes are in seconds on a Linux workstation

14 14 Original Statistics

15 15 Comparing Two Sizing Option

16 16 Comparing Full Flow

17 17 Full Flow with Improvements

18 18 Two Larger Designs

19 19 Experimental Results The following notation is used below: ToolD = industrial tool run in delay mode ToolA = industrial tool run in area mode AbcD = ABC run in delay mode AbcDF = ABC run in delay mode with novel fast synthesis feature AbcA = ABC run in area mode Gate count include buffers and inverters. The following notation is used below: ToolD = industrial tool run in delay mode ToolA = industrial tool run in area mode AbcD = ABC run in delay mode AbcDF = ABC run in delay mode with novel fast synthesis feature AbcA = ABC run in area mode Gate count include buffers and inverters. (1.1) AbcD has -19% gates, -13% area, and +3% delay, compared to ToolD. (1.2) AbcDF has -23% gates, -17% area, and +10% delay, compared to ToolD. (1.3) AbcA has -16% gates, +2% area, and -2x delay, compared to ToolA. The runtime of AbcDF (1.2) is about 2x faster than AbcD (1.1). The runtime of AbcA (1.3) is about 5x faster than AbcD (1.1). (1.1) AbcD has -19% gates, -13% area, and +3% delay, compared to ToolD. (1.2) AbcDF has -23% gates, -17% area, and +10% delay, compared to ToolD. (1.3) AbcA has -16% gates, +2% area, and -2x delay, compared to ToolA. The runtime of AbcDF (1.2) is about 2x faster than AbcD (1.1). The runtime of AbcA (1.3) is about 5x faster than AbcD (1.1). The same flow produces the following results on the public 130nm library: (2.1) AbcD has +31% gates, +16% area, and -15% delay, compared to ToolD. (2.3) AbcA has +18% gates, +11% area, and -65% delay, compared to ToolA. The same flow produces the following results on the public 130nm library: (2.1) AbcD has +31% gates, +16% area, and -15% delay, compared to ToolD. (2.3) AbcA has +18% gates, +11% area, and -65% delay, compared to ToolA.

20 20 Potential Issues Not specifying input driving cells and output loads Not specifying input driving cells and output loads This was addressed and experiments show it is fine This was addressed and experiments show it is fine Over-tuning for one particular library Over-tuning for one particular library Not sure heuristics will hold for submicron libraries Not sure heuristics will hold for submicron libraries Not looking at power Not looking at power Not taking high and low Vt cells into account Not taking high and low Vt cells into account Not mapping into multi-output cells Not mapping into multi-output cells Not mapping sequential elements Not mapping sequential elements Not considering multiple clock domains Not considering multiple clock domains

21 21 Conclusion A new synthesis flow is being developed and implemented in ABC A new synthesis flow is being developed and implemented in ABC An opportunity An opportunity to rethink some of the classical problems to rethink some of the classical problems improve on some of the known solutions improve on some of the known solutions come up with a new public implementation come up with a new public implementation Results are encouraging Results are encouraging delay (in delay-oriented synthesis) is within 5-15% delay (in delay-oriented synthesis) is within 5-15% area (in area-oriented synthesis) is within 1-3% area (in area-oriented synthesis) is within 1-3% runtime is about 20-50x better runtime is about 20-50x better

22 22 Abstract This presentation focuses on adding new capabilities to synthesize standard cell designs in the public-domain synthesis/verification tool ABC. An optimization flow has been developed, which included gain-based technology mapping, fanout-optimization by buffering and gate duplication, and gate- sizing. Novel heuristic algorithms have been proposed for several well-known optimization steps. For example, buffer tree construction can be performed not as a separate step, but concurrently with gate-sizing by reshaping initial well-balanced buffer trees. Each tree reshaping and each gate resizing transform are evaluated for delay/area improvement using a common cost- function and the most promising one is selected. The delay is measured by lookup table based delay model, which computes the delay of a gate from its input flew and output capacitance. Experiments show that the flow produces results that are 10% within those of industrial tools 20x faster. This presentation focuses on adding new capabilities to synthesize standard cell designs in the public-domain synthesis/verification tool ABC. An optimization flow has been developed, which included gain-based technology mapping, fanout-optimization by buffering and gate duplication, and gate- sizing. Novel heuristic algorithms have been proposed for several well-known optimization steps. For example, buffer tree construction can be performed not as a separate step, but concurrently with gate-sizing by reshaping initial well-balanced buffer trees. Each tree reshaping and each gate resizing transform are evaluated for delay/area improvement using a common cost- function and the most promising one is selected. The delay is measured by lookup table based delay model, which computes the delay of a gate from its input flew and output capacitance. Experiments show that the flow produces results that are 10% within those of industrial tools 20x faster.


Download ppt "Reinventing The Wheel: Developing a New Standard-Cell Synthesis Flow Alan Mishchenko University of California, Berkeley."

Similar presentations


Ads by Google