Presentation is loading. Please wait.

Presentation is loading. Please wait.

Global Clustering-Based Performance-Driven Circuit Partitioning Jason Cong University of California Los Angeles Chang Wu Aplus Design.

Similar presentations


Presentation on theme: "Global Clustering-Based Performance-Driven Circuit Partitioning Jason Cong University of California Los Angeles Chang Wu Aplus Design."— Presentation transcript:

1 Global Clustering-Based Performance-Driven Circuit Partitioning Jason Cong University of California Los Angeles cong@cs.ucla.edu Chang Wu Aplus Design Technologies Los Angeles changwu@aplus-dt.com

2 Problem Definition Problem: k-way circuit partitioning and retiming with balanced area for delay minimization  Delay minimization with consideration of cutsize  Retiming is performed simultaneously with partitioning for best possible delay reduction  Generic delay model: node delay, intra-block delay, inter-block delay D d Node delay d v Inter-block delay D Intra-block delay d D > d

3 Existing Approaches Clustering-based approaches  PRIME: group nodes into clusters with given area bound Quasi-optimal delay solution with node duplication Huge cutsize (3X) Partitioning-based approaches  Partition circuits into k-blocks and then iteratively move nodes to further improve  Cut-size minimization: hMetis Multi-level partitioning, very fast, excellent cutsize, fair circuit delay  Delay minimization: HPM Performance-driven clustering + cutsize-driven partitioning, tradeoff between delay and cutsize

4 Existing Approaches (cont) Clustering-based approaches  Delay optimization with node duplication is optimally solved  Node duplication-free clustering is NP-complete, but with fairly good results by resolving duplications heuristically  Huge cutsize Partitioning-based approaches  Very good cutsize  Difficulty on delay minimization: delay update for each node- move is too costly (linear time)  hMetis: does not consider delay directly, gradual coarsening is difficult to target for delay  HPM: separate clustering and partitioning, clustering does not know its impact on cutsize, partitioning does not have much control on delay

5 HPM: Combination of Clustering and Partitioning HPM by Cong, et al, [DAC99]  Clustering followed by partitioning Good delay and cutsize balance  Clustering and partitioning are two completely separated steps Clustering with very small and fixed area bound (10) on each blocks: much less than A/K, where A is circuit area Achieve inferior delay to clustering with cluster area bound of A/K (delay is ~23% larger) Achieve larger cutsize than hMetis because clustering constraints reduces cutsize reduction capability of partitioning  Better solution is Needed

6 Multi-Level Partitioning for Cutsize hMetis by Karypis, et al. [DAC97]  Gradual coarsening to group tightly connected nodes together  Uncoarsening gradually and reducing cutsize by moving clusters Fast algorithm: reduced solution space at each level as many nodes are grouped and moved together Smaller cutsize: more thorough search is possible in reduced solution space Hyperedge-based coarsening is very suitable for cutsize  Delay is completely ignored

7 Existing Multi-level Optimization Engine V-shape multi-level optimization used in hMetis  Not very suitable for delay minimization Gradual coarsening has difficulty to predict impact on delay

8 MLPR: Performance-Driven Multi- Level Partitioning and Retiming K-way partitioning algorithm for performance optimization  Retiming is performed during partitioning for best possible circuit delay  Cutsize reduction is also considered MLPR  Clustering with area bound of A/K, where A is circuit area  Partitioning of clusters into K blocks  For level from 1 to log(A/K) Clustering with area bound of A/(K  2 level ) –Each cluster is bounded by the block it belongs to Moving clusters to reduce cutsize while preserving circuit delay  Final movement of individual nodes for best solution

9 Our Contribution: Global Clustering Based Multi-Level Optimization Engine Start directly from the coarsest level with global clustering for best possible delay Clustering-based gradual declustering to increase the freedom for refinement Retiming is considered simultaneously during clustering and partitioning for smaller delay

10 Global Clustering for Delay Minimization Clustering: to group nodes into clusters with area no more than a given bound CLUS by Pan, et al. [TCAD98] PRIME by Cong, et al [DAC99]  Quasi-optimal clustering with retiming for delay minimization By setting area-bound to be A/K, clustering can compute a partitioning solution with quasi-optimal delay  Existing coarsening algorithms considering local node connectivity cannot predict circuit delay Theorem: Let  c be the circuit delay of a clustering solution. For any partitioning solution P on the clusters, its delay is less than or equal to  c  Clustering can compute an upper-bound on circuit delay after partitioning

11 Global Clustering-Based Optimization Engine Start from the coarsest level with clustering to define a good circuit delay  Comparison: coarsening with gradually increased cluster size has difficulty to predict circuit delay after partitioning on clusters Clustering with gradually reduced area bound to decluster at each level  Nodes on a critical path will be grouped together and will NOT be partitioned into different partitions  Avoid delay increase by partitioning refinement as much as possible Partition-bounded clustering to guarantee consistent solution improvement and algorithm convergency  Guarantee a better solution in a finer level than a coarser level

12 Partitioning with Retiming Retiming is considered during clustering and partitioning at each level for best possible circuit delay  Sequential arrival time: a v =  l(e), where l(e)=d v +d e -  w e for a given target clock period , where d v is node delay of v, d e is edge delay, w e is the number of FFs on edge e from u to v.  Theorem [Pan98]: if max(a po )  , minimum circuit delay after retiming is no more than  + D.  Timing analysis in both clustering and partitioning is based on sequential arrival time  Binary search to get the minimum clock period after retiming

13 Test Results 16-way partitioningBi-partitioning 16x 120

14 Conclusion Global clustering is more suitable for delay minimization Global clustering-based multi-level optimization engine achieves good delay and cutsize Retiming further helps delay reduction  Simultaneously retiming with partitioning achieves better results than separate partitioning with retiming  Not a necessity to the main algorithm, can be disabled


Download ppt "Global Clustering-Based Performance-Driven Circuit Partitioning Jason Cong University of California Los Angeles Chang Wu Aplus Design."

Similar presentations


Ads by Google