Local Unidirectional Bias for Smooth Cutsize-delay Tradeoff in Performance-driven Partitioning Andrew B. Kahng and Xu Xu UCSD CSE and ECE Depts. Work supported.

Slides:

Advertisements

Similar presentations

Porosity Aware Buffered Steiner Tree Construction C. Alpert G. Gandham S. Quay IBM Corp M. Hrkic Univ Illinois Chicago J. Hu Texas A&M Univ.

Advertisements

Gate Sizing for Cell Library Based Designs Shiyan Hu*, Mahesh Ketkar**, Jiang Hu* *Dept of ECE, Texas A&M University **Intel Corporation.

Multilevel Hypergraph Partitioning Daniel Salce Matthew Zobel.

Tunable Sensors for Process-Aware Voltage Scaling

OCV-Aware Top-Level Clock Tree Optimization

Ispd-2007 Repeater Insertion for Concurrent Setup and Hold Time Violations with Power-Delay Trade-Off Salim Chowdhury John Lillis Sun Microsystems University.

~1~ Infocom’04 Mar. 10th On Finding Disjoint Paths in Single and Dual Link Cost Networks Chunming Qiao* LANDER, CSE Department SUNY at Buffalo *Collaborators:

Timing Margin Recovery With Flexible Flip-Flop Timing Model

F.F. Dragan (Kent State) A.B. Kahng (UCSD) I. Mandoiu (UCLA) S. Muddu (Sanera Systems) A. Zelikovsky (Georgia State) Provably Good Global Buffering by.

Chop-SPICE: An Efficient SPICE Simulation Technique For Buffered RC Trees Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of.

Variability-Driven Formulation for Simultaneous Gate Sizing and Post-Silicon Tunability Allocation Vishal Khandelwal and Ankur Srivastava Department of.

Minimum-Buffered Routing of Non- Critical Nets for Slew Rate and Reliability Control Supported by Cadence Design Systems, Inc. and the MARCO Gigascale.

Boosting: Min-Cut Placement with Improved Signal Delay Andrew B. KahngSherief Reda CSE & ECE Departments University of CA, San Diego La Jolla, CA

VLSI Layout Algorithms CSE 6404 A 46 B 65 C 11 D 56 E 23 F 8 H 37 G 19 I 12J 14 K 27 X=(AB*CD)+ (A+D)+(A(B+C)) Y = (A(B+C)+AC+ D+A(BC+D)) Dr. Md. Saidur.

Faster SAT and Smaller BDDs via Common Function Structure Fadi A. Aloul, Igor L. Markov, Karem A. Sakallah University of Michigan.

1 DAOmap: A Depth-optimal Area Optimization Mapping Algorithm for FPGA Designs Deming Chen, Jacon Cong ICCAD 2004 Presented by: Wei Chen.

Power-Aware Placement

Chapter 2 – Netlist and System Partitioning

EDA (CS286.5b) Day 5 Partitioning: Intro + KLFM. Today Partitioning –why important –practical attack –variations and issues.

Supply Voltage Degradation Aware Analytical Placement Andrew B. Kahng, Bao Liu and Qinke Wang UCSD CSE Department {abk, bliu,

Placement Feedback: A Concept and Method for Better Min-Cut Placements Andrew B. KahngSherief Reda CSE & ECE Departments University of CA, San Diego La.

Yield- and Cost-Driven Fracturing for Variable Shaped-Beam Mask Writing Andrew B. Kahng CSE and ECE Departments, UCSD Xu Xu CSE Department, UCSD Alex Zelikovsky.

1 UCSD VLSI CAD Laboratory ISQED-2009 Revisiting the Linear Programming Framework for Leakage Power vs. Performance Optimization Kwangok Jeong, Andrew.

Can Recursive Bisection Alone Produce Routable Placements? Andrew E. Caldwell Andrew B. Kahng Igor L. Markov Supported by Cadence.

An Algebraic Multigrid Solver for Analytical Placement With Layout Based Clustering Hongyu Chen, Chung-Kuan Cheng, Andrew B. Kahng, Bo Yao, Zhengyong Zhu.

A Global Minimum Clock Distribution Network Augmentation Algorithm for Guaranteed Clock Skew Yield A. B. Kahng, B. Liu, X. Xu, J. Hu* and G. Venkataraman*

1 Integrating Logic Retiming and Register Placement Tzu-Chieh Tien, Hsiao-Pin Su, Yu-Wen Tsay Yih-Chih Chou, and Youn-Long Lin Department of Computer Science.

Lecture 9: Multi-FPGA System Software October 3, 2013 ECE 636 Reconfigurable Computing Lecture 9 Multi-FPGA System Software.

Accurate Pseudo-Constructive Wirelength and Congestion Estimation Andrew B. Kahng, UCSD CSE and ECE Depts., La Jolla Xu Xu, UCSD CSE Dept., La Jolla Supported.

ELEN 468 Lecture 271 ELEN 468 Advanced Logic Design Lecture 27 Interconnect Timing Optimization II.

Layout-based Logic Decomposition for Timing Optimization Yun-Yin Lien* Youn-Long Lin Department of Computer Science, National Tsing Hua University, Hsin-Chu,

1 Circuit Partitioning Presented by Jill. 2 Outline Introduction Cut-size driven circuit partitioning Multi-objective circuit partitioning Our approach.

1 Enhancing Performance of Iterative Heuristics for VLSI Netlist Partitioning Dr. Sadiq M. Sait Dr. Aiman El-Maleh Mr. Raslan Al Abaji. Computer Engineering.

Placement-Centered Research Directions and New Problems Xiaojian Yang Amir Farrahi Synplicity Inc.

Page 1 Department of Electrical Engineering National Chung Cheng University, Chiayi, Taiwan Power Optimization for Clock Network with Clock Gate Cloning.

Register-Transfer (RT) Synthesis Greg Stitt ECE Department University of Florida.

Graph partition in PCB and VLSI physical synthesis Lin Zhong ELEC424, Fall 2010.

CSE 242A Integrated Circuit Layout Automation Lecture: Partitioning Winter 2009 Chung-Kuan Cheng.

1 Coupling Aware Timing Optimization and Antenna Avoidance in Layer Assignment Di Wu, Jiang Hu and Rabi Mahapatra Texas A&M University.

Lecture 12 Review and Sample Exam Questions Professor Lei He EE 201A, Spring 2004

Horizontal Benchmark Extension for Improved Assessment of Physical CAD Research Andrew B. Kahng, Hyein Lee and Jiajia Li UC San Diego VLSI CAD Laboratory.

March 20, 2007 ISPD An Effective Clustering Algorithm for Mixed-size Placement Jianhua Li, Laleh Behjat, and Jie Huang Jianhua Li, Laleh Behjat,

UC San Diego / VLSI CAD Laboratory Incremental Multiple-Scan Chain Ordering for ECO Flip-Flop Insertion Andrew B. Kahng, Ilgweon Kang and Siddhartha Nath.

An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis Rupesh S. Shelar Enterprise Microprocessor Group Intel Corporation, Hillsboro, OR.

1 Wire Length Prediction-based Technology Mapping and Fanout Optimization Qinghua Liu Malgorzata Marek-Sadowska VLSI Design Automation Lab UC-Santa Barbara.

10/25/ VLSI Physical Design Automation Prof. David Pan Office: ACES Lecture 3. Circuit Partitioning.

"A probabilistic approach to clock cycle prediction" A probabilistic approach to clock cycle prediction J. Dambre, D. Stroobandt and J. Van Campenhout.

Temporal Logic Replication for Dynamically Reconfigurable FPGA Partitioning Wai-Kei Mak Dept. of Computer Science and Engineering University of South Florida.

1 ε -Optimal Minimum-Delay/Area Zero-Skew Clock Tree Wire-Sizing in Pseudo-Polynomial Time Jeng-Liang Tsai Tsung-Hao Chen Charlie Chung-Ping Chen (National.

Fast Algorithms for Slew Constrained Minimum Cost Buffering S. Hu*, C. Alpert**, J. Hu*, S. Karandikar**, Z. Li*, W. Shi* and C. Sze** *Dept of ECE, Texas.

Paired Sampling in Density-Sensitive Active Learning Pinar Donmez joint work with Jaime G. Carbonell Language Technologies Institute School of Computer.

Circuit Partitioning Divides circuit into smaller partitions that can be efficiently handled Goal is generally to minimize communication between balanced.

In-Place Decomposition for Robustness in FPGA Ju-Yueh Lee, Zhe Feng, and Lei He Electrical Engineering Dept., UCLA Presented by Ju-Yueh Lee Address comments.

Outline Motivation and Contributions Related Works ILP Formulation

CprE566 / Fall 06 / Prepared by Chris ChuPartitioning1 CprE566 Partitioning.

Improved Path Clustering for Adaptive Path-Delay Testing Tuck-Boon Chan* and Prof. Andrew B. Kahng*# UC San Diego ECE* & CSE # Departments.

DAOmap: A Depth-optimal Area Optimization Mapping Algorithm for FPGA Designs Deming Chen, Jason Cong ， Computer Science Department ， UCLA Presented.

Global Clustering-Based Performance-Driven Circuit Partitioning Jason Cong University of California Los Angeles Chang Wu Aplus Design.

-1- UC San Diego / VLSI CAD Laboratory Optimization of Overdrive Signoff Tuck-Boon Chan, Andrew B. Kahng, Jiajia Li and Siddhartha Nath Tuck-Boon Chan,

-1- Delay Uncertainty and Signal Criticality Driven Routing Channel Optimization for Advanced DRAM Products Samyoung Bang #, Kwangsoo Han ‡, Andrew B.

Hypergraph Partitioning With Fixed Vertices Andrew E. Caldwell, Andrew B. Kahng and Igor L. Markov UCLA Computer Science Department

Multilevel Partitioning

Proximity Optimization for Adaptive Circuit Design Ang Lu, Hao He, and Jiang Hu.

Placement and Routing Algorithms. 2 FPGA Placement & Routing.

Improved Performance of 3DIC Implementations Through Inherent Awareness of Mix-and-Match Die Stacking Kwangsoo Han, Andrew B. Kahng and Jiajia Li University.

Andrew B. Kahng and Xu Xu UCSD CSE and ECE Depts.

Revisiting and Bounding the Benefit From 3D Integration

A Semi-Persistent Clustering Technique for VLSI Circuit Placement

A Fundamental Bi-partition Algorithm of Kernighan-Lin

Rusakov A. S. (IPPM RAS), Sheblaev M.

Presentation transcript:

Local Unidirectional Bias for Smooth Cutsize-delay Tradeoff in Performance-driven Partitioning Andrew B. Kahng and Xu Xu UCSD CSE and ECE Depts. Work supported in part by MARCO GSRC

Outline  Motivation Performance driven bipartition problem New bipartitioning algorithm Experimental results Conclusion and future work

Partitioning and Performance The hypergraph partitioning problem is to divide the nodes of a hypergraph into roughly equal parts; the traditional objective is to minimize cutsize. In performance-driven partitioning, we also seek to minimize path delay on timing paths.

– Reduces delay by 16% while increasing cutsize by 17% – Requires substantial gate replication Previous Work (I) [Cong et al. ISPD-2002] –Global clustering based algorithm with retiming Min-delay Clustering w/ retiming De-clustering and refinement Min-cutsize Clustering

– 14% reduction of delay with 10% increase in cutsize – 139% increase in runtime compared with hMetis Previous Work (II) [Ababei et al. ICCAD-2002] –Reweighting based method Global timing analysis Find critical paths Reweighting Input Path based Net based Cutsize oriented partitioner, such as hMetis,MLPart

Motivating Questions  Can we avoid global timing analysis? –Global timing analysis is extremely time-consuming  Can we improve path delay without significant degrading of cutsize? –Need smooth tradeoff between delay and cutsize  Can we reduce implementation overheads? –Previous methods store thousands of critical paths and continuously update them

Outline Motivation  Performance driven bipartition problem New bipartitioning algorithm Experimental results Conclusion and future work

Delay Model Delay = hop_delay + node_delay Part 0 Part 1 FF nodes Combinational nodes hop cut [Cong et al. ISPD-2002] hop_delay=5 node_delay=1  Delay = 3x5 + 5x1 = 20 [Ababei et al. ICCAD-2002] hop_delay=Elmore delay node_delay=constant

Performance Driven Bipartition Problem Given: Hypergraph H=(V,E) Area Balance tolerance s (0<s<1), a parameter to control allowable slack in the area constraint , a given parameter which captures tradeoff between cutsize and path delay (hopcount) Find: A bipartition (V 0 |V 1 ) which satisfies: and minimizes  (cutsize)+(1-  )(Max_hopcount)

Outline Motivation Performance driven bipartition problem  New bipartitioning algorithm Experimental results Conclusion and future work

Unidirectional Partition Path delay is minimized with hopcount = 1 if the partition is unidirectional (“acyclic”), that is, all cuts are in the same direction Problem: High cutsize No unidirectional solution Can we achieve “locally unidirectional” partition? Max hopcount=5Max hopcount=3 Part 1 Part 0 Part 1 Part 0 Part 1

V-Shaped Nodes V-shaped node If a combinational node v satisfies: there exist v j, v t in the other part and a path from v j to v t that includes only v then v is a V-shaped node vjvj Part 1 Part 0 vtvt v

V-Shaped Nodes in Critical Paths Empirical observations from study of partitioning solutions: there are V-shaped nodes in the partitioning solutions every V-shaped node is included in many critical paths every critical path contains several V-shaped nodes For testcase 1: Number of nets : Number of critical paths : On average, one critical path contains 27.6 nodes On average, one critical path contains 3.4 V-nodes On average, one V-node belongs to critical paths

Key Idea: V-Shaped Nodes Elimination PATH: a  b  c hopcount=2 PATH: d  b  c hopcount=1 PATH: e  b  c hopcount=1 a f c b e d Move b a f c b e d Move V-shaped node “b” to reduce path hopcount Part 0 Part 1 Part 0 PATH: a  b  c hopcount=0 PATH: d  b  c hopcount=1 PATH: e  b  c hopcount=1

Distance-k V-Shaped Nodes Elimination a d b Move b,c k = 2: Move V 2 node “b, c” reduce path hopcount from 2 to 0 Part 0 Part 1 c a d b Part 0 Part 1 c Problems with large k: Cutsize may be greatly increased Delay of one path reduced while other paths delay increased

New Gain Function v Before Move After Move v g(v): traditional FM gain r j (v): reduction of V j nodes after moving v Gain(v)= δ (0)+ δ (1)

Distance-k Unidirectional Algorithm Calculate initial gains for all nodes and store the gains Select the node v with maximum gain /* CLIP-like method: move the cluster that v belongs to */ Reset the gains of all nodes to zero Move v and update the gains of v and its neighbors While (  one node not moved) Select one node v with the maximum updated gain Move v and update the related gains Find the point in the move sequence at which the sum of gains is maximum; undo all moves after this point

Outline Motivation New bipartitioning algorithm  Experimental results Conclusion and future work

Experimental Setup Four industry testcases obtained as LEF/DEF Model of Ababei et al. (ICCAD-2002) used to calculate delay Partitioning solutions compared to results of MLPart –strongest multilevel netlist partitioning code –website: All tests on 600MHz Intel Pentium-III Xeon

Biasing against V 1 Nodes vs. MLPart Testcase MLPart MLPart+V-shaped nodes Removal cutsizehdelaytime(s)cutsizehdelaytime(s) Reduction of delay: 4.5%-24.4% average:15.1% Increase of cutsize: 3.0%-10.0% average: 4.9% Increase of runtime: 6.3%-11.4% average: 9.7% Using the delay model in Cong et al. ISPD Reduction of delay: 4.3%-21.2% average:14.7% δ(0)=1, δ(1)=10

Biasing against V 2 Nodes vs. MLPart Testcase MLPartMLPart+V k=2 nodes Removal cutsizehdelaytime(s)cutsizehdelaytime(s) δ(0)=1, δ(1)=30, δ(2)=3 Reduction of delay: 8.9%-30.0% average: 18.7% Increase of cutsize: 3.1%-7.2% average: 3.5% Increase of runtime: 11.9%-15.9% average: 13.1% Using the delay model in Cong et al. ISPD Reduction of delay: 8.3%-28.7% average: 17.3%

Outline Motivation Performance driven bipartition problem New bipartitioning algorithm Experimental results  Conclusions and future work

Conclusions Simple yet efficient timing-driven partitioning that does not require global timing analysis Negligible implementation, runtime overhead Significantly reduces path delay with cutsize and runtime almost same as leading-edge MLPart Similar improvements observed with different path delay metrics Futures –Impact of new partitioner on placement –Efficient methods for biasing δ(k) k>2

Thank you!

Future Work Impact of new partitioner on placement Efficient methods for biasing δ(k) k>2

Why Performance Driven Partitioning? Achieving timing closure becomes increasingly difficult in deep-submicron technologies due to non-ideal scaling of interconnect delay Routing alone can no longer solve timing problem, even with aggressive optimizations (buffer insertion, buffer/wire sizing,…)  Timing needs to be addressed at all design stages Partitioning is a critical step in defining interconnect timing properties, but is traditionally driven by cutsize objective

Previous Work (I) With Logic Replication –Retiming –Replication graph Without Logic Replication –Net based reweighting –Path based reweighting

FM Partitioning and Gain Function v Before Move v After Move Gain(v) = Reduction of cutsize after moving v Gain(v)=-1 Move the node with the max gain and lock it Start with random partition Keep moving until all nodes are locked Find the best point in the move sequence Part 0 Part 1 Part 0 Part 1 Part 0 Part 1 Part 0 Part 1

Procedure to Calculate r j (v) Delete all FF nodes and their related edges In the remaining graph, BFS from v For each level j from 1 to k If v is a V j node before moving, r j ’=1 If v is a V j node after moving, r j ’’=1 r j =r j ’’-r j ’

CLIP Algorithm v CLIP v Reminiscent of CLIP (Deng et al. DAC 1996) in how it induces movement of clusters across the cutline.

Distance-k V-Shaped Nodes Distance-k V-shaped nodes (V k -node): If k combinational nodes v i,1 … v i,k satisfy: v i,1 … v i,k are in the same part  v j, v t in the other part  a path from v j to v t and only passes v i,1 … v i,k then v i,1 … v i,k are distance-k V-shaped nodes vjvj Part 1 Part 0 vtvt v i,1 v i,k

Notation H(V,E)= circuit hypergraph V = set of nodes representing components of the circuit E = set of signal nets A bipartition (V 0 |V 1 ) of H(V,E) divides V into two disjoint subsets s.t. V= V 0  V 1, which are called Part 0 and Part 1 A = the total area of all the nodes in V A 0 = the area of all the nodes in V 0