Simultaneous Time Slack Budgeting and Retiming for Dual-Vdd FPGA Power Reduction Yu Hu 1, Yan Lin 1, Lei He 1 and Tim Tuan 2 1 EE Department, UCLA 2 Xilinx.

Slides:

Advertisements

Similar presentations

Address comments to FPGA Area Reduction by Multi-Output Sequential Resynthesis Yu Hu 1, Victor Shih 2, Rupak Majumdar 2 and Lei He 1 1.

Advertisements

ECE 667 Synthesis and Verification of Digital Circuits

4/22/ Clock Network Synthesis Prof. Shiyan Hu Office: EREC 731.

© KLMH Lienig Paper: A Unified Theory of Timing Budget Management Presented by: Hangcheng Lou Original Authors: Soheil Ghiasi, Elaheh Bozorgzadeh, Siddharth.

ELEN 468 Lecture 261 ELEN 468 Advanced Logic Design Lecture 26 Interconnect Timing Optimization.

1 Interconnect Layout Optimization by Simultaneous Steiner Tree Construction and Buffer Insertion Presented By Cesare Ferri Takumi Okamoto, Jason Kong.

Leakage and Dynamic Glitch Power Minimization Using MIP for V th Assignment and Path Balancing Yuanlin Lu and Vishwani D. Agrawal Auburn University ECE.

Sequential Timing Optimization. Long path timing constraints Data must not reach destination FF too late s i + d(i,j) + T setup  s j + P s i s j d(i,j)

Minimal Skew Clock Synthesis Considering Time-Variant Temperature Gradient Hao Yu, Yu Hu, Chun-Chen Liu and Lei He EE Department, UCLA Presented by Yu.

CMOS Circuit Design for Minimum Dynamic Power and Highest Speed Tezaswi Raja, Dept. of ECE, Rutgers University Vishwani D. Agrawal, Dept. of ECE, Auburn.

Circuit Retiming with Interconnect Delay CUHK CSE CAD Group Meeting One Evangeline Young Aug 19, 2003.

Yan Lin, Fei Li and Lei He EE Department, UCLA

TH EDA NTHU-CS VLSI/CAD LAB 1 Re-synthesis for Reliability Design Shih-Chieh Chang Department of Computer Science National Tsing Hua University.

Power Modeling and Architecture Evaluation for FPGA with Novel Circuits for Vdd Programmability Yan Lin, Fei Li and Lei He EE Department, UCLA

Local Unidirectional Bias for Smooth Cutsize-delay Tradeoff in Performance-driven Partitioning Andrew B. Kahng and Xu Xu UCSD CSE and ECE Depts. Work supported.

Power Optimal Dual-V dd Buffered Tree Considering Buffer Stations and Blockages King Ho Tam and Lei He Electrical Engineering Department University of.

Penn ESE Fall DeHon 1 ESE (ESE534): Computer Organization Day 19: March 26, 2007 Retime 1: Transformations.

An Efficient Chiplevel Time Slack Allocation Algorithm for Dual-Vdd FPGA Power Reduction Yan Lin 1, Yu Hu 1, Lei He 1 and Vijay Raghunathan 2 1 EE Department,

Continuous Retiming EECS 290A Sequential Logic Synthesis and Verification.

Retiming. Consider the Following Circuit Suppose T XOR = 3 ns, T pcq = 1 ns, T setup = 1 ns, then this circuit can be clocked at 1 ns + (3 x 3 ns) + 1.

1 UCSD VLSI CAD Laboratory ISQED-2009 Revisiting the Linear Programming Framework for Leakage Power vs. Performance Optimization Kwangok Jeong, Andrew.

Stochastic Physical Synthesis for FPGAs with Pre-routing Interconnect Uncertainty and Process Variation Yan Lin and Lei He EE Department, UCLA

© 2005 Altera Corporation © 2006 Altera Corporation Placement and Timing for FPGAs Considering Variations Yan Lin 1, Mike Hutton 2 and Lei He 1 1 EE Department,

 Y. Hu, V. Shih, R. Majumdar and L. He, “Exploiting Symmetries to Speedup SAT-based Boolean Matching for Logic Synthesis of FPGAs”, TCAD  Y. Hu,

Architecture and Synthesis for Power-Efficient FPGAs Jason Cong University of California, Los Angeles Partially supported by NSF Grants.

1 Integrating Logic Retiming and Register Placement Tzu-Chieh Tien, Hsiao-Pin Su, Yu-Wen Tsay Yih-Chih Chou, and Youn-Long Lin Department of Computer Science.

CS294-6 Reconfigurable Computing Day 16 October 15, 1998 Retiming.

EDA (CS286.5b) Day 19 Covering and Retiming. “Final” Like Assignment #1 –longer –more breadth –focus since assignment #2 –…but ideas are cummulative –open.

Processing Rate Optimization by Sequential System Floorplanning Jia Wang 1, Ping-Chih Wu 2, and Hai Zhou 1 1 Electrical Engineering & Computer Science.

ELEN 468 Lecture 271 ELEN 468 Advanced Logic Design Lecture 27 Interconnect Timing Optimization II.

Leakage Efficient Chip-Level Dual-Vdd Assignment with Time Slack Allocation for FPGA Power Reduction Yan Lin and Lei He EE Department, UCLA Partially supported.

ECE Synthesis & Verification 1 ECE 667 ECE 667 Synthesis and Verification of Digital Systems Retiming.

Optimality Study of Logic Synthesis for LUT-Based FPGAs Jason Cong and Kirill Minkovich VLSI CAD Lab Computer Science Department University of California,

HARP: Hard-Wired Routing Pattern FPGAs Cristinel Ababei , Satish Sivaswamy ,Gang Wang , Kia Bazargan , Ryan Kastner , Eli Bozorgzadeh   ECE Dept.

Trace-Based Framework for Concurrent Development of Process and FPGA Architecture Considering Process Variation and Reliability 1 Lerong Cheng, 1 Yan Lin,

Dynamic Power Consumption In Large FPGAs WILLIAM GARCIA, ANDREW MORTELLARO.

StaticRoute: A novel router for the dynamic partial reconfiguration of FPGAs Brahim Al Farisi, Karel Bruneel, Dirk Stroobandt 2/9/2013.

Power Reduction for FPGA using Multiple Vdd/Vth

1 Coupling Aware Timing Optimization and Antenna Avoidance in Layer Assignment Di Wu, Jiang Hu and Rabi Mahapatra Texas A&M University.

Lecture 12 Review and Sample Exam Questions Professor Lei He EE 201A, Spring 2004

LOPASS: A Low Power Architectural Synthesis for FPGAs with Interconnect Estimation and Optimization Harikrishnan K.C. University of Massachusetts Amherst.

An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis Rupesh S. Shelar Enterprise Microprocessor Group Intel Corporation, Hillsboro, OR.

Thermal-aware Steiner Routing for 3D Stacked ICs M. Pathak and S.K. Lim Georgia Institute of Technology ICCAD 07.

1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.

A Routing Approach to Reduce Glitches in Low Power FPGAs Quang Dinh, Deming Chen, Martin D. F. Wong Department of Electrical and Computer Engineering University.

Design Space Exploration for Application Specific FPGAs in System-on-a-Chip Designs Mark Hammerquist, Roman Lysecky Department of Electrical and Computer.

Impact of Interconnect Architecture on VPSAs (Via-Programmed Structured ASICs) Usman Ahmed Guy Lemieux Steve Wilton System-on-Chip Lab University of British.

A Faster Approximation Scheme for Timing Driven Minimum Cost Layer Assignment Shiyan Hu*, Zhuo Li**, and Charles J. Alpert** *Dept of ECE, Michigan Technological.

1 ε -Optimal Minimum-Delay/Area Zero-Skew Clock Tree Wire-Sizing in Pseudo-Polynomial Time Jeng-Liang Tsai Tsung-Hao Chen Charlie Chung-Ping Chen (National.

CALTECH CS137 Winter DeHon CS137: Electronic Design Automation Day 7: February 3, 2002 Retiming.

CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.

1 A Min-Cost Flow Based Detailed Router for FPGAs Seokjin Lee *, Yongseok Cheon *, D. F. Wong + * The University of Texas at Austin + University of Illinois.

Improving Voltage Assignment by Outlier Detection and Incremental Placement Huaizhi Wu* and Martin D.F. Wong** * Atoptech, Inc. ** University of Illinois.

1 Leakage Power Analysis of a 90nm FPGA Authors: Tim Tuan (Xilinx), Bocheng Lai (UCLA) Presenter: Sang-Kyo Han (ECE, University of Maryland) Published.

In-Place Decomposition for Robustness in FPGA Ju-Yueh Lee, Zhe Feng, and Lei He Electrical Engineering Dept., UCLA Presented by Ju-Yueh Lee Address comments.

Interconnect Driver Design for Long Wires in FPGAs Edmund Lee University of British Columbia Electrical & Computer Engineering MASc Thesis Presentation.

DAOmap: A Depth-optimal Area Optimization Mapping Algorithm for FPGA Designs Deming Chen, Jason Cong ， Computer Science Department ， UCLA Presented.

Global Clustering-Based Performance-Driven Circuit Partitioning Jason Cong University of California Los Angeles Chang Wu Aplus Design.

TU/e Algorithms (2IL15) – Lecture 12 1 Linear Programming.

An O(bn 2 ) Time Algorithm for Optimal Buffer Insertion with b Buffer Types Authors: Zhuo Li and Weiping Shi Presenter: Sunil Khatri Department of Electrical.

Retiming EECS 290A Sequential Logic Synthesis and Verification.

An O(nm) Time Algorithm for Optimal Buffer Insertion of m Sink Nets Zhuo Li and Weiping Shi {zhuoli, Texas A&M University College Station,

Xiao Patrick Dong Supervisor: Guy Lemieux. Goal: Reduce critical path  shorter period Decrease dynamic power 2.

Fault-Tolerant Resynthesis for Dual-Output LUTs Roy Lee 1, Yu Hu 1, Rupak Majumdar 2, Lei He 1 and Minming Li 3 1 Electrical Engineering Dept., UCLA 2.

Andrew B. Kahng and Xu Xu UCSD CSE and ECE Depts.

FPGA Glitch Power Analysis and Reduction

Off-path Leakage Power Aware Routing for SRAM-based FPGAs

A New Hybrid FPGA with Nanoscale Clusters and CMOS Routing Reza M. P

Fast Min-Register Retiming Through Binary Max-Flow

Chapter 3b Leakage Efficient Chip-Level Dual-Vdd Assignment with Time Slack Allocation for FPGA Power Reduction Prof. Lei He Electrical Engineering Department.

Presentation transcript:

Simultaneous Time Slack Budgeting and Retiming for Dual-Vdd FPGA Power Reduction Yu Hu 1, Yan Lin 1, Lei He 1 and Tim Tuan 2 1 EE Department, UCLA 2 Xilinx Research Lab Presented by Yu Hu Partially supported by NSF. Address comments to

Outline Background, Motivation and Problem Formulation Chip-level Vdd-level Assignment Algorithm [for mixed length wire segments] Simultaneous Vdd Level Assignment and Retiming Experimental Results Conclusions

Background Existing FPGAs are power inefficient compared to ASICs. Interconnect is the dominant component of FPGA power dissipation (dynamic and leakage). [Li, TCAD‘05] Power aware FPGA architectures and CAD algorithms have been studied extensively.  CAD algorithms to minimize power-delay product [Lamoureux, ICCAD’03]  Configuration inversion for leakage reduction [Anderson, FPGA’04]  Vdd-programmable FPGA logic blocks [Li, FPGA’04] [Li, DAC’04]  Vdd-programmable FPGA interconnects [Li, ICCAD’04] [Gayasen, FPL’04] [Anderson, ICCAD’04] [Lin, DAC’05]

Island style and mixed wire segment length. Routing switch/connection block (Two PMOS power transistors M3 and M4 are inserted between the tri-state buffer and VddH, VddL power rails, respectively.) [Li, ICCAD’04] Level converter free in routing tree (Guarantee that no VddL switch drives VddH switches.) with LEAST area and power penalty [Lin, TCAD’06]. Vdd Programmable Interconnect Arch.

Limitation of Existing Approaches The most recent work [Lin, DAC'05] for programmable dual- Vdd FPGA considers timing slack budgeting to maximize power reduction  Uniform wire segment length was assumed, and can not be extended to mixed wire segment.  Vdd level assignment was performed in combinational sub-circuits. Simultaneous retiming and timing budgeting has been studied to reduce area or improve performance. [Yeh, DAC'03] [Yeh, ICCAD'03]  Power reduction has not been considered.  Post-layout flip-flop binding constraints were not addressed.

Call for Simultaneous Vdd Assignment and Retiming [Lin, DAC'05] performs Vdd level assignment in the combinational sub-circuit, which limits the searching space. Simultaneous retiming and Vdd assignment explores larger searching space to extract more useful timing slack for VddL switches insertion. VddL Switch Inserted ! All VddH Switches ! Combinational Assign Sequential Assignment Movable with Retiming! 2 units slack needed for a VddL switch insertion Interconnect Delay Timing Slack

Major Contributions Power-aware post-layout re-synthesis processes: Sequential vs. Simultaneous Vdd level assignment for mixed wire segments FPGAs. 53% interconnect power reduction is achieved compared to single Vdd designs. Simultaneous retiming and interconnect Vdd assignment with flip-flop binding constraints. Up to 20% further interconnect power reduction is achieved compared to sequential flow. Synthesis/Placement/routing Simultaneous retiming and interconnect Vdd assignment CLB levelVdd assignment Min-clock retiming InterconnectVdd assignment Global refinement

Problem Formulations [ Dual-Vdd Level Assignment Problem ] Given: placement and routing results of a FPGA design Find: A Vdd-level assignment to each interconnect switch Objective: Minimize interconnect (dynamic and leakage) power Constraints:  Meet the delay target T spec  Vdd-level converters are inserted ONLY at CLB inputs/outputs [ Simultaneous Retiming and Dual-Vdd Level Assignment Problem ] Same to Dual-Vdd level assignment problem in addition to: Retiming as an extra design freedom Satisfy post-layout flip-flop binding constraints.

Outline Background, Motivation and Problem Formulation Chip-level Vdd-level Assignment Algorithm [for mixed length wire segments]  Interconnect Power Reduction Estimation  LP Based Vdd-level Assignment Algorithm Simultaneous Vdd Level Assignment and Retiming Experimental Results Conclusions

Delay and Power Model for Interconnect Delay Model IIntrinsic delay and effective driving resistance of switch has been pre-characterized using SPICE. EElmore delay is used to calculate routing delay. Interconnect Power Model DDynamic power P d (Vdd jj )=0.5f clk *C*Vdd jj 2 LLeakage power P l (Vdd jj ) is pre-characterized using SPICE Interconnect power reduction estimation is the essential part of dual-Vdd assignment algorithm.

Review of Vdd Level Assignment Algorithm [Lin, DAC'05] The net-level bottom-up Vdd assignment guarantees the legalization of final solutions. [Lin, DAC’05] Leverage all extra slack with VddL switches [Lin, DAC’05] VddL possibility for switches S2=3 S1=1 b1 b2 b3 b4 Timing Slack assigned at sinks S2=3 S1=1 b1 b2 b3 b4 Vdd assignment base on estimation Power reduction estimation Interconnect power reduction estimation Problem remained: How to calculate VddL possibility for mixed wire segment?

VddL Possibility Calculation Represent timing slack in number of switches:  s i = L i * ( S i / D i ) s i is the number of VddL switches can be inserted in the path from source to j th sink in the routing tree. L i is the number of switches along this path.  s i : how many switches can be turned to VddL along source- to-sink-i path for the given timing slack S i. VddL possiblity for switch j at sink i based on load capacity:  f(i,j) = s i * (c ij / C i )  Key idea: distribute timing slack to each switch based on cap. L 2 = 3 D 2 = 12 s 2 = 3*(10/12)=5/2 S 2 =10 S 1 =6 b1, 8x b2, 8x b3, 16x b4, 16x f(2,2) = 1 f(2,3) = 1 f(2,4) = 1/2

Power Reduction Estimation for Mixed Wire Segments The lower bound estimation [Y. Lin, DAC'05] for interconnect power reduction is no longer valid for mixed wire segments. Our solution: develop the upper bound estimation of VddL switch number  Consistent upper bound of power reduction  Remove the non-linear term "min" and the corresponding extra LP constraints from lower bound estimation S = 2.7 b2, 8x, need 1.0 slack b1, 16x, need 1.8 slack f n (i,1) = 0.9 f n (i,2) = 0.5 lower bound of VddL switches = = 1.4 S = 2.7 Consume 1.0 Sum up all VddL possibility 1.7 slack left -1.8 needed! Only 1.0 VddL switch assignment Problem here: Lower bound > actual number!

LP formulation for dual-Vdd Level Assignment Basic timing constraints Slack constraints Objective function Dynamic power reduction upper bound Leakage power reduction upper bound Slack upper bound Slack constraints Slack non-negative Arrival time for prim-output Arrival time for prim-input Arrival time constraints

Outline Motivation Problem Formulations Chip-level Vdd-level Assignment Algorithm [for mixed length wire segments] Simultaneous Vdd Level Assignment and Retiming  MILP formulation for retiming FPGA circuits  Extra constraints for post-layout FPGA retiming  Link between MILP retiming to timing budgeting Experimental Results Conclusions

Retiming for FPGA Retiming graph is a directed cyclic graph. Given the retiming graph G=(V, E), a retiming is an integer- valued vertex-labeling r: V → Z. A weight is w(u,v) associated with edge e(u,v) denotes the number of FFs in that edge. After retiming (re-labeling of vertices): w'(u,v) = w(u,v) + r(v) – r(u) Retiming

Link between MILP retiming & timing budgeting Extend MILP formulation in [Leiserson, Algorithmica’91] to link arrival time with retiming labeling Timing slack in edge (u,v) is represented by Timing slack in connection from sink S k to the source of routing tree R i The real value a(v) assigned in node v is its arrival time after retiming R(v) = r(v) + a(v) /c, linearize

Keep both placement and routing unchanged after retiming.  No FFs in global interconnect (inter-CLB)  No FFs in local interconnect (intra-CLB and inter-SLICE) Within a single SLICE, only FF_NODE → SUBBLK_OPIN edges allow FF insertion. Extra constraints in MILP formulation: Retiming Constraints 1: Placement and Flip-Flop Binding Constraints No way to assign this FF in any SLICE physically! FF# can be further reduced! The only timing edge that can insert FFs

Delay constraints for timing edges within SLICE: Retiming Constraints 2: LUT Delay and FF Setup &Hold Time Constraints LUT delay FF setup time + LUT delay FF hold time FF# in edge (e)

Outline Motivation Problem Formulations Chip-level Vdd-level Assignment Algorithm [for mixed length wire segments] Simultaneous Vdd Level Assignment and Retiming Experimental Results  Dual-Vdd Assignment for FPGAs with Mixed Wire Segments  Simultaneous Vdd Level Assignment and Retiming  A runtime Efficient Post-Layout Re-Synthesis CAD Flow Conclusions

Experimental Setting Cluster-based Island Style FPGA Structure  Size-10 cluster and size-4 LUT  100% buffered interconnects, subset switch block  60% length-4 and 40% length-8l wire segments  25x buffer for length-4 and 10x buffer for length-8 ITRS 100nm technology, 1.3v for VddH and 0.8v for VddL Use VPR [Betz-Rose-Marquardt] for placement and routing Use fpgaEva-LP2 [Lin et al, FPGA’05] for power calculation  Considering short-circuit power, glitch power and input vector  8% average error compared to SPICE simulation 10 biggest sequential MCNC benchmarks are tested Use mosek [student license, to solve LP and MILP

EdTLC-LP algorithm achieves 85% VddL assignment. EdTLC-LP algorithm achieves 53% interconnect power reduction for mixed length interconnect wire on average. Experimental Results: Dual-Vdd Assignment for FPGAs with Mixed Wire Segments

Simultaneous Retiming and Slack Budgeting vs. Sequential Approach (Delay-Optimal Retiming + Slack Budgeting) [Those circuits with VddL < 85% are selected] Simultaneous approach gains 5% on average and up to 20% further power reduction compared to sequential one. Experimental Results – Simultaneous vs. Sequential

Runtime Efficient CAD Flow SOLUTION Do Simultaneous Procedure only when necessary  Simultaneous approach has 10x more runtime overhead compared to place&route.  DO NOT need to perform simultaneous approach for every single design.  Indicators for simultaneous gain High percentage of VddL assignment will not lead to gain from simultaneous approach Little gain from min-clock retiming indicates little room for improvement by simultaneous approach Synthesis/Placement/routing Simultaneous retiming and interconnect Vdd assignment CLB levelVdd assignment Min-clock retiming InterconnectVdd assignment Global refinement

Outline Motivation Problem Formulations Chip-level Vdd-level Assignment Algorithm [for mixed length wire segments] Simultaneous Vdd Level Assignment and Retiming Experimental Results Conclusions

A chip-level dual-Vdd assignment algorithm for mixed length wire segment. Experimental results show that reduces interconnect power by 53% on average compared to single-Vdd FPGA designs. A MILP based simultaneous timing budgeting and retiming formulation which further reduces interconnect power up to 20% compared to min-clock retiming followed Vdd assignment. A runtime efficient post-layout re-synthesis CAD flow.  Do simultaneous procedure only when necessary.

Thank you! Q/A

Extra Slides for Q/A

EdTLC-LP : Net-level Bottom-up Assignment Theorem: the bottom-up assignment is optimal Perform bottom-up assignment within each tree to leverage the allocated slacks Bottom-up assignment  Assign low-Vdd to switches in the routing tree in a bottom-up fashion  Slack is reduced by in each step  Stop the process until no slack left

Major Contributions Present a tight estimation of power reduction upper bound for mixed-length interconnect in FPGAs. Develop a linear programming (LP) based slack budgeting and Vdd level assignment algorithm for mixedlength interconnect FPGAs.  The experimental results show 53% interconnect power reduction on average compared to singleVdd interconnects. Propose a mixed integer and linear programming (MILP) based simultaneous retiming and slack budgeting for power reduction while considering placement and flip-flop (FF) binding constraints.  The experimental results show up to 20% interconnect power reduction compared to the sequential approach (retiming followed by slack budgeting).

Island style routing architecture. Mixed wire length (60% length 4 wire and 40% length 8 wire). Routing switch/connection block (Two PMOS power transistors M3 and M4 are inserted between the tri-state buffer and VddH, VddL power rails, respectively.) Level converter free (Guarantee that no VddL switch drives VddH switches.) Vdd Programmable Interconnect Arch. source

Timing Slack vs. VddL Switch Number Timing Slack S ij of a connection between source and j th sink in routing tree R i, = the amount of delay which could be added to this connection without increasing the cycle time T spec. Timing Slack S ij indicates the number of VddL switches. Useful Slack: Timing Slack S ij is bounded due to the number of switches in connection between source and j th sink in routing tree R i,. Extra slack will NOT lead to more VddL switches! Timing Slack Bounding Constraint: 0 ≤ S ij ≤ D ij [D ij is the delay increase of the path from source to j th sink by setting VddL to all the switches in this path] s1=1 s2=1 b1 b2 b3 b4 s2=4 s1=2 b1 b2 b3 b4 Useful Slack = 3 1 unit slack is needed for VddL

Retiming for LUT based FPGA Retiming graph is a directed cyclic graph. Given the retiming graph G=(V, E), a retiming is an integer- valued vertex-labeling r: V → Z. A weight is w(u,v) associated with edge e(u,v) denotes the number of FFs in that edge. After retiming (re-labeling of vertices): w'(u,v) = w(u,v) + r(v) – r(u) r(G) = 1 r(D) = 1 r(F) = 1 Retiming

MILP Based Retiming Formulation Extended from MILP formulation [Leiserson, Algorithmica’91] Let G = (V, E, d, w) be a synchronous circuits, and let c be a positive real number. Then there exists a retiming r of G such that Φ(G r ) ≤ c if and only if there exists an assignment of a real value a(v) and an integer value r(v) to each vertex v such that the following conditions are satisfied: Let R(v) = r(v) + a(v) /c, then this formula can be rewritten as

Runtime Efficient CAD Flow Runtime overhead of post-layout re-synthesis processes High percentage of VddL assignment after won‘t lead to gain from RTSB Little gain from min-clock retiming indicates little room for improvement by RTSB