An Efficient Chiplevel Time Slack Allocation Algorithm for Dual-Vdd FPGA Power Reduction Yan Lin 1, Yu Hu 1, Lei He 1 and Vijay Raghunathan 2 1 EE Department,

Slides:



Advertisements
Similar presentations
Address comments to FPGA Area Reduction by Multi-Output Sequential Resynthesis Yu Hu 1, Victor Shih 2, Rupak Majumdar 2 and Lei He 1 1.
Advertisements

© KLMH Lienig Paper: A Unified Theory of Timing Budget Management Presented by: Hangcheng Lou Original Authors: Soheil Ghiasi, Elaheh Bozorgzadeh, Siddharth.
ELEN 468 Lecture 261 ELEN 468 Advanced Logic Design Lecture 26 Interconnect Timing Optimization.
ECE 506 Reconfigurable Computing Lecture 6 Clustering Ali Akoglu.
Leakage and Dynamic Glitch Power Minimization Using MIP for V th Assignment and Path Balancing Yuanlin Lu and Vishwani D. Agrawal Auburn University ECE.
Minimal Skew Clock Synthesis Considering Time-Variant Temperature Gradient Hao Yu, Yu Hu, Chun-Chen Liu and Lei He EE Department, UCLA Presented by Yu.
CMOS Circuit Design for Minimum Dynamic Power and Highest Speed Tezaswi Raja, Dept. of ECE, Rutgers University Vishwani D. Agrawal, Dept. of ECE, Auburn.
Yan Lin, Fei Li and Lei He EE Department, UCLA
Power-Aware Placement
TH EDA NTHU-CS VLSI/CAD LAB 1 Re-synthesis for Reliability Design Shih-Chieh Chang Department of Computer Science National Tsing Hua University.
Simultaneous Time Slack Budgeting and Retiming for Dual-Vdd FPGA Power Reduction Yu Hu 1, Yan Lin 1, Lei He 1 and Tim Tuan 2 1 EE Department, UCLA 2 Xilinx.
Power Modeling and Architecture Evaluation for FPGA with Novel Circuits for Vdd Programmability Yan Lin, Fei Li and Lei He EE Department, UCLA
Power Optimal Dual-V dd Buffered Tree Considering Buffer Stations and Blockages King Ho Tam and Lei He Electrical Engineering Department University of.
Penn ESE Fall DeHon 1 ESE (ESE534): Computer Organization Day 19: March 26, 2007 Retime 1: Transformations.
1 UCSD VLSI CAD Laboratory ISQED-2009 Revisiting the Linear Programming Framework for Leakage Power vs. Performance Optimization Kwangok Jeong, Andrew.
Stochastic Physical Synthesis for FPGAs with Pre-routing Interconnect Uncertainty and Process Variation Yan Lin and Lei He EE Department, UCLA
© 2005 Altera Corporation © 2006 Altera Corporation Placement and Timing for FPGAs Considering Variations Yan Lin 1, Mike Hutton 2 and Lei He 1 1 EE Department,
A Global Minimum Clock Distribution Network Augmentation Algorithm for Guaranteed Clock Skew Yield A. B. Kahng, B. Liu, X. Xu, J. Hu* and G. Venkataraman*
Architecture and Synthesis for Power-Efficient FPGAs Jason Cong University of California, Los Angeles Partially supported by NSF Grants.
1 Integrating Logic Retiming and Register Placement Tzu-Chieh Tien, Hsiao-Pin Su, Yu-Wen Tsay Yih-Chih Chou, and Youn-Long Lin Department of Computer Science.
CS294-6 Reconfigurable Computing Day 16 October 15, 1998 Retiming.
Processing Rate Optimization by Sequential System Floorplanning Jia Wang 1, Ping-Chih Wu 2, and Hai Zhou 1 1 Electrical Engineering & Computer Science.
Changbo Long ECE Department, UW-Madison Lei He EDA Research Group EE Department, UCLA Distributed Sleep Transistor Network.
Leakage Efficient Chip-Level Dual-Vdd Assignment with Time Slack Allocation for FPGA Power Reduction Yan Lin and Lei He EE Department, UCLA Partially supported.
HARP: Hard-Wired Routing Pattern FPGAs Cristinel Ababei , Satish Sivaswamy ,Gang Wang , Kia Bazargan , Ryan Kastner , Eli Bozorgzadeh   ECE Dept.
Trace-Based Framework for Concurrent Development of Process and FPGA Architecture Considering Process Variation and Reliability 1 Lerong Cheng, 1 Yan Lin,
CDCTree: Novel Obstacle-Avoiding Routing Tree Construction based on Current Driven Circuit Model Speaker: Lei He.
Decoupling Capacitance Allocation for Power Supply Noise Suppression Shiyou Zhao, Kaushik Roy, Cheng-Kok Koh School of Electrical & Computer Engineering.
StaticRoute: A novel router for the dynamic partial reconfiguration of FPGAs Brahim Al Farisi, Karel Bruneel, Dirk Stroobandt 2/9/2013.
Escape Routing For Dense Pin Clusters In Integrated Circuits Mustafa Ozdal, Design Automation Conference, 2007 Mustafa Ozdal, IEEE Trans. on CAD, 2009.
ICCAD 2003 Algorithm for Achieving Minimum Energy Consumption in CMOS Circuits Using Multiple Supply and Threshold Voltages at the Module Level Yuvraj.
Power Reduction for FPGA using Multiple Vdd/Vth
1 Coupling Aware Timing Optimization and Antenna Avoidance in Layer Assignment Di Wu, Jiang Hu and Rabi Mahapatra Texas A&M University.
Lecture 12 Review and Sample Exam Questions Professor Lei He EE 201A, Spring 2004
LOPASS: A Low Power Architectural Synthesis for FPGAs with Interconnect Estimation and Optimization Harikrishnan K.C. University of Massachusetts Amherst.
Enforcing Long-Path Timing Closure for FPGA Routing with Path Searches on Clamped Lexicographic Spirals Keith So University of New South Wales, Sydney,
An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis Rupesh S. Shelar Enterprise Microprocessor Group Intel Corporation, Hillsboro, OR.
Thermal-aware Steiner Routing for 3D Stacked ICs M. Pathak and S.K. Lim Georgia Institute of Technology ICCAD 07.
A Routing Approach to Reduce Glitches in Low Power FPGAs Quang Dinh, Deming Chen, Martin D. F. Wong Department of Electrical and Computer Engineering University.
Minimax Open Shortest Path First (OSPF) Routing Algorithms in Networks Supporting the SMDS Service Frank Yeong-Sung Lin ( 林永松 ) Information Management.
Design Space Exploration for Application Specific FPGAs in System-on-a-Chip Designs Mark Hammerquist, Roman Lysecky Department of Electrical and Computer.
Impact of Interconnect Architecture on VPSAs (Via-Programmed Structured ASICs) Usman Ahmed Guy Lemieux Steve Wilton System-on-Chip Lab University of British.
A Faster Approximation Scheme for Timing Driven Minimum Cost Layer Assignment Shiyan Hu*, Zhuo Li**, and Charles J. Alpert** *Dept of ECE, Michigan Technological.
DAOmap: A Depth-optimal Area Optimization Mapping Algorithm for FPGA Designs Deming Chen and Jason Cong Computer Science Department University of California,
1 ε -Optimal Minimum-Delay/Area Zero-Skew Clock Tree Wire-Sizing in Pseudo-Polynomial Time Jeng-Liang Tsai Tsung-Hao Chen Charlie Chung-Ping Chen (National.
1 A Min-Cost Flow Based Detailed Router for FPGAs Seokjin Lee *, Yongseok Cheon *, D. F. Wong + * The University of Texas at Austin + University of Illinois.
Timing-Driven Routing for FPGAs Based on Lagrangian Relaxation
Improving Voltage Assignment by Outlier Detection and Incremental Placement Huaizhi Wu* and Martin D.F. Wong** * Atoptech, Inc. ** University of Illinois.
1 Carnegie Mellon University Center for Silicon System Implementation An Architectural Exploration of Via Patterned Gate Arrays Chetan Patel, Anthony Cozzie,
Maze Routing Algorithms with Exact Matching Constraints for Analog and Mixed Signal Designs M. M. Ozdal and R. F. Hentschke Intel Corporation ICCAD 2012.
In-Place Decomposition for Robustness in FPGA Ju-Yueh Lee, Zhe Feng, and Lei He Electrical Engineering Dept., UCLA Presented by Ju-Yueh Lee Address comments.
1ISPD'03 Process Variation Aware Clock Tree Routing Bing Lu Cadence Jiang Hu Texas A&M Univ Gary Ellis IBM Corp Haihua Su IBM Corp.
© PSU Variation Aware Placement in FPGAs Suresh Srinivasan and Vijaykrishnan Narayanan Pennsylvania State University, University Park.
Interconnect Driver Design for Long Wires in FPGAs Edmund Lee University of British Columbia Electrical & Computer Engineering MASc Thesis Presentation.
A Fully Polynomial Time Approximation Scheme for Timing Driven Minimum Cost Buffer Insertion Shiyan Hu*, Zhuo Li**, Charles Alpert** *Dept of Electrical.
DAOmap: A Depth-optimal Area Optimization Mapping Algorithm for FPGA Designs Deming Chen, Jason Cong , Computer Science Department , UCLA Presented.
Effective Linear Programming-Based Placement Techniques Sherief Reda UC San Diego Amit Chowdhary Intel Corporation.
SEMI-SYNTHETIC CIRCUIT GENERATION FOR TESTING INCREMENTAL PLACE AND ROUTE TOOLS David GrantGuy Lemieux University of British Columbia Vancouver, BC.
Congestion-Driven Re-Clustering for Low-cost FPGAs MASc Examination Darius Chiu Supervisor: Dr. Guy Lemieux University of British Columbia Department of.
An O(bn 2 ) Time Algorithm for Optimal Buffer Insertion with b Buffer Types Authors: Zhuo Li and Weiping Shi Presenter: Sunil Khatri Department of Electrical.
Interconnect Driver Design for Long Wires in FPGAs Edmund Lee, Guy Lemieux & Shahriar Mirabbasi University of British Columbia, Canada Electrical & Computer.
An O(nm) Time Algorithm for Optimal Buffer Insertion of m Sink Nets Zhuo Li and Weiping Shi {zhuoli, Texas A&M University College Station,
Xiao Patrick Dong Supervisor: Guy Lemieux. Goal: Reduce critical path  shorter period Decrease dynamic power 2.
Tuesday, March 19 The Network Simplex Method for Solving the Minimum Cost Flow Problem Handouts: Lecture Notes Warning: there is a lot to the network.
Fault-Tolerant Resynthesis for Dual-Output LUTs Roy Lee 1, Yu Hu 1, Rupak Majumdar 2, Lei He 1 and Minming Li 3 1 Electrical Engineering Dept., UCLA 2.
FPGA Glitch Power Analysis and Reduction
Off-path Leakage Power Aware Routing for SRAM-based FPGAs
A New Hybrid FPGA with Nanoscale Clusters and CMOS Routing Reza M. P
Fast Min-Register Retiming Through Binary Max-Flow
Chapter 3b Leakage Efficient Chip-Level Dual-Vdd Assignment with Time Slack Allocation for FPGA Power Reduction Prof. Lei He Electrical Engineering Department.
Presentation transcript:

An Efficient Chiplevel Time Slack Allocation Algorithm for Dual-Vdd FPGA Power Reduction Yan Lin 1, Yu Hu 1, Lei He 1 and Vijay Raghunathan 2 1 EE Department, UCLA 2 Purdue University Partially supported by NSF. Address comments to

Outline Background, Motivation and Problem Formulation Chip-level Vdd-level Assignment Algorithm [for mixed length wire segments, Hu et al, DAC’06] Network Flow Based Vdd Level Assignment Formulation Experimental Results Conclusions

Background Existing FPGAs are power inefficient compared to ASICs. Interconnect is the dominant component of FPGA power dissipation (dynamic and leakage). [Li, TCAD‘05] Power aware FPGA architectures and CAD algorithms have been studied extensively.  CAD algorithms to minimize power-delay product [Lamoureux, ICCAD’03]  Configuration inversion for leakage reduction [Anderson, FPGA’04]  Vdd-programmable FPGA logic blocks [Li, FPGA’04] [Li, DAC’04]  Vdd-programmable FPGA interconnects [Li, ICCAD’04] [Gayasen, FPL’04] [Anderson, ICCAD’04] [Lin, DAC’05]

Island style and mixed wire segment length. Routing switch/connection block (Two PMOS power transistors M3 and M4 are inserted between the tri-state buffer and VddH, VddL power rails, respectively.) [Li, ICCAD’04] Level converter free in routing tree (Guarantee that no VddL switch drives VddH switches.) with LEAST area and power penalty [Lin, TCAD’06]. Vdd Programmable Interconnect Arch.

Limitation of Existing Approaches Uniform wire segment length was assumed, and cannot be extended to mixed wire segment directly. LP based formulation is timing consuming and computational instable. Computational instability: small size circuit uses long runtime Time consuming: runtime goes up quickly for large circuit

Problem Formulations [ Dual-Vdd Level Assignment Problem ] Given: placement and routing results of a FPGA design Find: A Vdd-level assignment to each interconnect switch Objective: Minimize interconnect (dynamic and leakage) power Constraints:  Meet the delay target T spec  Vdd-level converters are inserted ONLY at CLB inputs/outputs

Outline Background, Motivation and Problem Formulation Chip-level Vdd-level Assignment Algorithm [for mixed length wire segments, Hu et al, DAC’06]  Interconnect Power Reduction Estimation  LP Based Vdd-level Assignment Algorithm Network Flow Based Vdd Level Assignment Formulation Experimental Results Conclusions

Delay and Power Model for Interconnect Delay Model IIntrinsic delay and effective driving resistance of switch has been pre-characterized using SPICE. EElmore delay is used to calculate routing delay. Interconnect Power Model DDynamic power P d (Vdd jj )=0.5f clk *C*Vdd jj 2 LLeakage power P l (Vdd jj ) is pre-characterized using SPICE Interconnect power reduction estimation is the essential part of dual-Vdd assignment algorithm.

Review of Vdd Level Assignment Algorithm [Lin, DAC'05] The net-level bottom-up Vdd assignment guarantees the legalization of final solutions. [Lin, DAC’05] Leverage all extra slack with VddL switches [Lin, DAC’05] VddL possibility for switches S2=3 S1=1 b1 b2 b3 b4 Timing Slack assigned at sinks S2=3 S1=1 b1 b2 b3 b4 Vdd assignment base on estimation Power reduction estimation Interconnect power reduction estimation Problem remained: How to calculate VddL possibility for mixed wire segment?

VddL Possibility Calculation Represent timing slack in number of switches:  s i = L i * ( S i / D i ) s i is the number of VddL switches can be inserted in the path from source to j th sink in the routing tree. L i is the number of switches along this path.  s i : how many switches can be turned to VddL along source- to-sink-i path for the given timing slack S i. VddL possiblity for switch j at sink i based on load capacity:  f(i,j) = s i * (c ij / C i )  Key idea: distribute timing slack to each switch based on cap. L 2 = 3 D 2 = 12 s 2 = 3*(10/12)=5/2 S 2 =10 S 1 =6 b1, 8x b2, 8x b3, 16x b4, 16x f(2,2) = 1 f(2,3) = 1 f(2,4) = 1/2

Power Reduction Estimation for Mixed Wire Segments The lower bound estimation [Y. Lin, DAC'05] for interconnect power reduction is no longer valid for mixed wire segments. Our solution: develop the upper bound estimation of VddL switch number  Consistent upper bound of power reduction  Remove the non-linear term "min" and the corresponding extra LP constraints from lower bound estimation S = 2.7 b2, 8x, need 1.0 slack b1, 16x, need 1.8 slack f n (i,1) = 0.9 f n (i,2) = 0.5 lower bound of VddL switches = = 1.4 S = 2.7 Consume 1.0 Sum up all VddL possibility 1.7 slack left -1.8 needed! Only 1.0 VddL switch assignment Problem here: Lower bound > actual number!

LP formulation for dual-Vdd Level Assignment Basic timing constraints Slack constraints Objective function Dynamic power reduction upper bound Leakage power reduction upper bound Slack upper bound Slack constraints Slack non-negative Arrival time for prim-output Arrival time for prim-input Arrival time constraints

Outline Motivation Problem Formulations Chip-level Vdd-level Assignment Algorithm [for mixed length wire segments, Hu et al, DAC’06] Network Flow Based Vdd Level Assignment Formulation  Overview of network flow based timing slack budgeting  Primal-dual reformulation Experimental Results Conclusions

Network Flow Based Timing Slack Budgeting Motivated by [Ghiasi, ICCAD’04] for logic level optimization Step 1: Reorganize objective function: Step 2: Eliminate timing slack variables (by substitution):

Network Flow Based Timing Slack Budgeting (cont.) Step 3: Reorganize objective function by timing nodes: Step 4: Generate dual-problem: Constant terms, remove Constant coefficients Node by node Edge by edge Node by node Edge by edge

Link Induced Network from Timing Graph /3 -2/ / / Flow in forward arch (solid segments) Flow in backward arch (dot segments) Demand in node i No negative weight cycle exists in the induced network. A min-cost flow can be found for sure! A shortest path based algorithm is used to produce the solution for primal problem.

Outline Motivation Problem Formulations Chip-level Vdd-level Assignment Algorithm [for mixed length wire segments, Hu et al, DAC’06] Network Flow Based Formulation Experimental Results Conclusions

Experimental Setting Cluster-based Island Style FPGA Structure  Size-10 cluster and size-4 LUT  100% buffered interconnects, subset switch block  60% length-4 and 40% length-8l wire segments  25x buffer for length-4 and 10x buffer for length-8 ITRS 100nm technology, 1.3v for VddH and 0.8v for VddL Use VPR [Betz-Rose-Marquardt] for placement and routing Use fpgaEva-LP2 [Lin et al, FPGA’05] for power calculation  Considering short-circuit power, glitch power and input vector  8% average error compared to SPICE simulation 20 biggest sequential MCNC benchmarks are tested Use LPsolver to solve LP

Both LP-based and Netflow-based algorithm achieves 85% VddL assignment on average. Dual-Vdd Assignment for FPGAs with Mixed Wire Segments

Interconnect Power Reduction 52% total interconnect power reduction is achieved!

Runtime comparison Netflow based algorithm gets consistent speedup and stable runtime More significant speedup is expected for larger circuits.

Outline Motivation Problem Formulations Chip-level Vdd-level Assignment Algorithm [for mixed length wire segments] Network Flow Based Formulation Experimental Results Conclusions

A min-cost network flow based timing budgeting formulation which speedups up the budgeting procedure and the overall design flow up to 6000x and 20x, respectively, compared to LP based one. Both chip-level dual-Vdd assignment algorithms are for mixed length wire segment. Experimental results show an interconnect power reduction of 53% on average compared to single-Vdd FPGA designs.

Thank you! Q/A