Leakage Efficient Chip-Level Dual-Vdd Assignment with Time Slack Allocation for FPGA Power Reduction Yan Lin and Lei He EE Department, UCLA Partially supported.

Slides:

Advertisements

Similar presentations

OCV-Aware Top-Level Clock Tree Optimization

Advertisements

Advanced Interconnect Optimizations. Buffers Improve Slack RAT = 300 Delay = 350 Slack = -50 RAT = 700 Delay = 600 Slack = 100 RAT = 300 Delay = 250 Slack.

NTHU-CS VLSI/CAD LAB TH EDA De-Shiuan Chiou Da-Cheng Juan Yu-Ting Chen Shih-Chieh Chang Department of CS, National Tsing Hua University, Taiwan Fine-Grained.

ELEN 468 Lecture 261 ELEN 468 Advanced Logic Design Lecture 26 Interconnect Timing Optimization.

1 Interconnect Layout Optimization by Simultaneous Steiner Tree Construction and Buffer Insertion Presented By Cesare Ferri Takumi Okamoto, Jason Kong.

Leakage and Dynamic Glitch Power Minimization Using MIP for V th Assignment and Path Balancing Yuanlin Lu and Vishwani D. Agrawal Auburn University ECE.

Reducing the Pressure on Routing Resources of FPGAs with Generic Logic Chains Hadi P. Afshar Joint work with: Grace Zgheib, Philip Brisk and Paolo Ienne.

Minimal Skew Clock Synthesis Considering Time-Variant Temperature Gradient Hao Yu, Yu Hu, Chun-Chen Liu and Lei He EE Department, UCLA Presented by Yu.

1 Closed-Loop Modeling of Power and Temperature Profiles of FPGAs Kanupriya Gulati Sunil P. Khatri Peng Li Department of ECE, Texas A&M University, College.

CMOS Circuit Design for Minimum Dynamic Power and Highest Speed Tezaswi Raja, Dept. of ECE, Rutgers University Vishwani D. Agrawal, Dept. of ECE, Auburn.

Aug 23, ‘021Low-Power Design Minimum Dynamic Power Design of CMOS Circuits by Linear Program Using Reduced Constraint Set Vishwani D. Agrawal Agere Systems,

Design of Variable Input Delay Gates for Low Dynamic Power Circuits

Aug 31, '02VDAT'02: Low-Power Design1 Minimum Dynamic Power Design of CMOS Circuits by Linear Program Using Reduced Constraint Set Tezaswi Raja, Rutgers.

Yan Lin, Fei Li and Lei He EE Department, UCLA

TH EDA NTHU-CS VLSI/CAD LAB 1 Re-synthesis for Reliability Design Shih-Chieh Chang Department of Computer Science National Tsing Hua University.

Simultaneous Time Slack Budgeting and Retiming for Dual-Vdd FPGA Power Reduction Yu Hu 1, Yan Lin 1, Lei He 1 and Tim Tuan 2 1 EE Department, UCLA 2 Xilinx.

Power Modeling and Architecture Evaluation for FPGA with Novel Circuits for Vdd Programmability Yan Lin, Fei Li and Lei He EE Department, UCLA

Local Unidirectional Bias for Smooth Cutsize-delay Tradeoff in Performance-driven Partitioning Andrew B. Kahng and Xu Xu UCSD CSE and ECE Depts. Work supported.

Power Optimal Dual-V dd Buffered Tree Considering Buffer Stations and Blockages King Ho Tam and Lei He Electrical Engineering Department University of.

May 28, 2003Minimum Dynamic Power CMOS1 Minimum Dynamic Power CMOS Circuits Vishwani D. Agrawal Rutgers University, Dept. of ECE Piscataway, NJ 08854

An Efficient Chiplevel Time Slack Allocation Algorithm for Dual-Vdd FPGA Power Reduction Yan Lin 1, Yu Hu 1, Lei He 1 and Vijay Raghunathan 2 1 EE Department,

NTHU-CS VLSI/CAD LAB TH EDA Student : Da-Cheng Juan Advisor : Shih-Chieh Chang Fine-Grained Sleep Transistor Sizing Algorithm for Leakage Power Minimization.

1 UCSD VLSI CAD Laboratory ISQED-2009 Revisiting the Linear Programming Framework for Leakage Power vs. Performance Optimization Kwangok Jeong, Andrew.

Stochastic Physical Synthesis for FPGAs with Pre-routing Interconnect Uncertainty and Process Variation Yan Lin and Lei He EE Department, UCLA

© 2005 Altera Corporation © 2006 Altera Corporation Placement and Timing for FPGAs Considering Variations Yan Lin 1, Mike Hutton 2 and Lei He 1 1 EE Department,

SAMSON: A Generalized Second-order Arnoldi Method for Reducing Multiple Source Linear Network with Susceptance Yiyu Shi, Hao Yu and Lei He EE Department,

Architecture and Synthesis for Power-Efficient FPGAs Jason Cong University of California, Los Angeles Partially supported by NSF Grants.

ELEN 468 Lecture 271 ELEN 468 Advanced Logic Design Lecture 27 Interconnect Timing Optimization II.

Changbo Long ECE Department, UW-Madison Lei He EDA Research Group EE Department, UCLA Distributed Sleep Transistor Network.

HARP: Hard-Wired Routing Pattern FPGAs Cristinel Ababei , Satish Sivaswamy ,Gang Wang , Kia Bazargan , Ryan Kastner , Eli Bozorgzadeh   ECE Dept.

Trace-Based Framework for Concurrent Development of Process and FPGA Architecture Considering Process Variation and Reliability 1 Lerong Cheng, 1 Yan Lin,

1 Reconfigurable ECO Cells for Timing Closure and IR Drop Minimization TingTing Hwang Tsing Hua University, Hsin-Chu.

StaticRoute: A novel router for the dynamic partial reconfiguration of FPGAs Brahim Al Farisi, Karel Bruneel, Dirk Stroobandt 2/9/2013.

Yehdhih Ould Mohammed Moctar1 Nithin George2 Hadi Parandeh-Afshar2

Power Reduction for FPGA using Multiple Vdd/Vth

1 Coupling Aware Timing Optimization and Antenna Avoidance in Layer Assignment Di Wu, Jiang Hu and Rabi Mahapatra Texas A&M University.

Lecture 12 Review and Sample Exam Questions Professor Lei He EE 201A, Spring 2004

LOPASS: A Low Power Architectural Synthesis for FPGAs with Interconnect Estimation and Optimization Harikrishnan K.C. University of Massachusetts Amherst.

Un/DoPack: Re-Clustering of Large System-on-Chip Designs with Interconnect Variation for Low-Cost FPGAs Marvin Tom* Xilinx Inc.

An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis Rupesh S. Shelar Enterprise Microprocessor Group Intel Corporation, Hillsboro, OR.

New Modeling Techniques for the Global Routing Problem Anthony Vannelli Department of Electrical and Computer Engineering University of Waterloo Waterloo,

University of British Columbia Dept. of Electrical and Computer Engineering November 30, 2007 A Combined Clustering and Placement Algorithm for FPGAs Mark.

Thermal-aware Steiner Routing for 3D Stacked ICs M. Pathak and S.K. Lim Georgia Institute of Technology ICCAD 07.

A Routing Approach to Reduce Glitches in Low Power FPGAs Quang Dinh, Deming Chen, Martin D. F. Wong Department of Electrical and Computer Engineering University.

Design Space Exploration for Application Specific FPGAs in System-on-a-Chip Designs Mark Hammerquist, Roman Lysecky Department of Electrical and Computer.

Impact of Interconnect Architecture on VPSAs (Via-Programmed Structured ASICs) Usman Ahmed Guy Lemieux Steve Wilton System-on-Chip Lab University of British.

A Faster Approximation Scheme for Timing Driven Minimum Cost Layer Assignment Shiyan Hu*, Zhuo Li**, and Charles J. Alpert** *Dept of ECE, Michigan Technological.

DAOmap: A Depth-optimal Area Optimization Mapping Algorithm for FPGA Designs Deming Chen and Jason Cong Computer Science Department University of California,

1 ε -Optimal Minimum-Delay/Area Zero-Skew Clock Tree Wire-Sizing in Pseudo-Polynomial Time Jeng-Liang Tsai Tsung-Hao Chen Charlie Chung-Ping Chen (National.

Skewed Flip-Flop Transformation for Minimizing Leakage in Sequential Circuits Jun Seomun, Jaehyun Kim, Youngsoo Shin Dept. of Electrical Engineering, KAIST,

1 A Min-Cost Flow Based Detailed Router for FPGAs Seokjin Lee *, Yongseok Cheon *, D. F. Wong + * The University of Texas at Austin + University of Illinois.

Timing-Driven Routing for FPGAs Based on Lagrangian Relaxation

Improving Voltage Assignment by Outlier Detection and Incremental Placement Huaizhi Wu* and Martin D.F. Wong** * Atoptech, Inc. ** University of Illinois.

1 Leakage Power Analysis of a 90nm FPGA Authors: Tim Tuan (Xilinx), Bocheng Lai (UCLA) Presenter: Sang-Kyo Han (ECE, University of Maryland) Published.

1 Carnegie Mellon University Center for Silicon System Implementation An Architectural Exploration of Via Patterned Gate Arrays Chetan Patel, Anthony Cozzie,

In-Place Decomposition for Robustness in FPGA Ju-Yueh Lee, Zhe Feng, and Lei He Electrical Engineering Dept., UCLA Presented by Ju-Yueh Lee Address comments.

1ISPD'03 Process Variation Aware Clock Tree Routing Bing Lu Cadence Jiang Hu Texas A&M Univ Gary Ellis IBM Corp Haihua Su IBM Corp.

IPR: In-Place Reconfiguration for FPGA Fault Tolerance Zhe Feng 1, Yu Hu 1, Lei He 1 and Rupak Majumdar 2 1 Electrical Engineering Department 2 Computer.

Interconnect Driver Design for Long Wires in FPGAs Edmund Lee University of British Columbia Electrical & Computer Engineering MASc Thesis Presentation.

DAOmap: A Depth-optimal Area Optimization Mapping Algorithm for FPGA Designs Deming Chen, Jason Cong ， Computer Science Department ， UCLA Presented.

Congestion-Driven Re-Clustering for Low-cost FPGAs MASc Examination Darius Chiu Supervisor: Dr. Guy Lemieux University of British Columbia Department of.

An O(bn 2 ) Time Algorithm for Optimal Buffer Insertion with b Buffer Types Authors: Zhuo Li and Weiping Shi Presenter: Sunil Khatri Department of Electrical.

EE222 Winter 2013 Steve Kang Lecture 5 Interconnects and Clock Signaling Open systems interconnect (

Xiao Patrick Dong Supervisor: Guy Lemieux. Goal: Reduce critical path  shorter period Decrease dynamic power 2.

Fault-Tolerant Resynthesis for Dual-Output LUTs Roy Lee 1, Yu Hu 1, Rupak Majumdar 2, Lei He 1 and Minming Li 3 1 Electrical Engineering Dept., UCLA 2.

MAPLD 2005 Reduced Triple Modular Redundancy for Tolerating SEUs in SRAM based FPGAs Vikram Chandrasekhar, Sk. Noor Mahammad, V. Muralidharan Dr. V. Kamakoti.

FPGA Glitch Power Analysis and Reduction

Off-path Leakage Power Aware Routing for SRAM-based FPGAs

A New Hybrid FPGA with Nanoscale Clusters and CMOS Routing Reza M. P

Chapter 3b Leakage Efficient Chip-Level Dual-Vdd Assignment with Time Slack Allocation for FPGA Power Reduction Prof. Lei He Electrical Engineering Department.

Presentation transcript:

Leakage Efficient Chip-Level Dual-Vdd Assignment with Time Slack Allocation for FPGA Power Reduction Yan Lin and Lei He EE Department, UCLA Partially supported by NSF. Address comments to

Outline Review and Motivation Chip-level Vdd-level Assignment Algorithms Experimental Results Conclusions

FPGA Power Reduction Existing FPGAs are power inefficient compared to ASICs [kussy, ISLPED’98] Power aware FPGA CAD algorithms for existing FPGA architectures  CAD algorithms to minimize power-delay product [Lamoureux et al, ICCAD’03]  Configuration inversion for leakage reduction [Anderson et al, FPGA’04] Power efficient FPGA circuits and architectures  Dual-Vdd and Vdd-programmable FPGA logic blocks [Li et al, FPGA’04][Li et al, DAC’04]  Vdd-programmable FPGA interconnects [Li et al, ICCAD’04] [Gayasen et al, FPL’04] [Anderson et al, ICCAD’04]

Vdd-programmable switch  Vdd selection for used switch  Power-gating unused switch Reduce leakage by 300X  Configurable Vdd-level conversion Avoid excessive leakage when low-Vdd switch drives high-Vdd switches Vdd-programmable Interconnects [Li et al, ICCAD ’ 04] Power transistor Segment based Vdd-level converter insertion (SLC)  Area overhead 35% area overhead for MCNC benchmark circuits  Leakage overhead 29% leakage overhead for MCNC benchmark circuits Conventional routing switch

Previous Approaches w/o LCs [Gayasen et al, FPL’04]  Level converters inserted at CLB inputs (outputs)  All the routing trees driven by (driving) the source (sink) CLB have the same Vdd-level as the source (sink) CLB Lacking in flexibility  A path-based Vdd-level assignment is performed for CLBs and interconnects [Anderson et al, ICCAD’04]  V T drop of NMOS is used to generate low-Vdd  Positive feedback PMOS is used to tolerate low-Vdd switch driving high-Vdd switches Alternative design of level converter Still has delay and power penalty

Our Major Contributions Proposed a few Vdd-level assignment algorithms  Sensitivity based algorithms TLC-S and dTLC-S for TLC and dTLC, respectively  Linear programming (LP) based algorithm dTLC-LP for dTLC Proposed two ways to avoid using level converters in interconnects  Tree based level converter insertion (TLC) All the switches in one routing tree have same Vdd-level  Dual-Vdd tree based level converter insertion (dTLC) Only high-Vdd switch drives low-Vdd switches in one tree

Tree based LC insertion (TLC)  allows one type of Vdd-level within one routing tree Problem Formulations Assign Vdd-level to each interconnect switch to minimize interconnect power  Meet the delay target T spec  Vdd-level converters are removed within interconnects are inserted at CLB inputs/outputs and can be used when needed Dual-Vdd tree based LC insertion (dTLC)  allows high-Vdd switch drives low-Vdd switches, but not vice versa

Outline Review and Motivation Chip-level Vdd-level Assignment Algorithms Experimental Results Conclusions

Delay & Power Model with Dual-Vdd To incorporate dual-Vdd into timing analysis  Pre-characterize the intrinsic delay and effective driving resistance of switch using SPICE  Calculate routing delay using Elmore delay model Interconnect power  Dynamic power  Leakage power is pre-characterized using SPICE

Chip-level Assignment Algorithms Tree based level converter insertion (TLC)  Sensitivity based algorithm TLC-S Dual-Vdd tree based level converter insertion (dTLC)  Sensitivity based algorithm dTLC-S  Linear programming (LP) based algorithm dTLC-LP

Sensitivity Based Algorithm TLC-S Iterative assignment  Assign low-Vdd to the ‘ untried ’ tree with maximum power sensitivity in each iteration  Reject the assignment if critical path increases  Iteration terminates after all trees are ‘ tried ’ Power sensitivity  The power reduction by changing Vdd from high-Vdd to low-Vdd  Power includes both dynamic and leakage power

Sensitivity Based Algorithm dTLC-S A “candidate switch” is defined as  A switch does not drive any switch  Low-Vdd has been assigned to all of its fanout switches Iterative assignment  Assign low-Vdd to a candidate switch with maximum power sensitivity in each iteration  Reject assignment if critical path increases  Iteration terminates when there is no candidate switch

LP Based Algorithm dTLC-LP: Overview Chip-level Time Slack Allocation Net-level Bottom-up Assignment Refinement Single-Vdd placed and routed netlist Dual-Vdd netlist

dTLC-LP: Single-Net Estimation Slack is represented in multiples of  is delay increase of an interconnect segment by changing Vdd from high-Vdd to low-Vdd b1 b2 b3 b4 sink1 sink2 s1 s2 s1=1 s2=1 b1 b2 b3 b4 s2=1 s1=2 b1 b2 b3 b4 s2=3 s1=2 b1 b2 b3 b4 An example

dTLC-LP: Single-Net Estimation (Cont.)  s ik : Slack for k th sink in i th routing tree  l ik : Number of switches in the path from source to k th sink in i th tree  SL ij : Set of sinks in the fanout cone of j th switch in i th tree Given the allocated slacks, estimate number of low-Vdd switches An example Source

dTLC-LP: Single-Net Estimation (Cont.)  s ik : Slack for k th sink in i th routing tree  l ik : Number of switches in the path from source to k th sink in i th tree  SL ij : Set of sinks in the fanout cone of j th switch in i th tree Given the allocated slacks, estimate number of low-Vdd switches An example s1s1 s 1 /l 1 Source

dTLC-LP: Single-Net Estimation (Cont.)  s ik : Slack for k th sink in i th routing tree  l ik : Number of switches in the path from source to k th sink in i th tree  SL ij : Set of sinks in the fanout cone of j th switch in i th tree Given the allocated slacks, estimate number of low-Vdd switches An example Source s2s2 s 2 /l 2

dTLC-LP: Single-Net Estimation (Cont.)  s ik : Slack for k th sink in i th routing tree  l ik : Number of switches in the path from source to k th sink in i th tree  SL ij : Set of sinks in the fanout cone of j th switch in i th tree Given the allocated slacks, estimate number of low-Vdd switches An example Source s3s3 s 3 /l 3

dTLC-LP: Single-Net Estimation (Cont.)  s ik : Slack for k th sink in i th routing tree  l ik : Number of switches in the path from source to k th sink in i th tree  SL ij : Set of sinks in the fanout cone of j th switch in i th tree Given the allocated slacks, estimate number of low-Vdd switches An example Min(s k /l k ) Source Theorem: The estimation gives a lower bound of number of low-Vdd switches that can be achieved

dTLC-LP : Full-chip Time Slack Allocation Objective function  f s (i): transition density of i th tree  F n (i): estimated number of low-Vdd switches in i th tree  Directly minimize dynamic power  May help minimizing leakage power that exponentially depends on Vdd-level Constraints  Net-based timing constraints For edges other than routing For edges corresponding to routing For PIs and POs

dTLC-LP : Full-chip Time Slack Allocation Objective function  f s (i): transition density of i th tree  F n (i): estimated number of low-Vdd switches in i th tree  Directly minimize dynamic power  May help minimizing leakage power that exponentially depends on Vdd-level Constraints  Constraints due to transforming min function to linear function  Upper bound for useful slack Theorem: The time slack allocation problem is an LP problem

dTLC-LP : Overview Chip-level Time Slack Allocation Net-level Bottom-up Assignment Refinement Single-Vdd placed and routed netlist Dual-Vdd netlist

dTLC-LP : Net-level Bottom-up Assignment Theorem: the bottom-up assignment is optimal Perform bottom-up assignment within each tree to leverage the allocated slacks Bottom-up assignment  Assign low-Vdd to switches in the routing tree in a bottom-up fashion  Slack is reduced by in each step  Stop the process until no slack left

dTLC-LP : Overview Chip-level Time Slack Allocation Net-level Bottom-up Assignment Refinement Single-Vdd placed and routed netlist Dual-Vdd netlist

Outline Review and Motivation Modeling and Problem Formulations Chip-level Vdd-level Assignment Algorithms Experimental Results Conclusions

Experimental Setting Cluster-based Island Style FPGA Structure  100% buffered interconnects, subset switch block  Uniform length 4 for all wire segments ITRS 100nm technology Use VPR [Betz-Rose-Marquardt] for placement and routing Use fpgaEva-LP2 [Lin et al, FPGA’05] for power calculation  Considering short-circuit power, glitch power and input vector  8% average error compared to SPICE simulation

Interconnect Power Comparison between TLC-S, dTLC-S and dTLC-LP dTLC-S and dTLC-LP achieve 6.7% and 6.9% less interconnect power compared to TLC-S, respectively Interconnect power breakdown  TLC-S, dTLC-S and dTLC-LP have almost the same leakage  dTLC-S and dTLC-LP achieve 13.8% and 15.8% less interconnect dynamic power compared to TLC-S, respectively TLC-S dTLC-S dTLC-LP Interconnect Power (watt) Leakage power Dynamic power

dTLC-LP compared to SLC and h2lLCi SLC [Li et al, ICCAD ’ 04]  Segment based level converter inserted in interconnects  Sensitivity based assignment algorithm h2lLCi [Gayasen et al, FPL ’ 04]  All the routing tree driven by source CLB have the same Vdd-level as the source CLB  Path based assignment algorithm dTLC-LP, SLC and h2lLCi achieve 77.54%, 74.70% and 41.80% low-Vdd switches w/o relaxing T spec At different delays, dTLC-LP achieves  The highest number of low-Vdd switches  The lowest power consumption 30% 40% 50% 60% 70% 80% 90% 100% Critical Path Delay (ns) % of VddL Switches dTLC-LP SLC h2lLCi 0% 5% 10% 15% 20% 25% Critical Path Delay (ns) Interconnect Power (watt) h2lLCi dTLC-LP SLC 0% 5% 10% 15% 20% 25% 64% 19%

Runtime Comparison between TLC-S, dTLC-S and dTLC-LP 0.E+00 1.E+03 2.E+03 3.E+03 4.E+03 5.E+03 6.E+03 7.E+03 8.E+03 9.E+03 1.E+04 alu4 apex2 apex4 elliptic ex1010 frisc pdc s38417 s38584 MCNC Benchmarks Runtime (s) TLC-S dTLC-S dTLC-LP TLC-S runs the fastest dTLC-S versus dTLC-LP  Runs 3X faster than dTLC-LP  But achieves similar power consumption

Conclusions and Future Work Proposed two ways to avoid using level converters in Vdd- programmable interconnects  Tree based level converter insertion (TLC)  Dual-Vdd tree based level converter insertion (dTLC) Developed chip-level dual-Vdd assignment algorithms w/o level converters  Sensitivity based algorithms TLC-S and dTLC-S  LP based algorithm dTLC-LP Developed dTLC-LP that reduces interconnect power by 64% Developed dTLC-S that obtains slightly smaller power reduction with 3X speedup compared to dTLC-LP Extend chip-level Vdd-level assignment to interconnects using wire segments of different lengths Allocate time slack to logic blocks and interconnects in a uniform fashion

Thank you!