Leakage Efficient Chip-Level Dual-Vdd Assignment with Time Slack Allocation for FPGA Power Reduction Yan Lin and Lei He EE Department, UCLA Partially supported.

Leakage Efficient Chip-Level Dual-Vdd Assignment with Time Slack Allocation for FPGA Power Reduction Yan Lin and Lei He EE Department, UCLA Partially supported by NSF. Address comments to lhe@ee.ucla.edu

Outline Review and Motivation Chip-level Vdd-level Assignment Algorithms Experimental Results Conclusions

FPGA Power Reduction Existing FPGAs are power inefficient compared to ASICs [kussy, ISLPED’98] Power aware FPGA CAD algorithms for existing FPGA architectures  CAD algorithms to minimize power-delay product [Lamoureux et al, ICCAD’03]  Configuration inversion for leakage reduction [Anderson et al, FPGA’04] Power efficient FPGA circuits and architectures  Dual-Vdd and Vdd-programmable FPGA logic blocks [Li et al, FPGA’04][Li et al, DAC’04]  Vdd-programmable FPGA interconnects [Li et al, ICCAD’04] [Gayasen et al, FPL’04] [Anderson et al, ICCAD’04]

Vdd-programmable switch  Vdd selection for used switch  Power-gating unused switch Reduce leakage by 300X  Configurable Vdd-level conversion Avoid excessive leakage when low-Vdd switch drives high-Vdd switches Vdd-programmable Interconnects [Li et al, ICCAD ’ 04] Power transistor Segment based Vdd-level converter insertion (SLC)  Area overhead 35% area overhead for MCNC benchmark circuits  Leakage overhead 29% leakage overhead for MCNC benchmark circuits Conventional routing switch

Previous Approaches w/o LCs [Gayasen et al, FPL’04]  Level converters inserted at CLB inputs (outputs)  All the routing trees driven by (driving) the source (sink) CLB have the same Vdd-level as the source (sink) CLB Lacking in flexibility  A path-based Vdd-level assignment is performed for CLBs and interconnects [Anderson et al, ICCAD’04]  V T drop of NMOS is used to generate low-Vdd  Positive feedback PMOS is used to tolerate low-Vdd switch driving high-Vdd switches Alternative design of level converter Still has delay and power penalty

Our Major Contributions Proposed a few Vdd-level assignment algorithms  Sensitivity based algorithms TLC-S and dTLC-S for TLC and dTLC, respectively  Linear programming (LP) based algorithm dTLC-LP for dTLC Proposed two ways to avoid using level converters in interconnects  Tree based level converter insertion (TLC) All the switches in one routing tree have same Vdd-level  Dual-Vdd tree based level converter insertion (dTLC) Only high-Vdd switch drives low-Vdd switches in one tree

Tree based LC insertion (TLC)  allows one type of Vdd-level within one routing tree Problem Formulations Assign Vdd-level to each interconnect switch to minimize interconnect power  Meet the delay target T spec  Vdd-level converters are removed within interconnects are inserted at CLB inputs/outputs and can be used when needed Dual-Vdd tree based LC insertion (dTLC)  allows high-Vdd switch drives low-Vdd switches, but not vice versa

Outline Review and Motivation Chip-level Vdd-level Assignment Algorithms Experimental Results Conclusions

Delay & Power Model with Dual-Vdd To incorporate dual-Vdd into timing analysis  Pre-characterize the intrinsic delay and effective driving resistance of switch using SPICE  Calculate routing delay using Elmore delay model Interconnect power  Dynamic power  Leakage power is pre-characterized using SPICE

Chip-level Assignment Algorithms Tree based level converter insertion (TLC)  Sensitivity based algorithm TLC-S Dual-Vdd tree based level converter insertion (dTLC)  Sensitivity based algorithm dTLC-S  Linear programming (LP) based algorithm dTLC-LP

Sensitivity Based Algorithm TLC-S Iterative assignment  Assign low-Vdd to the ‘ untried ’ tree with maximum power sensitivity in each iteration  Reject the assignment if critical path increases  Iteration terminates after all trees are ‘ tried ’ Power sensitivity  The power reduction by changing Vdd from high-Vdd to low-Vdd  Power includes both dynamic and leakage power

Sensitivity Based Algorithm dTLC-S A “candidate switch” is defined as  A switch does not drive any switch  Low-Vdd has been assigned to all of its fanout switches Iterative assignment  Assign low-Vdd to a candidate switch with maximum power sensitivity in each iteration  Reject assignment if critical path increases  Iteration terminates when there is no candidate switch

LP Based Algorithm dTLC-LP: Overview Chip-level Time Slack Allocation Net-level Bottom-up Assignment Refinement Single-Vdd placed and routed netlist Dual-Vdd netlist

dTLC-LP: Single-Net Estimation Slack is represented in multiples of  is delay increase of an interconnect segment by changing Vdd from high-Vdd to low-Vdd b1 b2 b3 b4 sink1 sink2 s1 s2 s1=1 s2=1 b1 b2 b3 b4 s2=1 s1=2 b1 b2 b3 b4 s2=3 s1=2 b1 b2 b3 b4 An example

dTLC-LP: Single-Net Estimation (Cont.)  s ik : Slack for k th sink in i th routing tree  l ik : Number of switches in the path from source to k th sink in i th tree  SL ij : Set of sinks in the fanout cone of j th switch in i th tree Given the allocated slacks, estimate number of low-Vdd switches An example Source

dTLC-LP: Single-Net Estimation (Cont.)  s ik : Slack for k th sink in i th routing tree  l ik : Number of switches in the path from source to k th sink in i th tree  SL ij : Set of sinks in the fanout cone of j th switch in i th tree Given the allocated slacks, estimate number of low-Vdd switches An example s1s1 s 1 /l 1 Source

dTLC-LP: Single-Net Estimation (Cont.)  s ik : Slack for k th sink in i th routing tree  l ik : Number of switches in the path from source to k th sink in i th tree  SL ij : Set of sinks in the fanout cone of j th switch in i th tree Given the allocated slacks, estimate number of low-Vdd switches An example Source s2s2 s 2 /l 2

dTLC-LP: Single-Net Estimation (Cont.)  s ik : Slack for k th sink in i th routing tree  l ik : Number of switches in the path from source to k th sink in i th tree  SL ij : Set of sinks in the fanout cone of j th switch in i th tree Given the allocated slacks, estimate number of low-Vdd switches An example Source s3s3 s 3 /l 3

dTLC-LP: Single-Net Estimation (Cont.)  s ik : Slack for k th sink in i th routing tree  l ik : Number of switches in the path from source to k th sink in i th tree  SL ij : Set of sinks in the fanout cone of j th switch in i th tree Given the allocated slacks, estimate number of low-Vdd switches An example Min(s k /l k ) Source Theorem: The estimation gives a lower bound of number of low-Vdd switches that can be achieved

dTLC-LP : Full-chip Time Slack Allocation Objective function  f s (i): transition density of i th tree  F n (i): estimated number of low-Vdd switches in i th tree  Directly minimize dynamic power  May help minimizing leakage power that exponentially depends on Vdd-level Constraints  Net-based timing constraints For edges other than routing For edges corresponding to routing For PIs and POs

dTLC-LP : Full-chip Time Slack Allocation Objective function  f s (i): transition density of i th tree  F n (i): estimated number of low-Vdd switches in i th tree  Directly minimize dynamic power  May help minimizing leakage power that exponentially depends on Vdd-level Constraints  Constraints due to transforming min function to linear function  Upper bound for useful slack Theorem: The time slack allocation problem is an LP problem

dTLC-LP : Overview Chip-level Time Slack Allocation Net-level Bottom-up Assignment Refinement Single-Vdd placed and routed netlist Dual-Vdd netlist

dTLC-LP : Net-level Bottom-up Assignment Theorem: the bottom-up assignment is optimal Perform bottom-up assignment within each tree to leverage the allocated slacks Bottom-up assignment  Assign low-Vdd to switches in the routing tree in a bottom-up fashion  Slack is reduced by in each step  Stop the process until no slack left

dTLC-LP : Overview Chip-level Time Slack Allocation Net-level Bottom-up Assignment Refinement Single-Vdd placed and routed netlist Dual-Vdd netlist

Outline Review and Motivation Modeling and Problem Formulations Chip-level Vdd-level Assignment Algorithms Experimental Results Conclusions

Experimental Setting Cluster-based Island Style FPGA Structure  100% buffered interconnects, subset switch block  Uniform length 4 for all wire segments ITRS 100nm technology Use VPR [Betz-Rose-Marquardt] for placement and routing Use fpgaEva-LP2 [Lin et al, FPGA’05] for power calculation  Considering short-circuit power, glitch power and input vector  8% average error compared to SPICE simulation

Interconnect Power Comparison between TLC-S, dTLC-S and dTLC-LP dTLC-S and dTLC-LP achieve 6.7% and 6.9% less interconnect power compared to TLC-S, respectively Interconnect power breakdown  TLC-S, dTLC-S and dTLC-LP have almost the same leakage  dTLC-S and dTLC-LP achieve 13.8% and 15.8% less interconnect dynamic power compared to TLC-S, respectively 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 TLC-S dTLC-S dTLC-LP Interconnect Power (watt) Leakage power Dynamic power

dTLC-LP compared to SLC and h2lLCi SLC [Li et al, ICCAD ’ 04]  Segment based level converter inserted in interconnects  Sensitivity based assignment algorithm h2lLCi [Gayasen et al, FPL ’ 04]  All the routing tree driven by source CLB have the same Vdd-level as the source CLB  Path based assignment algorithm dTLC-LP, SLC and h2lLCi achieve 77.54%, 74.70% and 41.80% low-Vdd switches w/o relaxing T spec At different delays, dTLC-LP achieves  The highest number of low-Vdd switches  The lowest power consumption 30% 40% 50% 60% 70% 80% 90% 100% 12.0012.5013.0013.5014.0014.5015.0015.50 Critical Path Delay (ns) % of VddL Switches dTLC-LP SLC h2lLCi 0% 5% 10% 15% 20% 25% 0.02 0.04 0.06 0.08 0.1 0.12 0.14 12.0012.5013.0013.5014.0014.5015.0015.50 Critical Path Delay (ns) Interconnect Power (watt) h2lLCi dTLC-LP SLC 0% 5% 10% 15% 20% 25% 64% 19%

Runtime Comparison between TLC-S, dTLC-S and dTLC-LP 0.E+00 1.E+03 2.E+03 3.E+03 4.E+03 5.E+03 6.E+03 7.E+03 8.E+03 9.E+03 1.E+04 alu4 apex2 apex4 elliptic ex1010 frisc pdc s38417 s38584 MCNC Benchmarks Runtime (s) TLC-S dTLC-S dTLC-LP TLC-S runs the fastest dTLC-S versus dTLC-LP  Runs 3X faster than dTLC-LP  But achieves similar power consumption

Conclusions and Future Work Proposed two ways to avoid using level converters in Vdd- programmable interconnects  Tree based level converter insertion (TLC)  Dual-Vdd tree based level converter insertion (dTLC) Developed chip-level dual-Vdd assignment algorithms w/o level converters  Sensitivity based algorithms TLC-S and dTLC-S  LP based algorithm dTLC-LP Developed dTLC-LP that reduces interconnect power by 64% Developed dTLC-S that obtains slightly smaller power reduction with 3X speedup compared to dTLC-LP Extend chip-level Vdd-level assignment to interconnects using wire segments of different lengths Allocate time slack to logic blocks and interconnects in a uniform fashion

Thank you!

Leakage Efficient Chip-Level Dual-Vdd Assignment with Time Slack Allocation for FPGA Power Reduction Yan Lin and Lei He EE Department, UCLA Partially supported.

Similar presentations

Presentation on theme: "Leakage Efficient Chip-Level Dual-Vdd Assignment with Time Slack Allocation for FPGA Power Reduction Yan Lin and Lei He EE Department, UCLA Partially supported."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Leakage Efficient Chip-Level Dual-Vdd Assignment with Time Slack Allocation for FPGA Power Reduction Yan Lin and Lei He EE Department, UCLA Partially supported.

Similar presentations

Presentation on theme: "Leakage Efficient Chip-Level Dual-Vdd Assignment with Time Slack Allocation for FPGA Power Reduction Yan Lin and Lei He EE Department, UCLA Partially supported."— Presentation transcript:

Similar presentations

About project

Feedback