Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 3b Leakage Efficient Chip-Level Dual-Vdd Assignment with Time Slack Allocation for FPGA Power Reduction Prof. Lei He Electrical Engineering Department.

Similar presentations


Presentation on theme: "Chapter 3b Leakage Efficient Chip-Level Dual-Vdd Assignment with Time Slack Allocation for FPGA Power Reduction Prof. Lei He Electrical Engineering Department."— Presentation transcript:

1 Chapter 3b Leakage Efficient Chip-Level Dual-Vdd Assignment with Time Slack Allocation for FPGA Power Reduction Prof. Lei He Electrical Engineering Department University of California, Los Angeles URL: eda.ee.ucla.edu

2 Chip-level Vdd-level Assignment Algorithms Experimental Results
Outline Review and Motivation Chip-level Vdd-level Assignment Algorithms Experimental Results Conclusions

3 FPGA Power Reduction Existing FPGAs are power inefficient compared to ASICs [kussy, ISLPED’98] Power aware FPGA CAD algorithms for existing FPGA architectures CAD algorithms to minimize power-delay product [Lamoureux et al, ICCAD’03] Configuration inversion for leakage reduction [Anderson et al, FPGA’04] Power efficient FPGA circuits and architectures Dual-Vdd and Vdd-programmable FPGA logic blocks [Li et al, FPGA’04][Li et al, DAC’04] Vdd-programmable FPGA interconnects [Li et al, ICCAD’04] [Gayasen et al, FPL’04] [Anderson et al, ICCAD’04]

4 Vdd-programmable Interconnects [Li et al, ICCAD’04]
Conventional routing switch Power transistor Vdd-programmable switch Vdd selection for used switch Power-gating unused switch Reduce leakage by 300X Configurable Vdd-level conversion Avoid excessive leakage when low-Vdd switch drives high-Vdd switches Segment based Vdd-level converter insertion (SLC) Area overhead 35% area overhead for MCNC benchmark circuits Leakage overhead 29% leakage overhead for MCNC benchmark circuits

5 Previous Approaches w/o LCs
[Gayasen et al, FPL’04] Level converters inserted at CLB inputs (outputs) All the routing trees driven by (driving) the source (sink) CLB have the same Vdd-level as the source (sink) CLB Lacking in flexibility A path-based Vdd-level assignment is performed for CLBs and interconnects [Anderson et al, ICCAD’04] VT drop of NMOS is used to generate low-Vdd Positive feedback PMOS is used to tolerate low-Vdd switch driving high-Vdd switches Alternative design of level converter Still has delay and power penalty

6 Our Major Contributions
Proposed two ways to avoid using level converters in interconnects Tree based level converter insertion (TLC) All the switches in one routing tree have same Vdd-level Dual-Vdd tree based level converter insertion (dTLC) Only high-Vdd switch drives low-Vdd switches in one tree Proposed a few Vdd-level assignment algorithms Sensitivity based algorithms TLC-S and dTLC-S for TLC and dTLC, respectively Linear programming (LP) based algorithm dTLC-LP for dTLC

7 Problem Formulations Tree based LC insertion (TLC)
allows one type of Vdd-level within one routing tree Dual-Vdd tree based LC insertion (dTLC) allows high-Vdd switch drives low-Vdd switches, but not vice versa Assign Vdd-level to each interconnect switch to minimize interconnect power Meet the delay target Tspec Vdd-level converters are removed within interconnects are inserted at CLB inputs/outputs and can be used when needed

8 Outline Review and Motivation
Chip-level Vdd-level Assignment Algorithms Experimental Results Conclusions

9 Delay & Power Model with Dual-Vdd
To incorporate dual-Vdd into timing analysis Pre-characterize the intrinsic delay and effective driving resistance of switch using SPICE Calculate routing delay using Elmore delay model Interconnect power Dynamic power Leakage power is pre-characterized using SPICE

10 Chip-level Assignment Algorithms
Tree based level converter insertion (TLC) Sensitivity based algorithm TLC-S Dual-Vdd tree based level converter insertion (dTLC) Sensitivity based algorithm dTLC-S Linear programming (LP) based algorithm dTLC-LP

11 Sensitivity Based Algorithm TLC-S
Power sensitivity The power reduction by changing Vdd from high-Vdd to low-Vdd Power includes both dynamic and leakage power Iterative assignment Assign low-Vdd to the ‘untried’ tree with maximum power sensitivity in each iteration Reject the assignment if critical path increases Iteration terminates after all trees are ‘tried’

12 Sensitivity Based Algorithm dTLC-S
A “candidate switch” is defined as A switch does not drive any switch Low-Vdd has been assigned to all of its fanout switches Iterative assignment Assign low-Vdd to a candidate switch with maximum power sensitivity in each iteration Reject assignment if critical path increases Iteration terminates when there is no candidate switch

13 LP Based Algorithm dTLC-LP: Overview
Single-Vdd placed and routed netlist Chip-level Time Slack Allocation Net-level Bottom-up Assignment Refinement Dual-Vdd netlist

14 dTLC-LP: Single-Net Estimation
Slack is represented in multiples of is delay increase of an interconnect segment by changing Vdd from high-Vdd to low-Vdd An example s2=1 s1=2 b1 b2 b3 b4 s2=3 s1=2 b1 b2 b3 b4 b1 b2 b3 b4 sink1 sink2 s1 s2 s1=1 s2=1 b1 b2 b3 b4

15 dTLC-LP: Single-Net Estimation (Cont.)
Given the allocated slacks, estimate number of low-Vdd switches sik: Slack for kth sink in ith routing tree lik: Number of switches in the path from source to kth sink in ith tree SLij: Set of sinks in the fanout cone of jth switch in ith tree An example Source

16 dTLC-LP: Single-Net Estimation (Cont.)
Given the allocated slacks, estimate number of low-Vdd switches sik: Slack for kth sink in ith routing tree lik: Number of switches in the path from source to kth sink in ith tree SLij: Set of sinks in the fanout cone of jth switch in ith tree An example Source s1 s1/l1

17 dTLC-LP: Single-Net Estimation (Cont.)
Given the allocated slacks, estimate number of low-Vdd switches sik: Slack for kth sink in ith routing tree lik: Number of switches in the path from source to kth sink in ith tree SLij: Set of sinks in the fanout cone of jth switch in ith tree An example Source s2/l2 s2

18 dTLC-LP: Single-Net Estimation (Cont.)
Given the allocated slacks, estimate number of low-Vdd switches sik: Slack for kth sink in ith routing tree lik: Number of switches in the path from source to kth sink in ith tree SLij: Set of sinks in the fanout cone of jth switch in ith tree An example Source s3/l3 s3

19 dTLC-LP: Single-Net Estimation (Cont.)
Given the allocated slacks, estimate number of low-Vdd switches sik: Slack for kth sink in ith routing tree lik: Number of switches in the path from source to kth sink in ith tree SLij: Set of sinks in the fanout cone of jth switch in ith tree An example Source Min(sk/lk) Theorem: The estimation gives a lower bound of number of low-Vdd switches that can be achieved

20 dTLC-LP : Full-chip Time Slack Allocation
Objective function fs(i): transition density of ith tree Fn(i): estimated number of low-Vdd switches in ith tree Directly minimize dynamic power May help minimizing leakage power that exponentially depends on Vdd-level Constraints Net-based timing constraints For PIs and POs For edges corresponding to routing For edges other than routing

21 dTLC-LP : Full-chip Time Slack Allocation
Objective function fs(i): transition density of ith tree Fn(i): estimated number of low-Vdd switches in ith tree Directly minimize dynamic power May help minimizing leakage power that exponentially depends on Vdd-level Constraints Upper bound for useful slack Constraints due to transforming min function to linear function Theorem: The time slack allocation problem is an LP problem

22 Single-Vdd placed and routed netlist
dTLC-LP : Overview Single-Vdd placed and routed netlist Chip-level Time Slack Allocation Net-level Bottom-up Assignment Refinement Dual-Vdd netlist

23 dTLC-LP : Net-level Bottom-up Assignment
Perform bottom-up assignment within each tree to leverage the allocated slacks Bottom-up assignment Assign low-Vdd to switches in the routing tree in a bottom-up fashion Slack is reduced by in each step Stop the process until no slack left Theorem: the bottom-up assignment is optimal

24 Single-Vdd placed and routed netlist
dTLC-LP : Overview Single-Vdd placed and routed netlist Chip-level Time Slack Allocation Net-level Bottom-up Assignment Refinement Dual-Vdd netlist

25 Outline Review and Motivation Modeling and Problem Formulations
Chip-level Vdd-level Assignment Algorithms Experimental Results Conclusions

26 Experimental Setting Cluster-based Island Style FPGA Structure
100% buffered interconnects, subset switch block Uniform length 4 for all wire segments ITRS 100nm technology Use VPR [Betz-Rose-Marquardt] for placement and routing Use fpgaEva-LP2 [Lin et al, FPGA’05] for power calculation Considering short-circuit power, glitch power and input vector 8% average error compared to SPICE simulation

27 Interconnect Power Comparison between TLC-S, dTLC-S and dTLC-LP
0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 TLC-S dTLC-S dTLC-LP Interconnect Power (watt) Leakage power Dynamic power dTLC-S and dTLC-LP achieve 6.7% and 6.9% less interconnect power compared to TLC-S, respectively Interconnect power breakdown TLC-S, dTLC-S and dTLC-LP have almost the same leakage dTLC-S and dTLC-LP achieve 13.8% and 15.8% less interconnect dynamic power compared to TLC-S, respectively

28 dTLC-LP compared to SLC and h2lLCi
30% 40% 50% 60% 70% 80% 90% 100% 12.00 12.50 13.00 13.50 14.00 14.50 15.00 15.50 Critical Path Delay (ns) % of VddL Switches dTLC-LP SLC h2lLCi 0% 5% 10% 15% 20% 25% 0.02 0.04 0.06 0.08 0.1 0.12 0.14 Interconnect Power (watt) 64% 19% SLC [Li et al, ICCAD ’04] Segment based level converter inserted in interconnects Sensitivity based assignment algorithm h2lLCi [Gayasen et al, FPL’04] All the routing tree driven by source CLB have the same Vdd-level as the source CLB Path based assignment algorithm dTLC-LP, SLC and h2lLCi achieve 77.54%, 74.70% and 41.80% low-Vdd switches w/o relaxing Tspec At different delays, dTLC-LP achieves The highest number of low-Vdd switches The lowest power consumption

29 Runtime Comparison between TLC-S, dTLC-S and dTLC-LP
alu4 apex2 apex4 elliptic ex1010 frisc pdc s38417 s38584 MCNC Benchmarks Runtime (s) TLC-S dTLC-S dTLC-LP TLC-S runs the fastest dTLC-S versus dTLC-LP Runs 3X faster than dTLC-LP But achieves similar power consumption

30 Conclusions and Future Work
Proposed two ways to avoid using level converters in Vdd-programmable interconnects Tree based level converter insertion (TLC) Dual-Vdd tree based level converter insertion (dTLC) Developed chip-level dual-Vdd assignment algorithms w/o level converters Sensitivity based algorithms TLC-S and dTLC-S LP based algorithm dTLC-LP Developed dTLC-LP that reduces interconnect power by 64% Developed dTLC-S that obtains slightly smaller power reduction with 3X speedup compared to dTLC-LP Extend chip-level Vdd-level assignment to interconnects using wire segments of different lengths Allocate time slack to logic blocks and interconnects in a uniform fashion

31 Thank you!


Download ppt "Chapter 3b Leakage Efficient Chip-Level Dual-Vdd Assignment with Time Slack Allocation for FPGA Power Reduction Prof. Lei He Electrical Engineering Department."

Similar presentations


Ads by Google