Device and Architecture Co-Optimization for FPGA Power Reduction Lerong Cheng, Phoebe Wong, Fei Li, Yan Lin, and Prof. Lei He EE Department, UCLA Partially.

Slides:

Advertisements

Similar presentations

Sequential Logic Design

Advertisements

Predicting Performance Impact of DVFS for Realistic Memory Systems Rustam Miftakhutdinov Eiman Ebrahimi Yale N. Patt.

A Fractional Order (Proportional and Derivative) Motion Controller Design for A Class of Second-order Systems Center for Self-Organizing Intelligent.

The basics for simulations

Jongsok Choi M.A.Sc Candidate, University of Toronto.

Savas Kaya and Ahmad Al-Ahmadi School of EE&CS Russ College of Eng & Tech Search for Optimum and Scalable COSMOS.

Design and Use of Memory-Specific Test Structures to Ensure SRAM Yield and Manufacturability F. Duan, R. Castagnetti, R. Venkatraman, O. Kobozeva and S.

Embedded Systems Design: A Unified Hardware/Software Introduction 1 Chapter 10: IC Technology.

Gate Sizing for Cell Library Based Designs Shiyan Hu*, Mahesh Ketkar**, Jiang Hu* *Dept of ECE, Texas A&M University **Intel Corporation.

Chapter 3 Logic Gates.

CMOS Logic Circuits.

1 Sizing the Streaming Media Cluster Solution for a Given Workload Lucy Cherkasova and Wenting Tang HPLabs.

EE 201A Modeling and Optimization for VLSI LayoutJeff Wong and Dan Vasquez EE 201A Noise Modeling Jeff Wong and Dan Vasquez Electrical Engineering Department.

Defect Tolerance for Yield Enhancement of FPGA Interconnect Using Fine-grain and Coarse-grain Redundancy Anthony J. YuGuy G.F. Lemieux September 15, 2005.

Abdullah Aldahami ( ) Jan 29,  This paper propose a new resynthesis algorithm for FPGA area reduction.  The existing resynthesis techniques.

Before Between After.

Non-Gaussian Statistical Timing Analysis Using Second Order Polynomial Fitting Lerong Cheng 1, Jinjun Xiong 2, and Lei He 1 1 EE Department, UCLA *2 IBM.

Mani Srivastava UCLA - EE Department Room: 6731-H Boelter Hall Tel: WWW: Copyright 2003.

Christopher LaFrieda and Rajit Manohar Computer Systems Laboratory Cornell University Reducing Power Consumption with Relaxed Quasi Delay-Insensitive Circuits.

Schutzvermerk nach DIN 34 beachten 05/04/15 Seite 1 Training EPAM and CANopen Basic Solution: Password * * Level 1 Level 2 * Level 3 Password2 IP-Adr.

Leakage-Biased Domino Circuits for Dynamic Fine- Grain Leakage Reduction Seongmoo Heo and Krste Asanović Massachusetts Institute of Technology Lab for.

Architecture Design Methodology. 2 The effects of architecture design on metrics:  Area (cost)  Performance  Power Target market:  A set of application.

1 Closed-Loop Modeling of Power and Temperature Profiles of FPGAs Kanupriya Gulati Sunil P. Khatri Peng Li Department of ECE, Texas A&M University, College.

 Device and architecture co-optimization – Large search space – Need fast yet accurate power and delay estimator for FPGAs  Trace-based power and delay.

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

Yan Lin, Fei Li and Lei He EE Department, UCLA

Device Sizing Techniques for High Yield Minimum-Energy Subthreshold Circuits Dan Holcomb and Mervin John University of California, Berkeley EE241 Spring.

Simultaneous Time Slack Budgeting and Retiming for Dual-Vdd FPGA Power Reduction Yu Hu 1, Yan Lin 1, Lei He 1 and Tim Tuan 2 1 EE Department, UCLA 2 Xilinx.

Power Modeling and Architecture Evaluation for FPGA with Novel Circuits for Vdd Programmability Yan Lin, Fei Li and Lei He EE Department, UCLA

Lecture 3: Field Programmable Gate Arrays II September 10, 2013 ECE 636 Reconfigurable Computing Lecture 3 Field Programmable Gate Arrays II.

An Efficient Chiplevel Time Slack Allocation Algorithm for Dual-Vdd FPGA Power Reduction Yan Lin 1, Yu Hu 1, Lei He 1 and Vijay Raghunathan 2 1 EE Department,

Stochastic Physical Synthesis for FPGAs with Pre-routing Interconnect Uncertainty and Process Variation Yan Lin and Lei He EE Department, UCLA

© 2005 Altera Corporation © 2006 Altera Corporation Placement and Timing for FPGAs Considering Variations Yan Lin 1, Mike Hutton 2 and Lei He 1 1 EE Department,

Architecture and Synthesis for Power-Efficient FPGAs Jason Cong University of California, Los Angeles Partially supported by NSF Grants.

S. Reda EN160 SP’07 Design and Implementation of VLSI Systems (EN0160) Lecture 13: Power Dissipation Prof. Sherief Reda Division of Engineering, Brown.

The Memory/Logic Interface in FPGA’s with Large Embedded Memory Arrays The Memory/Logic Interface in FPGA’s with Large Embedded Memory Arrays Steven J.

Changbo Long ECE Department, UW-Madison Lei He EDA Research Group EE Department, UCLA Distributed Sleep Transistor Network.

Leakage Efficient Chip-Level Dual-Vdd Assignment with Time Slack Allocation for FPGA Power Reduction Yan Lin and Lei He EE Department, UCLA Partially supported.

HARP: Hard-Wired Routing Pattern FPGAs Cristinel Ababei , Satish Sivaswamy ,Gang Wang , Kia Bazargan , Ryan Kastner , Eli Bozorgzadeh   ECE Dept.

Trace-Based Framework for Concurrent Development of Process and FPGA Architecture Considering Process Variation and Reliability 1 Lerong Cheng, 1 Yan Lin,

EE466: VLSI Design Power Dissipation. Outline Motivation to estimate power dissipation Sources of power dissipation Dynamic power dissipation Static power.

Power Reduction for FPGA using Multiple Vdd/Vth

Research on Analysis and Physical Synthesis Chung-Kuan Cheng CSE Department UC San Diego

POWER-DRIVEN MAPPING K-LUT-BASED FPGA CIRCUITS I. Bucur, N. Cupcea, C. Stefanescu, A. Surpateanu Computer Science and Engineering Department, University.

CAD for Physical Design of VLSI Circuits

Jia Yao and Vishwani D. Agrawal Department of Electrical and Computer Engineering Auburn University Auburn, AL 36830, USA Dual-Threshold Design of Sub-Threshold.

1 EE 587 SoC Design & Test Partha Pande School of EECS Washington State University

A Routing Approach to Reduce Glitches in Low Power FPGAs Quang Dinh, Deming Chen, Martin D. F. Wong Department of Electrical and Computer Engineering University.

Design Space Exploration for Application Specific FPGAs in System-on-a-Chip Designs Mark Hammerquist, Roman Lysecky Department of Electrical and Computer.

Impact of Interconnect Architecture on VPSAs (Via-Programmed Structured ASICs) Usman Ahmed Guy Lemieux Steve Wilton System-on-Chip Lab University of British.

Han Liu Supervisor: Seok-Bum Ko Electrical & Computer Engineering Department 2010-Feb-2.

Stochastic Current Prediction Enabled Frequency Actuator for Runtime Resonance Noise Reduction Yiyu Shi*, Jinjun Xiong +, Howard Chen + and Lei He* *Electrical.

EE201C : Stochastic Modeling of FinFET LER and Circuits Optimization based on Stochastic Modeling Shaodi Wang

In-Place Decomposition for Robustness in FPGA Ju-Yueh Lee, Zhe Feng, and Lei He Electrical Engineering Dept., UCLA Presented by Ju-Yueh Lee Address comments.

FPGA Logic Cluster Design Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

© PSU Variation Aware Placement in FPGAs Suresh Srinivasan and Vijaykrishnan Narayanan Pennsylvania State University, University Park.

Interconnect Driver Design for Long Wires in FPGAs Edmund Lee University of British Columbia Electrical & Computer Engineering MASc Thesis Presentation.

A Novel Timing-Driven Global Routing Algorithm Considering Coupling Effects for High Performance Circuit Design Jingyu Xu, Xianlong Hong, Tong Jing, Yici.

EEL 5722 FPGA Design Fall 2003 Digit-Serial DSP Functions Part I.

Proximity Optimization for Adaptive Circuit Design Ang Lu, Hao He, and Jiang Hu.

PROCEED: Pareto Optimization-based Circuit-level Evaluation Methodology for Emerging Devices Shaodi Wang, Andrew Pan, Chi-On Chui and Puneet Gupta Department.

Unified Adaptivity Optimization of Clock and Logic Signals Shiyan Hu and Jiang Hu Dept of Electrical and Computer Engineering Texas A&M University.

Power-Optimal Pipelining in Deep Submicron Technology

Fine-Grain CAM-Tag Cache Resizing Using Miss Tags

Yiyu Shi*, Jinjun Xiong+, Howard Chen+ and Lei He*

Impact of Parameter Variations on Multi-core chips

FPGA Glitch Power Analysis and Reduction

Off-path Leakage Power Aware Routing for SRAM-based FPGAs

A New Hybrid FPGA with Nanoscale Clusters and CMOS Routing Reza M. P

Chapter 3b Leakage Efficient Chip-Level Dual-Vdd Assignment with Time Slack Allocation for FPGA Power Reduction Prof. Lei He Electrical Engineering Department.

Presentation transcript:

Device and Architecture Co-Optimization for FPGA Power Reduction Lerong Cheng, Phoebe Wong, Fei Li, Yan Lin, and Prof. Lei He EE Department, UCLA Partially supported by NSF CAREER award CCR / and NSF grant CCR Address comments to

Outline Background and motivation Trace-based power and delay estimation Device and architecture co-optimization Conclusion

Evaluation of Conventional FPGA Architecture LUT size and cluster size have been evaluated for conventional FPGA  performance and area [Ahmed et al, ISFPGA’00]  power and performance [Li et al, ISFPGA ‘03]  Architecture tuning leads to 2.8X energy difference and 1.5X delay difference Logic block I/O pad Switch box Connection box Critical Path Delay (ns) Total FPGA Energy (nJ/cycle) (8, 7) (6, 7) (6, 6) (10, 5) (8, 5) (12, 4) (6, 5) (8, 4) (6, 4) (10, 4) (8, 6) (12, 5) (10, 6) (12, 6) (10, 7) (12, 7) (10, 3) (12, 3) (8, 3) (6, 3) Island style FPGA architecture Evaluation result

Evaluation of Low-Power FPGA Architecture Field programmable dual-vdd for power reduction [Lin et al, ISFPGA’05]  Applying field programmable dual Vdd reduces energy-delay product by 49% High Vdd Logic block Low Vdd logic block Vdd programmable logic block Conventional FPGA Vdd programmable FPGA

Evaluation Methodology Parasitic Extraction Cycle-accurate Power Simulator (Psim) Power Arch Spec Logic Optimization(SIS) Tech-Mapping (RASP) Timing-Driven Packing (TV-Pack) Placement & Routing (VPR) Delay Area Benchmark circuits

Impact of Device Tuning All the previous work only considers architecture tuning Device tuning leads to 84X power difference and 12X delay difference It is necessary to perform device tuning and architecture tuning simultaneously

Challenge of Device and Architecture Co-Optimization We consider the following architecture and device parameters during our co-optimization:  Architecture parameters: Cluster size (N) LUT size (K)  Device parameters: Supply voltage (Vdd) Threshold voltage (Vt) Hyper-architecture (hyper-arch) is the combination of the device and architecture parameters. Large number of hyper-arch combinations VPR and Psim are too slow to deal with such large number of experiments Need fast yet accurate power and delay estimation

Outline Back ground and motivation Trace-based power and delay estimation  Trace collection  Trace based power and delay model  Accuracy and efficiency verification of Trace-based estimator Device and architecture co-optimization Conclusion

Trace Collection VPR and Psim Ptrace Short circuit power ratio Circuit element statistics Switching activity Critical path structure Assume trace information will remain the same when device setting changes Area Trace

Trace Base Estimation (Ptrace) Framework Trace Ptrace Chip level delay, power, and area Circuit level delay and power Device independent Device dependent

Outline Back ground and motivation Trace-based power and delay estimation  Trace collection  Trace based power and delay model  Accuracy and efficiency verification of Trace-based estimator Device and architecture co-optimization Conclusion

Delay Model in VPR Delay is calculated for each path as  N i p is number of type i elements in the path and D i is delay of type i element  Delay of the logic elements is measured by SPICE simulation  Elmore delay is used for interconnect wire segments Critical path is the path with longest delay

Delay in Ptrace Obtain the path structure of a set of longest circuit paths Assume that when device setting changes, the new critical path is still among the set of longest paths. Delay computation: Trace information Device dependent parameters

Dynamic Power Model Psim  Switch power Switching activity is measured by timing simulation for each node S i is the average switching activity  Short circuit power α sc is calculated for each node Ptrace  Switch power  Short circuit power α sc is the average short circuit power ratio for the whole circuit Trace information Device dependent parameters

Static Power Model Psim  Without power gating  With power gating Ptrace  Without power gating  With power gating Trace information Device dependent parameters

Outline Back ground and motivation Trace-based power and delay estimation  Trace collection  Trace based power and delay model  Accuracy and efficiency verification of Trace-based estimator Device and architecture co-optimization Conclusion

Experiment Setting Collect trace using ITRS 70nm technology, but apply to both 100nm and 70nm technologies 20 MCNC benchmarks Assume each benchmark works in its highest possible frequency Power and delay are computed as geometric mean of 20 benchmarks. Evaluation range VddVtLUT size (K)Cluster size (N) 0.8~1.10.2~0.43~76~12

Accuracy Average power error is 3.4%. Average delay error is 6.4%.  Delay error is due to Ptrace ignores the impact of path branches that considered in VPR

Runtime VPR and Psim for one device setting  five days on eight 1.2GHz Intel Xeon servers Ptrace for 20 device settings  80 seconds on one 1.2GHz Intel Xeon server

Outline Back ground and motivation Trace-based power and delay estimation Device and architecture co-optimization  Energy and delay tradeoff  ED and area tradeoff  Comparison between classes  Comparison between device tuning and architecture tuning Conclusion

Architectures Classes to be Evaluated Hyper-architecture classes Baseline case  Vdd suggested by ITRS  Architecture same as Xilinx Virtex-II™.  Vt optimized by our method with respect to the above architecture and Vdd Hyper-arch classes Vt Homo-VtHomogeneous Vt Hetero-VtHeterogeneous Vt Homo-Vt+GHomogeneous Vt + Power Gating Hetero-Vt+GHeterogeneous Vt + Power Gating VddVtLUT size (K)Cluster size (N)

Outline Back ground and motivation Trace-based power and delay estimation Device and architecture co-optimization  Energy and delay tradeoff  ED and area tradeoff  Comparison between classes  Comparison between device tuning and architecture tuning Conclusion

Energy and Delay Tradeoff Dominant hyper-arch  Hyper-arch B is inferior to A if A has less energy and smaller delay than B.  Dominant hyper-archs (dom-arch) are the hyper-archs that are NOT inferior to any other hyper-archs.

Energy and Delay Tradeoff Hetero-Vt can reduce power Power gating reduces more leakage power than hetero-Vt Hetero-Vt has less impact when power gating is applied

Min-ED Hyper-Arch Hyper-arch classes Vdd (V) CVt (V)IVt (V)(N, K)ED (nJ·ns) ED reduction % Baseline (8,4) Homo-Vt (6,7) Hetero-Vt (8,4) Homo-Vt+G (12,4) Hetero-Vt+G (8,4) To achieve the best energy and delay tradeoff, we find out the hyper-arch with the minimum energy and delay product (ED)  Compared to the baseline, the min-ED hyper-arch of the conventional FPGA (Homo-Vt) reduces ED by 13.4%  For the Hetero-Vt class, ED is reduced by 20.5%  If power gating is applied, ED can be reduced by up to 59.0%

Outline Back ground and motivation Trace-based power and delay estimation Device and architecture co-optimization  Energy and delay tradeoff  ED and area tradeoff  Comparison between classes  Comparison between device tuning and architecture tuning Conclusion

ED and area Tradeoff Architecture tuning has great impact on area. To achieve the best area and ED tradeoff, we find the hyper-arch with the minimum product of area, energy and delay (AED)

ED Area Tradeoff for Classes without Power Gating Compared to the min-ED hyper arch, the min-AED hyper-arch significantly reduce area with a small ED increase

Sleep Transistor Size Tuning When Power gating is applied, sleep transistors may increase area The larger the sleep transistor size, the smaller the delay Sleep transistor size tuning:  Area overhead introduced by sleep transistors of logic blocks is negligible.  We consider 2X, 4X, 7X and 10X PMOS as sleep transistor for switch buffer

ED Area Tradeoff for Classes with Power Gating The area reduction achieved by device and architecture co-optimization compensates the area overhead introduced by sleep transistors

Min-AED Hyper-Arch Vdd (V) CVt (V) IVt (V) (N,K) Sleep transistor size ED (nJ·ns) Normalized area AED reduction % Baseline (8,4) Homo-Vt (6,4) Hetero-Vt (12, 4) Hetero-Vt+G (12, 4) Hetero-Vt+G (12, 4) Compared to the baseline, the min-AED hyper-arch in the conventional FPGA class can reduce area by 20% and ED by 12.3% In the Hetero-Vt class, ED is reduced by 20.8% and area is reduced by 23% compared to the baseline If power gating is applied, ED is reduced by 54.6% and area is reduced by 8.3%

Outline Back ground and motivation Trace-based power and delay estimation Device and architecture co-optimization  Energy and delay tradeoff  ED and area tradeoff  Comparison between classes  Comparison between device tuning and architecture tuning Conclusion

Comparison Between Classes in Similar Performance Range Homo-VtHetero-Vt VddVt(N, K)E (nJ)D (ns) ED (nJ·ns) VddCVtIVt(N, K)E (nJ)D (ns)ED (nJ·ns) , , , , , , Homo-Vt+GHetero-Vt+G VddVt(N, K)E (nJ)D (ns) ED (nJ·ns) VddCVtIVt(N, K)E (nJ)D (ns)ED (nJ·ns) , , , , , , Vt for logic block is lower than Vt for interconnect Vt for classes with power gating is lower

Outline Back ground and motivation Trace-based power and delay estimation Device and architecture co-optimization  Energy and delay tradeoff  ED and area tradeoff  Comparison between classes  Comparison between device tuning and architecture tuning Conclusion

Dom-Archs under Different Device Settings For a given device setting architecture tuning changes delay and energy in a smaller range Device tuning has a much more impact on delay and energy

Outline Back ground and motivation Trace-based power and delay estimation Device and architecture co-optimization Conclusion

Conclusion and Discussion Trace-based estimator provides efficient and accurate FPGA power and delay estimation  Average power error is 3.4% and average delay error is 6.1% Device and architecture co-optimization reduces ED by 20.5% and area by 23.3% when there is no power gating With power gating, device and architecture co-optimization reduces ED by 54.6% and area by 8.3% Device tuning has a more significant impact on delay and power than architecture tuning does In recent research, Ptrace has been extended to consider leakage and timing yield with process variations