Xiao Patrick Dong Supervisor: Guy Lemieux. Goal: Reduce critical path  shorter period Decrease dynamic power 2.

Slides:



Advertisements
Similar presentations
1 COMP541 Flip-Flop Timing Montek Singh Oct 6, 2014.
Advertisements

1/48 ENERGY OPTIMIZATION TECHNIQUES: FPGA GLITCH REDUCTION Patrick Cooke and Elizabeth Graham.
Mapping for Better Than Worst-Case Delays In LUT-Based FPGA Designs Kirill Minkovich and Jason Cong VLSI CAD Lab Computer Science Department University.
June 6, Using Negative Edge Triggered FFs to Reduce Glitching Power in FPGA Circuits Tomasz S. Czajkowski and Stephen D. Brown Department of Electrical.
Leakage and Dynamic Glitch Power Minimization Using MIP for V th Assignment and Path Balancing Yuanlin Lu and Vishwani D. Agrawal Auburn University ECE.
Cross-layer Optimized Placement and Routing for FPGA Soft Error Mitigation Keheng Huang 1,2, Yu Hu 1, and Xiaowei Li 1 1 Key Laboratory of Computer System.
A Useful Skew Tree Framework for Inserting Large Safety Margins Rickard Ewetz and Cheng-Kok Koh School of Electrical and Computer Engineering, Purdue University.
Clustering of Large Designs for Channel-Width Constrained FPGAs Marvin TomGuy Lemieux University of British Columbia Department of Electrical and Computer.
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day17: November 20, 2000 Time Multiplexing.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 21: April 2, 2007 Time Multiplexing.
Yan Lin, Fei Li and Lei He EE Department, UCLA
1 DAOmap: A Depth-optimal Area Optimization Mapping Algorithm for FPGA Designs Deming Chen, Jacon Cong ICCAD 2004 Presented by: Wei Chen.
An Algorithm to Minimize Leakage through Simultaneous Input Vector Control and Circuit Modification Nikhil Jayakumar Sunil P. Khatri Presented by Ayodeji.
TH EDA NTHU-CS VLSI/CAD LAB 1 Re-synthesis for Reliability Design Shih-Chieh Chang Department of Computer Science National Tsing Hua University.
Pipelining and Retiming 1 Pipelining  Adding registers along a path  split combinational logic into multiple cycles  increase clock rate  increase.
Simultaneous Time Slack Budgeting and Retiming for Dual-Vdd FPGA Power Reduction Yu Hu 1, Yan Lin 1, Lei He 1 and Tim Tuan 2 1 EE Department, UCLA 2 Xilinx.
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day8: October 18, 2000 Computing Elements 1: LUTs.
CSE241 L4 System.1Kahng & Cichy, UCSD ©2003 CSE241A VLSI Digital Circuits Winter 2003 Lecture 04: Static Timing Analysis.
An Efficient Chiplevel Time Slack Allocation Algorithm for Dual-Vdd FPGA Power Reduction Yan Lin 1, Yu Hu 1, Lei He 1 and Vijay Raghunathan 2 1 EE Department,
Continuous Retiming EECS 290A Sequential Logic Synthesis and Verification.
CS294-6 Reconfigurable Computing Day 19 October 27, 1998 Multicontext.
Leakage Efficient Chip-Level Dual-Vdd Assignment with Time Slack Allocation for FPGA Power Reduction Yan Lin and Lei He EE Department, UCLA Partially supported.
Chapter #6: Sequential Logic Design 6.2 Timing Methodologies
ELEN 468 Lecture 221 ELEN 468 Advanced Logic Design Lecture 22 Timing Verification.
HARP: Hard-Wired Routing Pattern FPGAs Cristinel Ababei , Satish Sivaswamy ,Gang Wang , Kia Bazargan , Ryan Kastner , Eli Bozorgzadeh   ECE Dept.
Trace-Based Framework for Concurrent Development of Process and FPGA Architecture Considering Process Variation and Reliability 1 Lerong Cheng, 1 Yan Lin,
Yehdhih Ould Mohammed Moctar1 Nithin George2 Hadi Parandeh-Afshar2
DELAY INSERTION METHOD IN CLOCK SKEW SCHEDULING BARIS TASKIN and IVAN S. KOURTEV ISPD 2005 High Performance Integrated Circuit Design Lab. Department of.
A New Methodology for Reduced Cost of Resilience Andrew B. Kahng, Seokhyeong Kang and Jiajia Li UC San Diego VLSI CAD Laboratory.
Power Reduction for FPGA using Multiple Vdd/Vth
Coarse and Fine Grain Programmable Overlay Architectures for FPGAs
Un/DoPack: Re-Clustering of Large System-on-Chip Designs with Interconnect Variation for Low-Cost FPGAs Marvin Tom* Xilinx Inc.
Channel Width Reduction Techniques for System-on-Chip Circuits in Field-Programmable Gate Arrays Marvin Tom University of British Columbia Department of.
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR FPGA Fabric n Elements of an FPGA fabric –Logic element –Placement –Wiring –I/O.
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.
Safe Overclocking Safe Overclocking of Tightly Coupled CGRAs and Processor Arrays using Razor © 2012 Guy Lemieux Alex Brant, Ameer Abdelhadi, Douglas Sim,
Digital System Clocking: High-Performance and Low-Power Aspects Vojin G. Oklobdzija, Vladimir M. Stojanovic, Dejan M. Markovic, Nikola M. Nedovic Wiley-Interscience.
Design Space Exploration for Application Specific FPGAs in System-on-a-Chip Designs Mark Hammerquist, Roman Lysecky Department of Electrical and Computer.
Skewed Flip-Flop Transformation for Minimizing Leakage in Sequential Circuits Jun Seomun, Jaehyun Kim, Youngsoo Shin Dept. of Electrical Engineering, KAIST,
The Effects of Operating Conditions on Speed and Power of Replica – Based SRAM Circuits Nika Sharifvaghefi Nicholas Kumar EE241 - Spring 2012.
1 A Min-Cost Flow Based Detailed Router for FPGAs Seokjin Lee *, Yongseok Cheon *, D. F. Wong + * The University of Texas at Austin + University of Illinois.
Timing-Driven Routing for FPGAs Based on Lagrangian Relaxation
Introduction to Clock Tree Synthesis
FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR Moore’s Law n Gordon Moore: co-founder of Intel. n Predicted that number of transistors.
An Improved “Soft” eFPGA Design and Implementation Strategy
March 28, Glitch Reduction for Altera Stratix II devices Tomasz S. Czajkowski PhD Candidate University of Toronto Supervisor: Professor Stephen D.
1 Area-Efficient FPGA Logic Elements: Architecture and Synthesis Jason Anderson and Qiang Wang 1 IEEE/ACM ASP-DAC Yokohama, Japan January 26-28,
In-Place Decomposition for Robustness in FPGA Ju-Yueh Lee, Zhe Feng, and Lei He Electrical Engineering Dept., UCLA Presented by Ju-Yueh Lee Address comments.
1ISPD'03 Process Variation Aware Clock Tree Routing Bing Lu Cadence Jiang Hu Texas A&M Univ Gary Ellis IBM Corp Haihua Su IBM Corp.
1 COMP541 Sequential Logic Timing Montek Singh Sep 30, 2015.
12006 MAPLD International ConferenceSpaceWire 101 Seminar Data Strobe (DS) Encoding Sam Stratton 2006 MAPLD International Conference.
Clocking System Design
1 WireMap FPGA Technology Mapping for Improved Routability Stephen Jang, Xilinx Inc. Billy Chan, Xilinx Inc. Kevin Chung, Xilinx Inc. Alan Mishchenko,
DAOmap: A Depth-optimal Area Optimization Mapping Algorithm for FPGA Designs Deming Chen, Jason Cong , Computer Science Department , UCLA Presented.
SEMI-SYNTHETIC CIRCUIT GENERATION FOR TESTING INCREMENTAL PLACE AND ROUTE TOOLS David GrantGuy Lemieux University of British Columbia Vancouver, BC.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 11: January 31, 2005 Compute 1: LUTs.
Congestion-Driven Re-Clustering for Low-cost FPGAs MASc Examination Darius Chiu Supervisor: Dr. Guy Lemieux University of British Columbia Department of.
EEL 5722 FPGA Design Fall 2003 Digit-Serial DSP Functions Part I.
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.
Unified Adaptivity Optimization of Clock and Logic Signals Shiyan Hu and Jiang Hu Dept of Electrical and Computer Engineering Texas A&M University.
Time-borrowing platform in the Xilinx UltraScale+ family of FPGAs and MPSoCs Ilya Ganusov, Benjamin Devlin.
MAPLD 2005 Reduced Triple Modular Redundancy for Tolerating SEUs in SRAM based FPGAs Vikram Chandrasekhar, Sk. Noor Mahammad, V. Muralidharan Dr. V. Kamakoti.
The University of British Columbia
Timing Analysis 11/21/2018.
An Active Glitch Elimination Technique for FPGAs
FPGA Glitch Power Analysis and Reduction
Off-path Leakage Power Aware Routing for SRAM-based FPGAs
FPT 2009 Xiao Patrick Dong Supervisor: Guy Lemieux
Fast Min-Register Retiming Through Binary Max-Flow
Chapter 3b Leakage Efficient Chip-Level Dual-Vdd Assignment with Time Slack Allocation for FPGA Power Reduction Prof. Lei He Electrical Engineering Department.
Presentation transcript:

Xiao Patrick Dong Supervisor: Guy Lemieux

Goal: Reduce critical path  shorter period Decrease dynamic power 2

Goal: Reduce critical path  shorter period Decrease dynamic power Approach: Add clock skew at FFs  clock skew scheduling (CSS) Relax CSS constraints  delay padding (DP) Reduce power due to glitching  GlitchLess (GL) 3

Goal: Reduce critical path  shorter period Decrease dynamic power Approach: Add clock skew at FFs  clock skew scheduling (CSS) Relax CSS constraints  delay padding (DP) Reduce power due to glitching  GlitchLess (GL) Implementation: One architectural change, to satisfy all 3 above Add programmable delay elements (PDE) to clocks For every FF (best QoR) For every CLB (best area) 4

One architectural change, to satisfy CSS DP GlitchLess Delay Padding for FPGAs – first time Improved glitch modelling GlitchLess allow period increase Investigates period, power and area tradeoffs PDE sharing Paper accepted to FPT 2009 This presentation Considers GlitchLess only, or CSS/DP only 5

Introduction/Motivation Concept Implementation Results Conclusion Future Work 6

Before: 14-ns critical path delay 7 FF 1 clk FF 2 clk FF 3 clk Min: 5ns Max: 14ns Min: 6ns Max: 6ns local path Alocal path B

Before: 14-ns critical path delay After: 10-ns critical delay – borrowed time 8 FF 1 clk FF 2 clk FF 3 clk 4ns local path Alocal path B Min: 5ns Max: 14ns Min: 6ns Max: 6ns

How to implement CSS? 9 FPGA 2002: Brown

How to implement CSS? 10 FPGA 2002: Brown FPGA 2005: Sadowska

How to implement CSS? Our 2 approaches 11 FPGA 2002: Brown 1 PDE for every FF FPGA 2005: Sadowska

How to implement CSS? Our 2 approaches 12 FPGA 2002: Brown 1 PDE for every FF 1 PDE for every CLB FPGA 2005: Sadowska

CSS constraints on permissible range of skew settings for Xi, Xj Increase permissible range Cannot decrease Dmax Increase Dmin 13 FF 1 clk FF 2 clk Xi=0nsXj=4ns Dmin=4ns Dmax

14 +d d FF 1 clk FF 2 clk Xi=0nsXj=4ns Dmin=4ns Dmax CSS constraints on permissible range of skew settings for Xi, Xj Increase permissible range Cannot decrease Dmax Increase Dmin by d

15 Feature ISCAS 1994 (Sapatnekar) FPGA 2002 (Brown) FPGA 2005 (Sadowska) FPL 2007 (Bazargan) ISPD 2005 (Kourtev) DAC 2005 (LU) FPT 2009 (Our Approach) CSS PlatformASICFPGAASIC FPGA Delayscontinuousdiscretecontinuous discrete Variation modeling DP PlatformASIC FPGA Delayscontinuous discrete Variation modeling Algorithmgraph LPgraph

Output fluctuate due to different input arrival 16 TVLSI 2008: Lamoureux Past approach Our approach

Introduction/Motivation Concept Implementation Results Conclusion Future work 17

1 PDE per FF 20 PDEs (2 FFs per LUT) ~10% area cost CSS – add δ to FF clock DP – rerouting GL – insert FF on path 18

19 Objective: save area 1 PDE per CLB Share PDE with all FFs ~0.5% area cost CSS – add δ to FF clock DP – rerouting GL – insert FF on path

20 Two choices: Choice “1”: GlitchLess Only Choice “2”: CSS+DP Choice “3”: CSS+DP+GlitchLess

Introduction/Motivation Concept Implementation Results Conclusion Future work 21

10 largest MCNC sequential circuits VPR 5.0: timing driven place and route route_chan_width = 104 Architecture 65nm technology k=4, N=10, I=22 (k=6, N=10, I=33 not shown) Glitch estimation: Modified ACE pseudo-random input vectors 22

All saving percentages are % of original period CSS geomean 13% 23

All saving percentages are % of original period CSS geomean 13% CSS+DP geomean 18% (up to 32%) Delay padding benefits 4 circuits (up to 23%) 24

All saving percentages are % of original period CSS geomean 13% CSS+DP geomean 18% (up to 32%) Delay padding benefits 4 circuits (up to 23%) 25 1 PDE per CLB restriction: DP not achievable Geomean 10%

PDEs need power – clock has activity = 1 ! 1 PDE per CLB – significantly lower power 26

PDEs need power – clock has activity = 1 ! 4-clk: power overhead with 3 extra global clocks 27

PDE settings aggregated over all circuits Skew is relatively “spread out” 28

Select nodes above “threshold” Power of node with most glitching = 1.0 Threshold filter selects nodes with most glitching Few (~ 10) high glitch power nodes Most nodes w/ small glitch power Threshold < 0.2 PDE power overhead swamps glitch savings 29 Excludes PDE power Includes PDE power Blue lines  20 PDE per CLB Green lines  1 PDE per CLB

Introduction/Motivation Concept Implementation Results Conclusion Future Work 30

20 PDEs per CLB CSS+DP speedup k=4: geomean 18% (up to 32%) k=6: geomean 20% (up to 38%) Dynamic power reduction Best case savings k=4: average 3% (up to 14%) k=6: average 1% (up to 8%) Swamped by PDE power  need low-power PDE Area penalty k=4: 11.7% k=6: 7.6% 31 1 PDE per CLB CSS speedup k=4: geomean 10% (up to 27%) k=6: geomean 10% (up to 38%) Can’t do delay padding Dynamic power reduction Similar Area Penalty k=4: 0.6% k=6: 0.4%

Improve glitch power estimation Done: fast glitches, analog behavior on single net To do: propagate analog glitches through LUTs Reduce PDE power overhead Low-power PDE (circuit design) Newer benchmarks Bigger, more recent circuits 32

THANK YOU! Questions? THANK YOU! Questions? 20 PDEs per CLB CSS+DP speedup k=4: geomean 18% (up to 32%) k=6: geomean 20% (up to 38%) Dynamic power reduction Best case savings k=4: average 3% (up to 14%) k=6: average 1% (up to 8%) Swamped by PDE power  need low-power PDE Area penalty k=4: 11.7% k=6: 7.6% 33 1 PDE per CLB CSS speedup k=4: geomean 10% (up to 27%) k=6: geomean 10% (up to 38%) Can’t do delay padding Dynamic power reduction Similar Area Penalty k=4: 0.6% k=6: 0.4% THANK YOU! Questions? THANK YOU! Questions? THANK YOU! Questions? THANK YOU! Questions?

PDE adapted from GlitchLess (TVLSI 2008) 2 n delay values Fast path – min size 34 clk Vdd GND Increasing channel length (resistance) SRAM Fast path Slow path Delay Fast path state 000on on on 001on on off 010on off on 011on off off

Need good activity estimator for good power estimates Previous work: ACE 2.0 Uses threshold to determine glitch propagation Threshold = one length-4 segment Question: can the glitch get through the segment? Real glitches have analog behavior Short pulses GRADUALLY damps out 35 Glitch pulse width YES: Propagates indefinitely NO: Does not propagate

Real glitches have analog behavior Short pulses GRADUALLY damps out Group pulse widths into bins – X axis 36

circuitk = 4K = 6 BinsOriginal% diffBinsOriginal% diff bigkey clma diffeq dsip elliptic frisc s s s tseng Positive = original ACE underestimates More underestimates for k=4  arrival time differences for smaller LUTs are smaller

38 iteration = 0; solution[iteration] = CSS ( Pmax, Pmin ); num_edges = find_critical_hold_edges ( edges[iteration] ); //delete edges while (num_edges > 0) { find_deleted_edge_nodes ( edges[iteration] ); //for delay padding later recalculate_binary_bound ( &Pmax, &Pmin ); iteration ++; solution[iteration] = CSS ( Pmax, Pmin ); num_edges = find_critical_hold_edges ( edges[iteration]) ; } while (iteration >= 0) { //in case delay padding fails for current iteration success = delay_padding ( edges[iteration], solution[iteration] ); if (success) break; iteration –= 1; }

During delay padding for each edge: 39 for each node “n” on deleted edge “iedge” { max_padding = get_max_padding( n ); skew = roundup ( fanin  arrival + Ts + MARGIN + fanin_delay(n, fanin), PRECISION ); //for early clock delay = skew – fanin  arrival – fanin_delay( n, fanin ); needed_slack = delay + MARGIN; //for late clock while (delay < needed_delay && needed_slack <= max_padding) increment skew, delay and needed_slack by PRECISION; needed_delay –= delay; if ( needed_delay <= 0.0 ) {done = 1; break;} } if ( done ) check_other_paths(); //check other paths with same source/sink else success = 0;

Similar to delay padding 40 for each level in breadth-first timing graph { rank_nodes ( &list, threshold ); //only nodes with glitch power > threshold for each node “n” in list { skew = roundup ( n  arrival + Ts + MARGIN, PRECISION ); //for early clock needed_slack = skew – n  arrival + MARGIN; //for late clock if ( needed_slack < n  slack ) { for each fanin “f” of node “n” { needed_delay = n  arrival – f  arrival – fanin_delay( n, f ); fanin_delay( n, f ) += needed_delay + needed_slack ; }