FPT 2009 Xiao Patrick Dong Supervisor: Guy Lemieux

FPT 2009 Xiao Patrick Dong Supervisor: Guy Lemieux
Period and Glitch Reduction Via Clock Skew Scheduling, Delay Padding and GlitchLess FPT 2009 Xiao Patrick Dong Supervisor: Guy Lemieux

Introduction/Motivation
Goal: Reduce critical path  shorter period Decrease dynamic power

Goal: Reduce critical path  shorter period Decrease dynamic power Approach: Add clock skew at FFs  clock skew scheduling (CSS) Relax CSS constraints  delay padding (DP) Reduce power due to glitching  GlitchLess (GL)

Goal: Reduce critical path  shorter period Decrease dynamic power Approach: Add clock skew at FFs  clock skew scheduling (CSS) Relax CSS constraints  delay padding (DP) Reduce power due to glitching  GlitchLess (GL) Implementation: One architectural change, to satisfy all 3 above Add programmable delay elements (PDE) to clocks For every FF (best QoR) For every CLB (best area)

Contributions One architectural change, to satisfy
CSS DP GlitchLess Delay Padding for FPGAs – first time Improved glitch modelling GlitchLess allows period increase Investigates period, power and area tradeoffs PDE sharing This presentation Considers GlitchLess only, or CSS/DP only Unified architectural change

Outline Introduction/Motivation Concept Implementation Results
Conclusion Future Work

Concept – CSS Before: 14-ns critical path delay local path A
local path B FF 1 Min: 5ns Max: 14ns FF 2 clk Min: 6ns Max: 6ns FF 3 clk clk

Concept – CSS Before: After: 14-ns critical path delay
10-ns critical delay – borrowed time local path A local path B FF 1 Min: 5ns Max: 14ns FF 2 clk Min: 6ns Max: 6ns FF 3 With the addition of the 4ns delay, there’s effectively more time for the data to travel from FF1 to FF2, so the critical path becomes 10ns. however, this actually comes at the expense of the FF2 to FF3 path, because the signal now leaves FF2 4ns late and has less time to get to FF3 in time. clk clk 4ns

Concept – CSS How to implement CSS? FPGA 2002: Brown
There are several past attempts to do CSS. The first one uses several versions of the global clocks each at a different skew. The user FFs can choose any one of these. The second approach inserts PDEs at the branching points in the H-tree. And the delays add up The third one insert several pdes at each rib in the clock tree, so each rib would actually contain 4 skews clock lines For our approach, there are 2 alternatives: 1 pde of each FF, or 1 pde for each CLB

Concept – CSS How to implement CSS? FPGA 2005: Sadowska
FPGA 2002: Brown There are several past attempts to do CSS. The first one uses several versions of the global clocks each at a different skew. The user FFs can choose any one of these. The second approach inserts PDEs at the branching points in the H-tree. And the delays add up The third one insert several pdes at each rib in the clock tree, so each rib would actually contain 4 skews clock lines For our approach, there are 2 alternatives: 1 pde of each FF, or 1 pde for each CLB

Concept – CSS How to implement CSS? Our 2 approaches
FPGA 2005: Sadowska FPGA 2002: Brown There are several past attempts to do CSS. The first one uses several versions of the global clocks each at a different skew. The user FFs can choose any one of these. The second approach inserts PDEs at the branching points in the H-tree. And the delays add up The third one insert several pdes at each rib in the clock tree, so each rib would actually contain 4 skews clock lines For our approach, there are 2 alternatives: 1 pde of each FF, or 1 pde for each CLB 1 PDE for every FF

Concept – CSS How to implement CSS? Our 2 approaches
FPGA 2005: Sadowska FPGA 2002: Brown 1 PDE for every CLB There are several past attempts to do CSS. The first one uses several versions of the global clocks each at a different skew. The user FFs can choose any one of these. The second approach inserts PDEs at the branching points in the H-tree. And the delays add up The third one insert several pdes at each rib in the clock tree, so each rib would actually contain 4 skews clock lines For our approach, there are 2 alternatives: 1 pde of each FF, or 1 pde for each CLB 1 PDE for every FF

Concept - DP CSS constraints on permissible range of skew settings for Xi, Xj Increase permissible range Cannot decrease Dmax Increase Dmin FF 1 clk FF 2 Xi=0ns Xj=4ns Dmin=4ns Dmax CSS has constraints. In fact, the difference between the skew assigned to FF2 and FF1, xj and xi, has upper and lower bound defined by these 2 contraints

Concept - DP CSS constraints on permissible range of skew settings for Xi, Xj Increase permissible range Cannot decrease Dmax Increase Dmin by d By inserting the delay of d, we relax the hold time constraint, making skew assignment more flexible. Also, not that clocking faster than dmax means wave pipelining. Dmin=4ns Dmax d +d FF 1 clk FF 2 clk Xi=0ns Xj=4ns

Concept – Feature Comparison
ISCAS 1994 (Sapatnekar) FPGA 2002 (Brown) FPGA 2005 (Sadowska) FPL 2007 (Bazargan) ISPD 2005 (Kourtev) DAC 2005 (LU) FPT 2009 (Our Approach) CSS Platform ASIC FPGA Delays continuous discrete Variation modeling DP Algorithm graph LP

Concept - GlitchLess Output fluctuate due to different input arrival
Past approach Our approach TVLSI 2008: Lamoureux

Conclusion Future work Call it implementation.

Architecture 1 1 PDE per FF CSS – add δ to FF clock DP – rerouting
20 PDEs (2 FFs per LUT) ~10% area cost CSS – add δ to FF clock DP – rerouting GL – insert FF on path For css, we just add the delay to the FF For delay padding, we may want to delay the input signal, so we reroute it to an unused FF in the CLB vis the fast local routing, then back to the LUT input. By properly adjusting the skew the FF, we can insert any amount of delay we want. For glitchless, the signal originally goes thru the MUX in combinational mode. We route it to the FF instead, and use clock skew to hold the fluctuating data back – skew assigned is actually the arrival time of the node.

Architecture 2 Objective: save area CSS – add δ to FF clock
1 PDE per CLB Share PDE with all FFs ~0.5% area cost CSS – add δ to FF clock DP – rerouting GL – insert FF on path 19 19

Algorithm – Overall Two choices: Choice “1”: GlitchLess Only
Choice “2”: CSS+DP Choice “3”: CSS+DP+GlitchLess We obtain the net delays in vpr first then use it to obtain circuit activities, which we use to do glitchless. Now, Ace output depend on del input.. After glitchelss is done the delays change, so the activities change as well so we have to input the new delays to ACE

Conclusion Future work

Results – Benchmarking
10 largest MCNC sequential circuits VPR 5.0: timing driven place and route route_chan_width = 104 Architecture 65nm technology k=4, N=10, I=22 (k=6, N=10, I=33 not shown) Glitch estimation: Modified ACE 2.0 5000 pseudo-random input vectors

Results – CSS+DP Only All saving percentages are % of original period
CSS geomean 13%

CSS geomean 13% CSS+DP geomean 18% (up to 32%) Delay padding benefits 4 circuits (up to 23%)

CSS geomean 13% CSS+DP geomean 18% (up to 32%) Delay padding benefits 4 circuits (up to 23%) 1 PDE per CLB restriction: DP not achievable Geomean 10%

Results – CSS+DP Power Implications
PDEs need power – clock has activity = 1 ! 1 PDE per CLB – significantly lower power PDEs need power, since power is proportional to activity, the PDE power overhead is quite high due to the high activity of the clock, which is 1.

Results – CSS+DP Power Implications
PDEs need power – clock has activity = 1 ! 4-clk: power overhead with 3 extra global clocks We compared the power overhead of our approach to the estimated power overhead using 4 skewed version of the global clock. We can see that the power overad is roughly the same magnitude.

Results – Skew Distribution
PDE settings aggregated over all circuits Skew is relatively “spread out”

Results – GlitchLess Only
Select nodes above “threshold” Power of node with most glitching = 1.0 Threshold filter selects nodes with most glitching Few (~ 10) high glitch power nodes Most nodes w/ small glitch power Threshold < 0.2 PDE power overhead swamps glitch savings Blue lines  20 PDE per CLB Green lines  1 PDE per CLB Includes PDE power For glitchless, the y axis is the dynamic power normalized to the result of original VPR solution. the x axis is called threshold. The trend is gradual for threshold < 0.2 because there are only a handful of nodes in the circuit with high glitching power. The majority of the nodes are not very glitchy, and the savings obtained from doing glitchless on them are pretty much overwhelmed by the PDE overhead Excludes PDE power

Conclusion Future Work

Conclusion 20 PDEs per CLB 1 PDE per CLB CSS+DP speedup
k=4: geomean 18% (up to 32%) k=6: geomean 20% (up to 38%) Dynamic power reduction Best case savings k=4: average 3% (up to 14%) k=6: average 1% (up to 8%) Swamped by PDE power  need low-power PDE Area penalty k=4: 11.7% k=6: 7.6% 1 PDE per CLB CSS speedup k=4: geomean 10% (up to 27%) k=6: geomean 10% (up to 38%) Can’t do delay padding Dynamic power reduction Similar Area Penalty k=4: 0.6% k=6: 0.4%

Future Work Improve glitch power estimation Reduce PDE power overhead
Done: fast glitches, analog behavior on single net To do: propagate analog glitches through LUTs Reduce PDE power overhead Low-power PDE (circuit design) Newer benchmarks Bigger, more recent circuits What we have already done is the modification to ACE 2.0 that models glitches with short pulse width and the pulse widths shrinking behavior on the routing track. Still have to model glitch propagation through nets and luts.

THANK YOU! Questions? THANK YOU! Questions? THANK YOU! Questions?
Conclusion 20 PDEs per CLB CSS+DP speedup k=4: geomean 18% (up to 32%) k=6: geomean 20% (up to 38%) Dynamic power reduction Best case savings k=4: average 3% (up to 14%) k=6: average 1% (up to 8%) Swamped by PDE power  need low-power PDE Area penalty k=4: 11.7% k=6: 7.6% 1 PDE per CLB CSS speedup k=4: geomean 10% (up to 27%) k=6: geomean 10% (up to 38%) Can’t do delay padding Dynamic power reduction Similar Area Penalty k=4: 0.6% k=6: 0.4% THANK YOU! Questions? THANK YOU! Questions? THANK YOU! Questions?

Architecture – PDE PDE adapted from GlitchLess (TVLSI 2008)
2n delay values Fast path – min size clk Vdd GND Increasing channel length (resistance) SRAM Fast path Slow path Delay Fast path state 000 on on on 001 on on off 010 on off on 011 on off off Put into back up section

Glitch Estimation Need good activity estimator for good power estimates Previous work: ACE 2.0 Uses threshold to determine glitch propagation Threshold = one length-4 segment Question: can the glitch get through the segment? Real glitches have analog behavior Short pulses GRADUALLY damps out Glitch pulse width YES: Propagates indefinitely NO: Does not propagate Put into backup section

Glitch Estimation Real glitches have analog behavior
Short pulses GRADUALLY damps out Group pulse widths into bins – X axis

Glitch Estimation Positive = original ACE underestimates
More underestimates for k=4  arrival time differences for smaller LUTs are smaller circuit k = 4 K = 6 Bins Original % diff bigkey 913 471 48.4 560 629 -12.4 clma 3794 3407 10.2 2955 3303 -11.8 diffeq 136 129 4.9 63 58 6.7 dsip 698 512 26.6 574 557 3.2 elliptic 11607 10462 9.9 6408 6944 -8.3 frisc 1185 1096 7.5 1045 1088 -4.1 s298 5350 3906 27 4956 5585 -12.7 s38417 29292 19195 34.5 7036 8111 -15.3 s 10455 9246 11.6 4052 4395 -8.5 tseng 1334 1326 0.6 590 608 -3.2

Algorithm – CSS+DP Top Level
iteration = 0; solution[iteration] = CSS ( Pmax, Pmin ); num_edges = find_critical_hold_edges ( edges[iteration] ); //delete edges while (num_edges > 0) { find_deleted_edge_nodes ( edges[iteration] ); //for delay padding later recalculate_binary_bound ( &Pmax, &Pmin ); iteration ++; num_edges = find_critical_hold_edges ( edges[iteration]) ; } while (iteration >= 0) { //in case delay padding fails for current iteration success = delay_padding ( edges[iteration], solution[iteration] ); if (success) break; iteration –= 1; Remove to backup secton

Algorithm – DP During delay padding for each edge:
for each node “n” on deleted edge “iedge” { max_padding = get_max_padding( n ); skew = roundup ( faninarrival + Ts + MARGIN + fanin_delay(n, fanin), PRECISION ); //for early clock delay = skew – faninarrival – fanin_delay( n, fanin ); needed_slack = delay + MARGIN; //for late clock while (delay < needed_delay && needed_slack <= max_padding) increment skew, delay and needed_slack by PRECISION; needed_delay –= delay; if ( needed_delay <= 0.0 ) {done = 1; break;} } if ( done ) check_other_paths(); //check other paths with same source/sink else success = 0;

Algorithm – GlitchLess
Similar to delay padding for each level in breadth-first timing graph { rank_nodes ( &list, threshold ); //only nodes with glitch power > threshold for each node “n” in list { skew = roundup ( narrival + Ts + MARGIN, PRECISION ); //for early clock needed_slack = skew – narrival + MARGIN; //for late clock if ( needed_slack < nslack ) { for each fanin “f” of node “n” { needed_delay = narrival – farrival – fanin_delay( n, f ); fanin_delay( n, f ) += needed_delay + needed_slack ; }

FPT 2009 Xiao Patrick Dong Supervisor: Guy Lemieux

Similar presentations

Presentation on theme: "FPT 2009 Xiao Patrick Dong Supervisor: Guy Lemieux"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

FPT 2009 Xiao Patrick Dong Supervisor: Guy Lemieux

Similar presentations

Presentation on theme: "FPT 2009 Xiao Patrick Dong Supervisor: Guy Lemieux"— Presentation transcript:

Similar presentations

About project

Feedback