Download presentation

Presentation is loading. Please wait.

Published byMaliyah Holdway Modified over 2 years ago

1
Mapping for Better Than Worst-Case Delays In LUT-Based FPGA Designs Kirill Minkovich and Jason Cong VLSI CAD Lab Computer Science Department University of California, Los Angeles Supported by the National Science Foundation under grant CCF-0530261

2
Variation and its effects Environmental Variation Causes: overheating and voltage fluctuations Addressed (in part) by: cooling and better power supplies Process Variation Causes: dopant density, edge geometry, stress during manufacturing, and much more Addressed (in part) by: Adding a slack of as much as 3-sigma for delay variation Data Variation Causes: output stabilization varying greatly between different data Addressed by: Highly restrictive asynchronous designs and the Razor architecture Solutions Speed Binning and more accurate estimates Only deals with process variation Only deals with process variation Variable Clocking (Razor Architecture) Deals with all 3 variations! Deals with all 3 variations! Intra-die variations in ILD thickness

3
High performance circuits Worst-case delay minimization Hitting a wall due to feature size limits Can’t keep up with Moore’s law Can’t keep up with Moore’s law Conservative timing due to variation Conservative timing due to variation Typical case delay minimization Defined: Delay for expected data to propagate through circuit Usually much smaller than worst-case delay Usually much smaller than worst-case delay Harder to optimize circuits Change thinking about circuit optimization Change thinking about circuit optimization Requires special architecture, like the Razor Architecture (MICRO ’03) UCLA VLSICAD LAB3

4
Razor flip-flop implementation Error comparator RAZOR FF Main Flip-Flop clk clk_del Shadow Latch Q Logic Stage L1 Logic Stage L2 Error_L 0 1 D Slide borrow from Razor (MICRO ’03) presentation Main flip-flop Clocked faster than worst-case delay Shadow Latch Clocked with delayed clock to catches any errors Error Occurs when main flip-flop and shadow latch differ Next clock cycle, the Shadow latch value moves into the Main Flip-Flop 4

5
Razor timing error detection Second sample of logic value used to validate earlier sample Key design issues: Maintaining forward progress Short path impact on shadow-latch Overhead of error detection and correction Main FF Shadow Latch Main FF clk clk_del 5 4 9 MEM 39 9 5 Slide borrow from Razor (MICRO ’03) presentation

6
Razor (output) registers Razor (state) registers Razor (input) registers FSM to Razor transformation Possible to convert most circuits to Razor Stallable buffer State registers Combinational logic Output registers Razor (stabilization) registers Input registers FSM combinational logic Data Data Valid Data Ready Razor Blackbox 6

7
Problem formulation Definitions Maximum depth Shadow latch (worst-case delay) Shadow latch (worst-case delay) Target depth Main (overclocked) flip-flip Main (overclocked) flip-flip Performance Measuring Can’t use the clock due to errors! Errors due to overclocking (any switching between target depth and max depth) So we have to use Expected Delay instead of Delay ExpDelay(using target depth d ) = d (Pr(NO error | using clock d)) ExpDelay(using target depth d ) = d (Pr(NO error | using clock d)) + ( d + time recover ) (Pr(error | using clock d)) + ( d + time recover ) (Pr(error | using clock d)) Find d Linear search! Linear search! BestExpDelay = min(ExpDelay( d ) | max_depth/2 ≤ d < max_depth) BestExpDelay = min(ExpDelay( d ) | max_depth/2 ≤ d < max_depth) UCLA VLSICAD LAB7

8
Optimization goals The expected delay Total delay for data to propagate and recovery from any errors Reduce probability of an error Reduce probability of an error Straight forward, if we are given a target depth Minimize probability of switching after target depth Minimize probability of switching after target depth What can we do without the target depth? We try to get the switching to occur as early as possible Extra area overhead Hard to compare solutions (special cost function is needed) Clock Switching Activity UCLA VLSICAD LAB8

9
BTWMap algorithm overview Decompose into 2-input gates Simulate 256 random input values over all cuts Assign cost based on switching and depth Choose cuts to minimize cost Save the scaled simulation data for next iteration Cut Selection 9 400 times Area Recovery Target clock assignment Area/performance tradeoff Done!

10
UCLA VLSICAD LAB Cut selection Cut ranking Can’t look at just switching activity for each depth For example, cut 2 is better than cut 1 Expired simulation data Keep the old data Assume previous iteration’s costs after still valid but scale them down Allows the algorithm to converge on a solution Keeping the old data, decreases Pr(error) by an average of 3.5% Keeping the old data, decreases Pr(error) by an average of 3.5% Huge improvement since for us Pr(error) <= 5% Huge improvement since for us Pr(error) <= 5% Depth Probability of switching Cut 1Cut 2 33%4% 250%5% 170% 10

11
2 1 3 3 UCLA VLSICAD LAB BTWMap - area recovery (target depth) Idea Find a target depth How much to overclock How much to overclock Ignore the switching the happens below this depth Implementation Set outputs’ target depth Select cuts PO->PI while propagating the target depth Works similar to worst case depth but calculated Works similar to worst case depth but calculated PO->PI using MIN instead PI->PO using MAX Benefits Moderate reduction in area Target depth 2 2 1 1 0 0

12
UCLA VLSICAD LAB BTWMap – area-performance tradeoff Idea Relax the minimum switching cost of each gate Give area recovery techniques room to work Implementation Set outputs to the initial amount they can be relaxed Make a relaxation and propagate what your inputs can change using: Depth of the inputs Depth of the inputs How much switching slack is left How much switching slack is left Input to output switching correlation Input to output switching correlation u For example, Pr(y switching|x 1 switched)=75% while Pr(y switching|x 2 switched)=50% while Pr(y switching|x 2 switched)=50% Benefits Accurate relaxation estimates Large reduction in area 12

13
BTWMap results example BTWMap mapping comparison Test circuit is PDC from the MCNC benchmark suit Comparing 4 methods A. Depth optimal mapping with depth relaxation on non-critical paths for area saving B. Depth optimal mapping without depth relaxation C. BTWMap D. BTWMap with area recovery. 13

14
UCLA VLSICAD LAB What circuits can’t be optimized Maximum Razor clock = (max depth)/2 Already good = switching < 2% at maximum Razor clock Very low switching at maximum Razor clock 4 of the MCNC suite Too bad = switching > 90% at max depth All the switching happens at the very last depth. Very hard to optimize. Have to reduce the switching activity a minimum of 20x at that depth 5 of the MCNC suite Easy to test and exclude Map using ABC and checking switching probabilities 14

15
UCLA VLSICAD LAB Sample results The example below is for the MCNC benchmark SEQ ABC BTWMap BTWMap+area ABC BTWMap BTWMap+area Area 1000Area1258Area1111 Max Depth 6 6 6 Depth Switch. Prob. Prob. Error Expected Delay Depth Switch. Prob. Prob. Error Expected Delay Depth Switch. Prob. Prob. Error Expected Delay 545.49 7.7353.85 5.2354.99 5.30 450.0395.529.73417.5321.385.28447.0452.027.12 372.51100.009.00345.9167.307.04352.80100.009.00 Ave Delay 5.09 Best Pipeline Delay 6.00 Ave Delay3.80 Best Pipeline Delay 5.23 Ave Delay4.28 Best Pipeline Delay 5.30 15

16
Results – expected delay and area Performance improvement: 13% with BTWMap and 11% after area recovery The area recovery saves over 16% of the lost area In the best case (ignoring switching), we’re still 3% away from ABC Trading 7% for much better switching activity Expected DelayRatioAreaIncrease CircuitABCBTWMap BTWMap +area BTWMap BTWMap +area ABCBTWMap BTWMap +area BTWMap BTWMap +area alu47.06.36.490%92%72292277628%8% apex26.86.46.694%97%9721200104024%7% apex47.06.1 87% 793102884230%6% clma8.48.28.097%95%42165674509135%21% misex36.0 100% 74289981921%10% pdc8.56.57.277%85%22342934249831%12% s2982.32.0 87% 44504914%11% s384178.06.0 75% 29093972323437%11% s38584.16.45.3 83%84%35924543385627%7% seq6.04.95.381%88%10001258111126%11% spla7.36.56.789%92%21272735239929%13% Geomean 87.0%88.8% 26.4%10.1% 16

17
UCLA VLSICAD LAB Conclusion BTWMap work includes: Methodologies for measuring performance on circuits optimized for average case delay. Algorithms for optimizing circuits for average case delay. Implementation and release these tools (alpha version) http://cadlab.cs.ucla.edu/software_release/btwmap/ Results Summary BTWMap (and the area recovery version) 14% (and 8%) average delay reduction 14% (and 8%) average delay reduction 13% (and 11%) pipeline improvement 13% (and 11%) pipeline improvement 26% (and 10%) area increase 26% (and 10%) area increase 17

18
UCLA VLSICAD LAB18

Similar presentations

OK

Unified Adaptivity Optimization of Clock and Logic Signals Shiyan Hu and Jiang Hu Dept of Electrical and Computer Engineering Texas A&M University.

Unified Adaptivity Optimization of Clock and Logic Signals Shiyan Hu and Jiang Hu Dept of Electrical and Computer Engineering Texas A&M University.

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google