Continuing Challenges in Static Timing Analysis Tom Spyrou TAU 2013 3/2013
Goal of this talk Higher level than latest trends Remind ourselves the trade-offs we have made as an industry to have a workable solution for STA Signoff Embedded in Design Synthesis and Optimization Plenty of discussion on new effects, lets discuss core STA Explain basis of industrial algorithms to academic community Challenge ourselves to look at the issues again Technology trends Design Compute
Why Static Timing Analysis Dynamic simulation is impossible for even a small chip Assume combination logic only 100 inputs implies 2^100 vectors needed to verify timing which is about 10^30 vectors If a simulator could process 10^6 vectors per second this works out to a sim time of 10^19 days or about 10^15 years Talk about a verification bottleneck! Now add in state elements and the problem of making sure the critical path is actually in the vector set STA can analyze such a design in 1 minute There are some issues, but they can be mitigated STA’s quality of result is not dependent on the quality of the vector set
What is the trade-off / core issues What is the trade-off / core issues? These have been unchanged for a long time A different kind of setup Result is dependent on quality of constraints and exceptions If all storage elements are clocked and i/o’s constrained generally safe Less accurate delay analysis Exact path is not really known as with event driven simulation When STA was first introduced this was less of an issue, PBA now essential Introduction of false paths due to topological not functional analysis Users have to manually specify these Multiple circuit modes take extra effort Not just more vectors Loops and level sensitive latches add complexity
Analysis Every circuits looks the same to STA since it ignores the functions of the logic.
Topological analysis Simplifies problem, possibility of reporting false paths
What do recent trends mean Design Hyper-optimization means accuracy is critical When a chip is designed at a bleeding edge technology it will be pushed on all dimensions of power, performance and area Simulation based delay calculation Path based analysis Design size means memory use is #1 problem Largest chips are approaching 1TB of RAM needed for flat runs Hierarchical / Parallel solutions must prioritize memory use on compute nodes Runtime also needs to be faster but the first step is to run on machines with reasonable cost Recent design uses 750+Gig of RAM for single mode/corner STA Compute CPU is cheap, data movement is expensive Whenever you hear its an expensive calculation don’t avoid it Parallel computing must not only improve performance but also accuracy and features. Don’t just make the same problem go faster or just divide the data
If you ask a designer what doesn’t work well Hierarchical timing in the final verification loop SI calculations very conservative SDC’s are large and hard to verify Worst case timing is done and process variation is modeled very pessimistically Block based analysis loses too much accuracy True delay (looking at combinational logic to prove a path true) reporting is slow and can’t run during optimization Libraries limit flexibility of analysis
STA Industry and Academia STA technology has been innovated inside Industry much more than in Academia The key approaches are not documented There is no open source reference to build from Industry protects the core concepts as trade secrets Academia does not (rarely) publish on STA beyond single clock designs or delay calculation We need a book on the core search algorithm
Example, Veritime from the 90’s STA Engine that required vectors for the clock Dynamic simulation of the clock Period, multicycle paths, clock to clock false paths automatically determined STA for data portion Absorbed by Cadence and forgotten since at the time SDCs were a lot easier to hand inspect
Requirements of an STA Engine I would like to begin by documenting the basics that everyone in Industry knows. There are no company specific trade secrets Must run in linear memory and runtime with circuit size, number of clocks, exceptions, and number of storage element Touch each vertex only once, maybe twice to simplify pre-processing, not once per clock or exception Must support SDC timing constraints Clocks, clock tree assumptions, multi-cycle paths, false paths, path delays, cases and modes Must be nearly spice accurate in delays and support path based Must be incremental enough Netlist changes / full retrace on one extreme Query based incremental with limited tracing on the other
The Basic Search The Graph Startpoints are inputs to the circuit and clock inputs to storage elements Endpoints are outputs of circuit and data inputs of storage elements Propagate the Clocks For each clock input BFS to all clock data pins Offset startpoint arrival times and end point required times with information from the clock propagation and cycle accounting Propagate the Data Use a BFS from startpoints to end points Use multiple timing totals at every pin to take into account multiple clocks and exceptions Can optionally store back pointers to record K critical paths but this time/memory is wasted on optimization programs and should be left to a reporting phase
Multiple Timing Totals with Partial Path Simplistic implementation is that each clock and each exception gets its own total Simultaneously or via separate traces Memory and/or runtime increase quickly Occurrence pins are the most common netlist object There can be thousands of exceptions At Timing endpoints like totals can be combined and evaluated At Timing endpoints point to point exceptions can be evaluated
Multiple Timing Totals
Multiple Timing Totals with path completion data A BFS has no information about paths However timing exceptions are specified in terms of from, through, and to paths with a boolean expression of pins Mcp –from a –through {b c} –to d From a through b or c and also through d Each total can have a small state machine about what exception points it has seen At timing endpoints like totals with like exception point data can be combined or if false not combined
Through exceptions
Framework can be used for Clock Pessimism d1,d2 Arr 1 d1,d2,d3 Arr 2 d1,d2 Arr 1 d1,d2 Arr 1 d1,d2,d3 Arr 2 d1,d2 Arr 1 d1,d2,d3 Arr 2 d1,d2,d3 Arr 2 d3 d2 d1 17
Delay Calculation, Multiple Timing Totals Worst case slew merging is pessimistic but allows Delay Calculation to be a pre-process step If Delay Calculation is done in the BFS the critical slew merging can be done It is also possible for each timing total to carry its own slew to improve accuracy Loops can be auto detected and dynamically broken avoiding accidental critical path breaks
Incremental Timing Netlist edits, full retime Netlist edits, fanout cone retime Netlist edits, query based retime The choice of how incremental to go depends on the optimization approach More global cost functions require less incrementalness More locally greedy approaches require more
STA needs innovation Increased sharing to Academia Increased research on the problems that are still problems Redirect solutions in light of the Design and Compute trends There is a lot of interesting work to do!
Some ideas New constraint language that is more functional Try to propagate the function with the delays Some combination with cycle based simulation Constraint language enhancements Library-less delay models New data model which is stage based Focus on data locality Hierarchical timing model which is truly context independent within acceptable limitations Constraint improvements to help constraint blocks more accurately