Fast Min-Register Retiming Through Binary Max-Flow Aaron Hurst Alan Mishchenko Robert Brayton FMCAD 2007
Retiming Retiming is the structural relocation of registers such that output functionality is preserved Registers can be relocated to one of several ends Minimizing worst-case delay Minimizing number of registers Either of the above under constraints Optimally or heuristically Other… ? In this work, we look at optimal register minimization without delay constraints: “Min-register”
Why Min-Register? Register count is critical in sequential verification “Make or break” effect Low investment Evidence of net utility Academic investigation: Cabodi, Quer, Somenzi DAC ‘01 Commercial sequential verification tools! State representation: linear decrease Total state space: exponential decrease Reachable state space: potential decrease
Orientation Consider one combinational frame of the circuit A single directed acyclic graph of combinational logic Nodes: logic gates Edges: pair-wise net connections Inputs: register outputs, primary inputs Outputs: registers inputs, primary outputs register inputs primary outputs primary inputs register outputs
Cuts in a Frame Momentarily ignore all primary IOs and their transitive fan-in/out Retiming = a complete cut of the DAG Number of registers = Problem consists of finding minimum cut
Max-Flow Formulation Min-cut/Max-flow Duality Compute flow Edges in graph are assigned a capacity Min-cut width = Max-flow through graph source sink source sink Compute flow Partition graph into {S,R}, S R = S = an augmenting path exists from source to s R = no augmenting path exists from source to r Reachable versus unreachable residual graph
Closest Min-Cut Insert registers between a node and any fan-out that lies in the other partition Fast: remove old registers; insert new ones Min-cut is not unique Minimum movement of registers sink source
Unconstrained Flow Width of minimum cut = capacity of crossing edges Effect of unconstrained edges? Restrict location of finite cut A useful tool… 1 = ? = 2 1 1
Reverse Edges Min-cut guarantees every path will be cut at least once Retiming requires that every path is cut exactly once R1’ R2’ R3’ A cut must be crossed by a reverse edge to have a path with more than one crossing Solution: Use unconstrained flow to prevent reverse edges R1 R2 R3
Fanout Sharing Flow graph is composed of arcs False model of register count One register per hyperedge “Fanout sharing” Introduce a structure to simulate fanout-sharing 1 1 1 1 1 1
Single Iteration What does the final flow graph look like? 1 What does the final flow graph look like? Reverse Edges Fanout-sharing One constraint per node output (not edge) Unitary Flow Simplification Binary marking scheme Flow computed on original graph 1
Primary Inputs/Outputs Synchronization with environment is flexible Registers can be absorbed from / donated to environment… or not Constrain growth of additional initialization logic “Switching off” desynchronization: exclude TFO of primary I/Os Logic source sink PI Forward retiming past this node Increases latency through PI by 1 PO PI
Multiple Frames Thus far, we have only considered retiming within one combinational frame At most one register is moved across a node Global min-cut may stretch across multiple combinational frames May require moving multiple registers across a node Solution: Repeat over single frame Terminate when no further change Logic Logic Logic
Forward and Backward Backward retiming must also be considered Solution for sequential core has already been found Considers retiming in fan-out cone of inputs Backward retiming is identical, with roles of PIs, POs, sources, and sinks reversed Logic Logic Logic Logic
Overall Algorithm Start Forward retiming Backward retiming Block Fan-out Cone of PIs? Block Fan-in Cone of POs? Compute Max-Flow Compute Max-Flow Yes Yes Improv.? No Implement Min-Cut Improv.? Implement Min-Cut No Forward retiming is preferred b/c of initial state computation Done
Asymptotic Analysis Minimum-register retiming can also be solved using other methods Original formulation: LP The Competition: Min-cost flow using cost scaling [Goldberg 97] Our Algorithm: Single iteration limited by maximum flow Total number of iterations is bounded by |R| Or, using unitary flow simplification…
Scalability Each iteration is strictly better than the previous Runtime can be bounded and any intermediate result accepted At the low, low cost of suboptimality Register Savings per Iteration The “real” improvement over previous techniques
Experimental Results Applied to OpenCores designs Reduction in number of registers Average = -11% Maximum = -62.5% Cost in delay Average = +25.8% Runtime is 5x faster than minimum-cost formulation <0.01s for 70% of benchmarks
Conclusions Max-flow approach to minimizing register count Optimal solution with minimum register movement Handles different models of environment synchronization Faster than existing methods Algorithmically and practically easier network problem Allows simplification via binary marking Scalable