EECS 150 - Components and Design Techniques for Digital Systems Lec 23 – Optimizing State Machines David Culler Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~culler http://inst.eecs.berkeley.edu/~cs150
Datapath vs Control Datapath Controller signals Control Points Datapath: Storage, FU, interconnect sufficient to perform the desired functions Inputs are Control Points Outputs are signals Controller: State machine to orchestrate operation on the data path Based on desired function and signals 11/16/2018 EECS 150, Fa07, Lec 23-optimize
Control Points Discussion Control points on a bus? Control points on a register? Control points on a function unit? Signal? Relationship to STG? STT? Input A State / Output Input B 11/16/2018 EECS 150, Fa07, Lec 23-optimize
Sequential Logic Optimization State Minimization Algorithms for State Minimization State, Input, and Output Encodings Minimize the Next State and Output logic Delay optimizations Retiming Parallelism and Pipelining (time permitting) 11/16/2018 EECS 150, Fa07, Lec 23-optimize
FSM Optimization in Context Understand the word specification Draw a picture Derive a state diagram and Symbolic State Table Determine an implementation approach (e.g., gate logic, ROM, FPGA, etc.) Perform STATE MINIMIZATION Perform STATE ASSIGNMENT Map Symbolic State Table to Encoded State Tables for implementation (INPUT and OUTPUT encodings) You can specify a specific state assignment in your Verilog code through parameter settings 11/16/2018 EECS 150, Fa07, Lec 23-optimize
Finite State Machine Optimization State Minimization Fewer states require fewer state bits Fewer bits require fewer logic equations Encodings: State, Inputs, Outputs State encoding with fewer bits has fewer equations to implement However, each may be more complex State encoding with more bits (e.g., one-hot) has simpler equations Complexity directly related to complexity of state diagram Input/output encoding may or may not be under designer control 11/16/2018 EECS 150, Fa07, Lec 23-optimize
FSM Optimization State Reduction: Example: Odd parity checker. Motivation: lower cost fewer flip-flops in one-hot implementations possibly fewer flip-flops in encoded implementations more don’t cares in NS logic fewer gates in NS logic Simpler to design with extra states then reduce later. Example: Odd parity checker. Two machines - identical behavior. 11/16/2018 EECS 150, Fa07, Lec 23-optimize
Algorithmic Approach to State Minimization Goal – identify and combine states that have equivalent behavior Equivalent States: Same output For all input combinations, states transition to same or equivalent states Algorithm Sketch 1. Place all states in one set 2. Initially partition set based on output behavior 3. Successively partition resulting subsets based on next state transitions 4. Repeat (3) until no further partitioning is required states left in the same set are equivalent Polynomial time procedure 11/16/2018 EECS 150, Fa07, Lec 23-optimize
State Minimization Example Sequence Detector for 010 or 110 Input Next State Output Sequence Present State X=0 X=1 X=0 X=1 Reset S0 S1 S2 0 0 0 S1 S3 S4 0 0 1 S2 S5 S6 0 0 00 S3 S0 S0 0 0 01 S4 S0 S0 1 0 10 S5 S0 S0 0 0 11 S6 S0 S0 1 0 S0 S3 S2 S1 S5 S6 S4 1/0 0/0 0/1 Note: Mealy machine Alternative STT 11/16/2018 EECS 150, Fa07, Lec 23-optimize
Method of Successive Partitions Input Next State Output Sequence Present State X=0 X=1 X=0 X=1 Reset S0 S1 S2 0 0 0 S1 S3 S4 0 0 1 S2 S5 S6 0 0 00 S3 S0 S0 0 0 01 S4 S0 S0 1 0 10 S5 S0 S0 0 0 11 S6 S0 S0 1 0 ( S0 S1 S2 S3 S4 S5 S6 ) ( S0 S1 S2 S3 S5 ) ( S4 S6 ) ( S0 S1 S2 ) ( S3 S5 ) ( S4 S6 ) ( S0 ) ( S1 S2 ) ( S3 S5 ) ( S4 S6 ) S1 is equivalent to S2 S3 is equivalent to S5 S4 is equivalent to S6 11/16/2018 EECS 150, Fa07, Lec 23-optimize
Minimized FSM State minimized sequence detector for 010 or 110 Input Next State Output Sequence Present State X=0 X=1 X=0 X=1 Reset S0 S1' S1' 0 0 0 + 1 S1' S3' S4' 0 0 X0 S3' S0 S0 0 0 X1 S4' S0 S0 1 0 S0 S1’ S3’ S4’ X/0 1/0 0/1 0/0 7 States reduced to 4 States 3 bit encoding replaced by 2 bit encoding 11/16/2018 EECS 150, Fa07, Lec 23-optimize
Another Example – Row Matching Method 4-Bit Sequence Detector: output 1 after each 4-bit input sequence consisting of the binary strings 0110 or 1010 11/16/2018 EECS 150, Fa07, Lec 23-optimize
State Transition Table Group states with same next state and same outputs S’10 11/16/2018 EECS 150, Fa07, Lec 23-optimize
Iterate the Row Matching Algorithm S’7 11/16/2018 EECS 150, Fa07, Lec 23-optimize
Iterate One Last Time S’3 S’4 11/16/2018 EECS 150, Fa07, Lec 23-optimize
Final Reduced State Machine 15 states (min 4 FFs) reduced to 7 states (min 3 FFs) 11/16/2018 EECS 150, Fa07, Lec 23-optimize
More Complex State Minimization Multiple input example inputs here 10 01 11 00 S0 [1] S2 [1] S4 [1] S1 [0] S3 S5 present next state output state 00 01 10 11 S0 S0 S1 S2 S3 1 S1 S0 S3 S1 S4 0 S2 S1 S3 S2 S4 1 S3 S1 S0 S4 S5 0 S4 S0 S1 S2 S5 1 S5 S1 S4 S0 S5 0 symbolic state transition table 11/16/2018 EECS 150, Fa07, Lec 23-optimize
State Reduction limits The “row matching” method is not guaranteed to result in the optimal solution in all cases, because it only looks at pairs of states. For example: Another (more complicated) method guarantees the optimal solution: “Implication table” method: cf. Mano, chapter 9 What ‘rule of thumb” heuristics? 11/16/2018 EECS 150, Fa07, Lec 23-optimize
Minimized FSM Implication Chart Method Table of all pairs of stats 1st Eliminate incompatible states based on outputs Fill entry with implied equivalents based on next state Cross out cells where indexed chart entries are crossed out present next state output state 00 01 10 11 S0 S0 S1 S2 S3 1 S1 S0 S3 S1 S4 0 S2 S1 S3 S2 S4 1 S3 S1 S0 S4 S5 0 S4 S0 S1 S2 S5 1 S5 S1 S4 S0 S5 0 S1 S2 S3 S4 S5 S0 S0-S1 S1-S3 S2-S2 S3-S4 S0-S1 S3-S0 S1-S4 S4-S5 minimized state table (S0==S4) (S3==S5) present next state output state 00 01 10 11 S0' S0' S1 S2 S3' 1 S1 S0' S3' S1 S3' 0 S2 S1 S3' S2 S0' 1 S3' S1 S0' S0' S3' 0 S0-S0 S1-S1 S2-S2 S3-S5 S1-S0 S3-S1 S2-S2 S4-S5 S0-S1 S3-S4 S1-S0 S4-S5 S4-S0 S5-S5 S1-S1 S0-S4 11/16/2018 EECS 150, Fa07, Lec 23-optimize
Minimizing Incompletely Specified FSMs Equivalence of states is transitive when machine is fully specified But its not transitive when don't cares are present e.g., state output S0 – 0 S1 is compatible with both S0 and S2 S1 1 – but S0 and S2 are incompatible S2 – 1 No polynomial time algorithm exists for determining best grouping of states into equivalent sets that will yield the smallest number of final states 11/16/2018 EECS 150, Fa07, Lec 23-optimize
Minimizing States May Not Yield Best Circuit Example: edge detector - outputs 1 when last two input changes from 0 to 1 X Q1 Q0 Q1+ Q0+ 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 1 0 0 0 1 1 0 1 1 1 1 1 1 1 1 – 1 0 0 0 00 [0] 11 [0] 01 [1] X’ X Q1+ = X (Q1 xor Q0) Q0+ = X Q1’ Q0’ 11/16/2018 EECS 150, Fa07, Lec 23-optimize
Another Implementation of Edge Detector "Ad hoc" solution - not minimal but cheap and fast 00 [0] 10 [0] 01 [1] X’ X 11 [0] 11/16/2018 EECS 150, Fa07, Lec 23-optimize
Announcements Reading: K&B 8.1-2 HW 9 due wed Last HW will go out next week TAs in lab this week as much as possible rather than official lab meetings Nov 29 Bring your question on a sheet of paper Down to the final stretch 11/16/2018 EECS 150, Fa07, Lec 23-optimize
State Assignment Choose bit vectors to assign to each “symbolic” state With n state bits for m states there are 2n! / (2n – m)! [log n <= m <= 2n] 2n codes possible for 1st state, 2n–1 for 2nd, 2n–2 for 3rd, … Huge number even for small values of n and m Intractable for state machines of any size Heuristics are necessary for practical solutions Optimize some metric for the combinational logic Size (amount of logic and number of FFs) Speed (depth of logic and fanout) Dependencies (decomposition) 11/16/2018 EECS 150, Fa07, Lec 23-optimize
State Assignment Strategies Possible Strategies Sequential – just number states as they appear in the state table Random – pick random codes One-hot – use as many state bits as there are states (bit=1 –> state) Output – use outputs to help encode states Heuristic – rules of thumb that seem to work in most cases No guarantee of optimality – another intractable problem 11/16/2018 EECS 150, Fa07, Lec 23-optimize
State Maps “K-maps” are used to help visualize good encodings. Assignment State q2 q1 q0 S0 0 0 0 S1 1 0 1 S2 1 1 1 S3 0 1 0 S4 0 1 1 Assignment State q2 q1 q0 S0 0 0 0 S1 0 0 1 S2 0 1 0 S3 0 1 1 S4 1 1 1 “K-maps” are used to help visualize good encodings. Adjacent states in the STD should be made adjacent in the map. 11/16/2018 EECS 150, Fa07, Lec 23-optimize
State Maps and Counting Bit Changes Bit Change Heuristic S0 1 S1 S2 S3 S4 S0 -> S1: 2 1 S0 -> S2: 3 1 S1 -> S3: 3 1 S2 -> S3: 2 1 S3 -> S4: 1 1 S4 -> S1: 2 2 Total: 13 7 11/16/2018 EECS 150, Fa07, Lec 23-optimize
State Assignment Alternative heuristics based on input and output behavior as well as transitions: Adjacent assignments to: states that share a common next state (group 1's in next state map) states that share a common ancestor state (group 1's in next state map) states that have common output behavior (group 1's in output map) 11/16/2018 EECS 150, Fa07, Lec 23-optimize
Heuristics for State Assignment Successor/Predecessor Heuristics High Priority: S’3 and S’4 share common successor state (S0) Medium Priority: S’3 and S’4 share common predecessor state (S’1) Low Priority: 0/0: S0, S’1, S’3 1/0: S0, S’1, S’3, S’4 11/16/2018 EECS 150, Fa07, Lec 23-optimize
Heuristics for State Assignment 11/16/2018 EECS 150, Fa07, Lec 23-optimize
Another Example High Priority: S’3, S’4 S’7, S’10 Medium Priority: S1, S2 2 x S’3, S’4 S’7, S’10 Low Priority: 0/0: S0, S1, S2, S’3, S’4, S’7 1/0: S0, S1, S2, S’3, S’4, S’7, S10 11/16/2018 EECS 150, Fa07, Lec 23-optimize
Example Continued Two alternative assignments at the left Choose assignment for S0 = 000 Place the high priority adjacency state pairs into the State Map Repeat for the medium adjacency pairs Repeat for any left over states, using the low priority scheme Two alternative assignments at the left 11/16/2018 EECS 150, Fa07, Lec 23-optimize
Why Do These Heuristics Work? Attempt to maximize adjacent groupings of 1’s in the next state and output functions 11/16/2018 EECS 150, Fa07, Lec 23-optimize
General Approach to Heuristic State Assignment All current methods are variants of this 1) Determine which states “attract” each other (weighted pairs) 2) Generate constraints on codes (which should be in same cube) 3) Place codes on Boolean cube so as to maximize constraints satisfied (weighted sum) Different weights make sense depending on whether we are optimizing for two-level or multi-level forms Can't consider all possible embeddings of state clusters in Boolean cube Heuristics for ordering embedding To prune search for best embedding Expand cube (more state bits) to satisfy more constraints 11/16/2018 EECS 150, Fa07, Lec 23-optimize
One-hot State Assignment Simple Easy to encode, debug Small Logic Functions Each state function requires only predecessor state bits as input Good for Programmable Devices Lots of flip-flops readily available Simple functions with small support (signals its dependent upon) Impractical for Large Machines Too many states require too many flip-flops Decompose FSMs into smaller pieces that can be one-hot encoded Many Slight Variations to One-hot One-hot + all-0 11/16/2018 EECS 150, Fa07, Lec 23-optimize
Output-Based Encoding Reuse outputs as state bits - use outputs to help distinguish states Why create new functions for state bits when output can serve as well Fits in nicely with synchronous Mealy implementations Inputs Present State Next State Outputs C TL TS ST H F 0 – – HG HG 0 00 10 – 0 – HG HG 0 00 10 1 1 – HG HY 1 00 10 – – 0 HY HY 0 01 10 – – 1 HY FG 1 01 10 1 0 – FG FG 0 10 00 0 – – FG FY 1 10 00 – 1 – FG FY 1 10 00 – – 0 FY FY 0 10 01 – – 1 FY HG 1 10 01 HG = ST’ H1’ H0’ F1 F0’ + ST H1 H0’ F1’ F0 HY = ST H1’ H0’ F1 F0’ + ST’ H1’ H0 F1 F0’ FG = ST H1’ H0 F1 F0’ + ST’ H1 H0’ F1’ F0’ HY = ST H1 H0’ F1’ F0’ + ST’ H1 H0’ F1’ F0 Output patterns are unique to states, we do not need ANY state bits – implement 5 functions (one for each output) instead of 7 (outputs plus 2 state bits) 11/16/2018 EECS 150, Fa07, Lec 23-optimize
Current State Assignment Approaches For tight encodings using close to the minimum number of state bits Best of 10 random seems to be adequate (averages as well as heuristics) Heuristic approaches are not even close to optimality Used in custom chip design One-hot encoding Easy for small state machines Generates small equations with easy to estimate complexity Common in FPGAs and other programmable logic Output-based encoding Ad hoc - no tools Most common approach taken by human designers Yields very small circuits for most FSMs 11/16/2018 EECS 150, Fa07, Lec 23-optimize
Sequential Logic Implementation Summary Implementation of sequential logic State minimization State assignment Implications for programmable logic devices When logic is expensive and FFs are scarce, optimization is highly desirable (e.g., gate logic, PLAs, etc.) In Xilinx devices, logic is bountiful (4 and 5 variable TTs) and FFs are many (2 per CLB), so optimization is not so crucial an issue as in other forms of programmable logic This makes sparse encodings like One-Hot worth considering 11/16/2018 EECS 150, Fa07, Lec 23-optimize
Improving Cycle Time Retiming Parallelism Pipelining 11/16/2018 EECS 150, Fa07, Lec 23-optimize
Example: Vending Machine State Machine Moore machine outputs associated with state Mealy machine outputs associated with transitions 0¢ [0] 10¢ 5¢ 15¢ [1] N’ D’ + Reset D N N+D N’ D’ Reset’ Reset 0¢ 10¢ 5¢ 15¢ (N’ D’ + Reset)/0 D/0 D/1 N/0 N+D/1 N’ D’/0 Reset’/1 Reset/0 11/16/2018 EECS 150, Fa07, Lec 23-optimize
State Machine Retiming Moore vs. (Async) Mealy Machine Vending Machine Example Open asserted only when in state 15 Open asserted when last coin inserted leading to state 15 11/16/2018 EECS 150, Fa07, Lec 23-optimize
State Machine Retiming Retiming the Moore Machine: Faster generation of outputs Synchronizing the Mealy Machine: Add a FF, delaying the output These two implementations have identical timing behavior Push the AND gate through the State FFs and synchronize with an output FF Like computing open in the prior state and delaying it one state time 11/16/2018 EECS 150, Fa07, Lec 23-optimize
State Machine Retiming Effect on timing of Open Signal (Moore Case) Clk FF prop delay State Open Out prop delay Out calc Plus set-up NOTE: overlaps with Next State calculation Retimed Open Open Calculation 11/16/2018 EECS 150, Fa07, Lec 23-optimize
State Machine Retiming Timing behavior is the same, but are the implementations really identical? FF input in retimed Moore implementation FF input in synchronous Mealy implementation Only difference in don’t care case of nickel and dime at the same time 11/16/2018 EECS 150, Fa07, Lec 23-optimize
Parallelism Doing more than one thing at a time: optimization in h/w often involves using parallelism to trade between cost and performance Example, Student final grade calculation: read mt1, mt2, mt3, project; grade = 0.2 mt1 + 0.2 mt2 + 0.2 mt3 + 0.4 project; write grade; High performance hardware implementation: As many operations as possible are done in parallel 11/16/2018 EECS 150, Fa07, Lec 23-optimize
Parallelism Is there a lower cost hardware implementation? Different tree organization? Can factor out multiply by 0.2: How about sharing operators (multipliers and adders)? 11/16/2018 EECS 150, Fa07, Lec 23-optimize
Pipelining Principle Pipelining review from CS61C: Analog to washing clothes: step 1: wash (20 minutes) step 2: dry (20 minutes) step 3: fold (20 minutes) 60 minutes x 4 loads 4 hours wash load1 load2 load3 load4 dry load1 load2 load3 load4 fold load1 load2 load3 load4 20 min overlapped 2 hours 11/16/2018 EECS 150, Fa07, Lec 23-optimize
Pipelining wash load1 load2 load3 load4 dry load1 load2 load3 load4 fold load1 load2 load3 load4 Increase number of loads, average time per load approaches 20 minutes Latency (time from start to end) for one load = 60 min Throughput = 3 loads/hour Pipelined throughput # of pipe stages x un-pipelined throughput 11/16/2018 EECS 150, Fa07, Lec 23-optimize
Pipelining Assume T = 8 ns TFF(setup +clkq) = 1 ns General principle: Cut the CL block into pieces (stages) and separate with registers: T’ = 4 ns + 1 ns + 4 ns +1 ns = 10 ns F = 1/(4 ns +1 ns) = 200 MHz CL block produces a new result every 5 ns instead of every 9 ns Assume T = 8 ns TFF(setup +clkq) = 1 ns F = 1/9 ns = 111 MHz Assume T1 = T2 = 4 ns 11/16/2018 EECS 150, Fa07, Lec 23-optimize
Limits on Pipelining Without FF overhead, throughput improvement proportional to # of stages After many stages are added. FF overhead begins to dominate: Other limiters to effective pipelining: Clock skew contributes to clock overhead Unequal stages FFs dominate cost Clock distribution power consumption feedback (dependencies between loop iterations) FF “overhead” is the setup and clk to Q times. 11/16/2018 EECS 150, Fa07, Lec 23-optimize
Pipelining Example Computation graph: F(x) = yi = a xi2 + b xi + c x and y are assumed to be “streams” Divide into 3 (nearly) equal stages. Insert pipeline registers at dashed lines. Can we pipeline basic operators? 11/16/2018 EECS 150, Fa07, Lec 23-optimize
Example: Pipelined Adder Possible, but usually not done … (arithmetic units can often be made sufficiently fast without internal pipelining) 11/16/2018 EECS 150, Fa07, Lec 23-optimize
State Machine Retiming Summary Vending Machine Example Very simple output function in this particular case But if output takes a long time to compute vs. the next state computation time -- can use retiming to “balance” these calculations and reduce the cycle time Parallelism Tradeoffs in cost and performance Time reuse of hardware to reduce cost but sacrifice performance Pipelining Introduce registers to split computation to reduce cycle time and allow parallel computation Trade latency (number of stage delays) for cycle time reduction 11/16/2018 EECS 150, Fa07, Lec 23-optimize
Row Matching Example State Transition Table NS output PS x=0 x=1 x=0 x=1 a a b 0 0 b c d 0 0 c a d 0 0 d e f 0 1 e a f 0 1 f g f 0 1 g a f 0 1 11/16/2018 EECS 150, Fa07, Lec 23-optimize
Row Matching Example (cont) NS output PS x=0 x=1 x=0 x=1 a a b 0 0 b c d 0 0 c a d 0 0 d e f 0 1 e a f 0 1 f e f 0 1 Reduced State Transition Diagram NS output PS x=0 x=1 x=0 x=1 a a b 0 0 b c d 0 0 c a d 0 0 d e d 0 1 e a d 0 1 11/16/2018 EECS 150, Fa07, Lec 23-optimize