# 1 CIS 461 Compiler Design and Construction Fall 2012 slides derived from Tevfik Bultan et al. Lecture-Module 5 More Lexical Analysis.

## Presentation on theme: "1 CIS 461 Compiler Design and Construction Fall 2012 slides derived from Tevfik Bultan et al. Lecture-Module 5 More Lexical Analysis."— Presentation transcript:

1 CIS 461 Compiler Design and Construction Fall 2012 slides derived from Tevfik Bultan et al. Lecture-Module 5 More Lexical Analysis

2 First Phase: Lexical Analysis (Scanning) Scanner Maps stream of characters into tokens –Basic unit of syntax Characters that form a word are its lexeme Its syntactic category is called its token Scanner discards white space and comments Scanner works as a subroutine of the parser Source code Scanner IR Parser Errors token get next token

3 Lexical Analysis Specify tokens using Regular Expressions Translate Regular Expressions to Finite Automata Use Finite Automata to generate tables or code for the scanner Scanner Generator specifications (regular expressions) source code tokens tables or code

4 Consider the problem of recognizing register names in an assembler Register  R (0|1|2| … |9) (0|1|2| … |9) * Allows registers of arbitrary number Requires at least one digit RE corresponds to a recognizer (or DFA) Example S0S0 S2S2 S1S1 R (0|1|2| … 9) accepting state (0|1|2| … 9) Recognizer for Register initial state

5 RDigit Digit * allows arbitrary numbers Accepts R00000 Accepts R99999 What if we want to limit it to R0 through R31 ? Write a tighter regular expression –Register  R ( (0|1|2) ( 0|1|2| … | 9 | ) | (4|5|6|7|8|9) | (3 (0|1| )) ) –Register  R0|R1|R2| … |R31|R00|R01|R02| … |R09 Produces a more complex DFA Has more states Same cost per transition Same basic implementation What if we need a tighter specification for registers?

6 Tighter register specification The DFA for Register  R ( (0|1|2) ( 0|1|2| … | 9 | ) | (4|5|6|7|8|9) | (3 (0|1| )) ) Accepts a more constrained set of registers Same set of actions, more states S0S0 S5S5 S1S1 R S4S4 S3S3 S6S6 S2S2 0,1,2 3 0,1 4,5,6,7,8,9 (0|1|2| … 9)

7 Tighter register specification To implement the recognizer Use the same code skeleton Use transition table and final states for the new RE Bigger tables, more space, same asymptotic costs Better syntax checking at the same cost Final = { s 2, s 3, s 4, s 5, s 6 }

8 Non-deterministic Finite Automata Non-deterministic Finite Automata (NFA) for the RE ( a | b ) * abb This is a little different S 0 has a transition on (empty string) – - transitions are allowed S 1 has two transitions on “a” –Transition function  : S    2 S maps (state, symbol) pairs to sets of states This is a non-deterministic finite automaton ( NFA ) a | b S0S0 S1S1 S4S4 S2S2 S3S3 abb

9 Non-deterministic Finite Automata An NFA accepts a string x iff there exists a path though the transition graph from s 0 to a final state and the edge labels spell x Transitions on consume no input To “run” (simulate) the NFA, –Start in s 0 and take all the transitions for each character –At each iteration add the states reachable by -transitions Why study NFAs? They are the key to automating the RE  DFA construction We can paste together NFAs with -transitions NFA becomes an NFA

10 NFA Simulation Two key functions Move(q, a) is set of states reachable by “a” from states in q -closure(q) is set of states reachable by from states in q states = -closure ( {s 0 } ); char = get_next_char(); while (char != EOF) { states = -closure(Move (states,char)); char = get_next_char(); } if (states  Final is not empty) report acceptance; else report failure;

11 Relationship between NFA s and DFA s DFA is a special case of an NFA DFA has no - transitions DFA ’s transition function is single-valued Same rules will work DFA can be simulated with an NFA –Obviously NFA can be simulated with a DFA –Less obvious Simulate sets of possible states that NFA can reach with states of the DFA Possible exponential blowup in the state space Still, one state per character in the input stream

12 Automating Scanner Construction To build a scanner: 1Write down the RE that specifies the tokens 2Translate the RE to an NFA 3Build the DFA that simulates the NFA 4Systematically shrink the DFA 5Turn it into code or table Scanner generators Lex, Flex, Jlex, … (inside JavaCC) work along these lines Algorithms are well-known and well-understood Interface to parser is important

13 Automating Scanner Construction RE  NFA ( Thompson’s construction ) Build an NFA for each term Combine them with -moves NFA  DFA ( subset construction ) Build the simulation DFA  Minimal DFA Hopcroft’s algorithm DFA  RE All pairs, all paths problem Union together paths from s 0 to a final state minimal DFA RENFADFA The Cycle of Constructions

14 RE  NFA using Thompson’s Construction Key idea NFA pattern for each symbol & each operator Join them with moves in precedence order S0S0 S1S1 a NFA for a S0S0 S1S1 a S3S3 S4S4 b NFA for ab NFA for a | b S0S0 S1S1 S2S2 a S3S3 S4S4 b S5S5 S0S0 S1S1 S3S3 S4S4 NFA for a * a Ken Thompson, C ACM, 1968 Subsection 3.8.1 in the textbook

15 Example: Thompson’s Construction Let’s try a ( b | c ) * 1. a, b, & c 2. b | c 3. ( b | c ) * S0S0 S1S1 a S0S0 S1S1 b S0S0 S1S1 c S1S1 S2S2 b S3S3 S4S4 c S0S0 S5S5 S2S2 S3S3 b S4S4 S5S5 c S1S1 S6S6 S0S0 S7S7

16 Example: Thompson’s Construction 4. a ( b | c ) * S0S0 S1S1 a S4S4 S5S5 b S6S6 S7S7 c S3S3 S8S8 S2S2 S9S9 S0S0 S1S1 a b | c But, we can automate production of the more complex one... Given a regular expressions r the generated NFA is of size |N| = O(|r|) At most two new states are created at each step Each state has at most two incoming and two outgoing transitions Simulating an NFA constructed with Thompson’s construction on a string x takes O(|N|  |x|) Of course, a human would design something simpler...

17 NFA  DFA with Subset Construction Need to simulate the NFA with the DFA The algorithm States of the DFA correspond to sets of states of NFA –DFA is keeping track of all the states the NFA can reach after reading an input string Start state of the DFA derived from start state of the NFA –Take -closure of the start state of the NFA Work forward, trying each    and taking its -closure Iterative algorithm that halts when the states wrap back on themselves

18 NFA  DFA with Subset Construction s 0 = -closure(q 0 ); Initialize S with {s 0 }; while (there exists an unmarked s i  S) { mark s i ; for each  in  { s ? = -closure(move(s i,,  )) if ( s ? is not in S ) add s ? to S as an unmarked state s j ; T[s i,  ] = s j ; } Why does this algorithm work? The algorithm halts: 1. S contains no duplicates (tests before adding) 2. 2 Q is finite (Q is the set of states of the NFA) 3. while loop adds to S, but does not remove from S (monotone)  the loop halts S contains all the reachable NFA states It tries each character in each s i. It builds every possible NFA configuration.  S and T form the DFA initial state of NFA initial state of DFA S is the set of states of the constructed DFA T is the transition function of the DFA A DFA state is an accepting state if it contains an accepting NFA state

19 NFA  DFA with Subset Construction Example of a fixed-point computation Monotone construction of some finite set Halts when it stops adding to the set Proofs of halting and correctness are similar These computations arise in many contexts Other fixed-point computations Canonical construction of sets of LR(1) items –Quite similar to the subset construction Data-flow analysis algorithms for compiler optimization Automated verification –Verification of an invariant condition can be written as a fixpoint computation

20 Example: NFA  DFA with Subset Construction Remember ( a | b ) * abb ? Applying the subset construction: a | b q0q0 q1q1 q4q4 q2q2 q3q3 abb contains q 4 (final state) Iteration 4 does not add a new state to S and all the states are marked, so the algorithm halts

21 Example: NFA  DFA with Subset Construction The DFA for ( a | b ) * abb Not much bigger than the original All transitions are deterministic Use same code skeleton as before s0s0 a s1s1 b s3s3 b s4s4 s2s2 a b b a a a b

22 DFA Minimization Important Result: Every regular language is recognized by a minimum-state DFA that is unique up to state names Main ideas in DFA minimization Discover sets of equivalent states Represent each such set with just one state Two states are equivalent if and only if: The set of paths leading to them are equivalent    , transitions on  lead to equivalent states transitions to distinct sets  states must be in distinct sets A partition P of S Each s  S is in exactly one set p i  P The algorithm iteratively partitions the DFA ’s states

23 DFA Minimization Details of the algorithm Group states into maximal size sets, optimistically Iteratively subdivide those sets, look for states that can be distinguished from each other States that remain grouped together are equivalent Initial partition, P 0, has two sets {F} & {Q-F} (D =(Q, , ,q 0,F)) Splitting a set Split a set S into smaller groups if states in S have transition to different sets for the same input Assume q 1, q 2, q 3  S, and α  Σ  (q 1, α) = q x,  (q 2, α) = q y, and  (q 3, α) = q z If q x, q y, and q z are not in the same set, then S must be split One state in the final DFA cannot have two transitions on α

24 DFA Minimization The algorithm P = { F, {Q-F}}; while ( P is still changing) { T = { } ; for each set S  P { for each    { partition S by  into S 1, S 2, …, S k ; T = T  S 1, S 2, …, S k ; } if (T != P) P = T; } Why does this work? Partition P  2 Q Start off with 2 subsets of Q {F} and {Q-F} While loop takes P i  P i+1 by splitting 1 or more sets P i+1 is at least one step closer to the partition with |Q | sets Maximum of |Q | splits Note that Partitions are never combined Initial partition ensures that final states are intact This is a fixed-point algorithm! Subsection 3.8.3 in the textbook

25 Example: DFA Minimization Enough theory, does this stuff work? –Recall our example: ( a | b) * abb s0s0 a s1s1 b s3s3 b s4s4 s2s2 a b b a a a b s 0, s 2 a s1s1 b s3s3 b s4s4 b a a a b final state start state

26 Example: DFA Minimization What about a ( b | c ) * ? First, the subset construction: q0q0 q1q1 a q4q4 q5q5 b q6q6 q7q7 c q3q3 q8q8 q2q2 q9q9 s3s3 s2s2 s0s0 s1s1 c b a b b c c Final states

27 Example: DFA Minimization Then, apply the minimization algorithm To produce the minimal DFA s3s3 s2s2 s0s0 s1s1 c b a b b c c s0s0 s1s1 a b | c Previously noted that a human would design a simpler automaton than Thompson’s construction did. The algorithms produce that same DFA ! final states start state

28 Last Link in the Cycle: DFA to RE Translation Basic idea is to eliminate the states of DFA one by one and replace them by regular expressions on the transitions. First add a new start and a new final state and connect them to the start and final states of the input DFA with -transitions –These are the only states which will not be eliminated, the regular expression on the transition between them will be the final result Delete states of the input DFA one by one using: qiqi qdqd qjqj R4R4 R1R1 R3R3 R2R2 qiqi qjqj (R 1 )(R 2 )*(R 3 ) | (R 4 ) Before After eliminating state q d Apply this rule for each q i and q j (Also for cases where q i = q j )

29 Example: DFA to RE Translation 1 2 a a a b b 3 1 a a a b b S 3 2 F S a ba | ab 3 2 F b b aa | b ab bb Initial automaton Step 1) After adding new start and final states Step 2) After deleting state 1 S a(aa|b)* (ba|a)(aa|b)* | a(aa|b)*ab | b 3 F (ba|a)(aa|b)*ab | bb Step 3) After deleting state 2 S F (a(aa|b)*ab|b)((ba|a)(aa|b)*ab|bb)*((ba|a)(aa|b)*| )|a(aa|b)* Step 4) After deleting state 3 We get the result

30 Equivalence of FAs and REs RE  NFA ( Thompson’s construction ) Build an NFA for each term Combine them with -moves NFA  DFA ( subset construction ) Build the simulation DFA  Minimal DFA Hopcroft’s algorithm DFA  RE All pairs, all paths problem Union together paths from s 0 to a final state minimal DFA RENFADFA The Cycle of Constructions

Download ppt "1 CIS 461 Compiler Design and Construction Fall 2012 slides derived from Tevfik Bultan et al. Lecture-Module 5 More Lexical Analysis."

Similar presentations