Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 3 Chang Chi-Chung 2007.4.12. The Role of the Lexical Analyzer Lexical Analyzer Parser Source Program Token Symbol Table getNextToken error.

Similar presentations


Presentation on theme: "Chapter 3 Chang Chi-Chung 2007.4.12. The Role of the Lexical Analyzer Lexical Analyzer Parser Source Program Token Symbol Table getNextToken error."— Presentation transcript:

1 Chapter 3 Chang Chi-Chung 2007.4.12

2 The Role of the Lexical Analyzer Lexical Analyzer Parser Source Program Token Symbol Table getNextToken error

3 The Reason for Using the Lexical Analyzer Simplifies the design of the compiler  A parser that had to deal with comments and white space as syntactic units would be more complex.  If lexical analysis is not separated from parser, then LL(1) or LR(1) parsing with 1 token lookahead would not be possible (multiple characters/tokens to match) Compiler efficiency is improved  Systematic techniques to implement lexical analyzers by hand or automatically from specifications  Stream buffering methods to scan input Compiler portability is enhanced  Input-device-specific peculiarities can be restricted to the lexical analyzer.

4 Lexical Analyzer Lexical analyzer are divided into a cascade of two process.  Scanning Consists of the simple processes that do not require tokenization of the input.  Deletion of comments.  Compaction of consecutive whitespace characters into one.  Lexical analysis The scanner produces the sequence of tokens as output.

5 Tokens, Patterns, and Lexemes Token ( 符號單元 )  A pair consisting of a token name and optional arrtibute value.  Example: num, id Pattern ( 樣本 )  A description of the form for the lexemes of a token.  Example: “non-empty sequence of digits”, “letter followed by letters and digits” Lexeme ( 詞 )  A sequence of characters that matches the pattern for a token.  Example: 123, abc

6 Examples: Tokens, Patterns, and Lexemes TokenPatternLexeme ifcharacters i fif elsecharacters e l s eelse comparison or = or == or !=<=, != idletter followed by letters and digits pi, score, D2 numberany numeric constant3.14, 0, 6.23 literal anything but “, surrounded by “ ’s “core dump”

7 An Example E = M * C ** 2 A sequence of pairs by lexical analyzer

8 Input Buffering E=M*C**2 eof lexemeBeginforward Sentinels

9 Lookahead Code with Sentinels switch (*forward++) { case eof: if (forward is at end of first buffer) { reload second buffer; forward = beginning of second buffer; } else if (forward is at end of second buffer) { reload first buffer; forward = beginning of first buffer; } else /* eof within a buffer marks the end of inout */ terminate lexical anaysis; break; cases for the other characters; }

10 Strings and Languages Alphabet  An alphabet  is a finite set of symbols (characters) String  A string is a finite sequence of symbols from   s  denotes the length of string s  denotes the empty string, thus  = 0 Language  A language is a countable set of strings over some fixed alphabet  Abstract Language Φ {ε}

11 String Operations Concatenation ( 連接 )  The concatenation of two strings x and y is denoted by xy Identity ( 單位元素 )  The empty string is the identity under concatenation.   s = s  = s Exponentiation  Define s 0 =  s i = s i-1 s for i > 0  By Define s 1 = s s 2 = ss

12 Language Operations Union L  M = { s  s  L or s  M } Concatenation L M = { xy  x  L and y  M} Exponentiation L 0 = {  } L i = L i-1 L Kleene closure ( 封閉包 ) L * = ∪ i=0,…,  L i Positive closure L + = ∪ i=1,…,  L i

13 Regular Expressions  A convenient means of specifying certain simple sets of strings.  We use regular expressions to define structures of tokens.  Tokens are built from symbols of a finite vocabulary. Regular Sets  The sets of strings defined by regular expressions.

14 Regular Expressions Basis symbols:   is a regular expression denoting language L(  ) = {  }  a   is a regular expression denoting L(a) = { a } If r and s are regular expressions denoting languages L(r) and M(s) respectively, then  r  s is a regular expression denoting L(r)  M(s)  rs is a regular expression denoting L(r)M(s)  r * is a regular expression denoting L(r) *  (r) is a regular expression denoting L(r) A language defined by a regular expression is called a regular set.

15 Operator Precedence OperatorPrecedenceAssociative *highestleft concatenationSecondleft |lowestleft

16 Algebraic Laws for Regular Expressions LawDescription r | s = s | r| is commutative r | ( s | t ) = ( r | s ) | t| is associative r(st) = (rs)tconcatenation is associative r(s|t) = rs | rt (s|t)r = sr | tr concatenation distributes over | εr = rε = rε is the identity for concatenation r* = ( r |ε)*ε is guaranteed in a closure r** = r** is idempotent

17 Regular Definitions If Σ is an alphabet of basic symbols, then a regular definitions is a sequence of definitions of the form: d 1  r 1 d 2  r 2 … d n  r n  Each d i is a new symbol, not in Σ and not the same as any other of d’s.  Each r i is a regular expression over the alphabet   {d 1, d 2, …, d i-1 } Any d j in r i can be textually substituted in r i to obtain an equivalent set of definitions

18 Example: Regular Definitions Regular Definitions letter_  A | B | … | Z | a | b | … | z | _ digit  0 | 1 | … | 9 id  letter_ ( letter_ | digit ) * Regular definitions are not recursive digits  digit digits  digit wrong

19 Extensions of Regular Definitions One or more instance  r + = rr * = r * r  r * = r + | ε Zero or one instance  r? = r | ε Character classes  [a-z] = a  b  c  …  z  [A-Za-z] = A|B|…|Z|a|…|z Example  digit  [0-9]  num  digit + (. digit + )? ( E (+  -)? digit + )?

20 Regular Definitions and Grammars Context-Free Grammars stmt  if expr then stmt  if expr then stmt else stmt   expr  term relop term  term term  id  num Regular Definitions digit  [0-9] letter  [A-Za-z] if  if then  then else  else relop   >  >=  = id  letter ( letter | digit ) * num  digit + (. digit + )? ( E (+ | -)? digit + )? ws  ( blank | tab | newline ) +

21 LEXEMESTOKEN NAMEATTRIBUTE VALUE Any ws -- if - then - else - Any idid Pointer to table entry Any numbernumber Pointer to table entry <relop LT <=relop LE =relop EQ <>relop NE >relop GT >=relop GE

22 Transition Diagrams 0 2 1 6 3 4 5 7 8 return(relop, LE ) return(relop, NE ) return(relop, LT ) return(relop, EQ ) return(relop, GE ) return(relop, GT ) start < = > = > = other * * relop   >  >=  =

23 Transition Diagrams 9 start letter 10 11 * other letter or digit return (getToken(), installID() ) id  letter ( letter | digit ) *

24 An Example: Implement of RELOP TOKEN getRelop() { TOKEN retToken = new(RELOP); while (1) { case 0: c = nextChar(); if (c == ‘<‘) state = 1; else if (c == ‘=‘) state= 5; else if (c == ‘>‘) state= 6; else fail(); break; case 1:...... case 8: retract(); retToken.attribute = GT; return(retTOKEN); }

25 Finite Automata Finite Automata are recognizers.  FA simply say “Yes” or “No” about each possible input string.  A FA can be used to recognize the tokens specified by a regular expression  Use FA to design of a Lexical Analyzer Generator Two kind of the Finite Automata  Nondeterministic finite automata (NFA)  Deterministic finite automata (DFA) Both DFA and NFA are capable of recognizing the same languages.

26 NFA Definitions NFA = { S, , , s 0, F }  A finite set of states S  A set of input symbols Σ input alphabet, ε is not in Σ  A transition function   : S    S  A special start state s 0  A set of final states F, F  S (accepting states)

27 Transition Graph for FA is a state is a transition is a the start state is a final state

28 Example 012 3 a bc c a This machine accepts abccabc, but it rejects abcab. This machine accepts (abc + ) +.

29 Transition Table 0 start a 1 3 2 bb a b STATEabε 0{0, 1}{0}- 1-{2}- 2-{3}- 3--- The mapping  of an NFA can be represented in a transition table  (0, a ) = {0,1}  (0, b ) = {0}  (1, b ) = {2}  (2, b ) = {3}

30 DFA DFA is a special case of an NFA  There are no moves on input ε  For each state s and input symbol a, there is exactly one edge out of s labeled a. Both DFA and NFA are capable of recognizing the same languages.

31 Simulating a DFA Input  An input string x terminated by an end-of-file character eof. A DFA D with start state s0, accepting states F, and transition function move. Output  Answer “yes” if D accepts x ; “no” otherwise. s = s 0 c = nextChar(); while ( c != eof ) { s = move(s, c); c = nextChar(); } if (s is in F ) return “yes”; else return “no”;

32 NFA vs DFA 0 start a 1 3 2 bb a b S = {0,1,2,3}  = { a, b } s 0 = 0 F = {3} 012 3 abb b a a a (a | b)*abb

33 The Regular Language The regular language defined by an NFA is the set of input strings it accepts.  Example: (a  b)*abb for the example NFA An NFA accepts an input string x if and only if  there is some path with edges labeled with symbols from x in sequence from the start state to some accepting state in the transition graph  A state transition from one state to another on the path is called a move.

34 Theorem The followings are equivalent  Regular Expression  NFA  DFA  Regular Language  Regular Grammar

35 Convert Concept Regular Expression Nondeterministic Finite Automata Deterministic Finite Automata Minimization Deterministic Finite Automata

36 Construction of an NFA from a Regular Expression Use Thompson ’ s Construction s | t N(s)N(s) N(t)N(t)     s t N(s)N(s)N(t)N(t) s*s* N(s)N(s)     a a ε

37 Example ( a | b )* a b b r 11 r8r8 r 10 r7r7 r9r9 r6r6 r5r5 * r4r4 a b b ( r3r3 ) r2r2 r1r1 ab | r 3 = r 4

38 0 start a 1 10 2 b b a b 3 45 6789         ( a | b )* a b b Example

39 Conversion of an NFA to a DFA The subset construction algorithm converts an NFA into a DFA using the following operation. OperationDescription ε- closure(s) Set of NFA states reachable from NFA state s on ε- transitions alone. ε- closure(T) Set of NFA states reachable from some NFA state s in set T on ε-transitions alone. = ∪ s in T ε- closure(s) move(T, a) Set of NFA states to which there is a transition on input symbol a from some state s in T

40 Subset Construction(1) Initially,  -closure(s0) is the only state in Dstates and it is unmarked; while (there is an unmarked state T in Dstates) { mark T; for (each input symbol a   ) { U =  -closure ( move (T, a) ); if ( U is not in Dstates) add U as an unmarked state to Dstates Dtran[T, a] = U } }

41 Computing ε- closure(T)

42 A start B C D E b b b b b a a a a a 0 a 1 10 2 b b a b 3 45 6789         NFA StateDFA Stateab {0,1,2,4,7}ABC {1,2,3,4,6,7,8}BBD {1,2,4,5,6,7}CBC {1,2,4,5,6,7,9}DBE {1,2,3,5,6,7,10}EBC Example ( a | b )* a b b

43 Example 2 a 1 6 a 3 45 bb 8 b 7 a b 0 start    Dstates A = {0,1,3,7} B = {2,4,7} C = {8} D = {7} E = {5,8} F = {6,8} a abb a*b + 0137247 68 7 858 a b b bb b a b

44 Simulation of an NFA Input  An input string x terminated by an end-of-file character eof. An NFA N with start state s0, accepting states F, and transition function move. Output  Answer “yes” if N accepts x ; “no” otherwise. S = ε-closure( s 0 ) c = nextChar(); while ( c != eof ) { S = ε-closure( s 0 ) c = nextChar(); } if (S ∩ F != ψ ) return “yes”; else return “no”;

45 Minimizing the DFA Step 1  Start with an initial partition II with two group: F and S-F (aceepting and nonaccepting) Step 2  Split Procedure Step 3  If ( II new = II ) II final = II and continue step 4 else II = II new and go to step 2 Step 4  Construct the minimum-state DFA by II final group.  Delete the dead state

46 Split Procedure Initially, let II new = II for ( each group G of II ) { Partition G into subgroup such that two states s and t are in the same subgroup if and only if for all input symbol a, states s and t have transition on a to states in the same group of II. /* at worst, a state will be in a subgroup by itself */ replace G in II new by the set of all subgroup formed }

47 Example initially, two sets {1, 2, 3, 5, 6}, {4, 7}. {1, 2, 3, 5, 6} splits {1, 2, 5}, {3, 6} on c. {1, 2, 5} splits {1}, {2, 5} on b.

48 Minimizing the DFA Major operation: partition states into equivalent classes according to  final / non-final states  transition functions ( A B C D E ) ( A B C D ) ( E ) ( A B C ) ( D ) ( E ) ( A C ) ( B ) ( D ) ( E )

49 Important States of an NFA The “ important states ” of an NFA are those without an  -transition, that is  if move({s}, a)   for some a then s is an important state The subset construction algorithm uses only the important states when it determines  -closure ( move(T, a) ) Augment the regular expression r with a special end symbol # to make accepting states important: the new expression is r#

50 Converting a RE Directly to a DFA Construct a syntax tree for ( r ) # Traverse the tree to construct functions nullable, firstpos, lastpos, and followpos Construct DFA D by algorithm 3.62

51 Function Computed From the Syntax Tree nullable(n)  The subtree at node n generates languages including the empty string firstpos(n)  The set of positions that can match the first symbol of a string generated by the subtree at node n lastpos(n)  The set of positions that can match the last symbol of a string generated be the subtree at node n followpos(i)  The set of positions that can follow position i in the tree

52 Rules for Computing the Function Node n nullable(n)firstpos(n)lastpos(n) A leaf labeled by  true  A leaf with position i false{i}{i}{i}{i} n = c 1 | c 2 nullable(c 1 ) or nullable(c 2 ) firstpos(c 1 )  firstpos(c 2 )lastpos(c 1 )  lastpos(c 2 ) n = c 1 c 2 nullable(c 1 ) and nullable(c 2 ) if ( nullable(c 1 ) ) firstpos(c 1 )  firstpos(c 2 ) else firstpos(c 1 ) if ( nullable(c 2 ) ) lastpos(c 1 )  lastpos(c 2 ) else lastpos(c 2 ) n = c 1 * truefirstpos(c 1 )lastpos(c 1 )

53 Computing followpos for (each node n in the tree) { //n is a cat-node with left child c1 and right child c2 if ( n == c1 . c2) for (each i in lastpos(c1) ) followpos(i) = followpos(i)  firstpos(c2); else if (n is a star-node) for ( each i in lastpos(n) ) followpos(i) = followpos(i)  firstpos(n); }

54 Converting a RE Directly to a DFA Initialize Dstates to contain only the unmarked state firstpos (n 0 ), where n 0 is the root of syntax tree T for (r)#; while ( there is an unmarked state S in Dstates ) { mark S; for ( each input symbol a   ) { let U be the union of followpos ( p ) for all p in S that correspond to a; if ( U is not in Dstates ) add U as an unmarked state to Dstates Dtran[S,a] = U ; } }

55 Example ( a | b )* a b b # nullable(n) = false firstpos(n) = { 1, 2, 3 } lastpos(n) = { 3 } followpos(1) = {1, 2, 3 } ○ b # ○ ○ b ○ a * 4 5 6 | ab 3 21 n n = ( a | b )* a

56 Example {6}{1, 2, 3} {5}{1, 2, 3} {4}{1, 2, 3} {3}{1, 2, 3} {1, 2} * | {1} a {2} b {3} a {4} b {5} b {6} # nullable firstposlastpos 12 3 4 5 6 ( a | b )* a b b #

57 Example 1,2,3 a 1,2, 3,4 1,2,3,6 1,2, 3,5 bb bb a a a Nodefollowpos 1{1, 2, 3} 2 3{4} 4{5} 5{6} 6- 1 2 3456 ( a | b )* a b b #

58 Time and Space Complexity Automaton Space (worst case) Time (worst case) NFA O(r)O(r)O(  r  x  ) DFAO(2 |r| ) O(x)O(x)


Download ppt "Chapter 3 Chang Chi-Chung 2007.4.12. The Role of the Lexical Analyzer Lexical Analyzer Parser Source Program Token Symbol Table getNextToken error."

Similar presentations


Ads by Google