# LEXICAL ANALYSIS Phung Hua Nguyen University of Technology 2006.

## Presentation on theme: "LEXICAL ANALYSIS Phung Hua Nguyen University of Technology 2006."— Presentation transcript:

LEXICAL ANALYSIS Phung Hua Nguyen University of Technology 2006

Faculty of IT - HCMUTLexical Analysis2 Outline Introduction to Lexical Analysis Token specification –Language –Regular Expressions (REs) Token recoginition –REs  NFA (Thompson’s construction, Algorithm 3.3) –NFA  DFA (subset construction, Algorithm 3.2) –DFA  minimal DFA (Algorithm 3.6) Programming

Faculty of IT - HCMUTLexical Analysis3 Introduction Read the input characters Produce as output a sequence of tokens Eliminate white space and comments lexical analyzer parser symbol table source program token get next token

Faculty of IT - HCMUTLexical Analysis4 Why ? Simplify design Improve compiler efficiency Enhance compiler portability

Faculty of IT - HCMUTLexical Analysis5 Tokens, Patterns, Lexemes TokenSample Lexeme Informal description of pattern const if relation,>= or >= idpi, count, x2letter followed by letters or digits num3.14, 25, 6.02E3any numeric constant literal“core dumped”any characters between “ and “ except “

Faculty of IT - HCMUTLexical Analysis6 Outline Introduction  Token specification –Language –Regular Expressions (REs) Token recoginition –REs  NFA (Thompson’s construction, Algorithm 3.3) –NFA  DFA (subset construction, Algorithm 3.2) –DFA  minimal DFA (Algorithm 3.6) Programming

Faculty of IT - HCMUTLexical Analysis7 Alphabet, Strings and Languages Alphabet ∑ : any finite set of symbols –The Vietnamese alphabet {a, á, à, ả, ã, ạ, b, c, d, đ,…} –The binary alphabet {0,1} –The ASCII alphabet String: a finite sequence of symbols drawn from ∑ : –Length |s| of a string s: the number of symbols in s –The empty string, denoted , |  | = 0 Language: any set of strings over ∑ ; –its two special cases:  : the empty set {  }

Faculty of IT - HCMUTLexical Analysis8 Examples of Languages ∑ ={ a, á, à, ả, ã, ạ, b, c, d, đ,… } –Vietnamese language ∑ = { 0,1 } –A string is an instruction –The set of Pentium instructions ∑ = the ASCII set –A string is a program –The set of C programs

Faculty of IT - HCMUTLexical Analysis9 Terms (Fig.3.7) TermDefinition prefix of sa string obtained by removing 0 or more trailing symbols of s; e.g. ban is a prefix of banana suffix of sa string formed by deleting 0 or more the leading symbols of s; e.g. na is a suffix of banana substring of sa string obtained by deleting a prefix and a suffix from s; e.g. nan is a substring of banana proper prefix, suffix or substring of s Any nonempty string x that is, respectively, a prefix, suffix os substring of s such that s  x

Faculty of IT - HCMUTLexical Analysis10 String operations String concatenation –If x and y are strings, xy is the string formed by appending y to x. E.g.: x = hom, y = nay  xy = homnay –  is the identity:  y = y; x  = x String exponentiation –s 0 =  –s i = s i-1 s E.g. s = 01, s 0 = , s 2 = 0101, s 3 = 010101

Faculty of IT - HCMUTLexical Analysis11 Language Operations (Fig 3.8) TermDefinition union: L  ML  M = { s | s  L or s  M } concatenation: LM LM= { st | s  L or t  M } Kleene closure: L * L * = L 0  L  LL  LLL  … where L 0 = {  } 0 or more concatenations of L positive closure: L + L + = L  LL  LLL  … 1 or more concatenations of L

Faculty of IT - HCMUTLexical Analysis12 Examples L = {A,B,…,Z,a,b,…,z} D = {0,1,…,9} ExampleLanguage L  D LD L 4 L * L(L  D) * D + letters and digits strings consists of a letter followed by a digit all four-letter strings all strings of letters, including  all strings of letters and digits beginning with a letter all strings of one or more digits

Faculty of IT - HCMUTLexical Analysis13 Regular Expressions (Res) over Alphabet ∑ Inductive base: 1.  is a RE, denoting the RL {  } 2.a  ∑ is a RE, denoting the RL {a} Inductive step: Suppose r and s are REs, denoting the language L(r) and L(s). Then 3.(r)|(s) is a RE, denoting the RL L(r)  L(s) 4.(r)(s) is a RE, denoting the RL L(r)L(s) 5.(r)* is a RE, denoting the RL (L(r))* 6.(r) is a RE, denoting the RL L(r)

Faculty of IT - HCMUTLexical Analysis14 Precedence and Associativity Precedence: –“*” has the highest precedence –“concatenation” has the second highest precedence –“|” has the lowest precedence Associativity: –all are left-associative E.g.: (a)|((b)*(c))  a|b*c  Unnecessary parentheses can be removed

Faculty of IT - HCMUTLexical Analysis15 Example ∑ = {a, b} 1.a|b denotes {a,b} 2.(a|b)(a|b) denotes {aa,ab,ba,bb} 3.a* denotes { ,a,aa,aaa,aaaa,…} 4.(a|b)* denotes ? 5.a|a*b denotes ?

Faculty of IT - HCMUTLexical Analysis16 Notational Shorthands One or more instances +: r+ = rr* –denotes the language (L(r))+ –has the same precedence and associativity as * Zero or one instance ?: r? = r|  –denotes the language (L(r)  {  }) Character classes –[abc] denotes a|b|c –[A-Z] denotes A|B|…|Z –[a-zA-Z_][a-zA-Z0-9_]* denotes ?

Faculty of IT - HCMUTLexical Analysis17 Outline Introduction  Token specification  –Language –Regular Expressions (REs) Token recoginition –REs  NFA (Thompson’s construction, Algorithm 3.3) –NFA  DFA (subset construction, Algorithm 3.2) –DFA  minimal DFA (Algorithm 3.6) Programming

Faculty of IT - HCMUTLexical Analysis18 Overview RE NFADFA mDFA 3.5 3.6 3.2 3.3

Faculty of IT - HCMUTLexical Analysis19 Nondeterministic finite automata A nondeterministic finite automaton (NFA) is a mathematical model that consists of –a finite set of states S –a set of input symbols ∑ –a transition function move: S  ∑  S –a start state s 0 –a finite set of final or accepting states F

Faculty of IT - HCMUTLexical Analysis20 Transition graph state transition start state final state AB a A A A

Faculty of IT - HCMUTLexical Analysis21 Transition table ab 0{0,1}{0} 1-{2} 2-{3} Input symbol State

Faculty of IT - HCMUTLexical Analysis22 Acceptance A NFA accepts an input string x iff there is some path in the transition graph from start state to some accepting state such that the edge labels along this path spell out x. A B 0 1 01010 01011 A  B  A  B  A  B 010 10 A  B  A  B  A  ? 0 1 0 11 error 0 1 0

Faculty of IT - HCMUTLexical Analysis23 Deterministic finite automata A deterministic finite automaton (DFA) is a special case of NFA in which 1.no state has an  -transition, and 2.for each state s and input symbol a, there is at most one edge labeled a leaving s.

Faculty of IT - HCMUTLexical Analysis24 Thompson’s construction of NFA from REs guided by the syntactic structure of the RE r For , For a in ∑ if  if a

Faculty of IT - HCMUTLexical Analysis25 Thompson’s construction (cont’d) Suppose N(s) and N(t) are NFA’s for REs s and t –For s|t, –For st, –For s*, –For (s), use N(s) itself N(s) N(t) i f     N(s) i f N(t) i f    

Faculty of IT - HCMUTLexical Analysis26 Outline Introduction  Token specification  –Language –Regular Expressions (REs) Token recoginition –REs  NFA (Thompson’s construction)  –NFA  DFA (subset construction) –DFA  minimal DFA (Algorithm 3.6) Programming

Faculty of IT - HCMUTLexical Analysis27 Subset construction OperationDescription  -closure(s) Set of NFA states reachable from state s on  -transition alone  -closure(T) Set of NFA states reachable from some state s in T on  -transition alone move(T,a)Set of NFA states to which there is a transition on input a from some state s in T s : an NFA state T : a set of NFA states

Faculty of IT - HCMUTLexical Analysis28 Subset construction (cont’d) Let s 0 be the start state of the NFA; Dstates contains the only unmarked state  -closure(s 0 ); while there is an unmarked state T in Dstates do begin mark T for each input symbol a do begin U :=  -closure(move(T; a)); if U is not in Dstates then Add U as an unmarked state to Dstates; DTran[T; a] := U; end;

Faculty of IT - HCMUTLexical Analysis29 DFA Let ( ∑, S, T, F, s 0 ) be the original NFA. The DFA is: The alphabet: ∑ The states: all states in Dstates The transitions: DTran The accepting states: all states in Dstates containing at least one accepting state in F of the NFA The start state:  -closure(s0)

Faculty of IT - HCMUTLexical Analysis30 Outline Introduction  Token specification  –Language –Regular Expressions (REs) Token recoginition –REs  NFA (Thompson’s construction)  –NFA  DFA (subset construction)  –DFA  minimal DFA (Algorithm 3.6) Programming

Faculty of IT - HCMUTLexical Analysis31 Minimise a DFA Initially, create two states: 1.one is the set of all final states: F 2.the other is the set of all non-final states: S - F while (more splits are possible) { Let S = {s 1,…, s n } be a state and c be any char in ∑ Let t 1,…, t n be the successor states to s 1,…, s n under c if (t 1,…, t n don't all belong to the same state) { Split S into new states so that s i and s j remain in the same state iff t i and t j are in the same state }

Faculty of IT - HCMUTLexical Analysis32 Example ABD E C b b b b b a a a a a Step1: {A,B,C,D}{E} For a, {B,B,B,B} For b, {C,D,C,E} Split {A,B,C} {D}{E} Step 2: For b, {C,D,C} Split {A,C} {B} {D} {E} Step 3: For a, {B,B} For b, {C,C} Terminate ABD E b b b b b a a a a

Faculty of IT - HCMUTLexical Analysis33 Outline Introduction  Token specification  –Language –Regular Expressions (REs) Token recoginition –REs  NFA (Thompson’s construction)  –NFA  DFA (subset construction)  –DFA  minimal DFA (Algorithm 3.6)  Programming

Faculty of IT - HCMUTLexical Analysis34 Input Buffering begin…begin… Scanner eof if (forward at end of first half) { reload second half forward++ } else if (forward at end of second half) { reload first half forward = 0 } else forward++

Faculty of IT - HCMUTLexical Analysis35 Input Buffering begin…begin… Scanner eof forward = forward + 1 if (forward↑=eof) { if (forward at end of first half) { reload second half forward++ } else if (forward at end of second half) { reload first half forward = 0 } else terminate the analysis }

Faculty of IT - HCMUTLexical Analysis36 Transition Diagrams relop  01 2 3 4 < = > other return(relop,LE) return(relop,NE) return(relop,LT) id  letter(letter|digit)* 56 7 letter letter or digit other return(id,lexeme) Transition diagram is a DFA in which there is no edge leaving out of a final state

Faculty of IT - HCMUTLexical Analysis37 Implementation token nexttoken() { while (1) { switch (state) { case 0: c = nextchar(); if (c == ‘<‘) state = 1; else state = fail(0); break; case 1: c = nextchar(); if (c == ‘=‘) state = 2; else if (c == ‘>’ state = 3; else state = 4; break; case 2: retract(0); return new Token(relop,”<=”); case 4: retract(1); return new Token(relop,”<”); case 5: c = nextchar(); if (Character.isLetter(c)) state = 6; else state = fail(5); break; case 6: c = nextchar(); if (Character.isLetter(c) ||Character.isDigit(c)) continue; else state = 7; break; case 7: retract(1); return new Token(id, getLexeme());

Faculty of IT - HCMUTLexical Analysis38 Implemetation (cont’d) int fail(int current_state) { forward = beginning; switch (current_state) { case 0: return 5; case 5: error(); } void retract(int flag) { if (flag ==1) move forward back get lexeme from beginning to forward move forward onward beginning = forward state = 0 } b│e│g│i│n│:│=│ │ │…

Faculty of IT - HCMUTLexical Analysis39 Outline Introduction  Token specification  –Language –Regular Expressions (REs) Token recoginition –REs  NFA (Thompson’s construction)  –NFA  DFA (subset construction)  –DFA  minimal DFA (Algorithm 3.6)  Programming 