Lexical Analysis Cheng-Chia Chen.

Slides:



Advertisements
Similar presentations
C O N T E X T - F R E E LANGUAGES ( use a grammar to describe a language) 1.
Advertisements

Lexical Analysis Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!
1 1 CDT314 FABER Formal Languages, Automata and Models of Computation Lecture 3 School of Innovation, Design and Engineering Mälardalen University 2012.
COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic.
Winter 2007SEG2101 Chapter 81 Chapter 8 Lexical Analysis.
Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.
Lecture 3-4 Notes by G. Necula, with additions by P. Hilfinger
1 The scanning process Main goal: recognize words/tokens Snapshot: At any point in time, the scanner has read some input and is on the way to identifying.
Prof. Fateman CS 164 Lecture 51 Lexical Analysis Lecture 5.
1 Lexical Analysis Cheng-Chia Chen. 2 Outline 1. The goal and niche of lexical analysis in a compiler 2. Lexical tokens 3. Regular expressions (RE) 4.
2. Lexical Analysis Prof. O. Nierstrasz
Prof. Hilfinger CS 164 Lecture 21 Lexical Analysis Lecture 2-4 Notes by G. Necula, with additions by P. Hilfinger.
1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002.
Automating Construction of Lexers. Example in javacc TOKEN: { ( | | "_")* > | ( )* > | } SKIP: { " " | "\n" | "\t" } --> get automatically generated code.
Transparency No. 8-1 Formal Language and Automata Theory Chapter 8 DFA state minimization (lecture 13, 14)
Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.
CS 426 Compiler Construction
1 Scanning Aaron Bloomfield CS 415 Fall Parsing & Scanning In real compilers the recognizer is split into two phases –Scanner: translate input.
Chapter 3 Lexical Analysis
Topic #3: Lexical Analysis
CPSC 388 – Compiler Design and Construction Scanners – Finite State Automata.
Finite-State Machines with No Output
1 Chapter 3 Scanning – Theory and Practice. 2 Overview Formal notations for specifying the precise structure of tokens are necessary –Quoted string in.
1 Outline Informal sketch of lexical analysis –Identifies tokens in input string Issues in lexical analysis –Lookahead –Ambiguities Specifying lexers –Regular.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
Topic #3: Lexical Analysis EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
COMP 3438 – Part II - Lecture 2: Lexical Analysis (I) Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ. 1.
Lexical Analyzer (Checker)
Overview of Previous Lesson(s) Over View  An NFA accepts a string if the symbols of the string specify a path from the start to an accepting state.
CS412/413 Introduction to Compilers Radu Rugina Lecture 4: Lexical Analyzers 28 Jan 02.
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
1 November 1, November 1, 2015November 1, 2015November 1, 2015 Azusa, CA Sheldon X. Liang Ph. D. Computer Science at Azusa Pacific University Azusa.
CS 536 Fall Scanner Construction  Given a single string, automata and regular expressions retuned a Boolean answer: a given string is/is not in.
Compiler Construction 2 주 강의 Lexical Analysis. “get next token” is a command sent from the parser to the lexical analyzer. On receipt of the command,
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 3 Mälardalen University 2010.
Pembangunan Kompilator.  A recognizer for a language is a program that takes a string x, and answers “yes” if x is a sentence of that language, and.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Lexical Analysis – Part II EECS 483 – Lecture 3 University of Michigan Wednesday, September 13, 2006.
Lexical Analysis.
1st Phase Lexical Analysis
UNIT - I Formal Language and Regular Expressions: Languages Definition regular expressions Regular sets identity rules. Finite Automata: DFA NFA NFA with.
CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis.
Prof. Necula CS 164 Lecture 31 Lexical Analysis Lecture 3-4.
Chapter 2 Scanning. Dr.Manal AbdulazizCS463 Ch22 The Scanning Process Lexical analysis or scanning has the task of reading the source program as a file.
1 Topic 2: Lexing and Flexing COS 320 Compiling Techniques Princeton University Spring 2016 Lennart Beringer.
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
Chapter 5 Finite Automata Finite State Automata n Capable of recognizing numerous symbol patterns, the class of regular languages n Suitable for.
CSCI 4325 / 6339 Theory of Computation Zhixiang Chen.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
LECTURE 5 Scanning. SYNTAX ANALYSIS We know from our previous lectures that the process of verifying the syntax of the program is performed in two stages:
CS412/413 Introduction to Compilers Radu Rugina Lecture 3: Finite Automata 25 Jan 02.
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
1 Lexical Analysis Cheng-Chia Chen. 2 Outline 1. The goal and niche of lexical analysis in a compiler 2. Lexical tokens 3. Regular expressions (RE) 4.
Department of Software & Media Technology
WELCOME TO A JOURNEY TO CS419 Dr. Hussien Sharaf Dr. Mohammad Nassef Department of Computer Science, Faculty of Computers and Information, Cairo University.
CS510 Compiler Lecture 2.
Chapter 3 Lexical Analysis.
Finite-State Machines (FSMs)
Finite-State Machines (FSMs)
The time complexity for e-closure(T).
Two issues in lexical analysis
Recognizer for a Language
Lexical Analysis Lecture 3-4 Prof. Necula CS 164 Lecture 3.
4b Lexical analysis Finite Automata
4b Lexical analysis Finite Automata
Lexical Analysis.
Presentation transcript:

Lexical Analysis Cheng-Chia Chen

Outline The goal and niche of lexical analysis in a compiler Lexical tokens Regular expressions (RE) Use regular expressions in lexical specification Finite automata (FA) DFA and NFA from RE to NFA from NFA to DFA from DFA to optimized DFA Lexical-analyzer generators

1. The goal and niche of lexical analysis Source Tokens (token stream) (char stream) Parsing Interm. Language Code Gen. Machine Code Optimization Goal of lexical analysis: breaking the input into individual words or “tokens”

Lexical Analysis What do we want to do? Example: if (i == j) else Z = 0; else Z = 1; The input is just a sequence of characters: \tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1; Goal: Partition input string into substrings And determine the categories (token types) to which the substrings belong

2. Lexical Tokens What’s a token ? Token attributes Normal token and special tokens Example of tokens and special tokens.

What’s a token? a sequence of characters that can be treated as a unit in the grammar of a PL. Output of lexical analysis is a stream of tokens Tokens are partitioned into categories called token types. ex: In English: book, students, like, help, strong,… : token - noun, verb, adjective, … : token type In a programming language: student, var34, 345, if, class, “abc” … : token ID, Integer, IF, WHILE, Whitespace, … : token type Parser relies on the token type instead of token distinctions to analyze: var32 and var1 are treated the same, var32(ID), 32(Integer) and if(IF) are treated differently.

Token attributes token type : category of the token; used by syntax analysis. ex: identifier, integer, string, if, plus, … token value : semantic value used in semantic analysis. ex: [integer, 26], [string, “26”] token lexeme (member, text): textual content of a token [while, “while”], [identifier, “var23”], [plus, “+”], [integer, “26”],… positional information: file + start/end line/position of the textual content in the source program.

Notes on Token attributes Token types affect syntax analysis Token values affect semantic analysis lexeme and positional information affect error handling Only token type information must be supplied by the lexical analyzer for syntax analysis. Any program performing lexical analysis is called a scanner (lexer, lexical analyzer).

Aspects of Token types Language view: A token type is the set of all lexemes of all its token instances. ID = {a, ab, … } – {if, do,…}. Integer = { 123, 456, …} IF = {if}, WHILE={while}; STRING={“abc”, “if”, “WHILE”,…} Pattern (regular expression): a rule defining the language of all instances of a token type. WHILE: w h i l e ID: letter (letters | digits )* ArithOp: + | - | * | /

Lexical Analyzer: Implementation An implementation must do two things: Recognize substrings corresponding to lexemes of tokens Determine token attributes type is necessary value depends on the type/application, lexeme/positional information depends on applications (eg: debug or not).

Example input lines: \tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1; Token-lexeme pairs returned by the lexer: [Whitespace, “\t”] [if, - ] [OpenPar, “(“] [Identifier, “i”] [Relation, “==“] [Identifier, “j”] …

Normal Tokens and special Tokens Kinds of tokens normal tokens: needed for later syntax analysis and must be passed to parser. special tokens skipped tokens (or nontoken): do not contribute to parsing, discarded by the scanner. Examples: Whitespace, Comments why need them ? Question: What happens if we remove all whitespace and all comments prior to scanning?

Lexical Analysis in FORTRAN FORTRAN rule: Whitespace is insignificant E.g., VAR1 is the same as VA R1 Footnote: FORTRAN whitespace rule motivated by inaccuracy of punch card operators

A terrible design! Example Consider DO 5 I = 1,25 DO 5 I = 1.25 The first is DO 5 I = 1 , 25 The second is DO5I = 1.25 Reading left-to-right, cannot tell if DO5I is a variable or DO stmt. until after “,” is reached

Lexical Analysis in FORTRAN. Lookahead. Two important points: The goal is to partition the string. This is implemented by reading left-to-right, recognizing one token at a time “Lookahead” may be required to decide where one token ends and the next token begins Even our simple example has lookahead issues i vs. if = vs. ==

Some token types of a typical PL Examples ID foo n14 last NUM (literal) 73 0 00 515 082 REAL(literal) 66.1 .5 10. 1e67 1.5e-10 IF if COMMA , NOTEQ != LPAREN ( RPAREN )

Some Special Tokens 1,5 are skipped. 2,3 need preprocess, 4 need to be expanded. 1. comment /* … */ // … 2. preprocessor directive #include <stdio.h> 3. macro definition #define NUMS 5,6 4. macro use NUMS 5.blank,tabs,newlines \t \n

3. Regular expressions and Regular Languages

The geography of lexical tokens ID: var1, last5,… NUM 23 56 0 000 special tokens : \t \n /* … */ IF:if LPAREN ( REAL 12.35 2.4 e –10 … RPAREN ) the set of all strings

Issues Definition problem: how to define (formally specify) the set of strings(tokens) belonging to a token type ? => regular expressions (Recognition problem) How to determine which set (token type) an input string belongs to? => DFA!

Languages Def. Let S be a set of symbols (or characters). A language over S is a set of strings of characters drawn from S (S is called the alphabet )

Examples of Languages Alphabet = English characters Language = English words Not every string on English characters is an English word likes, school,… beee,yykk,… Alphabet = ASCII Language = C programs Note: ASCII character set is different from English character set

Regular Expressions A language (metaLanguage) for representing (or defining) languages(sets of words) Definition: If S is an alphabet. The set of regular expression(RegExpr) over S is defined recursively as follows: (Atomic RegExpr) : 1. any symbol c   is a RegExpr. 2. e (empty string) is a RegExpr. (Compound RegExpr): if A and B are RegExpr, then so are 3. (A | B) (alternation) 4. (A  B) (concatenation) 5. A* (repetition)

Semantics (Meaning) of regular expressions For each regular expression A, we use L(A) to express the language defined by A. I.e. L is the function: L: RegExpr(S)  the set of Languages over S with L(A) = the language denoted by RegExpr A The meaning of RegExpr can be made clear by explicitly defining L.

Atomic Regular Expressions 1. Single symbol: c L(c) = { c } (for any c  ) 2. Epsilon (empty string): e L(e) = {e}

Compound Regular Expressions 3. alternation ( or union or choice) L( (A | B) ) = { s | s  L(A) or s  L(B) } 4. Concatenation: AB (where A and B are reg. exp.) L((A  B)) =L(A)  L(B) =def { a  b | a  L(A) and b  L(B) } Note: Parentheses enclosing (A|B) and (AB) can be omitted if there is no worries of confusion. MN (set concatenation) and a  b (string concatenation) will be abbreviated to AB and ab, respectively. AA and L(A)  L(A) are abbreviated as A2 and L(A)2, respectively.

Examples if | then | else  { if, then, else} 0 | 1 | … | 9  { 0, 1, …, 9 } (0 | 1) (0 | 1)  { 00, 01, 10, 11 }

More Compound Regular Expressions 5. repetition ( or Iteration): A* L(A*) = { e }  L(A)  L(A)L(A)  L(A)3  … Examples: 0* : {e, 0, 00, 000, …} 10* : strings starting with 1 and followed by 0’s. (0|1)* 0 : Binary even numbers. (a|b)*aa(a|b)*: strings of a’s and b’s containing consecutive a’s. b*(abb*)*(a|e) : strings of a’s and b’s with no consecutive a’s.

Example: Keyword Keyword: else or if or begin … else | if | begin | …

Example: Integers Integer: a non-empty string of digits ( 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ) ( 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 )* problem: reuse complicated expression improvement: define intermediate reg. expr. digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 number = digit digit* Abbreviation: A+ = A A*

Regular Definitions Names for regular expressions d1 = r1 d2 = r2 ... dn = rn where ri over alphabet È {d1, d2, ..., d i-1} note: Recursion is not allowed.

Example Identifier: strings of letters or digits, starting with a letter digit = 0 | 1 | ... | 9 letter = A | … | Z | a | … | z identifier = letter (letter | digit) * Is (letter* | digit*) the same as (letter | digit)* ?

Whitespace: a non-empty sequence of blanks, newlines, CRNL and tabs Example: Whitespace Whitespace: a non-empty sequence of blanks, newlines, CRNL and tabs WS = (\ | \t | \n | \r\n )+

Example: Email Addresses Consider chencc@cs.nccu.edu.tw  = letters [ { ., @ } name = letter+ address = name ‘@’ name (‘.’ name)*

Notational Shorthands One or more instances r+ = r r* r* = (r+ | e) Zero or one instance r? = (r | e) Character classes [abc] = a | b | c [a-z] = a | b | ... | z [ac-f] = a | c | d | e | f . = S [^ac-f] = S – [ac-f] [^.] = e

Summary Regular expressions describe many useful languages Regular definition is a language specification We still need an implementation problem: Given a string s and a rexp R, is

4. Use Regular expressions in lexical specification

Goal Specifying lexical structure using regular expressions

Regular Expressions in Lexical Specification Last lecture: the specification of all lexemes in a token type using regular expression. But we want a specification of all lexemes of all token types in a programming language. Which may enable us to partition the input into lexemes We will adapt regular expressions to this goal

Regular Expressions => Lexical Spec. (1) Select a set of token types Number, Keyword, Identifier, ... Write a rexp for the lexemes of each token type Number = digit+ Keyword = if | else | … Identifier = letter (letter | digit)* LParen = ‘(‘ …

Regular Expressions => Lexical Spec. (2) Construct R, matching all lexemes for all tokens R = Keyword | Identifier | Number | … = R1 | R2 | R3 + … Facts: If s  L(R) then s is a lexeme Furthermore s  L(Ri) for some “i” This “i” determines the token type that is reported

Regular Expressions => Lexical Spec. (3) 4. Let the current input be x1…xn (x1 ... xn are symbols in the language alphabet) For 1  i  n check x1…xi  L(R) ? 5. It must be the case that there is one i such that x1…xi  L(Rj) for some j 6. Remove t = x1…xi from input if t is normal token, then pass it to the parser // else it is whitespace or comments, just skip it! 7.go to (4)

Ambiguities (1) Rule: Pick the longest possible substring There are ambiguities in the algorithm How much input is used? What if x1…xi  L(R) and also x1…xk  L(R) for some i  k. Rule: Pick the longest possible substring The longest match principle !!

Ambiguities (2) Rule: use rule listed first (j iff j < k) Which token is used? What if x1…xi  L(Rj) and also x1…xi  L(Rk) Rule: use rule listed first (j iff j < k) Earlier rule first! Example: R1 = Keyword and R2 = Identifier “if” matches both. Treats “if” as a keyword not an identifier

Error Handling What if No rule matches any prefix of input ? Problem: Can’t just get stuck … Solution: Write a rule matching all “bad” strings Put it last Lexer tools allow the writing of: R = R1 | ... | Rn | Error Token Error matches if nothing else matches

Summary Regular expressions provide a concise notation for string patterns Use in lexical analysis requires small extensions To resolve ambiguities To handle errors Efficient algorithms exist (next) Require only single pass over the input Few operations per character (table lookup)

5. Finite Automata Regular expressions = specification Finite automata = implementation A finite automaton consists of An input alphabet  A finite set of states S A start state n A set of accepting states F  S A set of transitions state input state If the automata is for recognizing a token type , then this type should be associated with the machine.

In state s1 on input “a” go to state s2 Finite Automata Transition s1 a s2 Is read In state s1 on input “a” go to state s2 If end of input (or no transition possible) If in accepting state => accept Otherwise => reject

Finite Automata State Transition Graphs The start state An accepting state T [ T is the tokenType ] a A transition

A Simple Example A finite automaton that accepts only “1” 1

Another Simple Example A finite automaton accepting any number of 1’s followed by a single 0 Alphabet: {0,1} 1 accepted input: 1*0

And Another Example Alphabet {0,1} What language does this recognize? 1 1 accepted inputs: to be answered later!

And Another Example Alphabet still { 0, 1 } The operation of the automaton is not completely defined by the input On input “11” the automaton could be in either state 1

Epsilon Moves Another kind of transition: -moves  B Machine can move from state A to state B without reading input

Deterministic and Nondeterministic Automata Deterministic Finite Automata (DFA) One transition per input per state No -moves Nondeterministic Finite Automata (NFA) Can have multiple transitions for one input in a given state Can have -moves Finite automata can have only a finite number of states.

Execution of Finite Automata A DFA can take only one path through the state graph Completely determined by input NFAs can choose Whether to make -moves Which of multiple transitions for a single input to take

Acceptance of NFAs An NFA can get into multiple states Input: 1 Input: 1 1 Rule: NFA accepts if it can get in a final state

Acceptance of a Finite Automata A FA (DFA or NFA) accepts an input string s iff there is some path in the transition diagram from the start state to some final state such that the edge labels along this path spell out s

NFA vs. DFA (1) NFAs and DFAs recognize the same set of languages (regular languages) DFAs are easier to implement NFA are easier for specification and converting lexical specification to FA

NFA vs. DFA (2) For a given language the NFA can be simpler than the DFA 1 NFA 1 DFA DFA can be exponentially larger than NFA

Operations on NFA states e-closure(s): set of NFA states reachable from NFA state s on e-transitions alone e-closure(S): set of NFA states reachable from some NFA state s in S on e-transitions alone move(S, c): set of NFA states to which there is a transition on input symbol c from some NFA state s in S notes: e-closure(S) = Us ∈ S e-closure(s); e-closure(s) = e-closure({s}); e-closure(S) = ?

Computing e-closure Input. An NFA and a set of NFA states S. Output. E = e-closure(S). begin push all states in S onto stack; T := S; while stack is not empty do begin pop t, the top element, off of stack; for each state u with an edge from t to u labeled e do if u is not in T do begin add u to T; push u onto stack end end; return T end.

Simulating an NFA (for recognizing a token) Input. An input string ended with eof and an NFA with start state s0 and final states F. Output. The answer “yes” if accepts, “no” otherwise. begin S := e-closure({s0}); c := next_symbol(); while c != eof do begin S := e-closure(move(S, c)); end; if S Ç F != Æ then return “yes” else return “no” end.

Simulating an NFA (for recognizing a sequence of tokens) Input. An input string ended with eof and an NFA with start state s0 and a set F of final states, each marked by a token type. Output: a token sequence (possibly ended with error ).

L0: length = 0 ; buf = new ArrayList(); type = -1; S := e-closure({s0}); c := next_symbol(); while( c != eof ) { S := e-closure(move(S, c)); F1 = S Ç F ; if (F1 != Æ ) { // c goes to final states! buf.add(c ); length= buf.size(); typr = minTypeOf(F1); } if ( S == Æ ) { // cannot make c-transition!! if (length == 0 ) { output( error-token) ; exit() } else { output token(type, buf[0:length-1] ); puch back buf[length:-] and c into the input; goto L0; } else { pos++; buf.add(c); c := next_symbol(); } } if(length = buf.length() ) { if (length > 0) output token(type, buf ) ; } else { output token(type, buf(0, length-1); output error-token ; }.

Regular Expressions to Finite Automata High-level sketch NFA DFA Regular expressions Optimized DFA Lexical Specification Table-driven Implementation of DFA

Regular Expressions to NFA (1) For each kind of rexp, define an NFA Notation: NFA for rexp A A For   For input a a

Regular Expressions to NFA (2) For AB A B  x y min(x,y) For A | B

Regular Expressions to NFA (3) For A*  A  

Example of RegExp -> NFA conversion Consider the regular expression (1|0)*1 The NFA is  A H 1 C E  B G 1 I J D F 

NFA to DFA

Regular Expressions to Finite Automata High-level sketch NFA DFA Regular expressions Optimized DFA Lexical Specification Table-driven Implementation of DFA

RegExp -> NFA : an Examlpe Consider the regular expression (1+0)*1 The NFA is  A H 1 C E  B G 1 I J D F 

NFA to DFA. The Trick Simulate the NFA Each state of DFA = a non-empty subset of states of the NFA Start state = the set of NFA states reachable through -moves from NFA start states Add a transition S a S’ to DFA iff S’ is the set of NFA states reachable from any state in S after seeing the input a considering -moves as well

NFA -> DFA Example  1   1      1 1 1 C E A B G H I J D F G H I J   D F   FGABCDHI 1 ABCDHI 1 1 EJGABCDHI

NFA to DFA. Remark An NFA may be in many states at any time How many different states ? If there are N states, the NFA must be in some subset of those N states How many non-empty subsets are there? 2N - 1 = finitely many

From an NFA to a DFA Subset construction Algorithm. Input. An NFA N. Output. A DFA D with states S and transition table mv. begin add e-closure(s0) as an unmarked state to S; while there is an unmarked state T in S do begin mark T; let TokenType(T) = min{ type( s) | s ∈ T ∩ F }. for each input symbol a do begin U := e-closure(move(T, a)); if U is not in S then add U as an unmarked state to S; mv[T, a] := U end end end.

Implementation A DFA can be implemented by a 2D table T One dimension is “states” Other dimension is “input symbols” For every transition Si a Sk define mv[i,a] = k DFA “execution” If in state Si and input a, read mv[i,a] (= k) and move to state Sk Very efficient

Table Implementation of a DFA T 1 S 1 1 U 1 S T U

Simulation of a DFA Input. An input string ended with eof and a DFA with start state s0 and final states F. Output. The answer “yes” if accepts, “no” otherwise. begin s := s0; c := next_symbol(); while c <> eof do begin s := mv(s, c); c := next_symbol() end; if s is in F then return “yes” else return “no” end.

Simulation of a DFA (for recognizing token sequence ) Input. An input string ended with eof and a DFA with start state s0 and a set F of final states, each with a type. Output. a sequence of tokens possibly ended with error token. begin length = 0; buf = new ArrayList(); type = -1; s := s0; c := next_symbol(); while( c <> eof ) { s := mv(s, c); if( s is not error state ) { if(s is final) { length= buf.size() + 1; type = type(s); } buf.add(c) ; }

if( s is error-state ) { if(length == 0 ) { output error-token ; exit() } else { output token(type, buf[0:length-1]) ; push back buf[length:-] and c into input ; length = 0; type = -1; } } c := next_symbol() if(s is not final ) { output error-token ;} end.

Implementation (Cont.) NFA -> DFA conversion is at the heart of tools such as flex But, DFAs can be huge DFA => optimized DFA : try to decrease the number of states. not always helpful! In practice, flex-like tools trade off speed for space in the choice of NFA and DFA representations

Time-Space Tradeoffs RE to NFA, simulate NFA time: O(|r| |x|) , space: O(|r|) RE to NFA, NFA to DFA, simulate DFA time: O(|x|), space: O(2|r| ) Lazy transition evaluation transitions are computed as needed at run time; computed transitions are stored in cache for later use

DFA to optimized DFA

Motivations Problems: 1. Given a DFA M with k states, is it possible to find an equivalent DFA M’ (I.e., L(M) = L(M’)) with state number fewer than k ? 2. Given a regular language A, how to find a machine with minimum number of states ? Ex: A = L((a+b)*aba(a+b)*) can be accepted by the following NFA: By applying the subset construction, we can construct a DFA M2 with 24=16 states, of which only 6 are accessible from the initial state {s}. s t u v a b a,b

Inaccessible states A state p  Q is said to be inaccessible (or unreachable) [from the initial state] if there exists no path from from the initial state to it. If a state is not inaccessible, it is accessible. Inaccessible states can be removed from the DFA without affecting the behavior of the machine. Problem: Given a DFA (or NFA), how to find all inaccessible states ?

Finding all accessible states: (like e-closure) Input. An FA (DFA or NFA) Output. the set of all accessible states begin push all start states onto stack; Add all start states into A; while stack is not empty do begin pop t, the top element, off of stack; for each state u with an edge from t to u do if u is not in A do begin add u to A; push u onto stack end end; return A end.

Minimization process Minimization process for a DFA: 1. Remove all inaccessible states 2. Collapse all equivalent states What does it mean that two states are equivalent? both states have the same observable behaviors.i.e., there is no way to distinguish their difference, or more formally, we say p and q are not equivalent(or distinguishable) iff there is a string x S* s.t. exactly one of D(p,x) and D(q,x) is a final state, where D(p,x) is the ending state of the path from p with x as the input. Equivalents sates can be merged to form a simpler machine.

Example: 1 2 4 3 a a,b b 5 5 a,b 1,2 3,4

Quotient Construction M=(Q,S, d,s,F): a DFA. : a relation on Q defined by: p  q <=>for all x  S* D(p,x)  F iff D(q,x)  F Property:  is an equivalence relation. Hence it partitions Q into equivalence classes [p] = {q  Q | p  q} for p  Q. and the quotient set Q/ = {[p] | p  Q}. Every p  Q belongs to exactly one class [p] and p  q iff [p]=[q]. Define the quotient machine M/ = <Q’,S, d’,s’,F’> where Q’=Q/ ; s’=[s]; F’={[p] | p  F}; and d’([p],a)=[d(p,a)] for all p  Q and a  S.

Minimization algorithm input: a DFA output: a optimized DFA 1. Write down a table of all pairs {p,q}, initially unmarked. 2. mark {p,q} if p ∈ F and q  F or vice versa. 3. Repeat until no more change: 3.1 if ∃ unmarked pair {p,q} s.t. {move(p,a), move(q,a)} is marked for some a ∈ S, then mark {p,q}. 4. When done, p  q iff {p,q} is not marked. 5. merge all equivalent states into one class and return the resulting machine Note:For recognizing multiple token types, 2 need change to 2’ mark {p,q} if type(p) ≠ type(q) [assume non final state has the same type ]

An Example: The DFA: a b >0 1 2 1F 3 4 2F 5 5F

Initial Table 1 - 2 3 4 5

After step 2 1 M 2 - 3 4 5

After first pass of step 3 1 M 2 - 3 4 5

2nd pass of step 3. The result : 1  2 and 3  4. 1 M 2 - 3 M2 4 5 M1