Presentation is loading. Please wait.

Presentation is loading. Please wait.

Automata and Regular Expression Discrete Mathematics and Its Applications Baojian Hua

Similar presentations


Presentation on theme: "Automata and Regular Expression Discrete Mathematics and Its Applications Baojian Hua"— Presentation transcript:

1 Automata and Regular Expression Discrete Mathematics and Its Applications Baojian Hua bjhua@ustc.edu.cn

2 MiniC Formal Grammar prog -> stm prog | stm stm -> id = exp; | print (exp); exp -> exp + exp | exp - exp | exp * exp | exp / exp | id | num | (exp) Nonterminal symbols Terminal symbols

3 ::= … | | ident and intconst For most programming languages, the terminal symbols represent the basic punctuation symbols, keywords, and operators and are special in that they represent (infinite) sets of terminal symbols id ::= letter idRest letter ::= _ | a | … | z | A | … | Z idRest ::=  | letter idRest | digit idRest digit ::= 0 | 1 | 2 | … | 9

4 Lexical Analysis The lexical analyzer translates the source program into a stream of lexical tokens Source program: stream of (ASCII or Unicode) characters Lexical token: internal data structure that represents the occurrence of a terminal symbol

5 Example x = 11; y = 2; z = x + y; print (z); IDENT(x) ASSIGN INTCONST(11) SEMICOLON NEWLINE IDENT(y) ASSIGN INTCONST(2) SEMICOLON NEWLINE IDENT(z) ASSIGN IDENT(x) PLUS IDENT(y) SEMICOLON NEWLINE PRINT LPAREN IDENT(z) RPAREN SEMICOLON EOF lexical analysis

6 Practice Issues What are tokens? Recall the tagged union in our previous slides What if the input characters are illegal? Limited checking of the grammatical structure of input only checks that input stream can be viewed as a stream of terminal symbols

7 Lexical Errors x = 11 y = = = = = = 2; z =@ x + y; print (#z); Not lexical errors Lexical errors

8 Position Info For the purpose of later phases, it is useful to attach position information to each token we ’ d see how to make use of such kind of info in later slides LPRREN(1,4) IDENT(x,4,5) MINUS(6,7) …

9 Tokens in C #ifndef TOKEN_H #define TOKEN_H enum tokenKind {ID, NUM, ASSIGN, LPAREN, …}; typedef tokenStruct *token; struct tokenStruct { enum tokenKind kind; union {…} u; int line; int column; }; #endif

10 Lexer Interface #ifndef LEXER_H #define LEXER_H #include “token.h” token nextToken (char *fileName); #endif

11 Client Code #include “lexer.h” int main() { // we want to analysis file “test.c” token t = nextToken (“test.c”); while (t!=EOF); { … t = nextToken (“test.c”); … } return 0; }

12 Finite-state Automata

13 Finite-state Automata (FAs) Input String M {Yes, No} M = (, S, s0, F, f) Input alphabet State set Initial state Final states Transition function

14 f:S    S Transition Functions A deterministic finite automaton (DFA) which can be extended to f ’ :S   *  S and or in an inductive form: f ’ (q,  ) = q f ’ (q, a  ) = f ’ (f(q, a),  )

15 DFA Example Which strings of as and bs are accepted? Transition function: { (s0,a)  s1, (s0,b)  s0, (s1,a)  s2, (s1,b)  s1, (s2,a)  s2, (s2,b)  s2 } 1 2 0 aa bba,b

16 Nondeterministic FAs (NFAs) NFAs can transition to more than one state on any input f:S     (S) As before, can extend: f ’ :S   *   (S) Inductively: f ’ (q,  ) = {q} f ’ (q, a  ) =  p  f(q, a) f ’ (p,  )

17 NFA Example 0 1 a,b ab b Transition function: { (s0,a){s0,s1}, (s0,b){s1}, (s1,a), (s1,b){s0,s1} }

18 Regular Expression

19 Regular Expressions A regular language can always be described using a regular expression. Examples (01)* 00  (a|b)*ab this | that | theother 0*1*2* 01*|0 = 01* 00*11*22* = 0 + 1 + 2 + (1|0)*00(0|1)*

20 Regular Expressions and Tokens Regular expressions are convenient for describing lexical tokens intconst: [0-9][0-9]* ident: [_a-zA-Z][_a-zA-Z0-9_]* others: = | print | + | …

21 Regular Expressions Let  = {a,b}.  is a regular expression L = {  }

22 Regular Expressions Let  = {a,b}.  is a regular expression L = {  }  is a regular expression L = {}

23 Regular Expressions Let  = {a,b}.  is a regular expression L = {  }  is a regular expression L = {} a is a regular expression L = {a} a

24 Regular Expressions Let  = {a,b}.  is a regular expression L = {  }  is a regular expression L = {} a is a regular expression L = {a}

25 Regular Expressions Let  = {a,b}.  is a regular expression L = {  }  is a regular expression L = {} a is a regular expression L = {a} R|S is a regular expression if R and S are

26 Regular Expressions Let  = {a,b}.  is a regular expression L = {  }  is a regular expression L = {} a is a regular expression L = {a} R|S is a regular expression if R and S are L R+S = L R U L S

27 Regular Expressions Let  = {a,b}.  is a regular expression L = {  }  is a regular expression L = {} a is a regular expression L = {a} R|S is a regular expression if R and S are L R+S = L R U L S R S

28 Regular Expressions Let  = {a,b}.  is a regular expression L = {  }  is a regular expression L = {} a is a regular expression L = {a} R|S is a regular expression if R and S are L R+S = L R U L S     R S

29 Regular Expressions Let  = {a,b}.  is a regular expression  is a regular expression a is a regular expression R|S is a regular expression if R and S are RS is a regular expression if R and S are L RS = {uv | u  L R & v  L S }  R S

30 Regular Expressions Let  = {a,b}.  is a regular expression  is a regular expression a is a regular expression R|S is a regular expression if R and S are RS is a regular expression if R and S are R* is a regular expression if R is L R* = U 0  i L R i  R  

31 Regular Expressions The language described by a regular expression can be accepted by an FA. RE   NFA  NFA  DFA A regular grammar can always be described using a regular expression. RG  RE

32 Building FAs An FA is a directed graph How large is the input alphabet? How many states? How fast must it run? How to get the lowest constant factor? How to minimize space? Representations Matrix Array of lists Hashtable Switch statement For simplicity, we recommended this method in the assignment ab 011 123 210 323 414

33 Lex -- Automatic Lexer Generation Tools

34 History Lexical analysis was once a performance bottleneck certainly not true today! As a result, early research investigated methods for efficient lexical analysis While the performance concerns are largely irrelevant today, the tools resulting from this research are still in wide use

35 History: A long-standing Goal In this early period, a considerable amount of study went into the goal of creating an automatic lexer generator (aka compiler- compiler) declarative compiler specification compiler

36 History: Unix and C In the mid-1960 ’ s at Bell Labs, Ritchie and others were developing Unix A key part of this project was the development of C and a compiler for it Johnson, in 1968, proposed the use of finite state machines for lexical analysis and developed Lex [CACM 11(12), 1968] Lex realized a part of the compiler- compiler goal by automatically generating fast lexical analyzers

37 The Lex tool lexical analyzer specification fast lexical analyzer Lex The original Lex generated lexers written in C. Today every major language has its own lex tool(s): flex, sml-lex, ocamllex, JLex, JFlex, C#Flex, …


Download ppt "Automata and Regular Expression Discrete Mathematics and Its Applications Baojian Hua"

Similar presentations


Ads by Google