# Scanner 中正理工學院 電算中心副教授 許良全. Copyright © 1998 by LCH Compiler Design Overview of Scanning n The purpose of a scanner is to group input characters into.

## Presentation on theme: "Scanner 中正理工學院 電算中心副教授 許良全. Copyright © 1998 by LCH Compiler Design Overview of Scanning n The purpose of a scanner is to group input characters into."— Presentation transcript:

Scanner 中正理工學院 電算中心副教授 許良全

Copyright © 1998 by LCH Compiler Design Overview of Scanning n The purpose of a scanner is to group input characters into tokens. n A scanner is sometimes called a lexical analyzer n A precise definition of tokens is necessary to ensure that lexical rules are properly enforced. u Scanners normally seek to make a token as long as possible. E.g. ABC is scanned as one identifier rather than three n All scanners perform much the same function u using scanner generator is to limit the effort in building a scanner from scratch

Copyright © 1998 by LCH Compiler Design Finite State Systems n The finite state automaton is a mathematical model of a system, with discrete input and outputs

Copyright © 1998 by LCH Compiler Design Examples of Finite State Systems n Elevators u do not remember all previous requests for service but only the current floor, the direction of motion, and the collection of not yet satisfied requests for service n Vending machines u insert enough coins and you’ll get a Pepsi eventually n Computers u the state of the CPU, main memory, and auxiliary storage at any time is one of a very large but finite number of states n Human brains  2 35 cells or neurons at most

Copyright © 1998 by LCH Compiler Design Definition of Finite Automata n A finite automaton (FA) is an idealized 5- tuple computer that recognizes strings belonging to regular sets. (Q, , ,q 0,F) u A finite set of states, Q u A finite input alphabet, , or vocabulary, V. u A special start, or initial state, q 0. q 0  Q. u A set of final, or accepting states, F. F  Q. u A transition function, , that maps Q×F to Q.

Copyright © 1998 by LCH Compiler Design Regular Expressions n The languages accepted by finite automata are easily described by simple expressions called regular expressions. n Strings are built from characters in V via catenation  e.g., !=, for, while n An empty or null string, denoted by, is allowed The characters, (, ), ‘, *, +, and | are called meta- characters. They must be be quoted when used in order to avoid ambiguity. E.g. Delim = (‘(‘|’)’|:=|;|,|’+’|-|’*’|/|=|\$\$\$)

Copyright © 1998 by LCH Compiler Design Definition of Regular Expression n A regular expression denotes a set of strings: u  is a regular expression denoting the empty set (the set containing no strings). u is a regular expression denoting the set that contains only the empty string. F Note that this set contains one element.  A string s is a regular expression denoting a set containing only s. If s contains meta-characters, s can be quoted to avoid ambiguity.  If A and B are regular expressions, then A|B, AB, and A * are also regular expressions, corresponding to alternation, catenation, and Kleene closure respectively.

Copyright © 1998 by LCH Compiler Design Properties of Regular Expressions Let P and Q be a set of strings  The string s  (P|Q) iff s  P or s  Q  The string s  P * iff s can be broken into zero or more pieces: s = s 1 s 2 s 3 …s n such that each s i  P.  P + denotes all strings consisting one or more strings in P catenated together  P * = (P + | ) and P + = PP * = P * P  If A is a set of characters, Not(A) denotes (V-A)  all characters in V not included in A.  If k is a constant, the set A k represents all strings formed by catenating k strings from A, i.e., A k = (AAA…) ( k copies)

Copyright © 1998 by LCH Compiler Design Examples of Regular Expressions Let D = (0|…|9), L = (A|…|Z) n A comment that begins with -- and ends with Eol  Comment = --Not(Eol) * Eol n A fixed decimal literal u Lit = D +.D + n An identifier, composed of letters, digits, and underscores, that begins with a letter, ends with a letter or digit, and contains no consecutive underscores u ID = L(L|D) * (_(L|D) + ) *

Copyright © 1998 by LCH Compiler Design Using a Scanner Generator: Lex n Lex is a lexical analyzer generator developed by Lesk and Schmidt of AT&T Bell Lab, written in C, running under UNIX. n Lex produces an entire scanner module that can be compiled and linked with other compiler modules. n Lex associates regular expressions with arbitrary code fragments. When an expression is matched, the code segment is executed. A typical lex program contains three sections separated by % delimiters.

Copyright © 1998 by LCH Compiler Design First Section of Lex n The first section define character classes and auxiliary regular expression. (Fig. 3.5 on p. 67)  [] delimits character classes  - denotes ranges: [xyz] = = [x-z]  \ denotes the escape character: as in C.  ^ complements a character class, ( Not ):  [^xy] denotes all characters except x and y.  |, *, and + (alternation, Kleene closure, and positive closure) are provided.  () can be used to control grouping of subexpressions.  (expr)? = = (expr)|, i.e. matches Expr zero times or once.  {} signals the macroexpansion of a symbol defined in the first section.

Copyright © 1998 by LCH Compiler Design First Section of Lex, cont. n Catenation is specified by the juxtaposition of two expressions; no explicit operator is used. u [ab][cd] will match any of ad, ac, bc, and bd. begin = = “begin” = = [b][e][g][i][n]

Copyright © 1998 by LCH Compiler Design Second Section of Lex n The second section of lex defines a table of regular expressions and corresponding commands. u When an expression is matched, its associated command is executed. F Auxiliary functions may be defined in the third section.  Input that is matched is stored in the string variable yytext whose length is yyleng.  Lex creates an integer function yylex() that may be called from the parser. F The value returned is usually the token code of the token scanned by Lex.  When yylex() encounters end of file, it calls a use- supplied integer function named yywrap() to wrap up input processing.

Copyright © 1998 by LCH Compiler Design Dealing with Multiple Input Files yylex() uses three user-defined functions to handle character I/O:  input() : retrieve a single character, 0 on EOF  output(c) : write a single character to the output  unput(c) : put a single character back on the input to be re-read

Copyright © 1998 by LCH Compiler Design Translating Regular Expressions into Finite Automata n Remember the relationship between RE and FA. n The main job of a scanner generator program is to transform a regular expression definition into an equivalent (D)FA. n A regular expression is first translated into a nondeterministic finite automaton (NFA), then translated from NFA into DFA. (2 steps) n An NFA, when reading a particular input is not required to make a unique (deterministic) choice of which state to visit.

Copyright © 1998 by LCH Compiler Design Translating RE into NFA n Any regular expression can be transformed into an NFA with the following properties: u There is a unique final state u The final state has no successors u Every other state has either one or two successors Regular expressions are built out of the atomic regular expressions a (where a is a character in V ) and by using the three operations AB, A|B, and A *.

Copyright © 1998 by LCH Compiler Design An NFA for A B

Copyright © 1998 by LCH Compiler Design An NFA for A *

Copyright © 1998 by LCH Compiler Design Translating NFA into DFA Each state of DFA ( M ) corresponds to a set of states of NFA ( N ) u transforming N to M is done by subset construction M will be in state { x,y,z } after reading a given input string if and only if N could be in any of the states x, y, or z, depending on the transitions it chooses.  M keeps track of all the possible routes N might take and runs them in parallel.

Download ppt "Scanner 中正理工學院 電算中心副教授 許良全. Copyright © 1998 by LCH Compiler Design Overview of Scanning n The purpose of a scanner is to group input characters into."

Similar presentations