CPSC 325 - Compiler Tutorial 2 Scanner & Lex
Tokens Input Token Stream: Each significant lexical chunk of the program is represented by a token Operators & Punctuation: { } ! + - = * ; : … Keywords: if while return goto Identifier: id & actual name Constants: kind & value; int, floating-point character, string, …
Token – example 1 Input text if( x >= y ) y = 10; Token Stream IF LP ID(x) GEQ ID(y) RP ID(y) Assign INT(10) SEMI
Parser Tokens IF LP ID(x) GEQ ID(y) RP ID(y) Assign INT(10) SEMI IfStmt >= assign ID(y) ID(y) INT(10) ID(x)
Sample Grammar Program ::= statement | program statement Statement ::= assignStmt | ifStmt assignStmt ::= id = expr; ifStmt ::= if ( expr ) Statement Expr ::= id | int | expr + expr id ::= a | b | … | y | z Int ::= 1 | 2 | … | 9 | 0 a, b, 1, 2, 0 – terminal symbols; program, statement, id: non-terminal symbols.
Why Separate the Scanner and Parser? Simplicity & Separation of Concerns Scanner hides details from parser (comments, whitespace, input files, etc.) Parser is easier to build; has simpler input stream Efficiency Scanner can use simpler, faster design (But still often consumes a surprising amount of the compiler’s total execution time)
Principle of Longest Match In most of languages, the scanner should pick the longest possible string to make up the next token if there is a choice. Example return apple != banana; Should be recognized as 5 tokens Not more (not parts of words or identifier, or ! And = as separate tokens) return ID(apple) NEQ ID(banana) SEMI
Scanner DFA Example (1) White space or comments Accept EOF 1 Accept EOF 1 end of input ( Accept LP 2 ) 3 Accept RP ; 4 Accept SEMI
Scanner DFA Example (2) White space or comments Accept NEQ ! = 6 5 Accept NOT other 7 8 < = 9 Accept LEQ other 10 Accept LESS
Scanner DFA Example (3) White space or comments [0-9] [0-9] 11 Accept INT other 12
Scanner DFA Example (4) White space or comments [a-zA-Z] [a-zA-Z] 13 Accept ID or keyword other 14
Lex/Flex Use Flex instead of Lex Use Bison instead of yacc When compile, link to the library flex file.lex gcc –o object lex.yy.c –ll object
Lex - Structure Declarations/Definitions %% Rules/Production - Lex expression - white space - C statement (optional) Additional Code/Subroutines
Lex – Basic operators * - zero or more occurrences . - “ANY” character .* - matches any sequence | - separator + - one or more occurrences. (a+ :== aa*) ? - zero or one of something. (b? :== (b+null) [ ] - choice, so [12345] (1|2|3|4|5) (Note: [*+] represent a choice between star and plus. They lost their specialty. - - [a-zA-Z] a to z and A to Z, all the letters. \ - \* matches *, and \. Match period or decimal point.