Compilation 0368-3133 (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1.

Slides:



Advertisements
Similar presentations
Lexical Analysis Dragon Book: chapter 3.
Advertisements

COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
Compiler construction in4020 – lecture 2 Koen Langendoen Delft University of Technology The Netherlands.
Lex -- a Lexical Analyzer Generator (by M.E. Lesk and Eric. Schmidt) –Given tokens specified as regular expressions, Lex automatically generates a routine.
Lexical Analysis Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!
1 1 CDT314 FABER Formal Languages, Automata and Models of Computation Lecture 3 School of Innovation, Design and Engineering Mälardalen University 2012.
COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
Compiler Principles Winter Compiler Principles Lexical Analysis (Scanning) Mayer Goldberg and Roman Manevich Ben-Gurion University.
Lecture 02 – Lexical Analysis Eran Yahav 1. 2 You are here Executable code exe Source text txt Compiler Lexical Analysis Syntax Analysis Parsing Semantic.
Winter 2007SEG2101 Chapter 81 Chapter 8 Lexical Analysis.
Lexical Analysis Textbook:Modern Compiler Design Chapter 2.1.
Lexical Analysis Textbook:Modern Compiler Design Chapter 2.1
Lexical Analysis Mooly Sagiv html:// Textbook:Modern Compiler Implementation in C Chapter 2.
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
1 Chapter 3 Scanning – Theory and Practice. 2 Overview Formal notations for specifying the precise structure of tokens are necessary  Quoted string in.
Lexical Analysis Textbook:Modern Compiler Design Chapter 2.1.
Scanning with Jflex.
Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.
1 Material taught in lecture Scanner specification language: regular expressions Scanner generation using automata theory + extra book-keeping.
1 Scanning Aaron Bloomfield CS 415 Fall Parsing & Scanning In real compilers the recognizer is split into two phases –Scanner: translate input.
Compiler Construction Lexical Analysis Rina Zviel-Girshin and Ohad Shacham School of Computer Science Tel-Aviv University.
Compiler Principles Fall Compiler Principles Lecture 1: Lexical Analysis Roman Manevich Ben-Gurion University.
Chapter 3 Lexical Analysis
Compilation Lecture 2: Lexical Analysis Syntax Analysis (1): CFLs, CFGs, PDAs Noam Rinetzky 1.
Topic #3: Lexical Analysis
CPSC 388 – Compiler Design and Construction Scanners – Finite State Automata.
1 Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
1 Chapter 3 Scanning – Theory and Practice. 2 Overview Formal notations for specifying the precise structure of tokens are necessary –Quoted string in.
1 Outline Informal sketch of lexical analysis –Identifies tokens in input string Issues in lexical analysis –Lookahead –Ambiguities Specifying lexers –Regular.
Lexical Analysis - An Introduction. The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
어휘분석 (Lexical Analysis). Overview Main task: to read input characters and group them into “ tokens. ” Secondary tasks: –Skip comments and whitespace;
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Lexical Analysis Mooly Sagiv Schrierber Wed 10:00-12:00 html:// Textbook:Modern.
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
Lexical Analyzer (Checker)
COP 4620 / 5625 Programming Language Translation / Compiler Writing Fall 2003 Lecture 3, 09/11/2003 Prof. Roy Levow.
Scanning & FLEX CPSC 388 Ellen Walker Hiram College.
CS412/413 Introduction to Compilers Radu Rugina Lecture 4: Lexical Analyzers 28 Jan 02.
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
CSc 453 Lexical Analysis (Scanning)
Joey Paquet, 2000, Lecture 2 Lexical Analysis.
Compiler Construction By: Muhammad Nadeem Edited By: M. Bilal Qureshi.
The Role of Lexical Analyzer
Exercise 1 Consider a language with the following tokens and token classes: ID ::= letter (letter|digit)* LT ::= " " shiftL ::= " >" dot ::= "." LP ::=
Lexical Analysis.
1st Phase Lexical Analysis
CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis.
Prof. Necula CS 164 Lecture 31 Lexical Analysis Lecture 3-4.
1 Topic 2: Lexing and Flexing COS 320 Compiling Techniques Princeton University Spring 2016 Lennart Beringer.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
1 Compiler Construction Vana Doufexi office CS dept.
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
Last Chapter Review Source code characters combination lexemes tokens pattern Non-Formalization Description Formalization Description Regular Expression.
Chapter 3 Lexical Analysis.
Lecture 2 Lexical Analysis Joey Paquet, 2000, 2002, 2012.
Chapter 2 Scanning – Part 1 June 10, 2018 Prof. Abdelaziz Khamis.
CSc 453 Lexical Analysis (Scanning)
Finite-State Machines (FSMs)
Lecture 2: Lexical Analysis Noam Rinetzky
CSc 453 Lexical Analysis (Scanning)
RegExps & DFAs CS 536.
Finite-State Machines (FSMs)
Winter Compiler Principles Lexical Analysis (Scanning)
Lexical Analysis Lecture 3-4 Prof. Necula CS 164 Lecture 3.
CS 3304 Comparative Languages
CS 3304 Comparative Languages
4b Lexical analysis Finite Automata
CSc 453 Lexical Analysis (Scanning)
Presentation transcript:

Compilation (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1

Admin Slides: –All info: MoodleMoodle Class room: Trubowicz 101 (Law school) –Except 5 Nov 2013 Mobiles... 2

What is a Compiler? “A compiler is a computer program that transforms source code written in a programming language (source language) into another language (target language). The most common reason for wanting to transform source code is to create an executable program.” --Wikipedia 3

Conceptual Structure of a Compiler Executable code exe Source text txt Semantic Representation Backend Compiler Frontend Lexical Analysis Syntax Analysis Parsing Semantic Analysis Intermediate Representation (IR) Code Generation 4

Conceptual Structure of a Compiler Executable code exe Source text txt Semantic Representation Backend Compiler Frontend Lexical Analysis Syntax Analysis Parsing Semantic Analysis Intermediate Representation (IR) Code Generation wordssentences 5

What does Lexical Analysis do? Language: fully parenthesized expressions Expr  Num | LP Expr Op Expr RP Num  Dig | Dig Num Dig  ‘0’|‘1’|‘2’|‘3’|‘4’|‘5’|‘6’|‘7’|‘8’|‘9’ LP  ‘(’ RP  ‘)’ Op  ‘+’ | ‘*’ ( ( ) * 19 ) 6

What does Lexical Analysis do? Language: fully parenthesized expressions Context free language Regular languages ( ( ) * 19 ) Expr  Num | LP Expr Op Expr RP Num  Dig | Dig Num Dig  ‘0’|‘1’|‘2’|‘3’|‘4’|‘5’|‘6’|‘7’|‘8’|‘9’ LP  ‘(’ RP  ‘)’ Op  ‘+’ | ‘*’ 7

What does Lexical Analysis do? Language: fully parenthesized expressions Context free language Regular languages ( ( ) * 19 ) Expr  Num | LP Expr Op Expr RP Num  Dig | Dig Num Dig  ‘0’|‘1’|‘2’|‘3’|‘4’|‘5’|‘6’|‘7’|‘8’|‘9’ LP  ‘(’ RP  ‘)’ Op  ‘+’ | ‘*’ 8

What does Lexical Analysis do? Language: fully parenthesized expressions Context free language Regular languages ( ( ) * 19 ) Expr  Num | LP Expr Op Expr RP Num  Dig | Dig Num Dig  ‘0’|‘1’|‘2’|‘3’|‘4’|‘5’|‘6’|‘7’|‘8’|‘9’ LP  ‘(’ RP  ‘)’ Op  ‘+’ | ‘*’ 9

What does Lexical Analysis do? Language: fully parenthesized expressions Context free language Regular languages ( ( ) * 19 ) LP LP Num Op Num RP Op Num RP Expr  Num | LP Expr Op Expr RP Num  Dig | Dig Num Dig  ‘0’|‘1’|‘2’|‘3’|‘4’|‘5’|‘6’|‘7’|‘8’|‘9’ LP  ‘(’ RP  ‘)’ Op  ‘+’ | ‘*’ 10

What does Lexical Analysis do? Language: fully parenthesized expressions Context free language Regular languages ( ( ) * 19 ) LP LP Num Op Num RP Op Num RP Kind Value Expr  Num | LP Expr Op Expr RP Num  Dig | Dig Num Dig  ‘0’|‘1’|‘2’|‘3’|‘4’|‘5’|‘6’|‘7’|‘8’|‘9’ LP  ‘(’ RP  ‘)’ Op  ‘+’ | ‘*’ 11

What does Lexical Analysis do? Language: fully parenthesized expressions Context free language Regular languages ( ( ) * 19 ) LP LP Num Op Num RP Op Num RP Kind Value Expr  Num | LP Expr Op Expr RP Num  Dig | Dig Num Dig  ‘0’|‘1’|‘2’|‘3’|‘4’|‘5’|‘6’|‘7’|‘8’|‘9’ LP  ‘(’ RP  ‘)’ Op  ‘+’ | ‘*’ Token … 12

Partitions the input into stream of tokens –Numbers –Identifiers –Keywords –Punctuation Usually represented as (kind, value) pairs –(Num, 23) –(Op, ‘*’) “word” in the source language “meaningful” to the syntactical analysis What does Lexical Analysis do? 13

From scanning to parsing ((23 + 7) * x) )?*)7+23(( RPIdOPRPNumOPNumLP Lexical Analyzer program text token stream Parser Grammar: E ... | Id Id  ‘a’ |... | ‘z’ Op(*) Id(?) Num(23)Num(7) Op(+) Abstract Syntax Tree valid syntax error 14

Why Lexical Analysis? Simplifies the syntax analysis (parsing) –And language definition Modularity Reusability Efficiency 15

Lecture goals Understand role & place of lexical analysis Lexical analysis theory Using program generating tools 16

Lecture Outline Role & place of lexical analysis What is a token? Regular languages Lexical analysis Error handling Automatic creation of lexical analyzers 17

What is a token? (Intuitively) A “word” in the source language –Anything that should appear in the input to syntax analysis Identifiers Values Language keywords Usually, represented as a pair of (kind, value) 18

Example Tokens TypeExamples ID foo, n_14, last NUM 73, 00, 517, 082 REAL 66.1,.5, 5.5e-10 IF if COMMA, NOTEQ != LPAREN ( RPAREN ) 19

Example Non Tokens TypeExamples comment /* ignored */ preprocessor directive #include #define NUMS 5.6 macro NUMS whitespace \t, \n, \b, ‘ ‘ 20

Some basic terminology Lexeme (aka symbol) - a series of letters separated from the rest of the program according to a convention (space, semi-column, comma, etc.) Pattern - a rule specifying a set of strings. Example: “an identifier is a string that starts with a letter and continues with letters and digits” –(Usually) a regular expression Token - a pair of (pattern, attributes) 21

Example void match0(char *s) /* find a zero */ { if (!strncmp(s, “0.0”, 3)) return 0. ; } VOID ID(match0) LPAREN CHAR DEREF ID(s) RPAREN LBRACE IF LPAREN NOT ID(strncmp) LPAREN ID(s) COMMA STRING(0.0) COMMA NUM(3) RPAREN RPAREN RETURN REAL(0.0) SEMI RBRACE EOF 22

Example Non Tokens TypeExamples comment /* ignored */ preprocessor directive #include #define NUMS 5.6 macro NUMS whitespace \t, \n, \b, ‘ ‘ Lexemes that are recognized but get consumed rather than transmitted to parser –If –i/*comment*/f 23

How can we define tokens? Keywords – easy! –if, then, else, for, while, … Identifiers? Numerical Values? Strings? Characterize unbounded sets of values using a bounded description? 24

Lecture Outline Role & place of lexical analysis What is a token? Regular languages Lexical analysis Error handling Automatic creation of lexical analyzers 25

Regular languages Formal languages –Σ = finite set of letters –Word = sequence of letter –Language = set of words Regular languages defined equivalently by –Regular expressions –Finite-state automata 26

Common format for reg-exps 27

Examples ab*|cd? = (a|b)* = (0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9)* = 28

Escape characters What is the expression for one or more + symbols? –(+)+ won’t work –(\+)+ will backslash \ before an operator turns it to standard character –\*, \?, \+, a\(b\+\*, (a\(b\+\*)+, … backslash double quotes surrounds text –“a(b+*”, “a(b+*”+ 29

Shorthands Use names for expressions –letter = a | b | … | z | A | B | … | Z –letter_ = letter | _ –digit = 0 | 1 | 2 | … | 9 –id = letter_ (letter_ | digit)* Use hyphen to denote a range –letter = a-z | A-Z –digit =

Examples if = if then = then relop = | = | = | <> digit = 0-9 digits = digit+ 31

Example A number is number = ( 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ) + (  |. ( 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ) + (  | E ( 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ) + ) ) Using shorthands it can be written as number = digits ( |.digits ( | E (|+|-) digits ) ) 32

Exercise 1 - Question Language of Java identifiers –Identifiers start with either an underscore ‘_’ or a letter –Continue with either underscore, letter, or digit 33

Exercise 1 - Answer Language of Java identifiers –Identifiers start with either an underscore ‘_’ or a letter –Continue with either underscore, letter, or digit – (_|a|b|…|z|A|…|Z)(_|a|b|…|z|A|…|Z|0|…|9)* – Using shorthand macros First= _|a|b|…|z|A|…|Z Next= First|0|…|9 R= First Next* 34

Exercise 1 - Question Language of rational numbers in decimal representation (no leading, ending zeros) –0 – – –Not 007 –Not

Exercise 1 - Answer Language of rational numbers in decimal representation (no leading, ending zeros) – Digit= 1|2|…|9 Digit0 = 0|Digit Num= Digit Digit0* Frac= Digit0* Digit Pos= Num |.Frac | 0.Frac| Num.Frac PosOrNeg = (Є|-)Pos R= 0 | PosOrNeg 36

Exercise 2 - Question Equal number of opening and closing parenthesis: [ n ] n = [], [[]], [[[]]], … 37

Exercise 2 - Answer Equal number of opening and closing parenthesis: [ n ] n = [], [[]], [[[]]], … Not regular Context-free Grammar: S ::= [] | [S] 38

Challenge: Ambiguity if = if id = letter_ (letter_ | digit)* “if” is a valid word in the language of identifiers… so what should it be? How about the identifier “iffy”? Solution –Always find longest matching token –Break ties using order of definitions… first definition wins (=> list rules for keywords before identifiers) 39

Creating a lexical analyzer Given a list of token definitions (pattern name, regex), write a program such that –Input: String to be analyzed –Output: List of tokens How do we build an analyzer? 40

Lecture Outline Role & place of lexical analysis What is a token? Regular languages Lexical analysis Error handling Automatic creation of lexical analyzers 41

A Simplified Scanner for C Token nextToken() { char c ; loop: c = getchar(); switch (c){ case ` `: goto loop ; case `;`: return SemiColumn; case `+`: c = getchar() ; switch (c) { case `+': return PlusPlus ; case '=’ return PlusEqual; default: ungetc(c); return Plus; }; case `<`: … case `w`: … } 42

But we have a much better way! Generate a lexical analyzer automatically from token definitions Main idea: Use finite-state automata to match regular expressions –Regular languages defined equivalently by Regular expressions Finite-state automata 43

Reg-exp vs. automata Regular expressions are declarative –Offer compact way to define a regular language by humans –Don’t offer direct way to check whether a given word is in the language Automata are operative –Define an algorithm for deciding whether a given word is in a regular language –Not a natural notation for humans 44

Overview Define tokens using regular expressions Construct a nondeterministic finite-state automaton (NFA) from regular expression Determinize the NFA into a deterministic finite-state automaton (DFA) DFA can be directly used to identify tokens 45

Finite-State Automata Deterministic automaton M = ( , Q, , q 0, F) –  - alphabet –Q – finite set of state –q 0  Q – initial state –F  Q – final states –δ : Q    Q - transition function 46

Finite-State Automata Non-Deterministic automaton M = ( , Q, , q 0, F) –  - alphabet –Q – finite set of state –q 0  Q – initial state –F  Q – final states –δ : Q  (   {  }) → 2 Q - transition function Possible  -transitions For a word w, M can reach a number of states or get stuck. If some state reached is final, M accepts w. 47

Deterministic Finite automata start a b b c accepting state start state transition An automaton is defined by states and transitions 48

Accepting Words Words are read left-to-right cba start a b b c 49

Accepting Words cba start a b b c Words are read left-to-right 50

Accepting Words cba start a b b c Words are read left-to-right 51

Rejecting Words word accepted cba start a b b c Words are read left-to-right 52

Rejecting Words cbb start a b b c 53

start Rejecting Words Missing transition means non-acceptance cbb a b b c 54

Non-deterministic automata Allow multiple transitions from given state labeled by same letter start a a b c c b 55

NFA run example cba start a a b c c b 56

NFA run example Maintain set of states cba start a a b c c b 57

NFA run example cba start a a b c c b 58

NFA run example Accept word if any of the states in the set is accepting cba start a a b c c b 59

NFA+Є automata Є transitions can “fire” without reading the input Є start a b c 60

NFA+Є run example cba Є start a b c 61

NFA+Є run example Now Є transition can non-deterministically take place cba Є start a b c 62

NFA+Є run example cba Є start a b c 63

NFA+Є run example cba Є start a b c 64

NFA+Є run example cba Є start a b c 65

NFA+Є run example cba Word accepted Є start a b c 66

From regular expressions to NFA Step 1: assign expression names and obtain pure regular expressions R 1 …R m Step 2: construct an NFA M i for each regular expression R i Step 3: combine all M i into a single NFA Ambiguity resolution: prefer longest accepting word 67

From reg. exp. to automata Theorem: there is an algorithm to build an NFA+Є automaton for any regular expression Proof: by induction on the structure of the regular expression start 68

R =  R =  R = a a Basic constructs 69

Composition R = R1 | R2  M1 M2    R = R1R2  M1M2  70

Repetition R = R1*  M1    71

What now? Naïve approach: try each automaton separately Given a word w: –Try M 1 (w) –Try M 2 (w) –… –Try M n (w) Requires resetting after every attempt 72

Combine automata 1 2 a a 3 a 4 b 5 b 6 abb 7 8 b a*b+ b a 9 a 10 b 11 a 12 b 13 abab 0     a abb a*b+ abab combines 73

Ambiguity resolution Recall… Longest word Tie-breaker based on order of rules when words have same length Recipe –Turn NFA to DFA –Run until stuck, remember last accepting state, this is the token to be returned 74

Corresponding DFA b a a a b b b a b b abb a*b+ abab a b 75

Examples b a a a b b b a b b abb a*b+ abab a b abaa: gets stuck after aba in state 12, backs up to state (5 8 11) pattern is a*b+, token is ab abba: stops after second b in (6 8), token is abb because it comes first in spec 76

Summary of Construction Describes tokens as regular expressions –and decides which attributes (values) are saved for each token Regular expressions turned into a DFA –describes expressions and specifies which attributes (values) to keep Lexical analyzer simulates the run of an automata with the given transition table on any input string 77

A Few Remarks Turning an NFA to a DFA is expensive, but –Exponential in the worst case – In practice, works fine The construction is done once per-language –At Compiler construction time –Not at compilation time 78

Implementation by Example if{ return IF; } [a-z][a-z0-9]*{ return ID; } [0-9]+ { return NUM; } [0-9]”.”[0-9]*|[0-9]*”.”[0-9]+{ return REAL; } (\-\-[a-z]*\n)|(“ “|\n|\t){ ; }.{ error(); } 79

int edges[][256]= { /* …, 0, 1, 2, 3,..., -, e, f, g, h, i, j,... */ /* state 0 */ {0, …, 0, 0, …, 0, 0, 0, 0, 0,..., 0, 0, 0, 0, 0, 0}, /* state 1 */ {13, …, 7, 7, 7, 7, …, 9, 4, 4, 4, 4, 2, 4, …, 13, 13}, /* state 2 */ {0, …, 4, 4, 4, 4,..., 0, 4, 3, 4, 4, 4, 4, …, 0, 0}, /* state 3 */ {0, …, 4, 4, 4, 4, …, 0, 4, 4, 4, 4, 4, 4,, 0, 0}, /* state 4 */{0, …, 4, 4, 4, 4, …, 0, 4, 4, 4, 4, 4, 4, …, 0, 0}, /* state 5 */{0, …, 6, 6, 6, 6, …, 0, 0, 0, 0, 0, 0, 0, …, 0, 0}, /* state 6 */ {0, …, 6, 6, 6, 6, …, 0, 0, 0, 0, 0, 0, 0,..., 0, 0}, /* state 7 */ /* state … */... /* state 13 */{0, …, 0, 0, 0, 0, …, 0, 0, 0, 0, 0, 0, 0, …, 0, 0} }; 80

Pseudo Code for Scanner Token nextToken() { lastFinal = 0; currentState = 1 ; inputPositionAtLastFinal = input; currentPosition = input; while (not(isDead(currentState))) { nextState = edges[currentState][*currentPosition]; if (isFinal(nextState)) { lastFinal = nextState ; inputPositionAtLastFinal = currentPosition; } currentState = nextState; advance currentPosition; } input = inputPositionAtLastFinal ; return action[lastFinal]; } 81

Example Input: “if --not-a-com” 82

finalstateinput 01if --not-a-com return IF 83

found whitespace finalstateinput 01 --not-a-com 12 --not-a-com not-a-com 84

finalstateinput 01--not-a-com not-a-com 910--not-a-com 910--not-a-com 90 error 85

finalstateinput 01-not-a-com error 86

Efficient Scanners Efficient state representation Input buffering Using switch and gotos instead of tables 87

DFSM from Specification Create a non-deterministic automaton (NDFA) from every regular expression Merge all the automata using epsilon moves (like the | construction) Construct a deterministic finite automaton (DFA) –State priority Minimize the automaton starting –separate accepting states by token kinds 88

NDFA Construction if { return IF; } [a-z][a-z0-9]* { return ID; } [0-9]+ { return NUM; } 89

DFA Construction 90

Minimization 91

Lecture Outline Role & place of lexical analysis What is a token? Regular languages Lexical analysis Error handling Automatic creation of lexical analyzers 92

Errors in lexical analysis pi = txt Illegal token pi = 3oranges txt Illegal token pi = oranges3 txt,, 93

Error Handling Many errors cannot be identified at this stage Example: “fi (a==f(x))”. Should “fi” be “if”? Or is it a routine name? –We will discover this later in the analysis –At this point, we just create an identifier token Sometimes the lexeme does not match any pattern –Easiest: eliminate letters until the beginning of a legitimate lexeme –Alternatives: eliminate/add/replace one letter, replace order of two adjacent letters, etc. Goal: allow the compilation to continue Problem: errors that spread all over 94

Lecture Outline Role & place of lexical analysis What is a token? Regular languages Lexical analysis Error handling Automatic creation of lexical analyzers 95

Use of Program-Generating Tools Automatically generated lexical analyzer –Specification  Part of compiler –Compiler-Compiler Stream of tokens JFlex regular expressions input program scanner 96

Use of Program-Generating Tools Input: regular expressions and actions Action = Java code Output: a scanner program that Produces a stream of tokens Invoke actions when pattern is matched Stream of tokens JFlex regular expressions input program scanner 97

Line Counting Example Create a program that counts the number of lines in a given input text file 98

Creating a Scanner using Flex int num_lines = 0; % \n ++num_lines;. ; % main() { yylex(); printf( "# of lines = %d\n", num_lines); } 99

Creating a Scanner using Flex initial other newline \n ^\n int num_lines = 0; % \n ++num_lines;. ; % main() { yylex(); printf( "# of lines = %d\n", num_lines); } 100

JFLex Spec File User code: Copied directly to Java file % JFlex directives: macros, state names % Lexical analysis rules: –Optional state, regular expression, action –How to break input to tokens –Action when token matched Possible source of javac errors down the road DIGIT= [0-9] LETTER= [a-zA-Z] YYINITIAL {LETTER} ({LETTER}|{DIGIT})* 101

Creating a Scanner using JFlex import java_cup.runtime.*; % %cup %{ private int lineCounter = 0; %} %eofval{ System.out.println("line number=" + lineCounter); return new Symbol(sym.EOF); %eofval} NEWLINE=\n % {NEWLINE} { lineCounter++; } [^{NEWLINE}] { } 102

Catching errors What if input doesn’t match any token definition? Trick: Add a “catch-all” rule that matches any character and reports an error –Add after all other rules 103

A JFlex specification of C Scanner import java_cup.runtime.*; % %cup %{ private int lineCounter = 0; %} Letter= [a-zA-Z_] Digit= [0-9] % ”\t” { } ”\n” { lineCounter++; } “;” { return new Symbol(sym.SemiColumn);} “++” { return new Symbol(sym.PlusPlus); } “+=” { return new Symbol(sym.PlusEq); } “+” { return new Symbol(sym.Plus); } “while”{ return new Symbol(sym.While); } {Letter}({Letter}|{Digit})* { return new Symbol(sym.Id, yytext() ); } “<=” { return new Symbol(sym.LessOrEqual); } “<” { return new Symbol(sym.LessThan); } 104

A Simplified Scanner for C Token nextToken() { char c ; loop: c = getchar(); switch (c){ case ` `: goto loop ; case `;`: return SemiColumn; case `+`: c = getchar() ; switch (c) { case `+': return PlusPlus ; case '=’ return PlusEqual; default: ungetc(c); return Plus; }; case `<`: … case `w`: … } 105

Automatic Construction of Scanners Construction is done automatically by common tools lex is your friend –Automatically generates a lexical analyzer from declaration file Advantages: short declaration file, easily checked, easily modified and maintained lex Declaration file Lexical Analysis characters tokens Intuitively: Lex builds DFA table Analyzer simulates (runs) the DFA on a given input 106

Lecture Outline Role & place of lexical analysis What is a token? Regular languages Lexical analysis Error handling Automatic creation of lexical analyzers 107

Start States It may be hard to specify regular expressions for certain constructs –Examples Strings Comments Writing automata may be easier Can combine both Specify partial automata with regular expressions on the edges –No need to specify all states –Different actions at different states 108

Missing Creating a lexical analysis by hand Table compression Symbol Tables Nested Comments Handling Macros 109

What does Lexical Analysis do? Input: program text (file) Output: sequence of tokens Read input file Identify language keywords and standard identifiers Handle include files and macros Count line numbers Remove whitespaces Report illegal symbols [Produce symbol table] 110

Lexical Analysis – Impl. Lexical analyzer –Turns character stream into token stream –Tokens defined using regular expressions –Regular expressions -> NFA -> DFA construction for identifying tokens –Automated constructions of lexical analyzer using lex 111

Automatic Construction of Scanners For most programming languages lexical analyzers can be easily constructed automatically –Exceptions: Fortran PL/1 Lex/Flex/Jlex/JFlex are useful tools 112

Next Week Syntax analysis (aka parsing) Lecture in this room –Trubowicz 101 (Law school) 113

114

NFA vs. DFA (a|b)*a(a|b)(a|b)…(a|b) n times AutomatonSPACETIME NFAO(|r|)O(|r|*|w|) DFAO(2^|r|)O(|w|) 115

From NFA+Є to DFA Construction requires O(n) states for a reg- exp of length n Running an NFA+Є with n states on string of length m takes O(m·n 2 ) time Solution: determinization via subset construction –Number of states worst-case exponential in n –Running time O(m) 116

Subset construction For an NFA+Є with states M={s 1,…,s k } Construct a DFA with one state per set of states of the corresponding NFA –M’={ [], [s 1 ], [s 1,s 2 ], [s 2,s 3 ], [s 1,s 2,s 3 ], …} Simulate transitions between individual states for every letter a s1s1 s2s2 a [s 1,s 4 ] [s 2,s 7 ] NFA+ЄDFA a s4s4 s7s7 117

Subset construction For an NFA+Є with states M={s 1,…,s k } Construct a DFA with one state per set of states of the corresponding NFA –M’={ [], [s 1 ], [s 1,s 2 ], [s 2,s 3 ], [s 1,s 2,s 3 ], …} Extend macro states by states reachable via Є transitions Є s1s1 s4s4 [s 1,s 2 ] [s 1,s 2,s 4 ] NFA+ЄDFA 118