 # Topic #3: Lexical Analysis

## Presentation on theme: "Topic #3: Lexical Analysis"— Presentation transcript:

Topic #3: Lexical Analysis
CSC 338 – Compiler Design and implementation Dr. Mohamed Ben Othman

Lexical Analyzer and Parser

Why Separate? Reasons to separate lexical analysis from parsing:
Simpler design Improved efficiency Portability Tools exist to help implement lexical analyzers and parsers independently

Tokens, Lexemes, and Patterns
Tokens include keywords, operators, identifiers, constants, literal strings, punctuation symbols A lexeme is a sequence of characters in the source program representing a token A pattern is a rule describing a set of lexemes that can represent a particular token

Technically speaking, lexical analyzers usually provide a single attribute per token (might be pointer into symbol table)

Buffer Most lexical analyzers use a buffer
Often buffers are divided into two N character halves Two pointers used to indicate start and end of lexeme If pointer walks past end of either half of buffer, other half of buffer is reloaded A sentinel character can be used to decrease number of checks necessary

Strings and Languages Alphabet – any finite set of symbols (e.g. ASCII, binary alphabet, or a set of tokens) String – A finite sequence of symbols drawn from an alphabet Language – A set of strings over a fixed alphabet Other terms relating to strings: prefix; suffix; substring; proper prefix, suffix, or substring (non-empty, not entire string); subsequence

Operations on Languages
Union: Concatenation: Kleene closure: Zero or more concatenations Positive closure: One or more concatenations

Regular Expressions Defined over an alphabet Σ
ε represents {ε}, the set containing the empty string If a is a symbol in Σ, then a is a regular expression denoting {a}, the set containing the string a If r and s are regular expressions denoting the languages L(r) and L(s), then: (r)|(s) is a regular expression denoting L(r)U L(s) (r)(s) is a regular expression denoting L(r)L(s) (r)* is a regular expression denoting (L(r))* (r) is a regular expression denoting L(r) Precedence: * (left associative), then concatenation (left associative), then | (left associative)

Regular Definitions Can give “names” to regular expressions
Convention: names in boldface (to distinguish them from symbols) letter  A|B|…|Z|a|b|…|z digit  0|1|…|9 id  letter (letter | digit)*

Notational Shorthands
One or more instances: r+ denotes rr* Zero or one Instance: r? denotes r|ε Character classes: [a-z] denotes [a|b|…|z] digit  [0-9] digits  digit+ optional_fraction  (. digits )? optional_exponent  (E(+|-)? digits )? num  digits optional_fraction optional_exponent

Limitations Can not describe balanced or nested constructs
Example, all valid strings of balanced parentheses This can be done with CFG Can not describe repeated strings Example: {wcw|w is a string of a’s and b’s} Can not denote with CFG either!

Grammar Fragment (Pascal)
stmt  if expr then stmt | if expr then stmt else stmt | ε expr  term relop term | term term  id | num

Related Regular Definitions
if  if then  then else  else relop  < | <= | = | <> | > | >= id  letter ( letter | digit )* num  digit+ (. digit+ )? (E(+|-)? digit+ )? delim  blank | tab | newline ws  delim+

Tokens and Attributes Regular Expression Token Attribute Value ws - if
then else id pointer to entry num < relop LT <= LE = EQ <> NE > GT => GE

Transition Diagrams A stylized flowchart
Transition diagrams consist of states connected by edges Edges leaving a state s are labeled with input characters that may occur after reaching state s Assumed to be deterministic There is one start state and at least one accepting (final) state Some states may have associated actions At some final states, need to retract a character

Transition Diagram for “relop”

Identifiers and Keywords
Share a transition diagram After reaching accepting state, code determines if lexeme is keyword or identifier Easier than encoding exceptions in diagram Simple technique is to appropriately initialize symbol table with keywords

Numbers

Order of Transition Diagrams
Transition diagrams tested in order Diagrams with low numbered start states tried before diagrams with high numbered start states Order influences efficiency of lexical analyzer

Trying Transition Diagrams
int next_td(void) { switch (start) { case 0: start = 9; break; case 9: start = 12; break; case 12: start = 20; break; case 20: start = 25; break; case 25: recover(); break; default: error("invalid start state"); } /* Possibly additional actions here */ return start;

Finding the Next Token token nexttoken(void) { while (1) {
switch (state) { case 0: c = nextchar(); if (c == ' ' || c=='\t' || c == '\n') { state = 0; lexeme_beginning++; } else if (c == '<') state = 1; else if (c == '=') state = 5 else if (c == '>') state = 6 else state = next_td(); break; … /* 27 other cases here */

The End of a Token token nexttoken(void) { while (1) {
switch (state) { … /* First 19 cases */ case 19: retract(); install_num(); return(NUM); break; … /* Final 8 cases */

Finite Automata Generalized transition diagrams that act as “recognizer” for a language Can be nondeterministic (NFA) or deterministic (DFA) NFAs can have ε-transitions, DFAs can not NFAs can have multiple edges with same symbol leaving a state, DFAs can not Both can recognize exactly what regular expressions can denote

NFAs A set of states S A set of input symbols Σ (input alphabet)
A transition function move that maps state, symbol pairs to a set of states A single start state s0 A set of accepting (or final) states F An NFA accepts a string s if and only if there exists a path from the start state to an accepting state such that the edge labels spell out s

Transition Tables State Input Symbol a b {0,1} {0} 1 --- {2} 2 {3}

DFAs No state has an ε-transition
For each state s and input symbol a, there as at most one edge labeled a leaving s