Chapter 3 Chang Chi-Chung 2007.4.12. The Role of the Lexical Analyzer Lexical Analyzer Parser Source Program Token Symbol Table getNextToken error.

Slides:



Advertisements
Similar presentations
COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
Advertisements

LEXICAL ANALYSIS Phung Hua Nguyen University of Technology 2006.
Lexical Analyzer Second lecture. Compiler Construction Outline Informal sketch of lexical analysis Identifies tokens in input string Issues in lexical.
CS-338 Compiler Design Dr. Syed Noman Hasany Assistant Professor College of Computer, Qassim University.
Lecture # 5. Topics Minimization of DFA Examples What are the Important states of NFA? How to convert a Regular Expression directly into a DFA ?
Winter 2007SEG2101 Chapter 81 Chapter 8 Lexical Analysis.
Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.
1 The scanning process Main goal: recognize words/tokens Snapshot: At any point in time, the scanner has read some input and is on the way to identifying.
1 Chapter 2: Scanning 朱治平. Scanner (or Lexical Analyzer) the interface between source & compiler could be a separate pass and places its output on an.
2. Lexical Analysis Prof. O. Nierstrasz
1 Chapter 3 Scanning – Theory and Practice. 2 Overview Formal notations for specifying the precise structure of tokens are necessary  Quoted string in.
1 Pertemuan Lexical Analysis (Scanning) Matakuliah: T0174 / Teknik Kompilasi Tahun: 2005 Versi: 1/6.
Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.
Scanner Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source language? Is the.
1 Scanning Aaron Bloomfield CS 415 Fall Parsing & Scanning In real compilers the recognizer is split into two phases –Scanner: translate input.
Chapter 3 Lexical Analysis
Topic #3: Lexical Analysis
Finite-State Machines with No Output
1 Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Lexical Analysis Natawut Nupairoj, Ph.D.
1 Chapter 3 Scanning – Theory and Practice. 2 Overview Formal notations for specifying the precise structure of tokens are necessary –Quoted string in.
CS308 Compiler Principles Lexical Analyzer Fan Wu Department of Computer Science and Engineering Shanghai Jiao Tong University Fall 2012.
Overview of Previous Lesson(s) Over View  Strategies that have been used to implement and optimize pattern matchers constructed from regular expressions.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
어휘분석 (Lexical Analysis). Overview Main task: to read input characters and group them into “ tokens. ” Secondary tasks: –Skip comments and whitespace;
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
Topic #3: Lexical Analysis EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
Lexical Analyzer (Checker)
Overview of Previous Lesson(s) Over View  An NFA accepts a string if the symbols of the string specify a path from the start to an accepting state.
1 November 1, November 1, 2015November 1, 2015November 1, 2015 Azusa, CA Sheldon X. Liang Ph. D. Computer Science at Azusa Pacific University Azusa.
Compiler Construction 2 주 강의 Lexical Analysis. “get next token” is a command sent from the parser to the lexical analyzer. On receipt of the command,
Lexical Analyzer in Perspective
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
Chapter 3 Chang Chi-Chung The Role of the Lexical Analyzer Lexical Analyzer Parser Source Program Token Symbol Table getNextToken error.
1 Lexical Analysis and Lexical Analyzer Generators Chapter 3 COP5621 Compiler Construction Copyright Robert van Engelen, Florida State University,
Pembangunan Kompilator.  A recognizer for a language is a program that takes a string x, and answers “yes” if x is a sentence of that language, and.
By Neng-Fa Zhou Lexical Analysis 4 Why separate lexical and syntax analyses? –simpler design –efficiency –portability.
Lecture # 4 Chapter 1 (Left over Topics) Chapter 3 (continue)
1.  It is the first phase of compiler.  In computer science, lexical analysis is the process of converting a sequence of characters into a sequence.
Lexical Analysis S. M. Farhad. Input Buffering Speedup the reading the source program Look one or more characters beyond the next lexeme There are many.
Fall 2003CS416 Compiler Design1 Lexical Analyzer Lexical Analyzer reads the source program character by character to produce tokens. Normally a lexical.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Overview of Previous Lesson(s) Over View  Algorithm for converting RE to an NFA.  The algorithm is syntax- directed, it works recursively up the parse.
The Role of Lexical Analyzer
Lexical Analyzer CS308 Compiler Theory1. 2 Lexical Analyzer Lexical Analyzer reads the source program character by character to produce tokens. Normally.
Lexical Analysis.
1st Phase Lexical Analysis
UNIT - I Formal Language and Regular Expressions: Languages Definition regular expressions Regular sets identity rules. Finite Automata: DFA NFA NFA with.
1 February 23, February 23, 2016February 23, 2016February 23, 2016 Azusa, CA Sheldon X. Liang Ph. D. Computer Science at Azusa Pacific University.
Chapter 2 Scanning. Dr.Manal AbdulazizCS463 Ch22 The Scanning Process Lexical analysis or scanning has the task of reading the source program as a file.
using Deterministic Finite Automata & Nondeterministic Finite Automata
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
Deterministic Finite Automata Nondeterministic Finite Automata.
CS412/413 Introduction to Compilers Radu Rugina Lecture 3: Finite Automata 25 Jan 02.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
Compilers Lexical Analysis 1. while (y < z) { int x = a + b; y += x; } 2.
COMP 3438 – Part II - Lecture 3 Lexical Analysis II Par III: Finite Automata Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ. 1.
Lexical Analyzer in Perspective
Finite automate.
CS510 Compiler Lecture 2.
Chapter 3 Lexical Analysis.
Compilers Welcome to a journey to CS419 Lecture5: Lexical Analysis:
Introduction to Lexical Analysis
Two issues in lexical analysis
Recognizer for a Language
Lexical Analysis Why separate lexical and syntax analyses?
פרק 3 ניתוח לקסיקאלי תורת הקומפילציה איתן אביאור.
Lexical Analysis and Lexical Analyzer Generators
Recognition of Tokens.
Chapter 3. Lexical Analysis (2)
Presentation transcript:

Chapter 3 Chang Chi-Chung

The Role of the Lexical Analyzer Lexical Analyzer Parser Source Program Token Symbol Table getNextToken error

The Reason for Using the Lexical Analyzer Simplifies the design of the compiler  A parser that had to deal with comments and white space as syntactic units would be more complex.  If lexical analysis is not separated from parser, then LL(1) or LR(1) parsing with 1 token lookahead would not be possible (multiple characters/tokens to match) Compiler efficiency is improved  Systematic techniques to implement lexical analyzers by hand or automatically from specifications  Stream buffering methods to scan input Compiler portability is enhanced  Input-device-specific peculiarities can be restricted to the lexical analyzer.

Lexical Analyzer Lexical analyzer are divided into a cascade of two process.  Scanning Consists of the simple processes that do not require tokenization of the input.  Deletion of comments.  Compaction of consecutive whitespace characters into one.  Lexical analysis The scanner produces the sequence of tokens as output.

Tokens, Patterns, and Lexemes Token ( 符號單元 )  A pair consisting of a token name and optional arrtibute value.  Example: num, id Pattern ( 樣本 )  A description of the form for the lexemes of a token.  Example: “non-empty sequence of digits”, “letter followed by letters and digits” Lexeme ( 詞 )  A sequence of characters that matches the pattern for a token.  Example: 123, abc

Examples: Tokens, Patterns, and Lexemes TokenPatternLexeme ifcharacters i fif elsecharacters e l s eelse comparison or = or == or !=<=, != idletter followed by letters and digits pi, score, D2 numberany numeric constant3.14, 0, 6.23 literal anything but “, surrounded by “ ’s “core dump”

An Example E = M * C ** 2 A sequence of pairs by lexical analyzer

Input Buffering E=M*C**2 eof lexemeBeginforward Sentinels

Lookahead Code with Sentinels switch (*forward++) { case eof: if (forward is at end of first buffer) { reload second buffer; forward = beginning of second buffer; } else if (forward is at end of second buffer) { reload first buffer; forward = beginning of first buffer; } else /* eof within a buffer marks the end of inout */ terminate lexical anaysis; break; cases for the other characters; }

Strings and Languages Alphabet  An alphabet  is a finite set of symbols (characters) String  A string is a finite sequence of symbols from   s  denotes the length of string s  denotes the empty string, thus  = 0 Language  A language is a countable set of strings over some fixed alphabet  Abstract Language Φ {ε}

String Operations Concatenation ( 連接 )  The concatenation of two strings x and y is denoted by xy Identity ( 單位元素 )  The empty string is the identity under concatenation.   s = s  = s Exponentiation  Define s 0 =  s i = s i-1 s for i > 0  By Define s 1 = s s 2 = ss

Language Operations Union L  M = { s  s  L or s  M } Concatenation L M = { xy  x  L and y  M} Exponentiation L 0 = {  } L i = L i-1 L Kleene closure ( 封閉包 ) L * = ∪ i=0,…,  L i Positive closure L + = ∪ i=1,…,  L i

Regular Expressions  A convenient means of specifying certain simple sets of strings.  We use regular expressions to define structures of tokens.  Tokens are built from symbols of a finite vocabulary. Regular Sets  The sets of strings defined by regular expressions.

Regular Expressions Basis symbols:   is a regular expression denoting language L(  ) = {  }  a   is a regular expression denoting L(a) = { a } If r and s are regular expressions denoting languages L(r) and M(s) respectively, then  r  s is a regular expression denoting L(r)  M(s)  rs is a regular expression denoting L(r)M(s)  r * is a regular expression denoting L(r) *  (r) is a regular expression denoting L(r) A language defined by a regular expression is called a regular set.

Operator Precedence OperatorPrecedenceAssociative *highestleft concatenationSecondleft |lowestleft

Algebraic Laws for Regular Expressions LawDescription r | s = s | r| is commutative r | ( s | t ) = ( r | s ) | t| is associative r(st) = (rs)tconcatenation is associative r(s|t) = rs | rt (s|t)r = sr | tr concatenation distributes over | εr = rε = rε is the identity for concatenation r* = ( r |ε)*ε is guaranteed in a closure r** = r** is idempotent

Regular Definitions If Σ is an alphabet of basic symbols, then a regular definitions is a sequence of definitions of the form: d 1  r 1 d 2  r 2 … d n  r n  Each d i is a new symbol, not in Σ and not the same as any other of d’s.  Each r i is a regular expression over the alphabet   {d 1, d 2, …, d i-1 } Any d j in r i can be textually substituted in r i to obtain an equivalent set of definitions

Example: Regular Definitions Regular Definitions letter_  A | B | … | Z | a | b | … | z | _ digit  0 | 1 | … | 9 id  letter_ ( letter_ | digit ) * Regular definitions are not recursive digits  digit digits  digit wrong

Extensions of Regular Definitions One or more instance  r + = rr * = r * r  r * = r + | ε Zero or one instance  r? = r | ε Character classes  [a-z] = a  b  c  …  z  [A-Za-z] = A|B|…|Z|a|…|z Example  digit  [0-9]  num  digit + (. digit + )? ( E (+  -)? digit + )?

Regular Definitions and Grammars Context-Free Grammars stmt  if expr then stmt  if expr then stmt else stmt   expr  term relop term  term term  id  num Regular Definitions digit  [0-9] letter  [A-Za-z] if  if then  then else  else relop   >  >=  = id  letter ( letter | digit ) * num  digit + (. digit + )? ( E (+ | -)? digit + )? ws  ( blank | tab | newline ) +

LEXEMESTOKEN NAMEATTRIBUTE VALUE Any ws -- if - then - else - Any idid Pointer to table entry Any numbernumber Pointer to table entry <relop LT <=relop LE =relop EQ <>relop NE >relop GT >=relop GE

Transition Diagrams return(relop, LE ) return(relop, NE ) return(relop, LT ) return(relop, EQ ) return(relop, GE ) return(relop, GT ) start < = > = > = other * * relop   >  >=  =

Transition Diagrams 9 start letter * other letter or digit return (getToken(), installID() ) id  letter ( letter | digit ) *

An Example: Implement of RELOP TOKEN getRelop() { TOKEN retToken = new(RELOP); while (1) { case 0: c = nextChar(); if (c == ‘<‘) state = 1; else if (c == ‘=‘) state= 5; else if (c == ‘>‘) state= 6; else fail(); break; case 1: case 8: retract(); retToken.attribute = GT; return(retTOKEN); }

Finite Automata Finite Automata are recognizers.  FA simply say “Yes” or “No” about each possible input string.  A FA can be used to recognize the tokens specified by a regular expression  Use FA to design of a Lexical Analyzer Generator Two kind of the Finite Automata  Nondeterministic finite automata (NFA)  Deterministic finite automata (DFA) Both DFA and NFA are capable of recognizing the same languages.

NFA Definitions NFA = { S, , , s 0, F }  A finite set of states S  A set of input symbols Σ input alphabet, ε is not in Σ  A transition function   : S    S  A special start state s 0  A set of final states F, F  S (accepting states)

Transition Graph for FA is a state is a transition is a the start state is a final state

Example a bc c a This machine accepts abccabc, but it rejects abcab. This machine accepts (abc + ) +.

Transition Table 0 start a bb a b STATEabε 0{0, 1}{0}- 1-{2}- 2-{3} The mapping  of an NFA can be represented in a transition table  (0, a ) = {0,1}  (0, b ) = {0}  (1, b ) = {2}  (2, b ) = {3}

DFA DFA is a special case of an NFA  There are no moves on input ε  For each state s and input symbol a, there is exactly one edge out of s labeled a. Both DFA and NFA are capable of recognizing the same languages.

Simulating a DFA Input  An input string x terminated by an end-of-file character eof. A DFA D with start state s0, accepting states F, and transition function move. Output  Answer “yes” if D accepts x ; “no” otherwise. s = s 0 c = nextChar(); while ( c != eof ) { s = move(s, c); c = nextChar(); } if (s is in F ) return “yes”; else return “no”;

NFA vs DFA 0 start a bb a b S = {0,1,2,3}  = { a, b } s 0 = 0 F = {3} abb b a a a (a | b)*abb

The Regular Language The regular language defined by an NFA is the set of input strings it accepts.  Example: (a  b)*abb for the example NFA An NFA accepts an input string x if and only if  there is some path with edges labeled with symbols from x in sequence from the start state to some accepting state in the transition graph  A state transition from one state to another on the path is called a move.

Theorem The followings are equivalent  Regular Expression  NFA  DFA  Regular Language  Regular Grammar

Convert Concept Regular Expression Nondeterministic Finite Automata Deterministic Finite Automata Minimization Deterministic Finite Automata

Construction of an NFA from a Regular Expression Use Thompson ’ s Construction s | t N(s)N(s) N(t)N(t)     s t N(s)N(s)N(t)N(t) s*s* N(s)N(s)     a a ε

Example ( a | b )* a b b r 11 r8r8 r 10 r7r7 r9r9 r6r6 r5r5 * r4r4 a b b ( r3r3 ) r2r2 r1r1 ab | r 3 = r 4

0 start a b b a b         ( a | b )* a b b Example

Conversion of an NFA to a DFA The subset construction algorithm converts an NFA into a DFA using the following operation. OperationDescription ε- closure(s) Set of NFA states reachable from NFA state s on ε- transitions alone. ε- closure(T) Set of NFA states reachable from some NFA state s in set T on ε-transitions alone. = ∪ s in T ε- closure(s) move(T, a) Set of NFA states to which there is a transition on input symbol a from some state s in T

Subset Construction(1) Initially,  -closure(s0) is the only state in Dstates and it is unmarked; while (there is an unmarked state T in Dstates) { mark T; for (each input symbol a   ) { U =  -closure ( move (T, a) ); if ( U is not in Dstates) add U as an unmarked state to Dstates Dtran[T, a] = U } }

Computing ε- closure(T)

A start B C D E b b b b b a a a a a 0 a b b a b         NFA StateDFA Stateab {0,1,2,4,7}ABC {1,2,3,4,6,7,8}BBD {1,2,4,5,6,7}CBC {1,2,4,5,6,7,9}DBE {1,2,3,5,6,7,10}EBC Example ( a | b )* a b b

Example 2 a 1 6 a 3 45 bb 8 b 7 a b 0 start    Dstates A = {0,1,3,7} B = {2,4,7} C = {8} D = {7} E = {5,8} F = {6,8} a abb a*b a b b bb b a b

Simulation of an NFA Input  An input string x terminated by an end-of-file character eof. An NFA N with start state s0, accepting states F, and transition function move. Output  Answer “yes” if N accepts x ; “no” otherwise. S = ε-closure( s 0 ) c = nextChar(); while ( c != eof ) { S = ε-closure( s 0 ) c = nextChar(); } if (S ∩ F != ψ ) return “yes”; else return “no”;

Minimizing the DFA Step 1  Start with an initial partition II with two group: F and S-F (aceepting and nonaccepting) Step 2  Split Procedure Step 3  If ( II new = II ) II final = II and continue step 4 else II = II new and go to step 2 Step 4  Construct the minimum-state DFA by II final group.  Delete the dead state

Split Procedure Initially, let II new = II for ( each group G of II ) { Partition G into subgroup such that two states s and t are in the same subgroup if and only if for all input symbol a, states s and t have transition on a to states in the same group of II. /* at worst, a state will be in a subgroup by itself */ replace G in II new by the set of all subgroup formed }

Example initially, two sets {1, 2, 3, 5, 6}, {4, 7}. {1, 2, 3, 5, 6} splits {1, 2, 5}, {3, 6} on c. {1, 2, 5} splits {1}, {2, 5} on b.

Minimizing the DFA Major operation: partition states into equivalent classes according to  final / non-final states  transition functions ( A B C D E ) ( A B C D ) ( E ) ( A B C ) ( D ) ( E ) ( A C ) ( B ) ( D ) ( E )

Important States of an NFA The “ important states ” of an NFA are those without an  -transition, that is  if move({s}, a)   for some a then s is an important state The subset construction algorithm uses only the important states when it determines  -closure ( move(T, a) ) Augment the regular expression r with a special end symbol # to make accepting states important: the new expression is r#

Converting a RE Directly to a DFA Construct a syntax tree for ( r ) # Traverse the tree to construct functions nullable, firstpos, lastpos, and followpos Construct DFA D by algorithm 3.62

Function Computed From the Syntax Tree nullable(n)  The subtree at node n generates languages including the empty string firstpos(n)  The set of positions that can match the first symbol of a string generated by the subtree at node n lastpos(n)  The set of positions that can match the last symbol of a string generated be the subtree at node n followpos(i)  The set of positions that can follow position i in the tree

Rules for Computing the Function Node n nullable(n)firstpos(n)lastpos(n) A leaf labeled by  true  A leaf with position i false{i}{i}{i}{i} n = c 1 | c 2 nullable(c 1 ) or nullable(c 2 ) firstpos(c 1 )  firstpos(c 2 )lastpos(c 1 )  lastpos(c 2 ) n = c 1 c 2 nullable(c 1 ) and nullable(c 2 ) if ( nullable(c 1 ) ) firstpos(c 1 )  firstpos(c 2 ) else firstpos(c 1 ) if ( nullable(c 2 ) ) lastpos(c 1 )  lastpos(c 2 ) else lastpos(c 2 ) n = c 1 * truefirstpos(c 1 )lastpos(c 1 )

Computing followpos for (each node n in the tree) { //n is a cat-node with left child c1 and right child c2 if ( n == c1 . c2) for (each i in lastpos(c1) ) followpos(i) = followpos(i)  firstpos(c2); else if (n is a star-node) for ( each i in lastpos(n) ) followpos(i) = followpos(i)  firstpos(n); }

Converting a RE Directly to a DFA Initialize Dstates to contain only the unmarked state firstpos (n 0 ), where n 0 is the root of syntax tree T for (r)#; while ( there is an unmarked state S in Dstates ) { mark S; for ( each input symbol a   ) { let U be the union of followpos ( p ) for all p in S that correspond to a; if ( U is not in Dstates ) add U as an unmarked state to Dstates Dtran[S,a] = U ; } }

Example ( a | b )* a b b # nullable(n) = false firstpos(n) = { 1, 2, 3 } lastpos(n) = { 3 } followpos(1) = {1, 2, 3 } ○ b # ○ ○ b ○ a * | ab 3 21 n n = ( a | b )* a

Example {6}{1, 2, 3} {5}{1, 2, 3} {4}{1, 2, 3} {3}{1, 2, 3} {1, 2} * | {1} a {2} b {3} a {4} b {5} b {6} # nullable firstposlastpos ( a | b )* a b b #

Example 1,2,3 a 1,2, 3,4 1,2,3,6 1,2, 3,5 bb bb a a a Nodefollowpos 1{1, 2, 3} 2 3{4} 4{5} 5{6} ( a | b )* a b b #

Time and Space Complexity Automaton Space (worst case) Time (worst case) NFA O(r)O(r)O(  r  x  ) DFAO(2 |r| ) O(x)O(x)