Lexical Analysis (2 Lectures). CSE244 Compilers 2 Overview Basic Concepts Regular Expressions –Language Lexical analysis by hand Regular Languages Tools.

Slides:



Advertisements
Similar presentations
4b Lexical analysis Finite Automata
Advertisements

Lexical Analysis Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!
COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
From Cooper & Torczon1 The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source language?
1 IMPLEMENTATION OF FINITE AUTOMAT IN CODE There are several ways to translate either a DFA or an NFA into code. Consider, again the example of a DFA that.
Winter 2007SEG2101 Chapter 81 Chapter 8 Lexical Analysis.
Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.
1 The scanning process Main goal: recognize words/tokens Snapshot: At any point in time, the scanner has read some input and is on the way to identifying.
Compiler Construction
Lexical Analysis Recognize tokens and ignore white spaces, comments
CS 426 Compiler Construction
Scanner Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source language? Is the.
1 Scanning Aaron Bloomfield CS 415 Fall Parsing & Scanning In real compilers the recognizer is split into two phases –Scanner: translate input.
Chapter 3 Lexical Analysis
Topic #3: Lexical Analysis
CPSC 388 – Compiler Design and Construction Scanners – Finite State Automata.
1 Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
1 Chapter 3 Scanning – Theory and Practice. 2 Overview Formal notations for specifying the precise structure of tokens are necessary –Quoted string in.
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions.
1 Outline Informal sketch of lexical analysis –Identifies tokens in input string Issues in lexical analysis –Lookahead –Ambiguities Specifying lexers –Regular.
Lexical Analysis - An Introduction. The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
어휘분석 (Lexical Analysis). Overview Main task: to read input characters and group them into “ tokens. ” Secondary tasks: –Skip comments and whitespace;
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
Lexical Analysis Constructing a Scanner from Regular Expressions.
Topic #3: Lexical Analysis EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
Lexical Analysis I Specifying Tokens Lecture 2 CS 4318/5531 Spring 2010 Apan Qasem Texas State University *some slides adopted from Cooper and Torczon.
Lexical Analyzer (Checker)
4b 4b Lexical analysis Finite Automata. Finite Automata (FA) FA also called Finite State Machine (FSM) –Abstract model of a computing entity. –Decides.
1 November 1, November 1, 2015November 1, 2015November 1, 2015 Azusa, CA Sheldon X. Liang Ph. D. Computer Science at Azusa Pacific University Azusa.
Lexical Analyzer in Perspective
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
CSc 453 Lexical Analysis (Scanning)
Joey Paquet, 2000, Lecture 2 Lexical Analysis.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Compiler Construction By: Muhammad Nadeem Edited By: M. Bilal Qureshi.
Lexical Analysis - Scanner- Contd Computer Science Rensselaer Polytechnic Compiler Design Lecture 3(01/21/98)
The Role of Lexical Analyzer
1st Phase Lexical Analysis
CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis.
Prof. Necula CS 164 Lecture 31 Lexical Analysis Lecture 3-4.
using Deterministic Finite Automata & Nondeterministic Finite Automata
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
LECTURE 5 Scanning. SYNTAX ANALYSIS We know from our previous lectures that the process of verifying the syntax of the program is performed in two stages:
1 Compiler Construction Vana Doufexi office CS dept.
Deterministic Finite Automata Nondeterministic Finite Automata.
1 An automaton is a computation that determines whether a given string belongs to a specified language A finite state machine (FSM) is an automaton that.
CS412/413 Introduction to Compilers Radu Rugina Lecture 3: Finite Automata 25 Jan 02.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
WELCOME TO A JOURNEY TO CS419 Dr. Hussien Sharaf Dr. Mohammad Nassef Department of Computer Science, Faculty of Computers and Information, Cairo University.
Lexical Analyzer in Perspective
Chapter 3 Lexical Analysis.
Lecture 2 Lexical Analysis Joey Paquet, 2000, 2002, 2012.
CSc 453 Lexical Analysis (Scanning)
Lexical analysis Finite Automata
Compilers Welcome to a journey to CS419 Lecture5: Lexical Analysis:
CSc 453 Lexical Analysis (Scanning)
Recognizer for a Language
Lexical and Syntax Analysis
Chapter 3: Lexical Analysis
Lexical Analysis - An Introduction
Lexical Analysis Lecture 3-4 Prof. Necula CS 164 Lecture 3.
4b Lexical analysis Finite Automata
Finite Automata & Language Theory
4b Lexical analysis Finite Automata
Lexical Analysis - An Introduction
Lecture 5 Scanning.
CSc 453 Lexical Analysis (Scanning)
Presentation transcript:

Lexical Analysis (2 Lectures)

CSE244 Compilers 2 Overview Basic Concepts Regular Expressions –Language Lexical analysis by hand Regular Languages Tools –NFA –DFA Scanning tools –Lex / Flex / JFlex / ANTLR

CSE244 Compilers 3 Scanning Perspective Purpose –Transform a stream of symbols –Into a stream of tokens

CSE244 Compilers 4 Lexical Analyzer Responsibilities Lexical analyzer [Scanner] –Scan input –Remove white spaces –Remove comments –Manufacture tokens –Generate lexical errors –Pass token to parser

CSE244 Compilers 5 Modular design Rationale –Separate the two analysis High cohesion / Low coupling –Improve efficiency –Improve portability / maintainability –Enable integration of third-party lexers [lexer = lexical analysis tool]

CSE244 Compilers 6 Terminology Token –A classification for a common set of strings –Examples: Identifier, Integer, Float, Assign, LeftParen, RightParen,.... Pattern –The rules that characterize the set of strings for a token –Examples: [0-9]+ Lexeme –Actual sequence of characters that matches a pattern and has a given Token class. –Examples: Identifier: Name,Data,x Integer: 345,2,0,629,....

CSE244 Compilers 7 Examples “”“”

CSE244 Compilers 8 Lexical Errors Error Handling is very localized, w.r.t. Input Source Example: fi(a==f(x)) … generates no lexical error in C In what situations do errors occur? Prefix of remaining input doesn’t match any defined token Possible error recovery actions: Deleting or Inserting Input Characters Replacing or Transposing Characters Or, skip over to next separator to ignore problem

CSE244 Compilers 9 Basic Scanning technique Use 1 character of look-ahead –Obtain char with getc() Do a case analysis –Based on lookahead char –Based on current lexeme Outcome –If char can extend lexeme, all is well, go on. –If char cannot extend lexeme: Figure out what the complete lexeme is and return its token Put the lookahead back into the symbol stream

CSE244 Compilers 10 Language Concepts Alphabet Language {0,1}{0,10,100,1000,10000,…} {0,1,100,000,111,…} {a,b,c}{abc,aabbcc,aaabbbccc,…} {A…Z}{TEE,FORE,BALL…} {FOR,WHILE,GOTO…} {A…Z,a…z,0…9,{All legal PASCAL progs} +,-,…,,…}{All grammatically correct English Sentences} Special Languages: Φ – EMPTY LANGUAGE ε – contains empty string ε only A language, L, is simply any set of strings over a fixed alphabet.

CSE244 Compilers 11 Formal Language Operations

CSE244 Compilers 12 Examples

CSE244 Compilers 13 Regular Languages All examples above are –Quite expressive –Simple languages But also... –Belong to a special class: regular languages A Regular Expression is a Set of Rules / Techniques for Constructing Sequences of Symbols (Strings) From an Alphabet. Let Σ Be an Alphabet, r a Regular Expression Then L(r) is the Language That is Characterized by the Rules of r

CSE244 Compilers 14 Rules fix alphabet Σ ε is a regular expression denoting {ε} If a is in Σ, a is a regular expression that denotes {a} Let r and s be R.E. for L(r) and L(s). Then (a) (r) | (s) is a regular expression L(r) ∪ L(s) (b) (r)(s) is a regular expression L(r) L(s) (c) (r)* is a regular expression (L(r))* (d) (r) is a regular expression L(r) All are Left-Associative. Parentheses are dropped as allowed by precedences. Precedeence

CSE244 Compilers 15 Example revisited

CSE244 Compilers 16 Algebraic Properties

CSE244 Compilers 17 More Examples All Strings that start with “tab” or end with “bat”: tab{A,…,Z,a,...,z}*|{A,…,Z,a,....,z}*bat All Strings in Which {1,2,3} exist in ascending order: {A,…,Z}*1 {A,…,Z}*2 {A,…,Z}*3 {A,…,Z}*

CSE244 Compilers 18 Tokens as R.E. … … … “+” “?” …

CSE244 Compilers 19 Tokens as Patterns Patterns are ??? Tokens are ???

CSE244 Compilers 20 Throw Away Tokens Fact –Some languages define tokens as useless –Example: C whitespace, tabulations, carriage return, and comments can be discarded without affecting the program’s meaning.

CSE244 Compilers 21 Automaton A tool to specify a token

CSE244 Compilers 22 A More Complex Automaton

CSE244 Compilers 23 Two More...

CSE244 Compilers 24 What about keywords ? Easy! –Use the “Identifier” token –After a match, lookup the keyword table If found, return a token for the matched keyword If not, return a token for the true identifier

CSE244 Compilers 25 Yes... But how to scan? Remember the algorithm? –Acquire 1 character of lookahead –Case analysis based On lookahead On state of automaton

CSE244 Compilers 26 Scanner code class Scanner { InputStream _in; char _la; // The lookahead character char[] _window; // lexeme window Token nextToken() { startLexeme(); // reset window at start while(true) { switch(_state) { case 0: { _la = getChar(); if (_la == ‘<’) _state = 1; else if (_la == ‘=’) _state = 5; else if (_la == ‘>’) _state = 6; else failure(state); }break; case 6: { _la = getChar(); if (_la == ‘=’) _state = 7; else _state = 8; }break; } case 7: { return new Token(GEQUAL); }break; case 8: { pushBack(_la); return new Token(GREATER); }

CSE244 Compilers 27 Handling Failures Meaning –The automaton for this token failed solution –If another automaton is available “rewind” the input to the beginning of last lexeme Jump to start state of next automaton Start recognizing again –If no other automaton This is a true lexical error. Discard lexeme (or at least first char of lexeme) Start from state 0 again

CSE244 Compilers 28 Overview Basic Concepts Regular Expressions –Language Lexical analysis by hand Regular Languages Tools –NFA / DFA Scanning with DFAs Scanning tools –Lex / Flex / JFlex

CSE244 Compilers 29 Automata & Language Theory Terminology –FSA A recognizer that takes an input string and determines whether it’s a valid string of the language. –Non-Deterministic FSA (NFA) Has several alternative actions for the same input symbol –Deterministic FSA (DFA) Has at most 1 action for any given input symbol Bottom Line –expressive power(NFA) == expressive power(DFA) –Conversion can be automated

CSE244 Compilers 30 NFA An NFA is a mathematical model that consists of : S, a set of states , the symbols of the input alphabet move, a transition function. move(state, symbol) → set of states move : S   { ∈ } → Pow(S) A state, s 0 ∈ S, the start state F ⊆ S, a set of final or accepting states.

CSE244 Compilers 31 Representing NFA Transition Diagrams : Transition Tables: Number states (circles), arcs, final states, … More suitable to representation within a computer We’ll see examples of both !

CSE244 Compilers 32 Example NFA S = { 0, 1, 2, 3 } s 0 = 0 F = { 3 } Σ = { a, b } start 03 b 21 ba a b What Language is defined ? What is the Transition Table ? state i n p u t ab { 0, 1 } --{ 2 } --{ 3 } { 0 } ∈  (null) moves possible ji ∈ Switch state but do not use any input symbol

CSE244 Compilers 33 Epsilon-Transitions Given the regular expression : (a (b*c)) | (a (b | c+)?) Find a transition diagram NFA that recognizes it. Solution ?

CSE244 Compilers 34 NFA Construction Automatic construction example a(b*c) a(b|c+)? Build a Disjunction

CSE244 Compilers 35 Resulting NFA

CSE244 Compilers 36 Working NFA start 03 b 21 ba a b Given an input string, we trace moves If no more input & in final state, ACCEPT EXAMPLE: Input: ababb move(0, a) = 1 move(1, b) = 2 move(2, a) = ? (undefined) REJECT ! move(0, a) = 0 move(0, b) = 0 move(0, a) = 1 move(1, b) = 2 move(2, b) = 3 ACCEPT ! -OR-

CSE244 Compilers 37 Handling Undefined Transitions We can handle undefined transitions by defining one more state, a “death” state, and transitioning all previously undefined transition to this death state. start 03 b 21 ba a b 4 a, b a a 

CSE244 Compilers 38 Worse still... Not all path result in acceptance! start 03 b 21 ba a b aabb is accepted along path : 0 → 0 → 1 → 2 → 3 BUT… it is not accepted along the valid path: 0 → 0 → 0 → 0 → 0

CSE244 Compilers 39 The NFA “Problem” Two problems –Valid input may not be accepted –Non-deterministic behavior from run to run... Solution?

CSE244 Compilers 40 The DFA Save The Day A DFA is an NFA with a few restrictions –No epsilon transitions –For every state s, there is only one transition (s,x) from s for any symbol x in Σ Corollaries –Easy to implement a DFA with an algorithm! –Deterministic behavior

CSE244 Compilers 41 NFA vs. DFA NFA –smaller number of states Q nfa –In order to simulate it requires a |Q nfa | computation for each input symbol. DFA –larger number of states Q dfa –In order to simulate it requires a constant computation for each input symbol. caveat - generic NFA=>DFA construction: Q dfa ~ 2^{Q nfa } but: DFA’s are perfectly optimizable! (i.e., you can find smallest possible Q dfa )

CSE244 Compilers 42 One catch... NFA-DFA comparison

CSE244 Compilers 43 NFA to DFA Conversion Idea –Look at the state reachable without consuming any input –Aggregate them in macro states

CSE244 Compilers 44 Final Result A state is final –IFF one of the NFA state was final

CSE244 Compilers 45 Preliminary Definitions NFA N = ( S, Σ, s 0, F, MOVE ) ε -Closure(s) : s ε S –set of states in S that are reachable from s via ε -moves of N that originate from s. ε -Closure(T) : T ⊆  S NFA states reachable from all t ε T on ε -moves only. move(T,a): T ⊆  S, a ε Σ Set of states to which there is a transition on input a from some t ε T

CSE244 Compilers 46 Algorithm forall(t in T) push(t); initialize ε -closure(T) to T; while stack is not empty do begin t = pop(); for each u ε S with edge t→u labeled ε if u is not in ε -closure(T) add u to ε -closure(T) ; push u onto stack computing the ε - closure

CSE244 Compilers 47 DFA construction computing the The set of states The transitions computing the The set of states The transitions let Q = ε -closure(s 0 ) ; D = { Q }; enQueue(Q) while queue not empty do X = deQueue(); for each a ε Σ do Y := ε -closure(move(X,a)); T[X,a] := Y if Y is not in D D = D U { Y } enQueue(Y); end

CSE244 Compilers 48 Summary We can –Specify tokens with R.E. –Use DFA to scan an input and recognize token –Transform an NFA into a DFA automatically What we are missing –A way to transform an R.E. into an NFA Then, we will have a complete solution –Build a big R.E. –Turn the R.E. into an NFA –Turn the NFA into a DFA –Scan with the obtained DFA

CSE244 Compilers 49 R.E. To NFA Process –Inductive definition Use the structure of the R.E. Use atomic automata for atomic R.E. Use composition rules for each R.E. expression Recall –RE::= ε ::= s in Σ ::= rs ::= r | s ::= r*

CSE244 Compilers 50 Epsilon Construction RE::= ε

CSE244 Compilers 51 Symbol Construction RE::= x in Σ

CSE244 Compilers 52 Chaining Construction RE::= rs

CSE244 Compilers 53 Branching Construction RE::= r | s

CSE244 Compilers 54 Kleene-Closure Construction RE::= r*

CSE244 Compilers 55 NFA Construction Example R.E. –(ab*c) | (a(b|c*)) Parse Tree: r 13 r 12 r5r5 r3r3 r 11 r4r4 r9r9 r 10 r8r8 r7r7 r6r6 r0r0 r1r1 r2r2 b * c a a | ( ) b | * c

CSE244 Compilers 56 NFA Construction Example 2 r3:r3: a r0:r0: b r2:r2: c b ∈ ∈ ∈ ∈ r1:r1:r 4 : r 1 r 2 b ∈ ∈ ∈ ∈ c r 5 : r 3 r 4 b ∈ ∈ ∈ ∈ ac

CSE244 Compilers 57 NFA Construction Example 3 r 11 : a r7:r7: b r6:r6: c c ∈ ∈ ∈ ∈ r 9 : r 7 | r 8 ∈ ∈ b c ∈ ∈ ∈ ∈ r8:r8: c ∈ ∈ ∈ ∈ r 12 : r 11 r 10 ∈ ∈ b a r 10 : r 9

CSE244 Compilers 58 NFA Construction Example 4 r 13 : r 5 | r 12 b ∈ ∈ ∈ ∈ ac c ∈ ∈ ∈ ∈ ∈ ∈ b a ∈ ∈∈ ∈

CSE244 Compilers 59 Overall Summary How does this all fit together ? –Reg. Expr. → NFA construction –NFA → DFA conversion –DFA simulation for lexical analyzer Recall Lex Structure –Pattern Action – … … Each pattern recognizes lexemes Each pattern described by regular expression ∈ ∈ etc. (abc)*ab (a | b)*abb Recognizer!

CSE244 Compilers 60 Morale? All of this can be automated with a tool! –LEXThe first lexical analyzer tool for C –FLEXA newer/faster implementation C / C++ friendly –JFLEXA lexer for Java. Based on same principles. –JavaCC –ANTLR

CSE244 Compilers 61 Ahead... Grammars Parsing –Bottom Up –Top Down