Lexical Analysis Dragon Book: chapter 3.

Slides:



Advertisements
Similar presentations
1 2.Lexical Analysis 2.1Tasks of a Scanner 2.2Regular Grammars and Finite Automata 2.3Scanner Implementation.
Advertisements

Finite State Machines Finite state machines with output
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
Compiler construction in4020 – lecture 2 Koen Langendoen Delft University of Technology The Netherlands.
4b Lexical analysis Finite Automata
Lex -- a Lexical Analyzer Generator (by M.E. Lesk and Eric. Schmidt) –Given tokens specified as regular expressions, Lex automatically generates a routine.
Lexical Analysis Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!
COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
CSc 453 Lexical Analysis (Scanning)
 Lex helps to specify lexical analyzers by specifying regular expression  i/p notation for lex tool is lex language and the tool itself is refered to.
176 Formal Languages and Applications: We know that Pascal programming language is defined in terms of a CFG. All the other programming languages are context-free.
COMMONWEALTH OF AUSTRALIA Copyright Regulations 1969 WARNING This material has been reproduced and communicated to you by or on behalf of Monash University.
Yu-Chen Kuo1 Chapter 2 A Simple One-Pass Compiler.
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
1 Chapter 3 Scanning – Theory and Practice. 2 Overview Formal notations for specifying the precise structure of tokens are necessary  Quoted string in.
Topics Automata Theory Grammars and Languages Complexities
Lexical Analysis Recognize tokens and ignore white spaces, comments
Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.
1 Scanning Aaron Bloomfield CS 415 Fall Parsing & Scanning In real compilers the recognizer is split into two phases –Scanner: translate input.
CPSC 388 – Compiler Design and Construction
Topic #3: Lexical Analysis
CPSC 388 – Compiler Design and Construction Scanners – Finite State Automata.
1 Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
1 Chapter 3 Scanning – Theory and Practice. 2 Overview Formal notations for specifying the precise structure of tokens are necessary –Quoted string in.
Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Machine-independent code improvement Target code generation Machine-specific.
Lexical Analysis Mooly Sagiv Schrierber Wed 10:00-12:00 html:// Textbook:Modern.
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
Automating Construction of Lexers. Example in javacc TOKEN: { ( | | "_")* > | ( )* > | } SKIP: { " " | "\n" | "\t" } --> get automatically generated code.
Lexical Analyzer (Checker)
D. M. Akbar Hussain: Department of Software & Media Technology 1 Compiler is tool: which translate notations from one system to another, usually from source.
COP 4620 / 5625 Programming Language Translation / Compiler Writing Fall 2003 Lecture 3, 09/11/2003 Prof. Roy Levow.
Scanning & FLEX CPSC 388 Ellen Walker Hiram College.
CS412/413 Introduction to Compilers Radu Rugina Lecture 4: Lexical Analyzers 28 Jan 02.
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
TRANSITION DIAGRAM BASED LEXICAL ANALYZER and FINITE AUTOMATA Class date : 12 August, 2013 Prepared by : Karimgailiu R Panmei Roll no. : 11CS10020 GROUP.
Compiler Construction 2 주 강의 Lexical Analysis. “get next token” is a command sent from the parser to the lexical analyzer. On receipt of the command,
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
Review: Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Code optimizer Code generator Symbol.
CSc 453 Lexical Analysis (Scanning)
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Compiler Construction By: Muhammad Nadeem Edited By: M. Bilal Qureshi.
The Role of Lexical Analyzer
Exercise 1 Consider a language with the following tokens and token classes: ID ::= letter (letter|digit)* LT ::= " " shiftL ::= " >" dot ::= "." LP ::=
Lexical Analysis – Part II EECS 483 – Lecture 3 University of Michigan Wednesday, September 13, 2006.
1st Phase Lexical Analysis
CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis.
Chapter 2 Scanning. Dr.Manal AbdulazizCS463 Ch22 The Scanning Process Lexical analysis or scanning has the task of reading the source program as a file.
using Deterministic Finite Automata & Nondeterministic Finite Automata
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
Deterministic Finite Automata Nondeterministic Finite Automata.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
Last Chapter Review Source code characters combination lexemes tokens pattern Non-Formalization Description Formalization Description Regular Expression.
Department of Software & Media Technology
Lexical Analyzer in Perspective
Languages.
Lecture 2 Lexical Analysis
Chapter 3 Lexical Analysis.
Chapter 2 Scanning – Part 1 June 10, 2018 Prof. Abdelaziz Khamis.
Lexical Analysis (Sections )
Finite-State Machines (FSMs)
Lexical analysis Finite Automata
Compilers Welcome to a journey to CS419 Lecture5: Lexical Analysis:
CSc 453 Lexical Analysis (Scanning)
Finite-State Machines (FSMs)
REGULAR LANGUAGES AND REGULAR GRAMMARS
פרק 3 ניתוח לקסיקאלי תורת הקומפילציה איתן אביאור.
Review: Compiler Phases:
4b Lexical analysis Finite Automata
4b Lexical analysis Finite Automata
CSc 453 Lexical Analysis (Scanning)
Presentation transcript:

Lexical Analysis Dragon Book: chapter 3

Intermediate code generator Compiler structure Source program Lexical analyzer Syntax analyzer Error handling Symbol table Semantic analyzer Intermediate code generator Code optimizer Code generator Target program

Compiler structure Lexical analyzer Syntax analyzer Error handling Source program Lexical analyzer token Get next token Syntax analyzer Error handling Symbol table

Tokens in programming languages Sample instances Description if id keyword rel <, <=, <>, >=, > relation count, length, point2 variable num 3.1415927, 7, 145e-3 Numerical constant str “abc”, “some space” “\7\” is a char” Constant string

Tokens may be difficult to recognize Fortran: DO 5 I=1.25 DO 5 I=1,25 (spaces do not count). PL/I: IF THEN THEN THEN=ELSE; ELSE ELSE=THEN; (no reserved keywords). PL/I: PR1(2, 7, 18, D*3, 175.14)=3 (proc. call or array reference).

Strings, languages. A sequence of characters over some alphabet, e.g., 0100110 over {0, 1}. In computers, usually ASCII or EBCDIC. Length of strings: number of characters. Empty string:  (size 0). Concatenation: putting one string after another. X=dog, Y=house, XY=doghouse (also X.Y). Prefix: ban is prefix of banana. Suffix: ana is prefix of banana.

Language: a set of strings The alphabet is a language: L={A, B, …, Z, a, b, …, z}. Constant languages: X={ab, ba}, Y={a}. Concatenation: X.Y = {aba, baa}. Y.X = {aab, aba}. Union: XY=X+Y=X|Y={ab, ba, a}. Exponentation: X3 = X.X.X Star: X* = zero or more occurrences. L* = all words with letters from L. L+= all words with one or more letters from L.

Regular expressions X|Y = XY= { s | sX or sY }. X.Y = { x.y | xX and yY }. X* = i=0, Xi. X+ = i=1, Xi.

Examples a|b = {a, b}. (a|b).(a|b) = {aa, ab, ba, bb}. a* = { , a, aa, aaa, … }. (a|b)* = { , a, b, ab, ba, aa, aba, … }

Defining tokens digit  [0-9] digits  digit+ fraction  . digits |  exponent  E ( + | - |  ) digits |  const  digits fraction exponent

Not everything is regular! All the words of the form w c w, where w is a word and c a letter. The syntax of a program, e.g., the recursive definition of if-then-else. stmtif expr then stmt else stmt.

If a>8 then goto nextloop else begin while z>8 do Reading the input If a>8 then goto nextloop else begin while z>8 do Token starts here Last character read Need sometimes to “lookahead”. For example: identifying the variable done. May need to “unread” a character.

Returning: token + attributes. if xyz > 11 then if, keyword id, value=xyz op, value=“>”. const, value=11 then, keyword.

Finite Automata Includes: States {s1,s2,…,s5}. Initial states {s1}. Accepting states {s3,s5}. Alphabet {a, b, c}. Transitions: {(s1,a,s2), (s2, a, s3), …}. s1 a b b s2 b a s5 c b a c s4 s3 a Deterministic?

Automaton. What is the language? b s0 a b s1 Formally: An input is a word over the alphabet . A run over a word is an alternating sequence of states and letters, starting from the initial state. Accepting run: ends with an accepting state.

Example s0 s1 b a b Input: aabbb Run: s0 a s0 a s0 b s1 b s1 b s1. Accepts. Input: aba Run: s0 a s0 b s1 a s0. Does not accept.

Automaton. What is the language? b s0 s1 a b a

Automaton. What is the language? b s1 a s0 b a

Identifying tokens F I T H E N E L S E letter letter|digit

Non deterministic automata 0,1 1 s3 Allows more than a single transition from a state with the same label. There does not have to be a transition from every state with every label. Allows multiple initial states. Allows  transitions.

Nondeterministic runs 0,1 1 s3 Input: 0100 Run 1: s0 0 s0 1 s0 0 s0 0 s0. Does not accept. Run 2: s0 0 s0 1 s1 0 s2 0 s3. Accepts. Accepts when there exists an accepting run.

Determinizing Automata s0 s1 s2 0,1 1 s3 Each state of D is a set of the states of N. S—aT when T={t|sS and s—at}. The initial state of D includes all the initial states of N. Accepting states in D include at least one accepting state of N.

Determinization s0 s1 s2 s3 1 0,1 0,1 0,1 s0 s0,s3 s0,s2 s0,s1 1 1 1 1

Determinization 000 100 010 101 110 111 011 001 1 1

Translating regular expressions into automata     L1L2 L2 L1.L2 L1  L2   L   L*

Automatic translation (a|b).(a.b)=(ab)(ab)=(a+b).(a+b)=… a a b  b a  a     b b    

Determinization with  transitions. b b s8 s10 s2 s4     Add to each set states reachable using  transitions. s0,s1,s2 s3,s5,s6,s7,s8 s9,s11 s4,s5,s6,s7,s8 s10,s11 a b

Minimization a b p1 p3 p0 p2 p4  Group all the states together.  Separate states according to available exit transitions.  Separate a set to two if from some of its states one can reach another set and with others one cannot. Repeat until cannot separate.

Minimization a b p1 p3 p0 p2 p4 Group all the states together.

Minimization p0 p1 p3 p2 p4 a b Separate states according to available exit transitions.

Minimization p0 p1 p3 p2 p4 a b  Separate a set to two if from some of its states one can reach another set and with others one cannot. Repeat until cannot separate.

Can minimize now a b   a a b b

Lex Declarations %% Translation rules Auxiliary procedures

Lex behavior Lex Program a.out Lex source program lex.l lex.yy.c C Compiler Lex source program lex.l lex.yy.c a.out a.out Input streem Output tokens

Lex behavior Translates the definitions into an automaton. The automaton looks for the longest matching string. Either return some value to the reading program (parser), or looks for next token. Lookahead operator: x/y  allow the token x only if y follows it (but y is not part of the token).

Lex Project Project collection date: Feb 11th. Work in pairs (singles). Use lex to take a text and check whether the number of open parentheses of any kind is equal to the number of closed parentheses. Exception: Inside quotes. \” is not a closing quote.