Lexical Analysis Textbook:Modern Compiler Design Chapter 2.1.

Slides:



Advertisements
Similar presentations
Lexical Analysis Dragon Book: chapter 3.
Advertisements

COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
Compiler construction in4020 – lecture 2 Koen Langendoen Delft University of Technology The Netherlands.
1 Week 2 Questions / Concerns Schedule this week: Homework1 & Lab1a due at midnight on Friday. Sherry will be in Klamath Falls on Friday Lexical Analyzer.
Lex -- a Lexical Analyzer Generator (by M.E. Lesk and Eric. Schmidt) –Given tokens specified as regular expressions, Lex automatically generates a routine.
Lexical Analysis Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!
COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
Compiler construction 2002
Lecture 02 – Lexical Analysis Eran Yahav 1. 2 You are here Executable code exe Source text txt Compiler Lexical Analysis Syntax Analysis Parsing Semantic.
Winter 2007SEG2101 Chapter 81 Chapter 8 Lexical Analysis.
Recap Mooly Sagiv. Outline Subjects Studied Questions & Answers.
Chapter 4 Lexical and Syntax Analysis Sections 1-4.
Lexical Analysis Textbook:Modern Compiler Design Chapter 2.1.
Lexical Analysis Textbook:Modern Compiler Design Chapter 2.1
Syntax Analysis Mooly Sagiv html:// Textbook:Modern Compiler Design Chapter 2.2 (Partial) Hashlama 11:00-14:00.
Lexical Analysis Mooly Sagiv html:// Textbook:Modern Compiler Implementation in C Chapter 2.
Syntax Analysis Mooly Sagiv Textbook:Modern Compiler Design Chapter 2.2 (Partial)
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
1 Chapter 3 Scanning – Theory and Practice. 2 Overview Formal notations for specifying the precise structure of tokens are necessary  Quoted string in.
Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.
1 Scanning Aaron Bloomfield CS 415 Fall Parsing & Scanning In real compilers the recognizer is split into two phases –Scanner: translate input.
Compilation Lecture 2: Lexical Analysis Syntax Analysis (1): CFLs, CFGs, PDAs Noam Rinetzky 1.
CPSC 388 – Compiler Design and Construction Scanners – Finite State Automata.
CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 1 Chapter 4 Chapter 4 Lexical analysis.
1 Flex. 2 Flex A Lexical Analyzer Generator  generates a scanner procedure directly, with regular expressions and user-written procedures Steps to using.
Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.
Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Machine-independent code improvement Target code generation Machine-specific.
Lexical Analysis Mooly Sagiv Schrierber Wed 10:00-12:00 html:// Textbook:Modern.
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
Compilation (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1.
COMP313A Programming Languages Lexical Analysis. Lecture Outline Lexical Analysis The language of Lexical Analysis Regular Expressions.
COP 4620 / 5625 Programming Language Translation / Compiler Writing Fall 2003 Lecture 3, 09/11/2003 Prof. Roy Levow.
Scanning & FLEX CPSC 388 Ellen Walker Hiram College.
Lexical Analysis: Regular Expressions CS 671 January 22, 2008.
LANGUAGE TRANSLATORS: WEEK 14 LECTURE: REGULAR EXPRESSIONS FINITE STATE MACHINES LEXICAL ANALYSERS INTRO TO GRAMMAR THEORY TUTORIAL: CAPTURING LANGUAGES.
CS412/413 Introduction to Compilers Radu Rugina Lecture 4: Lexical Analyzers 28 Jan 02.
Flex: A fast Lexical Analyzer Generator CSE470: Spring 2000 Updated by Prasad.
1 Languages and Compilers (SProg og Oversættere) Lexical analysis.
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
Review: Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Code optimizer Code generator Symbol.
CPS 506 Comparative Programming Languages Syntax Specification.
Introduction to Lex Ying-Hung Jiang
1 Using Lex. Flex – Lexical Analyzer Generator A language for specifying lexical analyzers Flex compilerlex.yy.clang.l C compiler -lfl a.outlex.yy.c a.outtokenssource.
CSc 453 Lexical Analysis (Scanning)
1 Lex & Yacc. 2 Compilation Process Lexical Analyzer Source Code Syntax Analyzer Symbol Table Intermed. Code Gen. Code Generator Machine Code.
Joey Paquet, 2000, Lecture 2 Lexical Analysis.
Compiler Construction By: Muhammad Nadeem Edited By: M. Bilal Qureshi.
The Role of Lexical Analyzer
CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis.
Chapter 2 Scanning. Dr.Manal AbdulazizCS463 Ch22 The Scanning Process Lexical analysis or scanning has the task of reading the source program as a file.
Lexical Analysis: Regular Expressions CS 471 September 3, 2007.
Compiler Construction CPCS302 Dr. Manal Abdulaziz.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
CC410: System Programming Dr. Manal Helal – Fall 2014 – Lecture 12–Compilers.
CS510 Compiler Lecture 2.
Lecture 2 Lexical Analysis
Chapter 3 Lexical Analysis.
Lecture 2 Lexical Analysis Joey Paquet, 2000, 2002, 2012.
Chapter 2 Scanning – Part 1 June 10, 2018 Prof. Abdelaziz Khamis.
CSc 453 Lexical Analysis (Scanning)
Finite-State Machines (FSMs)
Textbook:Modern Compiler Design
Lecture 2: Lexical Analysis Noam Rinetzky
Finite-State Machines (FSMs)
Review: Compiler Phases:
Lecture 4: Lexical Analysis & Chomsky Hierarchy
Compiler Structures 2. Lexical Analysis Objectives
CSc 453 Lexical Analysis (Scanning)
Compiler Design 3. Lexical Analyzer, Flex
Presentation transcript:

Lexical Analysis Textbook:Modern Compiler Design Chapter 2.1

A motivating example Create a program that counts the number of lines in a given input text file

Solution int num_lines = 0; % \n ++num_lines;. ; % main() { yylex(); printf( "# of lines = %d\n", num_lines); }

Solution int num_lines = 0; % \n ++num_lines;. ; % main() { yylex(); printf( "# of lines = %d\n", num_lines); } initial ; newline \n other

Outline Roles of lexical analysis What is a token Regular expressions and regular descriptions Lexical analysis Automatic Creation of Lexical Analysis Error Handling

Basic Compiler Phases Source program (string) Fin. Assembly lexical analysis syntax analysis semantic analysis Tokens Abstract syntax tree Front-End Back-End Annotated Abstract syntax tree

Example Tokens TypeExamples IDfoo n_14 last NUM REAL e67 5.5e-10 IFif COMMA, NOTEQ!= LPAREN( RPAREN)

Example Non Tokens TypeExamples comment/* ignored */ preprocessor directive#include #define NUMS 5, 6 macroNUMS whitespace\t \n \b

Example void match0(char *s) /* find a zero */ { if (!strncmp(s, “0.0”, 3)) return 0. ; } VOID ID(match0) LPAREN CHAR DEREF ID(s) RPAREN LBRACE IF LPAREN NOT ID(strncmp) LPAREN ID(s) COMMA STRING(0.0) COMMA NUM(3) RPAREN RPAREN RETURN REAL(0.0) SEMI RBRACE EOF

input –program text (file) output –sequence of tokens Read input file Identify language keywords and standard identifiers Handle include files and macros Count line numbers Remove whitespaces Report illegal symbols [Produce symbol table] Lexical Analysis (Scanning)

Simplifies the syntax analysis –And language definition Modularity Reusability Efficiency Why Lexical Analysis

What is a token? Defined by the programming language Can be separated by spaces Smallest units Defined by regular expressions

A simplified scanner for C Token nextToken() { char c ; loop: c = getchar(); switch (c){ case ` `:goto loop ; case `;`: return SemiColumn; case `+`: c = ungetc() ; switch (c) { case `+': return PlusPlus ; case '=’ return PlusEqual; default: ungetc(c); return Plus; } case `<`: case `w`: }

Regular Expressions Basic patternsMatching xThe character x.Any character expect newline [xyz]Any of the characters x, y, z R?An optional R R*Zero or more occurrences of R R+One or more occurrences of R R1R2R1R2 R 1 followed by R 2 R 1 |R 2 Either R 1 or R 2 (R)R itself

Escape characters in regular expressions \ converts a single operator into text –a\+ –(a\+\*)+ Double quotes surround text –“a+*”+ Esthetically ugly But standard

Regular Descriptions EBNF where non-terminals are fully defined before first use letter  [a-zA-Z] digit  [0-9] underscore  _ letter_or_digit  letter|digit underscored_tail  underscore letter_or_digit+ identifier  letter letter_or_digit* underscored_tail token description –A token name –A regular expression

The Lexical Analysis Problem Given –A set of token descriptions –An input string Partition the strings into tokens (class, value) Ambiguity resolution –The longest matching token –Between two equal length tokens select the first

A Flex specification of C Scanner Letter [a-zA-Z_] Digit [0-9] % [ \t]{;} [\n]{line_count++;} “;”{ return SemiColumn;} “++”{ return PlusPlus ;} “+=” { return PlusEqual ;} “+”{ return Plus} “while”{ return While ; } {Letter}({Letter}|{Digit})*{ return Id ;} “<=”{ return LessOrEqual;} “<”{ return LessThen ;}

Flex Input – regular expressions and actions (C code) Output – A scanner program that reads the input and applies actions when input regular expression is matched flex regular expressions input program tokens scanner

Naïve Lexical Analysis

Automatic Creation of Efficient Scanners Naïve approach on regular expressions (dotted items) Construct non deterministic finite automaton over items Convert to a deterministic Minimize the resultant automaton Optimize (compress) representation

Dotted Items

Example T  a+ b+ Input ‘aab’ After parsing aa –T  a+  b+

Item Types Shift item –  In front of a basic pattern –A  (ab)+  c (de|fe)* Reduce item –  At the end of rhs –A  (ab)+ c (de|fe)*  Basic item –Shift or reduce items

Character Moves For shift items character moves are simple T    c  c  T   c   c Digit   [0-9] 7  T  [0-9]  7

 Moves For non-shift items the situation is more complicated What character do we need to see? Where are we in the matching? T   a* T   (a*)

 Moves for Repetitions Where can we get from T   (R)*  ? If R occurs zero times T   (R)*   If R occurs one or more times T   (  R)*  –When R ends  ( R  )*   (R)*    (  R)* 

 Moves

I  [0-9]+ F  [0-9]*’.’[0-9]+ Input ‘3.1;’ I   ([0-9])+ I  (.[0-9])+ I  ( [0-9].)+ I  (.[0-9])+I  ( [0-9])+.

I  [0-9]+ F  [0-9]*’.’[0-9]+ Input ‘3.1;’ F   ([0-9])*’.’([0-9])+ F  (  [0-9])*’.’([0-9])+F  ( [0-9])*  ’.’([0-9])+ F  ( [0-9]  )*’.’([0-9])+ F  (  [0-9])*’.’([0-9])+F  ( [0-9])*  ’.’([0-9])+ F  ( [0-9])*’.’  ([0-9])+ F  ( [0-9])*’.’ (  [0-9])+ F  ( [0-9])*’.’ ( [0-9]  )+ F  ( [0-9])*’.’ ( [0-9])+  F  ( [0-9])*’.’ (  [0-9])+

Concurrent Search How to scan multiple token classes in a single run?

I  [0-9]+ F  [0-9]*’.’[0-9]+ Input ‘3.1;’ I   ([0-9])+F   ([0-9])*’.’([0-9])+ I  (  [0-9])+F  (  [0-9])*’.’([0-9])+F  ([0-9])*  ’.’([0-9])+ I  ( [0-9]  )+F  ( [0-9]  )*’.’([0-9])+ I  (  [0-9])+I  ( [0-9])+  F  (  [0-9])*’.’([0-9])+ F  ( [0-9])*  ’.’([0-9])+ F  ( [0-9])*’.’  ([0-9])+

The Need for Backtracking A simple minded solution may require unbounded backtracking T 1  a+; T 2  a Quadratic behavior Does not occur in practice A linear solution exists

A Non-Deterministic Finite State Machine Add a production S’  T 1 | T 2 | … | T n Construct NDFA over the items –Initial state S’   (T 1 | T 2 | … | T n ) –For every character move, construct a character transition  T   c   –For every  move construct an  transition –The accepting states are the reduce items –Accept the language defined by T i

 Moves

I  [0-9]+ F  [0-9]*’.’[0-9]+ S  (I|F) I  ([0-9]+) F  ([0-9]*)’.’[0-9]+ I  (  [0-9])+ I  ( [0-9]  )+ [0-9] I  ( [0-9])+  F  (  [0-9]*)’.’[0-9]+ F  ( [0-9]*)  ’.’[0-9]+ F  ( [0-9]  *)’.’[0-9]+ [0-9] F  [0-9]*’.’  ([0-9]+) F  [0-9]*’.’ (  [0-9]+) F  [0-9]*’.’ ( [0-9]  +) F  [0-9]*’.’ ( [0-9] +)  “.” [0-9]     

Efficient Scanners Construct Deterministic Finite Automaton –Every state is a set of items –Every transition is followed by an  -closure –When a set contains two reduce items select the one declared first Minimize the resultant automaton –Rejecting states are initially indistinguishable –Accepting states of the same token are indistinguishable Exponential worst case complexity –Does not occur in practice Compress representation

I  [0-9]+ F  [0-9]*’.’[0-9]+ S  (I|F) I  ([0-9]+) I  (  [0-9])+ F  ([0-9]*)’.’[0-9]+ F  (  [0-9]*)’.’ [0-9]+ F  ([0-9]*)  ’.’ [0-9]+ I  ( [0-9]  )+ F  ( [0-9]  *)’.’[0-9]+ I  ( [0-9])+  I  (  [0-9])+ F  (  [0-9]*)’.’[0-9]+ F  ( [0-9]*)  ’.’[0-9]+ [0-9] Sink [^0-9.].|\n [0-9] F  [0-9] *’.’  ([0-9]+) F  [0-9]*’.’(  [0-9]+) “.” F  [0-9] *’.’ ([0-9]  +) F  [0-9]*’.’(  [0-9]+) F  [0-9]*’.’( [0-9]+)  [0-9] [^0-9] [^0-9.]

A Linear-Time Lexical Analyzer Procedure Get_Next_Token; set Start of token to Read Index; set End of last token to uninitialized set Class of last token to uninitialized set State to Initial while state /= Sink: Set ch to Input Char[Read Index]; Set state =  [state, ch]; if accepting(state): set Class of last token to Class(state); set End of last token to Read Index set Read Index to Read Index + 1; set token.class to Class of last token; set token.repr to char[Start of token.. End last token]; set Read index to End last token + 1; IMPORT Input Char [1..]; Set Read Index To 1;

Scanning “3.1;” inputstatenext state last token  3.1; 12I 3 .1; 23I 3.  1; 34F 3.1  ; 4SinkF 1 [^0-9.] 2 [0-9] 3 “.” [^0-9.] 4 [0-9] [^0-9] I F

Scanning “aaa” inputstatenext state last token  aaa$ 12T1 a  aa$ 24T1 a a  a$ 44T1 a a a  $ 4SinkT1 1 T1  a+”;” T2  a Sink [^a] [.\n] 2 [a] 3 “;” [^a;] T2 T1 [.\n] 4 “;” [a] [^a;]

Error Handling Illegal symbols Common errors

Missing Creating a lexical analysis by hand Table compression Symbol Tables Handling Macros Start states Nested comments

Summary For most programming languages lexical analyzers can be easily constructed automatically Exceptions: –Fortran –PL/1 Lex/Flex/Jlex are useful beyond compilers