Compiler Principles Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning) Mayer Goldberg and Roman Manevich Ben-Gurion University.

Slides:



Advertisements
Similar presentations
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
Advertisements

Lexical Analysis Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!
From Cooper & Torczon1 The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source language?
Lecture 02 – Lexical Analysis Eran Yahav 1. 2 You are here Executable code exe Source text txt Compiler Lexical Analysis Syntax Analysis Parsing Semantic.
Winter 2007SEG2101 Chapter 81 Chapter 8 Lexical Analysis.
1 CMPSC 160 Translation of Programming Languages Fall 2002 slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon Lecture-Module #4 Lexical.
Lexical Analysis Mooly Sagiv html:// Textbook:Modern Compiler Implementation in C Chapter 2.
Prof. Hilfinger CS 164 Lecture 21 Lexical Analysis Lecture 2-4 Notes by G. Necula, with additions by P. Hilfinger.
Compiler Construction
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
1 CMPSC 160 Translation of Programming Languages Fall 2002 slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon Lecture-Module #5 Introduction.
Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.
CS 426 Compiler Construction
Scanner Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source language? Is the.
1 Scanning Aaron Bloomfield CS 415 Fall Parsing & Scanning In real compilers the recognizer is split into two phases –Scanner: translate input.
Compiler Principles Fall Compiler Principles Lecture 1: Lexical Analysis Roman Manevich Ben-Gurion University.
Topic #3: Lexical Analysis
CPSC 388 – Compiler Design and Construction Scanners – Finite State Automata.
1 Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
1 Outline Informal sketch of lexical analysis –Identifies tokens in input string Issues in lexical analysis –Lookahead –Ambiguities Specifying lexers –Regular.
Lexical Analysis - An Introduction. The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
Topic #3: Lexical Analysis EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
COMP 3438 – Part II - Lecture 2: Lexical Analysis (I) Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ. 1.
Lexical Analyzer (Checker)
Compilation (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1.
4b 4b Lexical analysis Finite Automata. Finite Automata (FA) FA also called Finite State Machine (FSM) –Abstract model of a computing entity. –Decides.
Scanning & FLEX CPSC 388 Ellen Walker Hiram College.
CS412/413 Introduction to Compilers Radu Rugina Lecture 4: Lexical Analyzers 28 Jan 02.
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
1 November 1, November 1, 2015November 1, 2015November 1, 2015 Azusa, CA Sheldon X. Liang Ph. D. Computer Science at Azusa Pacific University Azusa.
1 Languages and Compilers (SProg og Oversættere) Lexical analysis.
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
CPS 506 Comparative Programming Languages Syntax Specification.
Joey Paquet, 2000, Lecture 2 Lexical Analysis.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Compiler Principles Fall Compiler Principles Lecture 1: Lexical Analysis Roman Manevich Ben-Gurion University.
Compiler Introduction 1 Kavita Patel. Outlines 2  1.1 What Do Compilers Do?  1.2 The Structure of a Compiler  1.3 Compilation Process  1.4 Phases.
The Role of Lexical Analyzer
CSC3315 (Spring 2009)1 CSC 3315 Lexical and Syntax Analysis Hamid Harroud School of Science and Engineering, Akhawayn University
Lexical Analysis.
1st Phase Lexical Analysis
CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis.
Prof. Necula CS 164 Lecture 31 Lexical Analysis Lecture 3-4.
Chapter 2 Scanning. Dr.Manal AbdulazizCS463 Ch22 The Scanning Process Lexical analysis or scanning has the task of reading the source program as a file.
1 Topic 2: Lexing and Flexing COS 320 Compiling Techniques Princeton University Spring 2016 Lennart Beringer.
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
LECTURE 5 Scanning. SYNTAX ANALYSIS We know from our previous lectures that the process of verifying the syntax of the program is performed in two stages:
1 Compiler Construction Vana Doufexi office CS dept.
CS412/413 Introduction to Compilers Radu Rugina Lecture 3: Finite Automata 25 Jan 02.
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
Topic 3: Automata Theory 1. OutlineOutline Finite state machine, Regular expressions, DFA, NDFA, and their equivalence, Grammars and Chomsky hierarchy.
Fall Compiler Principles Lecture 1: Lexical Analysis
CS510 Compiler Lecture 2.
Chapter 3 Lexical Analysis.
Chapter 2 Scanning – Part 1 June 10, 2018 Prof. Abdelaziz Khamis.
Lexical Analysis (Sections )
Finite-State Machines (FSMs)
Lecture 2: Lexical Analysis Noam Rinetzky
Finite-State Machines (FSMs)
Winter Compiler Principles Lexical Analysis (Scanning)
Lexical Analysis Lecture 3-4 Prof. Necula CS 164 Lecture 3.
CS 3304 Comparative Languages
Lecture 4: Lexical Analysis & Chomsky Hierarchy
CS 3304 Comparative Languages
Compiler Construction
Presentation transcript:

Compiler Principles Winter Compiler Principles Lexical Analysis (Scanning) Mayer Goldberg and Roman Manevich Ben-Gurion University

General stuff Topics taught by me Lexical analysis (scanning) Syntax analysis (parsing) … Dataflow analysis Register allocation Slides will be available from web-site after lecture Request: please mute mobiles, tablets, super-cool squeaking devices 2

Today Understand role of lexical analysis Lexical analysis theory Implementing modern scanner 3

Role of lexical analysis First part of compiler front-end Convert stream of characters into stream of tokens Split text into most basic meaningful strings Simplify input for syntax analysis 4 High-level Language (scheme) Executable Code Lexical Analysis Syntax Analysis Parsing ASTSymbol Table etc. Inter. Rep. (IR) Code Generation

From scanning to parsing (7 * x) )id*num(+ Lexical Analyzer program text token stream Parser Grammar: E  id E  num E  E + E E  E * E E  ( E ) + num x * Abstract Syntax Tree valid syntax error

Javascript example 6 var currOption = 0; // Choose content to display in lower pane. function choose ( id ) { var menu = ["about-me", "publications", "teaching", "software", "activities"]; for (i = 0; i < menu.length; i++) { currOption = menu[i]; var elt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none"; } Identify basic units in this code

Javascript example 7 var currOption = 0; // Choose content to display in lower pane. function choose ( id ) { var menu = ["about-me", "publications", "teaching", "software", "activities"]; for (i = 0; i < menu.length; i++) { currOption = menu[i]; var elt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none"; } Identify basic units in this code

Javascript example Identify basic units in this code 8 var currOption = 0; // Choose content to display in lower pane. function choose ( id ) { var menu = ["about-me", "publications", "teaching", "software", "activities"]; for (i = 0; i < menu.length; i++) { currOption = menu[i]; var elt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none"; } keyword numeric literal operator string literal punctuation identifier whitespace

Scanner output 9 var currOption = 0; // Choose content to display in lower pane. function choose ( id ) { var menu = ["about-me", "publications“, "teaching", "software", "activities"]; for (i = 0; i < menu.length; i++) { currOption = menu[i]; var elt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none"; } 1: VAR 1: ID(currOption) 1: EQ 1: INT_LITERAL(0) 1: SEMI 3: FUNCTION 3: ID(choose) 3: LP 3: ID(id) 3: EP 3: LCB... Stream of Tokens LINE: ID(value)

What is a token? Lexeme – substring of original text constituting an identifiable unit Identifiers, Values, reserved words, … Record type storing: Kind Value (when applicable) Start-position/end-position Any information that is useful for the parser Different for different languages 10

C++ example 1 Splitting text into tokens can be tricky How should the code below be split? 11 vector > myVector >> operator >, > two tokens or ?

C++ example 2 Splitting text into tokens can be tricky How should the code below be split? 12 vector > myVector >, > two tokens

Example tokens TypeExamples Identifierx, y, z, foo, bar NUM42 FLOATNUM STRING“so long, and thanks for all the fish” LPAREN( RPAREN) IFif … 13

Separating tokens 14 TypeExamples Comments/* ignore code */ // ignore until end of line White spaces\t \n Lexemes are recognized but get consumed rather than transmitted to parser if i f i/*comment*/f

Preprocessor directives in C 15 TypeExamples Inlude directives#include Macros#define THE_ANSWER 42

Designing a scanner Define each type of lexeme Reserved words: var, if, for, while Operators: < = ++ Identifiers: myFunction Literals: 123 “hello” But how do we define lexemes of unbounded length? 16

Designing a scanner Define each type of lexeme Reserved words: var, if, for, while Operators: < = ++ Identifiers: myFunction Literals: 123 “hello” But how do we define lexemes of unbounded length? Regular expressions 17

Regular languages refresher Formal languages Alphabet = finite set of letters Word = sequence of letter Language = set of words Regular languages defined equivalently by Regular expressions Finite-state automata 18

Regular expressions Empty string: Є Letter: a Concatenation: R 1 R 2 Union: R 1 | R 2 Kleene-star: R* Shorthand: R + stands for R R* scope: (R) Example: (0* 1*) | (1* 0*) What is this language? 19

Exercise 1 - Question Language of Java identifiers Identifiers start with either an underscore ‘_’ or a letter Continue with either underscore, letter, or digit 20

Exercise 1 - Answer Language of Java identifiers Identifiers start with either an underscore ‘_’ or a letter Continue with either underscore, letter, or digit (_|a|b|…|z|A|…|Z)(_|a|b|…|z|A|…|Z|0|…|9)* Using shorthand macros First= _|a|b|…|z|A|…|Z Next= First|0|…|9 R= First Next* 21

Exercise 2 - Question Language of rational numbers in decimal representation (no leading, ending zeros) Not 007 Not

Exercise 2 - Answer Language of rational numbers in decimal representation (no leading, ending zeros) Digit= 1|2|…|9 Digit0 = 0|Digit Num= Digit Digit0* Frac= Digit0* Digit Pos= Num |.Frac | 0.Frac| Num.Frac PosOrNeg = (Є|-)Pos R= 0 | PosOrNeg 23

Exercise 3 - Question Equal number of opening and closing parenthesis: [ n ] n = [], [[]], [[[]]], … 24

Exercise 3 - Answer Equal number of opening and closing parenthesis: [ n ] n = [], [[]], [[[]]], … Not regular Context-free Grammar: S ::= [] | [S] 25

Finite automata 26 start a b b c accepting state start state transition An automaton is defined by states and transitions

Automaton running example 27 start a b b c Words are read left-to-right cba

Automaton running example 28 start a b b c Words are read left-to-right cba

Automaton running example 29 start a b b c Words are read left-to-right cba

Automaton running example 30 start a b b c Words are read left-to-right word accepted cba

Word outside of language 31 start a b b c cbb

Word outside of language Missing transition means non-acceptance 32 start a b b c cbb

Exercise - Question What is the language defined by the automaton below? 33 start a b b c

Exercise - Answer What is the language defined by the automaton below? a b* c Generally: all paths leading to accepting states 34 start a b b c

Non-deterministic automata Allow multiple transitions from given state labeled by same letter 35 start a a b c b c

NFA run example 36 cba start a a b c b c

NFA run example Maintain set of states 37 cba start a a b c b c

NFA run example 38 cba start a a b c b c

NFA run example Accept word if any of the states in the set is accepting 39 cba start a a b c b c

NFA+Є automata Є transitions can “fire” without reading the input 40 start a b c Є

NFA+Є run example 41 start a b c cba Є

NFA+Є run example Now Є transition can non-deterministically take place 42 start a b c cba Є

NFA+Є run example 43 start a b c cba Є

NFA+Є run example 44 start a b c cba Є

NFA+Є run example 45 start a b c cba Є

NFA+Є run example 46 start a b c cba Є Word accepted

Reg-exp vs. automata Regular expressions are declarative Offer compact way to define a regular language by humans Don’t offer direct way to check whether a given word is in the language Automata are operative Define an algorithm for deciding whether a given word is in a regular language Not a natural notation for humans 47

From reg. exp. to automata Theorem: there is an algorithm to build an NFA+Є automaton for any regular expression Proof: by induction on the structure of the regular expression For each sub-expression R we build an automaton with exactly one start state and one accepting state Start state has no incoming transitions Accepting state has no outgoing transitions 48

From reg. exp. to automata Theorem: there is an algorithm to build an NFA+Є automaton for any regular expression Proof: by induction on the structure of the regular expression 49 start

Base cases 50 R =  R = a start  a

Construction for R 1 | R 2 51 start     R1R1 R2R2

Construction for R 1 R 2 52 start   R1R1 R2R2 

Construction for R* 53 start   R  

From NFA+Є to DFA Construction requires O(n) states for a reg- exp of length n Running an NFA+Є with n states on string of length m takes O(m·n 2 ) time Solution: determinization via subset construction Number of states worst-case exponential in n Running time O(m) 54

Subset construction For an NFA+Є with states M={s 1,…,s k } Construct a DFA with one state per set of states of the corresponding NFA M’={ [], [s 1 ], [s 1,s 2 ], [s 2,s 3 ], [s 1,s 2,s 3 ], …} Simulate transitions between individual states for every letter 55 a s1s1 s2s2 a [s 1,s 4 ] [s 2,s 7 ] NFA+Є DFA a s4s4 s7s7

Subset construction For an NFA+Є with states M={s 1,…,s k } Construct a DFA with one state per set of states of the corresponding NFA M’={ [], [s 1 ], [s 1,s 2 ], [s 2,s 3 ], [s 1,s 2,s 3 ], …} Extend macro states by states reachable via Є transitions 56 Є s1s1 s4s4 [s 1,s 2 ] [s 1,s 2,s 4 ] NFA+Є DFA

Scanning challenges Regular expressions allow us to define the language of all sequences of tokens Automata theory provides an algorithm for checking membership of words But we are interested in splitting the text not just deciding on membership How do we determine lexemes? How do we handle ambiguities – lexemes matching more than one token? 57

Separating lexemes ID= (a+b+…+z) (a+b+…+z)* ONE= 1 Input: abb1 How do we identify ID(abb), ONE? 58

Separating lexemes ID= (a+b+…+z) (a+b+…+z)* ONE= 1 Input: abb1 How do we identify ID(abb), ONE? 59 start a-z 1 ID ONE

Maximal munch ID= (a+b+…+z) (a+b+…+z)* ONE= 1 Input: abb1 How do we identify ID(abb), ONE? Solution: find longest matching lexeme Keep reading text until automaton leaves accepting state Return token corresponding to accepting state Reset – go back to start state and continue reading input from there 60

Handling ambiguities ID = (a+b+…+z) (a+b+…+z)* IF = if Input: if Matches both tokens What should the scanner output? 61 start a-z i ID IF f NFA

Handling ambiguities ID = (a+b+…+z) (a+b+…+z)* IF = if Input: if Matches both tokens What should the scanner output? 62 start a-z\i i a-z ID IF ID f ID a-z\f DFA a-z

Handling ambiguities ID = (a+b+…+z) (a+b+…+z)* IF = if Input: if Matches both tokens What should the scanner output? Solution: break tie using order of definitions Output: ID(if) 63 start a-z\i i a-z ID IF ID f ID a-z\f a-z

Handling ambiguities IF = if ID = (a+b+…+z) (a+b+…+z)* Input: if Matches both tokens What should the scanner output? Solution: break tie using order of definitions Output: IF 64 Conclusion: list keyword token definitions before identifier definition start a-z\i i a-z ID IF ID f ID a-z\f a-z

Implementing scanners in practice 65

Implementing scanners Manual construction of automata + determinization is Very tedious Error-prone Non-incremental Fortunately there are tools that automatically generate code from a specification for most languages C: Lex, Flex Java: JLex, JFlex 66

Using JFlex Define tokens (and states) Run Jflex to generate Java implementation Usually MyScanner.nextToken() will be called in a loop by parser 67 Regular Expressions JFlexMyScanner.java Stream of characters Tokens MyScanner.lex

Common format for reg-exps 68

Escape characters What is the expression for one or more + symbols? (+)+ won’t work (\+)+ will backslash \ before an operator turns it to standard character \*, \?, \+, … Newline: \n or \r\n depending on OS Tab: \t 69

Shorthands Use names for expressions letter = a | b | … | z | A | B | … | Z letter_ = letter | _ digit = 0 | 1 | 2 | … | 9 id = letter_ (letter_ | digit)* Use hyphen to denote a range letter = a-z | A-Z digit =

Catching errors What if input doesn’t match any token definition? Trick: Add a “catch-all” rule that matches any character and reports an error Add after all other rules 71

72 Next lecture: parsing