Compiler Principles Fall 2014-2015 Compiler Principles Lecture 1: Lexical Analysis Roman Manevich Ben-Gurion University.

Slides:



Advertisements
Similar presentations
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
Advertisements

Lexical Analysis Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!
From Cooper & Torczon1 The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source language?
1 CIS 461 Compiler Design and Construction Fall 2012 slides derived from Tevfik Bultan et al. Lecture-Module 5 More Lexical Analysis.
Compiler Principles Winter Compiler Principles Lexical Analysis (Scanning) Mayer Goldberg and Roman Manevich Ben-Gurion University.
Lecture 02 – Lexical Analysis Eran Yahav 1. 2 You are here Executable code exe Source text txt Compiler Lexical Analysis Syntax Analysis Parsing Semantic.
1 CMPSC 160 Translation of Programming Languages Fall 2002 slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon Lecture-Module #4 Lexical.
Lexical Analysis Mooly Sagiv html:// Textbook:Modern Compiler Implementation in C Chapter 2.
1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002.
Compiler Construction
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.
Scanner Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source language? Is the.
1 Scanning Aaron Bloomfield CS 415 Fall Parsing & Scanning In real compilers the recognizer is split into two phases –Scanner: translate input.
Topic #3: Lexical Analysis
CPSC 388 – Compiler Design and Construction Scanners – Finite State Automata.
1 Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions.
1 Outline Informal sketch of lexical analysis –Identifies tokens in input string Issues in lexical analysis –Lookahead –Ambiguities Specifying lexers –Regular.
Lexical Analysis - An Introduction. The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Compiler course 1. Introduction. Outline Scope of the course Disciplines involved in it Abstract view for a compiler Front-end and back-end tasks Modules.
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
COMP 3438 – Part II - Lecture 2: Lexical Analysis (I) Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ. 1.
Lexical Analyzer (Checker)
Compilation (Semester A, 2013/14) Lecture 2: Lexical Analysis Modern Compiler Design: Chapter 2.1 Noam Rinetzky 1.
Overview of Previous Lesson(s) Over View  An NFA accepts a string if the symbols of the string specify a path from the start to an accepting state.
4b 4b Lexical analysis Finite Automata. Finite Automata (FA) FA also called Finite State Machine (FSM) –Abstract model of a computing entity. –Decides.
CS412/413 Introduction to Compilers Radu Rugina Lecture 4: Lexical Analyzers 28 Jan 02.
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
May 31, May 31, 2016May 31, 2016May 31, 2016 Azusa, CA Sheldon X. Liang Ph. D. Computer Science at Azusa Pacific University Azusa Pacific University,
1 November 1, November 1, 2015November 1, 2015November 1, 2015 Azusa, CA Sheldon X. Liang Ph. D. Computer Science at Azusa Pacific University Azusa.
1 Languages and Compilers (SProg og Oversættere) Lexical analysis.
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
CPS 506 Comparative Programming Languages Syntax Specification.
Compiler Principles Fall Compiler Principles Lecture 6: Parsing part 5 Roman Manevich Ben-Gurion University.
Lexical Analysis S. M. Farhad. Input Buffering Speedup the reading the source program Look one or more characters beyond the next lexeme There are many.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Compiler Principles Fall Compiler Principles Lecture 1: Lexical Analysis Roman Manevich Ben-Gurion University.
Compiler Introduction 1 Kavita Patel. Outlines 2  1.1 What Do Compilers Do?  1.2 The Structure of a Compiler  1.3 Compilation Process  1.4 Phases.
Compiler Construction By: Muhammad Nadeem Edited By: M. Bilal Qureshi.
The Role of Lexical Analyzer
Lexical Analysis.
1st Phase Lexical Analysis
CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis.
Prof. Necula CS 164 Lecture 31 Lexical Analysis Lecture 3-4.
Chapter 2 Scanning. Dr.Manal AbdulazizCS463 Ch22 The Scanning Process Lexical analysis or scanning has the task of reading the source program as a file.
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
LECTURE 5 Scanning. SYNTAX ANALYSIS We know from our previous lectures that the process of verifying the syntax of the program is performed in two stages:
1 Compiler Construction Vana Doufexi office CS dept.
CS412/413 Introduction to Compilers Radu Rugina Lecture 3: Finite Automata 25 Jan 02.
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
Fall Compiler Principles Lecture 1: Lexical Analysis
CS510 Compiler Lecture 2.
Chapter 3 Lexical Analysis.
Chapter 2 :: Programming Language Syntax
Chapter 2 Scanning – Part 1 June 10, 2018 Prof. Abdelaziz Khamis.
Finite-State Machines (FSMs)
Finite-State Machines (FSMs)
Compiler Lecture 1 CS510.
Winter Compiler Principles Lexical Analysis (Scanning)
Lexical Analysis Lecture 3-4 Prof. Necula CS 164 Lecture 3.
CS 3304 Comparative Languages
Chapter 2 :: Programming Language Syntax
Chapter 2 :: Programming Language Syntax
Compiler Construction
Lecture 5 Scanning.
Presentation transcript:

Compiler Principles Fall Compiler Principles Lecture 1: Lexical Analysis Roman Manevich Ben-Gurion University

Agenda 2 Understand role of lexical analysis in a compiler Lexical analysis theory Implementing professional scanner via scanner generator

Javascript example var currOption = 0; // Choose content to display in lower pane. function choose ( id ) { var menu = ["about-me", "publications", "teaching", "software", "activities"]; for (i = 0; i < menu.length; i++) { currOption = menu[i]; var elt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none"; } Can you some identify basic units in this code? 3

Javascript example var currOption = 0; // Choose content to display in lower pane. function choose ( id ) { var menu = ["about-me", "publications", "teaching", "software", "activities"]; for (i = 0; i < menu.length; i++) { currOption = menu[i]; var elt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none"; } Can you some identify basic units in this code? 4 keyword ? ? ? ? ? ? ?

Javascript example var currOption = 0; // Choose content to display in lower pane. function choose ( id ) { var menu = ["about-me", "publications", "teaching", "software", "activities"]; for (i = 0; i < menu.length; i++) { currOption = menu[i]; var elt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none"; } Can you some identify basic units in this code? 5 keyword operator identifier numeric literal punctuation whitespace comment string literal

Role of lexical analysis First part of compiler front-end Convert stream of characters into stream of tokens – Split text into most basic meaningful strings Simplify input for syntax analysis High-level Language (scheme) Executable Code Lexical Analysis Syntax Analysis Parsing ASTSymbol Table etc. Inter. Rep. (IR) Code Generation 6

From scanning to parsing 59 + (1257 * xPosition) )id*num(+ Lexical Analyzer program text token stream Parser Grammar: E  id E  num E  E + E E  E * E E  ( E ) + num x * Abstract Syntax Tree valid syntax error 7 Lexical error valid

Scanner output var currOption = 0; // Choose content to display in lower pane. function choose ( id ) { var menu = ["about-me", "publications“, "teaching", "software", "activities"]; for (i = 0; i < menu.length; i++) { currOption = menu[i]; var elt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none"; } 1: VAR 1: ID(currOption) 1: EQ 1: INT_LITERAL(0) 1: SEMI 3: FUNCTION 3: ID(choose) 3: LP 3: ID(id) 3: EP 3: LCB... Stream of Tokens LINE: ID(value) 8

Tokens 9

What is a token? Lexeme – substring of original text constituting an identifiable unit – Identifiers, Values, reserved words, … Record type storing: – Kind – Value (when applicable) – Start-position/end-position – Any information that is useful for the parser Different for different languages 10

Example tokens TypeExamples Identifierx, y, z, foo, bar NUM42 FLOATNUM STRING“so long, and thanks for all the fish” LPAREN( RPAREN) IFif … 11

C++ example 1 Splitting text into tokens can be tricky How should the code below be split? vector > myVector >> operator >, > two tokens or ? 12

C++ example 2 Splitting text into tokens can be tricky How should the code below be split? vector > myVector >, > two tokens 13

Separating tokens TypeExamples Comments/* ignore code */ // ignore until end of line White spaces\t \n Lexemes are recognized but get consumed rather than transmitted to parser – if i f i/*comment*/f 14

Preprocessor directives in C TypeExamples Include directives#include Macros#define THE_ANSWER 42 15

First step of designing a scanner Define each type of lexeme – Reserved words: var, if, for, while – Operators: < = ++ – Identifiers: myFunction – Literals: 123 “hello” – How can we define lexemes of unbounded length 16 ?

First step of designing a scanner Define each type of lexeme – Reserved words: var, if, for, while – Operators: < = ++ – Identifiers: myFunction – Literals: 123 “hello” – How can we define lexemes of unbounded length – Regular expressions 17 ?

Agenda 18 Understand role of lexical analysis in a compiler – Convert text to stream of tokens Lexical analysis theory Implementing professional scanner via scanner generator

Regular expressions 19

Regular languages refresher Formal languages – Alphabet = finite set of letters – Word = sequence of letter – Language = set of words Regular languages defined equivalently by – Regular expressions – Finite-state automata 20

Regular expressions Empty string: Є Letter: a Concatenation: R 1 R 2 Union: R 1 | R 2 Kleene-star: R* – Shorthand: R + stands for R R* scope: (R) Example: (0* 1*) | (1* 0*) – What is this language? 21

Exercise 1 - Question Language of Java identifiers – Identifiers start with either an underscore ‘_’ or a letter – Continue with either underscore, letter, or digit 22

Exercise 1 - Answer Language of Java identifiers – Identifiers start with either an underscore ‘_’ or a letter – Continue with either underscore, letter, or digit – (_|a|b|…|z|A|…|Z)(_|a|b|…|z|A|…|Z|0|…|9)* 23

Exercise 1 – Better answer Language of Java identifiers – Identifiers start with either an underscore ‘_’ or a letter – Continue with either underscore, letter, or digit – (_|a|b|…|z|A|…|Z)(_|a|b|…|z|A|…|Z|0|…|9)* – Using shorthand macros First= _|a|b|…|z|A|…|Z Next= First|0|…|9 R= First Next* 24

Exercise 2 - Question Language of rational numbers in decimal representation (no leading, ending zeros) – Positive examples: – Negative examples:

Exercise 2 - Answer Language of rational numbers in decimal representation (no leading, ending zeros) – Digit= 1|2|…|9 Digit0 = 0|Digit Num= Digit Digit0* Frac= Digit0* Digit Pos= Num |.Frac | 0.Frac| Num.Frac PosOrNeg = (Є|-)Pos R= 0 | PosOrNeg 26

Exercise 3 - Question Equal number of opening and closing parenthesis: [ n ] n = [], [[]], [[[]]], … 27

Exercise 3 - Answer Equal number of opening and closing parenthesis: [ n ] n = [], [[]], [[[]]], … Not regular Context-free Grammar: S ::= [] | [S] 28

Finite automata 29

Finite automata start a b b c accepting state start state transition An automaton is defined by states and transitions 30

Automaton running example start a b b c Words are read left-to-right cba 31

Automaton running example start a b b c Words are read left-to-right cba 32

Automaton running example start a b b c Words are read left-to-right cba 33

Automaton running example start a b b c Words are read left-to-right word accepted cba 34

Word outside of language start a b b c cbb 35

Word outside of language Missing transition means non-acceptance start a b b c cbb 36

Word outside of language start a b b c bba 37

Word outside of language start a b b c bba 38

Word outside of language start a b b c bba 39 Final state is not an accepting state

Exercise - Question What is the language defined by the automaton below? start a b b c 40

Exercise - Answer What is the language defined by the automaton below? – a b* c – Generally: all paths leading to accepting states start a b b c 41

A little about me Joined Ben-Gurion University two years ago Research interests – Advanced compilation and synthesis techniques – Language-supported parallelism – Static analysis and verification 42

I am here for Teaching you theory and practice of popular compiler algorithms – Hopefully make you think about solving problems by examples from the compilers world – Answering questions about material Contacting me – – Office hours: see course web-pageweb-page Announcements Forums (per assignment) 43

Tentative syllabus Front End Scanning Top-down Parsing (LL) Bottom-up Parsing (LR) Attribute Grammars Intermediate Representation Lowering Optimizations Local Optimizations Dataflow Analysis Loop Optimizations Code Generation Register Allocation Instruction Selection 44 mid-termexam

Nondeterministic Finite automata 45

Non-deterministic automata Allow multiple transitions from given state labeled by same letter start a a b c b c 46

NFA run example cba start a a b c b c 47

NFA run example Maintain set of states cba start a a b c b c 48

NFA run example cba start a a b c b c 49

NFA run example Accept word if any of the states in the set is accepting cba start a a b c b c 50

NFA+Є automata Є transitions can “fire” without reading the input start a b c Є 51

NFA+Є run example start a b c cba Є 52

NFA+Є run example Now Є transition can non-deterministically take place start a b c cba Є 53

NFA+Є run example start a b c cba Є 54

NFA+Є run example start a b c cba Є 55

NFA+Є run example start a b c cba Є 56

NFA+Є run example start a b c cba Є Word accepted 57

Reg-exp vs. automata Regular expressions are declarative – Offer compact way to define a regular language by humans – Don’t offer direct way to check whether a given word is in the language Automata are operative – Define an algorithm for deciding whether a given word is in a regular language – Not a natural notation for humans 58

From Regular expressions to NFA 59

From reg. exp. to automata Theorem: there is an algorithm to build an NFA+Є automaton for any regular expression Proof: by induction on the structure of the regular expression – For each sub-expression R we build an automaton with exactly one start state and one accepting state – Start state has no incoming transitions – Accepting state has no outgoing transitions 60

From reg. exp. to automata Theorem: there is an algorithm to build an NFA+Є automaton for any regular expression Proof: by induction on the structure of the regular expression 61 start

Base cases R =  R = a start  a 62

Construction for R 1 | R 2 start     R1R1 R2R2 63

Construction for R 1 R 2 start   R1R1 R2R2  64

Construction for R* start   R   65

NFA determinization 66

From NFA+Є to DFA Construction requires O(n) states for a reg-exp of length n Running an NFA+Є with n states on string of length m takes O(m·n 2 ) time – Can we reduce the n 2 factor? 67

From NFA+Є to DFA Construction requires O(n) states for a reg-exp of length n Running an NFA+Є with n states on string of length m takes O(m·n 2 ) time – Can we reduce the n 2 factor? Theorem: for any NFA+Є automaton there exists an equivalent deterministic automaton Proof: determinization via subset construction – Number of states in the worst-case O(2 n ) – Running time O(m) 68

Subset construction For an NFA+Є with states M={s 1,…,s k } Construct a DFA with one state per set of states of the corresponding NFA – M’={ [], [s 1 ], [s 1,s 2 ], [s 2,s 3 ], [s 1,s 2,s 3 ], …} Simulate transitions between individual states for every letter 69 a s1s1 s2s2 a [s 1,s 4 ] [s 2,s 7 ] NFA+Є DFA a s4s4 s7s7

Subset construction For an NFA+Є with states M={s 1,…,s k } Construct a DFA with one state per set of states of the corresponding NFA – M’={ [], [s 1 ], [s 1,s 2 ], [s 2,s 3 ], [s 1,s 2,s 3 ], …} Extend macro states by states reachable via Є transitions 70 Є s1s1 s4s4 [s 1,s 2 ] [s 1,s 2,s 4 ] NFA+Є DFA

Recap We know how to define any single type of lexeme We know how to convert any regular expression into a recognizing automaton But is this enough for scanning? 71

Designing a scanner 72

Scanning challenges Regular expressions allow us to recognize whether a given text is a sequence of legal lexemes – Define the language of all sequences of lexemes Automata theory provides an algorithm for checking membership of words – But we are interested in splitting the text not just deciding on membership 1.How do we split the text into lexemes? 2.How do we handle ambiguities – lexemes matching more than one token? 73

Separating lexemes ID = (a|b|…|z) (a|b|…|z)* ONE = 1 Input: abb1 How do we return ID(abb), ONE? 74

Separating lexemes ID = (a|b|…|z) (a|b|…|z)* ONE = 1 Input: abb1 How do we return ID(abb), ONE? 75 start a-z 1 ID ONE

Maximal munch algorithm ID = (a|b|…|z) (a|b|…|z)* ONE = 1 Input: abb1 How do we return ID(abb), ONE? Solution: find longest matching lexeme – Keep reading text until automaton fails – remember text position of last accepting state – Return token corresponding to last accepting state – Reset – go back to start state and continue reading input from remembered text position 76

Handling ambiguities ID = (a|b|…|z) (a|b|…|z)* IF = if Input: if Matches both tokens What should the scanner output? 77 start a-z i ID IF f NFA

Handling ambiguities ID = (a|b|…|z) (a|b|…|z)* IF = if Input: if Matches both tokens What should the scanner output? Solution: break tie using order of definitions – Output: ID(if) 78 start a-z i ID IF f NFA

Handling ambiguities ID = (a|b|…|z) (a|b|…|z)* IF = if Input: if Matches both tokens What should the scanner output? Solution: break tie using order of definitions – Output: ID(if) 79 start a-z\i i a-z ID ID < IF f ID a-z\f DFA a-z

Handling ambiguities IF = if ID = (a|b|…|z) (a|b|…|z)* Input: if Matches both tokens What should the scanner output? Solution: break tie using order of definitions – Output: IF 80 Conclusion: list keyword token definitions before identifier definition start a-z\i i a-z ID IF < ID f ID a-z\f a-z DFA

Filtering illegal combinations Which tokens should the scanner return for “123foo”? 81

Filtering illegal combinations Which tokens should the scanner return for “123foo”? – We sometimes want to rule out certain token concatenations prior to parsing – How can we do that with what we’ve seen so far? 82

Filtering illegal combinations Which tokens should the scanner return for “123foo”? – We sometimes want to rule out certain token concatenations prior to parsing – How can we do that with what we’ve seen so far? Define “error” lexemes 83

Putting the algorithm pieces together 84

Scanner construction recap 85 ……………… List of regular expressions (one per lexeme) NFA+Є DFA Code implementing maximal munch with tie breaking policy minimization

Running the scanner Maintain variables for – ltpos = position from which we scan for next token – lposa = Last position of accepting state – state = Current state While not reaching end of text do Read until getting stuck If not end-of-file return token = input[ ltpos, lposa ] ltpos = lposa + 1 state = q1 (initial) lposa = N/A 86

Scanning exercise You are given the following lexeme: – RAT: (1-9)(0-9)*. (0-9)*(1-9) | 0. (0-9)*(1-9) Construct the corresponding scanner automaton Run it on the inputs

Scanner run on tokencurr. statecurr. lettercurr. positionLast accept pos.Scan from q110N/A0 q2.1N/A0 q322N/A0 q q5.430 ERRORq1.4N/A4 q1 start q q5 q3 q q6 0.

Scanner run on tokencurr. statecurr. lettercurr. positionLast accept pos.Scan from q110N/A0 q2.1N/A0 q322N/A0 q q4.530 q104N/A4 q6.5N/A4 q326N/A4 0.2q5EOF 64 q1 start q q5 q3 q q6 0.

Agenda 90 Understand role of lexical analysis in a compiler – Convert text to stream of tokens Lexical analysis theory – Theory of regular languages + maximal munch + precedence Implementing professional scanner via scanner generator

Implementing a scanner 91

Implementing modern scanners Manual construction of automata + determinization + maximal munch + tie breaking – Very tedious – Error-prone – Non-incremental Fortunately there are tools that automatically generate robust code from a specification for most languages – C: Lex, Flex Java: JLex, JFlex 92

Using JFlex Define tokens (and states) Run JFlex to generate Java implementation Usually MyScanner.nextToken() will be called in a loop by parser Lexical Specification JFlexMyScanner.java Stream of characters Tokens MyScanner.lex 93

Common format for reg-exps 94

Escape characters What is the expression for one or more + symbols? – (+)+ won’t work – (\+)+ will backslash \ before an operator turns it to standard character \*, \?, \+, … Newline: \n or \r\n depending on OS Tab: \t 95

Shorthands Use names for expressions – letter = a | b | … | z | A | B | … | Z – letter_ = letter | _ – digit = 0 | 1 | 2 | … | 9 – id = letter_ (letter_ | digit)* Use hyphen to denote a range – letter = a-z | A-Z – digit =

Catching errors What if input doesn’t match any token definition? – Want to gracefully signal an error Trick: add a “catch-all” rule that matches any character and reports an error – Add after all other rules 97

Next lecture: parsing