Compiler Principles Fall 2015-2016 Compiler Principles Lecture 1: Lexical Analysis Roman Manevich Ben-Gurion University.

Compiler Principles Fall 2015-2016 Compiler Principles Lecture 1: Lexical Analysis Roman Manevich Ben-Gurion University

Agenda 2 Understand role of lexical analysis in a compiler Regular languages reminder Lexical analysis theory Scanner generation

Javascript example var currOption = 0; // Choose content to display in lower pane. function choose ( id ) { var menu = ["about-me", "publications", "teaching", "software", "activities"]; for (i = 0; i < menu.length; i++) { currOption = menu[i]; var elt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none"; } Can you some identify basic units in this code? 3

Javascript example var currOption = 0; // Choose content to display in lower pane. function choose ( id ) { var menu = ["about-me", "publications", "teaching", "software", "activities"]; for (i = 0; i < menu.length; i++) { currOption = menu[i]; var elt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none"; } Can you some identify basic units in this code? 4 keyword ? ? ? ? ? ? ?

Javascript example var currOption = 0; // Choose content to display in lower pane. function choose ( id ) { var menu = ["about-me", "publications", "teaching", "software", "activities"]; for (i = 0; i < menu.length; i++) { currOption = menu[i]; var elt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none"; } Can you some identify basic units in this code? 5 keyword operator identifier numeric literal punctuation whitespace comment string literal

Role of lexical analysis First part of compiler front-end Convert stream of characters into stream of tokens – Split text into most basic meaningful strings Simplify input for syntax analysis High-level Language (scheme) Executable Code Lexical Analysis Syntax Analysis Parsing ASTSymbol Table etc. Inter. Rep. (IR) Code Generation 6

From scanning to parsing 59 + (1257 * xPosition) )id*num(+ Lexical Analyzer program text token stream Parser Grammar: E  id E  num E  E + E E  E * E E  ( E ) + num x * Abstract Syntax Tree valid syntax error 7 Lexical error valid

Scanner output var currOption = 0; // Choose content to display in lower pane. function choose ( id ) { var menu = ["about-me", "publications“, "teaching", "software", "activities"]; for (i = 0; i < menu.length; i++) { currOption = menu[i]; var elt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none"; } 1: VAR 1: ID(currOption) 1: EQ 1: INT_LITERAL(0) 1: SEMI 3: FUNCTION 3: ID(choose) 3: LP 3: ID(id) 3: EP 3: LCB... Stream of Tokens LINE: ID(value) 8

Tokens 9

What is a token? Lexeme – substring of original text constituting an identifiable unit – Identifiers, Values, reserved words, … Record type storing: – Kind – Value (when applicable) – Start-position/end-position – Any information that is useful for the parser Different for different languages 10

Example tokens TypeExamples Identifierx, y, z, foo, bar NUM42 FLOATNUM-3.141592654 STRING“so long, and thanks for all the fish” LPAREN( RPAREN) IFif … 11

C++ example 1 Splitting text into tokens can be tricky How should the code below be split? vector > myVector >> operator >, > two tokens or ? 12

C++ example 2 Splitting text into tokens can be tricky How should the code below be split? vector > myVector >, > two tokens 13

Separating tokens TypeExamples Comments/* ignore code */ // ignore until end of line White spaces\t \n Lexemes that are recognized but get consumed rather than transmitted to parser – if i f i/*comment*/f 14

Preprocessor directives in C TypeExamples Include directives#include Macros#define THE_ANSWER 42 15

First step of designing a scanner Define each type of lexeme – Reserved words: var, if, for, while – Operators: < = ++ – Identifiers: myFunction – Literals: 123 “hello” – Annotations: @SuppressWarnings How can we define lexemes of unbounded length 16 ?

First step of designing a scanner Define each type of lexeme – Reserved words: var, if, for, while – Operators: < = ++ – Identifiers: myFunction – Literals: 123 “hello” – Annotations: @SuppressWarnings How can we define lexemes of unbounded length – Regular expressions 17 ?

Agenda 18 Understand role of lexical analysis in a compiler – Convert text to stream of tokens Regular languages reminder Lexical analysis theory Scanner generation

Regular languages reminder 19

Regular languages refresher Formal languages – Alphabet= finite set of letters – Word= sequence of letter – Language= set of words Regular languages defined equivalently by – Regular expressions – Finite-state automata 20

Regular expressions Empty string: Є Letter: a Concatenation: R 1 R 2 Union: R 1 | R 2 Kleene-star: R* – Shorthand: R + stands for R R* scope: (R) Example: (0* 1*) | (1* 0*) – What is this language ? 21

Exercise 1 - Question Language of Java identifiers – Identifiers start with either an underscore ‘_’ or a letter – Continue with either underscore, letter, or digit 22

Exercise 1 - Answer Language of Java identifiers – Identifiers start with either an underscore ‘_’ or a letter – Continue with either underscore, letter, or digit – (_|a|b|…|z|A|…|Z)(_|a|b|…|z|A|…|Z|0|…|9)* 23

Exercise 1 – Better answer Language of Java identifiers – Identifiers start with either an underscore ‘_’ or a letter – Continue with either underscore, letter, or digit – (_|a|b|…|z|A|…|Z)(_|a|b|…|z|A|…|Z|0|…|9)* – Using shorthand macros First= _|a|b|…|z|A|…|Z Next= First|0|…|9 R= First Next* 24

Exercise 2 - Question Language of rational numbers in decimal representation (no leading, ending zeros) – Positive examples: 0 123.757.933333 0.7 – Negative examples: 007 0.30 25

Exercise 3 - Question Equal number of opening and closing parenthesis: [ n ] n = [], [[]], [[[]]], … 27

Exercise 3 - Answer Equal number of opening and closing parenthesis: [ n ] n = [], [[]], [[[]]], … Not regular Context-free Grammar: S ::= [] | [S] 28

Finite automata 29

Finite automata: known results Types of finite automata: – Deterministic (DFA) – Non-deterministic (NFA) – Non-deterministic + epsilon transitions Theorem: translation of regular expressions to NFA+epsilon (linear time) Theorem: translation of NFA+epsilon to DFA – Worst-case exponential time Theorem [Myhill-Nerode]: DFA can be minimized 30

Finite automata start a b b c accepting state start state transition An automaton M =  Q, , , q 0, F  is defined by states and transitions 31

Automaton running example start a b b c Words are read left-to-right cba 32

Automaton running example start a b b c Words are read left-to-right word accepted cba 35

Word outside of language 1 start a b b c cbb 36

Word outside of language 1 Missing transition means non-acceptance start a b b c cbb 37

Word outside of language 2 start a b b c bba 38

Word outside of language 2 start a b b c bba 39

Word outside of language 2 start a b b c bba 40 Final state is not an accepting state

Exercise - Question What is the language defined by the automaton below start a b b c 41 ?

Exercise - Answer What is the language defined by the automaton below – a b* c – Generally: all paths leading to accepting states start a b b c 42 ?

A little about me Joined Ben-Gurion University in 2012 Research interests – Future high-level languages – Advanced synthesis techniques – Language-supported parallelism – Static analysis and verification 43

I am here for Teaching you theory and practice of popular compiler algorithms – Hopefully make you think about solving problems by examples from the compilers world – Answering questions about material Contacting me – e-mail: romanm@cs.bgu.ac.il – Office hours: see course web-pageweb-page Announcements Forums (per assignment) 44

Tentative syllabus Front End Scanning Top-down Parsing (LL) Bottom-up Parsing (LR) Intermediate Representation Operational Semantics Lowering Optimizations Dataflow Analysis Loop Optimizations Code Generation Register Allocation Instruction Selection 45 mid-termexam

Nondeterministic Finite automata 46

Non-deterministic automata Allow multiple transitions from given state labeled by same letter start a a b c b c 47

NFA run example cba start a a b c b c 48

NFA run example Maintain set of states cba start a a b c b c 49

NFA run example cba start a a b c b c 50

NFA run example Accept word if any of the states in the set is accepting cba start a a b c b c 51

NFA+Є automata Є transitions can “fire” without reading the input start a b c Є 52

NFA+Є run example start a b c cba Є 53

NFA+Є run example Now Є transition can non-deterministically take place start a b c cba Є 54

NFA+Є run example start a b c cba Є Word accepted 58

Reg-exp vs. automata Regular expressions are declarative – Offer compact way to define a regular language by humans – Don’t offer direct way to check whether a given word is in the language Automata are operative – Define an algorithm for deciding whether a given word is in a regular language – Not a natural notation for humans 59

From Regular expressions to NFA 60

From reg. exp. to automata Theorem: there is an algorithm to build an NFA+Є automaton for any regular expression Proof: by induction on the structure of the regular expression – For each sub-expression R we build an automaton with exactly one start state and one accepting state – Start state has no incoming transitions – Accepting state has no outgoing transitions 61

From reg. exp. to NFA+Є automata Theorem: there is an algorithm to build an NFA+Є automaton for any regular expression Proof: by induction on the structure of the regular expression 62 start

Inductive constructions 63 R =  start  R = a start a     R1R1 R2R2 R 1 | R 2 start   R1R1  R2R2 R 1 R 2 start   R   R*R*

Running time of NFA+Є Construction requires O(k) states for a reg-exp of length k Running an NFA+Є with k states on string of length n takes O(n·k 2 ) time – Can we reduce the k 2 factor? 64 Each state in a configuration of O(k) states may have O(k) outgoing edges, so processing an input letter may take O(n 2 ) time

From NFA+Є to DFA Construction requires O(k) states for a reg-exp of length k Running an NFA+Є with k states on string of length n takes O(n·k 2 ) time – Can we reduce the k 2 factor? Theorem: for any NFA+Є automaton there exists an equivalent deterministic automaton Proof: determinization via subset construction – Number of states in the worst-case O(2 k ) – Running time O(n) 65

NFA determinization 66

Subset construction For an NFA+Є with states M={s 1,…,s k } Construct a DFA with one state per set of states of the corresponding NFA – M’={ [], [s 1 ], [s 1,s 2 ], [s 2,s 3 ], [s 1,s 2,s 3 ], …} Simulate transitions between individual states for every letter 67 a s1s1 s2s2 a [s 1,s 4 ] [s 2,s 7 ] NFA+Є DFA a s4s4 s7s7

Handling epsilon transitions Extend macro states by states reachable via Є transitions 68 Є s1s1 s4s4 [s 1,s 2 ] [s 1,s 2,s 4 ] NFA+Є DFA

Recap We know how to define any single type of lexeme We know how to convert any regular expression into a recognizing automaton But is this enough for scanning? 69

Designing a scanner 70

Scanning challenges Regular expressions allow us to recognize whether a given text is a sequence of legal lexemes – Define the language of all sequences of lexemes Automata theory provides an algorithm for checking membership of words – But we are interested in tokenization – splitting the text not just deciding on membership 1.How do we split the text into lexemes? 2.How do we handle ambiguities – lexemes matching more than one token? 71

Challenge 1: determine partitioning ID = (a|b|…|z) (a|b|…|z)* ONE = 1 Input: abb1 What should the output be? 1.ID(a), ID(b), ID(b), ONE 2.ID(a), ID(bb), ONE 3.ID(ab), ID(b), ONE 4.ID(abb), ONE 72 

Solution: maximal munch policy ID = (a|b|…|z) (a|b|…|z)* ONE = 1 Input: abb1 How do we return ID(abb), ONE? Solution: find longest matching lexeme 73 Automaton may enter and leave accepting state many times before longest match is found

Challenge 2: handling ambiguities ID = (a|b|…|z) (a|b|…|z)* IF = if Input: if Matches both tokens What should the scanner output? 74 DFA q0 start a-z\i i a-z ID ID, IF f ID a-z\f a-z

Solution: precedence ID = (a|b|…|z) (a|b|…|z)* IF = if Input: if Matches both tokens What should the scanner output? Break tie using order of definitions – Output: ID(if) 75 DFA q0 start a-z\i i a-z ID ID, IF f ID a-z\f a-z

Handling ambiguities IF = if ID = (a|b|…|z) (a|b|…|z)* Input: if Matches both tokens What should the scanner output? Break tie using order of definitions – Output: IF 76 Conclusion: list keyword token definitions before identifier definition q0 start a-z\i i a-z ID IF, ID f ID a-z\f a-z DFA

Putting the algorithm pieces together 77

Scanner construction recap 78 R1…RkR1…Rk List of regular expressions (one per lexeme) NFA+Є R 1 | … | R k DFA for R 1 | … | R k Scanner implementing maximal munch and tie breaking policy minimization How do we implement maximal munch

Maximal munch algorithm 79

Maximal munch scanning algorithm Input: – input: string of n characters – M: DFA for union of tokens Output: positions of in input that are the final characters of each token Data: – Stack of  state, index  of states and their positions encountered since last accepting state – i: index of next character in input – q: current state or Bottom (no state) 80

Maximal munch pseudo-code 81 Used to indicate an error situation (no token is found) Reset DFA to look for next token

Maximal munch run example Assume R 1 = a R 2 = a * b input = aaa 82 q0q1 a q2 a b q3 b a b qistack q01  B,1  Output =

Maximal munch run example Assume R 1 = a R 2 = a * b input = aaa 83 q0q1 a q2 a b q3 b a b qistack q01  B,1  q12  B,1   q0,1  Output =

Maximal munch run example Assume R 1 = a R 2 = a * b input = aaa 84 q0q1 a q2 a b q3 b a b qistack q01  B,1  q12  B,1   q0,1  q33  q1,2  Output =

Maximal munch run example Assume R 1 = a R 2 = a * b input = aaa 85 q0q1 a q2 a b q3 b a b qistack q01  B,1  q12  B,1   q0,1  q33  q1,2  q34  q1,2   q3,3  Output =

Maximal munch run example Assume R 1 = a R 2 = a * b input = aaa 86 q0q1 a q2 a b q3 b a b qistack q01  B,1  q12  B,1   q0,1  q33  q1,2  q34  q1,2   q3,3  q33  q1,2  Output =

Maximal munch run example Assume R 1 = a R 2 = a * b input = aaa 87 q0q1 a q2 a b q3 b a b qistack q01  B,1  q12  B,1   q0,1  q33  q1,2  q34  q1,2   q3,3  q33  q1,2  q12 Output =

Maximal munch run example Assume R 1 = a R 2 = a * b input = aaa 88 q0q1 a q2 a b q3 b a b qistack q01  B,1  q12  B,1   q0,1  q33  q1,2  q34  q1,2   q3,3  q33  q1,2  q12 Output = 1

Maximal munch run example Assume R 1 = a R 2 = a * b input = aaa 89 q0q1 a q2 a b q3 b a b qistack q02  B,2  Output = 1

Maximal munch run example Assume R 1 = a R 2 = a * b input = aaa 90 q0q1 a q2 a b q3 b a b qistack q02  B,2  q13  B,2   q0,2  Output = 1

Maximal munch run example Assume R 1 = a R 2 = a * b input = aaa 91 q0q1 a q2 a b q3 b a b qistack q02  B,2  q13  B,2  q34  q1,3  Output = 1

Maximal munch run example Assume R 1 = a R 2 = a * b input = aaa 92 q0q1 a q2 a b q3 b a b qistack q02  B,2  q13  B,2  q34  q1,3  Output = 1

Maximal munch run example Assume R 1 = a R 2 = a * b input = aaa 93 q0q1 a q2 a b q3 b a b qistack q02  B,2  q13  B,2  q34  q1,3  Output = 1 q13

Maximal munch run example Assume R 1 = a R 2 = a * b input = aaa 94 q0q1 a q2 a b q3 b a b qistack q02  B,2  q13  B,2  q34  q1,3  Output = 1 2 q13

Maximal munch run example Assume R 1 = a R 2 = a * b input = aaa 95 q0q1 a q2 a b q3 b a b qistack q03  B,3  Output = 1 2

Maximal munch run example Assume R 1 = a R 2 = a * b input = aaa 96 q0q1 a q2 a b q3 b a b qistack q03  B,3  q14  B,3   q0,3  Output = 1 2

Maximal munch run example Assume R 1 = a R 2 = a * b input = aaa 97 q0q1 a q2 a b q3 b a b qistack q03  B,3  q14  B,3   q0,3  Output = 1 2

Maximal munch run example Assume R 1 = a R 2 = a * b input = aaa 98 q0q1 a q2 a b q3 b a b qistack q03  B,3  q14  B,3   q0,3  Output = 1 2 3

Maximal munch run example Assume R 1 = a R 2 = a * b input = aaa 99 q0q1 a q2 a b q3 b a b qistack q03  B,3  q14  B,3   q0,3  Output = 1 2 3

Complexity of maximal munch What is the complexity of tokenizing a text of n characters by matching longest tokens? 100

Complexity of maximal munch What is the complexity of tokenizing a text of n characters by matching longest tokens? Assume the following token classes R1 = a R2 = a * b For a n it is O(n 2 ) 101 aaa…a qaqa qaqa qaqa qaqa … n n Cab we improve the worst-case complexity?

Improved scanning algorithm Idea: use work done on “leftover” stack to improve future decisions Remember for each index which states have failed – cannot be extended to a token “Maximal-Munch” Tokenization in Linear Time Tom Reps [TOPLAS 1998] 102

Improved algorithm pseudo-code 103 How many times can this test fail for a given index? What is the running time?

Agenda 104 Understand role of lexical analysis in a compiler – Convert text to stream of tokens Regular languages reminder Lexical analysis theory – Precedence + Maximal munch + algorithms Scanner generation

Implementing a scanner 105

Implementing modern scanners Manual construction of automata + determinization + maximal munch + tie breaking – Very tedious – Error-prone – Non-incremental Fortunately there are tools that automatically generate robust code from a specification for most languages – C: Lex, Flex Java: JLex, JFlex 106

Using JFlex Define tokens (and states) Run JFlex to generate Java implementation Usually MyScanner.nextToken() will be called in a loop by parser Lexical Specification JFlexMyScanner.java Stream of characters Tokens MyScanner.lex 107

Filtering illegal combinations Which tokens should the scanner return for “123foo”? 108

Filtering illegal combinations Which tokens should the scanner return for “123foo”? – We sometimes want to rule out certain token concatenations prior to parsing – How can we do that with what we’ve seen so far? 109

Filtering illegal combinations Which tokens should the scanner return for “123foo”? – We sometimes want to rule out certain token concatenations prior to parsing – How can we do that with what we’ve seen so far? Define “error” lexemes 110

Catching errors What if input doesn’t match any token definition? – Want to gracefully signal an error Trick: add a “catch-all” rule that matches any character and reports an error – Add after all other rules 111

Next lecture: parsing

Scanning exercise You are given the following lexeme: – RAT: (1-9)(0-9)*. (0-9)*(1-9) | 0. (0-9)*(1-9) Construct the corresponding scanner automaton Run it on the inputs 1.23.2 1.230.2 113

Question from 2015 midterm 114

Compiler Principles Fall 2015-2016 Compiler Principles Lecture 1: Lexical Analysis Roman Manevich Ben-Gurion University.

Similar presentations

Presentation on theme: "Compiler Principles Fall 2015-2016 Compiler Principles Lecture 1: Lexical Analysis Roman Manevich Ben-Gurion University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Compiler Principles Fall 2015-2016 Compiler Principles Lecture 1: Lexical Analysis Roman Manevich Ben-Gurion University.

Similar presentations

Presentation on theme: "Compiler Principles Fall 2015-2016 Compiler Principles Lecture 1: Lexical Analysis Roman Manevich Ben-Gurion University."— Presentation transcript:

Similar presentations

About project

Feedback