Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fall Compiler Principles Lecture 1: Lexical Analysis

Similar presentations


Presentation on theme: "Fall Compiler Principles Lecture 1: Lexical Analysis"— Presentation transcript:

1 Fall 2016-2017 Compiler Principles Lecture 1: Lexical Analysis
Roman Manevich Ben-Gurion University of the Negev

2 Agenda Understand role of lexical analysis in a compiler
Regular languages reminder Lexical analysis algorithms Scanner generation

3 Javascript example Can you some identify basic units in this code?
var currOption = 0; // Choose content to display in lower pane. function choose ( id ) { var menu = ["about-me", "publications", "teaching", "software", "activities"]; for (i = 0; i < menu.length; i++) { currOption = menu[i]; var elt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none";

4 Javascript example Can you some identify basic units in this code?
keyword ? ? ? ? ? var currOption = 0; // Choose content to display in lower pane. function choose ( id ) { var menu = ["about-me", "publications", "teaching", "software", "activities"]; for (i = 0; i < menu.length; i++) { currOption = menu[i]; var elt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none"; ? ?

5 Javascript example Can you some identify basic units in this code?
keyword identifier operator numeric literal punctuation comment var currOption = 0; // Choose content to display in lower pane. function choose ( id ) { var menu = ["about-me", "publications", "teaching", "software", "activities"]; for (i = 0; i < menu.length; i++) { currOption = menu[i]; var elt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none"; string literal whitespace

6 Role of lexical analysis
First part of compiler front-end Convert stream of characters into stream of tokens Split text into most basic meaningful strings Simplify input for syntax analysis High-level Language (scheme) Executable Code Lexical Analysis Syntax Analysis Parsing AST Symbol Table etc. Inter. Rep. (IR) Code Generation

7 From scanning to parsing
program text 59 + (1257 * xPosition) Lexical Analyzer Lexical error valid token stream ) id * num ( + Grammar: E  id E  num E  E + E E  E * E E  ( E ) Parser syntax error valid + num x * Abstract Syntax Tree

8 where is the white space?
Scanner output where is the white space? var currOption = 0; // Choose content to display in lower pane. function choose ( id ) { var menu = ["about-me", "publications“, "teaching", "software", "activities"]; for (i = 0; i < menu.length; i++) { currOption = menu[i]; var elt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none"; Stream of Tokens LINE: ID(value) 1: VAR 1: ID(currOption) 1: EQ 1: INT_LITERAL(0) 1: SEMI 3: FUNCTION 3: ID(choose) 3: LP 3: ID(id) 3: EP 3: LCB ...

9 Tokens

10 What is a token? Lexeme – substring of original text constituting an identifiable unit Identifiers, values, reserved words, … Record type storing: Kind Value (when applicable) Start-position/end-position Any information that is useful for the parser Different for different languages

11 Example tokens Type Examples Identifier x, y, z, foo, bar NUM 42
FLOATNUM STRING “so long, and thanks for all the fish” LPAREN ( RPAREN ) IF if

12 C++ example 1 Splitting text into tokens can be tricky
How should the code below be split? vector<vector<int>> myVector >> operator >, > two tokens or ?

13 C++ example 2 Splitting text into tokens can be tricky
How should the code below be split? vector<vector<int> > myVector >, > two tokens

14 Separating tokens Type Examples Comments /* ignore code */ // ignore until end of line White spaces \t \n Lexemes that are recognized but get consumed rather than transmitted to parser if i f i/*comment*/f

15 Preprocessor directives in C
Type Examples Include directives #include<foo.h> Macros #define THE_ANSWER 42

16 First step of designing a scanner
Define each type of lexeme Reserved words: var, if, for, while Operators: < = ++ Identifiers: myFunction Literals: 123 “hello” How can we define lexemes of unbounded length ?

17 First step of designing a scanner
Define each type of lexeme Reserved words: var, if, for, while Operators: < = ++ Identifiers: myFunction Literals: 123 “hello” How can we define lexemes of unbounded length Regular expressions ?

18 Agenda Understand role of lexical analysis in a compiler
Convert text to stream of tokens Regular languages reminder Lexical analysis algorithms Scanner generation

19 Regular languages reminder

20 Basic definitions and facts
Formal languages Alphabet = finite set of letters Word = sequence of letter Language = set of words Regular languages defined equivalently by Regular expressions Finite-state automata

21 Regular expressions Empty string: Є Letter: a1, …, ak  Alphabet
Concatenation: R1 R2 Union: R1 | R2 Kleene-star: R* Shorthand: R+ stands for R R* scope: (R) Example: (0* 1*) | (1* 0*) What is this language?

22 Exercise 1 - Question Language of Java identifiers
Identifiers start with either an underscore ‘_’ or a letter Continue with either underscore, letter, or digit

23 Exercise 1 - Answer Language of Java identifiers
Identifiers start with either an underscore ‘_’ or a letter Continue with either underscore, letter, or digit (_|a|b|…|z|A|…|Z)(_|a|b|…|z|A|…|Z|0|…|9)*

24 Exercise 1 – Better answer
Language of Java identifiers Identifiers start with either an underscore ‘_’ or a letter Continue with either underscore, letter, or digit (_|a|b|…|z|A|…|Z)(_|a|b|…|z|A|…|Z|0|…|9)* Using shorthand macros First = _|a|b|…|z|A|…|Z Next = First|0|…|9 R = First Next*

25 Exercise 2 - Question Language of rational numbers in decimal representation (no leading, ending zeros) Positive examples: 0.7 Negative examples: 007 0.30

26 Exercise 2 - Answer Language of rational numbers in decimal representation (no leading, ending zeros) Digit = 1|2|…|9 Digit = 0|Digit Num = Digit Digit0* Frac = Digit0* Digit Pos = Num | .Frac | 0.Frac| Num.Frac PosOrNeg = (Є|-)Pos R = 0 | PosOrNeg

27 Exercise 3 - Question Equal number of opening and closing parenthesis: [n]n = [], [[]], [[[]]], …

28 Exercise 3 - Answer Equal number of opening and closing parenthesis: [n]n = [], [[]], [[[]]], … Not regular Context-free Grammar: S ::= [] | [S]

29 Finite automata

30 Finite automata: known results
Types of finite automata: Deterministic (DFA) Non-deterministic (NFA) Non-deterministic + epsilon transitions Theorem: translation of regular expressions to NFA+epsilon (linear time) Theorem: translation of NFA+epsilon to DFA Worst-case exponential time Theorem [Myhill-Nerode]: DFA can be minimized

31 Finite automata An automaton M = Q, , , q0, F is defined by states and transitions transition accepting state b c a start b start state

32 Exercise - Question What is the language defined by the automaton below ? b c a start b

33 Exercise - Answer What is the language defined by the automaton below a b* c Generally: all paths leading to accepting states ? b c a start b

34 Non-deterministic automata
Allow multiple transitions from given state labeled by same letter b c a start c a b

35 NFA+Є automata Є transitions can “fire” without reading the input b a
start Є

36 A little about me Joined Ben-Gurion University in 2012
Research interests Inductive programming and synthesis Static analysis and verification Language-supported parallelism

37 I am here for Teaching you theory and practice of popular compiler algorithms Hopefully make you think about solving problems by examples from the compilers world Answering questions about material Contacting me Office hours: see course web-page Announcements Forums (per assignment)

38 Tentative syllabus mid-term exam Front End Intermediate Representation
Scanning Top-down Parsing (LL) Bottom-up Parsing (LR) Intermediate Representation Operational Semantics Lowering Optimizations Dataflow Analysis Loop Optimizations Code Generation Register Allocation Energy Optimization Instruction Selection mid-term exam

39 Reg-exp vs. automata Regular expressions are declarative
A high-level language Regular expressions are declarative Offer compact way to define a regular language by humans Don’t offer direct way to check whether a given word is in the language Automata are operative Define an algorithm for deciding whether a given word is in a regular language Not a natural notation for humans A machine language

40 From Regular expressions to automata

41 From reg. exp. to NFA+Є automata
Theorem: there is an algorithm to build an NFA+Є automaton for any regular expression Proof: by induction on the structure of the regular expression start

42 Inductive constructions
start start R1 R2 R1 | R2 R = a start a start R R* start R1 R2 R1 R2

43 Running time of NFA+Є Construction requires O(k) states for a reg-exp of length k Running an NFA+Є with k states on string of length n takes O(n·k2) time Can we reduce the k2 factor? Each state in a configuration of O(k) states may have O(k) outgoing edges, so processing an input letter may take O(k2) time

44 From NFA+Є to DFA Construction requires O(k) states for a reg-exp of length k Running an NFA+Є with k states on string of length n takes O(n·k2) time Can we reduce the k2 factor? Theorem: for any NFA+Є automaton there exists an equivalent deterministic automaton Proof: determinization via subset construction Number of states in the worst-case O(2k) Running time O(n)

45 Recap We know how to define any single type of lexeme
We know how to convert any regular expression into a recognizing automaton But how do we use this for scanning?

46 The formal scanning problem

47 What is a scanner Lexical Specification:
var currOption = 0; // Choose content function choose ( id ) { ... Lexical Specification: List of regular expressions (one per lexeme) R1 … Rk Scanner Stream of Tokens LINE: ID(value) 1: VAR 1: ID(currOption) 1: EQ 1: INT_LITERAL(0) 1: SEMI ...

48 Scanning problem Input:
Lexical specification: R1,…, Rk (regular expressions, one per lexeme) input: string of n characters Output: sequence of tokens R1(lex1) … Rn(lexn) such that The lexemes partition the input lex1 … lexn = input R1 … Rn match the lexeme type from the specification

49 Example 1: partitioning
ID = (a|b|…|z) (a|b|…|z)* ONE = 1 Input: abb1 What should the output be? ID(a) ID(b) ID(b) ONE ID(a) ID(bb) ONE ID(ab) ID(b) ONE ID(abb) ONE First match semantics Maximal munch semantics uld

50 Maximal munch semantics
ID = (a|b|…|z) (a|b|…|z)* ONE = 1 Input: abb1 How do we return ID(abb) ONE? Solution: find longest matching lexeme Intuition: some tokens, such as identifiers are prefix-closed Automaton may enter and leave accepting state many times before longest match is found

51 Example 2: handling ambiguities
ID = (a|b|…|z) (a|b|…|z)* IF = if Input: if Matches both tokens What should the scanner output be? a-z a-z\i ID a-z q0 start DFA a-z\f i f ID ID, IF

52 Solution: precedence semantics
ID = (a|b|…|z) (a|b|…|z)* IF = if Input: if Matches both tokens What should the scanner output be? Break tie using order of definitions Output: ID(if) a-z a-z\i ID a-z q0 start DFA a-z\f i f ID ID, IF

53 Solution: precedence semantics
IF = if ID = (a|b|…|z) (a|b|…|z)* Input: if Matches both tokens What should the scanner output be? Break tie using order of definitions Output: IF Conclusion: list keyword token definitions before identifier definition a-z a-z\i ID a-z q0 start DFA a-z\f i f ID IF, ID

54 Putting together an algorithm

55 Overall algorithm structure
High-level intermediate representation Medium-level intermediate representation R1 … Rk List of regular expressions (one per lexeme) minimization NFA+Є R1 | … | Rk DFA for R1 | … | Rk Crucial: Assign semantics How do we implement maximal munch? Scanner implementation (efficient data structures)

56 A First match algorithm

57 First match algorithm Suggestions? What is the complexity?

58 A Maximal munch algorithm

59 Maximal munch scanning algorithm
Input: input: string of n characters M: DFA for union of tokens Output: positions of in input that are the final characters of each token Data: Stack of state, index of states and their positions encountered since last accepting state i: index of next character in input q: current state or Bottom (no state)

60 Maximal munch pseudo-code
Reset DFA to look for next token Used to indicate an error situation (no token is found)

61 Maximal munch run example
Assume R1 = a R2 = a* b input = aaa a a a b q0 q1 q3 q2 b b q i stack q0 1 B,1 Output =

62 Maximal munch run example
Assume R1 = a R2 = a* b input = aaa a a a b q0 q1 q3 q2 b b q i stack q0 1 B,1 q1 2 B,1 q0,1 Output =

63 Maximal munch run example
Assume R1 = a R2 = a* b input = aaa a a a b q0 q1 q3 q2 b b q i stack q0 1 B,1 q1 2 B,1 q0,1 q3 3 q1,2 Output =

64 Maximal munch run example
Assume R1 = a R2 = a* b input = aaa a a a b q0 q1 q3 q2 b b q i stack q0 1 B,1 q1 2 B,1 q0,1 q3 3 q1,2 4 q1,2 q3,3 Output =

65 Maximal munch run example
Assume R1 = a R2 = a* b input = aaa a a a b q0 q1 q3 q2 b b q i stack q0 1 B,1 q1 2 B,1 q0,1 q3 3 q1,2 4 q1,2 q3,3 q3 3 q1,2 Output =

66 Maximal munch run example
Assume R1 = a R2 = a* b input = aaa a a a b q0 q1 q3 q2 b b q i stack q0 1 B,1 q1 2 B,1 q0,1 q3 3 q1,2 4 q1,2 q3,3 q3 3 q1,2 q1 2 Output =

67 Maximal munch run example
Assume R1 = a R2 = a* b input = aaa a a a b q0 q1 q3 q2 b b q i stack q0 1 B,1 q1 2 B,1 q0,1 q3 3 q1,2 4 q1,2 q3,3 q3 3 q1,2 q1 2 Output = 1

68 Maximal munch run example
Assume R1 = a R2 = a* b input = aaa a a a b q0 q1 q3 q2 b b q i stack q0 2 B,2 Output = 1

69 Maximal munch run example
Assume R1 = a R2 = a* b input = aaa a a a b q0 q1 q3 q2 b b q i stack q0 2 B,2 q1 3 B,2 q0,2 Output = 1

70 Maximal munch run example
Assume R1 = a R2 = a* b input = aaa a a a b q0 q1 q3 q2 b b q i stack q0 2 B,2 q1 3 q3 4 q1,3 Output = 1

71 Maximal munch run example
Assume R1 = a R2 = a* b input = aaa a a a b q0 q1 q3 q2 b b q i stack q0 2 B,2 q1 3 q3 4 q1,3 Output = 1

72 Maximal munch run example
Assume R1 = a R2 = a* b input = aaa a a a b q0 q1 q3 q2 b b q i stack q0 2 B,2 q1 3 q3 4 q1,3 q1 3 Output = 1

73 Maximal munch run example
Assume R1 = a R2 = a* b input = aaa a a a b q0 q1 q3 q2 b b q i stack q0 2 B,2 q1 3 q3 4 q1,3 q1 3 Output = 1 2

74 Maximal munch run example
Assume R1 = a R2 = a* b input = aaa a a a b q0 q1 q3 q2 b b q i stack q0 3 B,3 Output = 1 2

75 Maximal munch run example
Assume R1 = a R2 = a* b input = aaa a a a b q0 q1 q3 q2 b b q i stack q0 3 B,3 q1 4 B,3 q0,3 Output = 1 2

76 Maximal munch run example
Assume R1 = a R2 = a* b input = aaa a a a b q0 q1 q3 q2 b b q i stack q0 3 B,3 q1 4 B,3 q0,3 Output = 1 2

77 Maximal munch run example
Assume R1 = a R2 = a* b input = aaa a a a b q0 q1 q3 q2 b b q i stack q0 3 B,3 q1 4 B,3 q0,3 Output = 1 2 3

78 Maximal munch run example
Assume R1 = a R2 = a* b input = aaa a a a b q0 q1 q3 q2 b b q i stack q0 3 B,3 q1 4 B,3 q0,3 Output = 1 2 3

79 Complexity of maximal munch
What is the complexity of tokenizing a text of n characters by matching longest tokens?

80 Complexity of maximal munch
What is the complexity of tokenizing a text of n characters by matching longest tokens? Assume the following token classes R1 = a R2 = a* b For input=an it is O(n2) Can we improve the worst-case complexity? qa n qa qa qa a n

81 Improved scanning algorithm
Idea: use work done on “leftover” stack to improve future decisions Remember for each index which states have failed – cannot be extended to a token “Maximal-Munch” Tokenization in Linear Time Tom Reps [TOPLAS 1998]

82 Improved algorithm pseudo-code
What is the running time? How many times can this test fail for a given index?

83 Agenda Understand role of lexical analysis in a compiler
Convert text to stream of tokens Regular languages reminder Lexical analysis algorithms Precedence + First match Precedence + Maximal munch Scanner generation

84 Implementing a scanner

85 Implementing modern scanners
Manual construction of automata + determinization + maximal munch + tie breaking Very tedious Error-prone Non-incremental Fortunately there are tools that automatically generate robust code from a specification for most languages C: Lex, Flex Java: JLex, JFlex

86 Lexical Specification
Using JFlex Define tokens (and states) Run JFlex to generate Java implementation Usually MyScanner.nextToken() will be called in a loop by parser Stream of characters MyScanner.lex Lexical Specification JFlex MyScanner.java Tokens

87 Filtering illegal combinations
Which tokens should the scanner return for “123foo”?

88 Filtering illegal combinations
Which tokens should the scanner return for “123foo”? We sometimes want to rule out certain token concatenations prior to parsing How can we do that with what we’ve seen so far?

89 Filtering illegal combinations
Which tokens should the scanner return for “123foo”? We sometimes want to rule out certain token concatenations prior to parsing How can we do that with what we’ve seen so far? Define “error” lexemes

90 Catching errors What if input doesn’t match any token definition?
Want to gracefully signal an error Trick: add a “catch-all” rule that matches any character and reports an error Add after all other rules

91 Next lecture: parsing


Download ppt "Fall Compiler Principles Lecture 1: Lexical Analysis"

Similar presentations


Ads by Google