Presentation is loading. Please wait.

Presentation is loading. Please wait.

Winter Compiler Principles Lexical Analysis (Scanning)

Similar presentations


Presentation on theme: "Winter Compiler Principles Lexical Analysis (Scanning)"— Presentation transcript:

1 Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)
Mayer Goldberg and Roman Manevich Ben-Gurion University

2 General stuff Topics taught by me
Lexical analysis (scanning) Syntax analysis (parsing) Dataflow analysis Register allocation Slides will be available from web-site after lecture Request: please mute mobiles, tablets, super-cool squeaking devices

3 Today Understand role of lexical analysis Lexical analysis theory
Implementing modern scanner

4 Role of lexical analysis
First part of compiler front-end Convert stream of characters into stream of tokens Split text into most basic meaningful strings Simplify input for syntax analysis High-level Language (scheme) Executable Code Lexical Analysis Syntax Analysis Parsing AST Symbol Table etc. Inter. Rep. (IR) Code Generation

5 From scanning to parsing
program text 5 + (7 * x) Lexical Analyzer token stream ) id * num ( + Grammar: E  id E  num E  E + E E  E * E E  ( E ) Parser syntax error valid + num x * Abstract Syntax Tree

6 Javascript example Identify basic units in this code
var currOption = 0; // Choose content to display in lower pane. function choose ( id ) { var menu = ["about-me", "publications", "teaching", "software", "activities"]; for (i = 0; i < menu.length; i++) { currOption = menu[i]; var elt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none";

7 Javascript example Identify basic units in this code
var currOption = 0; // Choose content to display in lower pane. function choose ( id ) { var menu = ["about-me", "publications", "teaching", "software", "activities"]; for (i = 0; i < menu.length; i++) { currOption = menu[i]; var elt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none";

8 Javascript example Identify basic units in this code operator
keyword numeric literal var currOption = 0; // Choose content to display in lower pane. function choose ( id ) { var menu = ["about-me", "publications", "teaching", "software", "activities"]; for (i = 0; i < menu.length; i++) { currOption = menu[i]; var elt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none"; string literal whitespace identifier punctuation

9 Stream of Tokens LINE: ID(value)
Scanner output var currOption = 0; // Choose content to display in lower pane. function choose ( id ) { var menu = ["about-me", "publications“, "teaching", "software", "activities"]; for (i = 0; i < menu.length; i++) { currOption = menu[i]; var elt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none"; Stream of Tokens LINE: ID(value) 1: VAR 1: ID(currOption) 1: EQ 1: INT_LITERAL(0) 1: SEMI 3: FUNCTION 3: ID(choose) 3: LP 3: ID(id) 3: EP 3: LCB ...

10 What is a token? Lexeme – substring of original text constituting an identifiable unit Identifiers, Values, reserved words, … Record type storing: Kind Value (when applicable) Start-position/end-position Any information that is useful for the parser Different for different languages

11 C++ example 1 Splitting text into tokens can be tricky
How should the code below be split? vector<vector<int>> myVector >> operator >, > two tokens or ?

12 C++ example 2 Splitting text into tokens can be tricky
How should the code below be split? vector<vector<int> > myVector >, > two tokens

13 Example tokens Type Examples Identifier x, y, z, foo, bar NUM 42
FLOATNUM STRING “so long, and thanks for all the fish” LPAREN ( RPAREN ) IF if

14 Separating tokens Type Examples Comments /* ignore code */ // ignore until end of line White spaces \t \n Lexemes are recognized but get consumed rather than transmitted to parser if i f i/*comment*/f

15 Preprocessor directives in C
Type Examples Inlude directives #include<foo.h> Macros #define THE_ANSWER 42

16 Designing a scanner Define each type of lexeme
Reserved words: var, if, for, while Operators: < = ++ Identifiers: myFunction Literals: 123 “hello” But how do we define lexemes of unbounded length?

17 Designing a scanner Define each type of lexeme
Reserved words: var, if, for, while Operators: < = ++ Identifiers: myFunction Literals: 123 “hello” But how do we define lexemes of unbounded length? Regular expressions

18 Regular languages refresher
Formal languages Alphabet = finite set of letters Word = sequence of letter Language = set of words Regular languages defined equivalently by Regular expressions Finite-state automata

19 Regular expressions Empty string: Є Letter: a Concatenation: R1 R2
Union: R1 | R2 Kleene-star: R* Shorthand: R+ stands for R R* scope: (R) Example: (0* 1*) | (1* 0*) What is this language?

20 Exercise 1 - Question Language of Java identifiers
Identifiers start with either an underscore ‘_’ or a letter Continue with either underscore, letter, or digit

21 Exercise 1 - Answer Language of Java identifiers
Identifiers start with either an underscore ‘_’ or a letter Continue with either underscore, letter, or digit (_|a|b|…|z|A|…|Z)(_|a|b|…|z|A|…|Z|0|…|9)* Using shorthand macros First = _|a|b|…|z|A|…|Z Next = First|0|…|9 R = First Next*

22 Exercise 2 - Question Language of rational numbers in decimal representation (no leading, ending zeros) Not 007 Not 0.30

23 Exercise 2 - Answer Language of rational numbers in decimal representation (no leading, ending zeros) Digit = 1|2|…|9 Digit0 = 0|Digit Num = Digit Digit0* Frac = Digit0* Digit Pos = Num | .Frac | 0.Frac| Num.Frac PosOrNeg = (Є|-)Pos R = 0 | PosOrNeg

24 Exercise 3 - Question Equal number of opening and closing parenthesis: [n]n = [], [[]], [[[]]], …

25 Exercise 3 - Answer Equal number of opening and closing parenthesis: [n]n = [], [[]], [[[]]], … Not regular Context-free Grammar: S ::= [] | [S]

26 Finite automata An automaton is defined by states and transitions
accepting state b c a start b start state

27 Automaton running example
Words are read left-to-right c b a b c a start b

28 Automaton running example
Words are read left-to-right c b a b c a start b

29 Automaton running example
Words are read left-to-right c b a b c a start b

30 Automaton running example
Words are read left-to-right word accepted c b a b c a start b

31 Word outside of language
c b b c a start b

32 Word outside of language
Missing transition means non-acceptance c b b c a start b

33 Exercise - Question What is the language defined by the automaton below? b c a start b

34 Exercise - Answer What is the language defined by the automaton below?
a b* c Generally: all paths leading to accepting states b c a start b

35 Non-deterministic automata
Allow multiple transitions from given state labeled by same letter b c a start c a b

36 NFA run example c b a b c a start c a b

37 NFA run example Maintain set of states c b a b c a start c a b

38 NFA run example c b a b c a start c a b

39 NFA run example Accept word if any of the states in the set is accepting c b a b c a start c a b

40 NFA+Є automata Є transitions can “fire” without reading the input b a
start Є

41 NFA+Є run example c b a b a c start Є

42 NFA+Є run example Now Є transition can non-deterministically take place c b a b a c start Є

43 NFA+Є run example c b a b a c start Є

44 NFA+Є run example c b a b a c start Є

45 NFA+Є run example c b a b a c start Є

46 NFA+Є run example Word accepted c b a b a c start Є

47 Reg-exp vs. automata Regular expressions are declarative
Offer compact way to define a regular language by humans Don’t offer direct way to check whether a given word is in the language Automata are operative Define an algorithm for deciding whether a given word is in a regular language Not a natural notation for humans

48 From reg. exp. to automata
Theorem: there is an algorithm to build an NFA+Є automaton for any regular expression Proof: by induction on the structure of the regular expression For each sub-expression R we build an automaton with exactly one start state and one accepting state Start state has no incoming transitions Accepting state has no outgoing transitions

49 From reg. exp. to automata
Theorem: there is an algorithm to build an NFA+Є automaton for any regular expression Proof: by induction on the structure of the regular expression start

50 Base cases R =  start a R = a start

51 Construction for R1 | R2 R1 start R2

52 Construction for R1 R2 R1 R2 start

53 Construction for R* R start

54 From NFA+Є to DFA Construction requires O(n) states for a reg-exp of length n Running an NFA+Є with n states on string of length m takes O(m·n2) time Solution: determinization via subset construction Number of states worst-case exponential in n Running time O(m)

55 Subset construction NFA+Є DFA s1 s2 s4 s7
For an NFA+Є with states M={s1,…,sk} Construct a DFA with one state per set of states of the corresponding NFA M’={ [], [s1], [s1,s2], [s2,s3], [s1,s2,s3], …} Simulate transitions between individual states for every letter NFA+Є DFA s1 s2 a [s1,s4] a [s2,s7] s4 s7 a

56 Subset construction NFA+Є DFA s1 s4
For an NFA+Є with states M={s1,…,sk} Construct a DFA with one state per set of states of the corresponding NFA M’={ [], [s1], [s1,s2], [s2,s3], [s1,s2,s3], …} Extend macro states by states reachable via Є transitions NFA+Є DFA s1 s4 Є [s1,s2] [s1,s2,s4]

57 Scanning challenges Regular expressions allow us to define the language of all sequences of tokens Automata theory provides an algorithm for checking membership of words But we are interested in splitting the text not just deciding on membership How do we determine lexemes? How do we handle ambiguities – lexemes matching more than one token?

58 Separating lexemes ID = (a+b+…+z) (a+b+…+z)* ONE = 1 Input: abb1
How do we identify ID(abb), ONE?

59 Separating lexemes ID = (a+b+…+z) (a+b+…+z)* ONE = 1 Input: abb1
How do we identify ID(abb), ONE? a-z ID a-z start 1 ONE

60 Maximal munch ID = (a+b+…+z) (a+b+…+z)* ONE = 1 Input: abb1
How do we identify ID(abb), ONE? Solution: find longest matching lexeme Keep reading text until automaton leaves accepting state Return token corresponding to accepting state Reset – go back to start state and continue reading input from there

61 Handling ambiguities NFA ID = (a+b+…+z) (a+b+…+z)* IF = if Input: if
Matches both tokens What should the scanner output? a-z ID a-z start NFA i f IF

62 Handling ambiguities DFA ID = (a+b+…+z) (a+b+…+z)* IF = if Input: if
Matches both tokens What should the scanner output? a-z a-z\i ID a-z start DFA a-z\f i f ID IF ID

63 Handling ambiguities ID = (a+b+…+z) (a+b+…+z)* IF = if Input: if
Matches both tokens What should the scanner output? Solution: break tie using order of definitions Output: ID(if) a-z a-z\i ID a-z start a-z\f i f ID IF ID

64 Handling ambiguities IF = if ID = (a+b+…+z) (a+b+…+z)* Input: if
Matches both tokens What should the scanner output? Solution: break tie using order of definitions Output: IF Conclusion: list keyword token definitions before identifier definition a-z a-z\i ID a-z start a-z\f i f ID IF ID

65 Implementing scanners in practice

66 Implementing scanners
Manual construction of automata + determinization is Very tedious Error-prone Non-incremental Fortunately there are tools that automatically generate code from a specification for most languages C: Lex, Flex Java: JLex, JFlex

67 Using JFlex Define tokens (and states)
Run Jflex to generate Java implementation Usually MyScanner.nextToken() will be called in a loop by parser Stream of characters MyScanner.lex Regular Expressions JFlex MyScanner.java Tokens

68 Common format for reg-exps
Basic Patterns Matching x The character x . Any character, usually except a new line [xyz] Any of the characters x,y,z Repetition Operators R? An R or nothing (=optionally an R) R* Zero or more occurrences of R R+ One or more occurrences of R Composition Operators R1R2 An R1 followed by R2 R1|R2 Either an R1 or R2 Grouping (R) R itself

69 Escape characters What is the expression for one or more + symbols?
(+)+ won’t work (\+)+ will backslash \ before an operator turns it to standard character \*, \?, \+, … Newline: \n or \r\n depending on OS Tab: \t

70 Shorthands Use names for expressions Use hyphen to denote a range
letter = a | b | … | z | A | B | … | Z letter_ = letter | _ digit = 0 | 1 | 2 | … | 9 id = letter_ (letter_ | digit)* Use hyphen to denote a range letter = a-z | A-Z digit = 0-9

71 Catching errors What if input doesn’t match any token definition?
Trick: Add a “catch-all” rule that matches any character and reports an error Add after all other rules

72 Next lecture: parsing


Download ppt "Winter Compiler Principles Lexical Analysis (Scanning)"

Similar presentations


Ads by Google