Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Programming Language Syntax. 2  form (syntax) and meaning (semantics) must be precise  example: numbers digit → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8.

Similar presentations


Presentation on theme: "1 Programming Language Syntax. 2  form (syntax) and meaning (semantics) must be precise  example: numbers digit → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8."— Presentation transcript:

1 1 Programming Language Syntax

2 2  form (syntax) and meaning (semantics) must be precise  example: numbers digit → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 non_zero_digit → 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 number → non_zero_digit digit* *means zero or more repetitions *means zero or more repetitions no meaning, only symbols no meaning, only symbols could be anything: digits, days, colors, cars, chanels,… one syntax more semantics one syntax more semantics 110 binary, decimal, octal, hexadecimal number

3 3  important to distuinguish between syntax and semantics different languages have often same semantics under different syntax different languages have often same semantics under different syntax good algorithms for discovering syntax of a program good algorithms for discovering syntax of a program

4 4 the two problems we study here 1. How to specify the structural rules of a language? of programmers’ interest of programmers’ interest regular expressions and context-free grammars regular expressions and context-free grammars 2. How a compiler identifies the structure of a program? of compilers’ interest of compilers’ interest scanners and parsers scanners and parsers  Clear and intermediate connection between theory and practice

5 5 Specifying syntax: regular expressions and CFG  formal specification by a set of rules  formal languages sets of strings without sematics  formal grammars regular regular context-free context-free

6 6  tokens three kinds of rules: three kinds of rules: concatenationconcatenation alternationalternation repetition (’Kleene closure’)repetition (’Kleene closure’) regular sets/languages/expressions regular sets/languages/expressions recognized by scanners recognized by scanners  the rest recursion in addition to the three above recursion in addition to the three above context-free languages/grammars CFG context-free languages/grammars CFG parsers parsers

7 7 programming languagesCF grammars programming languagesCF grammars programsCF languages

8 8 Tokens and regular expressions  tokens: keywords, identifiers, numbers, punctuation marks, …  example C C keywordskeywords auto double int struct break else long switch case enum register typedef char extern return union const float short unsigned continue for signed void default goto sizeof volatile do if static while

9 9 constants: integer 23, 0x654L, 064Uconstants: integer 23, 0x654L, 064U character 'a' '\n' L'x' 'abc' '\0' L' ' character 'a' '\n' L'x' 'abc' '\0' L' ' floating point constants symbols... && -= >= ~ + ; ] >> %, >>= *= /= ^= & - = { | %> %= += = ~ + ; ] >> %, >>= *= /= ^= & - = { | %> %= += <= || ) / ? } %: ## -- == ! * : [ # %:%: comments /* */ or //comments /* */ or //

10 10 Pascal 64 tokens Pascal 64 tokens 35 keywords: begin, end, while, record, div, etc.35 keywords: begin, end, while, record, div, etc. identifiers: MyIdent, YourType, maxint, etc., 39 are predefinedidentifiers: MyIdent, YourType, maxint, etc., 39 are predefined 21 symbols: +, -, ;, :=,.., etc.21 symbols: +, -, ;, :=,.., etc. literalsliterals integer: 111 integer: 111 floating-point: 9.07e-23 floating-point: 9.07e-23 character/string: ‘ concept ’ character/string: ‘ concept ’ commentscomments (* *) (* *) { } in Turbo Pascal { } in Turbo Pascal

11 11  upper and lower case letters distinct (Modula-2/3, C, C++, Java) distinct (Modula-2/3, C, C++, Java) identical (Pascal, Ada, Common Lisp,...) identical (Pascal, Ada, Common Lisp,...)  names contain only letters and digits (Pascal, Modula-3) only letters and digits (Pascal, Modula-3) and ’_’ (almost all languages) and ’_’ (almost all languages) and something else (Lisp) and something else (Lisp)  limits on the length

12 12  free format white spaces (blanks, tabs, carriage returns, line and page feed characters) white spaces (blanks, tabs, carriage returns, line and page feed characters) exceptions exceptions Fortran (fixed format)Fortran (fixed format) Haskell, Occam (line breaks, indentation)Haskell, Occam (line breaks, indentation)

13 13  a regular expression is 1. the empty string ε 2. a character 3. concatenation of two regular expressions, i.e., any string generated by the first expression followed by a string generated by the second expression 4. union of two expressions, i.e., two expressions separated by | meaning any string generated by te first expression OR any string generated by the second expression 5. Kleene’s closure, i.e., a regular expression followed by a *

14 14  example digit → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 unsigned_integer → digit digit* unsigned_number → unsigned_integer ((.unsigned_integer)|ε) ((e(+|-|ε) unsigned_integer)|ε) 83929e3312e-4

15 15 Context-free grammars  enable specifying nested constructs defining concepts in terms of themselves  example expression → identifier | number | - expression | (expression) | expression operator expression expression operator expression operator → + | - | ∗ | /

16 16 something → this | that  rules are productions  on the left-hand sides are nonterminals  derived strings from a grammar are terminals (this) in our grammars they are tokens  start symbol is a distuinguished nonterminal in our grammars usually program

17 17  Backus-Naur Form (BNF) developed for Algol-60 developed for Algol-60 the meta-symbols of BNF are: ::= meaning "is defined as" | meaning "or" angle brackets used to surround category names angle brackets used to surround category names nonterminal ::= sequence_of_alternatives consisting of strings of terminals or nonterminals separated by the meta-symbol | example ::= program begin end ; ::= program begin end ;

18 18  extended BNF (EBNF) contains (), *,^+ contains (), *,^+ or even [ ], { } or even [ ], { }  expressive power is the same operator → + | - | ∗ | / or operator → + or operator → + operator → - → - operator → ∗ → ∗ operator → / → /

19 19 identifier_list → identifier (, identifier)* or identifier_list → identifier identifier_list → identifier_list, identifier or identifier_list → identifier identifier_list → identifier, identifier_list

20 20 Derivations and parse trees  CFG shows how to generate valid strings of terminals: 1. start with the start symbol 2. choose a production with the start symbol on the left-hand side 3. replace the start symbol with the right-hand side of the production 4. choose a nonterminal S in the string and choose a production with S on the left-hand side, and replace S with the right-hand side of the production (denoted by ⇒ ) 5. repeat 4 as long as there are nonterminals

21 21  example derivation of ’slope ∗ x + intercept’ in the ’expression’ grammar expression → expression operator expression ⇒ expression operator identifier ⇒ expression operator identifier ⇒ expression + identifier ⇒ expression + identifier ⇒ expression operator expression + identifier ⇒ expression operator expression + identifier ⇒ expression operator identifier + identifier ⇒ expression operator identifier + identifier ⇒ expression ∗ identifier + identifier ⇒ expression ∗ identifier + identifier ⇒ identifier ∗ identifier + identifier ⇒ identifier ∗ identifier + identifier (slope) (x) (intercept) (slope) (x) (intercept) the final form is the yield others are sentential forms

22 22  ⇒* denotes zero or more derivations expression ⇒* slope ∗ x + intercept  the right-most nonterminal has been replaced the right-most, or canonical, derivation the right-most, or canonical, derivation a parse tree

23 23 the left-most derivation the left-most derivation our grammar is ambiguous our grammar is ambiguous there is more then one derivation (tree) for a yield possible in-between possible in-between parsers use left-most or right-most parsers use left-most or right-most

24 24  the same language can be generated by infinitely many grammars  avoid: ambigious by means of disambiguating rules ambigious by means of disambiguating rules useless symbols useless symbols nonterminals that can not create terminal stringsnonterminals that can not create terminal strings terminals that do not appear in yield of any derivationterminals that do not appear in yield of any derivation  desirable: reflects the structure, so useful to the rest of the compiler reflects the structure, so useful to the rest of the compiler can be parsed efficiently can be parsed efficiently

25 25  example expressions (again) expressions (again) should capture associativity (10 - 4) -3 rather than 10 - (4 - 3) precedence (10 ∗ 4) + 3 rather than 10 ∗ (4 + 3) expression → term | expression add_op term term → factor | term mult_op factor factor → identifier | number | - factor | ( expression ) add_op → + | - mult_op → ∗ | /

26 26  parse tree for 3 + 4 ∗ 5, with precedence

27 27  parse tree for 10 – 4 – 3, with left associativity

28 28 *A short theory lesson  G = ( V, X, π ) formal grammar V dictionary X ⊆ V set of terminals V \ X set of nonterminals (auxiliary symbols) π set of productions string_of_nonterminals → any_string  S a distuinguished start symbol language generated by (V,S) is the set of all strings of terminals that can be generated starting from S and using rules from π

29 29  Chomsky hierarchy context-sensitive (type 1) context-sensitive (type 1) uAv → upv A is a nonterminal, p is any string, u and v are strings of nonterminals context-free (type 2) context-free (type 2) A → p A is a nonterminal, p is any string regular (type 3) regular (type 3) A → pB or A → q A and B are nonterminals, p and q are strings of terminals

30 30  every context-free grammar is equivalent to a grammar with rules A → STRING or A → a A is nonterminal, STRING is a string of nonterminals, a is terminal

31 31  regular languages recognized by finite automata  (A, X, δ) A states q1,q2 X input alphabet 0,1 δ transition function q1q2 0q2q1 1q1q2

32 32  context-free languages recognized by push-down automata

33 33  CFG languages recognized by push- down automata

34 34 Recognizing syntax: scanners and parsers  syntax analysis the first step in both compilation and interpretation the first step in both compilation and interpretation we consider compilation we consider compilation  the parser is the heart of a compiler calls the scanner to get tokens calls the scanner to get tokens assembles the tokens into the parse tree assembles the tokens into the parse tree passes the tree (maybe one after another subroutine) to the next phases passes the tree (maybe one after another subroutine) to the next phases in charge of the whole compilation process – syntax-directed translation in charge of the whole compilation process – syntax-directed translation a deterministic push-down automaton recognizing the syntax a deterministic push-down automaton recognizing the syntax

35 35  the scanner groups characters into tokens groups characters into tokens removes white spaces removes white spaces removes comments, special treatment for nested comments removes comments, special treatment for nested comments tags tokens with column and line numbers tags tokens with column and line numbers produces annotated source listing produces annotated source listing a deterministic finite automaton recognizing tokens a deterministic finite automaton recognizing tokens  both parsers and scanners can be generated automatically from regular expressions and CFGs, respectively lex&yacc orflex&bison

36 36 Scanning  after finding a token the scanner returns to the parser continues when invoked again from the point it has stopped before the longest possible token is accepted 3.1415 is not 3.14 and 15 readln is not read and ln

37 37

38 38  ad hoc scanner fast and compact fast and compact used in production compilers used in production compilers  code can represent a finite automaton by hand by hand scanner generator scanner generator automatically generatedautomatically generated easier to write and modifyeasier to write and modify used in language or compiler development, or when implementation is more important than performanceused in language or compiler development, or when implementation is more important than performance

39 39

40 40  a finite automaton can be realized in two ways 1. embed the automaton in the control flow using case-statements results usually in a hand-written scanner 2. using the (transition) table and a driver usually leads to an automatically generated scanner

41 41  the first approach two problems keywords is read a keyword or an identifier - can be solved by introducing (many) new states - solved by treating keywords as exceptions among identifiers keywords is read a keyword or an identifier - can be solved by introducing (many) new states - solved by treating keywords as exceptions among identifiers distinguish between. and.. distinguish between. and.. ’dot-dot’ problem 3.14 3..5 3.nothing 3.14 3..5 3.nothing - solved by look-ahead

42 42  in messier languages one look-ahead is not enough FORTRAN IV FORTRAN IV DO 5 I = 1, 25 head of a loop DO 5 I = 1. 25 variable DO5I, no declarations of variables from FORTRAN 77 DO 5, I = 1, 25 DO 5, I = 1, 25  if larger look-ahead needed then the scanner memorizes the longest token it has seen, buffers the continuation looking for a longer one, if an error appears, then returns the last token it has seen and ’unreads’ the buffered string

43 43

44 44  the second approach the table indexed by states and inputs entries determine whether to - move to a new state (if so, which one) - return a token (if so, which one) - return a token (if so, which one) - announce an error

45 45

46 46  lexical errors relatively rare relatively rare the next character is the next character is neither an acceptable continuation of the current token nor a begining of another token an error message and error recovery an error message and error recovery 1.throw away the current invalid token 2.skip forward until a character that can start a new token is found 3.restart the scanning 4.count on the parser’s error-recovery mechanism returns the kind of the token and its spelling, e.g., identifier num, its line and column,... returns the kind of the token and its spelling, e.g., identifier num, its line and column,...

47 47  significant comments, or pragmas do not affect the semantics, affect compiling do not affect the semantics, affect compiling handled by scanner since they appear anywhere handled by scanner since they appear anywhere in ADA they have reserved places, and hence handled by parsers, or further phases examples examples turn on or off various run-time checks (pointer or subscript checking), certain code improvements give hints, like a variable x is heavily used, subroutine S is not recursive, x and y have the same location,...

48 48 Top-down and bottom-up parsing  CFG generates a context-free language a parser recognizes a language for any CFG there is a parser that runs in O(n^3) time algorithms: Early, Cocke-Younger-Kasami algorithms: Early, Cocke-Younger-Kasami for some grammars O(n) for some grammars O(n) LL, Left-to-right, Left-most derivationLL, Left-to-right, Left-most derivation LR, Left-to-right, Right-most derivationLR, Left-to-right, Right-most derivation

49 49  LL parsers simpler, easier to understand simpler, easier to understand hand-written or automatically generated hand-written or automatically generated  LR parsers larger class larger class maybe more intuitive grammars, especially with arithmetic expressions maybe more intuitive grammars, especially with arithmetic expressions usually automatically generated usually automatically generated

50 50  LL parsers top-down top-down predictive predictive construct the tree from the root predicting which production is going to be used depending on the input token construct the tree from the root predicting which production is going to be used depending on the input token  LR parsers bottom-up bottom-up shift-reduce shift-reduce construct the tree from leaves (tokens), recognizing when a collection of leaves can be joined as the children of the single parent construct the tree from leaves (tokens), recognizing when a collection of leaves can be joined as the children of the single parent

51 51  example comma-separated list of identifiers id_list → id id_list_tail id_list_tail →, id id_list_tail id_list_tail → ; this grammar convinient for LL parsers, not for LR there is an equivalent LR grammar

52 52 top-down bottom-up

53 53  example back to id_list grammar customized for bottom-up parsing id_list → id_list_prefix ; id_list_prefix → id_list_prefix, id → id → id cannot be parsed top-down

54 54

55 55  subclasses of LR parsers SLR simple LR parsers SLR simple LR parsers LALR look-ahead LR parsers LALR look-ahead LR parsers full LR parsers full LR parsers  LL(k) and LR(k) k tokens look-ahead k tokens look-ahead LL(1), LALR(1), LL(2), LR(1), LR(0) LL(1), LALR(1), LL(2), LR(1), LR(0) in reality at most one token of look-ahead in reality at most one token of look-ahead

56 56

57 57 Recursive descent parser  for top-down parsing  when a grammar is simple, or an automatic tool is not available  no two productions start with the same token, nor …  example (old) expression → term | expression add_op term expression → term | expression add_op term term → factor | term mult_op factor factor → identifier | number | - factor | ( expression ) add_op → + | - mult_op → ∗ | /

58 58  calculator language grammar

59 59  the procedure: start from the top of the tree with the start symbol and predict the next production according to the current left-most non- terminal and the current input token  two ways to do this reminiscent to those to build scanners 1. recursive descent parser by hand 2. build LL parse table and a driver

60 60  a recursive descent parser has a subroutine for every nonterminal has a subroutine for every nonterminal has a mechanism to inspect the next token obtained from a scanner has a mechanism to inspect the next token obtained from a scanner a routine (match) to consume the token and verify that it is the expected one, if not it announces an error and triggers a recovery mechanism a routine (match) to consume the token and verify that it is the expected one, if not it announces an error and triggers a recovery mechanism

61 61

62 62  example read A read B sum := A + B write sum write sum / 2

63 63

64 64  a parser saves the parse tree or some equivalent form as an explicit data structure save by allocating and linking together records that are the children of a node immediately before executing the recursive subroutines and match invokations that represent the children save by allocating and linking together records that are the children of a node immediately before executing the recursive subroutines and match invokations that represent the children to each recursive routine an argument that points to the record that is to be expanded should be passed to each recursive routine an argument that points to the record that is to be expanded should be passed match should save information on certain tokens, e.g., character-string representations of literals and identifiers match should save information on certain tokens, e.g., character-string representations of literals and identifiers

65 65  too much information in a parse tree usually not constructed completely usually not constructed completely often an abstract syntax tree, or other more terse representation often an abstract syntax tree, or other more terse representation  the trickiest part in writing a recursive descent parser to determine arms of case statements they are productions predicted by the token that labels the arm they are productions predicted by the token that labels the arm a token X predicts a production if a token X predicts a production if (1) the right-hand side, when recursively expanded, starts with X (2) the right-hand side yields ε, and X may begin the yield of what comes next FIRST and FOLLOW sets FIRST and FOLLOW sets

66 66 *Syntax errors  syntax error-recovery any technique enabling the compiler to continue looking for errors any technique enabling the compiler to continue looking for errors  find as many errors as possible modify the state of the parser turn off code generation  methods panic mode panic mode a small set of ’safe’ symbols when an error discovered deletes the tokens until the next safe symbol backs the parser out to a context in which the symbol might appear may delete a ’starter’ token ( begin, while, if,...)

67 67 phrase-level recovery phrase-level recovery different sets of ’safe’ symbols in different contexts if an error in a routine for nonterminal then skips until a token that may start nonterminal and then proceeds the routine or a token that might follow nonterminal and then returns immediate error detection problem: predicts an epsilon production (if possible) instead of announcing an error immediatelly Y := (A + X X + X) – (B – X / X) + C / X context-sensitive look-ahead context-sensitive look-ahead context-specific FOLLOW sets does not delete ’starter’ symbols

68 68 exception-based recovery exception-based recovery Ada, Modula-3, C++, Java, ML a small set of context to which we back out in the event of an error (e.g. expression, statement) attach an exception handler to a code block if an error detected, raise the syntax error exception, rearrange stack,...  error productions language independent language independent sometimes grammar modified to accept common undengerous errors (; in Pascal) sometimes grammar modified to accept common undengerous errors (; in Pascal)

69 69 Table-driven top-down parsing  a stack containing a list of possible symbols maintained yet unvisited nodes of the parse tree that appear in the right-hand side of a predicted production yet unvisited nodes of the parse tree that appear in the right-hand side of a predicted production initially the start symbol (program) on the top of the stack initially the start symbol (program) on the top of the stack when predicting a production, the parser pops the left- hand side off, pushes the right-hand side in the reverse order on the top when predicting a production, the parser pops the left- hand side off, pushes the right-hand side in the reverse order on the top

70 70  loop if a terminal is on the top of the stack then match it against the next token if a terminal is on the top of the stack then match it against the next token if the match doesn’t succeeded then error, error-recoveryif the match doesn’t succeeded then error, error-recovery if it succeedes pop the terminal offif it succeedes pop the terminal off if a nonterminal is on the top of the stack, it and the next token index a table that tells which production to predict or error if a nonterminal is on the top of the stack, it and the next token index a table that tells which production to predict or error

71 71

72 72  predict sets of productions FIRST(nonterm) = all tokens that can start a production with the left-hand side nonterm + FIRST(nonterm) = all tokens that can start a production with the left-hand side nonterm + ε if nonterm ⇒ * ε FOLLOW(someth) = all tokens that can come after someth in a valid program + ε if someth can be the final token in the program FOLLOW(someth) = all tokens that can come after someth in a valid program + ε if someth can be the final token in the program extend FIRST to strings extend FIRST to strings

73 73

74 74

75 75 A, B, C,... nonterminals X, Y, Z,... arbitrary grammar symbols a, b, c,... terminals (tokens) x, y, z,... token strings α, β, γ,... any strings

76 76

77 77

78 78

79 79  predict-predict conflict if the same token predicts more than one production with the same left hand-sides the grammar is not LL(1) two possible reasons: 1. the same token begins more than one right- hand side (see expression grammar) expression grammarexpression grammar 2. the same token begins one right-hand side and may appear after the left-hand side in a valid program and one possible right-hand side can generate ε

80 80  writting an LL(1) grammar  two most common obstacles 1. left recursion id_list → id_list_prefix ; id_list_prefix → id_list_prefix, id → id → id 2. common prefixes 2. common prefixes stmt → id := expr stmt → id := expr stmt → id ( argument_list )

81 81  can be removed from the grammar mechanically 1. id_list → id id_list_tail id_list_tail →, id id_list_tail id_list_tail →, id id_list_tail → ; → ; 2. stmt → id stmt_list_tail stmt_list_tail → := expr | ( argument_list ) stmt_list_tail → := expr | ( argument_list )  in general, much more complicated S → A α or S → A α A → S β S → B β A ⇒ * a γ A ⇒ * a γ B ⇒ * a δ B ⇒ * a δ moreover, doesn’t help always

82 82  the few non-LL languages used in practice can be handled by augmenting the parsing algorithm with one or two simple heuristics  example in Pascal else in if construct is optional in Pascal else in if construct is optional stmt → if condition then_clause else_clause | other_stmt then_clause → then stmt else_clause → else stmt | ε if C1 then if C2 then S1 else S2 whose else is this?

83 83 stmt → balanced_stmt | unbalanced_stmt balanced_stmt → if condition then balanced_stmt else balanced_stmt | other_stmt unbalanced_stmt → if condition then stmt | if condition then balanced_stmt else unbalanced_stmt | if condition then balanced_stmt else unbalanced_stmt suitable for bottom-up parsing, not top-down there is NO pure top-down grammar for Pascal else-statements

84 84  ambiguity disambiguating rule: in case two rules satisfy conditions to be applied, choose the one that occurs first disambiguating rule: in case two rules satisfy conditions to be applied, choose the one that occurs first else_clause → else stmt else_clause → ε else_clause → ε pairs else with the closest then what if don’t want else to refer to the closest then ?

85 85 not correct: if i < n then if i > m then s := s + i else write ( ’write new i’) correct: if i < n then begin if i > m then s := s + i endelse write ( ’write new i’)

86 86 end in Modula-2 (Modula, Oberon) stmt → IF condition then_clause else_clause END | other_stmt stmt → IF condition then_clause else_clause END | other_stmt then_clause → THEN stmt _list else_clause → ELSE stmt _list | ε Modula-2 END for all structured statements Ada, Fortran if … end if, while … end while Algol 68 if … fi, do … od, case … esac if A = B then … else if A = C then … else if A = D then … else … end end end

87 87 elsif solves this if A = B then … elsif A = C then … elsif A = D then … else … end Modula-2 grammar now stmt → IF condition then_clause elsif_clauses else_clause END | other_stmt stmt → IF condition then_clause elsif_clauses else_clause END | other_stmt then_clause → THEN stmt _list elsif_clauses → ELSIF condition then_clause elsif_clauses | ε else_clause → ELSE stmt _list | ε

88 88 Bottom-up parsing  maintains a forest of partially completed subtrees joins them to be the children of a new root node when it recognizes that their roots are the right-hand side of a production, the left-hand side is the new root

89 89  table-driven roots of partially completed trees on a stack roots of partially completed trees on a stack a new token from the scanner is shifted on the top of the stack a new token from the scanner is shifted on the top of the stack when recognizes that the top few symbols form the right-hand side of a production, pops them off and pushes the left-hand side on the top when recognizes that the top few symbols form the right-hand side of a production, pops them off and pushes the left-hand side on the top  stack in top-down contains what is expected to be seen top-down contains what is expected to be seen bottom-up contains what has been seen bottom-up contains what has been seen

90 90 the right-most derivation roots of subtrees constitute sentential forms handle of the sentential form are roots that are going to be joined

91 91 the grammar id_list → id id_list_tail id_list_tail→, id id_list_tail id_list_tail → ; the derivation the derivation id_list → id id_list_tail ⇒ id, id id_list_tail ⇒ id, id id_list_tail ⇒ id, id, id ; ⇒ id, id, id ;

92 92  left-recursive production for stmt_list enables collapsing before reaching end  left-recursive productions for expr and term capture left associativity

93 93  example (again) read A read B sum := A + B write sum write sum / 2 find handle as soon as possible keep track of productions we might be ‘in the middle of’ and also point by the place where we could be then LR item a production with LR item a production with

94 94 beginning: program in the stack only one production with program on the left (if not modify S → A, S → B, add S’ new start symbol and the rule S’ → S) the initial state of the parser: program → stmt_list $$ the basis stmt_list → stmt_list stmt the closure stmt_list → stmt the closure stmt_list → stmt the closure stmt → id := expr the closure stmt → id := expr the closure stmt → read id the closure stmt → read id the closure stmt → write expr the closure stmt → write expr the closure

95 95

96 96

97 97 The characteristic finite state machine and LR parsing variants  states are pushed into the stack together with all other symbols only states are important for parsing, symbols are used in semantic analysis only states are important for parsing, symbols are used in semantic analysis symbols are inputs for the FSM symbols are inputs for the FSM for an input X (terminal or nonterminal) the new state has the basis from the closure of the previous state where has passed over X, plus whatever forms the closure

98 98  LR(0), SLR(1), LALR(1) use characteristic finite state machines (CFSM)  full LR much larger number of states  shift-reduce conflict one item has in the middle, suggesting shift, and the other at the end, suggesting reduce, of a production LR(0) works only without this conflict LR(0) works only without this conflict can be proven that with $$ at the end any language that can be deterministically parsed bottom-up has an LR(0) grammar real programming languages are much larger and less intuitive SLR peaks upcoming inputs and uses FOLLOW sets, still conflict is possible if tokens following are in FIRST sets of conflicted items SLR peaks upcoming inputs and uses FOLLOW sets, still conflict is possible if tokens following are in FIRST sets of conflicted items

99 99 LALR improves SLR by using local look-ahead LALR improves SLR by using local look-ahead conflict possible if there are two paths trough the same state in CFSM full LR keeps paths disjoint when their local look- aheads are different by duplicating states full LR keeps paths disjoint when their local look- aheads are different by duplicating states  in practice, LALR are the most common SLR have the same size and speed, but LALR resolve more conflicts SLR have the same size and speed, but LALR resolve more conflicts  our example shift-reduce conflict in states 6, 7, 9, 13 solved by global FOLLOW sets, so SLR suffices

100 100 Bottom-up parsing tables  SLR(1), LALR(1), LR(1) execute a loop to find out from a two-dimensional table what action to take the table indexed by the current input token and the current state the table indexed by the current input token and the current state entries are entries are ‘shift’ indicates the state that should be pushed‘shift’ indicates the state that should be pushed ‘reduce’ indicates the number of states to be popped, the nonterminal to be pushed back to the input stream, the new state determined from the table using the uncovered state and the new just recognized nonterminal‘reduce’ indicates the number of states to be popped, the nonterminal to be pushed back to the input stream, the new state determined from the table using the uncovered state and the new just recognized nonterminal

101 101

102 102 Handling epsilon productions  our ’program’ grammar for LL stmt_list → stmt stmt_list | ε for LR stmt_list → stmt_list stmt | stmt makes sense having an empty stmt_list the new production can be added to the LR grammar stmt_list → stmt_list stmt | ε

103 103  how does this change our parser? s tate 0 is the only one to be changed stmt_list → stmt becomes stmt_list → ε stmt_list → stmt becomes stmt_list → ε equivalent to stmt_list → ε equivalent to stmt_list → ε or just stmt_list → or just stmt_list → state 0: program → stmt_list $$ stmt_list → stmt_list stmt stmt_list → stmt_list → stmt_list → stmt stmt_list → stmt stmt → id := expr stmt → id := expr stmt → read id stmt → read id stmt → write expr stmt → write expr

104 104 the look-ahead for stmt_list → is FOLLOW(stmt_list)={$$} the look-ahead for stmt_list → is FOLLOW(stmt_list)={$$} it doesn’t appear in any other look-ahead the grammar is still SLR(1) LR(0) cannot have epsilon productions because they need look-ahead

105 105 Programming assignments Construct a parser for a language given by the following CFG: Construct a parser for a language given by the following CFG: 1. CFG from Figure 2.10 in the book (slide 58) 2.CFG from Figure 2.20 in the book (slide 92) Test for both read A read B sum := (A+B) ∗ C write sum write sum / 2

106 106 3. E → ( E ) T | i T T → + E | ∗ E | ε i is an integer. Test the program for the following expressions (55+8) ∗87, 9+6 ∗5, +876 ∗ i5, 6i ∗ (99+12), 4+4+8, 9 ∗ (+7+12). 4. A → B | A or B B → C | B and C B → C | B and C C → D | not D C → D | not D D → identifier | (A) D → identifier | (A) Test the program for the following expressions (a or b1) and n1, not dd and true, a_1 and a_2 ( or a_1). identifier is a sequence of letters, digits and _, starting with a letter, or a constant true or false.

107 107 5. E → E + E | E – E | E ∗ E | E / E | ( E ) | NUMBER Test with 55 – ( 5 + 6) /5, 7, 7 -, - 7, (765 – 98 ) ∗ ((65 -1 )/ 4 ). 6. record → record list_comp1 end ; list_comp1 → list_comp list_comp → id : tip | list_comp ; id : tip tip → integer | boolean | real | char | string [ num ] | range range → cons.. cons cons → id | num Here num is an integer, and id is an identifier (sequence of letters, digits and _, starting with a letter). Test the program with the following two tests recordrecord a1 : real;dd : char ; a2_second : string [100]dd1 : integer; end;end;

108 108 7. stmt → if condition then tail_if | while condition do stmt | begin list end | assignment tail_if → stmt | stmt else stmt list → list ; stmt | stmt Test your program with the following two examples 1. begin while condition do begin assignment; assignment end; if condition then assignment else begin while condition do assignment; assignment end end end 2. begin if assignment then while condition do assignment; assignment end end

109 109  8. parsers for parts of grammars of real programming languages, for example (a) expresions in Java (b) structures in C (c)... you can suggest yourself, but let me know

110 110 Remark:  It is not necessary to form a parse tree, it suffices to conclude if a given input is a correct yield of the grammar.  You can choose a programming language in which your parser will be written.  You can choose a partner with whom you are going to work on the problem.  Choose an available problem from the list and inform me about your decision. Only one pair can work on one problem in one language!  You should (personally) present your parser. At most three presentations can be on the same day in one demonstration group. Choose one of weeks 11,13,14,15 for your presentation and let me know your choice by March 7th.


Download ppt "1 Programming Language Syntax. 2  form (syntax) and meaning (semantics) must be precise  example: numbers digit → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8."

Similar presentations


Ads by Google