Presentation is loading. Please wait.

Presentation is loading. Please wait.

Compiler Principles Winter 2012-2013 Compiler Principles Syntax Analysis (Parsing) – Part 1 Mayer Goldberg and Roman Manevich Ben-Gurion University.

Similar presentations


Presentation on theme: "Compiler Principles Winter 2012-2013 Compiler Principles Syntax Analysis (Parsing) – Part 1 Mayer Goldberg and Roman Manevich Ben-Gurion University."— Presentation transcript:

1 Compiler Principles Winter 2012-2013 Compiler Principles Syntax Analysis (Parsing) – Part 1 Mayer Goldberg and Roman Manevich Ben-Gurion University

2 Books 2 Compilers Principles, Techniques, and Tools Alfred V. Aho, Ravi Sethi, Jeffrey D. Ullman Advanced Compiler Design and Implementation Steven Muchnik Modern Compiler Design D. Grune, H. Bal, C. Jacobs, K. Langendoen Modern Compiler Implementation in Java Andrew W. Appel

3 Today Understand role of syntax analysis Context-free grammars – Basic definitions – Ambiguities Top-down parsing – Predictive parsing Next time: bottom-up parsing method 3

4 The bigger picture Compilers include different kinds of program analyses each further constrains the set of legal programs – Lexical constraints – Syntax constraints – Semantic constraints – “Logical” constraints (Verifying Compiler grand challenge) 4 Program consists of legal tokens Program included in a given context- free language Type checking, legal inheritance graph, variables initialized before used Memory safety: null dereference, array-out-of-bounds access, data races, assertion violation

5 Role of syntax analysis Recover structure from stream of tokens – Parse tree / abstract syntax tree Error reporting (recovery) Other possible tasks – Syntax directed translation (one pass compilers) – Create symbol table – Create pretty-printed version of the program, e.g., Auto Formatting function in Eclipse 5 High-level Language (scheme) Executable Code Lexical Analysis Syntax Analysis Parsing ASTSymbol Table etc. Inter. Rep. (IR) Code Generation

6 From tokens to abstract syntax trees 5 + (7 * x) )id*num(+ Lexical Analyzer program text token stream Parser Grammar: E  id E  num E  E + E E  E * E E  ( E ) + num id * Abstract Syntax Tree valid syntax error 6 Regular expressions Finite automata Context-free grammars Push-down automata

7 Example grammar shorthand for Statement shorthand for Expression shorthand for List (of expressions) 7 S  S ; S S  id := E S  print (L) E  id E  num E  E + E L  E L  L, E

8 CFG terminology 8 Symbols : Terminals (tokens): ; := ( ) id num print Non-terminals: S E L Start non-terminal: S Convention: the non-terminal appearing in the first derivation rule Grammar productions (rules) N  α S  S ; S S  id := E S  print (L) E  id E  num E  E + E L  E L  L, E

9 Language of a CFG A sentence ω is in L(G) (valid program) if – There exists a corresponding derivation – There exists a corresponding parse tree 9

10 Derivations Show that a sentence ω is in a grammar G – Start with the start symbol – Repeatedly replace one of the non-terminals by a right-hand side of a production – Stop when the sentence contains only terminals Given a sentence αNβ and rule N  µ αNβ => αµβ ω is in L(G) if S =>* ω – Rightmost derivation – Leftmost derivation 10

11 Leftmost derivation 11 S => S ; S => id := E ; S => id := num ; S => id := num ; id := E => id := num ; id := E + E => id := num ; id := num + E => id := num ; id := num + num a := 56 ; b := 7 + 3 id := num ; id := num + num S  S ; S S  id := E S  print (L) E  id E  num E  E + E L  E L  L, E

12 Rightmost derivation 12 S => S ; S => S ; id := E => S ; id := E + E => S ; id := E + num => S ; id := num + num => id := E ; id := num + num => id := num ; id := num + num a := 56 ; b := 7 + 3 id := num ; id := num + num S  S ; S S  id := E S  print (L) E  id E  num E  E + E L  E L  L, E

13 Parse trees Tree nodes are symbols, children ordered left-to-right Each internal node is non-terminal and its children correspond to one of its productions N  µ 1 … µ k Root is start non-terminal Leaves are tokens Yield of parse tree: left-to-right walk over leaves 13 µ1µ1 µkµk N …

14 Parse tree example 14 S  S ; S S  id := E S  print (L) E  id E  num E  E + E L  E L  L, E id:=num;id:=num + Draw parse tree for expression

15 Parse tree example 15 id:=num;id:=num + EEE SE S S Order-independent representation S  S ; S S  id := E S  print (L) E  id E  num E  E + E L  E L  L, E ( S ( S a := ( E 56) E ) S ; ( S b := ( E ( E 7) E + ( E 3) E ) E ) S ) S Equivalently add parentheses labeled by non-terminal names

16 Capabilities and limitations of CFGs CFGs naturally express – Hierarchical structure A program is a list of classes, A Class is a list of definition, A definition is either… – Beginning-end type of constraints Balanced parentheses S  (S)S | ε Cannot express – Correlations between unbounded strings (identifiers) – Variables are declared before use: ω S ω – Handled by semantic analysis 16 p. 173

17 Sometimes there are two parse trees Leftmost derivation E E + E num + E num + E + E num + num + E num + num + num num(1) E EE + EE +num(2)num(3) Rightmost derivation E E + E E + num E + E + num E + num + num num + num + num +num(3)+num(1)num(2) Arithmetic expressions: E  id E  num E  E + E E  E * E E  ( E ) 1 + 2 + 3 E EE E E 1 + (2 + 3)(1 + 2) + 3 17

18 Is ambiguity a problem? Leftmost derivation E E + E num + E num + E + E num + num + E num + num + num num(1) E EE + EE +num(2)num(3) Rightmost derivation E E + E E + num E + E + num E + num + num num + num + num +num(3)+num(1)num(2) Arithmetic expressions: E  id E  num E  E + E E  E * E E  ( E ) 1 + 2 + 3 E EE E E = 6 1 + (2 + 3)(1 + 2) + 3 Depends on semantics 18

19 Problematic ambiguity example Leftmost derivation E E + E num + E num + E * E num + num * E num + num * num num(1) E EE + EE *num(2)num(3) Rightmost derivation E E * E E * num E + E * num E + num * num num + num * num *num(3)+num(1)num(2) Arithmetic expressions: E  id E  num E  E + E E  E * E E  ( E ) 1 + 2 * 3 This is what we usually want: * has precedence over + E EE E E = 7= 9 1 + (2 * 3)(1 + 2) * 3 19

20 Ambiguous grammars A grammar is ambiguous if there exists a sentence for which there are – Two different leftmost derivations – Two different rightmost derivations – Two different parse trees Property of grammars, not languages Some languages are inherently ambiguous – no unambiguous grammars exist No algorithm to detect whether arbitrary grammar is ambiguous 20

21 Drawbacks of ambiguous grammars Ambiguous semantics Parsing complexity May affect other phases Solutions – Transform grammar into non-ambiguous – Handle as part of parsing method Using special form of “precedence” Wait for bottom-up parsing lecture 21

22 Transforming ambiguous grammars to non-ambiguous by layering Ambiguous grammar E  E + E E  E * E E  id E  num E  ( E ) Unambiguous grammar E  E + T E  T T  T * F T  F F  id F  num F  ( E ) Layer 1 Layer 2 Layer 3 Let’s derive 1 + 2 * 3 Each layer takes care of one way of composing sub- strings to form a string: 1: by + 2: by * 3: atoms 22

23 Transformed grammar: * precedes + Ambiguous grammar E  E + E E  E * E E  id E  num E  ( E ) Unambiguous grammar E  E + T E  T T  T * F T  F F  id F  num F  ( E ) Derivation E => E + T => T + T => F + T => 1 + T => 1 + T * F => 1 + F * F => 1 + 2 * F => 1 + 2 * 3 +*321 FFF T TE T E Parse tree 23

24 Transformed grammar: + precedes * Ambiguous grammar E  E + E E  E * E E  id E  num E  ( E ) Unambiguous grammar E  E * T E  T T  T + F T  F F  id F  num F  ( E ) +*321 Derivation E => E * T => T * T => T + F * T => F + F * T => 1 + F * T => 1 + 2 * T => 1 + 2 * F => 1 + 2 * 3 FFF T T E T E Parse tree 24

25 Another example for layering 25 Ambiguous grammar P  ε | P P | ( P ) Unambiguous grammar S  P S | ε P  ( S ) Takes care of “concatenation” Takes care of nesting

26 “dangling-else” example 26 Ambiguous grammar S  if E then S S | if E then S else S | other if S Sthen ifelseESS E E1E1 E2E2 S1S1 S2S2 if S Sthen if else ES S E E1E1 E2E2 S1S1 S2S2 if E 1 then (if E 2 then S 1 else S 2 )if E 1 then (if E 2 then S 1 ) else S 2 This is what we usually want: match else to closest unmatched then if E 1 then if E 2 then S 1 else S 2 p. 174

27 “dangling-else” example 27 if S Sthen ifelse Ambiguous grammar S  if E then S S | if E then S else S | other ESS E E1E1 E2E2 S1S1 S2S2 if S Sthen if else ES S E E1E1 E2E2 S1S1 S2S2 if E 1 then (if E 2 then S 1 else S 2 )if E 1 then (if E 2 then S 1 ) else S 2 Unambiguous grammar S  M | U M  if E then M else M | other U  if E then S | if E then M else U if E 1 then if E 2 then S 1 else S 2 Matched statements Unmatched statements p. 174

28 Broad kinds of parsers Parsers for arbitrary grammars – Earley’s method, CYK method O(n 3 ) – Not used in practice Top-Down – Construct parse tree in a top-down matter – Find the leftmost derivation – Predictive: for every non-terminal and k-tokens predict the next production LL(k) – Preorder tree traversal Bottom-Up – Construct parse tree in a bottom-up manner – Find the rightmost derivation in a reverse order – For every potential right hand side and k-tokens decide when a production is found LR(k) – Postorder tree traversal 28

29 Top-down vs. bottom-up Top-down parsing – Beginning with the start symbol, try to guess the productions to apply to end up at the user's program Bottom-up parsing – Beginning with the user's program, try to apply productions in reverse to convert the program back into the start symbol 29

30 Top-down parsing 30 Unambiguous grammar E  E * T E  T T  T + F T  F F  id F  num F  ( E ) +*321 FFF T T E T E

31 Top-down parsing 31 Unambiguous grammar E  E * T E  T T  T + F T  F F  id F  num F  ( E ) We need this rule to get the * +*321 E

32 Top-down parsing 32 Unambiguous grammar E  E * T E  T T  T + F T  F F  id F  num F  ( E ) +*321 E T E

33 Top-down parsing 33 Unambiguous grammar E  E * T E  T T  T + F T  F F  id F  num F  ( E ) +*321 F T E T E

34 Top-down parsing 34 Unambiguous grammar E  E * T E  T T  T + F T  F F  id F  num F  ( E ) +*321 FF T T E T E

35 Top-down parsing 35 Unambiguous grammar E  E * T E  T T  T + F T  F F  id F  num F  ( E ) +*321 FFF T T E T E

36 Top-down parsing 36 Unambiguous grammar E  E * T E  T T  T + F T  F F  id F  num F  ( E ) +*321 FFF T T E T E

37 Bottom-up parsing 37 Unambiguous grammar E  E * T E  T T  T + F T  F F  id F  num F  ( E ) +*321

38 Bottom-up parsing 38 Unambiguous grammar E  E * T E  T T  T + F T  F F  id F  num F  ( E ) +*321 F

39 Bottom-up parsing 39 Unambiguous grammar E  E * T E  T T  T + F T  F F  id F  num F  ( E ) +*321 FF

40 Bottom-up parsing 40 Unambiguous grammar E  E * T E  T T  T + F T  F F  id F  num F  ( E ) +*321 FF T

41 Bottom-up parsing 41 Unambiguous grammar E  E * T E  T T  T + F T  F F  id F  num F  ( E ) +*321 FF T F

42 Bottom-up parsing 42 Unambiguous grammar E  E * T E  T T  T + F T  F F  id F  num F  ( E ) +*321 FF T F T

43 Bottom-up parsing 43 Unambiguous grammar E  E * T E  T T  T + F T  F F  id F  num F  ( E ) +*321 FF T F T T

44 Bottom-up parsing 44 Unambiguous grammar E  E * T E  T T  T + F T  F F  id F  num F  ( E ) +*321 FF T F T T E

45 Bottom-up parsing 45 Unambiguous grammar E  E * T E  T T  T + F T  F F  id F  num F  ( E ) +*321 FF T F T T E E

46 Challenges in top-down parsing Top-down parsing begins with virtually no information – Begins with just the start symbol, which matches every program How can we know which productions to apply? In general, we can‘t – There are some grammars for which the best we can do is guess and backtrack if we're wrong If we have to guess, how do we do it? – Parsing as a search algorithm – Too expensive in theory (exponential worst-case time) and practice 46

47 Predictive parsing Given a grammar G and a word w attempt to derive w using G Idea – Apply production to leftmost nonterminal – Pick production rule based on next input token General grammar – More than one option for choosing the next production based on a token Restricted grammars (LL) – Know exactly which single rule to apply – May require some lookahead to decide 47

48 Boolean expressions example 48 not ( not true or false ) E => not E => not ( E OP E ) => not ( not E OP E ) => not ( not LIT OP E ) => not ( not true OP E ) => not ( not true or E ) => not ( not true or LIT ) => not ( not true or false ) not E E (EOPE) notLITorLIT truefalse production to apply known from next token E  LIT | (E OP E) | not E LIT  true | false OP  and | or | xor

49 Recursive descent parsing Define a function for every nonterminal Every function work as follows – Find applicable production rule – Terminal function checks match with next input token – Nonterminal function calls (recursively) other functions If there are several applicable productions for a nonterminal, use lookahead 49

50 Matching tokens Variable current holds the current input token 50 match(token t) { if (current == t) current = next_token() else error } E  LIT | (E OP E) | not E LIT  true | false OP  and | or | xor

51 Functions for nonterminals 51 E() { if (current  {TRUE, FALSE}) // E  LIT LIT(); else if (current == LPAREN) // E  ( E OP E ) match(LPAREN); E(); OP(); E(); match(RPAREN); else if (current == NOT)// E  not E match(NOT); E(); else error; } LIT() { if (current == TRUE) match(TRUE); else if (current == FALSE) match(FALSE); else error; } E  LIT | (E OP E) | not E LIT  true | false OP  and | or | xor

52 Implementation via recursion 52 E → LIT | ( E OP E ) | not E LIT → true | false OP → and | or | xor E() { if (current  {TRUE, FALSE})LIT(); else if (current == LPAREN)match(LPARENT); E(); OP(); E(); match(RPAREN); else if (current == NOT)match(NOT); E(); else error; } LIT() { if (current == TRUE)match(TRUE); else if (current == FALSE)match(FALSE); elseerror; } OP() { if (current == AND)match(AND); else if (current == OR)match(OR); else if (current == XOR)match(XOR); elseerror; }

53 Adding semantic actions Can add an action to perform on each production rule Can build the parse tree – Every function returns an object of type Node – Every Node maintains a list of children – Function calls can add new children 53

54 Building the parse tree 54 Node E() { result = new Node(); result.name = “E”; if (current  {TRUE, FALSE}) // E  LIT result.addChild(LIT()); else if (current == LPAREN) // E  ( E OP E ) result.addChild(match(LPAREN)); result.addChild(E()); result.addChild(OP()); result.addChild(E()); result.addChild(match(RPAREN)); else if (current == NOT) // E  not E result.addChild(match(NOT)); result.addChild(E()); else error; return result; }

55 Recursive descent How do you pick the right A-production? Generally – try them all and use backtracking In our case – use lookahead 55 void A() { choose an A-production, A  X 1 X 2 …X k ; for (i=1; i≤ k; i++) { if (X i is a nonterminal) call procedure X i (); elseif (X i == current) advance input; else report error; }

56 The function for indexed_elem will never be tried… – What happens for input of the form ID[expr] 56 term  ID | indexed_elem indexed_elem  ID [ expr ] Problem 1: productions with common prefix

57 Problem 2: null productions int S() { return A() && match(token(‘a’)) && match(token(‘b’)); } int A() { return match(token(‘a’)) || 1; } 57 S  A a b A  a |   What happens for input “ab”?  What happens if you flip order of alternatives and try “aab”?

58 Problem 3: left recursion int E() { return E() && match(token(‘-’)) && term(); } 58 E  E - term | term  What happens with this procedure?  Recursive descent parsers cannot handle left-recursive grammars p. 127

59 FIRST sets For every production rule A  α – FIRST(α) = all terminals that α can start with – Every token that can appear as first in α under some derivation for α In our Boolean expressions example – FIRST( LIT ) = { true, false } – FIRST( ( E OP E ) ) = { ‘(‘ } – FIRST( not E ) = { not } No intersection between FIRST sets => can always pick a single rule If the FIRST sets intersect, may need longer lookahead – LL(k) = class of grammars in which production rule can be determined using a lookahead of k tokens – LL(1) is an important and useful class 59

60 Computing FIRST sets Assume no null productions A   1.Initially, for all nonterminals A, set FIRST(A) = { t | A  tω for some ω } 2.Repeat the following until no changes occur: for each nonterminal A for each production A  Bω set FIRST(A) = FIRST(A) ∪ FIRST(B) This is known a fixed-point 60

61 FIRST sets computation example 61 STMT  if EXPR then STMT | while EXPR do STMT | EXPR ; EXPR  TERM -> id | zero? TERM | not EXPR | ++ id | -- id TERM  id | constant TERMEXPRSTMT

62 1. Initialization 62 STMT  if EXPR then STMT | while EXPR do STMT | EXPR ; EXPR  TERM -> id | zero? TERM | not EXPR | ++ id | -- id TERM  id | constant TERMEXPRSTMT id constant zero? Not ++ -- if while

63 2. Iterate 1 63 STMT  if EXPR then STMT | while EXPR do STMT | EXPR ; EXPR  TERM -> id | zero? TERM | not EXPR | ++ id | -- id TERM  id | constant TERMEXPRSTMT id constant zero? Not ++ -- if while zero? Not ++ --

64 2. Iterate 2 64 STMT  if EXPR then STMT | while EXPR do STMT | EXPR ; EXPR  TERM -> id | zero? TERM | not EXPR | ++ id | -- id TERM  id | constant TERMEXPRSTMT id constant zero? Not ++ -- if while id constant zero? Not ++ --

65 2. Iterate 3 – fixed-point 65 STMT  if EXPR then STMT | while EXPR do STMT | EXPR ; EXPR  TERM -> id | zero? TERM | not EXPR | ++ id | -- id TERM  id | constant TERMEXPRSTMT id constant zero? Not ++ -- if while id constant zero? Not ++ -- id constant

66 FOLLOW sets What do we do with nullable (  ) productions? – A  B C D B   C   – Use what comes afterwards to predict the right production For every production rule A  α – FOLLOW(A) = set of tokens that can immediately follow A Can predict the alternative A k for a non-terminal N when the lookahead token is in the set – FIRST(A k )  (if A k is nullable then FOLLOW(N)) p. 189 66

67 LL(k) grammars A grammar is in the class LL(K) when it can be derived via: – Top-down derivation – Scanning the input from left to right (L) – Producing the leftmost derivation (L) – With lookahead of k tokens (k) – For every two productions A  α and A  β we have FIRST(α) ∩ FIRST(β) = {} and FIRST(A) ∩ FOLLOW(A) = {} A language is said to be LL(k) when it has an LL(k) grammar 67

68 Back to problem 1 FIRST(term) = { ID } FIRST(indexed_elem) = { ID } FIRST/FIRST conflict term  ID | indexed_elem indexed_elem  ID [ expr ] 68

69 Solution: left factoring Rewrite the grammar to be in LL(1) Intuition: just like factoring x*y + x*z into x*(y+z) term  ID | indexed_elem indexed_elem  ID [ expr ] term  ID after_ID After_ID  [ expr ] |  69

70 S  if E then S else S | if E then S | T S  if E then S S’ | T S’  else S |  Left factoring – another example 70

71 Back to problem 2 FIRST(S) = { a }FOLLOW(S) = { } FIRST(A) = { a  }FOLLOW(A) = { a } FIRST/FOLLOW conflict S  A a b A  a |  71

72 Solution: substitution S  A a b A  a |  S  a a b | a b Substitute A in S S  a after_A after_A  a b | b Left factoring 72

73 Back to problem 3 Left recursion cannot be handled with a bounded lookahead What can we do? E  E - term | term 73

74 Left recursion removal L(G 1 ) = β, βα, βαα, βααα, … L(G 2 ) = same N  Nα | β N  βN’ N’  αN’ |  G1G1 G2G2 E  E - term | term E  term TE | term TE  - term TE |   For our 3 rd example: p. 130 Can be done algorithmically. Problem: grammar becomes mangled beyond recognition 74

75 LL(k) Parsers Recursive Descent – Manual construction – Uses recursion Wanted – A parser that can be generated automatically – Does not use recursion 75

76 Pushdown automaton uses – Prediction stack – Input stream – Transition table nonterminals x tokens -> production alternative Entry indexed by nonterminal N and token t contains the alternative of N that must be predicated when current input starts with t LL(k) parsing via pushdown automata 76

77 LL(k) parsing via pushdown automata Two possible moves – Prediction When top of stack is nonterminal N, pop N, lookup table[N,t]. If table[N,t] is not empty, push table[N,t] on prediction stack, otherwise – syntax error – Match When top of prediction stack is a terminal T, must be equal to next input token t. If (t == T), pop T and consume t. If (t ≠ T) syntax error Parsing terminates when prediction stack is empty – If input is empty at that point, success. Otherwise, syntax error 77

78 Model of non-recursive predictive parser 78 Predictive Parsing program Parsing Table X Y Z $ Stack $b+a Output

79 ()nottruefalseandorxor$ E2311 LIT45 OP678 (1) E → LIT (2) E → ( E OP E ) (3) E → not E (4) LIT → true (5) LIT → false (6) OP → and (7) OP → or (8) OP → xor Nonterminals Input tokens Which rule should be used Example transition table 79

80 abc AA  aAbA  c A  aAb | c aacbb$ Input suffixStack contentMove aacbb$A$predict(A,a) = A  aAb aacbb$aAb$match(a,a) acbb$Ab$predict(A,a) = A  aAb acbb$aAbb$match(a,a) cbb$Abb$predict(A,c) = A  c cbb$ match(c,c) bb$ match(b,b) b$ match(b,b) $$match($,$) – success Running parser example 80

81 abc AA  aAbA  c A  aAb | c abcbb$ Input suffixStack contentMove abcbb$A$predict(A,a) = A  aAb abcbb$aAb$match(a,a) bcbb$Ab$predict(A,b) = ERROR Illegal input example 81

82 Error handling and recovery x = a * (p+q * ( -b * (r-s);  Where should we report the error?  The valid prefix property  Recovery is tricky  Heuristics for dropping tokens, skipping to semicolon, etc. 82

83 Error handling in LL parsers Now what? – Predict b S anyway “missing token b inserted in line XXX” S  a c | b S c$ abc SS  a cS  b S Input suffixStack contentMove c$S$predict(S,c) = ERROR 83

84 Error handling in LL parsers Result: infinite loop S  a c | b S c$ abc SS  a cS  b S Input suffixStack contentMove bc$S$predict(b,c) = S  bS bc$bS$match(b,b) c$S$Looks familiar? 84

85 Error handling Requires more systematic treatment Enrichment – Acceptable-set method – Not part of course material 85

86 Summary Parsing – Top-down or bottom-up Top-down parsing – Recursive descent – LL(k) grammars – LL(k) parsing with pushdown automata LL(K) parsers – Cannot deal with left recursion – Left-recursion removal might result with complicated grammar 86

87 See you next time


Download ppt "Compiler Principles Winter 2012-2013 Compiler Principles Syntax Analysis (Parsing) – Part 1 Mayer Goldberg and Roman Manevich Ben-Gurion University."

Similar presentations


Ads by Google