Theory of Compilation 236360 Erez Petrank Lecture 2: Syntax Analysis, Top-Down Parsing 1.

Theory of Compilation 236360 Erez Petrank Lecture 2: Syntax Analysis, Top-Down Parsing 1

You are here 2 Executable code Executable code exe Source text Source text txt Compiler Lexical Analysis Syntax Analysis Parsing Semantic Analysis Inter. Rep. (IR) Inter. Rep. (IR) Code Gen.

Last Week: from characters to tokens (Using Regular Expressions) 3 x = b*b – 4*a*c txt Token Stream

The Lex Tool Lex automatically generates a lexical analyzer from declaration file. Advantages: easy to produce a lexical analyzer from a short declaration, easily verified, easily modified and maintained. Intuitively: Lex builds a DFA, The analyzer simulates the DFA on a given input. 4 Lex Declaration file Lexical Analysis characters tokens

Today: from tokens to AST 5 Lexical Analysis Syntax Analysis Sem. Analysis Inter. Rep. Code Gen. ‘b’ ‘4’ ‘b’ ‘a’ ‘c’ ID factor term factor MULT term expression factor term factor MULT term expression term MULTfactor MINUS Syntax Tree

Syntax Analysis (Parsing) Goal: discover the program structure. – For example, a C program is built of functions, each function is built from declarations and instructions, each instruction is built from expressions, etc. – Is a sequence of tokens a valid program in the language? – Construct a structured representation of the input text – Error detection and reporting A simple and accurate method for describing a program structure is context free grammars. We will look at families of grammars that can be efficiently parsed. The parser will read the token series, make sure that they are derivable in the grammar (or report an error), and construct the derivation tree. 6

Context free grammars V – non terminals T – terminals (tokens for us) P – derivation rules – Each rule of the form V ➞ (T ∪ V) S ∈ V – the initial symbol 7 G = (V,T,P,S)

Why do we need context free grammars? Important program structures cannot be expressed by regular expressions. E.g., balanced parenthesis… – S ➞ SS; S ➞ (S); S ➞ () Anything expressible as a regular expression is expressible by CFG. Why use regular expressions at all? – Separation, modularity, simplification. – No point in using strong (and less efficient) tools on easily analyzable regular expressions. Regular expressions describe lexical structures like identifiers, constants, keywords, etc. Grammars describe nested structured like balanced parenthesis, match begin-end, if-then-else, etc. 8

Example S ➞ S;S S ➞ id := E E ➞ id | E + E | E * E | ( E ) 9 V = { S, E } T = { id, ‘+’, ‘*’, ‘(‘, ‘)’} S is the initial variable.

Terminology Derivation: a sequence of replacements of non-terminals using the derivation rules. Language: the set of strings of terminals derivable from the initial state. Sentential form ( תבנית פסוקית ) – the result of a partial derivation in which there may be non-terminals. 10

Derivation Example 11 S SS ; id := ES ; id := idS ; id := E ; id := idid := E + E ; id := idid := E + id ; id := idid := id + id ; x := z; y := x + z S ➞ S;S S ➞ id := E E ➞ id | E + E | E * E | ( E ) S ➞ S;S S ➞ id := E E ➞ id S ➞ id := E E ➞ E + E E ➞ id x:= z;y := x + z inputgrammar

Parse Tree 12 S SS ; id := ES ; id := idS ; id := E ; id := idid := E + E ; id := idid := E + id ; id := idid := id + id ; x:= z;y := x + z S S S S ; ; S S id := E E id := E E E E + + E E id

Questions How did we know which rule to apply on every step? Does it matter? Would we always get the same result? 13

Ambiguity 14 x := y+z*w S ➞ S;S S ➞ id := E E ➞ id | E + E | E * E | ( E ) S S id := E E E E + + E E id E E * * E E S S := E E E E * * E E id E E + + E E

Leftmost/rightmost Derivation Leftmost derivation – always expand leftmost non-terminal Rightmost derivation – Always expand rightmost non-terminal Allows us to describe derivation by listing a sequence of rules only. – always know what a rule is applied to Note that it does not necessarily solve ambiguity (e.g., previous slide). These are the orders of derivation applied in our parsers (coming soon). 15

Leftmost Derivation 16 x := z; y := x + z S ➞ S;S S ➞ id := E E ➞ id | E + E | E * E | ( E ) S SS ; id := ES ; id := idS ; id := E ; id := idid := E + E ; id := idid := id + E ; id := idid := id + id ; S ➞ S;S S ➞ id := E E ➞ id S ➞ id := E E ➞ E + E E ➞ id x:= z;y := x + z

Rightmost Derivation 17 S SS ; Sid := E ; Sid := E + E ; Sid := E + id ; Sid := id + id ; id := Eid := id + id ; id := idid := id + id ; x := z; y := x + z S ➞ S;S S ➞ id := E E ➞ id | E + E | E * E | ( E ) S ➞ S;S S ➞ id := E E ➞ E + E E ➞ id S ➞ id := E E ➞ id x:= z;y := x + z

Bottom-up Example 18 x := z; y := x + z S ➞ S;S S ➞ id := E E ➞ id | E + E | E * E | ( E ) id := id ; id := id + id id := Eid := id + id ; S ; Sid := E + id ; Sid := E + E ; Sid := E ; SS ; S E ➞ id S ➞ id := E E ➞ id E ➞ E + E S ➞ id := E S ➞ S;S Bottom-up picking left alternative on every step  Rightmost derivation when going top-down

Parsing A context free language can be recognized by a non- deterministic pushdown automaton – But not a deterministic one… Parsing can be seen as a search problem – Can you find a derivation from the start symbol to the input word? – Easy (but very expensive) to solve with backtracking Cocke-Younger-Kasami parser can be used to parse any context- free language but has complexity O(n 3 ) – Imagine a program with hundreds of thousands of lines of code. We want efficient parsers – Linear in input size – Deterministic pushdown automata – We will sacrifice generality for efficiency 19

“Brute-force” Parsing 20 x := z; y := x + z S ➞ S;S S ➞ id := E E ➞ id | E + E | E * E | ( E ) id := id ; id := id + id id := Eid := id + id ; id := idid := E+ id ; … E ➞ id (not a parse tree… a search for the parse tree by exhaustively applying all rules) id := Eid := id + id ; id := Eid := id + id ;

Efficient Parsers Top-down (predictive) – Construct the leftmost derivation – Apply rules “from left to right” – Predict what rule to apply based on nonterminal and token Bottom up (shift reduce) – Construct the rightmost derivation – Apply rules “from right to left” – Reduce a right-hand side of a production to its non-terminal 21

Efficient Parsers Top-down (predictive parsing) 22  Bottom-up (shift reduce) to be read… already read…

Top-down Parsing Given a grammar G=(V,T,P,S) and a word w Goal: derive w using G Idea – Apply production to leftmost nonterminal – Pick production rule based on next input token General grammar – More than one option for choosing the next production based on a token Restricted grammars (LL) – Know exactly which single rule to apply – May require some lookahead to decide 23

An Easily Parse-able Grammar 24 E ➞ LIT | (E OP E) | not E LIT ➞ true | false OP ➞ and | or | xor not (not true or false) E => not E => not ( E OP E ) => not ( not E OP E ) => not ( not LIT OP E ) => not ( not true OP E ) => not ( not true or E ) => not ( not true or LIT ) => not ( not true or false ) Production to apply is known from next input token E E not E E E E OP E E LIT true not LIT or ( ( ) ) false At any stage, looking at the current variable and the next input token, a rule can be easily determined.

LL(k) Grammars A grammar is in the class LL(K) when it can be derived via: – Top down derivation – Scanning the input from left to right (L) – Producing the leftmost derivation (L) – With lookahead of k tokens (k) A language is said to be LL(k) when it has an LL(k) grammar 25

Recursive Descent Parsing Define a function for every nonterminal Every function simulates the derivation of the variable it represents: – Find applicable production rule – Terminal function checks match with next input token – Nonterminal function calls (recursively) other functions If there are several applicable productions for a nonterminal, use lookahead 26

Matching tokens Variable current holds the current input token 27 void match(token t) { if (current == t) current = next_token(); else ; }

functions for nonterminals 28 E ➞ LIT | (E OP E) | not E LIT ➞ true | false OP ➞ and | or | xor void E() { if (current  {TRUE, FALSE}) // E → LIT LIT(); else if (current == LPAREN) // E → ( E OP E ) match(LPARENT); E(); OP(); E(); match(RPAREN); else if (current == NOT)// E → not E match(NOT); E(); else error; }

functions for nonterminals 29 void LIT() { if (current == TRUE) match(TRUE); else if (current == FALSE) match(FALSE); else error; } E ➞ LIT | (E OP E) | not E LIT ➞ true | false OP ➞ and | or | xor

functions for nonterminals 30 void OP() { if (current == AND) match(AND); else if (current == OR) match(OR); else if (current == XOR) match(XOR); else error; } E ➞ LIT | (E OP E) | not E LIT ➞ true | false OP ➞ and | or | xor

Overall: Functions for Grammar 31 E → LIT | ( E OP E ) | not E LIT → true | false OP → and | or | xor void E() { if (current  {TRUE, FALSE})LIT(); else if (current == LPAREN)match(LPARENT); E(); OP(); E(); match(RPAREN); else if (current == NOT)match(NOT); E(); else error; } void LIT() { if (current == TRUE)match(TRUE); else if (current == FALSE)match(FALSE); elseerror; } void OP() { if (current == AND)match(AND); else if (current == OR)match(OR); else if (current == XOR)match(XOR); elseerror; }

Adding semantic actions Can add an action to perform on each production rule simply by executing it when a function is invoked. For example, can build the parse tree – Every function returns an object of type Node – Every Node maintains a list of children – Function calls can add new children 32

Building the parse tree 33 Node E() { result = new Node(); result.name = “E”; if (current  {TRUE, FALSE}) // E → LIT result.addChild(LIT()); else if (current == LPAREN) // E → ( E OP E ) result.addChild(match(LPARENT)); result.addChild(E()); result.addChild(OP()); result.addChild(E()); result.addChild(match(RPAREN)); else if (current == NOT) // E → not E result.addChild(match(NOT)); result.addChild(E()); else error; return result; }

Getting Back to the Example Input = “( not true and false )”; Node treeRoot = E(); 34 E (EOPE) notLIT false true and LIT

Recursive Descent How do you pick the right A-production? Generally – try them all and use backtracking (costly). In our case – use lookahead 35 void A() { choose an A-production, A -> X 1 X 2 …X k ; for (i=1; i≤ k; i++) { if (X i is a nonterminal) call procedure X i (); elseif (X i == current) advance input; else report error; } In its basic form, each variable has a procedure that looks like:

Recursive Descent: a problem With lookahead 1, the function for indexed_elem will never be tried… – What happens for input of the form ID [ expr ] 36 term ➞ ID | indexed_elem indexed_elem ➞ ID [ expr ]

Recursive Descent: Another Problem Bool S() { return A() && match(token(‘a’)) && match(token(‘b’)); } Bool A() { if (current == ‘a’) return match(token(‘a’)) else return true ; } 37 S ➞ A a b A ➞ a |   What happens for input “ab” ?  What happens if you flip order of alternatives and try “aab”?

Recursive descent: a third problem Bool E() { return E() && match(token(‘-’)) && term() || ID(); } 38 E ➞ E – term | term  What happens with this procedure?  Recursive descent parsers cannot handle left-recursive grammars

3 Bad Examples for Recursive Descent Can we make it work? 39 term ➞ ID | indexed_elem indexed_elem ➞ ID [ expr ] S ➞ A a b A ➞ a |  E ➞ E - term

The “FIRST” Sets To formalize the property (of a grammar) that we can determine a rule using a single lookahead we define the FIRST sets. For every production rule A ➞ – FIRST() = all terminals that can start with – i.e., every token that can appear first under some derivation for No intersection between FIRST sets => can pick a single rule In our Boolean expressions example – FIRST(LIT) = { true, false } – FIRST( ( E OP E ) ) = { ‘(‘ } – FIRST ( not E ) = { not } 40 E ➞ LIT | (E OP E) | not E LIT ➞ true | false OP ➞ and | or | xor

The “FIRST” Sets No intersection between FIRST sets => can pick a single rule If the FIRST sets intersect, may need longer lookahead – LL(k) = class of grammars in which production rule can be determined using a lookahead of k tokens – LL(1) is an important and useful class 41

The FOLLOW Sets FIRST is not enough when variables are nullified. Consider: S ➞ AB | c ; A ➞ a |  ; B ➞ b; Need to know what comes afterwards to select the right production For any non-terminal A – FOLLOW(A) = set of tokens that can immediately follow A Can select the rule N ➞ with lookahead “b”, if – b ∈ FIRST() or – may be nullified and b ∈ FOLLOW(N). 42

Back to our 1 st example FIRST(ID) = { ID } FIRST(indexed_elem) = { ID } FIRST/FIRST conflict This grammar is not in LL(1). Can we “fix” it? 43 term ➞ ID | indexed_elem indexed_elem ➞ ID [ expr ]

Left factoring Rewrite into an equivalent grammar in LL(1) 44 term ➞ ID | indexed_elem indexed_elem ➞ ID [ expr ] term ➞ ID after_ID after_ID ➞ [ expr ] |  Intuition: just like factoring x*y + x*z into x*(y+z)

Left factoring – another example 45 S ➞ if E then S else S | if E then S | T S ➞ if E then S S’ | T S’ ➞ else S | 

Back to our 2 nd example Select a rule for A with a in the look-ahead: – Should we pick (1) A ➞ a or (2) A ➞  ? (1) FIRST(a) = { ‘a’ } (and a cannot be nullified). (2) FIRST (  )= . Also,  can (must) be nullified and FOLLOW(A) = { ‘a’ } FIRST/FOLLOW conflict The grammar is not in LL(1). 46 S ➞ A a b A ➞ a | 

An Equivalent Grammar via Substitution 47 S ➞ A a b A ➞ a |  S ➞ a a b | a b Substitute A in S S ➞ a after_a after_a ➞ a b | b Left factoring

So Far We have tools to determine if a grammar is in LL(1) – The FIRST and FOLLOW sets. – The exercises will provide algorithms for finding and using those. We have some techniques for modifying a grammar to find an equivalent in LL(1). – Left factoring, – Assignment. Now let’s look at the 3 rd example and present one more such technique. 48

Back to our 3 rd example Left recursion cannot be handled with a bounded lookahead. What can we do? Any grammar with a left recursion has an equivalent grammar with no left recursion. 49 E ➞ E – term | term

Left Recursion Elimination L(G1) = β, βα, βαα, βααα, … L(G2) = same 50 N ➞ Nα | β N ➞ βN’ N’ ➞ αN’ |  G1 G2 E ➞ E – term | term E ➞ term TE TE ➞ - term TE | term  For our 3 rd example:

אלימינציה של רקורסיה שמאלית ביטול רקורסיה ישירה: נחליף את הכללים A → Aα 1 | Aα 2 | ··· | Aα n | β 1 | β 2 | ··· | β n בכללים A → β 1 A’ | β 2 A’ | ··· | β n A’ A’ → α 1 A’ | α 2 A’| ··· | α n A’ | Є שימו לב שהשיטה לא עובדת אם α i ריק... וגם עלולה ליצור רקורסיה שמאלית עקיפה אם β i ריק. אם α i ריק, אז נוצרת רקורסיה שמאלית של A’. אם β i ריק, אז תיתכן רקורסיה שמאלית עקיפה כאשר α j מתחיל ב-A: – A → A’ וגם – A’ → A….

אלימינציה של רקורסיה שמאלית ביטול רקורסיה ישירה: נחליף את הכללים A → Aα 1 | Aα 2 | ··· | Aα n | β 1 | β 2 | ··· | β n בכללים A → β 1 A’ | β 2 A’ | ··· | β n A’ A’ → α 1 A’ | α 2 A’| ··· | α n A’ | Є צריך לטפל גם ברקורסיה עקיפה. למשל: S → Aa | b A → Ac | Sd | Є ועבורה האלגוריתם מעט יותר מורכב.

אלגוריתם להעלמת רקורסיה שמאלת )עקיפה וישירה( מדקדוק קלט: דקדוק G שאולי יש בו רקורסיה שמאלית, ללא מעגלים, וללא כללי ε. פלט: דקדוק שקול ללא רקורסיה שמאלית. דוגמא לכלל אפסילון: A → Є. דוגמא למעגל: A → B; B → A;. ניתן לבטל כללי אפסילון ומעגלים בדקדוק )באופן אוטומטי(. רעיון האלגוריתם לסילוק רקורסיה שמאלית: נסדר את המשתנים לפי סדר כלשהו: A 1, A 2, …, A n נעבור על המשתנים לפי הסדר, ולכל A i נדאג שכל כלל שלו יהיה מהצורה A i → A j β with j > i. מדוע זה מספיק?

An Algorithm for Left-Recursion Elimination Input: Grammar G possibly left-recursive, no cycles, no ε productions. Output: An equivalent grammar with no left-recursion Method: Arrange the nonterminals in some order A 1, A 2, …, A n for i:=1 to n do begin for s:=1 to i-1 do begin replace each production of the form A i → A s β by the productionsA i → d 1 β |d 2 β|…|d k β where A s -> d 1 | d 2 | …| d k are all the current A s -productions; end eliminate immediate left recursion among the A i -productions end

ניתוח האלגוריתם נראה שבסיום האלגוריתם כל חוק גזירה מהצורה A k → A t β מקיים t > k. שמורה 1: כשגומרים את הלולאה הפנימית עבור s כלשהו )עם A i בלולאה החיצונית( אז כל כללי הגזירה של A i מתחילים בטרמינלים, או במשתנים A j עבורם j>s. שמורה 2: כשמסיימים עם המשתנה A i, כל כללי הגזירה שלו מתחילים במשתנים A j עבורם j>i או בטרמינלים. הוכחת שתי השמורות יחד באינדוקציה על i ו-s. מסקנה: בסיום האלגוריתם אין רקורסיה שמאלית בין המשתנים המקוריים )ישירה או עקיפה(. נובע משמורה 2. לגבי המשתנים החדשים, הם תמיד מופיעים כימניים ביותר, ולכן לעולם לא יהיו מעורבים ברקורסיה שמאלית.

LL(k) Parsers Recursive Descent – Manual construction – Uses recursion Wanted – A parser that can be generated automatically – Does not use recursion 56

LL(k) parsing with pushdown automata Pushdown automaton uses – A stack – Input stream – Transition table nonterminals x tokens -> production rule Entry indexed by nonterminal N and token t contains the rule of N that must be used when current input starts with t The initial state: – Input stream has the input ($ marks its end). – Stack starts with “S$” for the initial variable S. 57

LL(k) parsing with pushdown automata Two possible moves – Prediction: When top of stack is nonterminal N and next token is t: pop N, lookup rule at table[N,t]. If table[N,t] is not empty, push the right-side of the rule on prediction stack, otherwise – syntax error. – Match: When top of prediction stack is a terminal T and next token is t: If (t == T), pop T and consume t. If (t ≠ T) syntax error. Parsing terminates when prediction stack is empty. If input is empty at that point, success. Otherwise, syntax error 58

Stack During the Run: if ( E ) then Stmt else Stmt ; Stmts ; } $ מחסנית : top if ( id < id ) then id = id + num else break; id = id * id; … Remaining Input:

Example transition table 60 ()nottruefalseandorxor$ E2311 LIT45 OP678 (1) E → LIT (2) E → ( E OP E ) (3) E → not E (4) LIT → true (5) LIT → false (6) OP → and (7) OP → or (8) OP → xor Nonterminals Input tokens Which rule should be used

Simple Example 61 abc A A ➞ aAbA ➞ c A ➞ aAb | c aacbb$ Input suffixStack contentMove aacbb$A$ predict(A,a) = A ➞ aAb aacbb$aAb$match(a,a) acbb$Ab$ predict(A,a) = A ➞ aAb acbb$aAbb$match(a,a) cbb$Abb$ predict(A,c) = A ➞ c cbb$ match(c,c) bb$ match(b,b) b$ match(b,b) $$match($,$) – success Stack top on left

The Transition Table Constructing the transition table is not hard. – It builds on FIRST and FOLLOW. You will construct First, Follow, and the table in the exercises. 62

Simple Example on a Bad Word 63 abc A A ➞ aAbA ➞ c A ➞ aAb | cabcbb$ Input suffixStack contentMove abcbb$A$ predict(A,a) = A ➞ aAb abcbb$aAb$match(a,a) bcbb$Ab$predict(A,b) = ERROR

Error Handling Types of errors: – Lexical errors (typos) – Syntax errors (e.g., imbalanced parenthesis) – Semantic errors (e.g., type mismatch) – Logical errors (infinite loop, but also use of ‘=‘ instead of ‘==‘). Requirements: – Report the error clearly. – Recover and continue so that more errors can be discovered. – Be reasonably efficient. 64

Error Handling and Recovery x = a * (p+q * ( -b * (r-s); 65  Where should we report the error?  The valid prefix property  Recovery is tricky  Heuristics for dropping tokens, skipping to semicolon, etc.

Error Handling in LL Parsers Now what? – Predict bS anyway “missing token b inserted in line XXX” 66 S ➞ a c | b Sc$ abc S S ➞ a cS ➞ bS Input suffixStack contentMove c$S$predict(S,c) = ERROR

Error Handling in LL Parsers Result: infinite loop 67 S ➞ a c | b Sc$ abc S S ➞ a cS ➞ bS Input suffixStack contentMove bc$S$ predict(b,c) = S ➞ bS bc$bS$match(b,b) c$S$Looks familiar?

Error Handling Requires more systematic treatment Some examples – Panic mode (or acceptable-set method): drop tokens until reaching a synchronizing token, like a semicolon, a right parenthesis, end of file, etc. – Phrase-level recovery: attempting local changes: replace “,” with “;”, eliminate or add a “;”, etc. – Error production: anticipate errors and automatically handle them by adding them to the grammar. – Global correction: find the minimum modification to the program that will make it derivable in the grammar. Not a practical solution… 68

An Example Structure of a Program program Main function More Functions Function DeclsStmts Decls Stmts DeclsStmts Decl Decls Decl Decls Decl Stmt Stmts Stmt Id Type exprid= ; ; { } { { } }

Summary After peeling the lexical layer, we parse the token sequence to understand the program structure. Program legal structure is accurately described by a CFG. Parsing is executed top-down or bottom-up. Recursive descent uses recursion and a function for each variable. General grammars may be hard to parse. LL(k) grammars can be parsed efficiently (with small k’s) using a pushdown automata. Grammars that are not LL(K) may sometimes be “fixed” using left-recursion elimination, left factorization, and assignments. 70

Coming up next time Bottom-Up Parsing. 71

Theory of Compilation 236360 Erez Petrank Lecture 2: Syntax Analysis, Top-Down Parsing 1.

Similar presentations

Presentation on theme: "Theory of Compilation 236360 Erez Petrank Lecture 2: Syntax Analysis, Top-Down Parsing 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Theory of Compilation 236360 Erez Petrank Lecture 2: Syntax Analysis, Top-Down Parsing 1.

Similar presentations

Presentation on theme: "Theory of Compilation 236360 Erez Petrank Lecture 2: Syntax Analysis, Top-Down Parsing 1."— Presentation transcript:

Similar presentations

About project

Feedback