Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 CMPSC 160 Translation of Programming Languages Fall 2002 slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon Lecture-Module #6 Parsing.

Similar presentations


Presentation on theme: "1 CMPSC 160 Translation of Programming Languages Fall 2002 slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon Lecture-Module #6 Parsing."— Presentation transcript:

1 1 CMPSC 160 Translation of Programming Languages Fall 2002 slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon Lecture-Module #6 Parsing

2 2 Announcements Programming assignment 1 is due Thursday Read chapter 4

3 3 The Front End: Parser Parser Input: a sequence of tokens representing the source program Output: A parse tree (in practice an abstract syntax tree) While generating the parse tree parser checks the stream of tokens for grammatical correctness –Checks the context-free syntax Parser builds an IR representation of the code –Generates an abstract syntax tree Guides checking at deeper levels than syntax Source code Scanner IR Parser IRType Checker Errors token get next token

4 4 The Study of Parsing Need a mathematical model of syntax — a grammar G –Context-free grammars Need an algorithm for testing membership in L(G) –Parsing algorithms Parsing is the process of discovering a derivation for some sentence from the rules of the grammar –Equivalently, it is the process of discovering a parse tree Natural language analogy –Lexical rules correspond to rules that define the valid words –Grammar rules correspond to rules that define valid sentences

5 5 Specifying Syntax with a Grammar Context-free syntax is specified with a context-free grammar Formally, a grammar is a four tuple, G = (S,N,T,P) T is a set of terminal symbols –These correspond to tokens returned by the scanner –For the parser tokens are undivisible units of syntax N is a set of non-terminal symbols –These are syntactic variables that can be substituted during a derivation –Variables that denote sets of substrings occuring in the language S is the start symbol : S  N –All the strings in L(G) are derived from the start symbol P is a set of productions or rewrite rules : P : N  ( N  T)*

6 6 Derivations Such a sequence of rewrites is called a derivation Process of discovering a derivation is called parsing We denote this as: S  * id – num * id An example grammar An example derivation for x – 2* y RuleSentential Form —S 1Expr 2Expr Op Expr 4 Op Expr 6 – Expr 2 – Expr Op Expr 3 – Op Expr 7 – * Expr 4 – * 1 S  Expr 2 Expr  Expr Op Expr 3 | num 4 | id 5 Op  + 6 | – 7 | * 8 | / A  B means A derives B after applying one production A  *B means A derives B after appying zero or more productions

7 7 Derivations At each step, we make two choices 1.Choose a non-terminal to replace 2.Choose a production to apply Different choices lead to different derivations Two types of derivation are of interest Leftmost derivation — replace leftmost non-terminal at each step Rightmost derivation — replace rightmost non-terminal at each step These are the two systematic derivations (the first choice is fixed) The example on the preceding slide was a leftmost derivation Of course, there is a rightmost derivation

8 8 Parsing Techniques Top-down parsers (LL(1), recursive descent) Start at the root of the parse tree from the start symbol and grow toward leaves (similar to a derivation) Pick a production and try to match the input Bad “pick”  may need to backtrack Some grammars are backtrack-free (predictive parsing) Bottom-up parsers (LR(1), operator precedence) Start at the leaves and grow toward root We can think of the process as reducing the input string to the start symbol At each reduction step a particular substring matching the right-side of a production is replaced by the symbol on the left-side of the production Bottom-up parsers handle a large class of grammars

9 9 Construct the root node of the parse tree, label it with the start symbol, and set the current-node to root node Repeat until all the input is consumed (i.e., until the frontier of the parse tree matches the input string)  If the label of the current-node is a nonterminal node A, select a production with A on its lhs and, for each symbol on its rhs, construct the appropriate child  If the current node is a terminal symbol: If it matches the input string, consume it (advance the input pointer) If it does not match the input string, backtrack  Set the current-node to the next node in the frontier of the parse tree If there is no node left in the frontier of the parse tree and input is not consumed backtrack The key is picking the right production in step 1 –That choice should be guided by the input string Top-down Parsing Algorithm

10 10 Example from last lecture And the input: x – 2 * y Version with correct precedence and associativity derived last lecture 1 S  Expr 2 Expr  Expr + Term 3 | Expr – Term 4 | Term 5 Term  Term * Factor 6 | Term / Factor 7 | Factor 8 Factor  num 9 | id

11 11 Let’s try x – 2 * y : Example S Expr Term + Expr Term Fact. RuleSentential FormInput -S  x – 2 * y 1Expr  x – 2 * y 2Epr + Term  x – 2 * y 4Term + Term  x – 2 * y 7Factor + Term  x – 2 * y 9 + Term  x – 2 * y + Termx  – 2 * y

12 12 Example Let’s try x – 2 * y : Note that “–” doesn’t match “+” The parser must backtrack to here S Expr Term + Expr Term Fact. RuleSentential FormInput -S  x – 2 * y 1Expr  x – 2 * y 2Epr + Term  x – 2 * y 4Term + Term  x – 2 * y 7Factor + Term  x – 2 * y 9 + Term  x – 2 * y + Termx  – 2 * y

13 13 Example Continuing with x – 2 * y : S Expr Term – Expr Term Fact. RuleSentential FormInput -S  x – 2 * y 1Expr  x – 2 * y 3Epr – Term  x – 2 * y 4Term – Term  x – 2 * y 7Factor – Term  x – 2 * y 9 – Term  x – 2 * y - – Termx  – 2 * y - – Termx –  2 * y

14 14 Example Continuing with x – 2 * y : S Expr Term – Expr Term Fact. RuleSentential FormInput -S  x – 2 * y 1Expr  x – 2 * y 3Epr – Term  x – 2 * y 4Term – Term  x – 2 * y 7Factor – Term  x – 2 * y 9 – Term  x – 2 * y - – Termx  – 2 * y - – Termx –  2 * y This time “–” and “ – ” matched We can advance past “–” to look at “2” Now we need to extend Term the last NT in the fringe of the parse tree

15 15 Example Trying to match the “2” in x – 2 * y : Where are we? num matches “2” We have more input, but no NT s left to expand The expansion terminated too soon  Need to backtrack S Expr Term – Expr Term Fact. Fact. RuleSentential FormInput – – Termx –  2 * y 7 – Factorx –  2 * y 9 – x –  2 * y – – x – 2  * y

16 16 Example Trying again with “2” in x – 2 * y : This time, we matched and consumed all the input  Success! S Expr Term – Expr Term Fact. Fact. Term Fact. * RuleSentential FormInput - – Termx –  2 * y 5 – Term * Factorx –  2 * y 7 – Factor * Factor x –  2 * y 8 – * Factor x –  2 * y - – * Factor x – 2  * y - – * Factor x – 2 *  y 9 – * x – 2 *  y - – * x – 2 * y 

17 17 Other choices for expansion are possible This doesn’t terminate Wrong choice of expansion leads to non-termination, the parser will not backtrack since it does not get to a point where it can backtrack Non-termination is a bad property for a parser to have Parser must make the right choice Another possible parse consuming no input !

18 18 Left Recursion Top-down parsers cannot handle left-recursive grammars Formally, A grammar is left recursive if there exists a nonterminal A such that there exists a derivation A  + A , for some string   (NT  T ) + Our expression grammar is left recursive This can lead to non-termination in a top-down parser For a top-down parser, any recursion must be right recursion We would like to convert the left recursion to right recursion (without changing the language that is defined by the grammar)

19 19 Eliminating Immediate Left Recursion To remove left recursion, we can transform the grammar Consider a grammar fragment of the form A  A  |  where  or  are strings of terminal and nonterminal symbols and neither  nor  start with A We can rewrite this as follows: A   R R   R |  where R is a new non-terminal This accepts the same language, but uses only right recursion A A   A  A   R R  R 

20 20 Eliminating Immediate Left Recursion The expression grammar contains two cases of left recursion Applying the transformation yields the following: These fragments use only right recursion They retain the original left associativity Expr  Term Expr Expr  + Term Expr |- Term Expr |  Expr  Expr + Term |Expr – Term |Term Term  Term * Factor |Term / Factor |Factor Term  Factor Term Term  * Factor Term |/ Factor Term | 

21 21 Eliminating Immediate Left Recursion Substituting back into the grammar yields This grammar is correct, if somewhat non-intuitive. It is left associative, as was the original A top-down parser will terminate using it. 1S  Expr 2Expr  Term Expr 3 Expr  + Term Expr 4|- Term Expr 5 |  6Term  Factor Term 7Term  * Factor Term 8 |/ Factor Term 9 |  10Factor  num 11|id

22 22 Left-Recursive and Right-Recursive Expression Grammar 1S  Expr 2Expr  Expr + Term 3 |Expr – Term 4|Term 5Term  Term * Factor 6|Term / Factor 7|Factor 8Factor  num 9|id 1S  Expr 2Expr  Term Expr 3 Expr  + Term Expr 4|– Term Expr 5 |  6Term  Factor Term 7Term  * Factor Term 8 |/ Factor Term 9 |  10Factor  num 11|id

23 23 Need to Preserve Precedence and Associativity S E T – E T F. F T F * S E E’ – T T’F F T *  E’  T’ F 

24 24 Eliminating Left Recursion The transformation eliminates immediate left recursion What about more general, indirect left recursion? The general algorithm (Algorithm 4.1 in the Textbook): arrange the NT s into some order A 1, A 2, …, A n for i  1 to n for j  1 to i-1 replace each production A i  A j  with A i   1  2  k , where A j   1  2  k are all the current productions for A j eliminate any immediate left recursion on A i using the direct transformation This assumes that the initial grammar has no cycles (A i  + A i ), and no epsilon productions (A i   )

25 25 Eliminating Left Recursion How does this algorithm work? 1. Impose arbitrary order on the non-terminals 2. Outer loop cycles through NTs in order 3. Inner loop ensures that a production expanding A i has no non-terminal A j in its rhs, for j < i 4. Last step in outer loop converts any direct recursion on A i to right recursion using the transformation showed earlier 5. New non-terminals are added at the end of the order and have no left recursion At the start of the i th outer loop iteration For all k < i, no production that expands A k contains a non-terminal A s in its rhs, for s < k

26 26 Picking the “Right” Production If it picks the wrong production, a top-down parser may backtrack Alternative is to look ahead in input & use context to pick correctly How much lookahead is needed? In general, an arbitrarily large amount Use the Cocke-Younger, Kasami algorithm or Earley’s algorithm –Complexity is O(|x| 3 ) where x is the input string Fortunately, Large subclasses of CFGs can be parsed efficiently with limited lookahead –Linear complexity, O(|x|) where x is the input string Most programming language constructs fall in those subclasses Among the interesting subclasses are LL(1) and LR(1) grammars

27 27 Predictive Parsing Basic idea Given A    , the parser should be able to choose between  &  F IRST sets For some rhs   G, define FI RST (  ) as the set of tokens that appear as the first symbol in some string that derives from  That is, x  F IRST (  ) iff   * x , for some  The LL(1) Property If A   and A   both appear in the grammar, we would like F IRST (  )  F IRST (  ) =  This would allow the parser to make a correct choice with a lookahead of exactly one symbol ! (Pursuing this idea leads to LL(1) parser generators...)

28 28 Predictive Parsing Given a grammar that has the LL(1) property Can write a simple routine to recognize each lhs Code is both simple & fast Consider A   1 |  2, with FIRST(  1 )  FIRST(  2 ) =  /* find an A */ if (current_token  F IRST (  1 )) match  1 and return true else if (current_token  F IRST (  2 )) match  2 and return true report an error and return false Grammars with the LL(1) property are called predictive grammars because the parser can “predict” the correct expansion at each point in the parse. Parsers that capitalize on the LL(1) property are called predictive parsers. One kind of predictive parser is the recursive descent parser.

29 29 Recursive Descent Parsing Recursive-descent parsing A top-down parsing method The term descent refers to the direction in which the parse tree is traversed (or built). Use a set of mutually recursive procedures (one procedure for each nonterminal symbol) –Start the parsing process by calling the procedure that corresponds to the start symbol –Each production becomes one clause in procedure We consider a special type of recursive-descent parsing called predictive parsing –Use a lookahead symbol to decide which production to use

30 30 Recursive Descent Parsing Recall the expression grammar, after transformation We can write a parser for this grammar with six mutually recursive procedures: Goal Expr ExprPrime Term TermPrime Factor Each recognizes one NT 1S  Expr 2Expr  Term Expr 3Expr  + Term Expr 4 |– Term Expr 5|  6Term  Factor Term 7Term  * Factor Term 8|/ Factor Term 9|  10Factor  num 11|id

31 31 Recursive Descent Parsing void S() { Expr(); } void Expr() { Term(); ExprPrime(); } void ExprPrime() { switch(lookahead) { case PLUS : match(PLUS); Term(); ExprPrime(); break; case MINUS : match(MINUS); Term(); ExprPrime(); break; default: return; } void Term() { Factor(); TermPrime(); } void TermPrime() { switch(lookahead) { case TIMES: match(TIMES); Factor(); TermPrime(); break; case DIV: match(DIV); Factor(); TermPrime(); break; default: return; } void Factor() { switch(lookahead) { case ID : match(ID); break; case NUMBER: match(NUMBER); break; default: error(); } int PLUS=1, MINUS=2,... int lookahead = getNextToken(); void advance() { lookahead = getNextToken(); } void match(int token) { if (lookahead == token) advance(); else error(); }

32 32 Recursive Descent Parsing To build a parse tree: Augment parsing routines to build nodes Pass nodes between routines using a stack Node for each symbol on rhs Action is to pop rhs nodes, make them children of lhs node, and push this subtree To build an abstract syntax tree Build fewer nodes Put them together in a different order to preserve left-associativity Expr() { Term(); ExprPrime(); /* build an Expr node; pop ExprPrime node; pop Term node; make ExprPrime and Term children of Expr; push Expr node; */ }

33 33 Another Grammar 1S  if E then S else S 2|begin S L 3|print E 4L  end 5|; S L 6E  num = num void S() { switch(lookahead) { case IF: match(IF); E(); match(THEN); S(); match(ELSE); S(); break; case BEGIN: matvh(BEGIN); S(); L(); break; case PRINT: match(PRINT); E(); break; default: error(); } void E() { match(NUM); match(EQ); match(NUM); } void L() { switch(lookahead) { case END: match(END); break; case SEMI: match(SEMI); S(); L(); break; default: error(); }

34 34 Example Execution For Input: if 2=2 then print 5=5 else print 1=1 main: call S(); S 1 : find the production for (S, IF) : S  if E then S else S S 1 : match(IF); S 1 : call E(); E 1 : find the production for (E, NUM): E  num = num E 1 : match(NUM); match(EQ); match(NUM); E 1 : return from E 1 to S 1 S 1 : match(THEN); S 1 :call S(); S 2 : find the production for (S, PRINT): S  print E S 2 : match(PRINT); S 2 : call E(); E 2 : find the production for (E, NUM): E  num = num E 2 : match(NUM); match(EQ); match(NUM); E 2 : return from E 2 to S 2 S 2 : return from S 2 to S 1 S 1 : match(ELSE); S 1 : call S(); S 3 : find the production for (S, PRINT): S  print E S 3 : match(PRINT); S 3 : call E(); E 3 : find the production for (E, NUM): E  num = num E 3 : match(NUM); match(EQ); match(NUM); E 3 : return from E 2 to S 3 S 3 : return from S 3 to S 1 S 1 : return from S 1 to main main: match(EOF); return success;


Download ppt "1 CMPSC 160 Translation of Programming Languages Fall 2002 slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon Lecture-Module #6 Parsing."

Similar presentations


Ads by Google