Compiler Structures 5. Top-down Parsing Objectives

Compiler Structures 5. Top-down Parsing Objectives
, Semester 2, 5. Top-down Parsing Objectives look at top-down (LL) parsing using recursive descent and tables consider a recursive descent parser for the Expressions language

Overview 1. Parsing with a Syntax Analyzer 2. Creating a Recursive Descent Parser 3. The Expressions Language Parser 4. LL(1) Parse Tables 5. Making a Grammar LL(1) 6. Error Recovery in LL Parsing

In this lecture Front End Back End Source Program Lexical Analyzer
Syntax Analyzer Semantic Analyzer but concentrating on top-down parsing Int. Code Generator Intermediate Code Code Optimizer Back End As I said earlier, there will be 5 homeworks, each of which will contribute to 5% of your final grade. You will have at least 2 weeks to complete each of the homeworks. Talking about algorithms really helps you learn about them, so I encourage you all to work in small groups. If you don’t have anyone to work with please either me or stop by my office and I will be sure to match you up with others. PLEASE make sure you all work on each problem; you will only be hurting yourself if you leach off of your partners. Problems are HARD! I will take into account the size of your group when grading your homework. Later in the course I will even have a contest for best algorithm and give prizes out for those who are most clever in their construct. I will allow you one late homework. You *must* write on the top that you are taking your late. Homework 1 comes out next class. Target Code Generator Target Lang. Prog.

1. Parsing with a Syntax Analyzer
3. Token, token value Syntax Analyzer (using tokens) Source Program Lexical Analyzer (using chars) parse tree 1. Get next token 2. Get chars to make a token lexical errors syntax errors

1.1. Top Down (LL) Parsing 1 2 4 3 6 5 B SS SS SS S S
B => begin SS end SS => S ; SS SS => e S => simplestmt S => begin SS end 2 SS 4 SS S 3 SS 6 5 S e begin simplestmt ; simplestmt ; end

1.2. LL Parsing Definition An LL parser is a top-down parser for a context-free grammar. It parses input from Left to right, and constructs a Leftmost derivation of the input.

A Leftmost Derivation In a leftmost derivation, the leftmost non-terminal is chosen to be expanded. this builds the parse tree top-down, left-to-right Example grammar: L => ( L ) L L => e

Leftmost Derivation for (())()
// L => ( L ) L  ( L ) L  ( ( L ) L ) L // L => e  ( ( ) L ) L  ( ( ) ) L  ( ( ) ) ( L ) L // L =>( L ) L  ( ( ) ) ( ) L  ( ( ) ) ( ) input

1.3. LL(1) and LL(k) An LL(1) parser uses the current token only to decide which production to use next. An LL(k) parser uses k tokens of input to decide which production to use this make the grammar easier to write adds no 'power' compared to LL(1) harder to implement efficiently

1.4. Two LL Implementation Approaches
Recursive Descent parsing all the compiler code is generated (automatically) from the grammar Table Driven parsing a table is generated (automatically) from the grammar the table is 'plugged' into an existing compiler

2. Creating a Recursive Descent Parser
Each non-terminal (e.g. A) is translated into a parsing function (e.g. A()). The A() function is generated from all the productions for A: A => B, A => a C, etc.

2.1. Basic Translation Rules
I'll start by assuming a production body doesn't use *, [], or e. I'll add to the translation rules later to deal with these extra features S => Body becomes void S() { translate< Body > }

If Body is B1 B2 . . . Bn then it becomes:
translate< B1 > ; translate< B2 > ; : translate< Bn > ;

If Body is B1 | B2 . . . | Bn then it becomes:
if (currToken in FIRST_SEQ<B1>) translate<B1> ; else if (currToken in FIRST_SEQ<B2>) translate<B2> ; : else if (currToken in FIRST_SEQ<Bn>) translate<Bn> ; else error(); For this to work FS(B1), FS(B2), ..., FS(Bn) must all be different from each other

currToken is the current token, which is obtained from the lexical analyzer:
Token currToken; // global void nextToken(void) { currToken = scanner(); }

The first token is read when the parser first starts
The first token is read when the parser first starts. main() also calls the function representing the start symbol: int main(void) { nextToken(); S(); // S is the grammar's start symbol : // other code return 0; }

error() reports that the current token cannot be matched against any production:
int lineNum; // global void error() { printf("\nSyntax error at \'%s\' on line %d\n", currentToken, lineNum); exit(1); }

In a body, if b is a terminal, it is translated into a match() call:
In a body, if B is a non-terminal, it is translated into the function call: B(); In a body, if b is a terminal, it is translated into a match() call: match(b);

match() checks that the current token is what is expected (e. g
match() checks that the current token is what is expected (e.g. b), and reads in the next one for future testing: void match(Token expected) { if(currToken == expected) currToken = scanner(); else error(); }

Special '|' Body case. If Body is
a1 B1 | a2 B | an Bn // ai's are terminals then it becomes: if (currToken == a1) { match(a1); translate<B1> ; } else if (currToken == a2) { match(a2); translate<B2> ; } : else if (currToken == an) { match(an); translate<Bn> ; } else error(); a1, a2, ..., an must be different

2.2. Example Translation And main(), nextToken(), match(), and
void S() { // S => a B | b C if (currToken == a) { match(a); B(); } else if (currToken == b) { match(b); C(); else error(); void B() { // B => b b C match(b); C(); void C() { // C => c c match(c); And main(), nextToken(), match(), and error().

Parsing "abbcc" S a B b b C c c a b c input Function calls:
main() --> S() --> match(a); B() --> match(b); match(b); C() --> match(c); match(c) S a B b b C c c

2.3. When can we use Recursive Descent?
A fast/efficient recursive descent parser can be generated for a LL(1) grammar. So we must first check if the grammar is LL(1). the check will generate information that can be used in constructing the parser e.g. FIRST_SEQ<...>

Dealing with "if" A tricky part of LL(1) is making sure that branches can be coded each branch must start differently so it's easy (and also fast) to decide which branch to use based only on the current input token (currToken value) continued

e.g. a .. A --> a B1 A --> b B2
is okay since the two branches start differently (a and b) A --> a B1 A --> a B2 not okay since both branches start the same way currToken continued

In non-mathematical words, a grammar is LL(1) if the choice between productions can be made by looking only at the start of the production bodies and the current input token (currToken).

Is a Grammar LL(1)? in maths
The grammar is LL(1) if LL(A) is true for all the nonterminals, A. LL(A) is true if ( PREDICT(A => ai) ∩ PREDICT(A => aj) ) == {} for all the A productions (i ≠ j) called being pairwise disjoint If there is only one A=> a production, then LL(A) is true continued

Calculating PREDICT()
PREDICT(A => a) = (FIRST_SEQ(a) – {e})  FOLLOW(A) if e in FIRST_SEQ(a) or = FIRST_SEQ(a) if e not in FIRST_SEQ(a) FIRST_SEQ() and FOLLOW() are the set functions I described in chapter 4.

Short Example 1 S => a S | a Grammar is LL(1) if LL(S) is true
But LL(S) is false since PREDICT(S => a S) = {a} PREDICT(S => a) = {a} LL(S)= (({a} ∩ {a }) == {}) = false So the grammar is not LL(1)

Short Example 2 S => a S | b Grammar is LL(1) if LL(S) is true
But LL(S) is true since PREDICT(S => a S) = {a} PREDICT(S => b) = {b} LL(S)= (({a} ∩ {b }) == {}) = true So the grammar is LL(1)

Example 3 Is this grammar LL(1)? S => T | R T => up | pen
R => down | pen

Calculate LL() for all the non-terminals:
Production Predict S => T = FIRST_SEQ(T) = {up,pen} S => R = FIRST_SEQ(R) = {down,pen} T => up {up} T => pen {pen} R => down {down} R => pen Calculate LL() for all the non-terminals: LL(S): {up,pen} ∩ {down,pen} == {pen}; false LL(T): {up} ∩ {pen} == {}; true LL(R): {down} ∩ {pen} == {}; true Not all true, so the grammar is not LL(1).

Example 4 Is this grammar LL(1)? FIRST(F) = {(,id} FIRST(T) = {(,id}
FIRST(E) = {(,id} FIRST(T1) = {*,e} FIRST(E1) = {+,e} FOLLOW(E) = {$,)} FOLLOW(E1) = {$,)} FOLLOW(T) = {+$,)} FOLLOW(T1) = {+,$,)} FOLLOW(F) = {*,+,$,)} Is this grammar LL(1)? E => T E1 E1 => + T E1 | e T => F T1 T1 => * F T1 | e F => id | '(' E ')'

Production Predict FIRST(F) = {(,id} FIRST(T) = {(,id}
FIRST(E) = {(,id} FIRST(T1) = {*,e} FIRST(E1) = {+,e} FOLLOW(E) = {$,)} FOLLOW(E1) = {$,)} FOLLOW(T) = {+$,)} FOLLOW(T1) = {+,$,)} FOLLOW(F) = {*,+,$,)} Production Predict E => T E1 = FIRST(T) = {(,id} E1 => + T E1 {+} E1 => e = FOLLOW(E1) = {$,)} T => F T1 = FIRST(F) = {(,id} T1 => * F T1 {*} T1 => e = FOLLOW(T1) = {+,$,)} F => id {id} F => ( E ) {(}

Calculate LL() for all the non-terminals:
LL(E): {(,id} true LL(E1): {+} ∩ {$,)} == {}; true LL(T): {(,id} true LL(T1): {*} ∩ {+,$,)} == {}; true LL(F): {id} ∩ {(} true All true, so the grammar is LL(1).

2.4. Extended Translation Rules
These extra rules allow a production body to use *, [], or e. S => Body becomes void S() { translate< Body > } same as before

If Body is optional e part B1 | B2 . . . | Bn | e then it becomes:
if (currToken in FIRST_SEQ(B1)) translate<B1> ; else if (currToken in FIRST_SEQ(B2)) translate<B2> ; : else if (currToken in FIRST_SEQ(Bn)) translate<Bn> ; else error(); include if there's no e part in the grammar

If Body is [ B1 B2 . . . Bn ] then it becomes:
if (currToken in FIRST_SEQ(B1)) { translate<B1> ; translate<B2> ; : translate<Bn> ; } [ B1 B2 ... Bn ] is the same as ( B1 B2 ... Bn ) | e rule []-1

A variant [] translation. If the body is
[ B1 B Bn ] C then it can become: if (currToken not in FIRST_SEQ(C)) translate<B1> ; translate<B2> ; : translate<Bn> ; } translate<C> ; rule []-2 This may be simpler code than FIRST_SEQ(B1)

Another variant [] translation. If the grammar rule is
A => [ B1 B Bn ] then it becomes: void A() { if (currToken not in FOLLOW(A)) translate<B1> ; translate<B2> ; : translate<Bn> ; } } rule []-3 This may be simpler code than FIRST_SEQ(B1)

If Body is ( B1 B2 . . . Bn )* then it becomes: rule *-1
while (currToken in FIRST_SEQ(B1)) translate<B1> ; translate<B2> ; : translate<Bn> ; } rule *-1

A variant * translation. If the body is
( B1 B Bn )* C then it becomes: while (currToken not in FIRST_SEQ(C)) translate<B1> ; translate<B2> ; : translate<Bn> ; } translate<C> ; rule *-2 This may be simpler code than FIRST_SEQ(B1)

Another variant * translation. If the grammar rule is
A => ( B1 B Bn )* then it becomes: void A() { while (currToken not in FOLLOW(A)) translate<B1> ; translate<B2> ; : translate<Bn> ; } } rule *-3 This may be simpler code than FIRST_SEQ(B1)

match() is slightly changed to deal with the end of input symbol, $:
void match(Token expected) { if(currToken == expected) { if (currToken != $) currToken = scanner(); } else error(); }

Translation Example 1 The LL(1) Grammar: E => T E1
T => F T1 T1 => [ '*' F T1 ] F => id | '(' E ')' This is the same grammar as on slide 33, so we know it's LL(1).

Generated Parser void E() // E => T E1 { T(); E1(); } void E1() // E1 => ['+' T E1 ] { if (currToken == '+') { match('+'); T(); E1(); } } use rule []-1 This is C code for "currToken in FIRST_SEQ(+)"

void T(). // T => F T1 { F(); T1(); } void T1(). // T1 => ['
void T() // T => F T1 { F(); T1(); } void T1() // T1 => ['*' F T1 ] { if (currToken == '*') { match('*'); F(); T1(); } } rule []-1 This is C code for "currToken in FIRST_SEQ(*)"

void F() // F => id | '(' E ')' { if (currToken == ID) match(ID); else if (currToken == '(') { match('('); E(); match(')'): } else error(); }

Parsing "a + b * c" E T E1 F T1 + T E1 id a F T1 e e id b * F T1 e id
input T E1 F T1 + T E1 id a F T1 e e id b * F T1 e id c

Optimizations It's possible to combine grammar rules and/or parse functions, in order to simplify the compiler. For example, we can combine: E and E1 T and T1

Translation Example 2 The previous LL(1) grammar can be expressed using *: E => T ( '+' T )* T => F ( '*' F )* F => id | '(' E ')' same as before

Generated Parser void E() // E => T ('+' T)* { T(); while (currToken == '+') { match('+'); T(); } } void T() // T => F ('*' F)* { F(); while (currToken == '*') { match('*'); F(); } } rule *-1 rule *-1

same as before void F() // F => id | '(' E ')' { if (currToken == ID) match(ID); else if (currToken == '(') { match('('); E(); match(')'): } else error(); }

Parsing "a + b * c" Again E T + T F F * F id id b c id a
done inside the E() loop T + T F F * F done inside the T() loop id b id c id a

3.1. FIRST and FOLLOW Sets First(Stats) = {let, (, Int, Id, \n, e}
First(Stat) = {let, (, Int, Id} First(Expr) = {(, Int, Id} First(Term) = {(, Int, Id} First(Fact) = {(, Int, Id} Follow(Stats) = {$} Follow(Stat) = {\n} Follow(Expr) = {\n} Follow(Term) = {+, -, \n} Follow(Fact) = {*, /, +,-,\n}

3.2. PREDICT Sets Production Predict LL()
Stats => ( [ Stat ] \n )* {let,(,Int,Id,\n,$} true Stat => let ID = Expr {let} {}; true Stat => Expr {(,Int,Id} Expr => Term ( (+ | - ) Term )* {(,Int,Id} true Term => Fact ( (* | / ) Fact ) * {(,Int,Id} true Fact => '(' Expr ')' {(} {}; true Fact => Int {Int} Fact => Id {Id} 3 pairs to check

3.3. exprParse0.c exprParse0.c is a recursive descent parser generated from the expressions grammar. It reads in an expressions program file. It's output is a print-out of parse function calls.

An Expressions Program (test1.txt)
5 + 6 let x = ( (x*y)/2) // comments // y let x = 5 let y = x /0 // comments

Usage > gcc -Wall -o exprParse0 exprParse0.c > ./exprParse0 < test1.txt 1: stats< 2: stat<expr<term<fact<num(5) >>'+' term<fact<num(6) >>>> 3: stat<'let' var(x) '=' expr<term<fact<num(2) >>>> 4: stat<expr<term<fact<num(3) >>'+' term<fact<'(' expr<term<fact<'(' expr<term> 5: 6: stat<'let' var(x) '=' expr<term<fact<num(5) >>>> 7: stat<'let' var(y) '=' expr<term<fact<var(x) >'/' fact<num(0) >>>> 8: 9: 10: >'eof'

exprParse0.c Callgraph lexical parser (like exprTokens.c)
generated from the grammar

Standard Token Functions
// globals (first used in exprToken.c) Token currToken; char tokString[MAX_IDLEN]; int tokStrLen = 0; int currTokValue; int lineNum = 1; // no. of lines read in void nextToken(void) { currToken = scanner(); } continued

void match(Token expected) { if(currToken == expected){ printToken(); // produces the parser's output if(currToken != SCANEOF) currToken = scanner(); } else printf("Expected %s, found %s on line %d\n", tokSyms[expected], tokSyms[currToken],lineNum); } // end of match() continued

void printToken(void) { if (currToken == ID) printf("%s(%s) ", tokSyms[currToken], tokString); // show token string else if (currToken == INT) printf("%s(%d) ", tokSyms[currToken], currTokValue); // show value else if (currToken == NEWLINE) printf("%s%2d: ", tokSyms[currToken], lineNum); // print newline token else printf("'%s' ", tokSyms[currToken]); // other tokens } // end of printToken()

Syntax Error Reporting
void syntax_error(Token tok) { printf("\nSyntax error at \'%s\' on line %d\n", tokSyms[tok], lineNum); exit(1); }

main() function for start symbol check that program is finished at eof
int main(void) { printf("%2d: ", lineNum); nextToken(); statements(); match(SCANEOF); printf("\n\n"); return 0; } function for start symbol check that program is finished at eof

Parsing Functions rule *-3 rule []-2
void statements(void) // Stats => ( [ Stat ] '\n' )* { printf("stats<"); while (currToken != SCANEOF) { if (currToken != NEWLINE) statement(); match(NEWLINE); } printf(">"); } // end of statements() rule *-3 rule []-2

Complicated, but it can be optimized with some 'tricks'
void statement(void) // Stat => ( 'let' ID '=' Expr ) | Expr { printf("stat<"); if (currToken == LET) { match(LET); match(ID); match(ASSIGNOP); expression(); } else if ((currToken == LPAREN) || (currToken == INT) || (currToken == ID)) else error(); printf(">"); } // end of statement() Complicated, but it can be optimized with some 'tricks'

void expression(void) // Expr => Term ( ( '+' | '-' ) Term )
void expression(void) // Expr => Term ( ( '+' | '-' ) Term )* { printf("expr<"); term(); while((currToken == PLUSOP) || (currToken == MINUSOP)) { if (currToken == PLUSOP) match(PLUSOP); else if (currToken == MINUSOP) match(MINUSOP); else error(); } printf(">"); } // end of expression() Version 1 rule *-1

Shorter, but also harder to understand! Version 2: simplified | code
void expression(void) // Expr => Term ( ( '+' | '-' ) Term )* { printf("expr<"); term(); while((currToken == PLUSOP) || (currToken == MINUSOP)) { match(currToken); } printf(">"); } // end of expression() Shorter, but also harder to understand!

Version 1 void term(void) // Term => Fact ( ('*' | '/' ) Fact )* { printf("term<"); factor(); while((currToken == MULTOP) || (currToken == DIVOP)) { if (currToken == MULTOP) match(MULTOP); else if (currToken == DIVOP) match(DIVOP); else error(); } printf(">"); } // end of term() rule *-1

Shorter, but also harder to understand! Version 2: simplified | code
void term(void) // Term => Fact ( ('*' | '/' ) Fact )* { printf("term<"); factor(); while((currToken == MULTOP) || (currToken == DIVOP)) { match(currToken); } printf(">"); } // end of term() Shorter, but also harder to understand!

void factor(void) // Fact => '(' Expr ')' | INT | ID { printf("fact<"); if(currToken == LPAREN) { match(LPAREN); expression(); match(RPAREN); } else if(currToken == INT) match(INT); else if (currToken == ID) match(ID); else syntax_error(currToken); printf(">"); } // end of factor()

4. LL(1) Parse Tables The format of a parse table:
T[non-terminals] [terminals] terminals (== tokens) b a production A => a with b  PREDICT(A=>a) non-terminals A

Other Data Structures Sequence of input tokens (ending with $).
A parse stack to hold nonterminals and terminals that are being processed. pop push E $

The Parsing Algorithm like match()
push($); push(start_symbol); currToken = scanner(); do X = pop(stack); if (X is a terminal or $) { if (X == currToken) else error(); } else // X is a non-terminal if (T[X][currToken] == X => Y1 Y2 ...Ym ) push(Ym); ... push (Y1); while (X != $); like match()

4.1. Table Parsing Example Use the LL(1) grammar: E => T E1
E1 => '+' T E1 | e T => F T1 T1 => '*' F T1 | e F => id | '(' E ')' This is the LL(1) grammar from slide 33.

Parse Table Generation
Production Predict 1: E => T E1 {(,id} 2: E1 => + T E1 {+} 3: E1 => e {$,)} 4: T => F T1 5: T1 => * F T1 {*} 6: T1 => e {+,$,)} 7: F => id {id} 8: F => ( E ) {(} NT/T + * ( ) ID $ E 1 E1 2 3 T 4 T1 6 5 F 8 7

Parsing "a + b * c $" Stack Input Action Stack Input Action $E a+b*c$
E => T E1 $E1 T " T => F T1 $E1 T1 F F => id $E1 T1 id match $E1 T1 +b*c$ T1 => e $E1 E1 => + T E1 $E1 T+ b*c$ Stack Input Action $E1 T1 F " F => id $E1 T1 id match $E1 T1 *c$ T1 => * F T1 $E1 T1 F * c$ $ T1 => e $E1 E1 => e Success!

5. Making a Grammar LL(1) Not all context free grammars are LL(1).
Use LL() to find the productions which cause the problem, and use various techniques to rewrite them so LL() is true.

Example Production Predict E => E + T = FIRST(E) = {(,id} E => T
= FIRST(T) = {(,id} T => T * F T => F = FIRST(F) = {(,id} F => id = {id} F => ( E ) = {(} FIRST(F) = {(,id} FIRST(T) = {(,id} FIRST(E) = {(,id} FOLLOW(E) = {$,),+} FOLLOW(T) = {+,$,),*} FOLLOW(F) = {+,$,),*} E and T are the problem since LL(E) and LL(T) are false.

Example of the Problem Input "5 + b"
There are two productions to choose from: E => E + T E => T Which should be chosen by looking only at the current token "5"?

5.1. From non-LL(1) to LL(1) There are two main techniques for converting a non-LL(1) grammar to LL(1). but they don't work for every grammar 1. Left Factoring e.g. used on A => B a C D | B a C E 2. Transforming left recursion to right recursion e.g. used on E => E + T | T

5.2. Left Factoring S => a B | a C Change S to: S => a S1
to see the problem try choosing a production to parse "a" in "andrew" Change S to: S => a S1 S1 => B | C now there is no difficult choice

In general: A => a b1 | a b2 | | a bn becomes A => a A1 A1 => b1 | b2 | | bn

5.3. Why is Left Recursion a Problem?
Grammar: A => A b A => b The input is "bbbb". Using only the current token, "b", which production should be used?

Remove Left Recursion A => A a1 | A a2 | … | b1 | b2 | … becomes
A => b1 A1 | b2 A1 | … A1 => a1 A1 | a2 A1 | … | e The left recursion is changed to right recursion in the new A1 rule.

Example Translation The left recursive grammar:
A => A b | b becomes A => b A1 A1 => b A1 | e Try parsing the input string "bbbb" using only the current token "b".

Fixing the E Grammar The E grammar is not LL(1):
E => E + T | T T => T * F | F F => id | ( E ) This is the same as the grammar on slide 81, so the problems are in the E and T productions. continued

Eliminate left recursion in E and T:
E => T E1 E1 => + T E1 | e T => F T1 T1 => * F T1 | e F => id | ( E ) This version of the E grammar is LL(1) We've been using it for most of our examples see slide 33

Example A B C A => B c | d B => C f | B f C => A e | g
Replace C in B's production by C's defn: B => A e f | g f | B f Replace A in B's production by A's defn: B => B c e f | d e f | g f | B f C

6. Error Recovery in LL Parsing
Simple answer: when there's an error, print a message and exit Better error recovery: 1. insert the expected token and continue this approach can cause non-termination 2. keep deleting tokens until the parser gets a token in the FOLLOW set for the production that went wrong see example on next slide

Example: E→T E1 from slide 33 (grammar) and slide 46 (code)
void E() { if (currToken in FIRST(T)) { // error checking T(); E1(); // FIRST(T) == {(,ID} } else { // error reporting and recovery printf("Expecting one of FIRST(T)"); while (currToken not in FOLLOW(E)) // FOLLOW(E) == {),$} currToken = scanner(); // skip input } } // end of E()

C Code void E() { if ((currToken == LPAREN) || (currToken == ID)) { T(); E1(); } else { printf("Expecting ( or id"); while ( (currToken != RPAREN) && (currToken != SCANEOF)) currToken = scanner(); } } // end of E()

Compiler Structures 5. Top-down Parsing Objectives

Similar presentations

Presentation on theme: "Compiler Structures 5. Top-down Parsing Objectives"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Compiler Structures 5. Top-down Parsing Objectives

Similar presentations

Presentation on theme: "Compiler Structures 5. Top-down Parsing Objectives"— Presentation transcript:

Similar presentations

About project

Feedback