# 1 CS 410 / 510 Mastery in Programming Chapter 5 LL(1) Parsing Herbert G. Mayer, PSU CS Status 7/14/2013.

## Presentation on theme: "1 CS 410 / 510 Mastery in Programming Chapter 5 LL(1) Parsing Herbert G. Mayer, PSU CS Status 7/14/2013."— Presentation transcript:

1 CS 410 / 510 Mastery in Programming Chapter 5 LL(1) Parsing Herbert G. Mayer, PSU CS Status 7/14/2013

2 Syllabus Goal Goal Grammars Formally, Intuitively Grammars Formally, Intuitively BNF, EBNF BNF, EBNF A Sample Ambiguous Grammar G1 A Sample Ambiguous Grammar G1 Suitable Grammar Suitable Grammar Left-Recursion Elimination Left-Recursion Elimination Lambda-Rule Elimination Lambda-Rule Elimination Suitable G5 and G6 Suitable G5 and G6 Use Grammar for Parsing Use Grammar for Parsing Recursive Descent Recursive Descent Recursive Descent Parser For s Recursive Descent Parser For s Parser for G2 Parser for G2 Parser for G4 Parser for G4 References References

3 Goal Become familiar with suitable grammars. Suitable means, certain rules are not allowed, such as: left-recursive rules, circular rules, and lambda-producing rules – Note: with one exception!!Become familiar with suitable grammars. Suitable means, certain rules are not allowed, such as: left-recursive rules, circular rules, and lambda-producing rules – Note: with one exception!! The rules of a programming language L specify how to generate strings in L; all other strings are not part of LThe rules of a programming language L specify how to generate strings in L; all other strings are not part of L The number of strings in L (i.e. the size of set { L } ) is generally unbounded for usable programming languagesThe number of strings in L (i.e. the size of set { L } ) is generally unbounded for usable programming languages One way of expressing language rules is through some grammar GOne way of expressing language rules is through some grammar G The class of grammars handled here is restricted to context-free ones; the more powerful class of grammars with context-sensitive rules is excludedThe class of grammars handled here is restricted to context-free ones; the more powerful class of grammars with context-sensitive rules is excluded A side goal is to learn a particular notation for writing grammars, but that notation is simply a convenience, a handy way of writingA side goal is to learn a particular notation for writing grammars, but that notation is simply a convenience, a handy way of writing We’ll focus on Backus Naur Form (BNF), AKA Backus Normal Form (BNF), from the early days of the Algol-60We’ll focus on Backus Naur Form (BNF), AKA Backus Normal Form (BNF), from the early days of the Algol-60

4 Grammars Formally A grammar G for language L, named G(L), is a quintuple { terminals, nonterminals, metasymbols, start symbol, productions } defining all strings in L; a string in L is named a programA grammar G for language L, named G(L), is a quintuple { terminals, nonterminals, metasymbols, start symbol, productions } defining all strings in L; a string in L is named a program Start Symbol: One of the productions starts the process of generating strings in L; doesn’t have to be the first nonterminal defined in GStart Symbol: One of the productions starts the process of generating strings in L; doesn’t have to be the first nonterminal defined in G Terminal: Is a final token in a program; e.g. “hello”. Such a token cannot derive other strings; it solely represents itselfTerminal: Is a final token in a program; e.g. “hello”. Such a token cannot derive other strings; it solely represents itself Nonterminal Symbol: Is a grammar symbol, used as short-hand for a string of other symbols; is defined on the left-hand side of a production; group multiple alternatives via the metasymbol |Nonterminal Symbol: Is a grammar symbol, used as short-hand for a string of other symbols; is defined on the left-hand side of a production; group multiple alternatives via the metasymbol | Metasymbol: Symbol of the grammar defining the process of string generation; is not part of the language L defined by G; instead is a grammar short-hand; hence the name metasymbolMetasymbol: Symbol of the grammar defining the process of string generation; is not part of the language L defined by G; instead is a grammar short-hand; hence the name metasymbol Production: Rule that defines a nonterminal; consists of nonterminal on left-hand side being defined, specified by the “produces” metasymbol, plus some string of symbols on the right-hand side that is not circularProduction: Rule that defines a nonterminal; consists of nonterminal on left-hand side being defined, specified by the “produces” metasymbol, plus some string of symbols on the right-hand side that is not circular

5 Grammars, Some Terminology The empty string is referred to as lambda. We’ll use lambda as a convenience in grammar writing; otherwise it is superfluous; also frequently referred to in the literature as epsilonThe empty string is referred to as lambda. We’ll use lambda as a convenience in grammar writing; otherwise it is superfluous; also frequently referred to in the literature as epsilon Lambda is superfluous as a grammar tool, except if the language allows the empty program. In all other cases, rules that produce lambda can be replaced by other rules that do not use lambda, at the expense of a more complex grammarLambda is superfluous as a grammar tool, except if the language allows the empty program. In all other cases, rules that produce lambda can be replaced by other rules that do not use lambda, at the expense of a more complex grammar Right-hand side of a suitable production –AKA alternative– eventually starts with a terminal; could be several terminals, for several alternatives. The set of all distinct terminals that can start a right-hand side is called the first setRight-hand side of a suitable production –AKA alternative– eventually starts with a terminal; could be several terminals, for several alternatives. The set of all distinct terminals that can start a right-hand side is called the first set

6 Grammars Intuitively A grammar G is a set of rules to produce programs; programs are strings of characters in a programming language LA grammar G is a set of rules to produce programs; programs are strings of characters in a programming language L Each rule has a name on the left-hand side, the nonterminal that generates at least one sequence of other symbols; those can be terminals or nonterminals listed on the right-hand sideEach rule has a name on the left-hand side, the nonterminal that generates at least one sequence of other symbols; those can be terminals or nonterminals listed on the right-hand side Terminal is a symbol expressing a value directly, like 500. Can also be some fixed symbol, like + or ( or END. A terminal symbol cannot produce other stringsTerminal is a symbol expressing a value directly, like 500. Can also be some fixed symbol, like + or ( or END. A terminal symbol cannot produce other strings Nonterminal is a name that can be used on the right-hand-side of a production. Occurs at least once on left-hand side of a production, and is defined by a string of nonterminals and terminalsNonterminal is a name that can be used on the right-hand-side of a production. Occurs at least once on left-hand side of a production, and is defined by a string of nonterminals and terminals When there are multiple rules --AKA productions-- for a nonterminal, we call these alternativesWhen there are multiple rules --AKA productions-- for a nonterminal, we call these alternatives One of the nonterminals is the start symbol. That is where the generating process starts; often written as the first rule, but must be clearly identified somehowOne of the nonterminals is the start symbol. That is where the generating process starts; often written as the first rule, but must be clearly identified somehow

7 Grammars Example for grammar G 0 : s:s ( s ) | Discussion of G 0 : The only nonterminal symbol used in grammar G 0 is s. Hence s must also be the start symbol There are 2 meta-symbols, or if we are picky 3 Metasymbol : means “the left side produces the string on the right” Metasymbol | means “another alternative for s” End of all rules means it is the end of G 0 Nothing to the right of | means: “this alternative generates the empty string”, i.e. nothing, or lambda; some authors call this epsilon The first alternative of the 2 above productions in G 0 is left-recursive There are 2 terminal symbols in L(G 0 ), these are ( and ) We can debate, whether the empty string lambda is also a terminal symbol I do not count the empty string, as this would create a situation in which an infinite sequence of the same terminal symbols --of nothings-- is the same as a single occurrence; not suitable for language grammars

8 BNF, EBNF While authoring the report on the language Algol60 in the late 1950s, John Backus developed a convenient short-hand, ably supported by ideas from Peter Naur Backus Normal Form, AKA Backus Naur Form Typical metasymbols in the Algol60 report ::= | <> [] [.. ] encloses an optional phrase; allowed once or not at all defines the non-terminal enclosed; allows disambiguation between, say, nonterminal and terminal symbol start ::= is the “produces” symbol; we’ll use a simpler one | starts another alternative for a production The notation found wide acceptance; extended to allow multiple options, by using the additional { and } metasymbols {.. } states that the.. part is included 0 or more times {.. }+ states that the.. part is included 1 or more times [.. ] states that the.. part is optional, i.e. included once or not at all Hence this type of grammar is called EBNF, for Extended BNF

9 A Sample Ambiguous Grammar G 1 Metasymbol : means “produces” Metasymbol | means “r.h.s. also produces …”; another alternative Nonterminals e and n Terminals + - * / ^ ( ) 0 1 2 3 4 5 6 7 8 9 Start Symbol e Grammar G 1 e: e + e-- addition, but that is “semantics” | e - e-- subtraction | e * e-- multiplication | e / e-- division | e ^ e-- exponentiation | ( e ) -- grouping with parentheses | n-- non-terminal for 10 terminals below: n: 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

10 Strings in G 1 80+76*6-4+22*(3+2)(((7)))((9)+8)*(((5-4)/2)/0) Operator precedence and ambiguity: In conventional arithmetic, * and / have stronger binding than + and - ; this binding strength is AKA precedence or priority G 1 alone cannot express that!! Parser discussed does not account for precedences! Can encode this in grammar too, but not covered here, since we do not include semantics discussion, i.e. code generation If there are multiple parse trees for some strings of terminals, then the grammar is called ambiguous! G1 is ambiguous

11 Grammar G 2 Rewrite G 1 suitable for RD parsing, introduce metasymbols { } for repetition 0 or more times; see G 2 Rewrite G 1 suitable for RD parsing, introduce metasymbols { } for repetition 0 or more times; see G 2 expression: term { plus_op term } plus_op: + | - term : factor { mult_op factor } mult_op: * | / factor: primary { ^ primary } primary: ( expression ) | number number: 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 note that position of semantic action effectively defines precedence; important for ^, which is right-associative! Others are usually left- associative; except in APL! We won’t cover semantics in CS 410/510 note that position of semantic action effectively defines precedence; important for ^, which is right-associative! Others are usually left- associative; except in APL! We won’t cover semantics in CS 410/510

12 Suitable Grammar Definition: Parsing means “analysing a string for grammatical correctness, according to the rules of language L” Definition: A program written in language L is a string of terminal symbols; these symbols are strung together according to the grammar rules of L Such a program can be empty only if there is a way for the start symbol to generate lambda We parse program strings in a top down fashion. Top down means: we start the string generation with the start symbol, matching terminals from the input stream one symbol (i.e. terminal) at a time. Other methods exist not mentioned here; e.g. bottom-up parsing When we see several alternatives during the parse that may have created the source string so far, we look-ahead one source symbol to determine the correct next alternative Thus was coined the short-hand LL(1): Left-to-right reading symbols, Left- to-right grammar use, 1 symbol look-ahead. Notation: LL(1)

13 Suitable Grammar A meaningful grammar G is suitable for LL(1) parsing, if it adheres to: 1. No lambda productions: Except for the start symbol, no other nonterminal is allowed to generate the empty string; reason is, a parser can always succeed finding an empty string, so there is no real information in finding lambda 2. No left-recursive rules: In the presence of left-recursive rules, a recursive descent resulting parser would cause infinite regress; i.e. self-recursive calls until stack overflow Details belong into the compiler course 3. No circular productions, AKA : There cannot be productions of the type a: a …- left-recursive without intermediate productions! a:b … b:a …- circular: left-recursive with intermediate productions! 4. No context-sensitive rules: Two or more non-terminals a and b do not occur on the left side of some production, defining unique strings different from concatenation of a and b: a b: some unique sequence – is not permitted 5. Empty Intersection of First Sets: Set of possible tokens that can start a production is called the First Set If some productions do share a token: factorise out!

14 Left-Recursion Elimination Grammar G is suitable for LL(1) parsing, if it includes no left-recursive rules If we start out with a left-recursive G, replace the left-recursive productions with equivalent ones that are not left-recursive Sample grammar G with 2 terminals ‘A’ and ‘B’, one non-terminal ‘a’, producing all strings of a single B followed by any number of A: Ga:a A |B Replace G with G’ which is not left-recursive, but as a result introduces lambda-productions: G’a:B a’ a’:A a’ | lambda

15 Lambda-Rule Elimination A grammar is suitable for LL(1) parsing, if it exhibits no lambda-productions But if we start out with lambda-productions, we have to transform it into an equivalent one that is free of such rules --except if lambda is a program G’a:B a’ a’:A a’ | lambda One method of lambda-elimination is to expand the grammar with additional productions for each non-terminal that can generate lambda: G’’a:B a’ |B-- additional rule a’:A a’ | A-- additional rule

16 Suitable G 5 and G 6 For both of the grammars G5 and G6 below, analyse them for their suitability of LL1 parsing. In both the start symbol is s, and A and B are terminals: Is G5 context free? Is G5 lambda-free, aside from the start symbol producing lambda? Is G5 free from left-recursive rules? G 5 s:A b|B a | lambda a:A|A s|B a a b:B|B s|A b b Describe the languages. Compare L(G 5 ) and L(G 6 ): Are they similar? Is G 6 context free? Is G 6 lambda-free, aside from the start symbol producing lambda? Is G 6 free from left-recursive rules? G 6 s:A s B|B s A|s s |lambda

17 Use Grammar for Parsing 1.Once we have a suitable grammar G, use G to methodically (mechanically, automatically) design a parser for language L(G). The method is named Recursive Descent Parsing; common, old method, outlined below 2.Once we have a suitable grammar G, encode G directly as a data structure. Then write a simple loop that reads the source and traverses the data structure driven by the incoming token stream, deciding at each point, which production of G to use that would allow the current source symbol (AKA token) 3.If indeed a person can “mechanically implement a parser for all strings in L” given G, then a program can do so as well; Church Thesis. These programs exist and are called parser generators. Their inventors sometimes call them “Compiler Compilers”; sounds fancier 4.Widely used industrial quality parser generator is YACC, so named after the tongue-in cheek phrase: Yet Another Compiler Compiler. Available on Unix systems

18 Now for the MAIN idea:

19 Recursive Descent Goal: Describe an algorithm to mechanically produce a parser for language L(G) using a suitable grammar G Preparation: Write a scanner, AKA lexical analyzer scan() that reads the source program one character at a time, and returns a token t for each string of characters constituting a whole token, AKA lexeme. Lambda is not one of the possible tokens; and then:  For each nonterminal n defined in G, define a recursive function – procedure– by that name n() –we’ll re-write some nonterminals  For each nonterminal n used on the right-hand-side in G, call n()  For each terminal t that is required by any alternative in G, call must_be( t ) verify t was found, and scan() the next token after t  When a production has multiple alternatives, use the mutually exclusive first-sets of each production and next input token t (i.e. look ahead 1 token) to determine, which nonterminal n() to call; if the first- set does not resolve this: error; we don’t have a suitable grammar!  When a production has multiple alternatives, use the mutually exclusive first-sets of each production and next input token t (i.e. look ahead 1 token) to determine, which nonterminal n() to call; if the first- set does not resolve this: error; we don’t have a suitable grammar!

20 Recursive Descent Parser For s Grammar G 0 : s:( s ) s | Sample strings in L(G 0 ): () or ((())) or ()()() but not )( scan(): For such simple tokens –AKA lexemes– that consist of single characters ’(’ and ’)’, a scanner can be as simple as the C/C++ function getchar(). But generally, tokens are multi- character symbols Function must_be( t ) simply checks for expected symbol t : // assume global: char next_char, void function scan() void must_be( char expected ) { // must_be if ( next_char != expected ) { printf( " Expect ‘%c', is '%c'.\n", expected, next_char ); } //end if scan(); } //end must_be

21 Recursive Descent Parser For s // other declarations here... void scan( ) { // scan next_char = getchar();// read next input character if ( BLANK == next_char ) {// skip ’ ’ scan();}else{ printf( "%c", next_char );// echo the non-blank found } //end if } //end scan void s()// start for grammar G0 { // s if ( next_char == OPEN ) {// that is open parenthesis ‘(‘ scan();// throw away the ‘(‘ s();// recurse for nested ( must_be( CLOSED );// i.e. closed parenthesis ‘)’ s();// recurse for sequence ( ) ( ) } //end if// no more OPEN found; return } //end s int main() { // main scan();// get first ever token s();// language Assert( EOF, “Garbage found” ); } //end main

22 Repeat of Grammar G 2 expression: term { plus_op term } plus_op: + | - term : factor { mult_op factor } mult_op: * | / factor: primary { ^ primary } primary: ( expression ) | number number: 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

23 Parser For G 2 expression(), 1 // parser for grammar G2: // // expression: term { plus_op term } // plus_op: '+' | '-' // term: factor { mult_op factor } // mult_op: '*' | '/’ // factor: primary { ^ primary } // primary: '(' expression ')' //| number // number: '0' | '1' | '2'... '9’ #include #include #define BLANK' ' #define EOL'\n' #defineOPEN'(' #defineCLOSED')' char next_char = BLANK;// globally used for "token" #define ASSERT( c )\ if ( next_char != c ) {\ printf( "Error, expected '%c', found '%c'\n", c, next_char );\ printf( "Error, expected '%c', found '%c'\n", c, next_char );\ }else{\ scan();\ } //end if void scan( ) { // scan next_char = getchar(); if ( BLANK == next_char ) { scan();}else{ printf( "%c", next_char );// echo non-blank found } //end if } //end scan void expression();// forward announcement!!

24 Parser For G 2 expression(), 2 // scans a single digit number; if not found: error void number() { // number if ( ( next_char >= '0' ) && ( next_char = '0' ) && ( next_char <= '9' ) ) { scan(); scan();}else{ printf( "primary expression 0,1,2.. or '(' expected.\n" ); printf( "primary expression 0,1,2.. or '(' expected.\n" ); } //end if } //end number // parse primary expression, either: // (... ) or a number void primary() { // primary if ( next_char == OPEN ) { scan(); scan(); expression(); expression(); ASSERT( CLOSED ); ASSERT( CLOSED );}else{ number(); number(); } //end if } //end primary

25 Parser For G 2 expression(), 3 // parse highest priority operator ^ void factor() { // factor primary(); while ( next_char == '^' ) { scan(); scan(); primary(); primary(); } //end while } //end factor // parse multiply operators; no need to write mult_op nonterminal void term() { // term factor(); while ( ( next_char == '*' ) || ( next_char == '/' ) ) { // note: abbreviation from “mult_op()” // note: abbreviation from “mult_op()” scan(); scan(); factor(); factor(); } //end while } //end term // parse adding operators + and -, no need to write plus_op nonterminal void expression() { // expression term(); while ( ( next_char == '+' ) || ( next_char == '-' ) ) { // note: abbreviation from “plus_op()” // note: abbreviation from “plus_op()” scan(); scan(); term(); term(); } //end while } //end expression

26 Parser For G 2 expression(), 4 // get first token // then parse complete expression // assert no more source after expression // int main() { // main scan();expression(); ASSERT( EOL ); return 0; } //end main

27 Sample Input for expression() (( 8) ) ( ( ( 5 + 3* 3 ) / ( 5^6 ) - 2 ) ^ ( 2 ^ 6 ^ 7 ) )

28 A Parsing Variation We broke the general rule for Recursive Descent Parsing, namely defining a recursive function for each non-terminal symbols in G We coded the scanning of operators directly in-line; e.g. + and -, or * and / using a while loop to parse one or more of the [repeated] operators instead! In such cases, the semantic actions can be associated with the operator just scanned in a left-to-right fashion i.e. the semantic actions are done left-associatively An equally elegant way is to use an If-Statement and call the parsing function directly recursively Easily allowing right-associative semantic actions Recursion parses multiple operators of the same precedence

29 Change Grammar G 2 to G 3 expression: term [ plus_op expression ] plus_op: + | - term : factor [ mult_op term ] mult_op: * | / factor: primary [ ^ factor ] primary: ( expression ) | number number: 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

30 Modified Parse For G 3 // parse highest priority operator ^ void factor() { // factor primary(); if ( ‘^’ == next_char ) { scan(); scan(); factor();// <- parse repeated ^ operators factor();// <- parse repeated ^ operators } //end if } //end factor // parse multiply operators; no need to write mult_op nonterminal void term() { // term factor(); if ( ( next_char == '*' ) || ( next_char == '/' ) ) { // note: abbreviation from “mult_op()” // note: abbreviation from “mult_op()” scan(); scan(); term(); // <- parse repeated * and / operators term(); // <- parse repeated * and / operators } //end if } //end term // parse adding operators + and -, skip plus_op nonterminal void expression() { // expression term(); if ( ( next_char == '+' ) || ( next_char == '-' ) ) { if ( ( next_char == '+' ) || ( next_char == '-' ) ) { // note: abbreviation from “plus_op()” // note: abbreviation from “plus_op()” scan(); scan(); expression(); // <- parse repeated + and - operators expression(); // <- parse repeated + and - operators } //end if } //end expression

31 Data Structure and Grammar To be handled in compiler course Possibly a future extension at CS 410/510

32 Grammar G 4 For Statement s() s: statement [ s ] statement: if_statement | assign_statement if_statement: IF_SYM expression THEN_SYM statement [ ELSE_SYM statement ] FI_SYM ‘;’ assign_statement: ident ‘=’ expression ‘;’ -- separate ideas: expression: as discussed earlier *_SYM: tokens returned by scan(), e.g. IF_SYM

33 Parser For G 4 Statements s(), Part 1 void s();// forward announcement void assign_statement() { // assign_statement must_be( ident ); must_be( assign_sym ); expression(); must_be( semi_sym ); } //end assign_statement

34 Parser For G 4 Statements s(), Part 2 void if_statement() { // if_statement must_be( if_sym ); expression(); must_be( then_sym ); s(); if ( else_sym == token ) { scan();s(); } //end if must_be( fi_sym ); must_be( semi_sym ); } //end if_statement

35 Parser For G 4 Statements s(), Part 3 void statement() { // statement if ( if_sym == token ) { if_statement();}else{assign_statement(); } //end if } //end statement void s() { // s statement(); // use first-set: more statements? if ( ( if_sym == token ) || ( ident == token ) ) { s(); } //end if } //end s

36 Parser For G 4 Statements s(), Part 4 int main() { // main //...// initializations scan();// find first token s();// list of statements ASSRT( EOF );// no junk after program } //end main

37 References  Algol-60 Report: http://www.masswerk.at/algol60/report.htm  John Backus, http://www- 03.ibm.com/ibm/history/exhibits/builders/builders_backus.html  BNF: http://cui.unige.ch/db- research/Enseignement/analyseinfo/AboutBNF.html  ISO EBNF: http://www.cl.cam.ac.uk/~mgk25/iso-ebnf.html  Left-Recursion elimination: Herbert G Mayer, “Programming Languages”, © 1988 MacMillan Publishing Co., ISBN: 0-02- 378295-1  Church Thesis: http://plato.stanford.edu/entries/church-turing/  YACC: http://dinosaur.compilertools.net/yacc/  http://en.wikipedia.org/wiki/Compiler_Description_Language

Download ppt "1 CS 410 / 510 Mastery in Programming Chapter 5 LL(1) Parsing Herbert G. Mayer, PSU CS Status 7/14/2013."

Similar presentations