Compiler Design 4. Language Grammars

Compiler Design 4. Language Grammars
Kanat Bolazar January 28, 2010

Introduction to Parsing: Language Grammars
Programming language grammars are usually written as some variation of Context Free Grammars (CFG)s Notation used is often BNF (Backus-Naur form): <block> > { <statementlist> } <statementlist> -> <statement> ; <statementlist> <statement> -> <assignment> ; | if ( <expr> ) <block> else <block> | while ( <expr> ) <block> ...

Example Grammar: Language 0+0
A language that we'll call "Language 0+0": E -> E + E | 0 Equivalently: E -> E + E E -> 0 Note that if there are multiple rules for the same left hand side, they are alternatives. This language only contains sentences of the form: Derivation for 0+0+0: E -> E + E -> E + E + E -> Note: This language is ambiguous: In the second step, did we expand the first or the second E to E + E? Both paths work.

Example Grammar: Arithmetic, Ambiguous
Arithmetic expressions: Exp -> num | Exp Operator Exp Op -> + | - | * | / | % The "num" here represents a token. What it corresponds to is defined in the lexical analyzer with a regular expression: num [0-9]+ This langugage allows: * This language as defined here is ambiguous: 2 + 5 * Exp * 7 or Exp ? Depending on the tools you use, you may be able to just define precedence of operators, or may have to change the grammar.

Example Language: Arithmetic, Factored
Arithmetic expressions grammar, factored for operator precedence: Exp -> Factor | Factor Addop Exp Factor -> num | num Multop Factor Addop -> + | - Multop -> * | / | % This langugage also allows the same sentences: * This language is not ambiguous; it first groups factors: 2 + 5 * 7 Factor Addop Exp num + Exp num + Factor num + num Multop Factor num + num * num

Grammar Definitions The grammar is a set of rules, sometimes called productions, that construct valid sentences in the language. Nonterminal symbols represent constructs in the language. These would be the phrases in a natural language. Terminal symbols are the actual words of the language. These are the tokens produced by the lexical analyzer. In a natural language, these would be the words, symbols, and space. A sentence in the language only contains terminal symbols. Nonterminals are intermediate linguistic constructs to define the structure of a sentence.

Rules, Nonterminal and Terminal Symbols
Arithmetic expressions grammar, using multiplicative factors for operator precedence: Exp -> Factor | Factor Addop Exp Factor -> num | num Multop Factor Addop -> + | - Multop -> * | / | % This langugage has four rules as written here. If we expand each option, we would have = 9 rules. There are four nonterminals: Exp Factor Addop Multop There are six terminals (tokens): num * / %

Grammar Definitions: Rules
The production rules are rewrite rules. The basic CFG rule form is: X -> Y1 Y2 Y3 … Yn where X is a nonterminal and the Y’s may be nonterminals or terminals. There is a special nonterminal called the Start symbol. The language is defined to be all the strings that can be generated by starting with the start symbol, repeatedly replacing nonterminals by the rhs of one of its rules until there are no more nonterminals.

Larger Grammar Examples
We'll look at language grammar examples for MicroJava and Decaf. Note: Decaf extends the standard notation; the very useful { X }, to mean X | X, X | X, X, X | ... is not standard.

Parse Trees Derivation of a sentence by the language rules can be used to construct a parse tree. We expect parse trees to correspond to meaningful semantic phrases of the programming language. Each node of the parse tree will represent some portion that can be implemented as one section of code. The nonterminals expanded during the derivation are trunk/branches in the parse tree. The terminals at the end of branches are the leaves of the parse tree.

Parsing A parser: Top-down parsing: Bottom-up parsing:
Uses the grammar to check whether a sentence (a program for us) is in the language or not. Gives syntax error If this is not a proper sentence/program. Constructs a parse tree from the derivation of the correct program from the grammar rules. Top-down parsing: Starts with the start symbol and applies rules until it gets the desired input program. Bottom-up parsing: Starts with the input program and applies rules in reverse until it can get back to the start symbol. Looks at left part of input program to see if it matches the rhs of a rule.

Parsing Issues Derivation Paths = Choices Ambiguity
Naïve top-down and bottom-up parsing may require backtracking to find a correct parse. Restrictions on the form of grammar rules to make parsing deterministic. Ambiguity One program may have two different correct derivations from the grammar. This may be a problem if it implies two different semantic interpretations. Famous examples are arithmetic operators and the dangling else problem.

Ambiguity: Dangling Else Problem
Which if does this else associate with? if X if Y find() else getConfused() The corresponding ambiguous grammar may be: IfSttmt -> if Cond Action | if Cond Action else Action Two derivations at top (associated with top "if") are: if Cond Action if Cond Action else Action Programming languages often associate else with the inner if.

Resources Aho, Lam, Sethi, and Ullman, Compilers: Principles, Techniques, and Tools, 2nd ed. Addison-Wesley, 2006. Compiler Construction Course Notes at Linz: CS 143 Compiler Course at Stanford:

Compiler Design 4. Language Grammars

Similar presentations

Presentation on theme: "Compiler Design 4. Language Grammars"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Compiler Design 4. Language Grammars

Similar presentations

Presentation on theme: "Compiler Design 4. Language Grammars"— Presentation transcript:

Similar presentations

About project

Feedback