CS510 Compiler Lecture 4.

CS510 Compiler Lecture 4

Syntax Analysis

What is Syntax Analysis?
After lexical analysis (scanning), we have a series of tokens. In syntax analysis (or parsing), we want to interpret what those tokens mean. Goal: Recover the structure described by that series of tokens. Goal: Report errors if those tokens do not properly encode a structure.

Intro to Parsing Input: sequence of tokens from lexer
Output: parse tree of the program

Intro to Parsing

Main Topics Context-Free Grammars Derivations
Concrete and Abstract Syntax Trees Ambiguity Next Week: Parsing algorithms. Top-Down Parsing Bottom-Up Parsing

The Limits of Regular Languages
When scanning, we used regular expressions to define each token. Unfortunately, regular expressions are (usually) too weak to define programming languages. Cannot define a regular expression matching all expressions with properly balanced parentheses. Cannot define a regular expression matching all functions with properly nested block structure.

We need a more powerful formalism. We need
A language for describing valid strings of tokens A method for distinguishing valid from invalid strings of tokens We need ……….. Context-Free Grammars

Context-Free Grammars
A context-free grammar (or CFG) is a formalism for defining languages. Can define the context-free languages, a strict superset of the regular languages. CFGs are best explained by example...

Definition of a Context-free Grammar: An alphabet or set of basic symbols (like regular expressions, only now the symbols are whole tokens, not chars), including . (Terminals) A set of names for structures (like statement, expression, definition). (Non-terminals) A set of grammar rules expressing the structure of each name. (Productions) A start symbol (the name of the most general structure – compilation-unit in C).

Context-Free Grammars (Continued)
Context-Free Grammars are designed to represent recursive structures. As a consequence: The structure of a matched string is no longer given by just a sequence of symbols (lexeme), but by a tree (parse tree) Recognition Process is much more complex: Algorithms can use stacks in many different ways. The notation used in writing grammar rules was developed by John Backus and modified by Peter Naur Thus, grammar rules are usually said be in Backus- Naur Form, or BNF.

A CFG consists of A set of terminals A set of non-terminals A start symbol A set of productions

Suppose we want to describe all legal arithmetic expressions Formally, a context-free grammar is a collection of four objects: set of nonterminal symbols (or variables), A set of terminal symbols, A set of production rules saying how each nonterminal can be converted by a string of terminals and nonterminal, and A start symbol that begins the derivation.

Not Notational Shorthand : The syntax for regular expressions does not carry over to CFGs. Cannot use *, or parentheses.

Some CFG Notation Capital letters at the beginning of the alphabet will represent nonterminals. i.e. A, B, C, D Lowercase letters at the end of the alphabet will represent terminals. i.e. t, u, v, w Lowercase Greek letters will represent arbitrary strings of terminals and nonterminals. i.e. α, γ, ω

Examples We might write an arbitrary production as A → ω
We might write a string of a nonterminal followed by a terminal as At We might write an arbitrary production containing a nonterminal followed by a terminal as B → αAtω

Which of the strings are in the language of the given CFG?

Solution SaXa …….X->bY ……. Y-> cYc first not allowed
SaXa …….X->bY second not allowed SaXa …….X->bY ……. Y-> ε SaXa …….X->bY ……. Y-> cYc forth not allowed The solution is : c

The language defined by a grammar
Grammar rules determine the legal strings of token symbols by means of derivations. A derivation is a sequence of replacements of structure names by choices on the right-hand sides of grammar rules. It starts with a single structure name and ends with a string of token symbols At each step in a derivation, a single replacement is made using one choice from a grammar rule. The set of all strings of token symbols obtained by derivations from the exp symbol is the language defined by the grammar of expressions. We can write this as: L(G) = { s: exp * s}

Derivations

Derivations A derivation is a sequence of productions
A derivation can be drawn as a tree Start symbol is the tree’s root For a production X  Y1…Yn add children Y1…Yn to node X

Derivations

Derivations E E+E E* E+E id *E + E id *id + E id *id + id

Derivations

Derivations Which of the following is a valid parse tree for the given grammar?

Solution

Derivations A leftmost derivation is a derivation in which each step expands the leftmost nonterminal. A rightmost derivation is a derivation in which each step expands the rightmost nonterminal.

Derivations Express matching of a string of token symbols,
“(34-3)*42”, by a derivation: (1) exp  exp op exp [exp  exp op exp] (2)  exp op number [exp  number] (3)  exp * number [op  * ] (4)  ( exp ) * number [exp  ( exp )] (5)  ( exp op exp ) * number [exp  exp op exp] (6)  (exp op number) * number [exp  number ] (7)  (exp - number) * number [op  - ] (8)  (number - number)*number [exp  number ] The above derivation is a rightmost derivation: (The rightmost non-terminal is placed at each step.)

Derivations (Continued)
Another derivation, leftmost derivation, for the same string of token symbols: (1) exp  exp op exp [exp  exp op exp] (2)  (exp) op exp [exp  ( exp )] (3)  (exp op exp) op exp [exp  exp op exp] (4)  (number op exp) op exp [exp  number] (5)  (number - exp) op exp [op  -] (6)  (number - number) op exp [exp  number] (7)  (number - number) * exp [op  *] (8)  (number - number) * number [exp  number] The representation of the structure of a string of tokens that abstracts the essential features of a derivation while factoring out differences in ordering is called a parse tree.

Derivations

Derivations: Left and Right

Derivations A derivation encodes two pieces of information:
What productions were applied produce the resulting string from the start symbol? In what order were they applied? Multiple derivations might use the same productions, but apply them in a different order.

Parse Tree A parse tree is a tree encoding the steps in a derivation.
Internal nodes represent nonterminal symbols used in the production. In order walk of the leaves contains the generated string. Encodes what productions are used, not the order in which those productions are applied.

Parse Trees Abstract the structure of a derivation to a parse tree
The following is the parse tree for (34-3)*42 , where the nodes are numbered according to the leftmost & rightmost derivations exp op * 1 4 3 number 2 - 5 8 7 6 ( ) exp op * 1 2 7 number 8 - 3 4 5 6 ( ) June 6, 2018 Prof. Abdelaziz Khamis

Parse Trees (Continued)
A leftmost derivation corresponds to a (top-down) pre-order traversal of the parse tree. A rightmost derivation corresponds to a (bottom-up) post-order traversal, but in reverse. Top-down parsers construct leftmost derivations. (LL = Left-to-right traversal of input, constructing a Leftmost derivation) Bottom-up parsers construct rightmost derivations in reverse order. (LR = Left-to-right traversal of input, constructing a Rightmost derivation)

Abstract Syntax Trees Parse trees contain much more information than is necessary for a compiler to produce executable code. Abstract syntax trees express the essential structure of the parse trees. The expression can be represented more simply by the following abstract syntax tree, or syntax tree for short. June 6, 2018 Prof. Abdelaziz Khamis

Ambiguity

Ambiguity A CFG is said to be ambiguous if there is at least one string with two or more parse trees. Note that ambiguity is a property of grammars, not languages. There is no algorithm for converting an arbitrary ambiguous grammar into an unambiguous one. Some languages are inherently ambiguous, meaning that no unambiguous grammar exists for them. There is no algorithm for detecting whether an arbitrary grammar is ambiguous.

CFG, Derivation, Parse Tree
Another example E  E * E | E + E | ( E ) | id Build a parse tree for: id * id + id * id Can have different ways Ambiguity. If, for some input string that can be derived from the grammar, there exists more than one parse tree to parse it, then the grammar is ambiguous E + * id E * id + E * id +

Ambiguity and Derivations
Leftmost: E  E * E  id * E  id * E + E  id * id + E  id * id + E * E  id * id + id * E  id * id + id * id Rightmost E  E * E  E * E + E  E * E + E * E  E * E + E * id  E * E + id * id  E * id + id * id  id * id + id * id Leftmost: E  E + E  E * E + E  id * E + E  id * id + E  id * id + E * E  id * id + id * E  id * id + id * id Rightmost E  E + E  E + E * E  E + E * id  E + id * id  E * E + id * id  E * id + id * id  id * id + id * id Example grammar E  E * E | E + E | ( E ) | id Derive: id * id + id * id E * id + E + * id Multiple derivations do not imply ambiguity, only multiple parse trees do. If the grammar is ambiguous then there exists multiple parse trees for the grammar, and for each parse tree, there is a unique leftmost derivation and a unique rightmost derivation.

Ambiguity Ambiguity implies multiple parse trees Eliminate ambiguity E
Can make parsing more difficult Can impact the semantics of the language Different parse trees can have different semantic meanings, yield different execution results Sometimes, rewrite the grammar can eliminate ambiguity But it may add additional semantics in the language Eliminate ambiguity E  E + T | E * T | (E) | T T  id Derive: id * id + id * id using leftmost derivation E  E * T  E + T * T  E * T + T * T  E * T + T * T  T * T + T * T  …  id * id + id * id E * T id + Is this what we want?

Ambiguity Rewrite grammar to eliminate ambiguity
Many ways to rewrite the grammar The new grammar should accept the same language For each input string, there may be multiple parse trees Each has a different semantic meaning Which one do we want? Rewrite grammar should be based on the desired semantics There is no general algorithm to rewrite ambiguous grammars

Rewrite Ambiguous Grammar
Try to use a single recursive nonterminal in each rule When the left symbol appears more than once on the right side Use additional symbols to substitute them and allow only one Force to only allow one expansion Example grammar E  E + E | E – E | E * E | E / E | (E) | id It is ambiguous Change to E  T + E | T – E | T * E | T / E | (E) | T T  id Parse: id * id – id E  T * E  T * T – E  T * T – T  …  id * id – id E T * E id T – E id T id

Rewrite Ambiguous Grammar
Build desired precedence in the grammar Example E  E + E | E * E | (E) | id Ambiguous Desired precedence: * executes before + Change to E  E + T | T T  T * F | F F  (E) | id Parse id + id * id E E + T T T * F F F id id id

Ambiguity – Another Example
if statement stmt  if-stmt | while-stmt | … if-stmt  if expr then stmt else stmt | if expr then stmt Parse: if (a) then if (b) then x = c else x = d if-stmt if-stmt if expr then stmt else if expr then stmt stmt (a) (a) if-stmt x=d if-stmt if expr then stmt if expr then stmt else stmt (b) x=c (b) x=c x=d

if statement stmt  if-stmt | while-stmt | … if-stmt  if expr then stmt else stmt | if expr then stmt Recursion in the rule Not immediate recursion Expands to another nonterminal then to itself Indirect recursion in two places Desired semantics Match the else with the closest if How to rewrite the if-stmt grammar to eliminate ambiguity? By defining different if statements: unmatched and matched Matched: with both then and else parts

Solution if-stmt  unmatched-stmt | matched-stmt (1) matched-stmt if expr then matched-stmt else matched-stmt (2) unmatched-stmt if expr then both (3) unmatched-stmt if expr then matched-stmt else unmatched-stmt For (1)(3), only matched-stmt can appear between then and else For (1), matched-stmt cannot leave any dangling if; otherwise, the else of the parent’s enclosing then-else should belong to the dangling if  both then and else parts should be matched (3)’else can have matched, but will introduce ambiguity with (1) After all matches, for each dangling if, use (2) After then, can have either matched or unmatched Consider all statements matched-stmt {stmt} – {unmatched-stmt} unmatched-stmt {stmt} – {matched-stmt}

CS510 Compiler Lecture 4.

Similar presentations

Presentation on theme: "CS510 Compiler Lecture 4."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS510 Compiler Lecture 4.

Similar presentations

Presentation on theme: "CS510 Compiler Lecture 4."— Presentation transcript:

Similar presentations

About project

Feedback