CS510 Compiler Lecture 4.

Slides:



Advertisements
Similar presentations
Context-Free Grammars Lecture 7
Advertisements

Prof. Bodik CS 164 Lecture 81 Grammars and ambiguity CS164 3:30-5:00 TT 10 Evans.
1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002.
Parsing — Part II (Ambiguity, Top-down parsing, Left-recursion Removal)
CSE 413 Programming Languages & Implementation Hal Perkins Autumn 2012 Context-Free Grammars and Parsing 1.
EECS 6083 Intro to Parsing Context Free Grammars
1 Introduction to Parsing Lecture 5. 2 Outline Regular languages revisited Parser overview Context-free grammars (CFG’s) Derivations.
BİL 744 Derleyici Gerçekleştirimi (Compiler Design)1 Syntax Analyzer Syntax Analyzer creates the syntactic structure of the given source program. This.
PART I: overview material
Profs. Necula CS 164 Lecture Top-Down Parsing ICOM 4036 Lecture 5.
Bernd Fischer RW713: Compiler and Software Language Engineering.
Introduction to Parsing
Chapter 3 Context-Free Grammars and Parsing. The Parsing Process sequence of tokens syntax tree parser Duties of parser: Determine correct syntax Build.
Unit-3 Parsing Theory (Syntax Analyzer) PREPARED BY: PROF. HARISH I RATHOD COMPUTER ENGINEERING DEPARTMENT GUJARAT POWER ENGINEERING & RESEARCH INSTITUTE.
Syntax Analysis – Part I EECS 483 – Lecture 4 University of Michigan Monday, September 17, 2006.
Top-Down Parsing.
Syntax Analyzer (Parser)
1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations.
Compiler Construction Lecture Five: Parsing - Part Two CSC 2103: Compiler Construction Lecture Five: Parsing - Part Two Joyce Nakatumba-Nabende 1.
Syntax Analysis Or Parsing. A.K.A. Syntax Analysis –Recognize sentences in a language. –Discover the structure of a document/program. –Construct (implicitly.
Syntax Analysis By Noor Dhia Syntax analysis:- Syntax analysis or parsing is the most important phase of a compiler. The syntax analyzer considers.
Introduction to Parsing
CSE 3302 Programming Languages
Chapter 3: Describing Syntax and Semantics
Chapter 3 – Describing Syntax
Describing Syntax and Semantics
lec02-parserCFG May 8, 2018 Syntax Analyzer
Describing Syntax and Semantics
Introduction to Parsing
Parsing & Context-Free Grammars
Context-Free Grammars: an overview
CS 404 Introduction to Compiler Design
Programming Languages Translator
Fall Compiler Principles Context-free Grammars Refresher
Chapter 3 Context-Free Grammar and Parsing
Syntax Specification and Analysis
Introduction to Parsing (adapted from CS 164 at Berkeley)
Parsing IV Bottom-up Parsing
Chapter 3 – Describing Syntax
Syntax Specification and Analysis
Parsing — Part II (Top-down parsing, left-recursion removal)
What does it mean? Notes from Robert Sebesta Programming Languages
Syntax Analysis Chapter 4.
Compiler Construction
Syntax versus Semantics
Compiler Construction (CS-636)
CS 363 Comparative Programming Languages
CSE 3302 Programming Languages
Parsing & Context-Free Grammars Hal Perkins Autumn 2011
Compiler Design 4. Language Grammars
(Slides copied liberally from Ruth Anderson, Hal Perkins and others)
Programming Language Syntax 2
Lecture 7: Introduction to Parsing (Syntax Analysis)
CHAPTER 2 Context-Free Languages
CSC 4181Compiler Construction Context-Free Grammars
R.Rajkumar Asst.Professor CSE
Introduction to Parsing
Introduction to Parsing
Compilers Principles, Techniques, & Tools Taught by Jing Zhang
CSC 4181 Compiler Construction Context-Free Grammars
Teori Bahasa dan Automata Lecture 9: Contex-Free Grammars
Fall Compiler Principles Context-free Grammars Refresher
Chapter 3 Describing Syntax and Semantics.
BNF 9-Apr-19.
Parsing & Context-Free Grammars Hal Perkins Summer 2004
lec02-parserCFG May 27, 2019 Syntax Analyzer
Parsing & Context-Free Grammars Hal Perkins Autumn 2005
COMPILER CONSTRUCTION
Faculty of Computer Science and Information System
Parsing CSCI 432 Computer Science Theory
Presentation transcript:

CS510 Compiler Lecture 4

Syntax Analysis

What is Syntax Analysis? After lexical analysis (scanning), we have a series of tokens. In syntax analysis (or parsing), we want to interpret what those tokens mean. Goal: Recover the structure described by that series of tokens. Goal: Report errors if those tokens do not properly encode a structure.

Intro to Parsing Input: sequence of tokens from lexer Output: parse tree of the program

Intro to Parsing

Main Topics Context-Free Grammars Derivations Concrete and Abstract Syntax Trees Ambiguity Next Week: Parsing algorithms. Top-Down Parsing Bottom-Up Parsing

The Limits of Regular Languages When scanning, we used regular expressions to define each token. Unfortunately, regular expressions are (usually) too weak to define programming languages. Cannot define a regular expression matching all expressions with properly balanced parentheses. Cannot define a regular expression matching all functions with properly nested block structure.

We need a more powerful formalism. We need A language for describing valid strings of tokens A method for distinguishing valid from invalid strings of tokens We need ……….. Context-Free Grammars

Context-Free Grammars A context-free grammar (or CFG) is a formalism for defining languages. Can define the context-free languages, a strict superset of the regular languages. CFGs are best explained by example...

Context-Free Grammars Definition of a Context-free Grammar: An alphabet or set of basic symbols (like regular expressions, only now the symbols are whole tokens, not chars), including . (Terminals) A set of names for structures (like statement, expression, definition). (Non-terminals) A set of grammar rules expressing the structure of each name. (Productions) A start symbol (the name of the most general structure – compilation-unit in C).

Context-Free Grammars (Continued) Context-Free Grammars are designed to represent recursive structures. As a consequence: The structure of a matched string is no longer given by just a sequence of symbols (lexeme), but by a tree (parse tree) Recognition Process is much more complex: Algorithms can use stacks in many different ways. The notation used in writing grammar rules was developed by John Backus and modified by Peter Naur Thus, grammar rules are usually said be in Backus- Naur Form, or BNF.

Context-Free Grammars A CFG consists of A set of terminals A set of non-terminals A start symbol A set of productions

Context-Free Grammars Suppose we want to describe all legal arithmetic expressions Formally, a context-free grammar is a collection of four objects: set of nonterminal symbols (or variables), A set of terminal symbols, A set of production rules saying how each nonterminal can be converted by a string of terminals and nonterminal, and A start symbol that begins the derivation.

Context-Free Grammars Not Notational Shorthand : The syntax for regular expressions does not carry over to CFGs. Cannot use *, or parentheses.

Some CFG Notation Capital letters at the beginning of the alphabet will represent nonterminals. i.e. A, B, C, D Lowercase letters at the end of the alphabet will represent terminals. i.e. t, u, v, w Lowercase Greek letters will represent arbitrary strings of terminals and nonterminals. i.e. α, γ, ω

Examples We might write an arbitrary production as A → ω We might write a string of a nonterminal followed by a terminal as At We might write an arbitrary production containing a nonterminal followed by a terminal as B → αAtω

Context-Free Grammars Which of the strings are in the language of the given CFG?

Solution SaXa …….X->bY ……. Y-> cYc first not allowed SaXa …….X->bY second not allowed SaXa …….X->bY ……. Y-> ε SaXa …….X->bY ……. Y-> cYc forth not allowed The solution is : c

The language defined by a grammar Grammar rules determine the legal strings of token symbols by means of derivations. A derivation is a sequence of replacements of structure names by choices on the right-hand sides of grammar rules. It starts with a single structure name and ends with a string of token symbols At each step in a derivation, a single replacement is made using one choice from a grammar rule. The set of all strings of token symbols obtained by derivations from the exp symbol is the language defined by the grammar of expressions. We can write this as: L(G) = { s: exp * s}

Derivations

Derivations A derivation is a sequence of productions A derivation can be drawn as a tree Start symbol is the tree’s root For a production X  Y1…Yn add children Y1…Yn to node X

Derivations

Derivations E E+E E* E+E id *E + E id *id + E id *id + id

Derivations

Derivations

Derivations

Derivations

Derivations

Derivations Which of the following is a valid parse tree for the given grammar?

Solution

Derivations A leftmost derivation is a derivation in which each step expands the leftmost nonterminal. A rightmost derivation is a derivation in which each step expands the rightmost nonterminal.

Derivations Express matching of a string of token symbols, “(34-3)*42”, by a derivation: (1) exp  exp op exp [exp  exp op exp] (2)  exp op number [exp  number] (3)  exp * number [op  * ] (4)  ( exp ) * number [exp  ( exp )] (5)  ( exp op exp ) * number [exp  exp op exp] (6)  (exp op number) * number [exp  number ] (7)  (exp - number) * number [op  - ] (8)  (number - number)*number [exp  number ] The above derivation is a rightmost derivation: (The rightmost non-terminal is placed at each step.)

Derivations (Continued) Another derivation, leftmost derivation, for the same string of token symbols: (1) exp  exp op exp [exp  exp op exp] (2)  (exp) op exp [exp  ( exp )] (3)  (exp op exp) op exp [exp  exp op exp] (4)  (number op exp) op exp [exp  number] (5)  (number - exp) op exp [op  -] (6)  (number - number) op exp [exp  number] (7)  (number - number) * exp [op  *] (8)  (number - number) * number [exp  number] The representation of the structure of a string of tokens that abstracts the essential features of a derivation while factoring out differences in ordering is called a parse tree.

Derivations

Derivations: Left and Right

Derivations A derivation encodes two pieces of information: What productions were applied produce the resulting string from the start symbol? In what order were they applied? Multiple derivations might use the same productions, but apply them in a different order.

Parse Tree A parse tree is a tree encoding the steps in a derivation. Internal nodes represent nonterminal symbols used in the production. In order walk of the leaves contains the generated string. Encodes what productions are used, not the order in which those productions are applied.

Parse Trees Abstract the structure of a derivation to a parse tree The following is the parse tree for (34-3)*42 , where the nodes are numbered according to the leftmost & rightmost derivations exp op * 1 4 3 number 2 - 5 8 7 6 ( ) exp op * 1 2 7 number 8 - 3 4 5 6 ( ) June 6, 2018 Prof. Abdelaziz Khamis

Parse Trees (Continued) A leftmost derivation corresponds to a (top-down) pre-order traversal of the parse tree. A rightmost derivation corresponds to a (bottom-up) post-order traversal, but in reverse. Top-down parsers construct leftmost derivations. (LL = Left-to-right traversal of input, constructing a Leftmost derivation) Bottom-up parsers construct rightmost derivations in reverse order. (LR = Left-to-right traversal of input, constructing a Rightmost derivation)

Abstract Syntax Trees Parse trees contain much more information than is necessary for a compiler to produce executable code. Abstract syntax trees express the essential structure of the parse trees. The expression can be represented more simply by the following abstract syntax tree, or syntax tree for short. June 6, 2018 Prof. Abdelaziz Khamis

Ambiguity

Ambiguity

Ambiguity A CFG is said to be ambiguous if there is at least one string with two or more parse trees. Note that ambiguity is a property of grammars, not languages. There is no algorithm for converting an arbitrary ambiguous grammar into an unambiguous one. Some languages are inherently ambiguous, meaning that no unambiguous grammar exists for them. There is no algorithm for detecting whether an arbitrary grammar is ambiguous.

CFG, Derivation, Parse Tree Another example E  E * E | E + E | ( E ) | id Build a parse tree for: id * id + id * id Can have different ways Ambiguity. If, for some input string that can be derived from the grammar, there exists more than one parse tree to parse it, then the grammar is ambiguous E + * id E * id + E * id +

Ambiguity and Derivations Leftmost: E  E * E  id * E  id * E + E  id * id + E  id * id + E * E  id * id + id * E  id * id + id * id Rightmost E  E * E  E * E + E  E * E + E * E  E * E + E * id  E * E + id * id  E * id + id * id  id * id + id * id Leftmost: E  E + E  E * E + E  id * E + E  id * id + E  id * id + E * E  id * id + id * E  id * id + id * id Rightmost E  E + E  E + E * E  E + E * id  E + id * id  E * E + id * id  E * id + id * id  id * id + id * id Example grammar E  E * E | E + E | ( E ) | id Derive: id * id + id * id E * id + E + * id Multiple derivations do not imply ambiguity, only multiple parse trees do. If the grammar is ambiguous then there exists multiple parse trees for the grammar, and for each parse tree, there is a unique leftmost derivation and a unique rightmost derivation.

Ambiguity Ambiguity implies multiple parse trees Eliminate ambiguity E Can make parsing more difficult Can impact the semantics of the language Different parse trees can have different semantic meanings, yield different execution results Sometimes, rewrite the grammar can eliminate ambiguity But it may add additional semantics in the language Eliminate ambiguity E  E + T | E * T | (E) | T T  id Derive: id * id + id * id using leftmost derivation E  E * T  E + T * T  E * T + T * T  E * T + T * T  T * T + T * T  …  id * id + id * id E * T id + Is this what we want?

Ambiguity Rewrite grammar to eliminate ambiguity Many ways to rewrite the grammar The new grammar should accept the same language For each input string, there may be multiple parse trees Each has a different semantic meaning Which one do we want? Rewrite grammar should be based on the desired semantics There is no general algorithm to rewrite ambiguous grammars

Rewrite Ambiguous Grammar Try to use a single recursive nonterminal in each rule When the left symbol appears more than once on the right side Use additional symbols to substitute them and allow only one Force to only allow one expansion Example grammar E  E + E | E – E | E * E | E / E | (E) | id It is ambiguous Change to E  T + E | T – E | T * E | T / E | (E) | T T  id Parse: id * id – id E  T * E  T * T – E  T * T – T  …  id * id – id E T * E id T – E id T id

Rewrite Ambiguous Grammar Build desired precedence in the grammar Example E  E + E | E * E | (E) | id Ambiguous Desired precedence: * executes before + Change to E  E + T | T T  T * F | F F  (E) | id Parse id + id * id E E + T T T * F F F id id id

Ambiguity – Another Example if statement stmt  if-stmt | while-stmt | … if-stmt  if expr then stmt else stmt | if expr then stmt Parse: if (a) then if (b) then x = c else x = d if-stmt if-stmt if expr then stmt else if expr then stmt stmt (a) (a) if-stmt x=d if-stmt if expr then stmt if expr then stmt else stmt (b) x=c (b) x=c x=d

Ambiguity – Another Example if statement stmt  if-stmt | while-stmt | … if-stmt  if expr then stmt else stmt | if expr then stmt Recursion in the rule Not immediate recursion Expands to another nonterminal then to itself Indirect recursion in two places Desired semantics Match the else with the closest if How to rewrite the if-stmt grammar to eliminate ambiguity? By defining different if statements: unmatched and matched Matched: with both then and else parts

Ambiguity – Another Example Solution if-stmt  unmatched-stmt | matched-stmt (1) matched-stmt if expr then matched-stmt else matched-stmt (2) unmatched-stmt if expr then both (3) unmatched-stmt if expr then matched-stmt else unmatched-stmt For (1)(3), only matched-stmt can appear between then and else For (1), matched-stmt cannot leave any dangling if; otherwise, the else of the parent’s enclosing then-else should belong to the dangling if  both then and else parts should be matched (3)’else can have matched, but will introduce ambiguity with (1) After all matches, for each dangling if, use (2) After then, can have either matched or unmatched Consider all statements matched-stmt {stmt} – {unmatched-stmt} unmatched-stmt {stmt} – {matched-stmt}