COMP 3438 – Part II - Lecture 4 Syntax Analysis I Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ.

Overview of the Subject (COMP 3438) Overview of Unix Sys. Prog. ProcessFile System Overview of Device Driver Development Character Device Driver Development Introduction to Block Device Driver Overview of Complier Design Lexical Analysis Syntax Analysis (HW #4) Part I: Unix System Programming (Device Driver Development) Part II: Compiler Design Course Organization (This lecture is in red)

Review for the Previous Lecture NFA Regular expressions DFA (Deterministic Finite Automata) Lexical Specification Table-driven Implementation of DFA (Lexical Analyzer) (Nondeterministic Finite Automata) Obtain regular expressions Conversion

Outline Part I: Introduction to Syntax Analysis 1. Input (Tokens) and Output (Parse Tree) 2. How to specify syntax? Context Free Grammar (CFG) 3. How to obtain parse tree? CFG  Remove left recursion, left factoring, ambiguity  LL (Leftmost Derivation) CFG  (Remove ambiguity)  LR (Reverse Rightmost Derivation) Part II: Context Free Grammar, Parse Tree and Ambiguity Part III: Bottom-up Paring (LR) SLR, Canonical LR, LALR Part III: Top-down Parsing (LL) Left Recursion, Left factoring (Tutorial) Recursive-Decent Paring Predictive Parsing (without backtracking) –HW3 Nonrecursive Predictive Parsing Software Tool: yacc (Lab)

Part I: Intro. to Syntax Analysis

The Phases of a Compiler Source program Lexical Analyzer Syntax Analyzer (Parser) Semantic Analyzer Intermediate Code Generator Code Optimizer Code Generator Target program Symbol-table Manager Error Handler tokens Parse tree

Syntax Analysis (Parsing) – 2 nd Phase Input: sequence of tokens from lexical analysis Output: a parse tree (Syntax Tree) based on the grammar of a programming language Comparison: PhaseInputOutput Lexical Analysis (Scanner) Source Program (String of characters) String of tokens Syntax Analysis (Parser) String of tokensParse/Syntax Tree

Example Code if x ==y then 1 else 2 fi Parser input Set of tokens IF ID==ID THEN NUM ELSE NUM FI Parser output IF-THEN-ELSE ==NUM NUMID

Syntax and Grammar Programming language has rules to prescribe the syntax of programs. In pascal, program  block; block  statements; … The syntax of programming language constructs can be described by context-free grammar or BNF (Backus-Naur Form) var a,b,c; begin a = b + c; end

10 Context-Free Grammar (CFG) A context-free grammar G = (N, T, S, P ) is: (1) N is a finite set of nonterminal symbols (2) T is a finite set of terminal symbols (3) S is the start or initial nonterminal symbol ( S  N). (4) P is a finite set of productions (rules). Every production in P is of the form: A   where A  N and  is a string over (N  T)*. For example: G={ N={S}, T={a,b}, S, P={S  aSb|ab} } S  aSb | ab denotes the language L= {a b | n >0}. n n

Why not Regular Expression? An example (x+y) * z ((x+y) + y) * z ((x+y)+y)+y)*z … (…(((x+y)+y))…) * z How do we know left and right parentheses are matching? The number_of “(” = = the number_of “)” L={a b | n >0} is not a regular set. But it is context free: S  a S b | ab nn

Regular Express, CFG and Automata Regular Expression Finite Automata CFG Pushdown Automata (with a stack) Finite Automaton i =a * b Pushdown Automaton a aabb stack With a stack, it is easy to identify the language like L={a b | n > 0}. n n

Language Classification Recursive Language Context-sensitive Language Context-free Language Regular Set

14 Regular expression vs. CFG Every language that can be described by a regular expression can also be described by a CFG grammar, e.g. for (a |b)*abb, the following CFG gives the same language A 0  a A 0 | bA 0 | aA 1 A 1  bA 2 A 2  bA 3 A3  A3   0123 a b abb Convert a NFA into a CFG: 1. For each state i, create a non-terminal Ai 2. If state i has a transition to state j on input a, introduce Ai  aAj 3. If state i goes to state j on input , add Ai  Aj 4. If i is an accepting state, introduce Ai  

15 Derivations Based on a grammar for a language, we can generate sentences (strings) of the language. This is done by derivations. Syntax Analysis: given the input token string, can we obtain a derivation based on the grammar? Grammar: S  a S b | ab Derivation : S  a S b  aabb where:  to mean “derives in one step”;

16 The derivation of a sentence can be shown pictorially by a parse tree. each node is labeled by a grammar symbol an interior node and its children correspond to a production A Language is: all the sentences that can be derived from the start symbol by the grammar. Parse Trees Example: Grammar: S  a S b | ab Derivation: S  a S b  aabb Parse tree: S a S b ab

17 Leftmost/Rightmost Derivation At each derivation step, we have to make two choices: Which non-terminal to replace? Using which alternative to replace that non-terminal? Leftmost Derivation: only the leftmost nonterminal is replaced at each derivations step. Grammar: E  E + E | E  E | (E) | -E | id Leftmost derivation for the sentence – (id+id) : E  - E  - (E )  - (E + E)  - (id + E )  - ( id + id ) Similarly, for rightmost derivation, the rightmost nonterminal is replaced at each step. E - E ( E ) E+ E id

18 Ambiguous Grammars Each parse tree has a unique leftmost/rightmost derivation (after we obtain the tree). Some sentences may have more than one leftmost or rightmost derivation, therefore, more than one parse tree. id E E + E id E * E id E * E E E + E id  A grammar that produces more than one parse tree for some sentence is said to be ambiguous. Grammar: E  E + E| E*E | id Sentence: id + id * id

Parsing Methods Parser (Syntax Analyzer): Given a token string, generate a parse tree based on the grammar of prog. lang. if the string belongs to the lang. Three Parsing Methods: Universal Parsing: such as Earley’s Algorithm that can parse any grammar. But extremely inefficient. Top-Down Parsing: find a leftmost derivation for an input token string. Bottom-Up Parsing: find a reverse rightmost derivation for an input token string. E E + E id E * E id Top-Down Bottom-Up

Top-Down Parsing Top-down Paring Start at the root of parse tree and try to get to leaves Leftmost derivation Can be efficiently written by hand Only work for certain class of grammars Unamibiguous No left recursion No left factoring Homework 4 is to ask you to implement a parser using top-down parsing E E + E id E * E id Top-Down

Bottom-Up Parsing Bottom-up Parsing Start at leaves and build tree from bottom up Reverse rightmost derivation Basic Method: Shift-reduce Shift symbols onto the stack; reduce when handle is identified by left hand side Use to implement automatic parser generator such as yacc Work for wider class of grammars than top-down Unambiguous Lab: Use yacc E E + E id E * E id Bottom-Up

Part II: Context Free Grammars, Parse Trees and Ambiguity

23 Context-Free Grammar (CFG) A context-free grammar G = (N, T, S, P ) is: (1) N is a finite set of nonterminal symbols (2) T is a finite set of terminal symbols (3) S is the start or initial nonterminal symbol ( S  N). (4) P is a finite set of productions (rules). Every production in P is of the form: A   where A  N and  is a string over (N  T)*. For example: G={ N={S}, T={a,b}, S, P={S  aSb|ab} } S  aSb | ab denotes the language L= {a b | n >0}. n n

25 Derivations A grammar derives strings by: beginning with the start symbol repeatedly replacing a nonterminal by the right hand side of a production for that nonterminal. We say that  A    if 1. A   is a production and 2.  and  are arbitrary strings of grammar symbols We use  to mean “ derives in one step ” ; Example: Grammar:S  a S b | ab Derivations: S  a S b  aabb

26 Example for Derivations Consider the following grammar G E for simple expressions S  ES  E E  E | E + E | E  E | id The string id + id  id can be derived from the start symbol S following the sequence of replacements: S  E  E + E  E + E  E  E + id  E  id + id  E  id + id  id. Each step is a derivation step. At each derivation step, we have to make two choices: Which non-terminal to replace? Using which alternative to replace that non-terminal?

27 Derivations Often we wish to say “ derives in zero or more steps. ” For this purpose we can use the symbol  * 1.   *  for any string , and 2.if   *  and   , then   *  Likewise, we use the symbol  + to mean “ derives in one or more steps. ” For grammar G, If S  * , where S is the start symbol and  may contain nonterminals, then we say that  is a sentential form of G. A sentence of the language defined by G is a sentential form with no nonterminals. A string x is a sentence of L G iff S  + x.

28 The derivation of a sentence can be shown pictorially by a parse tree. each node is labeled by a grammar symbol an interior node and its children correspond to a production. Parse Trees Example: Grammar: S  a S b | ab Derivation: S  a S b  aabb Parse tree: S a S b ab

29 The Properties for Parse tree A parse tree has the following properties: The root is labeled by the start symbol; A leaf node is labeled by a terminal symbol; An interior node is labeled by a nonterminal symbol; If A is the nonterminal labeling some interior node and X 1, X 2,..., X n are the labels of the children of that node from left to right, then A  X 1 X 2... X n is a production. E E + E id E * E id visiting all leaves of a parse tree from left to right, you will trace the sentence formed by the parse tree.

30 Parse tree Sketch of a Parse Tree for a Complete Program

31 Ambiguous Grammars Each parse tree has a unique leftmost/rightmost derivation (after we obtain the tree). Some sentences may have more than one leftmost or rightmost derivation, therefore, more than one parse tree. id E E + E id E * E id E * E E E + E id  A grammar that produces more than one parse tree for some sentence is said to be ambiguous. Grammar: E  E + E| E*E | id Sentence: id + id * id

32 Eliminating ambiguity Not always possible, why? first, no algorithm exists which take an arbitrary grammar and determine, with certainty and in a finite amount of time, whether it is ambiguous or not. second, some grammars are inherently ambiguous, i.e., cannot be made unambiguous So what can we do? We may try to use certain ambiguous grammars, together with disambiguating rules that "throw away" undesirable parse trees For expression, use precedence and associativity.

33 Eliminating ambiguity Consider the ambiguity in the following “ dangling-else ” grammar G: stmt  if expr then stmt | if expr then stmt else stmt | other G is ambiguous because the string: “ if E 1 then if E 2 then S 1 else S 2 “ stmt if expr thenstmt if expr E1 E2 thenstmtelse stmt S1S2 stmt if expr thenstmt if expr E1 E2 thenstmt elsestmt S1 S2

34 Eliminating ambiguity The general rule for “ Dangling-else ” grammar: “ Match each else with the closest previous unmatched then Idea: A statement between a then and a else must be matched. stmt  matched_stmt | unmached_stmt matched_stmt  if expr then matched_stmt else matched_stmt | other unmatched_stmt  if expr then stmt | if expr then matched_stmt else unmatched_stmt stmt if expr thenstmt if expr E1 E2 thenstmtelse stmt S1S2 stmt if expr thenstmt if expr E1 E2 thenstmt elsestmt S1 S2

Ambiguous Grammar Grammar: E  E+E | E*E | id Consider the string id * id + id * id Can have 3 different parse trees: E E+E E*EE*E id E E*E E+E E*E E E*E E*E E+E

Specifying Precedence Idea: Build precedence and associativity into grammar Different non-terminal for different precedence level Lowest level – highest in tree (lowest precedence) Highest level – lower in tree (highest precedence) Same level – same precedence Consider Associativity: left recursion – left associative right recursion – right associative E  E + T | E – T | T T  T * F | T/F | F F  P | P ^ F P  ID | NUM | ( E )

Example E  E + T | E – T | T T  T * F | T/F | F F  P | P ^ F P  ID | NUM | ( E ) 1+2+3+4^5^6 E E + T E + T E + T T F P NUM (1) F P NUM (2) F P NUM (3) F P ^ F NUM (4) P ^ F NUM (5) NUM (6)

Summary Introduction to syntax analysis Input (Tokens) and Output (Parse Tree) Specify syntax - Context Free Grammar (CFG) Parsing methods Context-free grammar CFG Derivation, parse tree Ambiguous grammars

COMP 3438 – Part II - Lecture 4 Syntax Analysis I Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ.

Similar presentations

Presentation on theme: "COMP 3438 – Part II - Lecture 4 Syntax Analysis I Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

COMP 3438 – Part II - Lecture 4 Syntax Analysis I Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ.

Similar presentations

Presentation on theme: "COMP 3438 – Part II - Lecture 4 Syntax Analysis I Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ."— Presentation transcript:

Similar presentations

About project

Feedback