Presentation is loading. Please wait.

Presentation is loading. Please wait.

COP4020 Programming Languages

Similar presentations


Presentation on theme: "COP4020 Programming Languages"— Presentation transcript:

1 COP4020 Programming Languages
Syntax analysis Prof. Xin Yuan

2 Overview Syntax analysis overview Grammar and context-free grammar
Grammar derivations Parse trees 4/17/2017 COP4020 Spring 2014 COP4020 Fall 2006

3 Syntax analysis Syntax analysis is done by the parser. Rest of
Detects whether the program is written following the grammar rules and reports syntax errors. Produces a parse tree from which intermediate code can be generated. token Rest of front end Lexical analyzer Parse tree Int. code Source program parser Request for token Symbol table

4 The syntax of a programming language is described by a context-free grammar (Backus-Naur Form (BNF)). Similar to the languages specified by regular expressions, but more general. A grammar gives a precise syntactic specification of a language. From some classes of grammars, tools exist that can automatically construct an efficient parser. These tools can also detect syntactic ambiguities and other problems automatically. A compiler based on a grammatical description of a language is more easily maintained and updated.

5 Grammars A grammar has four components G=(N, T, P, S): T is a finite set of tokens (terminal symbols) N is a finite set of nonterminals P is a finite set of productions of the form Where and S is a special nonterminal that is a designated start symbol 4/17/2017 COP4020 Spring 2014

6 Example Grammar for expression (T=?, N=?, P=?, S=?)
Production: E ->E+E E-> E-E E-> (E) E-> -E E->num E->id How does this correspond to a language? Informally, you can expand the non-terminals using the productions until all are expanded: the ending sentence (a sequence of tokens) is recognized by the grammar. 4/17/2017 COP4020 Spring 2014

7 Language recognized by a grammar
We say “aAb derives awb in one step”, denoted as “aAb=>awb”, if A->w is a production and a and b are arbitrary strings of terminal or nonterminal symbols. We say a1 derives am if a1=>a2=>…=>am, written as a1=>am The languages L(G) defined by G are the set of strings of the terminals w such that S=>w. * * 4/17/2017 COP4020 Spring 2014

8 Example A->aA A->bA A->a A->b G=(N, T, P, S) N=? T=? P=? S=? What is the language recognized by this grammer? 4/17/2017 COP4020 Spring 2014

9 Chomsky Hierarchy (classification of grammars) A grammar is said to be
regular if it is right-linear, where each production in P has the form, or Here, A and B are non-terminals and w is a terminal or left-linear context-free if each production in P is of the form , where and context sensitive if each production in P is of the form where unrestricted if each production in P is of the form where All languages recognized by regular expression can be represented by a regular grammar.

10 A context free grammar has four components G=(N, T, P, S): T is a finite set of tokens (terminal symbols) N is a finite set of nonterminals P is a finite set of productions of the form Where and S is a special nonterminal that is a designated start symbol. Context free grammar is more expressive than regular expression. Consider language {ab, aabb, aaabbb, …}

11 BNF Notation (another form of context free grammar)
Backus-Naur Form (BNF) notation for productions: <nonterminal> ::= sequence of (non)terminals where Each terminal in the grammar is a token A <nonterminal> defines a syntactic category The symbol | denotes alternative forms in a production The special symbol  denotes empty 4/17/2017 COP4020 Spring 2014

12 Example <Program> ::= program <id> ( <id> <More_ids> ) ; <Block> . <Block> ::= <Variables> begin <Stmt> <More_Stmts> end <More_ids> ::= , <id> <More_ids> |  <Variables> ::= var <id> <More_ids> : <Type> ; <More_Variables> |  <More_Variables> ::= <id> <More_ids> : <Type> ; <More_Variables> |  <Stmt> ::= <id> := <Exp> | if <Exp> then <Stmt> else <Stmt> | while <Exp> do <Stmt> | begin <Stmt> <More_Stmts> end <More_Stmts> ::= ; <Stmt> <More_Stmts> |  <Exp> ::= <num> | <id> | <Exp> + <Exp> | <Exp> - <Exp> 4/17/2017 COP4020 Spring 2014

13 Derivations From a grammar we can derive strings (= sequences of tokens) The opposite process of parsing Starting with the grammar’s designated start symbol, in each derivation step a nonterminal is replaced by a right-hand side of a production for that nonterminal A sentence (in the language) is a sequence of terminals that can be derived from the start symbol. A sentential form is a sequence of terminals and nonterminals that can be derived from the start symbol. 4/17/2017 COP4020 Spring 2014

14 Example Derivation <expression> ::= identifier
              | unsigned_integer               | - <expression>               | ( <expression> )               | <expression> <operator> <expression> <operator> ::= + | - | * | / Start symbol <expression>    <expression> <operator> <expression>    <expression> <operator> identifier    <expression> + identifier    <expression> <operator> <expression> + identifier    <expression> <operator> identifier + identifier    <expression> * identifier + identifier    identifier * identifier + identifier Replacement of nonterminal with one of its productions Sentential forms The final string is the yield 4/17/2017 COP4020 Spring 2014

15 Rightmost versus Leftmost Derivations
When the nonterminal on the far right (left) in a sentential form is replaced in each derivation step the derivation is called right-most (left-most) Replace in rightmost derivation <expression>    <expression> <operator> <expression>    <expression> <operator> identifier Replace in rightmost derivation Replace in leftmost derivation <expression>    <expression> <operator> <expression>    identifier <operator> <expression> Replace in leftmost derivation 4/17/2017 COP4020 Spring 2014

16 A Language Generated by a Grammar
A context-free grammar is a generator of a context-free language The language defined by a grammar G is the set of all strings w that can be derived from the start symbol S L(G) = { w | S * w } <S> ::= a | ‘(’ <S> ‘)’ L(G) = { set of all strings a (a) ((a)) (((a))) … } <S> ::= <B> | <C> <B> ::= <C> + <C> <C> ::= 0 | 1 L(G) = { 0+0, 0+1, 1+0, 1+1, 0, 1 } 4/17/2017 COP4020 Spring 2014

17 Parse Trees A parse tree depicts the end result of a derivation
The internal nodes are the nonterminals The children of a node are the symbols (terminals and nonterminals) on a right-hand side of a production The leaves are the terminals <expression> <expression> <operator> <expression> <expression> <operator> <expression> identifier * identifier + identifier 4/17/2017 COP4020 Spring 2014

18 Parse Trees <expression>
   <expression> <operator> <expression>    <expression> <operator> identifier    <expression> + identifier    <expression> <operator> <expression> + identifier    <expression> <operator> identifier + identifier    <expression> * identifier + identifier    identifier * identifier + identifier <expression> <expression> <operator> <expression> <expression> <operator> <expression> identifier * identifier + identifier 4/17/2017 COP4020 Spring 2014

19 Ambiguity There is another parse tree for the same grammar and input: the grammar is ambiguous This parse tree is not desired, since it appears that + has precedence over * <expression> <expression> <operator> <expression> <expression> <operator> <expression> identifier * identifier + identifier 4/17/2017 COP4020 Spring 2014

20 Ambiguous Grammars Ambiguous grammar: more than one distinct derivation of a string results in different parse trees A programming language construct should have only one parse tree to avoid misinterpretation by a compiler For expression grammars, associativity and precedence of operators is used to disambiguate <expression> ::= <term> | <expression> <add_op> <term> <term> ::= <factor> | <term> <mult_op> <factor> <factor> ::= identifier | unsigned_integer | - <factor> | ( <expression> ) <add_op> ::= + | - <mult_op> ::= * | / 4/17/2017 COP4020 Spring 2014

21 Ambiguous if-then-else: the “Dangling Else”
A classical example of an ambiguous grammar are the grammar productions for if-then-else: <stmt> ::= if <expr> then <stmt> | if <expr> then <stmt> else <stmt> It is possible to hack this into unambiguous productions for the same syntax, but the fact that it is not easy indicates a problem in the programming language design Ada uses different syntax to avoid ambiguity: <stmt> ::= if <expr> then <stmt> end if | if <expr> then <stmt> else <stmt> end if 4/17/2017 COP4020 Spring 2014


Download ppt "COP4020 Programming Languages"

Similar presentations


Ads by Google