Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bernd Fischer RW713: Compiler and Software Language Engineering.

Similar presentations


Presentation on theme: "Bernd Fischer RW713: Compiler and Software Language Engineering."— Presentation transcript:

1 Bernd Fischer bfischer@cs.sun.ac.za RW713: Compiler and Software Language Engineering

2 Parsing

3 Prelude: Reading text… thisissometextwithoutspacesandpunctuationmarkswhichist hereforequitedifficulttoreadbyhumanslexicalanalysiswillbre akthistextupintowordswhiletheparsingphasewillextractthegr ammaticalstructureofthetext this is some text without spaces and punctuation marks which is therefore quite difficult to read by humans lexical analysis will break this text up into words while the parsing phase will extract the grammatical structure of the text This is some text without spaces and punctuation-marks which is therefore quite difficult to read by humans. Lexical-analysis will break this text up into words while the parsing-phase will extract the grammatical-structure of the text.

4 Syntax analysis determines the structure behind the token stream. Machine Language Source Program Compiler Front end Back end analysis synthesis lexical syntax contextual intermediate code object code text tokens (abstract) syntax tree

5 Syntax analysis determines the structure behind the token stream. Machine Language Source Program Compiler Front end Back end analysis lexical syntax contextual IF ID(n) ROP(==) NUM(0) THEN RETURN NUM(0) ELSE RETURN ID(n) AOP(*) ID(f) LPAR ID(n) AOP(-) NUM(1) RPAR Syntax analysis… recovers implied structure: converts flat token stream into (abstract) syntax tree discards redundant tokens (keywords,…) drives lexical analysis and sometimes all phases (syntax-directed translation) expr==idnum return num return if test

6 Regular expressions are not expressive enough for parsing. Consider arithmetic expressions over natural numbers using + and *: L = { 0, 1, 2, 3,…, 0+0, 0+1, 0+2,…,1+0,1+1,1+2,…, 0*0, 0*1, 0*2,…, 0+0+0, 0+0+1,…, 0+0*0, 0+0*1, 0+0*2,…., 0*0+0, 0*0+1,…, 1*0+0, 1*0+1,…. } Pop-Quiz: Write a regular expression for L! Nat = 0 | [1-9][0-9]* Ex = Nat (( + | * ) Nat)* L(Ex) = L … but scanning does not recover operator precedences.

7 Regular expressions are not expressive enough for parsing. Consider arithmetic expressions over natural numbers using +, *, (, and ): L’ = { 0, 1, 2, 3,…, (0), (1), (2), … 0+0, 0+1, 0+2,…,1+0,1+1,1+2,…, 0*0, 0*1, 0*2,…, 0+0+0, 0+0+1,…, 0+0*0, 0+0*1, 0+0*2,…., 0*0+0, 0*0+1,…, 1*0+0, 1*0+1,…., (0+0), (0+1), (0+2),…,(1+0), (1+1), (1+2),…, (0*0), (0*1), (0*2),…, (0+0)+0, (0+0)+1,…, } Pop-Quiz: Write a regular expression for L’!

8 Regular expressions are not expressive enough for parsing. Consider arithmetic expressions over natural numbers using +, *, (, and ): Pop-Quiz: Write a regular expression for L’! Nat = 0 | [1-9][0-9]* Ex 1 = Nat | Nat + Nat | Nat * Nat Ex 2 = Ex 1 | (Ex 1 ) | Ex 1 + Ex 1 | Ex 1 * Ex 1 Ex 3 = Ex 2 | (Ex 2 ) | Ex 2 + Ex 2 | Ex 2 * Ex 2 Ex 4 = … No finite regular expression accepts L’!

9 Regular expressions are not expressive enough for parsing. Describing {“()”, “(())”, “((()))”, …} using a regular expression is impossible! –infinitely large expression! Similarly, a finite automaton cannot recognise it –DFA states have no memory ⇒ need more expressive specification formalism: context-free grammars  (  (  )   ) ) (  ... ) ) 

10 Reminder: Context-Free Grammars Definition: A context-free grammar G = (N, T, P, S) (CFG) consists of a set N of non-terminal symbols, a set T of terminal symbols, N ⋂ T = ∅, a set of productions P ⊆ N x (N ∪ T)*, a start symbol S ∈ N.

11 Reminder: Context-Free Grammars Notation: (simplified from the Dragon Book, 4.2.2) A, B, C,...denote non-terminals a, b, c,... denote terminals X, Y, Zdenote grammar symbols (i.e., X ∈ N ∪ T) u, v,...,z denote strings of terminals (i.e., x ∈ T*) α, β, γ,...denote strings of grammar symbols (i.e., α ∈ (N ∪ T)*) A → α denotes a production rule A → α | β | γ |... denotes A → α, A → β, A → γ,...

12 Context-Free Grammars: Pop-Quiz Write a context-free grammar for L’! L’ = { 0, 1, 2, 3,…, (0), (1), (2), … 0+0, 0+1, 0+2,…,1+0,1+1,1+2,…, 0*0, 0*1, 0*2,…, 0+0+0, 0+0+1,…, 0+0*0, 0+0*1, 0+0*2,…., 0*0+0, 0*0+1,…, 1*0+0, 1*0+1,…., (0+0), (0+1), (0+2),…,(1+0), (1+1), (1+2),…, (0*0), (0*1), (0*2),…, (0+0)+0, (0+0)+1,…, } P = {Ex → NAT | (Ex) | Ex + Ex | Ex * Ex} N = {Ex}, S = Ex, T = {NAT, (, ), +, *}

13 Context-Free Grammars: Pop-Quiz Write a “scanner-less” context-free grammar for L’! P= {NonZero → 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 Digit → 0 | NonZero Any → ε | Digit Any Nat → NonZero Any Ex → Nat | (Ex) | Ex + Ex | Ex * Ex} N= {Ex, Nat, NonZero, Any, Digit}, S = Ex, T= {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, (, ), +, *}

14 Context-Free Languages How do we construct the language described by... regular expressions: recursively build from languages of sub-expressions (syntax-directed) context-free grammars: derive individual words by recursively applying productions –start with ω = S –pick an occurrence of a non-terminal A in ω –pick a production A → α in P –replace A by α in ω –repeat until ω ∈ T*

15 Context-Free Derivations Definition: Let G = (N, T, P, S) be a context-free grammar. ψ is directly derivable in G from φ (written as φ ⇒ ψ) if there are α, σ, τ with φ = σAτ, A → α ∈ P, and φ = σατ. ψ is derivable in G from φ (written as φ ⇒ * ψ) if there are φ 0, φ 1,..., φ n with φ = φ 0, ψ = φ n and φ i ⇒ φ i+1, 0 ≤ i < n. φ 0,..., φ n is called a derivation of ψ from φ. Note: ⇒ * is the reflexive-transitive closure of ⇒.

16 Context-Free Languages Definition: Let G = (N, T, P, S) be a context-free grammar. The language generated by G is defined as L(G) = {x ∈ T* | S ⇒ * x}. x ∈ L(G) is also called a sentence. φ ∈ (N ∪ T)* is a sentential form if S ⇒ * φ.

17 Derivations can be represented by derivation trees. Definition: Let G = (N, T, P, S) be a context-free grammar. A tree is a derivation tree for G if: every node is labelled with a symbol of N ∪ T; the root is labelled with S; if a node n is labelled with A and has at least one descendant, then A ∈ N; if nodes n 1,..., n k with labels X 1,..., X k are direct descendants of n, then A → X 1,..., X k ∈ P. Note: Derivation trees are also called parse trees or (concrete) syntax trees.

18 Derivations can be represented by derivation trees: Pop-Quiz. Consider the grammar P ={Ex → NAT | (Ex) | Ex + Ex | Ex * Ex} N ={Ex}, S = Ex, T = {NAT, (, ), +, *} and construct a derivation tree for the derivation Ex ⇒ Ex * Ex ⇒ Ex + Ex * Ex ⇒ Ex + Ex * NAT (3) ⇒ ( Ex ) + Ex * NAT (3) ⇒ ( NAT (1) ) + Ex * NAT (3) ⇒ ( NAT (1) ) + NAT (2) * NAT (3)

19 Derivations can be represented by derivation trees: Pop-Quiz. Consider the grammar P ={Ex → NAT | (Ex) | Ex + Ex | Ex * Ex} N ={Ex}, S = Ex, T = {NAT, (, ), +, *} and construct a derivation tree for the derivation Ex ⇒ Ex * Ex ⇒ Ex + Ex * Ex ⇒ Ex + Ex * NAT (3) ⇒ ( Ex ) + Ex * NAT (3) ⇒ ( NAT (1) ) + Ex * NAT (3) ⇒ ( NAT (1) ) + NAT (2) * NAT (3)

20 Ambiguity Each derivation corresponds to one derivation tree,... but the same tree can be derived in different ways. Ex ⇒ Ex * Ex ⇒ Ex + Ex * Ex ⇒ (Ex) + Ex * Ex ⇒ (NAT (1) ) + Ex * Ex ⇒ (NAT (1) ) + NAT (2) * Ex ⇒ (NAT (1) ) + NAT (2) * NAT (3) leftmost derivation Ex ⇒ Ex * Ex ⇒ Ex * NAT (3) ⇒ Ex + Ex * NAT (3) ⇒ Ex + NAT (2) * NAT (3) ⇒ (Ex) + NAT (2) * NAT (3) ⇒ (NAT (1) ) + NAT (2) * NAT (3) rightmost derivation P = {Ex → NAT | (Ex) | Ex + Ex | Ex * Ex} N = {Ex}, S = Ex, T = {NAT, (, ), +, *}

21 Left-/rightmost derivations Definition: Let G = (N, T, P, S) be a context-free grammar. φ ⇒ l ψ is a leftmost derivation step if there are u, σ, τ with φ = uAτ, A → α ∈ P, and φ = uατ. φ ⇒ r ψ is a rightmost derivation step if there are σ, u, τ with φ = σAu, A → α ∈ P, and φ = σαu. φ 0,..., φ n is called a left-/rightmost derivation if every step is a left-/rightmost derivation step. Note: For each syntax tree, there exists exactly one leftmost derivation and exactly one rightmost derivation.

22 Ambiguity, again ⇒ syntax trees must be different! Ex ⇒ l Ex * Ex ⇒ l Ex + Ex * Ex ⇒ l (Ex) + Ex * Ex ⇒ l (NAT (1) ) + Ex * Ex ⇒ l (NAT (1) ) + NAT (2) * Ex ⇒ l (NAT (1) ) + NAT (2) * NAT (3) leftmost derivation Ex ⇒ l Ex + Ex ⇒ l (Ex) + Ex ⇒ l (NAT (1) ) + Ex ⇒ l (NAT (1) ) + Ex * Ex ⇒ l (NAT (1) ) + NAT (2) * Ex ⇒ l (NAT (1) ) + NAT (2) * NAT (3) another leftmost derivation ?? P = {Ex → NAT | (Ex) | Ex + Ex | Ex * Ex} N = {Ex}, S = Ex, T = {NAT, (, ), +, *}

23 Ambiguity of sentences, grammars, and languages Definition: A sentence is called unambiguous if it has exactly one syntax tree; it is called ambiguous otherwise. Definition: A grammar G is called ambiguous if L(G) contains an ambiguous sentence. Definition: A language for which every grammar is ambiguous is called inherently ambiguous. Note: There are inherently ambiguous context-free languages (Parikh’s Theorem, 1961). Programming language grammars should be unambiguous because semantics and translation are defined over the structure of the syntax trees.

24 Ambiguity, again: “dangling else” Consider the following C statement: if(a) if(b) c=1; else c=2; Which if owns the else ? Pop-Quiz: Draw the two syntax trees! Pop-Quiz: How would you implement the C way?

25 Resolving ambiguity: “dangling else” Consider the following original production: stmt →if (expr) stmt; | if (expr) stmt else stmt; | other Solution: introduce two new stmt-like non-terminals for the two different contexts. stmt →stmt u | stmt b stmt u →if (expr) stmt; | if (expr) stmt b else stmt u ; stmt b →if (expr) stmt b else stmt b ; | other Cannot be an unbalanced if (without matching else)

26 Resolving ambiguity: priorities Consider the following grammar: S →E E → E + E | E - E | E * E | (E) | id Solution: introduce a new non-terminal for each precedence level. S→ES→E E→T + E | T - E | T T→F * T | F F→(E) | id does not respect usual operator priorities (all operators at same precedence level) increasing precedence levels

27 Resolving ambiguity: associativity Consider the following grammar: S→ES→E E→T + E | T - E | T T→F * T | F F→(E) | id Solution: use left (right) recursion for left- (right-) associative operators. S→ES→E E→E + T | E - T | T … Note: not compatible with LL parsers (use %left directive) does not respect usual operator associtivity (interprets a - b - c as a - (b - c))

28 (Extended) Backus-Naur Form Backus-Naur form is an ASCII notation for grammars: ::= (“+” | “-”) | Many tools use an extended grammar formalism: EBNF ::= [“-”] ::= {“,” } BNF + regular operators ::= (“,” )*

29 Parsing is the search for derivations. There three classes of parsing algorithms that search... exhaustively –Early, CYK: bottom-up and memoization –GLR / GLL: split stacks on ambiguity for leftmost derivations –derive input from start symbol (top-down) –predictive parsing, LL for rightmost derivations –reduce input to start symbol (bottom-up) –shift-reduce parsing, LR


Download ppt "Bernd Fischer RW713: Compiler and Software Language Engineering."

Similar presentations


Ads by Google