Simplifying CFGs There are several ways in which context-free grammars can be simplified. One natural way is to eliminate useless symbols those that cannot.

Slides:



Advertisements
Similar presentations
Closure Properties of CFL's
Advertisements

CYK Parser Von Carla und Cornelia Kempa. Overview Top-downBottom-up Non-directional methods Unger ParserCYK Parser.
101 The Cocke-Kasami-Younger Algorithm An example of bottom-up parsing, for CFG in Chomsky normal form G :S  AB | BB A  CC | AB | a B  BB | CA | b C.
Grammars, constituency and order A grammar describes the legal strings of a language in terms of constituency and order. For example, a grammar for a fragment.
About Grammars CS 130 Theory of Computation HMU Textbook: Sec 7.1, 6.3, 5.4.
Fall 2004COMP 3351 Simplifications of Context-Free Grammars.
Prof. Busch - LSU1 Simplifications of Context-Free Grammars.
FORMAL LANGUAGES, AUTOMATA, AND COMPUTABILITY
Western Michigan University CS6800 Advanced Theory of Computation Spring 2014 By Abduljaleel Alhasnawi & Rihab Almalki.
CS 3240 – Chapter 6.  6.1: Simplifying Grammars  Substitution  Removing useless variables  Removing λ  Removing unit productions  6.2: Normal Forms.
Chapter 4 Normal Forms for CFGs Chomsky Normal Form n Defn A CFG G = (V, , P, S) is in chomsky normal form if each rule in G has one of.
CS5371 Theory of Computation
Lecture Note of 12/22 jinnjy. Outline Chomsky Normal Form and CYK Algorithm Pumping Lemma for Context-Free Languages Closure Properties of CFL.
1 CSC 3130: Automata theory and formal languages Tutorial 4 KN Hung Office: SHB 1026 Department of Computer Science & Engineering.
Costas Buch - RPI1 Simplifications of Context-Free Grammars.
CS Master – Introduction to the Theory of Computation Jan Maluszynski - HT Lecture 4 Context-free grammars Jan Maluszynski, IDA, 2007
1 Normal Forms for Context-free Grammars. 2 Chomsky Normal Form All productions have form: variable and terminal.
104 Closure Properties of Regular Languages Regular languages are closed under many set operations. Let L 1 and L 2 be regular languages. (1) L 1  L 2.
1 Simplifications of Context-Free Grammars. 2 A Substitution Rule substitute B equivalent grammar.
1 Simplifications of Context-Free Grammars. 2 A Substitution Rule Substitute Equivalent grammar.
Normal forms for Context-Free Grammars
1 Module 32 Chomsky Normal Form (CNF) –4 step process.
How to Convert a Context-Free Grammar to Greibach Normal Form
January 15, 2014CS21 Lecture 61 CS21 Decidability and Tractability Lecture 6 January 16, 2015.
1 Background Information for the Pumping Lemma for Context-Free Languages Definition: Let G = (V, T, P, S) be a CFL. If every production in P is of the.
Cs466(Prasad)L8Norm1 Normal Forms Chomsky Normal Form Griebach Normal Form.
Context-Free Grammars Chapter 3. 2 Context-Free Grammars and Languages n Defn A context-free grammar is a quadruple (V, , P, S), where  V is.
CS 3813 Introduction to Formal Languages and Automata Chapter 6 Simplification of Context-free Grammars and Normal Forms These class notes are based on.
Chapter 4 Context-Free Languages Copyright © 2011 The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 1.
Context-Free Grammars
CONVERTING TO CHOMSKY NORMAL FORM
Formal Languages Context free languages provide a convenient notation for recursive description of languages. The original goal of CFL was to formalize.
Context-free Grammars Example : S   Shortened notation : S  aSaS   | aSa | bSb S  bSb Which strings can be generated from S ? [Section 6.1]
Context-Free Grammars Normal Forms Chapter 11. Normal Forms A normal form F for a set C of data objects is a form, i.e., a set of syntactically valid.
Normal Forms for Context-Free Grammars Definition: A symbol X in V  T is useless in a CFG G=(V, T, P, S) if there does not exist a derivation of the form.
The Pumping Lemma for Context Free Grammars. Chomsky Normal Form Chomsky Normal Form (CNF) is a simple and useful form of a CFG Every rule of a CNF grammar.
Context-Free Grammars – Chomsky Normal Form Lecture 16 Section 2.1 Wed, Sep 26, 2007.
CSCI 2670 Introduction to Theory of Computing September 21, 2004.
CSCI 3130: Automata theory and formal languages Andrej Bogdanov The Chinese University of Hong Kong Ambiguity.
Context Free Grammar. Introduction Why do we want to learn about Context Free Grammars?  Used in many parsers in compilers  Yet another compiler-compiler,
The CYK Algorithm Presented by Aalapee Patel Tyler Ondracek CS6800 Spring 2014.
Membership problem CYK Algorithm Project presentation CS 5800 Spring 2013 Professor : Dr. Elise de Doncker Presented by : Savitha parur venkitachalam.
CS 44 – Jan. 29 Expression grammars –Associativity √ –Precedence CFG for entire language (handout) CYK algorithm –General technique for testing for acceptance.
Chapter 6 Simplification of Context-free Grammars and Normal Forms These class notes are based on material from our textbook, An Introduction to Formal.
1 Simplification of Context-Free Grammars Some useful substitution rules. Removing useless productions. Removing -productions. Removing unit-productions.
Closure Properties Lemma: Let A 1 and A 2 be two CF languages, then the union A 1  A 2 is context free as well. Proof: Assume that the two grammars are.
CS 208: Computing Theory Assoc. Prof. Dr. Brahim Hnich Faculty of Computer Sciences Izmir University of Economics.
1 Chapter 6 Simplification of CFGs and Normal Forms.
Introduction Finite Automata accept all regular languages and only regular languages Even very simple languages are non regular (  = {a,b}): - {a n b.
Context-Free Languages
Chapter 8 Properties of Context-free Languages These class notes are based on material from our textbook, An Introduction to Formal Languages and Automata,
CSC 3130: Automata theory and formal languages Andrej Bogdanov The Chinese University of Hong Kong Normal forms.
CSCI 4325 / 6339 Theory of Computation Zhixiang Chen Department of Computer Science University of Texas-Pan American.
Exercises on Chomsky Normal Form and CYK parsing
1 Context Free Grammars Xiaoyin Wang CS 5363 Spring 2016.
About Grammars Hopcroft, Motawi, Ullman, Chap 7.1, 6.3, 5.4.
Syntax Analysis By Noor Dhia Syntax analysis:- Syntax analysis or parsing is the most important phase of a compiler. The syntax analyzer considers.
Lecture 16 Cocke-Younger-Kasimi Parsing Topics: Closure Properties of Context Free Languages Cocke-Younger-Kasimi Parsing Algorithm June 23, 2015 CSCE.
Normal Forms for CFG’s Eliminating Useless Variables Removing Epsilon
Chomsky Normal Form CYK Algorithm
Complexity and Computability Theory I
7. Properties of Context-Free Languages
Simplifications of Context-Free Grammars
Simplifications of Context-Free Grammars
Definition: Let G = (V, T, P, S) be a CFL
7. Properties of Context-Free Languages
Chapter 6 Simplification of Context-free Grammars and Normal Forms
The Cocke-Kasami-Younger Algorithm
Normal forms and parsing
Context-Free Languages
Presentation transcript:

Simplifying CFGs There are several ways in which context-free grammars can be simplified. One natural way is to eliminate useless symbols those that cannot be part of a derivation (or parse tree) Symbols may be useless in one of two ways. they may not be reachable from the start symbol. or they may be variables that cannot derive a string of terminals

Example of a useless symbol Consider the CFG G with rules S → aBC, B → b|Cb, C → c|cC, D → d Here the symbols S, B, C, a, b, and c are reachable, but D is not. D may be removed without changing L(G)

Reachable symbols In a CFG, a symbol is reachable iff it is S or it appears in a, where A → a is a rule of the grammar, and A is reachable So in the grammar above we first find that S is reachable then that a, B, and C are and finally that b and c are. A symbol that is unreachable cannot be part of a derivation. It may be eliminated along with all of its rules.

Another reachability example Suppose the grammar G instead had rules S → aB, B → b|Cb, C → c|cC, D → d, Then we would first see that S is reachable then that a and B are then that b and C are and finally that c is We might say in this case that S is reachable at level 0, a and B at level 1, b and C at level 2, and c at level 3.

A second kind of useless symbol Two simple inductions show that X is reachable iff S =*> aXb for some strings a and b of symbols. A symbol X is also useless iff it cannot derive a string of terminals that is, iff there's no string w of terminals such that X =*> w.

Another simplification example In the grammar with rules S → aB, B → b|BD|cC, C → cC, D → d the symbol C cannot derive a string of terminals. So it and all rules that contain it may be eliminated to get just S → aB, B → b|BD, D → d

Generating strings of terminals A simple induction shows that the only symbols that can generate strings of terminals are terminal symbols variables A for which A → a is a rule of the grammar and every symbol of a generates a string of terminals

Our example revisited In the grammar above, we would observe first that a, b, c, and d generate strings of terminals (at level 0), then that B and D do (at level 1), and finally that S does (at level 2)

Removing the two kinds of useless symbols The characterizations of the two kinds of useless symbols are similar, except that To find reachable symbols, we work top down To find generating symbols, we work bottom up. When removing useless symbols, it’s important to remove unreachable symbols last since only this order will leave only useful symbols at the end of the process.

Bad example of removing useless symbols Using the algorithms implicit in the above characterizations, suppose a CFG has rules S → aB, B → b|bB|CD, C → cC, D → d We first observe that a, b, c, and d generate strings of terminals (at level 0) then that B and D do (at level 1) and finally that S does (at level 2). But removing the rule B → CD from this grammar makes the symbol D unreachable.

Eliminating l-rules Sometimes it is desirable to eliminate l-rules from a grammar G. This cannot be done if l is in L(G), But it's always possible to eliminate l-rules from a CFG and get a grammar that generates L(G) - {l}.

Nullable symbols Eliminating l-rules is like eliminating useless symbols. We first define a nullable symbol A to be one such that A =*> l. Then for every rule that contains a nullable symbol on the RHS, we add a version of the rule that doesn't contain this symbol. Finally we remove all l-productions from the resulting grammar.

Nullability Note that l is in L(G) iff S is nullable. In this case a CFG with S → l as its only l-rule can be obtained by removing all other l-rules and then adding this rule. Otherwise, removing l-rules gives a CFG that generates L(G) = L(G) - {l} By a simple induction, A is nullable iff G has a rule A → l, or G has a rule A → a, where every symbol in a is nullable

Example: removing nullable symbols Suppose G has the 9 rules S → ABC | CDB, A → l|aA, B → b|bB, C → l|cC, D → AC Then A and C are nullable, as is D. Optionally deleting nullable symbols adds: S → BC | AB | B, S → DB | CB | B, A → a, C → c, D → A | C Removing A → l and C → l (and not adding D → l) gives a CFG with 16 distinct rules

Observations on the previous example Note that if the rule S → ABC had been replaced by S → AC, then l would be in L(G). We’d then have to allow the rule S → l into the simplified grammar to generate all of L(G). Our algorithm for eliminating l-rules has the annoying property that it introduces rules with a single variable on the RHS

Unit productions Productions of the form A → B are called unit productions. Unit productions can be eliminated from a CFG In all cases where A =*> B => a, a rule must be added of the form A → a

When does A =*> B ? This requires finding all cases where A =*> B for nonterminals A and B But a version of our usual BFS algorithm will do the trick: we have that A =*> B iff A → B is a rule of the grammar, or A → C is a rule of the grammar, and C =*>B The =*> relation may be represented in a dependency graph

Eliminating unit productions -- example Consider the familiar grammar with rules E → E+T | T, T → T*F | F, F → x | (E) Here we have that E =*> T and T =*> F (at level 0), and E =*> F (at level 1) Eliminating unit productions gives new rules E → E+T | T*F | x | (E) T → T*F | x | (E) F → x | (E)

Order of steps when simplifying To eliminate useless symbols, l-productions, and unit productions safely from a CFG, we need to apply the simplification algorithms in an appropriate order. A safe order is: l-productions unit productions useless symbols nongenerating symbols unreachable symbols

Chomsky normal form One additional way to simplify a CFG is to simplify each RHS A CFG is in Chomsky normal form (CNF) iff each production has a RHS that consists of a single terminal symbol, or two variable symbols Fact: for any CFG G, with l not in L(G), there is an equivalent grammar G1 in CNF.

Converting to Chomsky normal form A CFG that doesn't generate l may be converted to CNF by first eliminating all l-moves and unit productions. This will give a grammar where each RHS of length less than 2 consists of a lone terminal. Any RHS of length k > 2 may be broken up by introducing k-2 new variables. For any terminal a that remains on a RHS, we add a new variable and new rule Ca → a.

Converting to CNF: an example For example, the rule S → AbCD in a CFG G can be replaced by S → AX, X → bY, Y → CD Here we don’t change L(G) After the remaining steps, the new rules would be S → AX, X → CbY, Y → CD, Cb → b Again we don’t change L(G)

A more complete example Consider the grammar with rules E → E + T | T * F | x, T → T * F | x, F → x The last rule for each symbol is legal in CNF. We may replace E → E + T by E → EX, X → C+T, C+ → + E → T * F by E → TY, Y → C*F, C* → * T → T * F by T → TZ, Z → C*F, C*→ *

The resulting grammar The resulting CFG is in CNF, with rules E → EX | TY | x T → TZ | x F → x X → C+T Y → C*F Z → C*F (or Z could be replaced by Y) C+ → + C* → *

Compact parse trees for CNF grammars Claim: a parse tree of height n for a CFG in CNF must have a yield of length at most 2n-1. Note that the number of nodes at level k can be at most twice the number on level k-1, giving a maximum of 2k nodes at level k for k<n. And at level n, all nodes are terminals, and for a CFG in CNF they can’t have siblings. So there are just as many nodes at level n as variables at level n-1 and this number is at most 2n-1.

Parsing and the membership question It’s common to ask for a CFG G, and a string x, whether x ε L(G) We already have a nondeterministic algorithm for this question. For example, we may guess which rules to apply until we have constructed a parse tree yielding x.

A deterministic membership algorithm One reasonably efficient membership algorithm (the CYK algorithm) works bottom up. It's easiest to state for the case when G is in CNF This assumption about G doesn't result in any loss of generality But it does make the algorithm more efficient.

The CYK algorithm The CYK algorithm asks, for every substring wij of w, which variables generate wij. Here wij is the substring of w starting at position i and ending at position j. If i=j (and the grammar is in CNF), we need only check which rules have the correct symbols on their RHS. Otherwise, we know that the first rule of a derivation for wij must be of the form A → BC

Parsing bottom up with CYK So we have that A =*> wij iff there are B, C, and k such that B =*> wik, C =*> w(k+1)j, and A → BC is a rule of the grammar Because the algorithm works bottom up, we already know which variables generate each substring of wij. The CYK algorithm saves these variables in a 2D table, indexed by i and j.

The table for CYK parsing To fill the table entry for wij, we need only look in the table for the variables that generate wik look in the table for those that generate w(k+1)j then see which rules of G have on their RHSs a variable from the first group followed by a variable from the second group. Of course, we need to do this for all possible values of k from i to j-1.

Decoding the CYK table The algorithm concludes that x is in L(G) iff S is one of the symbols that generates w1n. This strategy of computing and saving answers to all possible subproblems, without checking whether the subproblems are relevant, is known as dynamic programming.

A CYK example The CYK table for G as in Linz, p. 172, and w = bbaab, is given at right For this example, since S is one of the variables in the table entry for w1n = w, we may say that w is in the language. i \ j 1 2 3 4 5 B A   B,S

Parse trees from CYK tables The table as we have built it does not give a parse tree, or say whether there is more than one. To recover the parse tree(s), we would have had to save information about how each entry got into the table. The analog of this observation is true in general for dynamic programming algorithms.

The parse tree for our CYK example For the given example, the S in the corner arises only from S → AB with k=2 Continuing with A and B gives the parse tree below S / \ A B / \ / \ B B A B | | | / \ b b a A B | | a b

CYK time complexity If n = |w|, filling each of the Q(n2) table entries takes time O(n), for time O(n3) in all In fact, CYK takes time Q(n3), since Q(j-i) time is needed to fill the i,j entry. The time complexity also depends on the size of the grammar, since the set of rules must be repeatedly traversed. CYK takes time Q(n2) on an unambiguous CFG