Transformation schemes for context-free grammars structural, algorithmic, linguistic applications Eli Shamir Hebrew university of Jerusalem, Israel ISCOL.

Slides:



Advertisements
Similar presentations
Theory of Computation CS3102 – Spring 2014 A tale of computers, math, problem solving, life, love and tragic death Nathan Brunelle Department of Computer.
Advertisements

C O N T E X T - F R E E LANGUAGES ( use a grammar to describe a language) 1.
101 The Cocke-Kasami-Younger Algorithm An example of bottom-up parsing, for CFG in Chomsky normal form G :S  AB | BB A  CC | AB | a B  BB | CA | b C.
Grammars, constituency and order A grammar describes the legal strings of a language in terms of constituency and order. For example, a grammar for a fragment.
Simplifying CFGs There are several ways in which context-free grammars can be simplified. One natural way is to eliminate useless symbols those that cannot.
FORMAL LANGUAGES, AUTOMATA, AND COMPUTABILITY
Chapter Chapter Summary Languages and Grammars Finite-State Machines with Output Finite-State Machines with No Output Language Recognition Turing.
Introduction to Computability Theory
CS5371 Theory of Computation
Applied Computer Science II Chapter 2 : Context-free languages Prof. Dr. Luc De Raedt Institut für Informatik Albert-Ludwigs Universität Freiburg Germany.
Chap 2 Context-Free Languages. Context-free Grammars is not regular Context-free grammar : eg. G 1 : A  0A1substitution rules A  Bproduction rules B.
Lecture Note of 12/22 jinnjy. Outline Chomsky Normal Form and CYK Algorithm Pumping Lemma for Context-Free Languages Closure Properties of CFL.
Transparency No. P2C4-1 Formal Language and Automata Theory Part II Chapter 4 Parse Trees and Parsing.
1 CSC 3130: Automata theory and formal languages Tutorial 4 KN Hung Office: SHB 1026 Department of Computer Science & Engineering.
Transparency No. P2C5-1 Formal Language and Automata Theory Part II Chapter 5 The Pumping Lemma and Closure properties for Context-free Languages.
CS Master – Introduction to the Theory of Computation Jan Maluszynski - HT Lecture 4 Context-free grammars Jan Maluszynski, IDA, 2007
79 Regular Expression Regular expressions over an alphabet  are defined recursively as follows. (1) Ø, which denotes the empty set, is a regular expression.
Normal forms for Context-Free Grammars
Transparency No. P2C5-1 Formal Language and Automata Theory Part II Chapter 5 The Pumping Lemma and Closure properties for Context-free Languages.
Cs466(Prasad)L8Norm1 Normal Forms Chomsky Normal Form Griebach Normal Form.
Context-Free Grammars Chapter 3. 2 Context-Free Grammars and Languages n Defn A context-free grammar is a quadruple (V, , P, S), where  V is.
INHERENT LIMITATIONS OF COMPUTER PROGRAMS CSci 4011.
Final Exam Review Cummulative Chapters 0, 1, 2, 3, 4, 5 and 7.
Chapter 12: Context-Free Languages and Pushdown Automata
1 Properties of Context-free Languages Reading: Chapter 7.
Chapter 4 Context-Free Languages Copyright © 2011 The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 1.
Formal Grammars Denning, Sections 3.3 to 3.6. Formal Grammar, Defined A formal grammar G is a four-tuple G = (N,T,P,  ), where N is a finite nonempty.
Lecture 16 Oct 18 Context-Free Languages (CFL) - basic definitions Examples.
Scattered Context Grammars Alexander Meduna Faculty of Information Technology Brno University of Technology Brno, Czech Republic, Europe.
CSE 3813 Introduction to Formal Languages and Automata Chapter 8 Properties of Context-free Languages These class notes are based on material from our.
Chapter 7 PDA and CFLs.
Context-free Grammars Example : S   Shortened notation : S  aSaS   | aSa | bSb S  bSb Which strings can be generated from S ? [Section 6.1]
Context-Free Grammars Normal Forms Chapter 11. Normal Forms A normal form F for a set C of data objects is a form, i.e., a set of syntactically valid.
Normal Forms for Context-Free Grammars Definition: A symbol X in V  T is useless in a CFG G=(V, T, P, S) if there does not exist a derivation of the form.
CSCI 2670 Introduction to Theory of Computing September 21, 2004.
Languages & Grammars. Grammars  A set of rules which govern the structure of a language Fritz Fritz The dog The dog ate ate left left.
Lecture # 9 Chap 4: Ambiguous Grammar. 2 Chomsky Hierarchy: Language Classification A grammar G is said to be – Regular if it is right linear where each.
Copyright © by Curt Hill Grammar Types The Chomsky Hierarchy BNF and Derivation Trees.
Section 12.4 Context-Free Language Topics
Closure Properties Lemma: Let A 1 and A 2 be two CF languages, then the union A 1  A 2 is context free as well. Proof: Assume that the two grammars are.
Chapter 6 Properties of Regular Languages. 2 Regular Sets and Languages  Claim(1). The family of languages accepted by FSAs consists of precisely the.
Review for final pm. 2 Review for Midterm Induction – On integer: HW1, Ex 2.2.9b p54 – On length of string: Ex p53, HW2, HW3.
CS 208: Computing Theory Assoc. Prof. Dr. Brahim Hnich Faculty of Computer Sciences Izmir University of Economics.
1Computer Sciences Department. Book: INTRODUCTION TO THE THEORY OF COMPUTATION, SECOND EDITION, by: MICHAEL SIPSER Reference 3Computer Sciences Department.
Pumping Lemma for CFLs. Theorem 7.17: Let G be a CFG in CNF and w a string in L(G). Suppose we have a parse tree for w. If the length of the longest path.
Chapter 8 Properties of Context-free Languages These class notes are based on material from our textbook, An Introduction to Formal Languages and Automata,
Transparency No. P2C5-1 Formal Language and Automata Theory Part II Chapter 5 The Pumping Lemma and Closure properties for Context-free Languages.
Structure and Ambiguity Removing Ambiguity Chomsky Normal Form Pushdown Automata Intro (who is he foolin', thinking that there will be time to get to this?)
Donghyun (David) Kim Department of Mathematics and Physics North Carolina Central University 1 Chapter 2 Context-Free Languages Some slides are in courtesy.
CSCI 4325 / 6339 Theory of Computation Zhixiang Chen Department of Computer Science University of Texas-Pan American.
1 A well-parenthesized string is a string with the same number of (‘s as )’s which has the property that every prefix of the string has at least as many.
MPS 2016 ELI SHAMIR, HEBREW UNIVERSITY JERUSALEM PARENTAL VIEW OF CONTEXT – FREE BIRTH AND EVOLUTION.
Theory of Languages and Automata By: Mojtaba Khezrian.
Complexity and Computability Theory I Lecture #12 Instructor: Rina Zviel-Girshin Lea Epstein.
CSCI 2670 Introduction to Theory of Computing September 16, 2004.
1 Context-Free Languages & Grammars (CFLs & CFGs) Reading: Chapter 5.
Context-Free and Noncontext-Free Languages Chapter 13.
Closed book, closed notes
Context-Free Grammars: an overview
7. Properties of Context-Free Languages
Lecture 22 Pumping Lemma for Context Free Languages
FORMAL LANGUAGES AND AUTOMATA THEORY
Course 2 Introduction to Formal Languages and Automata Theory (part 2)
Context-Free Languages
7. Properties of Context-Free Languages
CHAPTER 2 Context-Free Languages
Finite Automata and Formal Languages
Compilers Principles, Techniques, & Tools Taught by Jing Zhang
Chapter 2 Context-Free Language - 01
Chapter 2 Context-Free Language - 02
Presentation transcript:

Transformation schemes for context-free grammars structural, algorithmic, linguistic applications Eli Shamir Hebrew university of Jerusalem, Israel ISCOL Haifa university September 2014

Overview CFG- Devices producing strings & their derivation trees (with weights) Top down schemes transforming the grammars Driven by rotations operations-tree (BOT) Preserving derivation trees, semi-ring weights Enhancing: property tests, parsing & optimal tree algorithms: time down to O(n ), space to O(n). Decomposition of bounded ambiguity grammars (Sam Eilenberg’s question [SE]) Non-expansive [NE] (quasi-rational) grammars Implications to NLP, sequence alignment, … 2

Schemes - simple to subtle Chomsky’s normal form (CNF) Elimination of redundant symbols, ε rules Greibach’s normal form (GNF) (subtle) all rules are A  Tx. T terminal (lexicalization) GNF destroys derivation trees, however has many applications (structural…) Schemes for sub-classes of CFG (in parsing technology) deterministic, LR(k)…

Context Free Basics 1 Such a grammar G = (V,T,P,S = root) is a well known model to derive/generate a set of terminal strings in T G defines a derivation relation between strings overV UT: One step x  y: y is obtained from x by rewriting a single occurrence of some A by B1..Bk when A  B1..Bk is production rule in P. Several steps x  y if x  x1  …  y L A (G) = {wεT | A  w}, L(G)=L S (G), the language generated by G. A derivation is best described by a labeled tree in which the k sons of a node labeled A are labeled B1,.., Bk.

Ambiguity-deg (A  w) = {number of distinct trees for (A  w), deg (GA)= max deg of (A  w). A  - B - defines a partial order on V U T, denoted A>B. it induces a complete order on any branch of a derivation tree. B in G is pumping if B>B'>B. Then B' is also pumping; both belong to the pumping equivalence class [B]. Context Free Basics 2

Node Type and Spread Lemma (i)B Pumping, (ii) C pre-terminal – if NOT {C > B, B pumping} (iii) D spread – D is not pumping but D>B, B pumping. SPREAD LEMMA 1. Pre-terminal C derives a bounded number of bounded terminal strings. 2. In each derivation tree a spread node D derives a bounded sub-tree the leaves of which are terminals or pump nodes. 3. In G, each spread symbol D derives the bounded number of sub-trees, as mentioned in 2.

Non Expansive Grammars G is non-expansive (NE) if no production rule has the form B  -B'-B''- where the B's are from the same pumping class Equivalently, no derivation B  —B—B— is possible (sideway pumping is forbidden!). NE is the quasi- rational class, the substitution closure of linear grammars[1]. Our BOT scheme simplifies proofs of its known properties and new ones (parsing speed).

Bounded Operation Tree (BOT) BOT Tree-nodes are labeled by: Current grammar as a product Π=P 1 …P k Current operation SPREAD / CYC / TTR (Depending on the type of the root of P 1 or P k ) Determines the children nodes and their labels Root of BOT= #G, Leaves of BOT - linear G(i) Main Claim: each derivation tree for w w.r. to #G is mapped onto derivation tree for ƍw w.r. to some G(i), (with the same weight) and V.V.

SPREAD / CYC / TTR Operations Type=SPREAD: P k is split to U Q(j), the current grammar at j’th child is P 1 …P k-1 * Q(j) Type=CYC: P k is terminal, the (effective) current grammar at the single child is P k P 1 …P k-1 Type=TTR, if the root of P k is pumping: let M= P 1 …P k-1, N=P k, the top trunk of N is rotated by 180° and mounted on M, so MN  M*N^

Top Trunk Rotation of MN to (M*N^) M M EXIT N^ x1x1 x2x2 y1y1 y2y2 x1x1 x2x2 y2y2 y1y1 N for strings: m x 1 x 2 … n^ …y 2 y 1 …y 2 y 1 m x 1 x 2 … n^ for trees: M* 180 Figure 1.1

N grammar (top trunk) M* grammar B  B’C B’  CB B  DB’ B’  BD B  B^, B^  α B  root(M), root (M)  α All other productions carry over from N to M*; those of M unchanged. The TTR rotation is invertible, one-one onto for the derivation trees, preserving ambiguity in ‘cyclic rotated’ sense. Figure 1.2: TTR For grammars:

Termination and Correctness TTR operations dominate the BOT scheme for NE grammars. The E-depth of N^ and of the two sides of the mounted trunk must decrease. The M* factors become taller and thinner until they become linear G(i). [without spread symbols] Claim: each derivation tree for w w.r. to #G is mapped to derivation tree for ƍw w.r. to some G(i), (with the same weight) and V.V. ƍw = CYCLIC rotation of w. Holds for each SPREAD/CYC/TTR step!

Tabular Dynamic Prog. For parsing G (CYK/ Earley algorithm for terminal w of length n the table extends to items of rotated intervals [i+1, i+k (mod n), A  BC], at the same cost. For linear G(i) total time cost is only O(n ) Space cost is O(n): one or few diagonals of width near k are kept in memory with pointers to few neighbors, enabling table reconstruction. Just membership, or total weight algorithm, is in the parallel class NC(1), as for finite-state transductions. 2

Example (from [4]) (M)(N) = (u I u ) (v J v), u, vε {0.1}* = I = J u = reversal of u, It has unbounded "direct (product) ambiguity" which increases the time in Earley algorithm. But after one TTR step MN is rotated to (M*)(N^) = (v u I u v ) (J), which has a linear grammar, (of unbounded ambiguity degree) And all product ambiguity trees are rotated to union of trees for the linear M*N^. R R R RR

Decomposing Bounded Ambiguity SE Claim: Ambiguity-deg(G)= l < ∞. Then L (G) is a bounded-size union of languages of deg 1-grammars. This provides a positive answer to a question Sam Eilenberg posed, c "Bounded size" means polynomial in |G|, the size of the grammar G, and l.

Expansive G and Ambiguity G expansive  each pump symbol has ambiguity - degree=1 or unbounded (exponential in length) B==> --B—B—B--… B--… (k times) If degB ≥ 2 then degB ≥ 2 This is a corner stone in the proof of SE Extending ambiguity to cyclic-closed strings is helpful (cf last slides) k

Proof of SE We briefly sketch the scheme for proving the claim. Starting with # G, and using the SPREAD LEMMA, the claim is reduces to: LEMMA Let Π = MN(1)…N(k), deg M=1 deg Π=l < ∞, N(i) are terminals or with pump roots then L(M) = U L(M(j)), jεJ and deg M(j)=1, J bounded. It suffices to prove it for a pair, starting with MN(1), after which M(j)N(2) are decomposed, and so on.

Proof of SE (2) For a pair MN the operation TTR is used transforming it to M*N^. Now deg M* < l and its ambiguity must be concentrated along the top trunk which it got from N. An easy direct argument shows it decomposes into a bounded union of M(j) of deg 1. As for N^ its E-depth is smaller than that of N. so for M(j)N^ we can use induction on the E-depth of the second factor or, more explicitly, continue the recursive descent on N^ until it is consumed.

Approximate G by NE G’ Easy to achieve by duplicating symbols of the pumping classes. Makes linguistic sense Advantages of NE G’ using the BOT scheme view the linear G’i as finite-state transactions: powerful tool in several linguistic fields Applications to Bio-informatic (stringology)? Extension of NE condition to mildly context- sensitive models (LIG, TAG…)?

The Hardest Context Free Grammar The concept is due to S. Greibach. The simplest reduction is based on Shamir's homomorphism theorem([1]), mapping each b in T into a finite set φ(b) of strings over the vocabulary of the Dyck language and claiming that w is in L(G) if and only if φ(w) contains a string in the Dyck language (see the description in [1]). In fact, the categorical grammar model in the 1960 article ([2]) provides another homomorphism which makes it a hardest CFG.

However, those hardest CFG languages are inherently expansive. Indeed, an NE candidate grammar for Dyck will be negated by its BOT scheme, upon using local pump-shrinks, which for linear grammars can operate near any point of the (sufficiently long) main branch of non- terminals. We conjecture that any hardest CFG must be expansive. Note that finding a non-expansive one would entail O(n ) complexity of membership test for any context free grammar. 2

Ambiguity and Cyclic Rotation Ambiguity in natural languages can be resolved (or created) by cyclic rotation. Consider the bible verse in book of Job chapter 6 verse 14 (six Hebrew words). Translated to English: "a friend should extend mercy to the sufferer, even if he abandons God's fear." The ambiguity here is anaphoric, does the pronoun "he" refer to the sufferer or to the friend? The poetic beautiful answer is: to both. The rotated sentences, starting at the symbols # and $, resolve the ambiguity one way or the other. Politically loaded example: the policeman shot the boy with the gun # $ # $

References 1. J. Autebert, J. Berstel and L. Boasson, Context-free language and pushdown automata. Chap. 3 In: handbook of formal languages Vol 1. G. Rozenberg and A. Salomaa (eds.), Springer-Verlag Y. Bar-Hillel, H. Gaifman and E. Shamir, On categorical and phrase structure grammars. Bulletin research council of Israel, vol. 9f (1960), S. Greibach. The hardest context-free language. SIAM J. on computing 3 (1973), E. Shamir. Some inherently ambiguous context-free languages. Inf. and Control 18 (1971),