Presentation on theme: "Statistical NLP Lecture 18: Bayesian grammar induction & machine translation Roger Levy Department of Linguistics, UCSD Thanks to Percy Liang, Noah Smith,"— Presentation transcript:
Statistical NLP Lecture 18: Bayesian grammar induction & machine translation Roger Levy Department of Linguistics, UCSD Thanks to Percy Liang, Noah Smith, and Dan Klein for slides
Plan 1.Recent developments in Bayesian unsupervised grammar induction Nonparametric grammars Non-conjugate priors 2.A bit about machine translation
Nonparametric grammars Motivation: How many symbols should a grammar have? Really an open question Let the data have a say
Hierarchical Dirichlet Process PCFG Start with the standard Bayesian picture: (Liang et al., 2007)
Grammar representation Liang et al. use Chomsky normal-form (CNF) grammars A CNF grammar has no productions, and only has rules of form X Y Z [binary rewrite] X a [unary terminal production] not CNF CNF
HDP-PCFG defined Each grammar has a top-level distribution over (non- terminal) symbols This distribution is a Dirichlet process (stick-breaking distribution; Sethuraman, 1994) So really there are infinitely many nonterminals Each nonterminal symbol has: an emission distribution a binary rule distribution and a distribution over what type of rule to use
The prior over symbols The Dirichlet Process controls expectations about symbol distributions
Binary rewrite rules
Inference Variational Bayes The tractable distribution is factored into data, top-level symbol, and rewrite components
Results on treebank parsing Binarize the Penn Treebank and erase category labels Try to recover label structure, and then parse sentences with the resulting grammar ML estimation
Dependency grammar induction & other priors Well now cover work by Noah Smith and colleagues on unsupervised dependency grammar induction Highlight on: non-conjugate priors What types of priors are interesting to use?
Klein & Manning dependency recap
Klein and Mannings DMV Probabilistic, unlexicalized dependency grammar over part-of-speech sequences, designed for unsupervised learning (Klein and Manning, 2004). Left and right arguments are independent; two states to handle valence. $ Det Nsing Vpast PrepAdj Nsing.
Aside: Visual Notation T G XtXt YtYt maximized over integrated out observed
EM for Maximum Likelihood Estimation E step: calculate exact posterior given current grammar M step: calculate best grammar, assuming current posterior T G XtXt YtYt
Convenient Change of Variable T G XtXt YtYt E T G XtXt F t,e
E EM (Algorithmic View) E step: calculate derivation event posteriors given grammar M step: calculate best grammar using event posteriors T G XtXt F t,e
Maximum a Posteriori (MAP) Estimation The data are not the only source of information about the grammar. Robustness: the grammar should not have many zeroes. Smooth. This can be accomplished by putting a prior U on the grammar (Chen, 1995; Eisner, 2001, inter alia). The most computationally convenient prior is a Dirichlet, with α > 1.
E MAP EM (Algorithmic View) E step: calculate derivation event posteriors given grammar M step: calculate best grammar using event posteriors T G XtXt F t,e U
Experimental Results: EM and MAP EM Evaluation of learned grammar on a parsing task (unseen test data). Initialization and, for MAP, smoothing hyperparameter u need to be chosen. Can do this with unlabeled dev data (modulo infinite cross- ent), or labeled (shown in blue). EM MAP German English2342 Bulgarian Mandarin Turkish Portuguese Smith (2006, ch. 8)
Structural Bias and Annealing T Simple idea: use soft structural constraints to encourage structures that are more plausible. This affects the E step only. The final grammar takes the same form as usual. Here: favor short dependencies. Annealing: gradually shift this bias over time. G XtXt YtYt B U
Algorithmic Issues Structural bias score for a tree needs to factor in such a way that dynamic programming algorithms are still efficient. Equivalently, g and b, taken together, factor into local features. Idea explored here: string distance between a word and its parent is penalized geometrically.
Experimental Results: Structural Bias & Annealing Labeled dev data used to pick Initialization Hyperparameter Structural bias strength (for SB) Annealing schedule (for SA) MAPCESA German English Bulgarian Mandarin Turkish Portuguese Smith (2006, ch. 8)
Correlating Grammar Events Observation by Blei and Lafferty (2006), regarding topic models: A multinomial over states that gives high probability to some states is likely to give high probability to other, correlated states. For us: a class that favors one type of dependents is likely to favor similar types of dependents. If Vpast favors Nsing as a subject, it might also favor Nplural. In general, certain classes are likely to have correlated child distributions. Can we build a grammar-prior that encodes (and learns) these tendencies?
Logistic Normal Distribution over Multinomials Given: mean vector μ, covariance matrix Σ Draw a vector η from Normal(η; μ, Σ). Apply softmax:
Logistic Normal Distributions p 1 = p 2 = 0.5 p 1 1 p 2 1 p 1 = 1p 1 = 0 m = [ ] η softmax
Logistic Normal Distributions μ, Σ p 1 1 p 2 1
Logistic Normal Grammar... η1η1 η2η2 η3η3 ηnηn
softmax Logistic Normal Grammar
softmax Logistic Normal Grammar g
Learning a Logistic Normal Grammar We use variational EM as before to achieve Empirical Bayes; the result is a learned μ and Σ corresponding to each multinomial distribution in the grammar. Variational model for G also has a logistic normal form. Cohen et al. (2009) exploit tricks from Blei and Lafferty (2006), as well as the dynamic programming trick for trees/derivation events used previously.
Experimental Results: EB Single initializer. MAP hyperparameter value is fixed at 1.1. LN covariance matrix is 1 on the diagonal and 0.5 for tag pairs within the same family (thirteen, designed to be language-independent). EMMAP EB (D) EB (LN) English46 59 Mandarin38 47 Cohen, Gimpel, and Smith (NIPS 2008) Cohen and Smith (NAACL-HLT 2009)
Shared Logistic Normals Logistic normal softly ties grammar event probabilities within the same distribution. What about across distributions? If Vpast is likely to have a noun argument, so is Vpresent. In general, certain classes are likely to have correlated parent distributions. We can capture this by combining draws from logistic normal distributions.
Shared Logistic Normal Distributions... η1η1 η2η2 η3η3 ηnηn
Shared Logistic Normal Distributions... η1η1 η2η2 η3η3 ηnηn
Shared Logistic Normal Distributions
average & softmax Shared Logistic Normal Distributions
average & softmax Shared Logistic Normal Distributions g
What to Tie? All verb tags share components for all six distributions (left children, right children, and stopping in each direction in each state). All noun tags share components for all six distributions (left children, right children, and stopping in each direction in each state). (Clearly, many more ideas to try!)
Experimental Results: EB Single initializer. MAP hyperparameter value is fixed at 1.1. Tag families used for logistic normal and shared logistic normal models. Verb-as-parent distributions, noun-as- parent distributions each tied in shared logistic normal models. EMMAP EB (LN) EB (SLN) English Mandarin Cohen and Smith (NAACL-HLT 2009)
Bayesian grammar induction summary This is an exciting (though technical and computationally complex) area! Nonparametric models ability to scale model complexity with data complexity is attractive Since likelihood clearly wont guide us to the right grammars, exploring a wider variety of priors is also attractive Open issue: nonparametric models constrain what types of priors can be used
Machine translation Shifting gears…
Machine Translation: Examples
Machine Translation Madame la présidente, votre présidence de cette institution a été marquante. Mrs Fontaine, your presidency of this institution has been outstanding. Madam President, president of this house has been discoveries. Madam President, your presidency of this institution has been impressive. Je vais maintenant m'exprimer brièvement en irlandais. I shall now speak briefly in Irish. I will now speak briefly in Ireland. I will now speak briefly in Irish. Nous trouvons en vous un président tel que nous le souhaitions. We think that you are the type of president that we want. We are in you a president as the wanted. We are in you a president as we the wanted.
History 1950s: Intensive research activity in MT 1960s: Direct word-for-word replacement 1966 (ALPAC): NRC Report on MT Conclusion: MT no longer worthy of serious scientific investigation : `Recovery period : Resurgence (Europe, Japan) 1985-present: Gradual Resurgence (US)
Levels of Transfer Interlingua Semantic Structure Semantic Structure Syntactic Structure Syntactic Structure Word Structure Word Structure Source Text Target Text Semantic Composition Semantic Decomposition Semantic Analysis Semantic Generation Syntactic Analysis Syntactic Generation Morphological Analysis Morphological Generation Semantic Transfer Syntactic Transfer Direct (Vauquois triangle)
General Approaches Rule-based approaches Expert system-like rewrite systems Interlingua methods (analyze and generate) Lexicons come from humans Can be very fast, and can accumulate a lot of knowledge over time (e.g. Systran) Statistical approaches Word-to-word translation Phrase-based translation Syntax-based translation (tree-to-tree, tree-to-string) Trained on parallel corpora Usually noisy-channel (at least in spirit)
The Coding View One naturally wonders if the problem of translation could conceivably be treated as a problem in cryptography. When I look at an article in Russian, I say: This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode. Warren Weaver (1955:18, quoting a letter he wrote in 1947)
MT System Components source P(e) e f decoder observed argmax P(e|f) = argmax P(f|e)P(e) e e ef best channel P(f|e) Language ModelTranslation Model Finds an English translation which is both fluent and semantically faithful to the French source
Overview: Extracting Phrases Sentence-aligned corpus cat ||| chat ||| 0.9 the cat ||| le chat ||| 0.8 dog ||| chien ||| 0.8 house ||| maison ||| 0.6 my house ||| ma maison ||| 0.9 language ||| langue ||| 0.9 … Phrase table (translation model) Intersected and grown word alignments Directional word alignments
Phrase-Based Decoding 7.
Why Syntactic Translation? Kare ha ongaku wo kiku no ga daisuki desu From Yamada and Knight (2001) He adores listening to music.
Two Places for Syntax? Language Model Can use with any translation model Syntactic language models seem to be better for MT than ASR (why?) Not thoroughly investigated [Charniak et al 03] Translation Model Can use any language model Linear LM can complement a tree-based TM (why?) Also not thoroughly explored, but much more work recently
Parse Tree (E) Sentence (J). Reorder VB PRPVB2 VB1 TOVB MN TO he adores listening music to Insert desu VB PRPVB2VB1 TOVB MNTO he ha music to ga adores listeningno Translate desu VB PRPVB2VB1 TOVB MNTO kare ha ongaku wo ga daisuki kikuno VB PRPVB1 headores listening VB2 VBTO MNTO musicto Parse Tree(E) Sentence(J)
VB PRP VB1 VB2 VB TO TO MN PRP VB2 VB1 TO VB VB NN TO he adores listening to music he adores listening music to P(PRP VB1 VB2 PRP VB2 VB1 ) = P(VB TO TO VB ) = P(TO NN NN TO ) = Reorder
Parameter Table: Reorder
2. Insert VB PRP VB2 VB1 TO VB NN TO music to he ha ga no listening adores desu P(none|TOP-VB) = P(right|VB-PRP)* P(ha) = * P(right|VB-VB) * P (ga) = * P(none|TO-TO) = Conditioning Feature = Parent Label & Node Label (position) none (word selection)
Parameter Table: Insert
3. Translate VB PRP VB2 VB1 he ha TO VB ga adores desu kare NN TO music to listening no kiku daisuki P (he kare) = P (music ongaku) =0.900 P (to wo ) = P (listening kiku ) = P (adore daisuki) = Conditioning Feature= word (E) identity ongaku wo
Parameter Table: Translate Note: Translation to NULL = deletion
Experiment: Y+K 03 Training Corpus: J-E 2K sentence pairs J: Tokenized by Chasen [Matsumoto, et al., 1999] E: Parsed by Collins Parser [Collins, 1999] --- Trained: 40K Treebank, Accuracy: ~90% E: Flatten parse tree --- To Capture word-order difference (SVO->SOV) EM Training: 20 Iterations min/iter (Sparc 200Mhz 1-CPU) or sec/iter (Pentium3 700Mhz 30-CPU)
Result: Alignments Y/K Model IBM Model 5 Ave. Score # perf sent Ave. by 3 humans for 50 sents - okay(1.0), not sure(0.5), wrong(0.0) - precision only
Result: Alignment Example Syntax-based Model He adores listening to music IBM Model 3 He adores listening to music Kare ha ongaku wo kiku no ga daisuki desu
Synchronous Grammars Multi-dimensional PCFGs (Wu 95, Melamed 04) Both texts share the same parse tree:
Synchronous Grammars Formally: have paired expansions … with probabilities, of course! Distribution over tree pairs Strong assumption: constituents in one language are constituents in the other Is this a good assumption? Why / why not? S NP VP VP V NP VP NP V
Synchronous Derivations (II)
Details Distinctions in lines of work are in the details: What about insertions? What about deletions? How flat can rules be? Multiple transductions of rules? Recent work (Eisner 04, Melamed 04, Chiang 05, Galley 04, others) much more flexible than early work
S NPVP PRO he VBZ enjoys NP VBG listening VP P to NP SBAR music Original input:Transformation: S NPVP PRO he VBZ enjoys NP VBG listening VP P to NP SBAR music Top-Down Tree Transducers [Next slides from Kevin Knight]
S NPVP PRO he VBZ enjoys NP VBG listening VP P to NP SBAR music Original input:Transformation: S NPVP PRO he VBZ enjoys NP VBG listening VP P to NP SBAR music Top-Down Tree Transducers
S NPVP PRO he VBZ enjoys NP VBG listening VP P to NP SBAR music Original input:Transformation: NP PRO he VBZ enjoys NP VBG listening VP P to NP SBAR music,, Top-Down Tree Transducers, wa, ga
S NPVP PRO he VBZ enjoys NP VBG listening VP P to NP SBAR music Original input:Transformation: VBZ enjoys NP VBG listening VP P to NP SBAR music, karewa, Top-Down Tree Transducers,, ga
S NPVP PRO he VBZ enjoys NP VBG listening VP P to NP SBAR music karekikuongakuowadaisukidesugano Original input:Final output:,,,,,,,, Top-Down Tree Transducers
S NPVP PRO he VBZ enjoys NP VBG listening VP P to NP SBAR music karekikuongakuowadaisukidesugano Original input:,,,,,,,, Top-Down Tree Transducers A x0:BC x0, F, x2, G, x1 x1:Dx2:E
7. RULE 1: DT(these) RULE 2: VBP(include) RULE 6: NNP(Russia) RULE 4: NNP(France) RULE 8: NP(NNS(astronauts)), RULE 5: CC(and) RULE 10: NP(x0:DT, CD(7), NNS(people) x0, 7 RULE 13: NP(x0:NNP, x1:CC, x2:NNP) x0, x1, x2 RULE 15: S(x0:NP, x1:VP, x2:PUNC) x0, x1, x2 RULE 16: NP(x0:NP, x1:VP) x1,, x0 RULE 9: PUNC(.). RULE 11: VP(VBG(coming), PP(IN(from), x0:NP)), x0 RULE 14: VP(x0:VBP, x1:NP) x0, x1 These 7 people include astronauts coming from France and Russia Derivation Tree France and Russia coming from France and Russia astronauts coming from France and Russia these 7 people include astronauts coming from France and Russia theseRussiaastronauts.includeFrance&