A Phrase-Based Model of Alignment for Natural Language Inference Bill MacCartney, Michel Galley, and Christopher D. Manning Stanford University 26 October.

Slides:

Advertisements

Similar presentations

Statistical Machine Translation

Advertisements

COGEX at the Second RTE Marta Tatu, Brandon Iles, John Slavick, Adrian Novischi, Dan Moldovan Language Computer Corporation April 10 th, 2006.

COGEX at the Second RTE Marta Tatu, Brandon Iles, John Slavick, Adrian Novischi, Dan Moldovan Language Computer Corporation April 10 th, 2006.

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.

Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

Deciding entailment and contradiction with stochastic and edit distance-based alignment Marie-Catherine de Marneffe, Sebastian Pado, Bill MacCartney, Anna.

Exploiting Dictionaries in Named Entity Extraction: Combining Semi-Markov Extraction Processes and Data Integration Methods William W. Cohen, Sunita Sarawagi.

LEDIR : An Unsupervised Algorithm for Learning Directionality of Inference Rules Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: From EMNLP.

Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.

Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.

Discriminative Learning of Extraction Sets for Machine Translation John DeNero and Dan Klein UC Berkeley TexPoint fonts used in EMF. Read the TexPoint.

Natural Language Inference Bill MacCartney NLP Group Stanford University 8 May 2009.

A Phrase-Based, Joint Probability Model for Statistical Machine Translation Daniel Marcu, William Wong(2002) Presented by Ping Yu 01/17/2006.

Two Aspects of the Problem of Natural Language Inference

Statistical Phrase-Based Translation Authors: Koehn, Och, Marcu Presented by Albert Bertram Titles, charts, graphs, figures and tables were extracted from.

Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding.

Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.

Symmetric Probabilistic Alignment Jae Dong Kim Committee: Jaime G. Carbonell Ralf D. Brown Peter J. Jansen.

A Confidence Model for Syntactically-Motivated Entailment Proofs Asher Stern & Ido Dagan ISCOL June 2011, Israel 1.

1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.

Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.

Outline P1EDA’s simple features currently implemented –And their ablation test Features we have reviewed from Literature –(Let’s briefly visit them) –Iftene’s.

Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.

This week: overview on pattern recognition (related to machine learning)

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.

Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?

A Comparative Study of Search Result Diversification Methods Wei Zheng and Hui Fang University of Delaware, Newark DE 19716, USA

A Compositional Context Sensitive Multi-document Summarizer: Exploring the Factors That Influence Summarization Ani Nenkova, Stanford University Lucy Vanderwende,

Statistical Machine Translation Part IV – Log-Linear Models Alex Fraser Institute for Natural Language Processing University of Stuttgart Seminar:

Knowledge and Tree-Edits in Learnable Entailment Proofs Asher Stern, Amnon Lotan, Shachar Mirkin, Eyal Shnarch, Lili Kotlerman, Jonathan Berant and Ido.

METEOR-Ranking & M-BLEU: Flexible Matching & Parameter Tuning for MT Evaluation Alon Lavie and Abhaya Agarwal Language Technologies Institute Carnegie.

Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

INSTITUTE OF COMPUTING TECHNOLOGY Bagging-based System Combination for Domain Adaptation Linfeng Song, Haitao Mi, Yajuan Lü and Qun Liu Institute of Computing.

Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.

2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.

Scalable Inference and Training of Context- Rich Syntactic Translation Models Michel Galley, Jonathan Graehl, Keven Knight, Daniel Marcu, Steve DeNeefe.

Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová

This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.

Statistical Machine Translation Part III – Phrase-based SMT Alexander Fraser CIS, LMU München WSD and MT.

A Language Independent Method for Question Classification COLING 2004.

Phrase Reordering for Statistical Machine Translation Based on Predicate-Argument Structure Mamoru Komachi, Yuji Matsumoto Nara Institute of Science and.

Coşkun Mermer, Hamza Kaya, Mehmet Uğur Doğan National Research Institute of Electronics and Cryptology (UEKAE) The Scientific and Technological Research.

CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.

A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:

1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.

Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-

Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.

Word Translation Disambiguation Using Bilingial Bootsrapping Paper written by Hang Li and Cong Li, Microsoft Research Asia Presented by Sarah Hunter.

NRC Report Conclusion Tu Zhaopeng NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.

LREC 2008 Marrakech 29 May Caroline Lavecchia, Kamel Smaïli and David Langlois LORIA / Groupe Parole, Vandoeuvre-Lès-Nancy, France Phrase-Based Machine.

Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.

Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.

Inference Protocols for Coreference Resolution Kai-Wei Chang, Rajhans Samdani, Alla Rozovskaya, Nick Rizzolo, Mark Sammons, and Dan Roth This research.

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

A New Approach for English- Chinese Named Entity Alignment Donghui Feng Yayuan Lv Ming Zhou USC MSR Asia EMNLP-04.

A Lightweight and High Performance Monolingual Word Aligner Xuchen Yao, Benjamin Van Durme, (Johns Hopkins) Chris Callison-Burch and Peter Clark (UPenn)

1 Gloss-based Semantic Similarity Metrics for Predominant Sense Acquisition Ryu Iida Nara Institute of Science and Technology Diana McCarthy and Rob Koeling.

From Words to Senses: A Case Study of Subjectivity Recognition Author: Fangzhong Su & Katja Markert (University of Leeds, UK) Source: COLING 2008 Reporter:

Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Discriminative Modeling extraction Sets for Machine Translation Author John DeNero and Dan KleinUC Berkeley Presenter Justin Chiu.

1 Question Answering and Logistics. 2 Class Logistics  Comments on proposals will be returned next week and may be available as early as Monday  Look.

Annotating and measuring Temporal relations in texts Philippe Muller and Xavier Tannier IRIT,Université Paul Sabatier COLING 2004.

Do-Gil Lee1*, Ilhwan Kim1 and Seok Kee Lee2

Statistical Machine Translation Papers from COLING 2004

Statistical NLP Spring 2011

CoQA: A Conversational Question Answering Challenge

Presentation transcript:

A Phrase-Based Model of Alignment for Natural Language Inference Bill MacCartney, Michel Galley, and Christopher D. Manning Stanford University 26 October 2008

2 P Gazprom today confirmed a two-fold increase in its gas price for Georgia, beginning next Monday. H Gazprom will double Georgia’s gas bill. yes Natural language inference (NLI) (aka RTE) Does premise P justify an inference to hypothesis H? An informal notion of inference; variability of linguistic expression Like MT, NLI depends on a facility for alignment I.e., linking corresponding words/phrases in two related sentences Introduction The MANLI Aligner Evaluation on MSR Data Predicting RTE Answers Conclusion

3 Alignment example Introduction The MANLI Aligner Evaluation on MSR Data Predicting RTE Answers Conclusion unaligned content: “deletions” from P approximate match: price ~ bill phrase alignment: two-fold increase ~ double H (hypothesis) P (premise)

4 Approaches to NLI alignment Alignment addressed variously by current NLI systems In some approaches to NLI, alignments are implicit: NLI via lexical overlap [Glickman et al. 05, Jijkoun & de Rijke 05] NLI as proof search [Tatu & Moldovan 07, Bar-Haim et al. 07] Other NLI systems make alignment step explicit: Align first, then determine inferential validity [Marsi & Kramer 05, MacCartney et al. 06] What about using an MT aligner? Alignment is familiar in MT, with extensive literature [Brown et al. 93, Vogel et al. 96, Och & Ney 03, Marcu & Wong 02, DeNero et al. 06, Birch et al. 06, DeNero & Klein 08] Can tools & techniques of MT alignment transfer to NLI? Introduction The MANLI Aligner Evaluation on MSR Data Predicting RTE Answers Conclusion

5 NLI alignment vs. MT alignment Doubtful — NLI alignment differs in several respects: 1.Monolingual: can exploit resources like WordNet 2.Asymmetric: P often longer & has content unrelated to H 3.Cannot assume semantic equivalence NLI aligner must accommodate frequent unaligned content 4.Little training data available MT aligners use unsupervised training on huge amounts of bitext NLI aligners must rely on supervised training & much less data Introduction The MANLI Aligner Evaluation on MSR Data Predicting RTE Answers Conclusion

6 Contributions of this paper In this paper, we: 1.Undertake the first systematic study of alignment for NLI Existing NLI aligners use idiosyncratic methods, are poorly documented, use proprietary data 2.Examine the relation between alignment in NLI and MT How do existing MT aligners perform on NLI alignment task? 3.Propose a new model of alignment for NLI: MANLI Outperforms existing MT & NLI aligners on NLI alignment task Introduction The MANLI Aligner Evaluation on MSR Data Predicting RTE Answers Conclusion

7 The MANLI aligner A model of alignment for NLI consisting of four components: Introduction The MANLI Aligner Evaluation on MSR Data Predicting RTE Answers Conclusion 1.Phrase-based representation 2.Feature-based scoring function 3.Decoding using simulated annealing 4.Perceptron learning

8 Phrase-based alignment representation EQ( Gazprom 1, Gazprom 1 ) INS( will 2 ) DEL( today 2 ) DEL( confirmed 3 ) DEL( a 4 ) SUB( two-fold 5 increase 6, double 3 ) DEL( in 7 ) DEL( its 8 ) … Represent alignments by sequence of phrase edits: EQ, SUB, DEL, INS One-to-one at phrase level (but many-to-many at token level) Avoids arbitrary alignment choices; can use phrase-based resources Introduction The MANLI Aligner Evaluation on MSR Data Predicting RTE Answers Conclusion

9 Score edits as linear combination of features, then sum: A feature-based scoring function Edit type features: EQ, SUB, DEL, INS Phrase features: phrase sizes, non-constituents Lexical similarity feature: max over similarity scores WordNet: synonymy, hyponymy, antonymy, Jiang-Conrath Distributional similarity à la Dekang Lin Various measures of string/lemma similarity Contextual features: distortion, matching neighbors Introduction The MANLI Aligner Evaluation on MSR Data Predicting RTE Answers Conclusion

10 Decoding using simulated annealing 1.Start 3.Score 4.Smooth/sharpen P(A) = P(A) 1/T 5.Sample 6.Lower temp T = 0.9  T … 2.Generate successors 7.Repeat … 100 times Introduction The MANLI Aligner Evaluation on MSR Data Predicting RTE Answers Conclusion

11 Perceptron learning of feature weights We use a variant of averaged perceptron [Collins 2002] Initialize weight vector w = 0, learning rate R 0 = 1 For training epoch i = 1 to 50: For each problem  P j, H j  with gold alignment E j : Set Ê j = ALIGN (P j, H j, w) Set w = w + R i  (  (E j ) –  (Ê j )) Set w = w / ‖ w ‖ 2 (L 2 normalization) Set w[i] = w (store weight vector for this epoch) Set R i = 0.8  R i–1 (reduce learning rate) Throw away weight vectors from first 20% of epochs Return average weight vector Training runs require about 20 hours (on 800 RTE problems) Introduction The MANLI Aligner Evaluation on MSR Data Predicting RTE Answers Conclusion

12 The MSR RTE2 alignment data Previously, little supervised data Now, MSR gold alignments for RTE2 [Brockett 2007] dev & test sets, 800 problems each Token-based, but many-to-many allows implicit alignment of phrases 3 independent annotators 3 of 3 agreed on 70% of proposed links 2 of 3 agreed on 99.7% of proposed links merged using majority rule Introduction The MANLI Aligner Evaluation on MSR Data Predicting RTE Answers Conclusion

13 Evaluation on MSR data Introduction The MANLI Aligner Evaluation on MSR Data Predicting RTE Answers Conclusion We evaluate several systems on MSR data A simple baseline aligner MT aligners: GIZA++ & Cross-EM NLI aligners: Stanford RTE, MANLI How well do they recover gold-standard alignments? We report per-link precision, recall, and F 1 We also report exact match rate for complete alignments

14 Baseline: bag-of-words aligner RTE2 devRTE2 test SystemP %R %F 1 %E %P %R %F 1 %E % Bag-of-words Surprisingly good recall, despite extreme simplicity But very mediocre precision, F 1, & exact match rate Main problem: aligns every token in H Match each H token to most similar P token: [cf. Glickman et al. 2005] Introduction The MANLI Aligner Evaluation on MSR Data Predicting RTE Answers Conclusion

15 MT aligners: GIZA++ & Cross-EM Can we show that MT aligners aren’t suitable for NLI? Run GIZA++ via Moses, with default parameters Train on dev set, evaluate on dev & test sets Asymmetric alignments in both directions Then symmetrize using INTERSECTION heuristic Initial results are very poor: 56% F 1 Doesn’t even align equal words Remedy: add lexicon of equal words as extra training data Do similar experiments with Berkeley Cross-EM aligner Introduction The MANLI Aligner Evaluation on MSR Data Predicting RTE Answers Conclusion

16 Results: MT aligners RTE2 devRTE2 test SystemP %R %F 1 %E %P %R %F 1 %E % Bag-of-words GIZA Cross-EM Similar F 1, but GIZA++ wins on precision, Cross-EM on recall Both do best with lexicon & INTERSECTION heuristic Also tried UNION, GROW, GROW-DIAG, GROW-DIAG-FINAL, GROW-DIAG-FINAL-AND, and asymmetric alignments All achieve better recall, but much worse precision & F 1 Problem: too little data for unsupervised learning Need to compensate by exploiting external lexical resources Introduction The MANLI Aligner Evaluation on MSR Data Predicting RTE Answers Conclusion

17 The Stanford RTE aligner Token-based alignments: map from H tokens to P tokens Phrase alignments not directly representable (But, named entities & collocations collapsed in pre- processing) Exploits external lexical resources WordNet, LSA, distributional similarity, string sim, … Syntax-based features to promote aligning corresponding predicate-argument structures Decoding & learning similar to MANLI Introduction The MANLI Aligner Evaluation on MSR Data Predicting RTE Answers Conclusion

18 Results: Stanford RTE aligner RTE2 devRTE2 test SystemP %R %F 1 %E %P %R %F 1 %E % Bag-of-words GIZA Cross-EM Stanford RTE Better F 1 than MT aligners — but recall lags precision Stanford does poor job aligning function words 13% of links in gold are prepositions & articles Stanford misses 67% of these (MANLI only 10%) Also, Stanford fails to align multi-word phrases peace activists ~ protestors, hackers ~ non-authorized personnel **** * includes (generous) correction for missed punctuation Introduction The MANLI Aligner Evaluation on MSR Data Predicting RTE Answers Conclusion

19 Results: MANLI aligner RTE2 devRTE2 test SystemP %R %F 1 %E %P %R %F 1 %E % Bag-of-words GIZA Cross-EM Stanford RTE MANLI MANLI outperforms all others on every measure F 1 : 10.5% higher than GIZA++, 6.2% higher than Stanford Good balance of precision & recall Matched >20% exactly Introduction The MANLI Aligner Evaluation on MSR Data Predicting RTE Answers Conclusion

20 MANLI results: discussion Three factors contribute to success: 1.Lexical resources: jail ~ prison, prevent ~ stop, injured ~ wounded 2.Contextual features enable matching function words 3.Phrases: death penalty ~ capital punishment, abdicate ~ give up But phrases help less than expected! If we set max phrase size = 1, we lose just 0.2% in F 1 Recall errors: room to improve 40%: need better lexical resources: conservation ~ protecting, organization ~ agencies, bone fragility ~ osteoporosis Precision errors harder to reduce equal function words (49%), forms of be (21%), punctuation (7%) Introduction The MANLI Aligner Evaluation on MSR Data Predicting RTE Answers Conclusion

21 Can aligners predict RTE answers? Introduction The MANLI Aligner Evaluation on MSR Data Predicting RTE Answers Conclusion We’ve been evaluating against gold-standard alignments But alignment is just one component of an NLI system Does a good alignment indicate a valid inference? Not necessarily: negations, modals, non-factives & implicatives, … But alignment score can be strongly predictive And many NLI systems rely solely on alignment Using alignment score to predict RTE answers: Predict YES if score > threshold Tune threshold on development data Evaluate on test data

22 Results: predicting RTE answers RTE2 devRTE2 test SystemAcc %AvgP %Acc %AvgP % Bag-of-words Stanford RTE MANLI RTE2 entries (average)—— LCC [Hickl et al. 2006]—— No NLI aligner rivals best complete RTE system (Most) complete systems do a lot more than just alignment! But, Stanford & MANLI beat average entry for RTE2 Many NLI systems could benefit from better alignments! Introduction The MANLI Aligner Evaluation on MSR Data Predicting RTE Answers Conclusion

23Conclusion :-) Thanks! Questions? Introduction The MANLI Aligner Evaluation on MSR Data Predicting RTE Answers Conclusion MT aligners not directly applicable to NLI They rely on unsupervised learning from massive amounts of bitext They assume semantic equivalence of P & H MANLI succeeds by: Exploiting (manually & automatically constructed) lexical resources Accommodating frequent unaligned phrases Phrase-based representation shows potential But not yet proven: need better phrase-based lexical resources

24 Backup slides follow END

25 Related work Introduction The MANLI Aligner Evaluation on MSR Data Predicting RTE Answers Conclusion Lots of past work on phrase-based MT But most systems extract phrases from word-aligned data Despite assumption that many translations are non- compositional Recent work jointly aligns & weights phrases [Marcu & Wong 02, DeNero et al. 06, Birch et al. 06, DeNero & Klein 08] However, this is of limited applicability to the NLI task MANLI uses phrases only when words aren’t appropriate MT uses longer phrases to realize more dependencies (e.g. word order, agreement, subcategorization) MT systems don’t model word insertions & deletions