Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Phrase-Based Model of Alignment for Natural Language Inference Bill MacCartney, Michel Galley, and Christopher D. Manning Stanford University 26 October.

Similar presentations


Presentation on theme: "A Phrase-Based Model of Alignment for Natural Language Inference Bill MacCartney, Michel Galley, and Christopher D. Manning Stanford University 26 October."— Presentation transcript:

1 A Phrase-Based Model of Alignment for Natural Language Inference Bill MacCartney, Michel Galley, and Christopher D. Manning Stanford University 26 October 2008

2 2 P Gazprom today confirmed a two-fold increase in its gas price for Georgia, beginning next Monday. H Gazprom will double Georgia’s gas bill. yes Natural language inference (NLI) (aka RTE) Does premise P justify an inference to hypothesis H? An informal notion of inference; variability of linguistic expression Like MT, NLI depends on a facility for alignment I.e., linking corresponding words/phrases in two related sentences Introduction The MANLI Aligner Evaluation on MSR Data Predicting RTE Answers Conclusion

3 3 Alignment example Introduction The MANLI Aligner Evaluation on MSR Data Predicting RTE Answers Conclusion unaligned content: “deletions” from P approximate match: price ~ bill phrase alignment: two-fold increase ~ double H (hypothesis) P (premise)

4 4 Approaches to NLI alignment Alignment addressed variously by current NLI systems In some approaches to NLI, alignments are implicit: NLI via lexical overlap [Glickman et al. 05, Jijkoun & de Rijke 05] NLI as proof search [Tatu & Moldovan 07, Bar-Haim et al. 07] Other NLI systems make alignment step explicit: Align first, then determine inferential validity [Marsi & Kramer 05, MacCartney et al. 06] What about using an MT aligner? Alignment is familiar in MT, with extensive literature [Brown et al. 93, Vogel et al. 96, Och & Ney 03, Marcu & Wong 02, DeNero et al. 06, Birch et al. 06, DeNero & Klein 08] Can tools & techniques of MT alignment transfer to NLI? Introduction The MANLI Aligner Evaluation on MSR Data Predicting RTE Answers Conclusion

5 5 NLI alignment vs. MT alignment Doubtful — NLI alignment differs in several respects: 1.Monolingual: can exploit resources like WordNet 2.Asymmetric: P often longer & has content unrelated to H 3.Cannot assume semantic equivalence NLI aligner must accommodate frequent unaligned content 4.Little training data available MT aligners use unsupervised training on huge amounts of bitext NLI aligners must rely on supervised training & much less data Introduction The MANLI Aligner Evaluation on MSR Data Predicting RTE Answers Conclusion

6 6 Contributions of this paper In this paper, we: 1.Undertake the first systematic study of alignment for NLI Existing NLI aligners use idiosyncratic methods, are poorly documented, use proprietary data 2.Examine the relation between alignment in NLI and MT How do existing MT aligners perform on NLI alignment task? 3.Propose a new model of alignment for NLI: MANLI Outperforms existing MT & NLI aligners on NLI alignment task Introduction The MANLI Aligner Evaluation on MSR Data Predicting RTE Answers Conclusion

7 7 The MANLI aligner A model of alignment for NLI consisting of four components: Introduction The MANLI Aligner Evaluation on MSR Data Predicting RTE Answers Conclusion 1.Phrase-based representation 2.Feature-based scoring function 3.Decoding using simulated annealing 4.Perceptron learning

8 8 Phrase-based alignment representation EQ( Gazprom 1, Gazprom 1 ) INS( will 2 ) DEL( today 2 ) DEL( confirmed 3 ) DEL( a 4 ) SUB( two-fold 5 increase 6, double 3 ) DEL( in 7 ) DEL( its 8 ) … Represent alignments by sequence of phrase edits: EQ, SUB, DEL, INS One-to-one at phrase level (but many-to-many at token level) Avoids arbitrary alignment choices; can use phrase-based resources Introduction The MANLI Aligner Evaluation on MSR Data Predicting RTE Answers Conclusion

9 9 Score edits as linear combination of features, then sum: A feature-based scoring function Edit type features: EQ, SUB, DEL, INS Phrase features: phrase sizes, non-constituents Lexical similarity feature: max over similarity scores WordNet: synonymy, hyponymy, antonymy, Jiang-Conrath Distributional similarity à la Dekang Lin Various measures of string/lemma similarity Contextual features: distortion, matching neighbors Introduction The MANLI Aligner Evaluation on MSR Data Predicting RTE Answers Conclusion

10 10 Decoding using simulated annealing 1.Start 3.Score 4.Smooth/sharpen P(A) = P(A) 1/T 5.Sample 6.Lower temp T = 0.9  T … 2.Generate successors 7.Repeat … 100 times Introduction The MANLI Aligner Evaluation on MSR Data Predicting RTE Answers Conclusion

11 11 Perceptron learning of feature weights We use a variant of averaged perceptron [Collins 2002] Initialize weight vector w = 0, learning rate R 0 = 1 For training epoch i = 1 to 50: For each problem  P j, H j  with gold alignment E j : Set Ê j = ALIGN (P j, H j, w) Set w = w + R i  (  (E j ) –  (Ê j )) Set w = w / ‖ w ‖ 2 (L 2 normalization) Set w[i] = w (store weight vector for this epoch) Set R i = 0.8  R i–1 (reduce learning rate) Throw away weight vectors from first 20% of epochs Return average weight vector Training runs require about 20 hours (on 800 RTE problems) Introduction The MANLI Aligner Evaluation on MSR Data Predicting RTE Answers Conclusion

12 12 The MSR RTE2 alignment data Previously, little supervised data Now, MSR gold alignments for RTE2 [Brockett 2007] dev & test sets, 800 problems each Token-based, but many-to-many allows implicit alignment of phrases 3 independent annotators 3 of 3 agreed on 70% of proposed links 2 of 3 agreed on 99.7% of proposed links merged using majority rule Introduction The MANLI Aligner Evaluation on MSR Data Predicting RTE Answers Conclusion

13 13 Evaluation on MSR data Introduction The MANLI Aligner Evaluation on MSR Data Predicting RTE Answers Conclusion We evaluate several systems on MSR data A simple baseline aligner MT aligners: GIZA++ & Cross-EM NLI aligners: Stanford RTE, MANLI How well do they recover gold-standard alignments? We report per-link precision, recall, and F 1 We also report exact match rate for complete alignments

14 14 Baseline: bag-of-words aligner RTE2 devRTE2 test SystemP %R %F 1 %E %P %R %F 1 %E % Bag-of-words57.881.267.53.562.182.670.95.3 Surprisingly good recall, despite extreme simplicity But very mediocre precision, F 1, & exact match rate Main problem: aligns every token in H Match each H token to most similar P token: [cf. Glickman et al. 2005] Introduction The MANLI Aligner Evaluation on MSR Data Predicting RTE Answers Conclusion

15 15 MT aligners: GIZA++ & Cross-EM Can we show that MT aligners aren’t suitable for NLI? Run GIZA++ via Moses, with default parameters Train on dev set, evaluate on dev & test sets Asymmetric alignments in both directions Then symmetrize using INTERSECTION heuristic Initial results are very poor: 56% F 1 Doesn’t even align equal words Remedy: add lexicon of equal words as extra training data Do similar experiments with Berkeley Cross-EM aligner Introduction The MANLI Aligner Evaluation on MSR Data Predicting RTE Answers Conclusion

16 16 Results: MT aligners RTE2 devRTE2 test SystemP %R %F 1 %E %P %R %F 1 %E % Bag-of-words57.881.267.53.562.182.670.95.3 GIZA++83.066.472.19.485.169.174.811.3 Cross-EM67.680.172.11.370.381.074.10.8 Similar F 1, but GIZA++ wins on precision, Cross-EM on recall Both do best with lexicon & INTERSECTION heuristic Also tried UNION, GROW, GROW-DIAG, GROW-DIAG-FINAL, GROW-DIAG-FINAL-AND, and asymmetric alignments All achieve better recall, but much worse precision & F 1 Problem: too little data for unsupervised learning Need to compensate by exploiting external lexical resources Introduction The MANLI Aligner Evaluation on MSR Data Predicting RTE Answers Conclusion

17 17 The Stanford RTE aligner Token-based alignments: map from H tokens to P tokens Phrase alignments not directly representable (But, named entities & collocations collapsed in pre- processing) Exploits external lexical resources WordNet, LSA, distributional similarity, string sim, … Syntax-based features to promote aligning corresponding predicate-argument structures Decoding & learning similar to MANLI Introduction The MANLI Aligner Evaluation on MSR Data Predicting RTE Answers Conclusion

18 18 Results: Stanford RTE aligner RTE2 devRTE2 test SystemP %R %F 1 %E %P %R %F 1 %E % Bag-of-words57.881.267.53.562.182.670.95.3 GIZA++83.066.472.19.485.169.174.811.3 Cross-EM67.680.172.11.370.381.074.10.8 Stanford RTE81.175.878.40.582.775.879.10.3 Better F 1 than MT aligners — but recall lags precision Stanford does poor job aligning function words 13% of links in gold are prepositions & articles Stanford misses 67% of these (MANLI only 10%) Also, Stanford fails to align multi-word phrases peace activists ~ protestors, hackers ~ non-authorized personnel **** * includes (generous) correction for missed punctuation Introduction The MANLI Aligner Evaluation on MSR Data Predicting RTE Answers Conclusion

19 19 Results: MANLI aligner RTE2 devRTE2 test SystemP %R %F 1 %E %P %R %F 1 %E % Bag-of-words57.881.267.53.562.182.670.95.3 GIZA++83.066.472.19.485.169.174.811.3 Cross-EM67.680.172.11.370.381.074.10.8 Stanford RTE81.175.878.40.582.775.879.10.3 MANLI83.485.584.421.785.485.3 21.3 MANLI outperforms all others on every measure F 1 : 10.5% higher than GIZA++, 6.2% higher than Stanford Good balance of precision & recall Matched >20% exactly Introduction The MANLI Aligner Evaluation on MSR Data Predicting RTE Answers Conclusion

20 20 MANLI results: discussion Three factors contribute to success: 1.Lexical resources: jail ~ prison, prevent ~ stop, injured ~ wounded 2.Contextual features enable matching function words 3.Phrases: death penalty ~ capital punishment, abdicate ~ give up But phrases help less than expected! If we set max phrase size = 1, we lose just 0.2% in F 1 Recall errors: room to improve 40%: need better lexical resources: conservation ~ protecting, organization ~ agencies, bone fragility ~ osteoporosis Precision errors harder to reduce equal function words (49%), forms of be (21%), punctuation (7%) Introduction The MANLI Aligner Evaluation on MSR Data Predicting RTE Answers Conclusion

21 21 Can aligners predict RTE answers? Introduction The MANLI Aligner Evaluation on MSR Data Predicting RTE Answers Conclusion We’ve been evaluating against gold-standard alignments But alignment is just one component of an NLI system Does a good alignment indicate a valid inference? Not necessarily: negations, modals, non-factives & implicatives, … But alignment score can be strongly predictive And many NLI systems rely solely on alignment Using alignment score to predict RTE answers: Predict YES if score > threshold Tune threshold on development data Evaluate on test data

22 22 Results: predicting RTE answers RTE2 devRTE2 test SystemAcc %AvgP %Acc %AvgP % Bag-of-words61.361.557.958.9 Stanford RTE63.164.960.959.2 MANLI59.369.060.361.0 RTE2 entries (average)——58.559.1 LCC [Hickl et al. 2006]——75.480.8 No NLI aligner rivals best complete RTE system (Most) complete systems do a lot more than just alignment! But, Stanford & MANLI beat average entry for RTE2 Many NLI systems could benefit from better alignments! Introduction The MANLI Aligner Evaluation on MSR Data Predicting RTE Answers Conclusion

23 23Conclusion :-) Thanks! Questions? Introduction The MANLI Aligner Evaluation on MSR Data Predicting RTE Answers Conclusion MT aligners not directly applicable to NLI They rely on unsupervised learning from massive amounts of bitext They assume semantic equivalence of P & H MANLI succeeds by: Exploiting (manually & automatically constructed) lexical resources Accommodating frequent unaligned phrases Phrase-based representation shows potential But not yet proven: need better phrase-based lexical resources

24 24 Backup slides follow END

25 25 Related work Introduction The MANLI Aligner Evaluation on MSR Data Predicting RTE Answers Conclusion Lots of past work on phrase-based MT But most systems extract phrases from word-aligned data Despite assumption that many translations are non- compositional Recent work jointly aligns & weights phrases [Marcu & Wong 02, DeNero et al. 06, Birch et al. 06, DeNero & Klein 08] However, this is of limited applicability to the NLI task MANLI uses phrases only when words aren’t appropriate MT uses longer phrases to realize more dependencies (e.g. word order, agreement, subcategorization) MT systems don’t model word insertions & deletions


Download ppt "A Phrase-Based Model of Alignment for Natural Language Inference Bill MacCartney, Michel Galley, and Christopher D. Manning Stanford University 26 October."

Similar presentations


Ads by Google