Presentation on theme: "Baselines for Recognizing Textual Entailment Ling 541 Final Project Terrence Szymanski."— Presentation transcript:
Baselines for Recognizing Textual Entailment Ling 541 Final Project Terrence Szymanski
What is Textual Entailment? Informally: A text T entails a hypothesis H if the meaning of H can be inferred from the meaning of T. Example: T: Profits nearly doubled to nearly $1.8 billion. H: Profits grew to nearly $1.8 billion. Entailment holds (is true).
Types of Entailment For many entailments, H is simply a paraphrase of all or part of T. Other entailments are less obvious: T: Jorma Ollila joined Nokia in 1985 and held a variety of key management positions before taking the helm in 1992 H: Jorma Ollila is the CEO of Nokia. ~95% human level of agreement on entailment judgments
The PASCAL RTE Challenge First challenge held in 2005 (RTE1) 16 entries System performances ranged from 50% to 59% accuracy. Wide array of approaches, using word overlap, synonymy/word distance, statistical lexical relations, dependency tree matching… Second challenge is underway (RTE2)
What is BLEU? BLEU was designed as a metric to measure the accuracy of machine- generated translations by comparing them to human-generated gold standards. Scores based on n-gram overlap (typically for n=1,2,3 and 4) and penalizes for brief translations. Application for RTE?
Using the BLEU Algorithm for RTE Proposed by Perez & Alfonseca in RTE1. Use the traditional BLEU algorithm to capture n-gram overlap between T-H pairs. Find a cutoff score such that a BLEU score above the cutoff implies a TRUE entailment (otherwise FALSE) Roughly 50% accuracy: simple baseline. However: intuitively, the BLEU algorithm is not ideal for RTE BLEU was designed for evaluating MT systems BLEU could be adjusted to better suit the RTE task.
Modifying the BLEU Algorithm Entailments are normally short; thus it does not make sense to penalize them for being short. BLEU uses a geometric mean to average the n- gram overlap for n=1,2,3, and 4 If any value of n produces a zero score, the entire score is nullified. Therefore: modify the algorithm to not penalize for brevity, use a linear weighted average.
Modifying the BLEU Algorithm Original BLEU Modified BLEU w i is the weighting factor (universally set to 1/N) b is the brevity factor (see paper for details) c test,ref is the count of n-grams appearing in both test and ref, and c test is the count of total n-grams appearing in test.
Performance Comparison Ran both unmodified and modified BLEU algorithm on the RTE1 data sets. Used the development set to obtain the cutoff score Use the test set as the evaluation data
Cutoff Score for BLEU The unmodified algorithm produces a high percentage of zero scores (67%). Not surprisingly, the cutoff score is zero!
Cutoff Score for BLEU Two equivalent cutoff scores: 0 and 0.13. Both offer 53.8% accuracy, but the zero cutoff was used because it is a natural candidate for cutoff.
Cutoff Score for Modified BLEU Modified BLEU produces a continuum of scores, unlike the original BLEU Need to find the optimal cutoff score that maximizes accuracy.
Cutoff Score for Modified BLEU Optimal cutoff score is found to be 0.221
Validity of cutoff scores? The original BLEU seems to have a good natural cutoff score of zero The modified BLEU optimal cutoff varies depending on the data set, although 0.221 is an acceptable value (future data may be needed for optimization; also the cutoff may be task-specific).
Results on RTE1 Data Original BLEU Development Set: Cutoff score = zero Accuracy = 53.8% Test Set: Accuracy = 52.0% Modified BLEU Development Set: Cutoff score = 0.221 Accuracy = 57.8% Test Set: Accuracy = 53.8%
Results on RTE2 Data Original BLEU Development Set: Cutoff score = zero Accuracy = 56.0% Test Set: ??? Modified BLEU Development Set: Cutoff score = 0.221 Accuracy = 60.4% Cutoff score = 0.25 Accuracy = 61.4% Test Set: ??? RTE2 test set will be released in January.
Comparison of Results BLEUModifiedBLEUPérez &AlfonsecaRTE1Best Development Set (RTE1) 53.857.854n/a Test Set (RTE1) 52.053.849.558.6 Development Set (RTE2) 56.060.4n/a Accuracy scores for four systems: Original BLEU, Modified BLEU, Perez & Alfonseca’s implementation of BLEU, and the best submission to the RTE1 Challenge. Modified BLEU is better than the other versions of BLEU, but nowhere near the best system performance.
End Results Modified BLEU algorithm outperforms the original BLEU algorithm for RTE Consistent 2-4% increase in accuracy Does this mean that modified BLEU is a candidate system for RTE applications?
NO: BLEU is a baseline algorithm “Don’t climb a tree to get to the moon.” BLEU (and other n-gram based methods) are good baselines, but lack the potential for future improvement. Example: T: It is not the case that John likes ice cream. H: John likes ice cream. Perfect n-gram overlap, but entailment is FALSE.
Future Improvements Potential exists to add word-similarity enhancements, such as synonym substitution, etc. Rather than think of these as enhancements to the BLEU algorithm, we should think of the BLEU algorithm as a baseline for measuring the benefit offered by such improvements. i.e. Performance of BLEU vs. Performance of BLEU after synonym substitution. => Evaluate the benefit synonym substitution can have on a larger RTE system.
Conclusions The BLEU algorithm can be modified to better suit the RTE task Modifications are theory-motivated Eliminate brevity penalty, use linear rather than geometric mean Performance benefits: Modified BLEU consistently has 2-4% higher accuracy. Still, BLEU is only a baseline algorithm Lacks the capacity to incorporate future developments Can be used to measure performance benefits of various enhancements.