Presentation is loading. Please wait.

Presentation is loading. Please wait.

Recognizing textual entailment: Rational, evaluation and approaches Source:Natural Language Engineering 15 (4) Author:Ido Dagan, Bill Dolan, Bernardo Magnini.

Similar presentations


Presentation on theme: "Recognizing textual entailment: Rational, evaluation and approaches Source:Natural Language Engineering 15 (4) Author:Ido Dagan, Bill Dolan, Bernardo Magnini."— Presentation transcript:

1 Recognizing textual entailment: Rational, evaluation and approaches Source:Natural Language Engineering 15 (4) Author:Ido Dagan, Bill Dolan, Bernardo Magnini and Dan Roth Reporter:Yong-Xiang Chen

2 Textual Entailment Textual Entailment :One piece of text can be plausibly inferred from another –Entailment: a text t entails another text (hypothesis) h if h is true in every circumstance (possible world) in which t is true the hypothesis is necessarily true in any circumstance for which the text is true –Example: T: iTunes software has seen strong sales in Europe H: Strong sales for iTunes in Europe Text T entails hypothesis H

3 Applied definition Allows the truth of the hypothesis is highly plausible, rather than certain Example: –T: The Republic of Yemen is an Arab, Islamic and independent sovereign state whose integrity is inviolable, and no part of which may be ceded –H: The national language of Yemen is Arabic

4 Multiple applications Text-understanding applications which need semantic inference QA : –System has to identify (candidate) texts that entail the expected answer Question: ‘Who is John Lennon’s widow?’ Answer: ‘Yoko Ono is John Lennon’s widow’ Candidate text: ‘Yoko Ono unveiled a bronze statue of her late husband, John Lennon, to complete the official renaming of England’s Liverpool Airport as Liverpool John Lennon Airport’ Candidate text entails the expected answer

5 Current lines of research Automatic acquisition of paraphrases and lexical semantic relationships Unsupervised inference in applications –Question answering, –Information extraction Need –Inference methods –Knowledge representations –The use of learning methods Community : PASCAL Recognizing Textual Entailment (RTE) challenges

6 RTE task Given two text fragments (T and H), whether the meaning of one text is entailed (can be inferred) from another text –Applied notion: a directional relationship between pairs of text expressions –T entails H if a human reading T would infer that H is most probably true By language By common background knowledge –Contests provided datasets for evaluation and a forum for presenting and comparing –RTE1 in 2005, RTE4 in 2008, now TAC2010 Other workshop –Answer Validation Exercise (AVE) at Cross-Language Evaluation Forum (QA@CLEF 2008) –The second evaluation campaign of NLP tools for Italian (EVALITA 2009)

7 Rationale of RTE Variability of semantic expression –the same meaning can be expressed by, or inferred from, different texts –considered as the dual problem 1.language ambiguity 2.many-to-many mapping between language expressions and meanings Different applications need similar models for semantic variability need a model for recognize that different text variants inferred a particular target meaning to evaluated the performance of application-oriented methods difficult to compare under a generic evaluation framework

8 Entailment for applications IR –Query denotes the combination of semantic concepts and relations –Relevant retrieved documents should entail the query IE –Different text variants entail the same target relation Multi-document summarization –sentences in the summary should entail a redundant sentence (be omitted from the summary) MT –A correct automatic translation should be semantically equivalent to the gold-standard translation

9 Evaluation of RTE Operational evaluation tasks –the annotators who decide whether this relationship holds for a given pair of texts or not

10 RTE1 dataset & gold-standard annotation

11 The RTE systems’ results Overall accuracy levels ranging –RTE1: from 50% to 60% (17 submissions) –RTE2: from 53% to 75% (23 submissions) –RTE3: from 49% to 80% (26 submissions) –RTE4: from 45% to 74% (26 submissions, three-way task) Common approaches –machine learning (typically SVM) –Logical inference –cross-pair similarity measures between T and H –word alignment

12 Collecting RTE datasets The text–hypothesis pairs were collected from several application scenarios 1.information extraction pairs 2.information retrieval pairs 3.question answering pairs 4.summarization pairs

13 Collecting information extraction pairs Simulate the need of IE systems to recognize that the given text indeed entails the semantic relation –Relation: X works for Y –T: ‘An Afghan interpreter, employed by the United States, was also wounded’ –H: ‘An interpreter worked for Afghanistan’ Adapting the setting to pairs of texts rather than a text and a structured template The pairs were generated using four different approaches –1 ACE-2004 relations (relations tested in the ACE-2004 RDR task) were taken as templates for hypotheses Relevant news articles were collected as texts Texts then given to actual IE systems for extraction of ACE relation instances The system outputs were used as hypotheses –Generating both positive examples (from correct outputs) and negative examples (from incorrect outputs)

14 –2 the output of IE systems on the dataset of the MUC-4 TST3 task (the events are acts of terrorism) –3 additional entailment pairs were manually generated from both the annotated MUC-4 dataset and news articles collected for the ACE relations –4 hypotheses which correspond to new types of semantic relations (not found in the ACE and MUC datasets) manually generated for sentences in the collected news articles relations were taken from various semantic domains, such as sports, entertainment and science

15 Collecting information retrieval pairs Assume: relevant documents should entail the given propositional query (hypothesis) Hypotheses: propositional IR queries –which specify some statement –e.g. ‘Alzheimer’s disease is treated using drugs’ –adapted and simplified from standard IR evaluation datasets (TREC and CLEF). Texts: selected from documents retrieved by different search engines for each hypothesis –e.g. Google, Yahoo and MSN

16 Collecting question answering pairs In QA system, the retrieved passage text as answer should entails the correct answer Annotators were given questions, taken from TREC-QA and QA@CLEF datasets Text: corresponding answers were extracted from the web by QA systems Hypothesis: Transforming a question–answer pair to a text– hypothesis pair 1.Annotators picked from the answer passage an answer term of the expected answer type, either a correct or an incorrect one 2.Then, the annotators turned the question into an affirmative sentence with the answer term ‘plugged in’ given the question: ‘How many inhabitants does Slovenia have?’ Answer text: ‘In other words, with its 2 million inhabitants, Slovenia has only 5.5 thousand professional soldiers’ (T) picked ‘2 million’ as the (correct) answer term, turn the question into the statement ‘Slovenia has 2 million inhabitants’ (H), producing a positive entailment pair ‘5.5 thousand’ inhabitants as an (incorrect) answer term

17 Collecting summarization pairs T and H are sentences taken from a news document cluster –news articles that describe the same news item Annotators were given output of multi-document summarization systems –document clusters –the summary generated for each cluster picked sentence pairs with high lexical overlap T: at least one of the sentences was taken from the summary H: –positive examples: simplified by removing sentence H parts which can’t entailed by sentence T, until H was fully entailed by T –Negative examples: without reaching the entailment of H by T

18 Creating the final dataset Cross-annotation by at least two annotators –in RTE1,pairs on which the annotators disagreed were filtered out –average agreement on the test set (between each pair of annotators who shared at least 100 examples) was 89.2%, with an average kappa level of 0.78, –Additional filtering: discarded pairs that seemed controversial, too difficult or redundant 25.5% of the (original) pairs were removed from the test set Fixing spelling and punctuation but not style

19 Evaluation measures main task in the RTE Challenges was “classification” –entailment judgment for each pair in the test set –The evaluation criterion for this task was accuracy secondary optional task was “ranking” –the T-H pairs, according to their entailment confidence the first pair is the one for which entailment is most certain the last pair is the one for which entailment is least likely –A perfect ranking would place all the positive pairs before all the negative pairs –This task was evaluated using the average precision measure

20 RTE1 Manually collected text fragment pairs –text (T): one or two sentences –hypothesis (H): one sentence participating systems were required to judge for each pair whether T entails H –The pairs represented success and failure settings of inferences in various application types, QA, IE, IR and MT –development set: 567 pairs –test set: 800 pairs –17 submissions –low accuracy: best results below 60%

21 Result Allow submission of partial coverage results which do not cover all test data Balanced dataset in terms of true and false examples –uniformly predicts True (or False) would achieve an accuracy of 50% as baseline The most basic inference type: measure word overlap between T and H –using a simple decision tree trained on the development set, obtained accuracy 0.568 Might reflect a knowledge-poor baseline

22

23 RTE2 Four sites participated in the data collection and annotation, from Israel, Italy, USA RTE2 dataset was to provide more ‘realistic’ text–hypothesis examples, based mostly on outputs of actual systems Annotation processes including cross-annotation Including sentence splitting and dependency parsing to the challenge data

24 RTE3 part of the pairs contained longer texts (up to one paragraph) –encouraging participants to move towards discourse-level inference 26 teams participated the results were presented at the ACL 2007 Workshop scores ranged from 0.49 to 0.80

25 RTE4 Included as a track in the Text Analysis Conference (TAC), organized by the NIST Three-way classification, non-entailment cases are split CONTRADICTION : the negation of the hypothesis is entailed from the text UNKNOWN: the truth of the hypothesis cannot be determined based on the text –Runs submitted to the three-way task were automatically converted to two-way runs (conflated non-entailment cases to NO ENTAILMENT) –in the three-way task, the best accuracy was 0.685 the average three-way accuracy was 0.51 –2-way judgment, the best accuracy was 0.72 lower than those achieved in RTE3’s competition (the datasets are different)

26 Some issues Adopts approaches as applied to semantics Availability of training set made it possible to formulate the problem in terms of a classification task, features including –Lexical syntactic –semantic features –document co-occurrence counts –first-order syntactic rewrite rules –extract the information gain provided by lexical measures To design Transformations: to derive the hypothesis H from the text T –transformation rules designed to preserve the entailment relation –probabilistic setting Precision-oriented RTE modules –specialized textual entailment engines are designed to address a specific aspect of language variability e.g. contradiction, lexical similarity combined, applying a voting mechanism, with a high-coverage backup module

27 Resources lexical-semantic resources –WordNet and its extensions statistically learned inference rules –DIRT(Discovery of Inference Rules from Text) X is author of Y ≈ X wrote Y X solved Y ≈ X found a solution to Y X caused Y ≈ Y is triggered by X verb-oriented resources –VerbNet and VerbOcean web as a resource –to extract entailment rules, named entities and background knowledge –Wikipedia various text collections –Reuters corpus –English Gigaword To extract features based on documents’ co-occurence counts and InfoMap –Dekang Lins thesaurus and gazetteers to draw lexical similarity judgements

28 Predicts entailment using syntactic features and a general purpose thesaurus To understand what proportion of the entailments in the RTE1 test set could be solved using a robust parser Human annotators evaluated each T–H pair –true by syntax, false by syntax, not syntax and can’t decide Annotators also indicate whether the information in a general purpose thesaurus entry would allow a pair to be judged true or false –37% of the test items can be handled by syntax –49% of the test items can be handled by syntax plus a general purpose thesaurus It is easier to decide when syntax can be expected to return ‘true’, and it is uncertain when to assign ‘false’

29 Two intermediate models of textual entailment Lexical level –lexical-semantic –morphological relations –lexical world knowledge Lexical-syntactic level –syntactic relationships and transformations –lexical-syntactic inference patterns (rules) –co-reference Compared the outcomes for the two models as well as for their individual components The lexical-syntactic model outperforms Both models fail to achieve high recall the majority rely on significant amount of the so-called common human understanding of lexical and world knowledge

30 a harder task: contradiction Contradiction occur when two sentences are extremely unlikely to be true simultaneously(involve the same event) with a collection of contradiction corpora harder task since requires deeper inferences, assessing event co-reference and model building

31 Conclusions RTE task has reached a noticeable level of maturity the long-term goal of textual entailment research is the development of robust entailment ‘engines’ –will be used as a generic component in many text-understanding applications


Download ppt "Recognizing textual entailment: Rational, evaluation and approaches Source:Natural Language Engineering 15 (4) Author:Ido Dagan, Bill Dolan, Bernardo Magnini."

Similar presentations


Ads by Google