1 The PASCAL Recognizing Textual Entailment Challenges - RTE-1,2,3 Ido DaganBar-Ilan University, Israel with …

Slides:

Advertisements

Similar presentations

ThemeInformation Extraction for World Wide Web PaperUnsupervised Learning of Soft Patterns for Generating Definitions from Online News Author Cui, H.,

Advertisements

COGEX at the Second RTE Marta Tatu, Brandon Iles, John Slavick, Adrian Novischi, Dan Moldovan Language Computer Corporation April 10 th, 2006.

COGEX at the Second RTE Marta Tatu, Brandon Iles, John Slavick, Adrian Novischi, Dan Moldovan Language Computer Corporation April 10 th, 2006.

Question Answering for Machine Reading Evaluation Evaluation Campaign at CLEF 2011 Anselmo Peñas (UNED, Spain) Eduard Hovy (USC-ISI, USA) Pamela Forner.

Recognizing Textual Entailment Challenge PASCAL Suleiman BaniHani.

 Andisheh Keikha Ryerson University Ebrahim Bagheri Ryerson University May 7 th

LEDIR : An Unsupervised Algorithm for Learning Directionality of Inference Rules Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: From EMNLP.

FATE: a FrameNet Annotated corpus for Textual Entailment Marco Pennacchiotti, Aljoscha Burchardt Computerlinguistik Saarland University, Germany LREC 2008,

GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.

Research topics Semantic Web - Spring 2007 Computer Engineering Department Sharif University of Technology.

Search Engines and Information Retrieval

IR Challenges and Language Modeling. IR Achievements Search engines  Meta-search  Cross-lingual search  Factoid question answering  Filtering Statistical.

Textual Entailment Using Univariate Density Model and Maximizing Discriminant Function “Third Recognizing Textual Entailment Challenge 2007 Submission”

XML Document Mining Challenge Bridging the gap between Information Retrieval and Machine Learning Ludovic DENOYER – University of Paris 6.

Understanding Text Meaning in Information Applications

Web Logs and Question Answering Richard Sutcliffe 1, Udo Kruschwitz 2, Thomas Mandl University of Limerick, Ireland 2 - University of Essex, UK 3.

Third Recognizing Textual Entailment Challenge Potential SNeRG Submission.

Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University

Scalable Text Mining with Sparse Generative Models

Employing Two Question Answering Systems in TREC 2005 Harabagiu, Moldovan, et al 2005 Language Computer Corporation.

A Confidence Model for Syntactically-Motivated Entailment Proofs Asher Stern & Ido Dagan ISCOL June 2011, Israel 1.

Information Retrieval in Practice

OMAP: An Implemented Framework for Automatically Aligning OWL Ontologies SWAP, December, 2005 Raphaël Troncy, Umberto Straccia ISTI-CNR

AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.

1 Textual Entailment as a Framework for Applied Semantics Ido DaganBar-Ilan University, Israel Joint works with: Oren Glickman, Idan Szpektor, Roy Bar.

Overview of the Fourth Recognising Textual Entailment Challenge NIST-Nov. 17, 2008TAC Danilo Giampiccolo (coordinator, CELCT) Hoa Trang Dan (NIST)

Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.

Answer Validation Exercise - AVE QA subtrack at Cross-Language Evaluation Forum 2007 UNED (coord.) Anselmo Peñas Álvaro Rodrigo Valentín Sama Felisa Verdejo.

Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.

Knowledge and Tree-Edits in Learnable Entailment Proofs Asher Stern and Ido Dagan (earlier partial version by Roy Bar-Haim) Download at:

Search Engines and Information Retrieval Chapter 1.

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.

The Second PASCAL Recognising Textual Entailment Challenge Roy Bar-Haim, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampicollo, Bernardo Magnini, Idan.

Assessing the Impact of Frame Semantics on Textual Entailment Authors: Aljoscha Burchardt, Marco Pennacchiotti, Stefan Thater, Manfred Pinkal Saarland.

Knowledge and Tree-Edits in Learnable Entailment Proofs Asher Stern, Amnon Lotan, Shachar Mirkin, Eyal Shnarch, Lili Kotlerman, Jonathan Berant and Ido.

GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)

RTE Planning Session Luisa Bentivogli, Peter Clark, Ido Dagan, Hoa Trang Dang, Danilo Giampiccolo.

1 Textual Entailment: A Perspective on Applied Text Understanding Ido DaganBar-Ilan University, Israel Joint works with: Oren Glickman, Idan Szpektor,

Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.

Question Answering.  Goal  Automatically answer questions submitted by humans in a natural language form  Approaches  Rely on techniques from diverse.

Recognizing textual entailment: Rational, evaluation and approaches Source:Natural Language Engineering 15 (4) Author:Ido Dagan, Bill Dolan, Bernardo Magnini.

Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.

Artificial Intelligence Research Center Pereslavl-Zalessky, Russia Program Systems Institute, RAS.

1/21 Automatic Discovery of Intentions in Text and its Application to Question Answering (ACL 2005 Student Research Workshop )

Department of Software and Computing Systems Research Group of Language Processing and Information Systems The DLSIUAES Team’s Participation in the TAC.

For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.

For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.

Evaluating Answer Validation in multi- stream Question Answering Álvaro Rodrigo, Anselmo Peñas, Felisa Verdejo UNED NLP & IR group nlp.uned.es The Second.

UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.

Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq

Toward an Open Source Textual Entailment Platform (Excitement Project) Bernardo Magnini (on behalf of the Excitement consortium) 1 STS workshop, NYC March.

Towards Entailment Based Question Answering: ITC-irst at Clef 2006 Milen Kouylekov, Matteo Negri, Bernardo Magnini & Bonaventura Coppola ITC-irst, Centro.

Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.

Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.

SALSA-WS 09/05 Approximating Textual Entailment with LFG and FrameNet Frames Aljoscha Burchardt, Anette Frank Computational Linguistics Department Saarland.

Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)

A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.

For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.

Overview of Statistical NLP IR Group Meeting March 7, 2006.

AQUAINT Mid-Year PI Meeting – June 2002 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.

1 Predicting Answer Location Using Shallow Semantic Analogical Reasoning in a Factoid Question Answering System Hapnes Toba, Mirna Adriani, and Ruli Manurung.

Relation Extraction (RE) via Supervised Classification See: Jurafsky & Martin SLP book, Chapter 22 Exploring Various Knowledge in Relation Extraction.

Sentiment analysis algorithms and applications: A survey

Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin

Distributed Representation of Words, Sentences and Paragraphs

Recognizing Partial Textual Entailment

Automatic Detection of Causal Relations for Question Answering

CS246: Information Retrieval

Statistical NLP : Lecture 9 Word Sense Disambiguation

Presentation transcript:

1 The PASCAL Recognizing Textual Entailment Challenges - RTE-1,2,3 Ido DaganBar-Ilan University, Israel with …

2 Recognizing Textual Entailment PASCAL NOE Challenge Ido Dagan, Oren glickmanBar-Ilan University, Israel Bernardo Magnini ITC-irst, Trento, Italy

3 The Second PASCAL Recognising Textual Entailment Challenge Roy Bar-Haim, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampicollo, Bernardo Magnini, Idan Szpektor Bar-Ilan, CELCT, ITC-irst, Microsoft Research, MITRE

4 The Third Recognising Textual Entailment Challenge Danilo Giampiccolo (CELCT) and Bernardo Magnini (FBK-ITC) With Ido Dagan (Bar-Ilan) and Bill Dolan (Microsoft Research) Patrick Pantel (USC-ISI), for Resources Pool Hoa Dang and Ellen Voorhees (NIST), for Extended Task

5 RTE Motivation Text applications require semantic inference A common framework for addressing applied inference as a whole is needed, but still missing –Global inference is typically application dependent –Application-independent approaches and resources exist for some semantic sub-problems Textual entailment may provide such common application-independent semantic framework

6 Framework Desiderata A framework for modeling a target level of language processing should provide: 1)Generic module for applications –A common underlying task, unified interface (cf. parsing) 2)Unified paradigm for investigating sub- phenomena

7 Outline The textual entailment task – what and why? Evaluation dataset & methodology Participating systems and approaches Potential for machine learning Framework for investigating semantics

8 Natural Language and Meaning Meaning Language Ambiguity Variability

9 Variability of Semantic Expression Model variability as relations between text expressions: Equivalence: text1  text2 (paraphrasing) Entailment: text1  text2 – the general case Dow ends up Dow climbs 255 The Dow Jones Industrial Average closed up 255 Stock market hits a record high Dow gains 255 points

10 Typical Application Inference Overture’s acquisition by Yahoo Yahoo bought Overture Question Expected answer form Who bought Overture? >> X bought Overture Similar for IE: X buy Y “ Semantic ” IR: t: Overture was bought … Summarization (multi-document) – identify redundant info MT evaluation (and recent ideas for MT) Educational applications, … text hypothesized answer entails

11 KRAQ'05 Workshop - KNOWLEDGE and REASONING for ANSWERING QUESTIONS (IJCAI-05) CFP: –Reasoning aspects: * information fusion, * search criteria expansion models * summarization and intensional answers, * reasoning under uncertainty or with incomplete knowledge, –Knowledge representation and integration: * levels of knowledge involved (e.g. ontologies, domain knowledge), * knowledge extraction models and techniques to optimize response accuracy … but similar needs for other applications – can entailment provide a common empirical task?

12 Classical Entailment Definition Chierchia & McConnell-Ginet (2001): A text t entails a hypothesis h if h is true in every circumstance (possible world) in which t is true Strict entailment - doesn't account for some uncertainty allowed in applications

13 “Almost certain” Entailments t: The technological triumph known as GPS … was incubated in the mind of Ivan Getting. h: Ivan Getting invented the GPS.

14 Applied Textual Entailment Directional relation between two text fragments: Text (t) and Hypothesis (h): t entails h (t  h) if humans reading t will infer that h is most likely true Operational (applied) definition: –Human gold standard - as in NLP applications –Assuming common background knowledge – which is indeed expected from applications

15 Intended scope: what’s in the text Text (t): Sunscreen is used to protect from getting sunburned. Hypothesis(h): Sunscreen prevents sunburns. thth

16 Evaluation Dataset

17 Generic Dataset by Application Use 7 application settings in RTE-1, 4 in RTE-2/3 –QA – IE – “Semantic” IR – Comparable documents / multi-doc summarization – MT evaluation – Reading comprehension – Paraphrase acquisition Most data created from actual applications output ~800 examples in development and test sets 50-50% YES/NO split

18 Some Examples TEXTHYPOTHESISTASK ENTAIL- MENT 1 Regan attended a ceremony in Washington to commemorate the landings in Normandy. Washington is located in Normandy. IEFalse 2Google files for its long awaited IPO.Google goes public.IRTrue 3 …: a shootout at the Guadalajara airport in May, 1993, that killed Cardinal Juan Jesus Posadas Ocampo and six others. Cardinal Juan Jesus Posadas Ocampo died in QATrue 4 The SPD got just 21.5% of the vote in the European Parliament elections, while the conservative opposition parties polled 44.5%. The SPD is defeated by the opposition parties. IETrue

19 Final Dataset (RTE-2) Average pairwise inter-judge agreement: 89.2% –Average Kappa 0.78 – substantial agreement –Better than RTE-1 Removed 18.2% of pairs due to disagreement (3-4 judges) Disagreement example: –(t) Women are under-represented at all political levels... (h) Women are poorly represented in parliament. Additional review removed 25.5% of pairs –too difficult / vague / redundant

20 Final Dataset (RTE-3) Each pair judged by three annotators Pairs on which the annotators disagreed were filtered-out. Average pairwise annotator agreement: 87.8% (Kappa level of 0.75) Filtered-out pairs: –19.2 % due to disagreement –9.4 % as controversial, too difficult, or too similar to other pairs

21 Progress from 1 to 3 More realistic application data: –RTE-1: some partly synthetic examples –RTE-2&3 mostly: Input from common benchmarks for the different applications Output from real systems –Test entailment potential across applications Text length: –RTE-1&2: one-two sentences –RTE-3: 25% full paragraphs, requires discourse modeling/anaphora Improve data collection and annotation –Revised and expanded guidelines –Most pairs triply annotated, some across organizers sites Provide linguistic pre-processing, RTE Resources Pool RTE-3 pilot task by NIST: 3-way judgments; explanations

22 Suggested Perspective RE the Arthur Bernstein competition: “… Competition, even a piano competition, is legitimate … as long as it is just an anecdotal side effect of the musical culture scene, and doesn’t threat to overtake the center stage” Haaretz Israeli News Paper, Culture Section, April 1 st, 2005

23 Participating Systems

24 Participation Popular challenges, world wide: –RTE-1 – 17 groups –RTE-2 – 23 groups –RTE-3 – 26 groups 14 Europe, 12 US 11 newcomers (~40 groups so far) 79 dev-set downloads (44 planned, 26 maybe) 42 test-set downloads Joint ACL-07/PASCAL workshop (~70 participants)

25 Methods and Approaches Estimate similarity match between t and h (coverage of h by t): –Lexical overlap (unigram, N-gram, subsequence) –Lexical substitution (WordNet, statistical) –Lexical-syntactic variations (“paraphrases”) –Syntactic matching/edit-distance/transformations –Semantic role labeling and matching –Global similarity parameters (e.g. negation, modality) –Anaphora resolution Probabilistic tree-transformations Cross-pair similarity Detect mismatch (for non-entailment) Logical interpretation and inference

26 Dominant approach: Supervised Learning Features model various aspects of similarity and mismatch Classifier determines relative weights of information sources Train on development set and auxiliary t-h corpora t,h Similarity Features: Lexical, n-gram,syntactic semantic, global Feature vector Classifier YES NO

27 Parse-based Proof Systems rain verb when adj leave verb wha expletive ROOT i it other i Mary noun John noun subj conj N2 noun N1 noun conj N2 noun It rained when John and Mary left it other rain verb when adj leave verb wha ROOT i i Mary noun subj left verb ROOT i Mary noun subj V1 verb when adj ROOT i i V2 verb ROOT V2 verb i wha It rained when Mary left Mary left  expletive (Bar-Haim et al., RTE-3)

28 Resources WordNet, Extended WordNet, distributional similarity –Britain  UK –steal  take DIRT (paraphrase rules) –X file a lawsuit against Y  X accuse Y (world knowledge) –X confirm Y  X approve Y (linguistic knowledge) FrameNet, ProBank, VerbNet –For semantic role labeling Entailment pairs corpora –Automatically acquired training No dedicated resources for entailment yet

29 Accuracy Results – RTE-1

30 Results (RTE-2) Average PrecisionAccuracyFirst Author (Group) 80.8%75.4%Hickl (LCC) 71.3%73.8%Tatu (LCC) 64.4%63.9%Zanzotto (Milan & Rome) 62.8%62.6%Adams (Dallas) 66.9%61.6%Bos (Rome & Leeds) 58.1%-60.5%11 groups 52.9%-55.6%7 groups Average: 60% Median: 59%

31 Results: RTE-3 Accuracy 1. Hickl - LCC Tatu - LCC Iftene - Uni. Iasi Adams - Uni. Dallas Wang - DFKI0.66 Baseline (all YES)0.51 Two systems above 70% Most systems (65%) in the range 60-70%; they were just 30% at RTE-2

32 Current Limitations Simple methods perform quite well, but not best System reports point at: –Lack of knowledge (syntactic transformation rules, paraphrases, lexical relations, etc.) –Lack of training data It seems that systems that coped better with these issues performed best: –Hickl et al. - acquisition of large entailment corpora for training –Tatu et al. – large knowledge bases (linguistic and world knowledge)

33 Impact High interest in the research community –Papers, conference sessions and areas, PhD theses, funded projects –Special issue - Journal of Natural Language Engineering –ACL-07 tutorial Initial contribution to specific applications –QA – Harabagiu & Hickl, ACL-06; CLEF-06/07 –RE – Romano et al., EACL-06 RTE-4 – by NIST, with CELCT –Within TAC, a new semantic evaluation conference (with QA and summarization, subsuming DUC)

34 New Potentials for Machine Learning

35 Classical Approach = Interpretation Stipulated Meaning Representation (by scholar) Language (by nature) Variability  Logical forms, word senses, semantic roles, named entity types, … - scattered tasks  Feasible/suitable framework for applied semantics?

36 Textual Entailment = Text Mapping Assumed Meaning (by humans) Language (by nature) Variability

37 General Case – Inference Meaning Representation Language  Entailment mapping is the actual applied goal - and also a touchstone for understanding!  Interpretation becomes a possible mean Inference Interpretation Textual Entailment

38 Machine Learning Perspectives Issues with interpretation approach: –Hard to agree on target representations –Costly to annotate semantic representations for training –Has it been a barrier? Language-level entailment mapping refers to texts –Texts are semantic-theory neutral –Amenable for unsupervised/semi-supervised learning It would be interesting to explore (many do) –language-based representations of meaning, inference knowledge, and ontology, –for which learning and inference methods may be easier to develop. –Artificial intelligence through natural language?

39 Major Learning Directions Learning entailment knowledge (!!!) –Learning entailment relations between words/expressions –Integrating with manual resources and knowledge Inference methods –Principled frameworks for probabilistic inference Estimate likelihood of deriving hypothesis from text Fusing information levels –More than bags of features Relational learning relevant for both How can we increase ML researchers involvement?

40 Learning Entailment Knowledge Entailing “topical” terms from words/texts –E.g. medicine, law, cars, computer security, … –An unsupervised version of text categorization Learning entailment graph for terms/expressions –Partial knowledge: statistical, lexical resources, Wikipedia, … –Estimate link likelihood in context acquire/v own/v acquisition/n buy/v purchase/n derived WN-syn Dist. sim entails ?? ?

41 Meeting the knowledge challenge – by a coordinated effort? A vast amount of “entailment rules” needed Speculation: can we have a joint community effort for knowledge acquisition? –Uniform representations –Mostly automatic acquisition (millions of rules) –Human Genome Project analogy Preliminary: RTE-3 Resources Pool at ACLWiki (set up by Patrick Pantel)

42 Textual Entailment ≈ Human Reading Comprehension From a children’s English learning book (Sela and Greenberg): Reference Text: “…The Bermuda Triangle lies in the Atlantic Ocean, off the coast of Florida. …” Hypothesis (True/False?): The Bermuda Triangle is near the United States ???

43 Where are we (from RTE-1)?

44 Cautious Optimism 1)Textual entailment provides a unified framework for applied semantics –Towards generic inference “engines” for applications 2)Potential for: –Scalable knowledge acquisition, boosted by (mostly unsupervised) learning –Learning-based inference methods Thank you!

45 Summary: Textual Entailment as Goal The essence of our proposal:  Base applied inference on entailment “engines” and KBs  Formulate various semantic problems as entailment tasks Interpretations and “mapping” methods may compete/complement Open question: which inferences –can be represented at language level? –require logical or specialized representation and inference? (temporal, spatial, mathematical, …)

46 Collecting QA Pairs Motivation: a passage containing the answer slot filler should entail the corresponding answer statement. –E.g. for: Who invented the telephone?, and answer Bell, text should entail Bell invented the telephone QA systems were given TREC and CLEF questions. Hypothesis generated by “plugging” the system answer term into the affirmative form of the question Texts correspond to the candidate answer passages

47 Collecting IE Pairs Motivation: a sentence containing a target relation instance should entail an instantiated template of the relation –E.g: X is located in Y Pairs were generated in several ways –Outputs of IE systems: for ACE-2004 and MUC-4 relations –Manually : for ACE-2004 and MUC-4 relations for additional relations in news domain

48 Collecting IR Pairs Motivation: relevant documents should entail a given “propositional” query. Hypotheses are propositional IR queries, adapted and simplified from TREC and CLEF –drug legalization benefits  drug legalization has benefits Texts selected from documents retrieved by different search engines

49 Collecting SUM (MDS) Pairs Motivation: identifying redundant statements (particularly in multi-document summaries) Using web document clusters and system summary Picking for hypotheses sentences having high lexical overlap with summary In final pairs: –Texts are original sentences (usually from summary) –Hypotheses: Positive pairs: simplify h until entailed by t Negative pairs: simplify h similarly In RTE-3: using Pyramid benchmark data