Answer Validation Exercise - AVE QA subtrack at Cross-Language Evaluation Forum UNED (coord.) Anselmo Peñas Álvaro Rodrigo Valentín Sama Felisa Verdejo.

Slides:



Advertisements
Similar presentations
Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England.
Advertisements

EVALITA 2009 Recognizing Textual Entailment (RTE) Italian Chapter Johan Bos 1, Fabio Massimo Zanzotto 2, Marco Pennacchiotti 3 1 University of Rome La.
SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer.
Question Answering for Machine Reading Evaluation Evaluation Campaign at CLEF 2011 Anselmo Peñas (UNED, Spain) Eduard Hovy (USC-ISI, USA) Pamela Forner.
Recognizing Textual Entailment Challenge PASCAL Suleiman BaniHani.
ResPubliQA 2010: QA on European Legislation Anselmo Peñas, UNED, Spain Pamela Forner, CELCT, Italy Richard Sutcliffe, U. Limerick, Ireland Alvaro Rodrigo,
Overview of the KBP 2013 Slot Filler Validation Track Hoa Trang Dang National Institute of Standards and Technology.
1 CLEF 2012, Rome QA4MRE, Question Answering for Machine Reading Evaluation Anselmo Peñas (UNED, Spain) Eduard Hovy (USC-ISI, USA) Pamela Forner (CELCT,
1 CLEF 2011, Amsterdam QA4MRE, Question Answering for Machine Reading Evaluation Question Answering Track Overview Main Task Anselmo Peñas Eduard Hovy.
1. LEARN CONNECT & GET SUPPORT GENERATE LEADS START Marketing Community Marketing Services Bureau Microsoft Dynamics Marketplace Demand Generation Campaigns.
CLEF 2008 Multilingual Question Answering Track UNED Anselmo Peñas Valentín Sama Álvaro Rodrigo CELCT Danilo Giampiccolo Pamela Forner.
Supporting e-learning with automatic glossary extraction Experiments with Portuguese Rosa Del Gaudio, António Branco RANLP, Borovets 2007.
3rd Answer Validation Exercise ( AVE 2008) QA subtrack at Cross-Language Evaluation Forum 2008 UNED Anselmo Peñas Álvaro Rodrigo Felisa Verdejo Thanks.
1 CLEF 2009, Corfu Question Answering Track Overview J. Turmo P.R. Comas S. Rosset O. Galibert N. Moreau D. Mostefa P. Rosso D. Buscaldi D. Santos L.M.
UNED at PASCAL RTE-2 Challenge IR&NLP Group at UNED nlp.uned.es Jesús Herrera Anselmo Peñas Álvaro Rodrigo Felisa Verdejo.
Alicante, September, 22, Workshop Overview of the Multilingual Question Answering Track Danilo Giampiccolo.
CLEF 2007 Multilingual Question Answering Track Danilo Giampiccolo, CELCT Anselmo Peñas, UNED.
Third Recognizing Textual Entailment Challenge Potential SNeRG Submission.
ResPubliQA IR baselines and UNED participation Álvaro Rodrigo Joaquín Pérez Anselmo Peñas Guillermo Garrido Lourdes Araujo nlp.uned.es.
Answer Validation Exercise Anselmo Peñas UNED NLP Group 2005 Breakout session.
Web 2.0 Tools to learn English Zamareth Sosa Merced Keishla Ocasio Shelimar Benitez.
Spanish Question Answering Evaluation Anselmo Peñas, Felisa Verdejo and Jesús Herrera UNED NLP Group Distance Learning University of Spain CICLing 2004,
Overview of the Fourth Recognising Textual Entailment Challenge NIST-Nov. 17, 2008TAC Danilo Giampiccolo (coordinator, CELCT) Hoa Trang Dan (NIST)
Answer Validation Exercise - AVE QA subtrack at Cross-Language Evaluation Forum 2007 UNED (coord.) Anselmo Peñas Álvaro Rodrigo Valentín Sama Felisa Verdejo.
A New Approach for Cross- Language Plagiarism Analysis Rafael Corezola Pereira, Viviane P. Moreira, and Renata Galante Universidade Federal do Rio Grande.
© Janice Regan, CMPT 128, Jan CMPT 128 Introduction to Computing Science for Engineering Students Creating a program.
CLEF – Cross Language Evaluation Forum Question Answering at CLEF 2003 ( Bridging Languages for Question Answering: DIOGENE at CLEF-2003.
CLEF Ǻrhus Robust – Word Sense Disambiguation exercise UBC: Eneko Agirre, Oier Lopez de Lacalle, Arantxa Otegi, German Rigau UVA & Irion: Piek Vossen.
CLEF – Cross Language Evaluation Forum Question Answering at CLEF 2003 ( The Multiple Language Question Answering Track at CLEF 2003.
RTE Planning Session Luisa Bentivogli, Peter Clark, Ido Dagan, Hoa Trang Dang, Danilo Giampiccolo.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
Evaluating Question Answering Validation Anselmo Peñas (and Alvaro Rodrigo) NLP & IR group UNED nlp.uned.es Information Science Institute Marina del Rey,
MIRACLE Multilingual Information RetrievAl for the CLEF campaign DAEDALUS – Data, Decisions and Language, S.A. Universidad Carlos III de.
Evaluating Multilingual Question Answering Systems at CLEF Pamela Forner 1, Danilo Giampiccolo 1, Bernardo Magnini 2, Anselmo Peñas 3, Álvaro Rodrigo 3,
A Language Independent Method for Question Classification COLING 2004.
Recognizing textual entailment: Rational, evaluation and approaches Source:Natural Language Engineering 15 (4) Author:Ido Dagan, Bill Dolan, Bernardo Magnini.
CLEF 2007 Workshop Budapest, September 19, 2007  ELDA 1 Overview of QAST Question Answering on Speech Transcriptions - J. Turmo, P. Comas (1),
UA in ImageCLEF 2005 Maximiliano Saiz Noeda. Index System  Indexing  Retrieval Image category classification  Building  Use Experiments and results.
$1,000,000 $500,000 $100,000 $50,000 $10,000 $5000 $1000 $500 $200 $100 Is this your Final Answer? YesNo Question 2? Correct Answer Wrong Answer.
CLEF 2009 Workshop Corfu, September 30, 2009  ELDA 1 Overview of QAST Question Answering on Speech Transcriptions - J. Turmo, P. R. Comas,TALP.
How robust is CLIR? Proposal for a new robust task at CLEF Thomas Mandl Information Science Universität Hildesheim 6 th Workshop.
Department of Software and Computing Systems Research Group of Language Processing and Information Systems The DLSIUAES Team’s Participation in the TAC.
CLEF Kerkyra Robust – Word Sense Disambiguation exercise UBC: Eneko Agirre, Arantxa Otegi UNIPD: Giorgio Di Nunzio UH: Thomas Mandl.
Cross-Language Evaluation Forum CLEF 2003 Carol Peters ISTI-CNR, Pisa, Italy Martin Braschler Eurospider Information Technology AG.
QA Pilot Task at CLEF 2004 Jesús Herrera Anselmo Peñas Felisa Verdejo UNED NLP Group Cross-Language Evaluation Forum Bath, UK - September 2004.
Evaluating Answer Validation in multi- stream Question Answering Álvaro Rodrigo, Anselmo Peñas, Felisa Verdejo UNED NLP & IR group nlp.uned.es The Second.
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
Programming Errors. Errors of different types Syntax errors – easiest to fix, found by compiler or interpreter Semantic errors – logic errors, found by.
LREC Marrakech, May 29, 2008 Question Answering on Speech Transcriptions: the QAST evaluation in CLEF L. Lamel 1, S. Rosset 1, C. Ayache 2, D. Mostefa.
Answer Mining by Combining Extraction Techniques with Abductive Reasoning Sanda Harabagiu, Dan Moldovan, Christine Clark, Mitchell Bowden, Jown Williams.
Towards Entailment Based Question Answering: ITC-irst at Clef 2006 Milen Kouylekov, Matteo Negri, Bernardo Magnini & Bonaventura Coppola ITC-irst, Centro.
 General domain question answering system.  The starting point was the architecture described in Brill, Eric. ‘Processing Natural Language without Natural.
Introduction A field survey of Dutch language resources has been carried out within the framework of a project launched by the Dutch Language Union (Nederlandse.
Acceso a la información mediante exploración de sintagmas Anselmo Peñas, Julio Gonzalo y Felisa Verdejo Dpto. Lenguajes y Sistemas Informáticos UNED III.
AQUAINT AQUAINT Evaluation Overview Ellen M. Voorhees.
CLEF 2008 Workshop Aarhus, September 17, 2008  ELDA 1 Overview of QAST Question Answering on Speech Transcriptions - J. Turmo, P. Comas (1), L.
The CLEF 2005 interactive track (iCLEF) Julio Gonzalo 1, Paul Clough 2 and Alessandro Vallin Departamento de Lenguajes y Sistemas Informáticos, Universidad.
1 CPA: Where do we go from here? Research Institute for Information and Language Processing, University of Wolverhampton; UPF Barcelona; University of.
Analysis of Experiments on Hybridization of different approaches in mono and cross-language information retrieval DAEDALUS – Data, Decisions and Language,
CLEF Workshop ECDL 2003 Trondheim Michael Kluck slide 1 Introduction to the Monolingual and Domain-Specific Tasks of the Cross-language.
CLEF Budapest1 Measuring the contribution of Word Sense Disambiguation for QA Proposers: UBC: Agirre, Lopez de Lacalle, Otegi, Rigau, FBK: Magnini.
V E R B O T He is… You are … She is… I am… We are…
V E R B O T He is… You are … She is… I am… We are…

What is the Entrance Exams Task
UNED Anselmo Peñas Álvaro Rodrigo Felisa Verdejo Thanks to…
Learning Intention I will learn about the different types of programming errors.

Machine Reading.
CLEF 2008 Multilingual Question Answering Track
Presentation transcript:

Answer Validation Exercise - AVE QA subtrack at Cross-Language Evaluation Forum UNED (coord.) Anselmo Peñas Álvaro Rodrigo Valentín Sama Felisa Verdejo Thanks to… Bernardo Magnini Danilo Giampiccolo Pamela Forner Petya Osenova Christelle Ayache Bodgan Scaleanu Diana Santos Juan Feu Ido dagan …

nlp.uned.es/QA/ave What? Answer Validation Exercise Validate the correctness of the answers given by real QA systems......the answers of participants at CLEF QA 2006 Why? Give feedback on a single QA module, improve QA systems performance, improve systems self-score, help humans in the assessment of QA systems output, develop criteria for collaborative QA systems,...

nlp.uned.es/QA/ave How? Turning it into a RTE exercise If the text semantically entails the hypothesis, then the answer is expected to be correct. Question Supporting snippet & doc ID Exact Answer QA system Hypothesis Into affirmative form Text several sentences <500 bytes

nlp.uned.es/QA/ave Example  Question: Who is the President of Mexico?  Answer (obsolete): Vicente Fox  Hypothesis: Vicente Fox is the President of Mexico  Supporting Text: “...President Vicente Fox promises a more democratic Mexico...”  Exercise Text entails Hypothesis? Answer: YES | NO

nlp.uned.es/QA/ave Looking for robust systems  Hypothesis are built semi-automatically from systems answers Some answers are correct and exact Many are too large, too short, too wrong  Many hypothesis with Wrong syntax but understandable Wrong syntax and not understandable Wrong semantics

nlp.uned.es/QA/ave So, the exercise  Return an entailment value (YES|NO) for each given text-hypothesis pair  Results were evaluated against the QA human assessments  Subtasks English, Spanish, Italian, Dutch, French, German, Portuguese and Bulgarian

nlp.uned.es/QA/ave Collections Available for CLEF participants atnlp.uned.es/QA/ave/ TestingTraining English2088 (10% YES)2870 (15% YES) Spanish2369 (28% YES)2905 (22% YES) German1443 (25% YES) French3266 (22% YES) Italian1140 (16% YES) Dutch807 (10% YES) Portuguese1324 (14% YES)

nlp.uned.es/QA/ave Evaluation  Not balanced collections  Approach: Detect if there is enough evidence to accept an answer  Measures: Precision, recall and F over pairs YES (where text entails hypothesis)  Baseline system: Accept all answers, (give always YES)

nlp.uned.es/QA/ave Participants and runs DEENESFRITNLPT Fernuniversität in Hagen2 2 Language Computer Corporation 11 2 U. Rome "Tor Vergata" 2 2 U. Alicante (Kozareva) U. Politecnica de Valencia 1 1 U. Alicante (Ferrández) 2 2 LIMSI-CNRS 1 1 U. Twente UNED (Herrera) 2 2 UNED (Rodrigo) 1 1 ITC-irst 1 1 R2D2 project 1 1 Total

nlp.uned.es/QA/ave Results LanguageBaseline (F) Best (F) Reported Techiques English.27.44Logic Spanish.45.61Logic German.39.54Lexical, Syntax, Semantics, Logic, Corpus French.37.47Overlapping, Learning Dutch.19.39Syntax, Learning Portuguese.38.35Overlapping Italian.29.41Overlapping, Learning

nlp.uned.es/QA/ave Conclusions  Developed methodologies Build collections from QA responses Evaluate in chain with a QA Track  New testing collections for the QA and RTE communities In 7 languages, not only English  Evaluation in a real environment Real systems outputs -> AVE input

nlp.uned.es/QA/ave Conclusions  Reformulation of Answer Validation as Textual Entailment problem is feasible Introduces a 4% of error (in the semi-automatic generation of the collection)  Good participation 11 systems, 38 runs, 7 languages  Systems that reported the use of Logic obtained the best results