1 CLEF 2011, Amsterdam QA4MRE, Question Answering for Machine Reading Evaluation Question Answering Track Overview Main Task Anselmo Peñas Eduard Hovy.

Slides:



Advertisements
Similar presentations
A didactic plan for a communicative translation class Dr. Constanza Gerding Salas Leipzig Universität - Universidad de Concepción May 2012.
Advertisements

Chapter 5: Introduction to Information Retrieval
Text Analysis Conference Knowledge Base Population 2013 Hoa Trang Dang National Institute of Standards and Technology Sponsored by:
Question Answering for Machine Reading Evaluation Evaluation Campaign at CLEF 2011 Anselmo Peñas (UNED, Spain) Eduard Hovy (USC-ISI, USA) Pamela Forner.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
ResPubliQA 2010: QA on European Legislation Anselmo Peñas, UNED, Spain Pamela Forner, CELCT, Italy Richard Sutcliffe, U. Limerick, Ireland Alvaro Rodrigo,
1 CLEF 2012, Rome QA4MRE, Question Answering for Machine Reading Evaluation Anselmo Peñas (UNED, Spain) Eduard Hovy (USC-ISI, USA) Pamela Forner (CELCT,
Overview of Collaborative Information Retrieval (CIR) at FIRE 2012 Debasis Ganguly, Johannes Leveling, Gareth Jones School of Computing, CNGL, Dublin City.
CLEF 2008 Multilingual Question Answering Track UNED Anselmo Peñas Valentín Sama Álvaro Rodrigo CELCT Danilo Giampiccolo Pamela Forner.
3rd Answer Validation Exercise ( AVE 2008) QA subtrack at Cross-Language Evaluation Forum 2008 UNED Anselmo Peñas Álvaro Rodrigo Felisa Verdejo Thanks.
1 CLEF 2009, Corfu Question Answering Track Overview J. Turmo P.R. Comas S. Rosset O. Galibert N. Moreau D. Mostefa P. Rosso D. Buscaldi D. Santos L.M.
Queensland University of Technology An Ontology-based Mining Approach for User Search Intent Discovery Yan Shen, Yuefeng Li, Yue Xu, Renato Iannella, Abdulmohsen.
Web Logs and Question Answering Richard Sutcliffe 1, Udo Kruschwitz 2, Thomas Mandl University of Limerick, Ireland 2 - University of Essex, UK 3.
CLEF 2007 Multilingual Question Answering Track Danilo Giampiccolo, CELCT Anselmo Peñas, UNED.
ResPubliQA IR baselines and UNED participation Álvaro Rodrigo Joaquín Pérez Anselmo Peñas Guillermo Garrido Lourdes Araujo nlp.uned.es.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Workshop on APA Style Morning Session II WSU College of Nursing October 24, 2008 Ellen Barton Linguistics/English WSU Director of Composition.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Unit 9: Program Evaluation CERT Program Manager. CERT Program Manager: Program Evaluation 9-2 CERT Program Manager: Program Evaluation Unit Objectives.
Evaluation 101 Everything You Need to Know to Get Started Evaluating Informal Science Education Media Everything You Need to Know to Get Started Evaluating.
Spanish Question Answering Evaluation Anselmo Peñas, Felisa Verdejo and Jesús Herrera UNED NLP Group Distance Learning University of Spain CICLing 2004,
Overview There are many aspects of applying for jobs. We will focus today on: –Resumes –Cover Letters –Interviews.
AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
Overview of the Fourth Recognising Textual Entailment Challenge NIST-Nov. 17, 2008TAC Danilo Giampiccolo (coordinator, CELCT) Hoa Trang Dan (NIST)
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Answer Validation Exercise - AVE QA subtrack at Cross-Language Evaluation Forum 2007 UNED (coord.) Anselmo Peñas Álvaro Rodrigo Valentín Sama Felisa Verdejo.
Higher-Level Cognitive Processes
CLEF Ǻrhus Robust – Word Sense Disambiguation exercise UBC: Eneko Agirre, Oier Lopez de Lacalle, Arantxa Otegi, German Rigau UVA & Irion: Piek Vossen.
CLEF – Cross Language Evaluation Forum Question Answering at CLEF 2003 ( The Multiple Language Question Answering Track at CLEF 2003.
RTE Planning Session Luisa Bentivogli, Peter Clark, Ido Dagan, Hoa Trang Dang, Danilo Giampiccolo.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
INTRODUCTION TO RESEARCH. Learning to become a researcher By the time you get to college, you will be expected to advance from: Information retrieval–
Answer Validation Exercise - AVE QA subtrack at Cross-Language Evaluation Forum UNED (coord.) Anselmo Peñas Álvaro Rodrigo Valentín Sama Felisa Verdejo.
Evaluating Question Answering Validation Anselmo Peñas (and Alvaro Rodrigo) NLP & IR group UNED nlp.uned.es Information Science Institute Marina del Rey,
MIRACLE Multilingual Information RetrievAl for the CLEF campaign DAEDALUS – Data, Decisions and Language, S.A. Universidad Carlos III de.
Evaluating Multilingual Question Answering Systems at CLEF Pamela Forner 1, Danilo Giampiccolo 1, Bernardo Magnini 2, Anselmo Peñas 3, Álvaro Rodrigo 3,
CLEF 2009 Workshop Corfu, September 30, 2009  ELDA 1 Overview of QAST Question Answering on Speech Transcriptions - J. Turmo, P. R. Comas,TALP.
1 01/10/09 1 INFILE CEA LIST ELDA Univ. Lille 3 - Geriico Overview of the INFILE track at CLEF 2009 multilingual INformation FILtering Evaluation.
Department of Software and Computing Systems Research Group of Language Processing and Information Systems The DLSIUAES Team’s Participation in the TAC.
previous next 12/1/2015 There’s only one kind of question on a reading test, right? Book Style Questions Brain Style Questions Definition Types of Questions.
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
CLEF Kerkyra Robust – Word Sense Disambiguation exercise UBC: Eneko Agirre, Arantxa Otegi UNIPD: Giorgio Di Nunzio UH: Thomas Mandl.
Thomas Mandl: GeoCLEF Track Overview Cross-Language Evaluation Forum (CLEF) Thomas Mandl, (U. Hildesheim) 8 th Workshop.
CPSC 873 John D. McGregor Session 9 Testing Vocabulary.
Metadata By N.Gopinath AP/CSE Metadata and it’s role in the lifecycle. The collection, maintenance, and deployment of metadata Metadata and tool integration.
QA Pilot Task at CLEF 2004 Jesús Herrera Anselmo Peñas Felisa Verdejo UNED NLP Group Cross-Language Evaluation Forum Bath, UK - September 2004.
Evaluating Answer Validation in multi- stream Question Answering Álvaro Rodrigo, Anselmo Peñas, Felisa Verdejo UNED NLP & IR group nlp.uned.es The Second.
TEFL METHODOLOGY I COMMUNICATIVE LANGUAGE TEACHING.
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.
LREC Marrakech, May 29, 2008 Question Answering on Speech Transcriptions: the QAST evaluation in CLEF L. Lamel 1, S. Rosset 1, C. Ayache 2, D. Mostefa.
CPSC 871 John D. McGregor Module 8 Session 1 Testing.
RESEARCH An Overview A tutorial PowerPoint presentation by: Ramesh Adhikari.
The Cross Language Image Retrieval Track: ImageCLEF Breakout session discussion.
A Patent Document Retrieval System Addressing Both Semantic and Syntactic Properties Liang Chen*,Naoyuki Tokuda+, Hisahiro Adachi+ *University of Northern.
1 13/05/07 1/20 LIST – DTSI – Interfaces, Cognitics and Virtual Reality Unit The INFILE project: a crosslingual filtering systems evaluation campaign Romaric.
Listening Skill By Marc Helgesen Lecture # 23. Review of the last lecture Yesterday we had discussion on Principles for Teaching Language Methodology.
Analysis of Experiments on Hybridization of different approaches in mono and cross-language information retrieval DAEDALUS – Data, Decisions and Language,
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Textual Analysis Introduction. What is Textual Analysis? Textual Analysis, as the name suggests, involves the Analysis of a literary Text. It is very.
AQUAINT Mid-Year PI Meeting – June 2002 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
Present apply review Introduce students to a new topic by giving them a set of documents using a variety of formats (e.g. text, video, web link etc.) outlining.
CPSC 372 John D. McGregor Module 8 Session 1 Testing.
A Trainable Multi-factored QA System Radu Ion, Dan Ştefănescu, Alexandru Ceauşu, Dan Tufiş, Elena Irimia, Verginica Barbu-Mititelu Research Institute for.
What is the Entrance Exams Task
UNED Anselmo Peñas Álvaro Rodrigo Felisa Verdejo Thanks to…
Machine Reading.
CLEF 2008 Multilingual Question Answering Track
Presentation transcript:

1 CLEF 2011, Amsterdam QA4MRE, Question Answering for Machine Reading Evaluation Question Answering Track Overview Main Task Anselmo Peñas Eduard Hovy Pamela Forner Álvaro Rodrigo Richard Sutcliffe Corina Forascu Caroline Sporleder Modality and Negation Roser Morante Walter Daelemans

2 QA Tasks & Time at CLEF QA Tasks Multiple Language QA Main TaskResPubliQAQA4MRE Temporal restrictions and lists Answer Validation Exercise (AVE) GikiCLEF Negation and Modality Real Time QA over Speech Transcriptions (QAST) WiQA WSD QA

New setting QA over a single document Multiple Choice Reading Comprehension Tests Forget about the IR step (for a while) Focus on answering questions about a single text Chose the correct answer Why this new setting? 3

Systems performance Upper bound of 60% accuracy Overall Best result <60% Definitions Best result >80% NOT IR approach

Pipeline Upper Bound SOMETHING to break the pipeline: answer validation instead of re-ranking Question Answer Question analysis Passage Retrieval Answer Extraction Answer Ranking xx= Not enough evidence

Multi-stream upper bound Perfect combination 81% Best system 52,5% Best with ORGANIZATION Best with PERSON Best with TIME

Multi-stream architectures Different systems response better different types of questions Specialization Collaboration QA sys 1 QA sys 2 QA sys 3 QA sys n Question Candidate answers SOMETHING for combining / selecting Answer

AVE Answer Validation: decide whether to return the candidate answer or not Answer Validation should help to improve QA Introduce more content analysis Use Machine Learning techniques Able to break pipelines and combine streams

Hypothesis generation + validation 9 Question Searching space of candidate answers Hypothesis generation functions + Answer validation functions Answer

ResPubliQA Transfer AVE results to QA main task 2009 and 2010 Promote QA systems with better answer validation QA evaluation setting assuming that To leave a question unanswered has more value than to give a wrong answer

Evaluation measure n: Number of questions n R : Number of correctly answered questions n U : Number of unanswered questions Reward systems that maintain accuracy but reduce the number of incorrect answers by leaving some questions unanswered

Conclusions of ResPubliQA 2009 – 2010 This was not enough We expected a bigger change in systems architecture Validation is still in the pipeline Bad IR -> Bad QA No qualitative improvement in performance Need of space to develop the technology 12

campaign Promote a bigger change in QA systems architecture QA4MRE: Question Answering for Machine Reading Evaluation Measure progress in two reading abilities Answer questions about a single text Capture knowledge from text collections

Reading test Text Coal seam gas drilling in Australia's Surat Basin has been halted by flooding. Australia's Easternwell, being acquired by Transfield Services, has ceased drilling because of the flooding. The company is drilling coal seam gas wells for Australia's Santos Ltd. Santos said the impact was minimal. Multiple choice test According to the text… What company owns wells in Surat Basin? a)Australia b)Coal seam gas wells c)Easternwell d)Transfield Services e)Santos Ltd. f)Ausam Energy Corporation g)Queensland h)Chinchilla

Knowledge gaps Acquire this knowledge from the reference collection drill Company BWell C for own | P=0.8 Queensland Australia Surat Basin is part of Company A I II

Knowledge-Understanding dependence We “understand” because we “know” We need a little more of both to answer questions Capture ‘knowledge’ expressed in texts ‘Understand’ language Reading cycle

Control the variable of knowledge The ability of making inferences about texts is correlated to the amount of knowledge considered This variable has to be taken into account during evaluation Otherwise is very difficult to compare methods How to control the variable of knowledge in a reading task?

Text as sources of knowledge Text Collection Big and diverse enough to acquire knowledge Impossible for all possible topics Define a scalable strategy: topic by topic Reference collection per topic (20, ,000 docs.) Several topics Narrow enough to limit knowledge needed AIDS CLIMATE CHANGE MUSIC & SOCIETY

Evaluation tests 12 reading tests (4 docs per topic) 120 questions (10 questions per test) 600 choices (5 options per question) Translated into 5 languages: English, German, Spanish, Italian, Romanian 19

Evaluation tests 44 questions required background knowledge from the reference collection 38 required combine info from different paragraphs Textual inferences Lexical: acronyms, synonyms, hypernyms… Syntactic: nominalizations, paraphrasing… Discourse: correference, ellipsis… 20

Evaluation QA perspective evaluation over all 120 questions Reading perspective evaluation Aggregating results by test 21 Task Registered groups Participant groups Submitted Runs QA4MRE runs

Workshop QA4MRE Tuesday 10:30 – 12:30 Keynote: Text Mining in Biograph (Walter Daelemans) QA4MRE methodology and results (Álvaro Rodrigo) Report on Modality and Negation pilot (Roser Morante) 14:00 – 16:00 Reports from participants Wednesday 10:30 – 12:30 Breakout session 22

23 CLEF 2011, Amsterdam QA4MRE, Question Answering for Machine Reading Evaluation Question Answering Track Breakout session Main Task Anselmo Peñas Eduard Hovy Pamela Forner Álvaro Rodrigo Richard Sutcliffe Corina Forascu Caroline Sporleder Modality and Negation Roser Morante Walter Daelemans

QA4MRE breakout session Task Questions are more difficult and realistic 100% reusable test sets Languages and participants No participants for some languages But valuable resource for evaluation Good balance for developing tests in other languages (even without participants) Problem is to find parallel translations for tests 24

QA4MRE breakout session Background collections Good balance of quality and noise Methodology to build them is ok Test documents (TED) Not ideal but parallel Open audience and no copyright issues Consider other possibilities CafeBabel BBC news 25

QA4MRE breakout session Evaluation Encourage participants to test previous systems on new campaigns Ablation tests, what happens if you remove a component? Runs with and without background knowledge, with and without external resources Processing time measurements 26

QA4MRE 2012 Topics Previous 1.AIDS 2.Music and Society 3.Climate Change Add 1.Alzheimer (divulgative sources: blogs, web, news, …) 27

QA4MRE 2012 Pilots Modality and Negation Move to a three value setting: Given an event in the text decide whether it is 1.Asserted (no negation and no speculation) 2.Negated (negation and no speculation), 3.Speculated Roadmap as a separated pilot integrate modality and negation in the main task tests 28

QA4MRE 2012 Pilots Biomedical domain Focus in one disease: Alzheimer (59,000 Medline abstracts) Scientific language Give participants the background collection already processed: Tok, Lem, POS, NER, Dependency parsing Development set 29

QA4MRE 2012 in summary Main task Multiple Choice Reading Comprehension tests Same format Additional topic: Alzheimer English, German, (maybe Spanish, Italian, Romanian, others) Two pilots Modality and negation Asserted, negated, speculated Biomedical domain focus on Alzheimer disease Same format as the main task 30

Thanks! 31