1 CLEF 2011, Amsterdam QA4MRE, Question Answering for Machine Reading Evaluation Question Answering Track Overview Main Task Anselmo Peñas Eduard Hovy.

1 CLEF 2011, Amsterdam QA4MRE, Question Answering for Machine Reading Evaluation Question Answering Track Overview Main Task Anselmo Peñas Eduard Hovy Pamela Forner Álvaro Rodrigo Richard Sutcliffe Corina Forascu Caroline Sporleder Modality and Negation Roser Morante Walter Daelemans

2 QA Tasks & Time at CLEF 200320042005200620072008200920102011 QA Tasks Multiple Language QA Main TaskResPubliQAQA4MRE Temporal restrictions and lists Answer Validation Exercise (AVE) GikiCLEF Negation and Modality Real Time QA over Speech Transcriptions (QAST) WiQA WSD QA

New setting QA over a single document Multiple Choice Reading Comprehension Tests Forget about the IR step (for a while) Focus on answering questions about a single text Chose the correct answer Why this new setting? 3

Systems performance Upper bound of 60% accuracy Overall Best result <60% Definitions Best result >80% NOT IR approach

Pipeline Upper Bound SOMETHING to break the pipeline: answer validation instead of re-ranking Question Answer Question analysis Passage Retrieval Answer Extraction Answer Ranking 1.00.8 0.64xx= Not enough evidence

Multi-stream upper bound Perfect combination 81% Best system 52,5% Best with ORGANIZATION Best with PERSON Best with TIME

Multi-stream architectures Different systems response better different types of questions Specialization Collaboration QA sys 1 QA sys 2 QA sys 3 QA sys n Question Candidate answers SOMETHING for combining / selecting Answer

AVE 2006-2008 Answer Validation: decide whether to return the candidate answer or not Answer Validation should help to improve QA Introduce more content analysis Use Machine Learning techniques Able to break pipelines and combine streams

Hypothesis generation + validation 9 Question Searching space of candidate answers Hypothesis generation functions + Answer validation functions Answer

ResPubliQA 2009 - 2010 Transfer AVE results to QA main task 2009 and 2010 Promote QA systems with better answer validation QA evaluation setting assuming that To leave a question unanswered has more value than to give a wrong answer

Evaluation measure n: Number of questions n R : Number of correctly answered questions n U : Number of unanswered questions Reward systems that maintain accuracy but reduce the number of incorrect answers by leaving some questions unanswered

Conclusions of ResPubliQA 2009 – 2010 This was not enough We expected a bigger change in systems architecture Validation is still in the pipeline Bad IR -> Bad QA No qualitative improvement in performance Need of space to develop the technology 12

13 2011 campaign Promote a bigger change in QA systems architecture QA4MRE: Question Answering for Machine Reading Evaluation Measure progress in two reading abilities Answer questions about a single text Capture knowledge from text collections

Reading test Text Coal seam gas drilling in Australia's Surat Basin has been halted by flooding. Australia's Easternwell, being acquired by Transfield Services, has ceased drilling because of the flooding. The company is drilling coal seam gas wells for Australia's Santos Ltd. Santos said the impact was minimal. Multiple choice test According to the text… What company owns wells in Surat Basin? a)Australia b)Coal seam gas wells c)Easternwell d)Transfield Services e)Santos Ltd. f)Ausam Energy Corporation g)Queensland h)Chinchilla

Knowledge gaps Acquire this knowledge from the reference collection drill Company BWell C for own | P=0.8 Queensland Australia Surat Basin is part of Company A I II

Knowledge-Understanding dependence We “understand” because we “know” We need a little more of both to answer questions Capture ‘knowledge’ expressed in texts ‘Understand’ language Reading cycle

Control the variable of knowledge The ability of making inferences about texts is correlated to the amount of knowledge considered This variable has to be taken into account during evaluation Otherwise is very difficult to compare methods How to control the variable of knowledge in a reading task?

Text as sources of knowledge Text Collection Big and diverse enough to acquire knowledge Impossible for all possible topics Define a scalable strategy: topic by topic Reference collection per topic (20,000-100,000 docs.) Several topics Narrow enough to limit knowledge needed AIDS CLIMATE CHANGE MUSIC & SOCIETY

Evaluation tests 12 reading tests (4 docs per topic) 120 questions (10 questions per test) 600 choices (5 options per question) Translated into 5 languages: English, German, Spanish, Italian, Romanian 19

Evaluation tests 44 questions required background knowledge from the reference collection 38 required combine info from different paragraphs Textual inferences Lexical: acronyms, synonyms, hypernyms… Syntactic: nominalizations, paraphrasing… Discourse: correference, ellipsis… 20

Evaluation QA perspective evaluation c@1 over all 120 questions Reading perspective evaluation Aggregating results by test 21 Task Registered groups Participant groups Submitted Runs QA4MRE251262 runs

Workshop QA4MRE Tuesday 10:30 – 12:30 Keynote: Text Mining in Biograph (Walter Daelemans) QA4MRE methodology and results (Álvaro Rodrigo) Report on Modality and Negation pilot (Roser Morante) 14:00 – 16:00 Reports from participants Wednesday 10:30 – 12:30 Breakout session 22

23 CLEF 2011, Amsterdam QA4MRE, Question Answering for Machine Reading Evaluation Question Answering Track Breakout session Main Task Anselmo Peñas Eduard Hovy Pamela Forner Álvaro Rodrigo Richard Sutcliffe Corina Forascu Caroline Sporleder Modality and Negation Roser Morante Walter Daelemans

QA4MRE breakout session Task Questions are more difficult and realistic 100% reusable test sets Languages and participants No participants for some languages But valuable resource for evaluation Good balance for developing tests in other languages (even without participants) Problem is to find parallel translations for tests 24

QA4MRE breakout session Background collections Good balance of quality and noise Methodology to build them is ok Test documents (TED) Not ideal but parallel Open audience and no copyright issues Consider other possibilities CafeBabel BBC news 25

QA4MRE breakout session Evaluation Encourage participants to test previous systems on new campaigns Ablation tests, what happens if you remove a component? Runs with and without background knowledge, with and without external resources Processing time measurements 26

QA4MRE 2012 Topics Previous 1.AIDS 2.Music and Society 3.Climate Change Add 1.Alzheimer (divulgative sources: blogs, web, news, …) 27

QA4MRE 2012 Pilots Modality and Negation Move to a three value setting: Given an event in the text decide whether it is 1.Asserted (no negation and no speculation) 2.Negated (negation and no speculation), 3.Speculated Roadmap 1.2012 as a separated pilot 2.2013 integrate modality and negation in the main task tests 28

QA4MRE 2012 Pilots Biomedical domain Focus in one disease: Alzheimer (59,000 Medline abstracts) Scientific language Give participants the background collection already processed: Tok, Lem, POS, NER, Dependency parsing Development set 29

QA4MRE 2012 in summary Main task Multiple Choice Reading Comprehension tests Same format Additional topic: Alzheimer English, German, (maybe Spanish, Italian, Romanian, others) Two pilots Modality and negation Asserted, negated, speculated Biomedical domain focus on Alzheimer disease Same format as the main task 30

Thanks! 31

1 CLEF 2011, Amsterdam QA4MRE, Question Answering for Machine Reading Evaluation Question Answering Track Overview Main Task Anselmo Peñas Eduard Hovy.

Similar presentations

Presentation on theme: "1 CLEF 2011, Amsterdam QA4MRE, Question Answering for Machine Reading Evaluation Question Answering Track Overview Main Task Anselmo Peñas Eduard Hovy."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 CLEF 2011, Amsterdam QA4MRE, Question Answering for Machine Reading Evaluation Question Answering Track Overview Main Task Anselmo Peñas Eduard Hovy.

Similar presentations

Presentation on theme: "1 CLEF 2011, Amsterdam QA4MRE, Question Answering for Machine Reading Evaluation Question Answering Track Overview Main Task Anselmo Peñas Eduard Hovy."— Presentation transcript:

Similar presentations

About project

Feedback