Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 CLEF 2011, Amsterdam QA4MRE, Question Answering for Machine Reading Evaluation Question Answering Track Overview Main Task Anselmo Peñas Eduard Hovy.

Similar presentations


Presentation on theme: "1 CLEF 2011, Amsterdam QA4MRE, Question Answering for Machine Reading Evaluation Question Answering Track Overview Main Task Anselmo Peñas Eduard Hovy."— Presentation transcript:

1 1 CLEF 2011, Amsterdam QA4MRE, Question Answering for Machine Reading Evaluation Question Answering Track Overview Main Task Anselmo Peñas Eduard Hovy Pamela Forner Álvaro Rodrigo Richard Sutcliffe Corina Forascu Caroline Sporleder Modality and Negation Roser Morante Walter Daelemans

2 2 QA Tasks & Time at CLEF 200320042005200620072008200920102011 QA Tasks Multiple Language QA Main TaskResPubliQAQA4MRE Temporal restrictions and lists Answer Validation Exercise (AVE) GikiCLEF Negation and Modality Real Time QA over Speech Transcriptions (QAST) WiQA WSD QA

3 New setting QA over a single document Multiple Choice Reading Comprehension Tests Forget about the IR step (for a while) Focus on answering questions about a single text Chose the correct answer Why this new setting? 3

4 Systems performance Upper bound of 60% accuracy Overall Best result <60% Definitions Best result >80% NOT IR approach

5 Pipeline Upper Bound SOMETHING to break the pipeline: answer validation instead of re-ranking Question Answer Question analysis Passage Retrieval Answer Extraction Answer Ranking 1.00.8 0.64xx= Not enough evidence

6 Multi-stream upper bound Perfect combination 81% Best system 52,5% Best with ORGANIZATION Best with PERSON Best with TIME

7 Multi-stream architectures Different systems response better different types of questions Specialization Collaboration QA sys 1 QA sys 2 QA sys 3 QA sys n Question Candidate answers SOMETHING for combining / selecting Answer

8 AVE 2006-2008 Answer Validation: decide whether to return the candidate answer or not Answer Validation should help to improve QA Introduce more content analysis Use Machine Learning techniques Able to break pipelines and combine streams

9 Hypothesis generation + validation 9 Question Searching space of candidate answers Hypothesis generation functions + Answer validation functions Answer

10 ResPubliQA 2009 - 2010 Transfer AVE results to QA main task 2009 and 2010 Promote QA systems with better answer validation QA evaluation setting assuming that To leave a question unanswered has more value than to give a wrong answer

11 Evaluation measure n: Number of questions n R : Number of correctly answered questions n U : Number of unanswered questions Reward systems that maintain accuracy but reduce the number of incorrect answers by leaving some questions unanswered

12 Conclusions of ResPubliQA 2009 – 2010 This was not enough We expected a bigger change in systems architecture Validation is still in the pipeline Bad IR -> Bad QA No qualitative improvement in performance Need of space to develop the technology 12

13 13 2011 campaign Promote a bigger change in QA systems architecture QA4MRE: Question Answering for Machine Reading Evaluation Measure progress in two reading abilities Answer questions about a single text Capture knowledge from text collections

14 Reading test Text Coal seam gas drilling in Australia's Surat Basin has been halted by flooding. Australia's Easternwell, being acquired by Transfield Services, has ceased drilling because of the flooding. The company is drilling coal seam gas wells for Australia's Santos Ltd. Santos said the impact was minimal. Multiple choice test According to the text… What company owns wells in Surat Basin? a)Australia b)Coal seam gas wells c)Easternwell d)Transfield Services e)Santos Ltd. f)Ausam Energy Corporation g)Queensland h)Chinchilla

15 Knowledge gaps Acquire this knowledge from the reference collection drill Company BWell C for own | P=0.8 Queensland Australia Surat Basin is part of Company A I II

16 Knowledge-Understanding dependence We “understand” because we “know” We need a little more of both to answer questions Capture ‘knowledge’ expressed in texts ‘Understand’ language Reading cycle

17 Control the variable of knowledge The ability of making inferences about texts is correlated to the amount of knowledge considered This variable has to be taken into account during evaluation Otherwise is very difficult to compare methods How to control the variable of knowledge in a reading task?

18 Text as sources of knowledge Text Collection Big and diverse enough to acquire knowledge Impossible for all possible topics Define a scalable strategy: topic by topic Reference collection per topic (20,000-100,000 docs.) Several topics Narrow enough to limit knowledge needed AIDS CLIMATE CHANGE MUSIC & SOCIETY

19 Evaluation tests 12 reading tests (4 docs per topic) 120 questions (10 questions per test) 600 choices (5 options per question) Translated into 5 languages: English, German, Spanish, Italian, Romanian 19

20 Evaluation tests 44 questions required background knowledge from the reference collection 38 required combine info from different paragraphs Textual inferences Lexical: acronyms, synonyms, hypernyms… Syntactic: nominalizations, paraphrasing… Discourse: correference, ellipsis… 20

21 Evaluation QA perspective evaluation c@1 over all 120 questions Reading perspective evaluation Aggregating results by test 21 Task Registered groups Participant groups Submitted Runs QA4MRE251262 runs

22 Workshop QA4MRE Tuesday 10:30 – 12:30 Keynote: Text Mining in Biograph (Walter Daelemans) QA4MRE methodology and results (Álvaro Rodrigo) Report on Modality and Negation pilot (Roser Morante) 14:00 – 16:00 Reports from participants Wednesday 10:30 – 12:30 Breakout session 22

23 23 CLEF 2011, Amsterdam QA4MRE, Question Answering for Machine Reading Evaluation Question Answering Track Breakout session Main Task Anselmo Peñas Eduard Hovy Pamela Forner Álvaro Rodrigo Richard Sutcliffe Corina Forascu Caroline Sporleder Modality and Negation Roser Morante Walter Daelemans

24 QA4MRE breakout session Task Questions are more difficult and realistic 100% reusable test sets Languages and participants No participants for some languages But valuable resource for evaluation Good balance for developing tests in other languages (even without participants) Problem is to find parallel translations for tests 24

25 QA4MRE breakout session Background collections Good balance of quality and noise Methodology to build them is ok Test documents (TED) Not ideal but parallel Open audience and no copyright issues Consider other possibilities CafeBabel BBC news 25

26 QA4MRE breakout session Evaluation Encourage participants to test previous systems on new campaigns Ablation tests, what happens if you remove a component? Runs with and without background knowledge, with and without external resources Processing time measurements 26

27 QA4MRE 2012 Topics Previous 1.AIDS 2.Music and Society 3.Climate Change Add 1.Alzheimer (divulgative sources: blogs, web, news, …) 27

28 QA4MRE 2012 Pilots Modality and Negation Move to a three value setting: Given an event in the text decide whether it is 1.Asserted (no negation and no speculation) 2.Negated (negation and no speculation), 3.Speculated Roadmap 1.2012 as a separated pilot 2.2013 integrate modality and negation in the main task tests 28

29 QA4MRE 2012 Pilots Biomedical domain Focus in one disease: Alzheimer (59,000 Medline abstracts) Scientific language Give participants the background collection already processed: Tok, Lem, POS, NER, Dependency parsing Development set 29

30 QA4MRE 2012 in summary Main task Multiple Choice Reading Comprehension tests Same format Additional topic: Alzheimer English, German, (maybe Spanish, Italian, Romanian, others) Two pilots Modality and negation Asserted, negated, speculated Biomedical domain focus on Alzheimer disease Same format as the main task 30

31 Thanks! 31


Download ppt "1 CLEF 2011, Amsterdam QA4MRE, Question Answering for Machine Reading Evaluation Question Answering Track Overview Main Task Anselmo Peñas Eduard Hovy."

Similar presentations


Ads by Google