1 CLEF 2009, Corfu Question Answering Track Overview J. Turmo P.R. Comas S. Rosset O. Galibert N. Moreau D. Mostefa P. Rosso D. Buscaldi D. Santos L.M.

1 CLEF 2009, Corfu Question Answering Track Overview J. Turmo P.R. Comas S. Rosset O. Galibert N. Moreau D. Mostefa P. Rosso D. Buscaldi D. Santos L.M. Cabral A. Peñas P. Forner R. Sutcliffe Á. Rodrigo C. Forascu I. Alegria D. Giampiccolo N. Moreau P. Osenova

2 QA Tasks & Time 2003200420052006200720082009 QA Tasks Multiple Language QA Main TaskResPubliQA Temporal restrictions and lists Answer Validation Exercise (AVE) GikiCLEF Real Time QA over Speech Transcriptions (QAST) WiQA WSD QA

3 2009 campaign ResPubliQA: QA on European Legislation GikiCLEF: QA requiring geographical reasoning on Wikipedia QAST: QA on Speech Transcriptions of European Parliament Plenary sessions

4 QA 2009 campaign Task Registered groups Participant groups Submitted Runs Organizing people ResPubliQA2011 28 + 16 (baseline runs) 9 Giki CLEF27817 runs2 QAST12486 (5 subtasks)8 Total 59 showed interest 23 Groups 147 runs evaluated 19 + additional assessors

5 ResPubliQA 2009: QA on European Legislation Organizers Anselmo Peñas Pamela Forner Richard Sutcliffe Álvaro Rodrigo Corina Forascu Iñaki Alegria Danilo Giampiccolo Nicolas Moreau Petya Osenova Additional Assessors Fernando Luis Costa Anna Kampchen Julia Kramme Cosmina Croitoru Advisory Board Donna Harman Maarten de Rijke Dominique Laurent

6 Evolution of the task 2003200420052006200720082009 Target languages 378910118 Collections News 1994+ News 1995 + Wikipedia Nov. 2006 European Legislation Number of questions 200500 Type of questions 200 Factoid + Temporal restrictions + Definitions - Type of question + Lists + Linked questions + Closed lists - Linked + Reason + Purpose + Procedure Supporting information DocumentSnippetParagraph Size of answer SnnipetExactParagraph

7 Objectives 1. Move towards a domain of potential users 2. Compare systems working in different languages 3. Compare QA Tech. with pure IR 4. Introduce more types of questions 5. Introduce Answer Validation Tech.

8 Collection Subset of JRC-Acquis (10,700 docs x lang) Parallel at document level EU treaties, EU legislation, agreements and resolutions Economy, health, law, food, … Between 1950 and 2006 XML-TEI.2 encoding Unfortunately, non parallel at the paragraph level -> extra work

9 500 questions REASON Why did a commission expert conduct an inspection visit to Uruguay? PURPOSE/OBJECTIVE What is the overall objective of the eco-label? PROCEDURE How are stable conditions in the natural rubber trade achieved? In general, any question that can be answered in a paragraph

10 500 questions Also FACTOID In how many languages is the Official Journal of the Community published? DEFINITION What is meant by “whole milk”? No NIL questions

12 Translation of questions

13 Selection of the final pool of 500 questions out of the 600 produced

15 Systems response No Answer ≠ Wrong Answer 1. Decide if the answer is given or not [ YES | NO ] Classification Problem Machine Learning, Provers, etc. Textual Entailment 2. Provide the paragraph (ID+Text) that answers the question Aim To leave a question unanswered has more value than to give a wrong answer

16 Assessments R: The question is answered correctly W: The question is answered incorrectly NoA: The question is not answered NoA R: NoA, but the candidate answer was correct NoA W: NoA, and the candidate answer was incorrect Noa Empty: NoA and no candidate answer was given Evaluation measure: c@1 Extension of the traditional accuracy (as proportion of questions correctly answered) Considering unanswered questions

17 Evaluation measure n: Number of questions n R : Number of correctly answered questions n U : Number of unanswered questions

18 Evaluation measure If n U = 0 then c@1=n R /n  Accuracy If n R = 0 then c@1=0 If n U = n then c@1=0 Leave a question unanswered gives value only if this avoids to return a wrong answer Accuracy The added value is the performance shown with the answered questions: Accuracy

19 List of Participants SystemTeam elixELHUYAR-IXA, SPAIN iciaRACAI, ROMANIA iiitSearch & Info Extraction Lab, INDIA ilesLIMSI-CNRS-2, FRANCE isikISI-Kolkata, INDIA logaU.Koblenz-Landau, GERMAN miraMIRACLE, SPAIN nlelU. politecnica Valencia, SPAIN synaSynapse Developpment, FRANCE uaicAI.I.Cuza U. of IASI, ROMANIA unedUNED, SPAIN

20 Value of reducing wrong answers Systemc@1Accuracy#R#W#NoA#NoA R #NoA W #NoA empty combination0.76 3811190000 icia092roro0.680.522608415600 icia091roro0.580.4723715610700 UAIC092roro0.47 236264000 0 UAIC091roro0.45 227273000 0 base092roro0.44 220280000 0 base091roro0.37 185315000 0

21 Detecting wrong answers Systemc@1Accuracy#R#W#NoA#NoA R #NoA W#NoA empty combination0.56 2782220000 loga091dede0.440.4186221931668 9 loga092dede0.440.4187230831262 9 base092dede0.38 189311000 0 base091dede0.35 174326000 0 Maintaining the number of correct answers, the candidate answer was not correct for 83% of unanswered questions Very good step towards improving the system

22 IR important, not enough Systemc@1Accuracy#R#W#NoA#NoA R#NoA W#NoA empty combination0.9 451490000 uned092enen0.61 288184281512 1 uned091enen0.60.59282190281513 0 nlel091enen0.580.57287211200 2 uaic092enen0.540.52243204531835 0 base092enen0.53 263236110 0 base091enen0.51 256243101 0 elix092enen0.48 240260000 0 uaic091enen0.440.42200253471136 0 elix091enen0.42 211289000 0 syna091enen0.28 141359000 0 isik091enen0.25 126374000 0 iiit091enen0.20.115437409011 398 elix092euen0.18 91409000 0 elix091euen0.16 78422000 0 Feasible Task Perfect combination is 50% better than best system Many systems under the IR baselines

23 Comparison across languages Same questions Same documents Same baseline systems Strict comparison only affected by the variable of language But it is feasible to detect the most promising approaches across languages

24 Comparison across languages SystemROESENITDE icia0920.68 nlel092 0.47 uned092 0.410.61 uned091 0.410.6 icia0910.58 nlel091 0.580.52 uaic0920.47 0.54 uaic091 0.45 loga091 0.44 loga092 0.44 Baseline0.440.40.530.420.38 Systems above the baselines Icia, Boolean + intensive NLP + ML- based validation & very good knowledge of the collection (Eurovoc terms…) Baseline, Okapi- BM25 tuned for paragraph retrieval

25 Comparison across languages SystemROESENITDE icia0920.68 nlel092 0.47 uned092 0.410.61 uned091 0.410.6 icia0910.58 nlel091 0.580.52 uaic0920.47 0.54 uaic091 0.45 loga091 0.44 loga092 0.44 Baseline0.440.40.530.420.38 Systems above the baselines nlel092, ngram- based retrieval, combining evidence from several languages Baseline, Okapi- BM25 tuned for paragraph retrieval

26 Comparison across languages SystemROESENITDE icia0920.68 nlel092 0.47 uned092 0.410.61 uned091 0.410.6 icia0910.58 nlel091 0.580.52 uaic0920.47 0.54 uaic091 0.45 loga091 0.44 loga092 0.44 Baseline0.440.40.530.420.38 Systems above the baselines Uned, Okapi-BM25 + NER + paragraph validation + ngram based re-ranking Baseline, Okapi- BM25 tuned for paragraph retrieval

27 Comparison across languages SystemROESENITDE icia0920.68 nlel092 0.47 uned092 0.410.61 uned091 0.410.6 icia0910.58 nlel091 0.35 0.580.52 uaic0920.47 0.54 uaic091 0.45 loga091 0.44 loga092 0.44 Baseline0.440.40.530.420.38 Systems above the baselines nlel091, ngram-based paragraph retrieval Baseline, Okapi- BM25 tuned for paragraph retrieval

28 Comparison across languages SystemROESENITDE icia0920.68 nlel092 0.47 uned092 0.410.61 uned091 0.410.6 icia0910.58 nlel091 0.580.52 uaic0920.47 0.54 uaic091 0.45 loga091 0.44 loga092 0.44 Baseline0.440.40.530.420.38 Systems above the baselines Baseline, Okapi- BM25 tuned for paragraph retrieval Loga, Lucene + deep NLP + Logic + ML- based validation

29 Conclusion Compare systems working in different languages Compare QA Tech. with pure IR Pay more attention to paragraph retrieval Old issue, late 90’s state of the art (English) Pure IR performance: 0.38 - 0.58 Highest difference respect IR baselines: 0.44 – 0.68 Intensive NLP ML-based answer validation Introduce more types of questions Some types difficult to distinguish Any question that can be answered in a paragraph Analysis of results by question types (in progress)

30 Conclusion Introduce Answer Validation Tech. Evaluation measure: c@1 Value of reducing wrong answers Detecting wrong answers is feasible Feasible task 90% of questions have been answered Room for improvement: Best systems around 60% Even with less participants we have More comparison More analysis More learning ResPubliQA proposal for 2010 SC and breakout session

31 Interest on ResPubliQA 2010 GROUP 1Uni. "Al.I.Cuza" Iasi (Dan Cristea, Diana Trandabat) 2Linguateca (Nuno Cardoso) 3RACAI (Dan Tufis, Radu Ion) 4Jesus Vilares 5Univ. Koblenz-Landlau (Bjorn Pelzer) 6Thomson Reuters (Isabelle Moulinier) 7Gracinda Carvalho 8UNED (Alvaro Rodrigo) 9Uni. Politecnica Valencia (Paolo Rosso & Davide Buscaldi) 10Uni. Hagen (Ingo Glockner) 11Linguit (Jochen L. Leidner) 12Uni. Saarland (Dietrich Klakow) 13ELHUYAR-IXA (Arantxa Otegi) 14MIRACLE TEAM (Paloma Martínez Fernández) But we need more You have already a Gold Standard of 500 questions & answers to play with…

1 CLEF 2009, Corfu Question Answering Track Overview J. Turmo P.R. Comas S. Rosset O. Galibert N. Moreau D. Mostefa P. Rosso D. Buscaldi D. Santos L.M.

Similar presentations

Presentation on theme: "1 CLEF 2009, Corfu Question Answering Track Overview J. Turmo P.R. Comas S. Rosset O. Galibert N. Moreau D. Mostefa P. Rosso D. Buscaldi D. Santos L.M."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 CLEF 2009, Corfu Question Answering Track Overview J. Turmo P.R. Comas S. Rosset O. Galibert N. Moreau D. Mostefa P. Rosso D. Buscaldi D. Santos L.M.

Similar presentations

Presentation on theme: "1 CLEF 2009, Corfu Question Answering Track Overview J. Turmo P.R. Comas S. Rosset O. Galibert N. Moreau D. Mostefa P. Rosso D. Buscaldi D. Santos L.M."— Presentation transcript:

Similar presentations

About project

Feedback