Overview of the Multilingual Question Answering Track Alessandro Vallin ITC-irst, Trento - Italy Cross Language Evaluation Forum 2004 Bath, September 16,

Overview of the Multilingual Question Answering Track Alessandro Vallin ITC-irst, Trento - Italy Cross Language Evaluation Forum 2004 Bath, September 16, 2004 Joint work with Bernardo Magnini, Christelle Ayache, Gregor Erbach, Anselmo Peñas, Maarten de Rijke, Paulo Rocha, Kiril Simov and Richard Sutcliffe

CLEF 2004 WorkshopBath – September 16, 2004 Outline  Introduction  QA Track Setup - Overview - Task Definition - Questions and Answers - Assessment  Evaluation Exercise - Participants - Results - Approaches  Conclusions - Future Directions

CLEF 2004 WorkshopBath – September 16, 2004 Introduction Question Answering Domain specificDomain-independent Structured dataFree text Web Fixed set of collections Single document Growing interest in QA (3 QA-related tasks at CLEF 2004). Focus on multilinguality (Burger et al., 2001): - monolingual tasks in other languages than English - cross-language tasks

CLEF 2004 WorkshopBath – September 16, 2004 Introduction faithfulness compactness Automatic Summarization Machine Translation Automatic Question Answering as compact as possible answers must be faithful w.r.t. questions (correctness) and compact (exactness) as faithful as possible

CLEF 2004 WorkshopBath – September 16, 2004 QA @ CLEF 2004 (http://clef-qa.itc.it/2004) Seven groups coordinated the QA track: - ITC-irst (IT and EN test set preparation) - DFKI (DE) - ELDA/ELRA (FR) - Linguateca (PT) - UNED (ES) - U. Amsterdam (NL) - U. Limerick (EN assessment) Two more groups participated in the test set construction: - Bulgarian Academy of Sciences (BG) - U. Helsinki (FI)

QA Track Setup - Overview document collections translation EN => 7 languages systems’ answers 100 monolingual Q&A pairs with EN translation IT FR NL ES … 700 Q&A pairs in 1 language + EN selection of additional 80 + 20 questions Multieight-04 XML collection of 700 Q&A in 8 languages extraction of plain text test sets experiments (1 week window) manual assessment question generation (2.5 p/m per group) Exercise (10-23/5) evaluation (2 p/d for 1 run)

CLEF 2004 WorkshopBath – September 16, 2004 QA Track Setup – Task Definition Given 200 questions in a source language, find one exact answer per question in a collection of documents written in a target language, and provide a justification for each retrieved answer (i.e. the docid of the unique document that supports the answer). DEENESFRITNLPT BG DE EN ES FI FR IT NL PT S T 6 monolingual and 50 bilingual tasks. Teams participated in 19 tasks, submitting 48 runs.

CLEF 2004 WorkshopBath – September 16, 2004 QA Track Setup - Questions All the test sets were made up of 200 questions: - ~90% factoid questions - ~10% definition questions - ~10% of the questions did not have any answer in the corpora (right answer- string was “NIL”) Problems in introducing definition questions:  What’s the right answer? (it depends on the user’s model)  What’s the easiest and more efficient way to assess their answers?  Overlap with factoid questions: F Who is the Pope? D Who is John Paul II? the Pope John Paul II the head of the Roman Catholic Church

CLEF 2004 WorkshopBath – September 16, 2004 QA Track Setup - Answers One exact answer per question was required. Exactness subsumes conciseness (the answer should not contain extraneous or irrelevant information) and completeness (the answer should be complete). Exactness is relatively easy to assess for factoids: F Where is Heathrow airport? R F London F What racing team is Flavio Briatore the manager of? X F I am happy for this thing in particular said Benetton Flavio Briatore Definition questions and How- questions could have a passage as answer (depending on the user’s model). D Who is Jorge Amado? R D Bahian novelist R D American authors such as Stephen King and Sidney Sheldon are perennial best sellers in Latin American countries, while Brazilian Jorge Amado, Colombian Gabriel Garcia Marquez and Mexican Carlos Fuentes are renowned in U.S. literary circles.

CLEF 2004 WorkshopBath – September 16, 2004 QA Track Setup – Multieight Как умира Пазолини? TRANSLATION[убит] Auf welche Art starb Pasolini? TRANSLATION[ermordet] ermordet How did Pasolini die? TRANSLATION[murdered] murdered ¿Cómo murió Pasolini? TRANSLATION[Asesinado] Brutalmente asesinado en los arrabales de Ostia Comment est mort Pasolini ? TRANSLATION[assassiné] assassiné assassiné en novembre 1975 dans des circonstances mystérieuses assassiné il y a 20 ans Come è morto Pasolini? TRANSLATION[assassinato] massacrato e abbandonato sulla spiaggia di Ostia Hoe stierf Pasolini? TRANSLATION[vermoord] vermoord Como morreu Pasolini? assassinado

CLEF 2004 WorkshopBath – September 16, 2004 QA Track Setup - Assessment Judgments taken from the TREC QA tracks: - Right - Wrong - ineXact - Unsupported Other criteria, such as the length of the answer-strings (instead of X, which is underspecified) or the usefulness of responses for a potential user, have not been considered. Main evaluation measure was accuracy (fraction of Right responses). Whenever possible, a Confidence-Weighted Score was calculated: 1 Q  Q i=1 number of correct responses in first i ranks i CWS =

CLEF 2004 WorkshopBath – September 16, 2004 Evaluation Exercise - Participants AmericaEuropeAsiaAustraliaTOTAL submitted runs TREC-8133312046 TREC-91476-2775 TREC-101988-3567 TREC-1116106-3267 TREC-121384-2554 NTCIR-3 (QAC-1)1-15-1636 CLEF 200335--817 CLEF 2004117--1848 Distribution of participating groups in different QA evaluation campaigns.

CLEF 2004 WorkshopBath – September 16, 2004 Evaluation Exercise - Participants Number of participating teams-number of submitted runs at CLEF 2004. Comparability issue. DEENESFRITNLPT BG 1-11-2 DE 2-22-2 2-31-2 EN 1-21-1 ES 5-81-2 FI 1-1 FR 3-61-2 IT 1-2 2-3 NL 1-2 PT 1-22-3 ST

CLEF 2004 WorkshopBath – September 16, 2004 Evaluation Exercise - Results Systems’ performance at the TREC and CLEF QA tracks. * considering only the 413 factoid questions ** considering only the answers returned at the first rank 70 25 65 24 67 23 83 22 70 21.4 41.5 29 35 17 45.5 23.7 35 14.7 accuracy (%) TREC-8 TREC-9TREC-10TREC-11TREC-12 * CLEF-2003 ** monol. bil. CLEF-2004 monol. bil. best system average

CLEF 2004 WorkshopBath – September 16, 2004 Evaluation Exercise – Results (DE) Results of the runs with German as target language. Run NameRWXU Overall Accuracy (%) Accuracy over F (%) Accuracy over D (%) NIL Accuracy CWS PrecisionRecall dfki041dede501431325.3828.250.000.140.85- FUHA041-dede6712820 34.01 31.6455.000.141.000.333

CLEF 2004 WorkshopBath – September 16, 2004 Evaluation Exercise – Results (EN) Run NameRWXU Overall Accuracy (%) Accuracy over F (%) Accuracy over D (%) NIL Accuracy CWS PrecisionRecall bgas041bgen261685113.0011.6725.000.130.400.056 dfki041deen4715102 23.50 23.8920.000.100.750.177 dltg041fren381557019.0017.7830.000.170.55- dltg042fren291647014.5012.7830.000.140.45- edin041deen281665114.0013.3320.000.140.350.049 edin041fren331616016.5017.785.000.150.550.056 edin042deen341597017.0016.1125.000.140.350.052 edin042fren4015370 20.00 20.5615.000.150.550.058 hels041fien211711010.8811.565.000.100.850.046 irst041iten4514663 22.50 22.2225.000.240.300.121 irst042iten351585217.5016.6725.000.240.300.075 lire041fren221726011.0010.0020.000.05 0.032 lire042fren391556019.5020.0015.000.00 0.075 Results of the runs with English as target language.

CLEF 2004 WorkshopBath – September 16, 2004 Evaluation Exercise – Results (ES) Results of the runs with Spanish as target language. Run NameRWXU Overall Accuracy (%) Accuracy over F (%) Accuracy over D (%) NIL Accuracy CWS PrecisionRecall aliv041eses631305231.5030.5640.000.170.350.121 aliv042eses6512942 32.50 31.1145.000.170.350.144 cole041eses221780011.0011.675.000.101.00- inao041eses451455522.5019.4450.000.190.50- inao042eses371526518.5017.7825.000.210.50- mira041eses18174719.0010.000.000.140.55- talp041eses481501124.0018.8970.000.190.500.087 talp042eses521433226.0021.1170.000.200.550.102

Evaluation Exercise – Results (FR) Results of the runs with French as target language. Run NameRWXU Overall Accuracy (%) Accuracy over F (%) Accuracy over D (%) NIL Accuracy CWS PrecisionRecall gine041bgfr13182506.506.675.000.100.500.051 gine041defr2916110014.5014.4415.000.150.200.079 gine041enfr181701209.008.8910.000.050.100.033 gine041esfr271658013.5014.445.000.120.150.056 gine041frfr2716013013.5013.8910.000.00 0.048 gine041itfr2516510012.5013.335.000.150.300.049 gine041nlfr2016911010.00 0.120.200.044 gine041ptfr251696012.5012.2215.000.110.150.044 gine042bgfr13180706.506.1110.000.100.350.038 gine042defr3415412017.0015.5630.000.230.200.097 gine042enfr271649013.5012.2225.000.060.100.051 gine042esfr341624017.0017.2215.000.110.100.075 gine042frfr4914560 24.50 23.8930.000.090.050.114 gine042itfr291647014.5015.565.000.140.300.054 gine042nlfr2915615014.5013.3325.000.140.200.065 gine042ptfr291647014.5013.3325.000.100.150.056

CLEF 2004 WorkshopBath – September 16, 2004 Evaluation Exercise – Results (IT) Results of the runs with Italian as target language. Run NameRWXU Overall Accuracy (%) Accuracy over F (%) Accuracy over D (%) NIL Accuracy CWS PrecisionRecall ILCP-QA-ITIT5111729325.5022.7850.000.620.50- irst041itit56131112 28.00 26.6740.000.270.300.155 irst042itit441479022.0020.0040.000.660.200.107

CLEF 2004 WorkshopBath – September 16, 2004 Evaluation Exercise – Results (NL) Results of the runs with Dutch as target language. Run NameRWXU Overall Accuracy (%) Accuracy over F (%) Accuracy over D (%) NIL Accuracy CWS PrecisionRecall uams041ennl701227135.0031.0765.220.00 - uams041nlnl889810444.0042.3756.520.00 - uams042nlnl919710245.5045.2047.830.560.25-

CLEF 2004 WorkshopBath – September 16, 2004 Evaluation Exercise – Results (PT) Results of the runs with Portuguese as target language. Run NameRWXU Overall Accuracy (%) Accuracy over F (%) Accuracy over D (%) NIL Accuracy CWS PrecisionRecall PTUE041ptpt57125180 28.64 29.1725.810.140.90- sfnx041ptpt221668411.0611.906.450.130.75- sfnx042ptpt3015510515.0816.079.680.160.55-

CLEF 2004 WorkshopBath – September 16, 2004 Evaluation Exercise – CL Approaches Question Analysis / keyword extraction INPUT (source language) Candidate Document Selection Document Collection Document Collection Preprocessing Preprocessed Documents Candidate Document Analysis Answer Extraction OUTPUT (target language) question translation into target language translation of retrieved data U. Amsterdam U. Edinburgh U. Neuchatel Bulg. Ac. of Sciences ITC-Irst U. Limerick U. Helsinki DFKI LIMSI-CNRS

CLEF 2004 WorkshopBath – September 16, 2004 Conclusions CLEF multilingual QA track (like TREC QA) represents a formal evaluation, designed with an eye to replicability. As an exercise, it is an abstraction of the real problems. Future challenges: investigate QA in combination with other applications (for instance summarization) access not only free text, but also different sources of data (multimedia, spoken language, imagery) introduce automated evaluation along with judgments given by humans focus on user’s need: develop real-time interactive systems, which means modeling a potential user and defining suitable answer types.

CLEF 2004 WorkshopBath – September 16, 2004 Conclusions Possible improvements of the CLEF QA track: Questions  closer interaction with IR track (use topics)  individuate common areas of interest in the collections (minimize translation efforts)  more realistic factoid questions (focus on real newspapers readers, introduce yes/no questions) Answers  accept short snippets in response to definition and How- questions Evaluation  introduce new judgments (consider usefulness)  introduce new evaluation measures (study measures that reward self scoring, introduce automated evaluation)

Overview of the Multilingual Question Answering Track Alessandro Vallin ITC-irst, Trento - Italy Cross Language Evaluation Forum 2004 Bath, September 16,

Similar presentations

Presentation on theme: "Overview of the Multilingual Question Answering Track Alessandro Vallin ITC-irst, Trento - Italy Cross Language Evaluation Forum 2004 Bath, September 16,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Overview of the Multilingual Question Answering Track Alessandro Vallin ITC-irst, Trento - Italy Cross Language Evaluation Forum 2004 Bath, September 16,

Similar presentations

Presentation on theme: "Overview of the Multilingual Question Answering Track Alessandro Vallin ITC-irst, Trento - Italy Cross Language Evaluation Forum 2004 Bath, September 16,"— Presentation transcript:

Similar presentations

About project

Feedback