Spanish Question Answering Evaluation Anselmo Peñas, Felisa Verdejo and Jesús Herrera UNED NLP Group Distance Learning University of Spain CICLing 2004,

Spanish Question Answering Evaluation Anselmo Peñas, Felisa Verdejo and Jesús Herrera UNED NLP Group Distance Learning University of Spain CICLing 2004, Seoul

Question Answering task Give an answer to a question –Approach: Find (search) an answer in a document collection –A document must support the answer –Where is Seoul? South Korea (correct) Korea (responsive?) Asia (non responsive) Population of South Korea (inexact) Oranges of China (incorrect)

QA system architecture Question Answer Question Analysis Pre-processing / indexing Answer type / structure Key-terms Passage retrieval Answer extraction Answer validation / scoring Opportunity for natural language techniques Documents

Overview Evaluation forums: objectives QA evaluation methodology The challenge of multilingualism QA at CLEF 2003 QA at CLEF 2004 Conclusion

Evaluation Forums: Objectives Stimulate research Establish shared working lines Generate resources for evaluation and for training Compare different approaches and obtain some evidences Serve as a meeting point for collaboration and exchange (CLEF, TREC, NTCIR)

QA Evaluation Methodology Test suite production: Document collection (hundreds of thousands) Questions (hundreds) Systems answering (Answer + Document id) Limited time Judgment of answers Human assessors Correct, inexact, Unsupported, Incorrect Measuring of systems behavior % of questions correctly answered % of NIL questions correctly detected Precision Recall, F, MRR (Mean Reciprocal Rank), Confidence-weighted score,... Results comparison

QA Evaluation Methodology Considerations on task definition (I) Quantitative evaluation constrains the type of questions Questions valuable in terms of correctness, completeness and exactness e.g. “Which are the causes of the Iraq war?” Human resources available Test suite generation Assessment (# of questions, # of answers per question) Collection Restricted vs. unrestricted domains News vs. patents Multilingual QA: Comparable collections available

QA Evaluation Methodology Considerations on task definition (II) Research direction “Do it better” versus “How to get better results?” Systems are tuned according the evaluation task. e.g. evaluation measure, external resources (web) Roadmap versus state of the art What systems should do in future? (Burger, 2000-2002) When is it realistic to incorporate new features in the evaluation? Type of questions, temporary restrictions, confidence in answer, encyclopedic knowledge and inference, different sources and languages, consistency between different answers,...

The challenge of multilingualism May I continue this talk in Spanish? Then multilingualism still remains a challenge...

The challenge of multilingualism Feasible with current QA state of the art Challenge for systems but...... challenge from the evaluation point of view What is the possible roadmap to achieve fully multilingual systems? –QA at CLEF (Cross-Language Evaluation Forum) –Monolingual  Bilingual  Multilingual systems What tasks can be proposed according the current state of the art? –Monolingual other than English? Bilingual considering English? –Any bilingual? Fully multilingual? Which new resources are needed for the evaluation? –Comparable corpus? Unrestricted domain? –Parallel corpus? Domain specific? Size? –Human resources: Answers in any language make difficult the assessment by native speakers

The challenge of multilingualism (cont.) How to ensure that fully multilingual systems receive better evaluation? –Some answers in just one language? How? »Hard pre-assessment? »Different languages for different domains? »Different languages for different dates or localities? »Parallel collections extracting a controlled subset of documents different for each language? –How to balance type and difficulty of questions in all languages? SpanishItalianDutchGermanFrench 250 Questions50 Answer only inSpanishItalianDutchGermanFrench 10 Ouch!

The challenge of multilingualism Fortunately (unfortunately), with the current state of the art is not realistic to plan such evaluation... Very few systems are able to deal with several target languages...yet While we try to answer the questions... Plan a separate evaluation for each target language seems more realistic Option followed by QA at CLEF in the short term

Overview Evaluation forums: objectives QA evaluation methodology The challenge of multilingualism QA at CLEF 2003 QA at CLEF 2004 Conclusion

QA at CLEF groups ITC-irst, Centro per la Ricerca Scientifica e Tecnologica, Trento, Italy UNED, Universidad Nacional de Educación a Distancia, Madrid, Spain ILLC, Language and Inference Technology Group, U. of Amsterdam DFKI, Deutsches Forschungszentrum für Künstliche Intelligenz, Saarbrücken, Germany ELDA/ELRA, Evaluations and Language Resources Distribution Agency, Paris, France Linguateca, Oslo (Norway), Braga, Lisbon & Porto (Portugal) BulTreeBank Project, CLPP, Bulgarian Academy of Sciences, Sofia, Bulgaria University of Limerick, Ireland ISTI-CNR, Istituto di Scienza e Tecnologie dell'Informazione "A. Faedo“, Pisa, Italy NIST, National Institute of Standards and Technology, Gaithersburg, USA

QA at CLEF 2003 Task 200 factoid questions, up to 3 answers per question Exact answer / answer in 50-byte long string Document collection [Spanish] >200,000 news (EFE, 1994) Questions DISEQuA corpus (available in web) (Magnini et al. 2003): –Coordinated work between ITC-IRST (Italian), UNED (Spanish) and U.Amsterdam (Dutch) –450 questions and answers translated into English, Spanish, Italian and Dutch 200 questions from DISEQuA corpus (20 NIL) Assessment Incorrect, Unsupported, Non-exact, Correct

Multilingual pool of questions Coordination between several groups Spanish (100) Italian (100) Dutch (100) German (100) French (100) Questions with known answer in each target language English pool (500) Translation into English Multilingual Pool (500x6) Spanish Italian Dutch German French English Translation into the rest of languages Final questions are selected from pool -For each target language -After pre-assessment

QA at CLEF 2003

QA at CLEF 2004: tasks Source languages (questions) Target languages (answers & docs.) Six main tasks (one per target language) (e.g. Spanish) English Spanish EFE 1994-1995, 1086 Mb (453,045 docs) French German Italian PortugueseDutch Portuguese? Bulgarian … KOREAN?

QA at CLEF 2004 200 questions –Factual: person, object, measure, organization... –Definition: person, organization –How-to 1 answer per question (without manual intervention) Up to two runs Exact answers Assessment: correct, inexact, unsupported, incorrect Evaluation: –Fraction of correct answers –Measures based on systems self-scoring

QA at CLEF 2004 Schedule Registration Opens Corpora Release Trial Data Test Sets Release Submission of Runs Release of Results Papers CLEF Workshop January 15 February March May 10 May 17 from July 15 August 15 15-16 September

Conclusion Information and resources Cross-Language Evaluation Forum http://clef-qa.itc.it/2004 DISEQuA Corpus: Dutch, Italian, Spanish, English Spanish QA at CLEF http://nlp.uned.es/QA (anselmo@lsi.uned.es)

Spanish Question Answering Evaluation Anselmo Peñas, Felisa Verdejo and Jesús Herrera UNED NLP Group Distance Learning University of Spain CICLing 2004,

Similar presentations

Presentation on theme: "Spanish Question Answering Evaluation Anselmo Peñas, Felisa Verdejo and Jesús Herrera UNED NLP Group Distance Learning University of Spain CICLing 2004,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Spanish Question Answering Evaluation Anselmo Peñas, Felisa Verdejo and Jesús Herrera UNED NLP Group Distance Learning University of Spain CICLing 2004,

Similar presentations

Presentation on theme: "Spanish Question Answering Evaluation Anselmo Peñas, Felisa Verdejo and Jesús Herrera UNED NLP Group Distance Learning University of Spain CICLing 2004,"— Presentation transcript:

Similar presentations

About project

Feedback