Answer Validation Exercise - AVE QA subtrack at Cross-Language Evaluation Forum 2007 UNED (coord.) Anselmo Peñas Álvaro Rodrigo Valentín Sama Felisa Verdejo.

Answer Validation Exercise - AVE QA subtrack at Cross-Language Evaluation Forum 2007 UNED (coord.) Anselmo Peñas Álvaro Rodrigo Valentín Sama Felisa Verdejo Thanks to… Main task organizing committee

nlp.uned.es/QA/ave What? Answer Validation Exercise Validate the correctness of the answers…... given by the participants at CLEF QA 2007

nlp.uned.es/QA/ave AVE 2006: an RTE exercise If the text semantically entails the hypothesis, then the answer is expected to be correct. Question Supporting snippet & doc ID Exact Answer QA system Hypothesis Into affirmative form Text

nlp.uned.es/QA/ave Answer Validation Exercise Question Answering Question Candidate answer Supporting Text Textual Entailment Answer is not correct or not enough evidence Automatic Hypothesis Generation Question Hypothesis Answer is correct AVE 2006 AVE 2007 Answer Validation Black box

nlp.uned.es/QA/ave Answer Validation Exercise  AVE 2006  Not possible to quantify the potential gain that AV modules give to QA systems  Change in AVE 2007 methodology Group answers by question Systems must validate all But select one

nlp.uned.es/QA/ave AVE 2007 Collections What is Zanussi? was an Italian producer of home appliances Zanussi For the Polish film director, see Krzysztof Zanussi. For the hot-air balloon, see Zanussi (balloon). Zanussi was an Italian producer of home appliances that in 1984 was bought who had also been in Cassibile since August 31 Only after the signing had taken place was Giuseppe Castellano informed of the additional clauses that had been presented by general Ronald Campbell to another Italian general, Zanussi, who had also been in Cassibile since August 31. 3 (1985) 3 Out of 5 Live (1985) What Is This?

nlp.uned.es/QA/ave Collections  Remove duplicated answers inside the same question group  Discard NIL answers, void answers and answers with too long supporting snippet  This processing lead to a reduction in the number of answers to be validated

nlp.uned.es/QA/ave Collections (# answers to validate) Available for CLEF participants atnlp.uned.es/QA/ave/ TestingDevelopment English2021121 Spanish5641817 German282504 French1871503 Italian103476 Dutch202528 Portuguese367817 Bulgarian-70 Romanian127-

nlp.uned.es/QA/ave Evaluation  Not balanced collections  Approach: Detect if there is enough evidence to accept an answer  Measures: Precision, recall and F over ACCEPTED answers  Baseline system: Accept all answers

nlp.uned.es/QA/ave Evaluation GroupSystemFPrecisionRecall DFKIltqa_20.550.440.71 DFKIltqa_10.460.370.62 U. Alicanteofe_10.390.250.81 Text-Mess ProjectText-Mess_10.360.250.62 Iasiadiftene0.340.210.81 UNEDrodrigo0.340.220.71 Text-Mess ProjectText-Mess_20.340.250.52 U. Alicanteofe_20.290.180.81 100% VALIDATED 0.190.111 50% VALIDATED0.180.110.5 Precision, Recall and F measure over correct answers for English

nlp.uned.es/QA/ave Comparing AV systems performance with QA systems (German) GroupSystem Type QA accuracy % of perfect selection Perfect selection QA 0.54 100% FUHiglockner_2 AV 0.50 93.44% FUHiglockner_1 AV 0.48 88.52% DFKI dfki071dedeQA 0.35 65.57% FUH fuha071dedeQA 0.32 59.02% Random AV 0.28 51.91% DFKI dfki071endeQA 0.25 45.9% FUH fuha072dedeQA 0.21 39.34% DFKI dfki071ptdeQA 0.05 9.84%

nlp.uned.es/QA/ave Techniques reported at AVE 2007  10 reports, all reported a RTE approach Generates hypotheses 6 Wordnet 3 Chunking 3 n-grams, longest common Subsequences 5 Phrase transformations 2 NER 5 Num. expressions 6 Temp. expressions 4 Coreference resolution 2 Dependency analysis 3 Syntactic similarity 4 Functions (sub, obj, etc) 3 Syntactic transformations 1 Word-sense disambiguation 2 Semantic parsing 4 Semantic role labeling 2 First order logic representation 3 Theorem prover 3 Semantic similarity 2

nlp.uned.es/QA/ave Conclusion  Evaluation in a real environment Real systems outputs -> AVE input  Developed methodologies Build collections from QA responses Evaluate in chain with a QA Track Compare results with QA systems  New testing collections for the QA and RTE communities In 7 languages, not only English

nlp.uned.es/QA/ave Conclusion  9 groups, 16 systems, 4 languages  All systems based on Textual Entailment  5 out of 9 groups participated in QA Introduction of RTE techniques in QA More NLP More Machine Learning  Systems based on syntactic or semantic analysis perform Automatic Hypothesis Generation Combination of the question and the answer Some cases directly in a logic form

Answer Validation Exercise - AVE QA subtrack at Cross-Language Evaluation Forum 2007 UNED (coord.) Anselmo Peñas Álvaro Rodrigo Valentín Sama Felisa Verdejo.

Similar presentations

Presentation on theme: "Answer Validation Exercise - AVE QA subtrack at Cross-Language Evaluation Forum 2007 UNED (coord.) Anselmo Peñas Álvaro Rodrigo Valentín Sama Felisa Verdejo."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Answer Validation Exercise - AVE QA subtrack at Cross-Language Evaluation Forum 2007 UNED (coord.) Anselmo Peñas Álvaro Rodrigo Valentín Sama Felisa Verdejo.

Similar presentations

Presentation on theme: "Answer Validation Exercise - AVE QA subtrack at Cross-Language Evaluation Forum 2007 UNED (coord.) Anselmo Peñas Álvaro Rodrigo Valentín Sama Felisa Verdejo."— Presentation transcript:

Similar presentations

About project

Feedback