Presentation is loading. Please wait.

Presentation is loading. Please wait.

Priberam Informática Av. Defensores de Chaves, 32 – 3º Esq. 1000-119 Lisboa, Portugal Tel.: +351 21 781 72 60 / Fax: +351 21 781 72 79 CLEF Workshop, Vienna,

Similar presentations


Presentation on theme: "Priberam Informática Av. Defensores de Chaves, 32 – 3º Esq. 1000-119 Lisboa, Portugal Tel.: +351 21 781 72 60 / Fax: +351 21 781 72 79 CLEF Workshop, Vienna,"— Presentation transcript:

1 Priberam Informática Av. Defensores de Chaves, 32 – 3º Esq Lisboa, Portugal Tel.: / Fax: CLEF Workshop, Vienna, of September, Priberam’s question answering system for Portuguese Carlos Amaral, Helena Figueira, André Martins, Afonso Mendes, Pedro Mendes, Cláudia Pinto

2 2 Summary Introduction A workbench for NLP Lexical resources Software tools Question categorization System description Indexing process Question analysis Document retrieval Sentence retrieval Answer extraction Evaluation & Results Conclusions

3 3 Introduction Goal: to build a question answering (QA) engine that finds a unique exact answer for NL questions. Evaluation: Portuguese monolingual task. Previous work by Priberam on this subject: LegiX – a juridical information system SintaGest – a workbench for NLP TRUST project (Text Retrieval Using Semantics Technology) – development of the Portuguese module in a cross-language environment.

4 4 Lexical resources Lexicon: Lemmas, inflections and POS; Sense definitions (*); Semantic features, subcategorization and selection restrictions; Ontological and terminological domains; English and French equivalents (*); Lexical-semantic relations (e.g. derivations). (*) Not used in the QA system. Thesaurus Ontology: Multilingual (**) (English, French, Portuguese) – enables translations; Designed by Synapse Développement for TRUST (**) Only Portuguese information is used in the QA system.

5 5 Software tools Priberam’s SintaGest – a NLP application that allows: Building & testing a context-free grammar (CFG); Building & testing contextual rules for: Morphological disambiguation; Named entity & fixed expressions recognition; Building & testing patterns for question categorization/answer extraction; Compressing & compiling all data into binary files. Statistical POS tagger: Used together w/ contextual rules for morphological disambiguation; HMM-based (2nd order), trained with the CETEMPublico corpus; Fast & efficient performance => Viterbi algorithm.

6 6 Question categorization (I) 86 question categories, flat structure,,,,, … Categorization: performed through “rich” patterns (more powerful than regular expressions) More than one category is allowed (avoiding hard decisions); “Rich” patterns are conditional expressions w/ words (Word), lemmas (Root), POS (Cat), ontology entries (Ont), question identifiers (QuestIdent), and constant phrases; Everything built & tested through SintaGest.

7 7 Question categorization (II) There are 3 kinds of patterns: Question patterns (QPs): for question categorization. Answer patterns (APs): for sentence categorization (during indexation). Question answering patterns (QAPs): for answer extraction. Question (FUNCTION) : Word(quem) Distance(0,3) Root(ser) AnyCat(Nprop, ENT) = 15 // e.g. ‘‘Quem é Jorge Sampaio?’’ : Word(que) QuestIdent(FUNCTION_N) Distance(0,3) QuestIdent(FUNCTION_V) = 15 // e.g. ‘‘Que cargo desempenha Jorge Sampaio?’’ Answer : Pivot & AnyCat (Nprop, ENT) Root(ser) {Definition With Ergonym?} = 20 // e.g. ‘‘Jorge Sampaio é o {Presidente da República}...’’ : {NounPhrase With Ergonym?} AnyCat (Trav, Vg) Pivot & AnyCat (Nprop, ENT) = 15 // e.g. ‘‘O {presidente da República}, Jorge Sampaio...’’ ; Answer (FUNCTION) : QuestIdent(FUNCTION_N) = 10 : Ergonym = 10 ; QPs QAPs APs Heuristic scores

8 8 QA system overview The system architecture is composed by 5 major modules:

9 9 Indexing process The collection of target documents is analysed (off-line) and information is stored in a index database. Each document first feeds the sentence analyser; Sentence categorization: each sentence is classified with one or more question categories through the APs. We build indices for: Lemmas Heads of derivation NEs and fixed expressions Question categories Ontology domains (at document level)

10 10 Question analysis Input: A NL question (e.g. “Quem é o presidente da Albânia?”) Procedure : Sentence analysis Question categorization & activation of QAPs (through the QPs) Extraction of pivots (words, NEs, phrases, dates, abbreviations, …) “Query expansion” (heads of derivation & synonyms) Output: Pivots’ lemmas, heads & synonyms (e.g. presidente, Albânia, presidir, albanês, chefe de estado) Question categories (e.g., ) Relevant ontological domains Active QAPs

11 11 Document retrieval Input: Pivots’ lemmas (w L i ), heads (w H i ) & synonyms (w S ij ) Question categories (c k ) & ontological domains (o l )  d := 0 For Each pivot i If d contains lemma w L i Then  d += K L  (w L i ) Else If d contains head w H i Then  d += K H  (w H i ) Else If d contains any synonym w S ij Then  d += max j (K S  (w S ij, w L i )  (w S ij )) If d contains any question category c k Then  d += K C If d contains any ontology domain o l Then  d += K O  d := RewardPivotProximity(d,  d ) Procedure: Word weighting  (w) according to: POS; ilf (inv. lexical freq.); idf (inv. docum. freq.). Each document d is given a score  d : Output: The top 30 scored documents.

12 12 Sentence retrieval Input: Scored documents {(d,  d )} w/ relevant sentences marked. Procedure: Sentence analysis Sentence scoring – Each sentence s is given a score  s according to: Output: Scored sentences {(s,  s )} above a fixed threshold. # pivots’ lemmas, heads & synonyms matching s; # partial matches: Fidel ↔ Fidel Castro; Order & proximity of pivots in s; Existence of common question categories between q and s; Score  d of document d containing s.

13 13 Answer extraction Input: Scored sentences {(s,  s )} Active QAPs (from the Question Analysis module) Procedure: Answer extraction & scoring – through the QAPs Answer coherence Each answer a is rescored to  a taking into account its coherence to the whole collection of candidate answers (e.g., “Sali Berisha”, “Ramiz Alia”, “Berisha”) Selection of the final answer. e.g. “O Presidente da Albânia, Sali Berisha, tentou evitar o pior, afirmando que não está provado que o Governo grego esteja envolvido no ataque.” Output: The answer a with highest  a or ‘NIL’ if none answer was extracted.

14 14 Results & evaluation (I) evaluation: Portuguese monolingual task target documents (~564 Mb) from Portuguese & Brazilian newspaper corpora: Público1994, Público1995, Folha1994, Folha1995 Test set of 200 questions (in Brazilian and European Portuguese). Results 64,5% of right answers (R):

15 15 Results & evaluation (II) Reasons for bad answers (W+X+U): 16,5%Extraction of candidate answers “Como se chama a Organização para a Alimentação e Agricultura das Nações Unidas?” Overextraction: “(...) que viria a estar na origem da FAO (a Organização para a Alimentação e a Agricultura das Nações Unidas)” 8,0%NIL validation “Que partido foi fundado por Andrei Brejnev?” Should return NIL 6,5%Choice of the final answer “O que é a Sabena?” 1st answer: “No caso da Sabena, a Swissair (…) terá de pronunciar-se”. 2nd answer: “(...) o acordo de união entre a companhia aérea belga Sabena” 4,5%Document retrieval “Diga o nome de um assassino em série americano.” The right document was missed. No match between americano and EUA in “(...) John Wayne Gacy, maior assassino em série da história dos EUA (…)”

16 16 Conclusions Priberam’s QA system exhibited encouraging results: State-of-the-art accuracy (64.5%) in evaluation Possible advantages over other systems: Adjustable & powerful patterns for categorization & extraction (SintaGest) Query expansion through heads of derivation & synonyms Use of ontology to introduce semantic knowledge Some future work: Confidence measure for final answer validation Handling of list-, how-, & temporally-restricted questions Semantic disambiguation & further exploiting of the ontology Syntactical parsing & anaphora resolution Refinement for Web & book searching

17 Priberam Informática Av. Defensores de Chaves, 32 – 3º Esq Lisboa, Portugal Tel.: / Fax: CLEF Workshop, Vienna, of September, Priberam’s question answering system for Portuguese Carlos Amaral, Helena Figueira, André Martins, Afonso Mendes, Pedro Mendes, Cláudia Pinto

18 18 Ontology Concept-based Tree-structured, 4 levels Nodes are concepts Leaves are senses of words Words are translated in several languages (English, French, Portuguese, Italian, Polish, and soon Spanish and Czech) There are 3387 terminal nodes (the most specific concepts)


Download ppt "Priberam Informática Av. Defensores de Chaves, 32 – 3º Esq. 1000-119 Lisboa, Portugal Tel.: +351 21 781 72 60 / Fax: +351 21 781 72 79 CLEF Workshop, Vienna,"

Similar presentations


Ads by Google