Presentation is loading. Please wait.

Presentation is loading. Please wait.

QRISTAL (QRISTAL = Questions-Réponses Intégrant un Système de Traitement Automatique des Langues) Questions-Replies Integrating a System to Treat (process)

Similar presentations


Presentation on theme: "QRISTAL (QRISTAL = Questions-Réponses Intégrant un Système de Traitement Automatique des Langues) Questions-Replies Integrating a System to Treat (process)"— Presentation transcript:

1 QRISTAL (QRISTAL = Questions-Réponses Intégrant un Système de Traitement Automatique des Langues) Questions-Replies Integrating a System to Treat (process) Automatically the Languages at QA@CLEF 2005 Presentation QA@CLEF, Sept. 22, 2005, Synapse Développement, D. LAURENT

2 1. Historic  Synapse Développement has been created in January 1994. Our company works exclusively on software using massively linguistics.  In the first time, we edited proofing tools, speller checker and grammar checker.  In 1997, our company has been selected by Microsoft as vendor for French proofing tools. Since this date, our proofing tools are integrated in Office software. So our copyright, in the “About…” of Word, is present on about 500 millions of computers in the world... Presentation QA@CLEF, Sept. 22, 2005, Synapse Développement, D. LAURENT

3  For the proofing tools, we developed a parser, recognised as the most efficient parser for French (used by almost all the French NLP laboratories).  In the 90’s, we developed so many linguistic resources (a general taxonomy and more than 50 dictionaries) that it seemed us clever to use all these resources in other NLP domains than proofing tools.  In 1999, with help of ANVAR (French government agency), we created our first question-answering system, commercialised in 2001 under the name “Chercheur”.  “Chercheur” was only monolingual system but was really the initial version of QRISTAL. Presentation QA@CLEF, Sept. 22, 2005, Synapse Développement, D. LAURENT

4  In 2001, we began a R&D project co-financed by the EU Commission, named TRUST, under Synapse technological leadership,,and addressing a multilingual QA system. It was submitted by a consortium of 6 SMEs Synapse Développement, Toulouse, France Expert System Solutions, Modène, Italie Priberam, Lisbonne, Portugal TiP, Katowice, Pologne Convis, Berlin, Allemagne & Paris, France Sémiosphère, Toulouse, France (coordination) Presentation QA@CLEF, Sept. 22, 2005, Synapse Développement, D. LAURENT

5  TRUST started in November 2001 and completed in October 2003.  It was designed to be an industrial project with aim to commercialise in B2B and B2C, a QA software allowing to any user to retrieve one or several answers to a general purpose or factual question.  It was bound to answer to questions from a finite corpus (hard disk, set of documents…), or questions addressed to Internet, via a meta-engine, using the most popular engines (Google, MSN, Altavista, AOL, etc.) Presentation QA@CLEF, Sept. 22, 2005, Synapse Développement, D. LAURENT

6  Targeted languages : French, Italian, Polish, Portuguese.  English was not part of Trust but was developed in parallel. The pivot language, allowing to ask a question in one language and get the reply in another is English.  All partners owned a syntactic analyser and important linguistic resources.  Synapse, as technology transferor, had at disposal his previously commercialised engine (called “Chercheur”) to index and retrieve documents. Presentation QA@CLEF, Sept. 22, 2005, Synapse Développement, D. LAURENT

7 2. Qristal Engine description At completion, the Qristal engine has very original features : the indexation is carried-out on words, expressions, named entities but also on concepts, domains and the types of QA The excerpt search, and the answer extraction are using a very deep and sharp syntactic, conceptual, and semantic analysis... Presentation QA@CLEF, Sept. 22, 2005, Synapse Développement, D. LAURENT

8 A modular conception French Language Module Italian Language Module Portuguese Language Module Polish Language Module English Language Module Indexation engineExtraction of text engine Index Documents Visualization of Results Visualization of Results Presentation QA@CLEF, Sept. 22, 2005, Synapse Développement, D. LAURENT

9

10 2.1. Document Indexation  indexes numerous document formats :.html,.doc,.pdf,.ps,.sgml,.xml,.hlp,.dbx, etc.) as well as archived/compressed (.zip) and ascii texts (Unicode or not).  automated spelling checking may be carried out prior to it.  Beyond the usual indexation of the terms, a semantic and syntactic analysis performs the indexation of the concepts and the typology of answers (ex. : a date of birth,a title or an occupation for a person, etc.) Presentation QA@CLEF, Sept. 22, 2005, Synapse Développement, D. LAURENT

11  Simple words are indexed by « head of derivation » i.e. words such as « symmetry », « symmetrical », « symmetrically », « asymmetric », « dissymmetric » or « symmetrisable » will be indexed under the same heading « symmetry ».  This technique : allows to reduce the size of the indexes facilitates the grouping of neighbouring notions, avoids the classical « term expansion » process during the request. Presentation QA@CLEF, Sept. 22, 2005, Synapse Développement, D. LAURENT

12 Technical characteristics  Currently indexation is performed in 1Ko blocks, i.e. the texts are sliced in 1Ko blocks (stopping at end of sentence) and any head of derivation will be indexed with an occurrence number (ex: found 3 times in the block 15, occurrence is 3)  Indexation speed depends of the languages: about 300 MB/hour in French and Polish, about 240 MB/hour in Portuguese, about 100 MB/hour for English about 10 MB/hour for italian. Presentation QA@CLEF, Sept. 22, 2005, Synapse Développement, D. LAURENT

13 Conceptual Indexation and Ontology  TRUST shares a common ontology with all linguistic modules of the various languages attached to it.This ontology, developed by Synapse, includes 5 hierarchical levels corresponding to : 28 categories at the main superior level 94 categories at the second level 256 categories at the third level 3387 categories at the fourth level over 71 000 terms (including 25 000 meanings for 9 000 polysemic words) & over 60 000 « syntagms » at the basic level. Presentation QA@CLEF, Sept. 22, 2005, Synapse Développement, D. LAURENT

14 Indexation & Types of questions  QRISTAL indexes the types of questions. It means that each linguistic module,when analysing each block of text, attempts to detect/profile the possible answer for each type of question (person, date, event, cause, aim, etc.)  The present taxonomy of the type of questions comprises 86 different categories. It goes beyond the « factual » because including notions such as « usefulness » « comparison » « judgement » but also categories like « yes/no » or classification. Presentation QA@CLEF, Sept. 22, 2005, Synapse Développement, D. LAURENT

15 Analysis of the question  When the question is keyed in by the user, automatically the language of the question is detected, and its matching linguistic module performs the semantic and syntactic analysis of the question.  When some words of the question have several meanings, the most probable meaning in the context is chosen, but the user may force the meaning of each word.  The same linguistic modules determines the domain, the concepts and above all the type of the question. Presentation QA@CLEF, Sept. 22, 2005, Synapse Développement, D. LAURENT

16 2.2. The text search  From the data obtained via the analysis of the question,(heads of derivation,named entities, domains, concepts, the question profile/type), the search engine extracts from the index, the blocks of texts best suiting the set of data.  A balance of the different available data is carried out in order to avoid that a disambiguation error relating to the meaning or the type of the question prevents acquiring of the blocks of texts that may contain an answer. Presentation QA@CLEF, Sept. 22, 2005, Synapse Développement, D. LAURENT

17 Extraction of Answers  For a given question, after a (possible spell-check) syntactic, semantic, conceptual analysis, then detection of the question, heads of derivations,named entities, concepts, domains, the types of QA are compared to the indexes for these different types.  The best ranked blocks are analysed.  Then answers are extracted.  The extraction of the answer is performed by the search of the named entities or syntactic groups in « position of use for the answering ». Presentation QA@CLEF, Sept. 22, 2005, Synapse Développement, D. LAURENT

18 Response time  A keyed-in question on a closed corpus (hard disk, corpus, Intranet) the answer is provided in French in less than 3 seconds on a Pentium 3 GHz.  With other languages it can be up to 10 seconds.  A keyed-in question on Internet,the response time may be anything between 2 to 14 seconds, depending on : the language used, the number of pages analysed (user-definable) the type of the question (a lot of answers are retrieved very quickly just on the available snippets) Presentation QA@CLEF, Sept. 22, 2005, Synapse Développement, D. LAURENT

19 2.3. QRISTAL Presentation  QRISTAL is the B2C version of the european project TRUST.  It is priced at 99 € and commercialised in retail computer outlets and in large consumer market distributors such as Virgin Stores or FNAC.  Fruit of a 6 year development, QRISTAL performs beyond the TRUST set limits, but is undoubtedly arising from the “old” CHERCHEUR and from the TRUST project. Presentation QA@CLEF, Sept. 22, 2005, Synapse Développement, D. LAURENT

20

21  QRISTAL may be used in 2 major functions : Provide exact answers to questions on « closed corpora » (hard disk, emails, Intranet, etc.), these being previously indexed so as extract the answers from the blocks of text corresponding to the analysis of the question. Provide the exact answers to questions addressed to Internet (web). In this case, Qristal converts the questions in « understandable requests » for the standard engines, extracts the returned pages and their short description, analyses them and computes the answers. Presentation QA@CLEF, Sept. 22, 2005, Synapse Développement, D. LAURENT

22  In Qristal, a special attention has been given to the « user- self definability » : In design, Qristal is targeting those unfamiliar with SQL or web requests, and wishing to obtain directly an answer while formulating their questions in common natural language. Therefore the interface must be very user-friendly and as simple as possible,in order for them to profile Qristal usage to suit their habits and wishes. For more experimented users, files of questions as well as work on several indexes permit a more advanced usability. Presentation QA@CLEF, Sept. 22, 2005, Synapse Développement, D. LAURENT

23 Some informations  Development languages : C (all linguistic components) and C++,  Number of lines of the program : 380 000 lines,  Size of the linguistic resources : 35 MBytes (compacted),  Number of words and expressions in our semantic network : 130 000,  Number of words and expressions in our dictionaries of translations (French-English, English-French) : 150 000,  Number of synonyms : more than 350 000 lemmas (5 millions of forms) Presentation QA@CLEF, Sept. 22, 2005, Synapse Développement, D. LAURENT

24 Commercialization  QRISTAL has been commercialised since December 2004. Users are satisfied of the results obtained in French, while their judgment on the other language results is (a bit unfairly but honestly) critical.  Qristal appears to be very « reliable and stable », user- friendly as very few calls to the support/customer service may justify this appreciation.  Users expectations are very large and their satisfaction will mean for us to produce a lot of efforts in the future. Presentation QA@CLEF, Sept. 22, 2005, Synapse Développement, D. LAURENT

25 Perspectives  The same linguistic modules determine the domain, the concepts and above all the type of the question.  QRISTAL will have updates in the coming years, with the following improvements : improve the rate of exact answers, eliminate noise use the notoriety of the pages to order them carry out more precise inferences to extract the answers allow « user profiles » include other languages (German, Spanish) better differentiate the answer mode (alone, all) better situate the answers in their context. Presentation QA@CLEF, Sept. 22, 2005, Synapse Développement, D. LAURENT

26 Perspectives (suite)  It’s very important to keep a high velocity for answering. For example, some tests with validation of the answer by an interrogation on the Web shown that this validation needed about 10 seconds. So we have disabled this feature in the final version.  If it’s sure that, in some years, the actual Boolean requests used by Web engines like Google will seem so obsolete than DOS now, we must take into account the important expectations of the users, avoiding to say that all the questions have an answer ! Presentation QA@CLEF, Sept. 22, 2005, Synapse Développement, D. LAURENT

27 Perspectives (end)  Find the answer to non-factual requests is a real big challenge.  If the success rate for the definitions is correct, the answers for questions waiting lists as answers are not sufficient, because these questions need sophisticated analysis. It’s another challenge.  We can only experiment “no time-consuming” strategies in the future. For example we can imagine to use answers databases but integrated in our system, not online. It would be too slow for the user even with ADSL. Presentation QA@CLEF, Sept. 22, 2005, Synapse Développement, D. LAURENT

28 Actual and future developments  With the partners of TRUST, our company participates in 2005 and 2006 to another European project : M-CAST.  M-CAST is a e-Content project (22249, Multilingual Content Aggregation System based on TRUST Search Engine).  In this project, we develop a client-server version of our system in order to exploit digital resources of libraries.  In the same time, we develop B2B specific versions of our system (for example, last year, we developed a French module for Invention Machine). Presentation QA@CLEF, Sept. 22, 2005, Synapse Développement, D. LAURENT

29 3. QRISTAL Evaluation at CLEF 2005  QRISTAL participated to the QA@CLEF 2005 campaign in 4 categories : French monolingual English-French multilingual Italian-French multilingual Portuguese-French multilingual  We participated only with French as target, letting our other partners to participate with their language as target (finally only Priberam participated to monolingual task). Presentation QA@CLEF, Sept. 22, 2005, Synapse Développement, D. LAURENT

30  The Synapse QA system evaluated during the QA@CLEF campaign was the commercialised version of QRISTAL, with only some changes to manage the specific presentation of the question files.  No code and no resource for the specificities of the CLEF corpus (newspapers, importance of time in the questions)  With QA@CLEF, Synapse participated to its first cross- lingual campaign to evaluate QA systems, even if our company participated to EQUER campaign (evaluation of French question-answering systems) where he was evaluated in first position. Presentation QA@CLEF, Sept. 22, 2005, Synapse Développement, D. LAURENT Version of QRISTAL used

31 Technical performance  The full set of the 200 questions of the general corpus was processed in 9 minutes and 18 seconds, hence less than 3 seconds per question.  The speed of the linguistic analysis of the blocks was about 400 MB/hour for the indexation, i.e. 18 000 words/second. The speed of analysis and extraction of the answer was about 230 MB/hour, i.e. 10 000 words/second.  On 200 questions, the “correct” type of the question has been determined in 95,5 % of the cases. These speed tests were carried out on a Pentium 3 GHz with 1 Gb Ram memory. Presentation QA@CLEF, Sept. 22, 2005, Synapse Développement, D. LAURENT

32 Monolingual and multilingual results

33 Presentation QA@CLEF, Sept. 22, 2005, Synapse Développement, D. LAURENT Results by type of question

34 Presentation QA@CLEF, Sept. 22, 2005, Synapse Développement, D. LAURENT French-FrenchEnglish-FrenchPortuguese-FrenchItalian-French Not wrong (R + U + X) 138 (69.0%)92 (46.0%)85 (42.5%) Results by type of question English-FrenchPortuguese-FrenchItalian-French Total of pivot words623701769 Not translated in English 47451 Not translated from English 46 Badly translated from English 5563 Total for translation mistakes 59 (9,5 %)116 (16,5 %)451 (58,6 %)

35 Future evaluations  Synapse has the intention to participate in CLEF-QA both in monolingual and multilingual options in the next years.  In monolingual task (French), our objective for 2007 is to find about 80% for factoid questions (we obtain already 80% for definition questions).  In multilingual task, our objective for 2007 is to obtain about 60% for english and portuguese.  The maximum time for answer should be under 3 seconds, according to the fact that improvements in hardware allow more software processing in these 3 seconds. Presentation QA@CLEF, Sept. 22, 2005, Synapse Développement, D. LAURENT

36 END Thank you ! Presentation QA@CLEF, Sept. 22, 2005, Synapse Développement, D. LAURENT


Download ppt "QRISTAL (QRISTAL = Questions-Réponses Intégrant un Système de Traitement Automatique des Langues) Questions-Replies Integrating a System to Treat (process)"

Similar presentations


Ads by Google