Presentation is loading. Please wait.

Presentation is loading. Please wait.

Terminology Retrieval: towards a synergy between thesaurus and free text searching Anselmo Peñas, Felisa Verdejo and Julio Gonzalo Dpto. Lenguajes y Sistemas.

Similar presentations


Presentation on theme: "Terminology Retrieval: towards a synergy between thesaurus and free text searching Anselmo Peñas, Felisa Verdejo and Julio Gonzalo Dpto. Lenguajes y Sistemas."— Presentation transcript:

1 Terminology Retrieval: towards a synergy between thesaurus and free text searching Anselmo Peñas, Felisa Verdejo and Julio Gonzalo Dpto. Lenguajes y Sistemas Informáticos UNED VIII Iberoamerican Conference on Artificial Intelligence Sevilla, 2002

2 2Overview Motivation Objectives Proposed approach: Terminology Retrieval Website Term Browser Evaluation Conclusions

3 3 Multilingual Thesaurus Designed for Indexing and searching in a specific subject area Vocabulary control Promoting consistency Cross-language Guiding users about which terms to use Navigate the thesaurus 60. EDUCATIONAL SYSTEM Education NT1 adult education RT adult (10) RT lifelong learning NT1 basic education RT* transition from basic to secondary education RT didactic continuity (50) NT1 distance education UF distance learning UF distance study UF distance training UF ODL UF open and distance learning NT1 informal education NT1 lifelong learning UF continuing education UF lifelong education UF recurrent education RT adult education (…)

4 4 Multilingual Thesaurus Problems Construction & management (high cost) Indexing Manual keyword assessment Errors in automatic keyword assessment Domain specific New domain needs a new thesaurus Specialist oriented (know preferred descriptors) Less specialized audience get poorer results

5 5Objectives Develop a model –to help users to express and precise their information needs –to help users to overcome language barriers Bringing to users the collection terminology Morpho-syntactic, semantic & translingual variations Without needs of thesauri construction Establish an appropriate evaluation framework

6 6 Proposed approach Information Retrieval Controlled Vocabulary Searching Free Text Searching NLP Techniques Controlled Vocabulary Searching Free Text Searching Terminology Retrieval & Term browsing (Website Term Browser) Automatic Terminology Extraction

7 7 Terminology Retrieval From Automatic Terminology Extraction... Obtain lists of terms relevant for a specific domain Term Extraction Term Weighting Term Selection... to Terminology Retrieval Retrieve terms relevant for an information need User query points the relevant terms No terminology lists truncation Favor recall relaxing term extraction patterns... & Browsing Navigate through relevant terminology Access information from retrieved terms Bridge the gap between query and collection vocabularies Cross-Language

8 8 Terminology Retrieval Requires Phrase indexing and retrieval Query expansion and translation To retrieve terminology variations –Morpho-syntactic variations –Semantic variations –Translingual variations Noise in retrieval Ambiguity reduction –Co-ocurrence of expansion words in the same phrase

9 9Indexing Steps 1. Text pre-processing and listing of words 2. Word tagging (oriented to phrase detection) 3. Phrase detection & lemmatization of components 4. Document indexing & statistics (document frequency) 5. Phrase selection (Subsumption & Lexicalization degree) 6. Phrase indexing LemmaDocument Phrase LemmaDocument Phrase

10 10 Tratados acuerdo capitulación concertación convenio cuidar, pacto manejar procesar accord discourse handle manage pact process treat treatise treaty Query expansion and translation Prohibición embargo entredicho interdicción interdicto proscripción ban interdiction prohibition proscription Pruebas cata, catadura degustación ensayo escandallo experimento gustación muestreo, tanteo demonstrate establish, exhibit experiment experimentation fall, fitting indicate, point present, proof prove, run sample, sampling shew,show, taste test, trial, try de Nucleares nuclear de Expansion Translation Nuclear taste proscription process? Nuclear test ban treaty? Ambiguity Reduction

11 11Retrieval query Tokenising Expansion / Translation lem 11 lem 21 lem 31 lem 12 lem 22 lem 32 ··· ··· ··· EWN & Dic. Lemmatising tok 1 tok 2 tok 3 Lexicon Phrase retrieval exp 31 exp tran 31 tran exp 21 exp tran 21 tran exp 11 exp tran 11 tran Phrase index Document retrieval Document index Term ranking lem 11 lem lem 31 lem terms documents Document ranking

12 12 Query in Spanish Hierarchy of terms Catalan English Spanish Ranking of documents

13 13 - Translingual - Morpho-syntactic variations (permutation, insertion) - Semantic variations

14 14 Evaluation of Terminology Retrieval Compare Terminology Retrieval over 42,406 web pages (200 Mb) Hand-crafted Multilingual Thesaurus (1051 descriptors)

15 15

16 16 Evaluation of Terminology Retrieval Recall of mono-lexical terms (lemmas) Monolingual: 85% - 95% Translingual: 55% - 65% Recall of poly-lexical terms (phrases) Monolingual: 40% - 65% Translingual: 10% - 45% Loss of recall due to Phrase extraction (mainly POS tagging): 3% - 17% Phrase indexing (mainly lemmatization): 2% - 34% Phrase selection: 12% - 37% Lack of connections between different languages in EWN Lack in EWN adjective hierarchies

17 17Conclusions A search model based on extraction, retrieval and browsing of terminology has been developed User oriented Interaction over terminological information –Intermediate way between free-searching and thesaurus- guided searching –Without needs of thesaurus construction Bringing to users the collection terminology –Morpho-syntactic & semantic variations –Translinguality

18 18Conclusions An evaluation framework for Terminology Retrieval and Term Browsing has been established Points the way to improve Terminology Retrieval Users appreciate Term Browsing WTB phrasal information can substantially complement the document ranking provided by the search engines


Download ppt "Terminology Retrieval: towards a synergy between thesaurus and free text searching Anselmo Peñas, Felisa Verdejo and Julio Gonzalo Dpto. Lenguajes y Sistemas."

Similar presentations


Ads by Google