Presentation is loading. Please wait.

Presentation is loading. Please wait.

CLIR: opening up possibilities for indigenous languages in South Africa? Research team: Erica Cosijn1, Heikki Keskustalo2, Ari Pirkola2, Karen de Wet1.

Similar presentations


Presentation on theme: "CLIR: opening up possibilities for indigenous languages in South Africa? Research team: Erica Cosijn1, Heikki Keskustalo2, Ari Pirkola2, Karen de Wet1."— Presentation transcript:

1 CLIR: opening up possibilities for indigenous languages in South Africa?
Research team: Erica Cosijn1, Heikki Keskustalo2, Ari Pirkola2, Karen de Wet1 & Kalervo Järvelin2 1University of Pretoria, Pretoria, South Africa 2University of Tampere, Finland

2 Introduction What is CLIR? General methodology Afrikaans-English CLIR
Zulu-English CLIR The road ahead Conclusions

3 What is CLIR? The basic idea to bridge the language boundary by providing access in one language (the source language) to documents written in another language (the target language) Source language: the language that gives access to the required information; the quiery language thus Target language: the language of the content in the database

4 CLIR (cont.) Use CLIR in: Main strategies for query translation
query translation and/or document translation from the source language. Main strategies for query translation dictionary-based methods corpus-based methods, and machine translation

5 CLIR approaches Corpus-based methods: work with frequency analysis
Implication: aboutness of the two collections should be similar  Machine translation: uses morphological parser etc.

6 CLIR: Machine translation
Translates source language texts into target language using: Translation dictionaries Other linguistic resources Syntax analysis Limited availability In this approach, a machine translation system is employed for query translation. Such systems aim at correct translation of source language texts into a target language. Translation is based on translation dictionaries, other linguistic resources and syntax analysis to arrive at an unambiguous and high-quality text in the target language. In CLIR applications, the source language query must be a grammatically correct sentence (or a longer text) for the translation to be successful. A major problem with this approach is the limited availability of good machine translation systems. For many language pairs no systems exist and, for many others, the quality of the systems is rather poor and/or their topical scope limited. In the South African context, there is a machine translation system between English, Afrikaans and several African languages ( but its quality for CLIR applications is inadequate.

7 CLIR: Dictionary Based
Problems Limitations of dictionaries Inflected word forms Phrases and compound words Lexical ambiguity Possible solution Approximate string matching Inherent problems to this approach are as follows: Untranslateable keywords because the words are not in the dictionary. One of the reasons are that natural languages evolve and new words are not added to dictionaries on a regular basis. Other categories of words that are not generally found in dictionaries are compound words, proper names, spelling variants and special terms. Inflected word forms is another problem. If the source language words appear in inflected form, they annort easily be translated, because they do not match the words as entered in the dictionary. A common way to deal with this stemming - to remove prefixes and suffixes from the word forms so as to find a common root or stem of different forms. One of the drawbacks of stemming is that different headforms may be conflated to the same form. Some languages have a high frequency of compounds.

8

9 CLEF The Cross-Language Evaluation Forum supports global digital library applications by (i) developing an infrastructure for the testing, tuning and evaluation of information retrieval systems operating on European languages in and (ii) creating test-suites of reusable data which can be employed by system developers for benchmarking purposes

10 Retrieval system and test data
Inquery – commercially available Probabilistic – i.e. best match, not exact “Bag of words” or structured queries used by Finnish partners in their projects TEST DATA: CLEF 2001 newspaper articles 35 queries (title and description) English to English baseline for comparison 2 sets Afrikaans/Zulu title Afrikaans/Zulu title and description

11 Afrikaans-English CLIR
Afrikaans spoken by third largest group in South Africa as first language Originated mainly from Dutch Germanic language Not inflectional Good technical vocabulary Good resources – e.g. dictionaries, spell checkers, parsers, compound splitters.

12 Methodology : Resources
Electronic bilingual dictionary Filtered commercial dictionary Stopword list Translated from English and adapted Morphological analyzer Derived statistically from analysis of large newspaper text body

13 Dictionary Filtering Headwords identified by string-based rules
Alternative spellings separated and listed as separate headwords Homonyms: each sense listed as separate headword Compounds identified and listed as separate headwords Plurals not included, but solved by morph analyzer Manual checking and fine-tuning

14 Stopword list Translation of existing English stopword list
Check homonyms, e.g. again = weer = weather Large text body – Afrikaans language newspaper articles – 3500 words Frequency analysis compared to translated list Ad hoc additions Accented words added N=341

15 Morphological analyser (1)
Based on patterns in language Newspaper text used for manual analysis 3500 words sorted by frequency facilitated duplicate removal 1200 unique words

16 Morphological analyser: Plurals
All plural forms manually identified from 1200 words 62% of Afrikaans plurals formed by adding -e, -s or -’s to singular 13% of plurals have a double vowel in singular and plural is formed by removing one vowel and adding an -e to the end of the word Thus 75% of plurals solved by two simple rules

17 Morphological analyser: Affixes
Manual analysis of text shows Past tense indicated by ge- prefix, but sometimes embedded, e.g. aangesteek Various suffixes are common: -te, -ste, -er, -ing, -ke, -le, -de, etc. Suffix stripping possible by longest common substring (LCS) matching

18 Morphological analyser: Compounds
Manual analysis of text shows Relatively high occurrence of compounds in Afrikaans - 1% Different types of compounds With or without fogemorphemes (joining morphemes) Only two fogemorphemes identified, namely -s- and -e-

19 Morphological analyser test data: Statistics - solvable

20 Morphological analyser test data: Statistics – not solvable

21

22 Morphological analyser – steps (condensed from flow chart)
Match words found in dictionary Uppercase becomes lower case Remove ge- prefix Double vowel plural case Match longest common subsequence (suffixes as well as compounds solved) Modify lower case to uppercase (probably proper noun) Fuzzy match “as is” with target language database

23 Example Database used: Cleff
English title: Pesticides in Baby Food Afrikaans source query: Plaagdoders in babakos English baseline query: #sum(pesticide baby food) The English target query translated from the Afrikaans source query: #sum(#syn(nullstr lues die van plague plague blight infestation pest affliction vexation killer) #syn( nullstr) #syn( baby food))

24 Results

25 Conclusions Dictionary probably too large Normalizer worked quite well
Copmpound splitting by LCS methods mostly successful Stopword list adequate Results quite promising

26 Zulu-English CLIR isiZulu spoken by 8,8 million – largest number of speakers for a single language in SA Agglutinative – grammatical information conveyed by attaching pre- and suffixes to roots and stems Nouns: Grammatical genders – 8 classes in Zulu with distinctive prefixes in every class for singular and plural forms Verbs: Affixes mark grammatical relations such as object, subject, tense, mood, aspect

27 Methodology: Zulu to English

28 Methodology (1) Monolingual word list Approximate matching
No electronic bilingual dictionary Approximate matching Of all five metric and non-metric similarity measures tested, skipgrams yielded best results The Zulu word could be identified within three words 80% of the time

29 Methodology (2) Translations from Zulu source words into English done manually Problems experienced in this process Paraphrasing due to disparate vocabularies E.g. isinyabulala – person weak from age Homonyms – single words with various meanings E.g. –zwe isizwe izizwe = tribe OR rapidly spreading brain disease

30 Example of paraphrasing

31 Analysis of translation problems

32 The road forward Parsers and morphological analysers in process
Spellcheckers has extensive word lists Increasing web presence of indiginous languages, especially government sites and newspapers leads to possibility of pararlel corpora Cross Cultural Information Retrieval?

33 Conclusions Indigenous Knowledge is a valuable resource – it is important to make it accessible Learn from international research and create a good product from the outset Many opportunities for research

34 Cross Language Information Retrieval (CLIR)
To provide access in one language to documents written in another language Query translation or document translation Approaches Corpus-based techniques Machine translation Dictionary-based techniques Indigenous Knowledge is normally stored in databases in textual form, based on transcribed speech in some indigenous language, and is likely to contain IK in several indigenous languages. The users may well be able to read the text in the original language, but may have difficulty in expressing their interest properly in these languages, and this is the requirement for successful direct access to content. The purpose of CLIR is to provide access in one language (called the source language) to documents written in another language (the target language). In the South African IK context, the source languages could be English, Afrikaans or an African language. The target language would be the indigenous language. The basic approaches to CLIR involve either document or query translation. Document translation requires good machine translation systems, and these will not be available for the South African context anytime soon. Query translation from source language into target language requires fewer resources, but there is a requirement that the user can read the original documents in the language in which they were written. But at the worst, even if the user cannot read the language, you at least have retrieved a relevant set of documents that may be translated manually. The main strategies for query translation are based on using parallel corpora, machine translation and translation dictionaries. We will look at each of these in more detail.


Download ppt "CLIR: opening up possibilities for indigenous languages in South Africa? Research team: Erica Cosijn1, Heikki Keskustalo2, Ari Pirkola2, Karen de Wet1."

Similar presentations


Ads by Google