Presentation is loading. Please wait.

Presentation is loading. Please wait.

Checking Terminology Consistency with Statistical Methods LRC XIII 2 nd October 2008 Alfredo Maldonado Guerra Microsoft European Development Centre Masaki.

Similar presentations

Presentation on theme: "Checking Terminology Consistency with Statistical Methods LRC XIII 2 nd October 2008 Alfredo Maldonado Guerra Microsoft European Development Centre Masaki."— Presentation transcript:

1 Checking Terminology Consistency with Statistical Methods LRC XIII 2 nd October 2008 Alfredo Maldonado Guerra Microsoft European Development Centre Masaki Itagaki Microsoft Corporation

2 About this presentation Introduction Internal Consistency Check Step 1: Mine Source Terms Step 2: Identify translations of Source Terms (Alignment) Step 3: Consistency Check Current Challenges Tips Future Improvements

3 Introduction Terminology Consistency: A key element of localised language quality Terminology Consistency: Difficult to maintain Difficulty to keep source and target in synch during dev/loc process Translation done by several people (often working remotely) Terminology changes (e.g. between product versions) Manual Language Quality Assurance (QA) can help, however QA costs time and money QA usually concentrates on a sample of the text Reviewer must be familiar with reference material Its hard for humans to keep track of terminology

4 Introduction Can we use technology to control consistency? Yes, but… Existing tools require term lists or term bases Not all software companies have term bases set up Companies that do have term bases wont have every single term captured – building a term base is always a work in progress

5 Introduction Our Approach doesnt require a term base By using Term Mining technology we identify terms on the source strings We then check the translation consistency of the terminology mined

6 Internal Consistency Check 1. Mine Source Terms2. Align Translations3. Consistency Check 123 Inconsistency!

7 Step 1: Source Term Mining Bigram and Trigram extraction Noun phrases of the form Noun + Noun Noun + Noun + Noun Verb Phrases discriminated: 5% of terms Adjective Phrases discriminated: 2% of terms Monogram nouns discriminated: most are common words, and only 27% of terms are monograms In the future we might cover Adj + Noun forms

8 Step 2: Translation Alignment Problem statement: Given a mined source term S, identify the corresponding target term T in the translation column. Example: Mined term: input field (S) champ dentrée (T) champ dentrée (T)

9 Step 2: Translation Alignment We need to consider all possible term combinations We call each combination an NGram NGrams: where N = 2, 3, 4, maybe 5. For languages like German we even consider N = 1 How do we decide which NGram is the correct translation for the term? Bayesian statistics can help! Réattribue leurs valeurs initiales à tous les champs d'entrée. Réattribue leurs leurs valeurs valeurs initiales Initiales à à les … Réattribue leurs valeurs Leurs valeurs initiales …

10 Step 2: Translation Alignment Problem statement: Given a source term S, obtain the NGram T that maximises the conditional probability function [1] But how do we calculate this?!

11 Step 2: Translation Alignment [1] Well, the multiplication rule of conditional probability tells us that So [1] becomes: [2] And we also know that: |NGrams| is the number of NGrams of the same N as T. For example, if T is a 2 word term (a bigram), |NGrams| will be the amount of NGrams made up of 2 words. |STSeg| is the number of segments (strings) that contain both S in the source column and T in the target column.

12 Step 2: Translation Alignment In our Best Target Term Selection Routine we will be comparing probabilities of different target terms (T k s): Since P(S) remains constant during these comparisons, we can eliminate it. We call the resulting equation I(T k ): [3] The candidate T k with the highest I, is our Best Target Term Candidate

13 Step 2: Translation Alignment Normalisation Depending on context any particular term can be translated in a slightly different way. For example: file name could be translated in Spanish as: nombre de archivo nombre del archivo nombres de archivo nombres de archivos nombres de los archivos Our algorithm has to be clever enough to realise that nombres de archivo is just a form of nombre de archivo.

14 Step 2: Translation Alignment Normalisation So, during NGram generation, we need to generate regular expressions for our terms Since Asian languages do not inflect, regular expressions are simpler for these languages For European languages we use more complex regular expressions Source TermTarget Term (Italian)Regular ExpressionMatches (admitted translations) Error codecodice errore\bcod\w{0,3}(\s\w{1,4}'?){0,2}\s? err\w{0,3}\b codice d'errore codice di errore codice errore codici di errore Source TermTarget Term (Japanese)Regular ExpressionMatches (admitted translations) Error code \b \s? \b

15 Step 3: Consistency Check Detect the strings that do not use any of our admitted translations Report these strings along with our findings to the user

16 Current Challenges False Positives Due to heavy rephrasing Unreliable for short, generic monograms Source TermAdmitted translations (Italian) datad, d3d, da, dac, dai, dal, dall, data, dati, dato, dc, ddc, dei, del, dell, deny, der, deve, dfs, dhcp, di, dir, disk, dll, dma, dns, dopo, dos, dove, dpc, dsis, dtr, due, dvd, dwm

17 Current Challenges Verbs can potentially cause problems Due to high inflection: amar => amo, amas, ama, amamos, amáis/aman, aman venir => vengo, vienes, viene, venimos, venís/vienen, vienen Difficult to differentiate from other parts of speech Not all languages supported: Arabic Complex Script languages Source termAdmitted translationsTarget Language downloaddescarga, descargar, descargó, descargueSpanish installinstall, installa, installare, installata, installati, installato, installerItalian

18 Current Challenges Best Candidate Selection logic is very good, but its not perfect. About 70% of term selections are correct. Incorrect selections Correct term highlighted Correct selections

19 Tips Make sure your data is clean to a certain degree. Remove any HTML/XML tags from your strings Filter out any unlocalised strings and non-localisable strings. For Asian languages, run a word breaker tool on your target strings (this is required for proper NGram handling)

20 Tips If you already have source term lists youre interested in, you can use them to bypass the term mining process If your source terms are well selected, youll achieve very good results – A well selected source term has a precise technical meaning. Source termGood/BadReason failurebadToo generic databadToo generic, forms part of many other terms: data type, data structure, etc. worker processgoodHas a precise meaning user account controlgoodHas a precise meaning

21 Tips The more data you have, the more accurate your results will be Try combining software data with help / user education data to increase term repetitions

22 Future Improvements More work with Adj + Noun Work with verbs Add support for Complex Script languages and languages that inflect on different parts of the word Further refine Best Translation Candidate Selection logic

23 Questions?

24 Thank You!

Download ppt "Checking Terminology Consistency with Statistical Methods LRC XIII 2 nd October 2008 Alfredo Maldonado Guerra Microsoft European Development Centre Masaki."

Similar presentations

Ads by Google