Presentation is loading. Please wait.

Presentation is loading. Please wait.

Parallel Data & Sentence Alignment Declan Groves, DCU

Similar presentations


Presentation on theme: "Parallel Data & Sentence Alignment Declan Groves, DCU"— Presentation transcript:

1 Parallel Data & Sentence Alignment Declan Groves, DCU

2 Parallel Corpus Seen why from week 2 that data-driven Machine Translation (MT), based on using real-world translation examples/data, is now the most prevelant approach 3 approaches to data-driven MT: Example Based (EBMT) Statistical Based (SMT) Hybrid models (a mix of different approaches; may include non-data driven approaches such as rule-based) which use some probabilistic processing All need a parallel copus (or bitext) of aligned sentences Can create the resource manually, otherwise if we have un-aligned bilinugal texts, we can automate the alignment

3 Automatic Alignment (1/2) Most alignments are one-to-one: E1: Often, in the textile industry, businesses close their plant in Montreal to move to the Eastern townships. F1: Dans le domaine du textile souvent, dans Montréal, on ferme et on va s’installer dans les Cantons de l’Est. E2: There is no legislation to prevent them from doing so, for it is a matter of internal economy. F2: Il n’ya aucune loi pour empêcher cela, c’est de la r´egie interne. E3: That is serious. F3: C’est grave.

4 Automatic Alignment (2/2) But not always: E1: Honourary members opposite scoff at the freeze suggested by this party; to them it is laughable. F1: Les deputés d’en face se moquent du gel que a propose notre parti. F2: Pour eux, c’est une mesure risible.

5 Automatic Alignment (2/2) But not always: So some (like this) have one sentence correspond to two, or more, or none. Or there may be a two-to-two sentence alignment without there being a one-to-one relation between the component sentences. E1: Honourary members opposite scoff at the freeze suggested by this party; to them it is laughable. F1: Les deputés d’en face se moquent du gel que a propose notre parti. F2: Pour eux, c’est une mesure risible. E1 F1 F2 {

6 Any statistical approach to MT requires the availability of aligned bilingual corpora which are : large; good-quality; representative Class Question Assmume the following (tiny) corpus: Q1: what’s P(have) vs. P(has) in a general particular corpus? Which is more likely? Q2: what’s P(have | John) vs. P(has | John) in a general corpus? Q3: what’s P(have) vs. P(has) in this corpus? What’s their relative probability? Q4: what’s P(have | John) vs. P(has | John) in this corpus? * SPEECH and LANGUAGE PROCESSING: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition (Jurafsky & Martin) Good Language Models Come from Good Copora (1/2) Mary and John have two children. The children that Mary and John have are aged 3 and 4. John has blue eyes.

7 Good Language Models Come from Good Copora (2/2) Assume a different small corpus : Q5: What two generalisations would a probabilistic language model (based on bigrams, say) infer from this data, which are not true of English as a whole? Are there any other generalisations that could be inferred? Am I right, or am I wrong? Peter and I are seldom wrong. I am sometimes right. Sam and I are often mistaken.

8 Good Language Models Come from Good Copora (2/2) Assume a different small corpus : Q5: What two generalisations would a probabilistic language model (based on bigrams, say) infer from this data, which are not true of English as a whole? Are there any other generalisations that could be inferred? Am I right, or am I wrong? Peter and I are seldom wrong. I am sometimes right. Sam and I are often mistaken.

9 Good Language Models Come from Good Copora (2/2) Assume a different small corpus : Q5: What two generalisations would a probabilistic language model (based on bigrams, say) infer from this data, which are not true of English as a whole? Are there any other generalisations that could be inferred? Q6: Try to think of some trigrams (and 4-grams, if you can) that cannot be ‘discovered’ by a bigram model? What you’re looking for here is a phrase where the third (or subsequent) word depends on the first word, which in a bigram model is ‘too far away’... Note that all the sentences in these corpora are well-formed. If, on the other hand, the corpus contains ill-formed input, then that too will skew our probability models... Am I right, or am I wrong? Peter and I are seldom wrong. I am sometimes right. Sam and I are often mistaken.

10 Bilingual Corpora (1/2) Previous examples, all monolingual, but the same applies w.r.t. bilingual corpora. One issue that we can think about now is what corpora we’re going to extract our probabilistic language (and translation) models from. What sorts of large, good quality, representative bilingual corpora exist? Canadian Hansards Proceedings of the Hong Kong parliament Dáil Proceedings … i.e. while the statistical techniques can be applied to any pair of languages, this approach is currently limited to only a few language pairs.

11 Bilingual Corpora (1/2) W.r.t. ‘representativeness’, consider the following from a 9000 word experiment (Brown et al., 1990) on the Canadian Hansards... Q1: For what English word are these possible candidate translations? Q2: What’s “???”: Beware of sparse data and unrepresentative corpora!! Ditto poor quality language... though I’ll come back to this one! FrenchProbability ???.808 entendre.079 entendu.026 entends.024 entendons.013

12 Bilingual Corpora (1/2) W.r.t. ‘representativeness’, consider the following from a 9000 word experiment (Brown et al., 1990) on the Canadian Hansards... Q1: For what English word are these possible candidate translations? Q2: What’s “???”: Beware of sparse data and unrepresentative corpora!! Ditto poor quality language... though I’ll come back to this one! If the corpora are small, or of poor quality, or are unrepresentative, then our statistical language models will be poor, so any results we achieve will be poor. FrenchProbability bravo.808 entendre.079 entendu.026 entends.024 entendons.013

13 Is the World Wide Web a good corpus to use? Let’s imagine you want to find the correct translation of the French compound (a word that consists of more than one token/stem) “groupe de travail”. For each of the two French nouns here, there are several translations: groupe: cluster, group, grouping, concern, collective travail: work, labor, labour If groupe de travail is not in our dictionary (cf. proliferation of compounds), but the two composite nouns are, then any compositional translation is multiply ambiguous. Now here’s the trick: let’s search for all 15 possible pairs on the WWW. We might find: labour cluster: 2 labour concern: work group: 66593

14 Is the World Wide Web a good corpus to use? There are at least two ways in which we could attempt to resolve the potential ambiguity: simply take the most often occurring (frequency) term as the translation; see which candidate translations surpass some threshold (in terms of relative frequency, say) cf. share price, stock price: assuming these two massively outrank all other candidates, we can keep both as synonymous translations. So, we can use the WWW to create a very large bilingual lexicon...

15 Can we use the WWW to extract parallel corpora? Lots of pages where you can click on a link to get a version of that page in a different language – how to find these? Query a search engine for a string “English” ‘not too far away’ from the string “Spanish”. This might get you things like: Click here for English version vs. Click here for Spanish version But also: English literature vs. Spanish literature What then? Need a process to evaluate these candidates. Using something like ‘diff’, you could align each case and manually remove poor candidates:

16 Can we use the WWW to extract parallel corpora? Comparison using ‘diff’ Or automate the comparison fully using string compare routines measured against some % threshold.... sunsite{away}691: diff a b 1c1 < Mary and John have two children. --- > Maria und Johannes haben zwei Kinder. 3c3 < The children that Mary and John have are aged 3 and > Die Kinder, die Maria und Johannes haben, sind 3 und 4 Jahre alt. 5c5 < John has blue eyes. --- > Johannes hat blaue Augen.

17 Sentence Alignment Manual construction of aligned corpora (which are essential for probabilistic techniques) also avoids the considerable problem of trying to align source and target texts. But what if we already have bilingual corpora which are not aligned?! Automation!

18 Automatic Alignment We’ve just seen one (novel) way of automating this process (using simple string comparison techniques). What are the problems? Most alignments are one-to-one... but some are not, as we saw previously: Some have one sentence correspond to two, or more, or none. Or there may be a two-to-two sentence alignment without there being a one-to-one relation between the component sentences. Let’s look at (some of) the major algorithms for aligning pairs of sentences...

19 Gale & Church’s Algorithm (1/4) Gale & Church (1993): ‘A program for aligning sentences in bilingual corpora’, in Computational Linguistics 19(1):75–102. All sentence alignment algorithms are premised on the notion that good candidates are not dissimilar with regards to length (i.e. shorter sentences tend to have shorter translations then longer sentences). G&C use this length-based notion, together with the probability that a typical translation is one of various many-to-many sentence relations (0-1, 1-0, 1-1, 2-1, 1-2, 2-2 etc). Distance measure: P(match |  ), where match = 1-1, 0-1 etc., and  = difference in length How to calculate text length? word tokens; characters. G&C use characters. No language maps character by character to another language, so first calculate the ratio of chars. in L1 to chars. in L2. e.g. English text = 100,000 chars; Spanish text = 110,000, then scaling ratio = 1.1

20 Gale & Church’s Algorithm (2/4) Q: What’s the difference in length between an English text of 50 chars. and a Spanish text of 56 chars given this scaling ratio (1.1)? Typically, the longer the text, the bigger the difference, so we need to normalize the difference in length to take into account longer texts: where l t and l s are the lengths of the target and source texts respectively, and l m is the average of the two lengths, c is the scaling factor, and s 2 is the variance for all values  of in a typical aligned corpus, which G&C suggest is 6.8.

21 Gale & Church’s Algorithm (3/4) Terms like P(match |  ) are called conditional probabilities Q: P(throwing at least 7 with 2 die)? Q: P(throwing at least 7 | 1st dice thrown was a 6)? Bayes’ Theorem is used to to relate probabilities: P(match |  ) = P  match).P(match) P()P() P(  ) is a normalizing constant, so can be ignored. P(match) is estimated from correctly aligned corpora G&C give one-to-one alignments a probability of 0.89, with the rest of the probability space assigned to other possible alignments. P(  | match) has to be estimated (cf. Trujillo, 1999:71).

22 Gale & Church’s Algorithm (4/4) Comment: Aligning complete texts is computationally expensive. Luckily, we can use paragraphs, sections,headings, chapters etc to identify chunks of texts and align these smaller elements. That is, we’re looking for anchors which can identify text chunks in both languages. Other things can be tags, proper names, acronyms, cognates, figures, dates…(Q: can you think of others?) Results? G&C report error rates of 4.2% on 1316 alignments. Most errors occur in non 1-1 alignments. Selection of the best 80% alignments reduces the error rate to 0.7%. Interestingly, for European languages at least, the algorithm is not sensitive to different values for c and s 2, but ‘noisy’ texts or from very different languages can degrade performance.

23 Recap Data-driven (e.g. statistical) MT relies on the availability of parallel data that is: Of sufficient quantity Of sufficient (i.e. high) quality Representative of the language we’re trying to model Poor quality data = poor quality models Too little data: data sparseness & poor coverage Too poor quality: bad quality translations Not representative: model will make wrong assumptions, produce incorrect translations Data extracted from the web can be used to create some parallel texts Best for dictionary extraction Finding corresponding documents can be difficult Parallel data needs to be aligned Document-level alignment Sentence-level alignment (manual vs automatic, using rel. sentence length ratio) Gale & Church (1993) character-based sentence alignment

24 Brown et al.’s Algorithm (1/3) Brown et al. (1991): ‘Aligning sentences in parallel corpora’, in Proceedings of 29th Annual Meeting of the Association for Computational Linguistics, University of California, Berkeley, pp.169–176. Brown et al. measure the length of a text in terms of the number of word tokens. Their method assumes alignments of text fragments are produced by a Hidden Markov Model (HMM), and the correct alignment is that which maximizes the output of the HMM The HMM of Brown et al. models alignment by determining probabilities for eight types of alignments, or beads: s, t, st, sst, stt, ¶s, ¶t, ¶s ¶t i.e. sst indicates that two source strings align with one target string (2-1), and ¶s signifies that a source paragraph delimiter matches nothing in the target language (i.e. it is deleted) (1-0).

25 Brown et al.’s Algorithm (2/3) English text (s) 17s 25s 12s … French text (t) 19t 20t 8t … st stt

26 Brown et al.’s Algorithm (3/3)

27 Kay & Röscheisen’s Algorithm (1/2) Kay & Röscheisen’s (1993): “Text-translation alignment” in Computational Linguistics 19(1): Text-based alignment assumes set of anchors, parallel text and access to an English-French dictionary Two sentences will be aligned iff word translation pairs from bilingual dictionary are found with sufficient frequency between two sentences (relative to size and to other sentences in corpus) Compare with length based algorithms. Disadvantages: requires bilingual dictionary – always available? Uses information on relative sentence position for English-French Canadian Hansards. Previously seen relative sentence length (length-based algorithms)

28 Kay & Röscheisen’s Algorithm (2/2)

29

30 Exact Matches In Translation Memory (TM) (and EBMT systems) we would like to first look for exact matches for input or source language sentences Some non-exact matches: Different spelling:Change the color of the font. Change the color of the font. Different punctuation:Open the file and select the text. Open the file, and select the text. Different inflection:Delete the document. Delete the documents. Different numbers:Use version 1.1. Use version 1.2. Different formatting:Click on OK. Click on OK.

31 Fuzzy Matches Exact matches often do not occur; TM systems then use “fuzzy” matching: A Fuzzy Match New input segment: The specified operation failed because it requires the Character to be active. Stored TM Unit:EN: The specified language for the Character is not supported on the computer FR: La langue spécifiée pour le Compagnon n’est pas prise en charge par cet ordinateur.

32 Fuzzy Matches Exact matches often do not occur; TM systems then use “fuzzy” matching: A Fuzzy Match New input segment: The specified operation failed because it requires the Character to be active. Stored TM Unit:EN: The specified language for the Character is not supported on the computer FR: La langue spécifiée pour le Compagnon n’est pas prise en charge par cet ordinateur.

33 Fuzzy Matches Exact matches often do not occur; TM systems then use “fuzzy” matching: Multiple Fuzzy Matches in Ranked Order New input segment: The operation was interrupted because the Character was hidden. Best Match:EN: The operation was interrupted because the Listening key was pressed. FR: L’opération a été interrompue car la touche d’écout a été enfoncée.

34 Fuzzy Matches Exact matches often do not occur; TM systems then use “fuzzy” matching: Multiple Fuzzy Matches in Ranked Order New input segment: The operation was interrupted because the Character was hidden. Best Match:EN: The operation was interrupted because the Listening key was pressed. FR: L’opération a été interrompue car la touche d’écout a été enfoncée.

35 Fuzzy Matches Exact matches often do not occur; TM systems then use “fuzzy” matching: Multiple Fuzzy Matches in Ranked Order New input segment: The operation was interrupted because the Character was hidden. Best Match:EN: The operation was interrupted because the Listening key was pressed. FR: L’opération a été interrompue car la touche d’écout a été enfoncée. 2 nd Best Match: EN: The specified method failed because the Character is hidden. FR: La méthode spécifiée a échoué cat le Compagnon est masqué 3 nd Best Match: EN: The operation was interrupted by the application. FR: L’opération a été interrompue par l’application

36 Subsentential Alignment (1/4)

37 Subsentential Alignment (2/4)

38 Subsentential Alignment (3/4) A Simple Aligned Text EN: Start the operating systemES: Comenzar el sistema operativo EN: Launch the program via theES: Empezar el programa mediante keyboard el telcando

39 Subsentential Alignment (4/4)  comenzar  start  *sistema  start  *empezat  system  telcado  keyboard Two are right, two are wrong… but their correct translations lie close by Assuming enough text, we assume the pairings (sistema,system) and (empezar,launch) can be confirmed as correct alignments. This leaves fewer candidates to be aligned. For European languages at least, we can assume that words which are close together in the source translate as words which are close together in the target. Similarly, words at the start/end of a source sentence map to words at the start/end of the target sentence. Other subsentential alignment techniques: Distance between occurrences of source word mirrors distance between occurrence of a target word. Confirm aligned word pairs, using a third corpus as reference. A Simple Aligned Text EN: Start the operating systemES: Comenzar el sistema operativo EN: Launch the program via theES: Empezar el programa mediante keyboard el telcando


Download ppt "Parallel Data & Sentence Alignment Declan Groves, DCU"

Similar presentations


Ads by Google