Presentation is loading. Please wait.

Presentation is loading. Please wait.

EBMT1 Example Based Machine Translation as used in the Pangloss system at Carnegie Mellon University Dave Inman.

Similar presentations


Presentation on theme: "EBMT1 Example Based Machine Translation as used in the Pangloss system at Carnegie Mellon University Dave Inman."— Presentation transcript:

1 EBMT1 Example Based Machine Translation as used in the Pangloss system at Carnegie Mellon University Dave Inman

2 EBMT2 Outline EBMT in outline? What data do we need? How do we create a lexicon? Indexing the corpus. Finding chunks to translate. Matching a chunk against the target. Quality of translation. Speed of translation. Good and bad points Conclusions.

3 EBMT3 EBMT in outline - Corpus Corpus S1: The cat eats a fish. Le chat mange un poisson. S2: A dog eats a cat. Un chien mange un chat. ….. S99,999,999 …. Index the: S1 cat: S1 eats: S1 … dog: S2

4 EBMT4 EBMT in outline – find chunks A source language sentence is input. The cat eats a dog. Chunks of this sentence are matched against the corpus. The cat : S1 The cat eats: S1 The cat eats a: S1 a dog : S2

5 EBMT5 How does EBMT work in outline - Corpus 1. The target language sentences are retrieved for each chunk. The cat eats : S1 Corpus S1: The cat eats a fish. Le chat mange un poisson 2. The chunks are aligned with target sentences (hard!). The cat eats Le chat mange

6 EBMT6 How does EBMT work in outline - Corpus Chunks are scored to find good match… The cat eats Le chat mange Score 78% The cat eats Le chat dorme Score 43% … a dog un chien Score 67% a dog le chien Score 56% a dog un arbre Score 22% The best translated chunks are put together to make the final translation. The cat eats Le chat mange a dog un chien

7 EBMT7 What data do we need? 1. A large corpus of parallel sentences. …if possible in the same domain as the translations. 2. A bilingual dictionary …but we can induce this from the corpus. 3. A target language root/synonym list. … so we can see similarity between words and inflected forms (e.g. verbs) 4. Classes of words easily translated … such as numbers, towns, weekdays.

8 EBMT8 How to create a lexicon. 1. Take each sentence pair in the corpus. 2. For each word in the source sentence, add each word in the target sentence and increment the frequency count. 3. Repeat for as many sentences as possible. 4. Use a threshold to get possible alternative translations.

9 EBMT9 How to create a lexicon..example The cat eats a fish. Le chat mange un poisson. thele,1chat,1mange,1un,1poisson,1 catle,1chat,1mange,1un,1poisson,1 eatsle,1chat,1mange,1un,1poisson,1 ale,1chat,1mange,1un,1poisson,1 fishle,1chat,1mange,1un,1poisson,1

10 EBMT10 Create a lexicon…after many sentences the le,956 la,925 un,235 ------ Threshold ---------- chat,47 mange,33 poisson,28.... arbre,18

11 EBMT11 Create a lexicon…after many sentences cat chat,963 ------ Threshold ---------- le,604 la,485 un,305 mange,33 poisson,28.... arbre,47

12 EBMT12 Indexing the corpus. For speed the corpus is indexed on the source language sentences. Each word in each source language sentence is stored with info about the target sentence. Words can be added to the corpus and the index easily updated. Tokens are used for common classes of words (e.g. numbers). This makes matching more effective.

13 EBMT13 Finding chunks to translate. Look up each word in the source sentence in the index. Look for chunks in the source sentence (at least 2 words adjacent) which match the corpus. Select last few matches against the corpus (translation memory). Pangloss uses the last 5 matches for any chunk.

14 EBMT14 Matching a chunk against the target. For each source chunk found previously, retrieve the target sentences from the corpus (using the index). Try to find the translation for the source chunk from these sentences. This is the hard bit! Look for the minimum and maximum segments in the target sentences which could correspond with the source chunk. Score each of these segments.

15 EBMT15 Scoring a segment… Unmatched Words : Higher priority is given to sentences containing all the words in an input chunk. Noise : Higher priority is given to corpus sentences which have fewer extra words. Order : Higher priority is given to sentences containing input words in the order which is closer to their order in the input chunk. Morphology : Higher priority is given to sentences in which words match exactly rather than against morphological variants.

16 EBMT16 Whole sentence match… If we are lucky the whole sentence will be found in the corpus! In that case the target sentence is used without previous alignment. Useful if translation memory is available (sentences recently translated are added to the corpus).

17 EBMT17 Quality of translation. Pangloss was tested against source sentences in a different domain to the examples in the corpus. Pangloss “covered” about 70% of the sentences input. This means a match was found against the corpus…. …but not necessarily a good match. Others report around 60% of the translation can be understood by a native speaker. Systran manages about 70%.

18 EBMT18 Speed of translation. Translations are much faster than for Systran. Simple sentences translated in seconds. Corpus can be added to (translation memory) at about 6MBytes per minute (Sun Sparc Station) A 270 Mbytes corpus takes 45 minutes to index.

19 EBMT19 Good points. Fast Easy to add a new language pair No need to analyse languages (much) Can induce a dictionary from the corpus Allows easy implementation of translation memory Graceful degradation as size of corpus reduced

20 EBMT20 Bad points. Quality is second best at present Depends on a large corpus of parallel, well translated sentences 30% of source has no coverage (translation) Matching of words is brittle – we can see a match Pangloss cannot. Domain of corpus should match domain to be translated - to match chunks

21 EBMT21 Conclusions. An alternative to Systran Faster Lower quality Quick to develop for a new language pair – if corpus exists! Needs no linguistics Might improve as bigger corpora become available?


Download ppt "EBMT1 Example Based Machine Translation as used in the Pangloss system at Carnegie Mellon University Dave Inman."

Similar presentations


Ads by Google