Presentation is loading. Please wait.

Presentation is loading. Please wait.

Stephan Vogel - Machine Translation1 Statistical Machine Translation SMT – Basic Ideas Stephan Vogel MT Class Spring Semester 2011.

Similar presentations


Presentation on theme: "Stephan Vogel - Machine Translation1 Statistical Machine Translation SMT – Basic Ideas Stephan Vogel MT Class Spring Semester 2011."— Presentation transcript:

1 Stephan Vogel - Machine Translation1 Statistical Machine Translation SMT – Basic Ideas Stephan Vogel MT Class Spring Semester 2011

2 Stephan Vogel - Machine Translation2 Overview lDeciphering foreign text – an example lPrinciples of SMT lData processing

3 Stephan Vogel - Machine Translation3 Deciphering Example lApinaye – English lApinaye belongs to the Ge family of Brazil lSpoken by 800 (according to SIL, 1994) lhttp://www.ethnologue.com/show_family.asp?subid= lExample from Linguistic Olympics 2008, see lParallel Corpus (some characters adapted) Kukre kokoi The monkey eats Ape kre The child works Ape kokoi rats The big monkey works Ape mi mets The good man works Ape mets kra The child works well Ape punui mi pinjets The old man works badly lCan we translate new sentence?

4 Stephan Vogel - Machine Translation4 Deciphering Example lParallel Corpus (some characters adapted) lCan we build a lexicon from these sentence pairs? lObservations: lApinaye: Kukre (1) Ape (5), English: The (6), works (5) Aha! -> first guess: Ape – works lmonkey in 1, 3; child in 2, 4; man in 4, 6 different distribution over corpus: do we find words with similar distribution on the Apinaye side? Kukre kokoiThe monkey eats Ape kraThe child works Ape kokoi ratsThe big monkey works Ape mi metsThe good man works Ape mets kraThe child works well Ape punui mi pinjetsThe old man works badly

5 Stephan Vogel - Machine Translation5 … Vocabularies Corpus Vocabularies Kukre kokoiThe monkey eats Ape kraThe child works Ape kokoi ratsThe big monkey works Ape mi metsThe good man works Ape mets kraThe child works well Ape punui mi pinjetsThe old man works badly ApinayeEnglish kukreThe kokoimonkey apeeats krachild ratsworks mibig metsgood punuiman pinjetswell old badly lObservations: l 9 Apinaye words, 11 English words lExpectations: lEnglish words without translation? lApinaye words corresponding to more then 1 English word?

6 Stephan Vogel - Machine Translation6 … Word Frequencies Corpus Vocabularies, with frequencies ApinayeEnglish kukre1The6 kokoi2monkey2 ape5eats1 kra2child2 rats1works5 mi1big1 mets2good1 punui1man2 pinjets1well1 old1 badly1 Kukre kokoiThe monkey eats Ape kraThe child works Ape kokoi ratsThe big monkey works Ape mi metsThe good man works Ape mets kraThe child works well Ape punui mi pinjetsThe old man works badly lSuggestions: l‘ape’ (5) could align to ‘The’ (6) or ‘works’ (5) lMore likely that content word ‘works’ has match, i.e. ‘ape’ = ‘works’ lOther word pairs difficult to predict – too many similar frequencies

7 Stephan Vogel - Machine Translation7 … Location in Corpus Corpus Vocabularies, with occurrences ApinayeSentencesEnglishSentences kukre1The kokoi1 3monkey1 3 ape eats1 kra2 5child2 5 rats3works mi4 6big3 mets4 5good4 punui6man4 6 pinjets6well5 old6 badly6 lObservations: lSame sentences: ‘kukre’ – ‘eats’, ‘kokoi’ – ‘monkey’, ‘ape’ – ‘works’, ‘kra’ – ‘child’, ‘rats’ – ‘big’, ‘mi’ – ‘man’ l‘mets’ (4 and 5) =? ‘good’ (4) and ‘well’ (5); makes sense l‘punui’ and ‘pinjets’ match ‘old’ and ‘badly’ – which is which? Kukre kokoiThe monkey eats Ape kraThe child works Ape kokoi ratsThe big monkey works Ape mi metsThe good man works Ape mets kraThe child works well Ape punui mi pinjetsThe old man works badly

8 Stephan Vogel - Machine Translation8 … Location in Sentence Corpus lObservations: lFirst English word (‘The’) does not align; we say it aligns to the NULL word lApinaye verb in first position lEnglish last word aligns to 1 st or 2 nd position lEnglish -> Apinaye: reverse word order (not strictly in sentence pair 5) lHypothesis: lalignment for last sentence pair is I.e: ‘pinjets’ – ‘old’ and ‘punui’ – ‘badly’ ApinayeEnglishAlignment EN - AP Kukre kokoiThe monkey eats Ape kraThe child works Ape kokoi ratsThe big monkey works Ape mi metsThe good man works Ape mets kraThe child works well Ape punui mi pinjetsThe old man works badly1-0 2-??? ???

9 Stephan Vogel - Machine Translation9 … POS Information Corpus lObservations: lEnglish determiner (‘The’) does not align; perhaps no determiners in Apinaye lEnglish Verb Adverb -> Apinaye: Verb Adverb -> no reordering lEnglish Adjective Noun -> Apinaye: Noun Adjective -> reordering lHypothesis: l‘pinjets’ is Adj to make it N Adj, ‘punui’ is Adv (consistent with alignment hypothesis) Kukre kokoiV NThe monkey eatsDET N V Ape kraV NThe child worksDet N V Ape kokoi ratsV N AdjThe big monkey worksDet Adj N V Ape mi metsV N AdjThe good man worksDet Adj N V Ape mets kraV Adv NThe child works wellDet N V Adv Ape punui mi pinjetsV ??? N ???The old man works badlyDet Adj N V Adv

10 Stephan Vogel - Machine Translation10 Translate New Sentences: Ap - En lSource Sentence: Ape rats mi mets lLexical information: works big man good/well lReordering information: The good man works big lBetter lexical choice: The good man works hard lCompare: Ape mi mets -> The good man works lSource Sentence: Kukre rats kokoi punui lLexical information: eats big monkey badly lReordering information: The bad monkey eats big lBetter lexical choice: The bad monkey eats a lot

11 Stephan Vogel - Machine Translation11 Translate New Sentences: En - Ap lSource Sentence: The old monkey eats a lot lLexical information: NULL pinjets kokio kukre rats lReordering information: kukre rats kokio pinjets lOr lDeleting words: old monkey eats a lot lRephrase: old monkey eats big lReorder: eats big monkey old lLexical information: kukre rats kokio pinjets lSource Sentence: The big child works a long time lDelete plus rephrase: big child works big lReorder: works big child big lLexical information: Ape rats kra rats

12 Stephan Vogel - Machine Translation12 Overview lDeciphering foreign text – an example lPrinciples of SMT lData processing

13 Stephan Vogel - Machine Translation13 Principles of SMT lWe will use the same approach – learning from data lBuild translation models using frequency, co-occurrence, word position, etc. information lUse the models to translate new sentences lNot manually, but fully automatically lThe training will be automatically lThe is still lots of manual work left: designing models, preparing data, running experiments, etc.

14 Stephan Vogel - Machine Translation14 Machine Translation Approaches lGrammar-based lInterlingua-based lTransfer-based lDirect lExample-based lStatistical

15 Stephan Vogel - Machine Translation15 Statistical Approach lUsing statistical models lCreate many alternatives; we call them hypotheses lGive a score to each hypothesis; based on statistical models lSelect the best -> search problem lAdvantages lAvoid hard decisions lSometimes, optimality can be guaranteed lSpeed can be traded with quality, not all-or-nothing lIt works better ! lDisadvantages lDifficulties in handling structurally rich models, mathematically and computationally (but that’s also true for non-statistical systems) lNeed data to train the model parameters

16 Stephan Vogel - Machine Translation16 Statistical versus Grammar-Based lOften statistical and grammar-based MT are seen as alternatives, even opposing approaches – wrong !!! lDichotomies are: lUse probabilities || everything is equally likely, yes/no decision lRich (deep) structure || no or only flat structure lBoth dimensions are continuous lExamples lEBMT: no/little structure and heuristics lSMT: (initially only) flat structure and probabilities lXFER: deep(er) structure and heuristics lGoal: structurally rich probabilistic models lstatXFER: deep structure and probabilities lSyntax-augmented SMT: deep structure and probabilities No ProbsProbs Flat Structure EBMTSMT Deep Structure XFER, Interlingua Holy Grail

17 Stephan Vogel - Machine Translation17 Statistical Machine Translation lTranslator translates source text lUse machine learning techniques to extract useful knowledge lTranslation model: word and phrase translations lLanguage model: how likely words follow in a particular sequence lTranslation system (decoder) uses these models to translates new sentences lAdvantages: lCan quickly train for new languages lCan adopt to new domains lProblems: lNeed parallel data lAll words, even punctuation, are equal lDifficult to pin-point the causes of errors SourceTarget Source Sentence Translation Model Language Model

18 Stephan Vogel - Machine Translation18 Tasks in SMT lModelling build statistical models which capture characteristic features of translation equivalences and of the target language lTraining train translation model on bilingual corpus, train language model on monolingual corpus lDecoding find best translation for new sentences according to models lEvaluation lSubjective evaluation: fluency, adequacy lAutomatic evaluation: WER, Bleu, etc lAnd all the nitty-gritty stuff lText preprocessing, data cleaning lParameter tuning (minimum error rate training)

19 Stephan Vogel - Machine Translation19 Noisy Channel View “French is actually English, which has been garbled during transmission; recover the correct, original English” Speaker speaks English Noisy channel distorts into French You hear French, but need to recover the English

20 Stephan Vogel - Machine Translation20 Bayesian Approach Select translations which has highest probability ê = argmax{ p(e | f) } = argmax{ p(e) p(f | e) } Model Channel Model Source Search Process

21 Stephan Vogel - Machine Translation21 SMT Architecture p(e) – language model p(f | e) – translation model

22 Stephan Vogel - Machine Translation22 Log-Linear Model lIn practice: ê = argmax{ log(p(e)) + log( p(f | e)) } lTranslaiton model (TM) and language model (LM) may be of different quality: l- simplifying assumptions l- trained on different abounts of data lGive different weights to both models ê = argmax{ w 1 * log(p(e)) + w 2 * log( p(f | e)) } lWhy not add more features? ê = argmax{ w 1 * h 1 (e,f) +... w n * h n (e, f) } lNote: We don‘t need the normalization constant for the argmax

23 Stephan Vogel - Machine Translation23 Overview lDeciphering foreign text – an example lPrinciples of SMT lData processing

24 Stephan Vogel - Machine Translation24 Corpus Statistics We want to know how much data lCorpus size: not file size, not documents, but words and sentences lWhy is file size not important? lVocabulary: number of word types We want to know some distributions lHow many words are seen only once? lWhy is this interesting? lDoes it help to increase the corpus? l… lHow long are the sentence lDoes it matter if we have many short of fewer, but longer sentences?

25 Stephan Vogel - Machine Translation25 All Simple, Basic, Important lImportant: When you publish, these numbers are important lTo be able to interpret the results E.g. what works on small corpora may not work on large corpora lTo make them comparable to other papers lBasic: no deep thinking, no fancy lSimple: a few unix commands, a few simple scripts lwc, grep, sed, sort, uniq lperl, awk (my favorite), perhaps python, … lLet’s look at some data!

26 Stephan Vogel - Machine Translation26 BTEC Spa-Eng lCorpus Statistics lCorpus and vocabulary size lPercentage of singletons lNumber of unknown words, out-of-vocabulary (OOV) rate lSentence length balance lText normalization lSpoken language forms: I’ll, we’ar, but also I will, we are Note: this was shown online

27 Stephan Vogel - Machine Translation27 Tokenization lPunctuation attached to words lExample: ‘you’ ‘you,’ ‘you.’ ‘you?’ lAll different strings, i.e. all are different words lTokenization can be tricky lWhat about punctuation in numbers lWhat about appreviations(A5-0104/1999) lNumbers are not just numbers lPercentages: 1.2% lOrdinals: 1 st, 2. lRanges: , 3:1 lAnd more: (A5-0104/1999)

28 Stephan Vogel - Machine Translation28 GigaWord Corpus lDistributed by LDC lCollection of new papers: NYT, Xinhua News, … l> 3 billion words lHow large is vocabulary? lSome observations in vocabulary lNumber of entries with digits lNumber of entries with special characters lNumber of strange ‘words’ lSome observations in corpus lSentences with lots of numbers lSentences with lots of punctuation lSentences with very long words Note: this was shown online

29 Stephan Vogel - Machine Translation29 And then the more interesting Stuff lPOS tagging lParsing lFor syntax-based MT systems lHow parallel are the parse trees? lWord segmentation lMorphological processing In all these tasks the central problem is: How to make the corpus more parallel?


Download ppt "Stephan Vogel - Machine Translation1 Statistical Machine Translation SMT – Basic Ideas Stephan Vogel MT Class Spring Semester 2011."

Similar presentations


Ads by Google