Presentation is loading. Please wait.

Presentation is loading. Please wait.

LING 180 SYMBSYS 138 Intro to Computer Speech and Language Processing

Similar presentations


Presentation on theme: "LING 180 SYMBSYS 138 Intro to Computer Speech and Language Processing"— Presentation transcript:

1 LING 180 SYMBSYS 138 Intro to Computer Speech and Language Processing
Lecture 10: Machine Translation (II) October 26, 2005 Dan Jurafsky Thanks to Kevin Knight for much of this material, and many slides also came from Bonnie Dorr and Christof Monz! 12/7/2018

2 Outline for MT Week Intro and a little history
Language Similarities and Divergences Four main MT Approaches Transfer Interlingua Direct Statistical Evaluation 12/7/2018

3 How do we evaluate MT? Human
Fluency Overall fluency Human rating of sentences read out loud Cohesion (Lexical chains, anaphora, ellipsis) Hand-checking for cohesion. Well-formedness 5-point scale of syntactic correctness Fidelity (same information as source?) Hand rating of target text on 100pt scale Clarity Comprehensibility Noise test Multiple choice questionnaire Readability cloze 12/7/2018

4 Evaluating MT: Problems
Asking humans to judge sentences on a 5-point scale for 10 factors takes time and $$$ (weeks or months!) We can’t build language engineering systems if we can only evaluate them once every quarter!!!! We need a metric that we can run every time we change our algorithm. It would be OK if it wasn’t perfect, but just tended to correlate with the expensive human metrics, which we could still run in quarterly. 12/7/2018

5 BiLingual Evaluation Understudy (BLEU —Papineni, 2001)
Automatic Technique, but …. Requires the pre-existence of Human (Reference) Translations Approach: Produce corpus of high-quality human translations Judge “closeness” numerically (word-error rate) Compare n-gram matches between candidate translation and 1 or more reference translations 12/7/2018

6 BLEU Evaluation Metric
(Papineni et al, ACL-2002) Reference (human) translation: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport . N-gram precision (score is between 0 & 1) What percentage of machine n-grams can be found in the reference translation? An n-gram is an sequence of n words Not allowed to use same portion of reference translation twice (can’t cheat by typing out “the the the the the”) Brevity penalty Can’t just type out single word “the” (precision 1.0!) *** Amazingly hard to “game” the system (i.e., find a way to change machine output so that BLEU goes up, but quality doesn’t) Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail , which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack , [?] highly alerts after the maintenance. 12/7/2018

7 BLEU Evaluation Metric
(Papineni et al, ACL-2002) Reference (human) translation: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport . BLEU4 formula (counts n-grams up to length 4) exp (1.0 * log p1 + 0.5 * log p2 + 0.25 * log p3 + 0.125 * log p4 – max(words-in-reference / words-in-machine – 1, 0) p1 = 1-gram precision P2 = 2-gram precision P3 = 3-gram precision P4 = 4-gram precision Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail , which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack , [?] highly alerts after the maintenance. 12/7/2018

8 Multiple Reference Translations
Reference translation 1: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport . Reference translation 3: The US International Airport of Guam and its office has received an from a self-claimed Arabian millionaire named Laden , which threatens to launch a biochemical attack on such public places as airport . Guam authority has been on alert . Reference translation 4: US Guam International Airport and its office received an from Mr. Bin Laden and other rich businessman from Saudi Arabia . They said there would be biochemistry air raid to Guam Airport and other public places . Guam needs to be in high precaution about this matter . Reference translation 2: Guam International Airport and its offices are maintaining a high state of alert after receiving an that was from a person claiming to be the wealthy Saudi Arabian businessman Bin Laden and that threatened to launch a biological and chemical attack on the airport and other public places . Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail , which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack , [?] highly alerts after the maintenance. Reference translation 1: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport . Reference translation 3: The US International Airport of Guam and its office has received an from a self-claimed Arabian millionaire named Laden , which threatens to launch a biochemical attack on such public places as airport . Guam authority has been on alert . Reference translation 4: US Guam International Airport and its office received an from Mr. Bin Laden and other rich businessman from Saudi Arabia . They said there would be biochemistry air raid to Guam Airport and other public places . Guam needs to be in high precaution about this matter . Reference translation 2: Guam International Airport and its offices are maintaining a high state of alert after receiving an that was from a person claiming to be the wealthy Saudi Arabian businessman Bin Laden and that threatened to launch a biological and chemical attack on the airport and other public places . Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail , which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack , [?] highly alerts after the maintenance. 12/7/2018

9 BLEU in Action 枪手被警方击毙。 (Foreign Original)
the gunman was shot to death by the police . (Reference Translation) the gunman was police kill . #1 wounded police jaya of #2 the gunman was shot dead by the police . #3 the gunman arrested by police kill . #4 the gunmen were killed #5 the gunman was shot to death by the police . #6 gunmen were killed by police ?SUB>0 ?SUB>0 #7 al by the police #8 the ringer is killed by the police . #9 police killed the gunman . #10 green = 4-gram match (good!) red = word not matched (bad!) 12/7/2018

10 Bleu Comparison Chinese-English Translation Example:
Candidate 1: It is a guide to action which ensures that the military always obeys the commands of the party. Candidate 2: It is to insure the troops forever hearing the activity guidebook that party direct. Reference 1: It is a guide to action that ensures that the military will forever heed Party commands. Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. Reference 3: It is the practical guide for the army always to heed the directions of the party. 12/7/2018

11 How Do We Compute Bleu Scores?
Intuition: “What percentage of words in candidate occurred in some human translation?” Proposal: count up # of candidate translation words (unigrams) # in any reference translation, divide by the total # of words in # candidate translation But can’t just count total # of overlapping N-grams! Candidate: the the the the the the Reference 1: The cat is on the mat Solution: A reference word should be considered exhausted after a matching candidate word is identified. 12/7/2018

12 “Modified n-gram precision”
For each word compute: (1) the maximum number of times it occurs in any single reference translation (2) number of times it occurs in the candidate translation Instead of using count #2, use the minimum of #1 and #2, I.e. clip the counts at the max for the reference transcription Now use that modified count. And divide by number of candidate words. 12/7/2018

13 Modified Unigram Precision: Candidate #1
It(1) is(1) a(1) guide(1) to(1) action(1) which(1) ensures(1) that(2) the(4) military(1) always(1) obeys(0) the commands(1) of(1) the party(1) Reference 1: It is a guide to action that ensures that the military will forever heed Party commands. Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. Reference 3: It is the practical guide for the army always to heed the directions of the party. What’s the answer??? 17/18 12/7/2018

14 Modified Unigram Precision: Candidate #2
It(1) is(1) to(1) insure(0) the(4) troops(0) forever(1) hearing(0) the activity(0) guidebook(0) that(2) party(1) direct(0) Reference 1: It is a guide to action that ensures that the military will forever heed Party commands. Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. Reference 3: It is the practical guide for the army always to heed the directions of the party. What’s the answer???? 8/14 12/7/2018

15 Modified Bigram Precision: Candidate #1
It is(1) is a(1) a guide(1) guide to(1) to action(1) action which(0) which ensures(0) ensures that(1) that the(1) the military(1) military always(0) always obeys(0) obeys the(0) the commands(0) commands of(0) of the(1) the party(1) Reference 1: It is a guide to action that ensures that the military will forever heed Party commands. Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. Reference 3: It is the practical guide for the army always to heed the directions of the party. 10/17 What’s the answer???? 12/7/2018

16 Modified Bigram Precision: Candidate #2
It is(1) is to(0) to insure(0) insure the(0) the troops(0) troops forever(0) forever hearing(0) hearing the(0) the activity(0) activity guidebook(0) guidebook that(0) that party(0) party direct(0) Reference 1: It is a guide to action that ensures that the military will forever heed Party commands. Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. Reference 3: It is the practical guide for the army always to heed the directions of the party. What’s the answer???? 1/13 12/7/2018

17 Catching Cheaters the(2) the the the(0) the(0) the(0) the(0)
Reference 1: The cat is on the mat Reference 2: There is a cat on the mat What’s the unigram answer? 2/7 What’s the bigram answer? 0/7 12/7/2018

18 Bleu distinguishes human from machine translations
12/7/2018

19 Bleu problems with sentence length
Candidate: of the Solution: brevity penalty; prefers candidates translations which are same length as one of the references Reference 1: It is a guide to action that ensures that the military will forever heed Party commands. Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. Reference 3: It is the practical guide for the army always to heed the directions of the party. Problem: modified unigram precision is 2/2, bigram 1/1! 12/7/2018

20 Sentence length N-gram penalizes using extra words.
Fails to enforce the proper translation length. Sentence Brevity Multiplicative brevity penalty Done over the entire corpus, rather than sentence-by-sentence: r/c r = sum of the “best match lengths” c = total length of the candidate translation 12/7/2018

21 Brevity Penalty Ranking 12/7/2018

22 BLEU Tends to Predict Human Judgments
(variant of BLEU) slide from G. Doddington (NIST) 12/7/2018

23 Rule-Based vs. Statistical MT
Rule-based MT: Hand-written transfer rules Rules can be based on lexical or structural transfer Pro: firm grip on complex translation phenomena Con: Often very labor-intensive -> lack of robustness Statistical MT Mainly word or phrase-based translations Translation are learned from actual data Pro: Translations are learned automatically Con: Difficult to model complex translation phenomena 12/7/2018

24 Parallel Corpus Example from DE-News (8/1/1996) English German
Diverging opinions about planned tax reform Unterschiedliche Meinungen zur geplanten Steuerreform The discussion around the envisaged major tax reform continues . Die Diskussion um die vorgesehene grosse Steuerreform dauert an . The FDP economics expert , Graf Lambsdorff , today came out in favor of advancing the enactment of significant parts of the overhaul , currently planned for Der FDP - Wirtschaftsexperte Graf Lambsdorff sprach sich heute dafuer aus , wesentliche Teile der fuer 1999 geplanten Reform vorzuziehen . 12/7/2018

25 Word-Level Alignments
Given a parallel sentence pair we can link (align) words or phrases that are translations of each other: 12/7/2018

26 Parallel Resources Newswire: DE-News (German-English), Hong-Kong News, Xinhua News (Chinese-English), Government: Canadian-Hansards (French-English), Europarl (Danish, Dutch, English, Finnish, French, German, Greek, Italian, Portugese, Spanish, Swedish), UN Treaties (Russian, English, Arabic, ) Manuals: PHP, KDE, OpenOffice (all from OPUS, many languages) Web pages: STRAND project (Philip Resnik) 12/7/2018

27 Sentence Alignment If document De is translation of document Df how do we find the translation for each sentence? The n-th sentence in De is not necessarily the translation of the n-th sentence in document Df In addition to 1:1 alignments, there are also 1:0, 0:1, 1:n, and n:1 alignments Approximately 90% of the sentence alignments are 1:1 12/7/2018

28 Sentence Alignment (c’ntd)
There are several sentence alignment algorithms: Align (Gale & Church): Aligns sentences based on their character length (shorter sentences tend to have shorter translations then longer sentences). Works astonishingly well Char-align: (Church): Aligns based on shared character sequences. Works fine for similar languages or technical domains K-Vec (Fung & Church): Induces a translation lexicon from the parallel texts based on the distribution of foreign-English word pairs. 12/7/2018

29 Computing Translation Probabilities
Given a parallel corpus we can estimate P(e | f) The maximum likelihood estimation of P(e | f) is: freq(e,f)/freq(f) Way too specific to get any reasonable frequencies! Vast majority of unseen data will have zero counts! P(e | f ) could be re-defined as: Problem: The English words maximizing P(e | f ) might not result in a readable sentence 12/7/2018

30 Computing Translation Probabilities (c’tnd)
We can account for adequacy: each foreign word translates into its most likely English word We cannot guarantee that this will result in a fluent English sentence Solution: transform P(e | f) with Bayes’ rule: P(e | f) = P(e) P(f | e) / P(f) P(f | e) accounts for adequacy P(e) accounts for fluency 12/7/2018

31 Decoding The decoder combines the evidence from P(e) and P(f | e) to find the sequence e that is the best translation: The choice of word e’ as translation of f’ depends on the translation probability P(f’ | e’) and on the context, i.e. other English words preceding e’ 12/7/2018

32 What makes a good translation
Translators often talk about two factors we want to maximize: Faithfulness or fidelity How close is the meaning of the translation to the meaning of the original (Even better: does the translation cause the reader to draw the same inferences as the original would have) Fluency or naturalness How natural the translation is, just considering its fluency in the target language 12/7/2018

33 Statistical MT: Faithfulness and Fluency formalized!
Best-translation of a source sentence S: Developed by researchers who were originally in speech recognition at IBM Called the IBM model 12/7/2018

34 The IBM model Hmm, those two factors might look familiar…
Yup, it’s Bayes rule: 12/7/2018

35 It’s also the Noisy Channel Model
12/7/2018 Slide from Christof Monz

36 Fluency: P(T) How to measure that this sentence
That car was almost crash onto me is less fluent than this one: That car almost hit me. Answer: language models (N-grams!) For example P(hit|almost) > P(was|almost) But can use any other more sophisticated model of grammar Advantage: this is monolingual knowledge! 12/7/2018

37 Faithfulness: P(S|T) French: ça me plait [that me pleases] English:
that pleases me - most fluent I like it I’ll take that one How to quantify this? Intuition: degree to which words in one sentence are plausible translations of words in other sentence Product of probabilities that each word in target sentence would generate each word in source sentence. 12/7/2018

38 Faithfulness P(S|T) Need to know, for every target language word, probability of it mapping to every source language word. How do we learn these probabilities? Parallel texts! Lots of times we have two texts that are translations of each other If we knew which word in Source Text mapped to each word in Target Text, we could just count! 12/7/2018

39 Faithfulness P(S|T) Sentence alignment: Word alignment
Figuring out which source language sentence maps to which target language sentence Word alignment Figuring out which source language word maps to which target language word 12/7/2018

40 Big Point about Faithfulness and Fluency
Job of the faithfulness model P(S|T) is just to model “bag of words”; which words come from say English to Spanish. P(S|T) doesn’t have to worry about internal facts about Spanish word order: that’s the job of P(T) P(T) can do Bag generation: put the following words in order have programming a seen never I language better -actual the hashing is since not collision-free usually the is less perfectly the of somewhat capacity table 12/7/2018

41 P(T) and bag generation: the answer
“Usually the actual capacity of the table is somewhat less, since the hashing is not collision-free” How about: loves Mary John 12/7/2018

42 A motivating example P(S|T) alone prefers: 2000 Correspondence
Japanese phrase 2000nen taio 2000nen highest Y2K 2000 years 2000 year Taio Correspondence -highest Corresponding Equivalent Tackle Dealing with Deal with P(S|T) alone prefers: 2000 Correspondence Adding P(T) might produce correct Dealing with Y2K 12/7/2018

43 Recent Progress in Statistical MT
slide from C. Wayne, DARPA 2002 2003 insistent Wednesday may recurred her trips to Libya tomorrow for flying Cairo ( AFP ) - an official announced today in the Egyptian lines company for flying Tuesday is a company " insistent for flying " may resumed a consideration of a day Wednesday tomorrow her trips to Libya of Security Council decision trace international the imposed ban comment . And said the official " the institution sent a speech to Ministry of Foreign Affairs of lifting on Libya air , a situation her receiving replying are so a trip will pull to Libya a morning Wednesday " . Egyptair Has Tomorrow to Resume Its Flights to Libya Cairo 4-6 (AFP) - said an official at the Egyptian Aviation Company today that the company egyptair may resume as of tomorrow, Wednesday its flights to Libya after the International Security Council resolution to the suspension of the embargo imposed on Libya. " The official said that the company had sent a letter to the Ministry of Foreign Affairs, information on the lifting of the air embargo on Libya, where it had received a response, the first take off a trip to Libya on Wednesday morning ". 12/7/2018

44 Centauri/Arcturan [Knight, 1997]
Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 12/7/2018

45 Centauri/Arcturan [Knight, 1997]
Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 1a. ok-voon ororok sprok . 1b. at-voon bichat dat . 7a. lalok farok ororok lalok sprok izok enemok . 7b. wat jjat bichat wat dat vat eneat . 2a. ok-drubel ok-voon anok plok sprok . 2b. at-drubel at-voon pippat rrat dat . 8a. lalok brok anok plok nok . 8b. iat lat pippat rrat nnat . 3a. erok sprok izok hihok ghirok . 3b. totat dat arrat vat hilat . 9a. wiwok nok izok kantok ok-yurp . 9b. totat nnat quat oloat at-yurp . 4a. ok-voon anok drok brok jok . 4b. at-voon krat pippat sat lat . 10a. lalok mok nok yorok ghirok clok . 10b. wat nnat gat mat bat hilat . 5a. wiwok farok izok stok . 5b. totat jjat quat cat . 11a. lalok nok crrrok hihok yorok zanzanok . 11b. wat nnat arrat mat zanzanat . 6a. lalok sprok izok jok stok . 6b. wat dat krat quat cat . 12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat . 12/7/2018

46 Centauri/Arcturan [Knight, 1997]
Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 1a. ok-voon ororok sprok . 1b. at-voon bichat dat . 7a. lalok farok ororok lalok sprok izok enemok . 7b. wat jjat bichat wat dat vat eneat . 2a. ok-drubel ok-voon anok plok sprok . 2b. at-drubel at-voon pippat rrat dat . 8a. lalok brok anok plok nok . 8b. iat lat pippat rrat nnat . 3a. erok sprok izok hihok ghirok . 3b. totat dat arrat vat hilat . 9a. wiwok nok izok kantok ok-yurp . 9b. totat nnat quat oloat at-yurp . 4a. ok-voon anok drok brok jok . 4b. at-voon krat pippat sat lat . 10a. lalok mok nok yorok ghirok clok . 10b. wat nnat gat mat bat hilat . 5a. wiwok farok izok stok . 5b. totat jjat quat cat . 11a. lalok nok crrrok hihok yorok zanzanok . 11b. wat nnat arrat mat zanzanat . 6a. lalok sprok izok jok stok . 6b. wat dat krat quat cat . 12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat . 12/7/2018

47 Centauri/Arcturan [Knight, 1997]
Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 1a. ok-voon ororok sprok . 1b. at-voon bichat dat . 7a. lalok farok ororok lalok sprok izok enemok . 7b. wat jjat bichat wat dat vat eneat . 2a. ok-drubel ok-voon anok plok sprok . 2b. at-drubel at-voon pippat rrat dat . 8a. lalok brok anok plok nok . 8b. iat lat pippat rrat nnat . 3a. erok sprok izok hihok ghirok . 3b. totat dat arrat vat hilat . 9a. wiwok nok izok kantok ok-yurp . 9b. totat nnat quat oloat at-yurp . 4a. ok-voon anok drok brok jok . 4b. at-voon krat pippat sat lat . 10a. lalok mok nok yorok ghirok clok . 10b. wat nnat gat mat bat hilat . 5a. wiwok farok izok stok . 5b. totat jjat quat cat . 11a. lalok nok crrrok hihok yorok zanzanok . 11b. wat nnat arrat mat zanzanat . 6a. lalok sprok izok jok stok . 6b. wat dat krat quat cat . 12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat . 12/7/2018

48 Centauri/Arcturan [Knight, 1997]
Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 1a. ok-voon ororok sprok . 1b. at-voon bichat dat . 7a. lalok farok ororok lalok sprok izok enemok . 7b. wat jjat bichat wat dat vat eneat . 2a. ok-drubel ok-voon anok plok sprok . 2b. at-drubel at-voon pippat rrat dat . 8a. lalok brok anok plok nok . 8b. iat lat pippat rrat nnat . 3a. erok sprok izok hihok ghirok . 3b. totat dat arrat vat hilat . 9a. wiwok nok izok kantok ok-yurp . 9b. totat nnat quat oloat at-yurp . 4a. ok-voon anok drok brok jok . 4b. at-voon krat pippat sat lat . 10a. lalok mok nok yorok ghirok clok . 10b. wat nnat gat mat bat hilat . 5a. wiwok farok izok stok . 5b. totat jjat quat cat . 11a. lalok nok crrrok hihok yorok zanzanok . 11b. wat nnat arrat mat zanzanat . 6a. lalok sprok izok jok stok . 6b. wat dat krat quat cat . 12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat . ??? 12/7/2018

49 Centauri/Arcturan [Knight, 1997]
Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 1a. ok-voon ororok sprok . 1b. at-voon bichat dat . 7a. lalok farok ororok lalok sprok izok enemok . 7b. wat jjat bichat wat dat vat eneat . 2a. ok-drubel ok-voon anok plok sprok . 2b. at-drubel at-voon pippat rrat dat . 8a. lalok brok anok plok nok . 8b. iat lat pippat rrat nnat . 3a. erok sprok izok hihok ghirok . 3b. totat dat arrat vat hilat . 9a. wiwok nok izok kantok ok-yurp . 9b. totat nnat quat oloat at-yurp . 4a. ok-voon anok drok brok jok . 4b. at-voon krat pippat sat lat . 10a. lalok mok nok yorok ghirok clok . 10b. wat nnat gat mat bat hilat . 5a. wiwok farok izok stok . 5b. totat jjat quat cat . 11a. lalok nok crrrok hihok yorok zanzanok . 11b. wat nnat arrat mat zanzanat . 6a. lalok sprok izok jok stok . 6b. wat dat krat quat cat . 12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat . 12/7/2018

50 Centauri/Arcturan [Knight, 1997]
Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 1a. ok-voon ororok sprok . 1b. at-voon bichat dat . 7a. lalok farok ororok lalok sprok izok enemok . 7b. wat jjat bichat wat dat vat eneat . 2a. ok-drubel ok-voon anok plok sprok . 2b. at-drubel at-voon pippat rrat dat . 8a. lalok brok anok plok nok . 8b. iat lat pippat rrat nnat . 3a. erok sprok izok hihok ghirok . 3b. totat dat arrat vat hilat . 9a. wiwok nok izok kantok ok-yurp . 9b. totat nnat quat oloat at-yurp . 4a. ok-voon anok drok brok jok . 4b. at-voon krat pippat sat lat . 10a. lalok mok nok yorok ghirok clok . 10b. wat nnat gat mat bat hilat . 5a. wiwok farok izok stok . 5b. totat jjat quat cat . 11a. lalok nok crrrok hihok yorok zanzanok . 11b. wat nnat arrat mat zanzanat . 6a. lalok sprok izok jok stok . 6b. wat dat krat quat cat . 12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat . 12/7/2018

51 Centauri/Arcturan [Knight, 1997]
Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 1a. ok-voon ororok sprok . 1b. at-voon bichat dat . 7a. lalok farok ororok lalok sprok izok enemok . 7b. wat jjat bichat wat dat vat eneat . 2a. ok-drubel ok-voon anok plok sprok . 2b. at-drubel at-voon pippat rrat dat . 8a. lalok brok anok plok nok . 8b. iat lat pippat rrat nnat . 3a. erok sprok izok hihok ghirok . 3b. totat dat arrat vat hilat . 9a. wiwok nok izok kantok ok-yurp . 9b. totat nnat quat oloat at-yurp . 4a. ok-voon anok drok brok jok . 4b. at-voon krat pippat sat lat . 10a. lalok mok nok yorok ghirok clok . 10b. wat nnat gat mat bat hilat . 5a. wiwok farok izok stok . 5b. totat jjat quat cat . 11a. lalok nok crrrok hihok yorok zanzanok . 11b. wat nnat arrat mat zanzanat . 6a. lalok sprok izok jok stok . 6b. wat dat krat quat cat . 12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat . 12/7/2018

52 Centauri/Arcturan [Knight, 1997]
Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 1a. ok-voon ororok sprok . 1b. at-voon bichat dat . 7a. lalok farok ororok lalok sprok izok enemok . 7b. wat jjat bichat wat dat vat eneat . 2a. ok-drubel ok-voon anok plok sprok . 2b. at-drubel at-voon pippat rrat dat . 8a. lalok brok anok plok nok . 8b. iat lat pippat rrat nnat . 3a. erok sprok izok hihok ghirok . 3b. totat dat arrat vat hilat . 9a. wiwok nok izok kantok ok-yurp . 9b. totat nnat quat oloat at-yurp . 4a. ok-voon anok drok brok jok . 4b. at-voon krat pippat sat lat . 10a. lalok mok nok yorok ghirok clok . 10b. wat nnat gat mat bat hilat . 5a. wiwok farok izok stok . 5b. totat jjat quat cat . 11a. lalok nok crrrok hihok yorok zanzanok . 11b. wat nnat arrat mat zanzanat . 6a. lalok sprok izok jok stok . 6b. wat dat krat quat cat . 12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat . ??? 12/7/2018

53 Centauri/Arcturan [Knight, 1997]
Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 1a. ok-voon ororok sprok . 1b. at-voon bichat dat . 7a. lalok farok ororok lalok sprok izok enemok . 7b. wat jjat bichat wat dat vat eneat . 2a. ok-drubel ok-voon anok plok sprok . 2b. at-drubel at-voon pippat rrat dat . 8a. lalok brok anok plok nok . 8b. iat lat pippat rrat nnat . 3a. erok sprok izok hihok ghirok . 3b. totat dat arrat vat hilat . 9a. wiwok nok izok kantok ok-yurp . 9b. totat nnat quat oloat at-yurp . 4a. ok-voon anok drok brok jok . 4b. at-voon krat pippat sat lat . 10a. lalok mok nok yorok ghirok clok . 10b. wat nnat gat mat bat hilat . 5a. wiwok farok izok stok . 5b. totat jjat quat cat . 11a. lalok nok crrrok hihok yorok zanzanok . 11b. wat nnat arrat mat zanzanat . 6a. lalok sprok izok jok stok . 6b. wat dat krat quat cat . 12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat . 12/7/2018

54 Centauri/Arcturan [Knight, 1997]
Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 1a. ok-voon ororok sprok . 1b. at-voon bichat dat . 7a. lalok farok ororok lalok sprok izok enemok . 7b. wat jjat bichat wat dat vat eneat . 2a. ok-drubel ok-voon anok plok sprok . 2b. at-drubel at-voon pippat rrat dat . 8a. lalok brok anok plok nok . 8b. iat lat pippat rrat nnat . 3a. erok sprok izok hihok ghirok . 3b. totat dat arrat vat hilat . 9a. wiwok nok izok kantok ok-yurp . 9b. totat nnat quat oloat at-yurp . 4a. ok-voon anok drok brok jok . 4b. at-voon krat pippat sat lat . 10a. lalok mok nok yorok ghirok clok . 10b. wat nnat gat mat bat hilat . 5a. wiwok farok izok stok . 5b. totat jjat quat cat . 11a. lalok nok crrrok hihok yorok zanzanok . 11b. wat nnat arrat mat zanzanat . 6a. lalok sprok izok jok stok . 6b. wat dat krat quat cat . 12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat . process of elimination 12/7/2018

55 Centauri/Arcturan [Knight, 1997]
Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 1a. ok-voon ororok sprok . 1b. at-voon bichat dat . 7a. lalok farok ororok lalok sprok izok enemok . 7b. wat jjat bichat wat dat vat eneat . 2a. ok-drubel ok-voon anok plok sprok . 2b. at-drubel at-voon pippat rrat dat . 8a. lalok brok anok plok nok . 8b. iat lat pippat rrat nnat . 3a. erok sprok izok hihok ghirok . 3b. totat dat arrat vat hilat . 9a. wiwok nok izok kantok ok-yurp . 9b. totat nnat quat oloat at-yurp . 4a. ok-voon anok drok brok jok . 4b. at-voon krat pippat sat lat . 10a. lalok mok nok yorok ghirok clok . 10b. wat nnat gat mat bat hilat . 5a. wiwok farok izok stok . 5b. totat jjat quat cat . 11a. lalok nok crrrok hihok yorok zanzanok . 11b. wat nnat arrat mat zanzanat . 6a. lalok sprok izok jok stok . 6b. wat dat krat quat cat . 12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat . cognate? 12/7/2018

56 Centauri/Arcturan [Knight, 1997]
Your assignment, put these words in order: { jjat, arrat, mat, bat, oloat, at-yurp } 1a. ok-voon ororok sprok . 1b. at-voon bichat dat . 7a. lalok farok ororok lalok sprok izok enemok . 7b. wat jjat bichat wat dat vat eneat . 2a. ok-drubel ok-voon anok plok sprok . 2b. at-drubel at-voon pippat rrat dat . 8a. lalok brok anok plok nok . 8b. iat lat pippat rrat nnat . 3a. erok sprok izok hihok ghirok . 3b. totat dat arrat vat hilat . 9a. wiwok nok izok kantok ok-yurp . 9b. totat nnat quat oloat at-yurp . 4a. ok-voon anok drok brok jok . 4b. at-voon krat pippat sat lat . 10a. lalok mok nok yorok ghirok clok . 10b. wat nnat gat mat bat hilat . 5a. wiwok farok izok stok . 5b. totat jjat quat cat . 11a. lalok nok crrrok hihok yorok zanzanok . 11b. wat nnat arrat mat zanzanat . 6a. lalok sprok izok jok stok . 6b. wat dat krat quat cat . 12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat . zero fertility 12/7/2018

57 It’s Really Spanish/English
Clients do not sell pharmaceuticals in Europe => Clientes no venden medicinas en Europa 1a. Garcia and associates . 1b. Garcia y asociados . 7a. the clients and the associates are enemies . 7b. los clients y los asociados son enemigos . 2a. Carlos Garcia has three associates . 2b. Carlos Garcia tiene tres asociados . 8a. the company has three groups . 8b. la empresa tiene tres grupos . 3a. his associates are not strong . 3b. sus asociados no son fuertes . 9a. its groups are in Europe . 9b. sus grupos estan en Europa . 4a. Garcia has a company also . 4b. Garcia tambien tiene una empresa . 10a. the modern groups sell strong pharmaceuticals . 10b. los grupos modernos venden medicinas fuertes . 5a. its clients are angry . 5b. sus clientes estan enfadados . 11a. the groups do not sell zenzanine . 11b. los grupos no venden zanzanina . 6a. the associates are also angry . 6b. los asociados tambien estan enfadados . 12a. the small groups are not modern . 12b. los grupos pequenos no son modernos . 12/7/2018

58 Statistical MT Systems
Spanish/English Bilingual Text English Text Statistical Analysis Statistical Analysis Broken English Spanish English What hunger have I, Hungry I am so, I am so hungry, Have I that hunger … Que hambre tengo yo I am so hungry 12/7/2018

59 Statistical MT Systems
Spanish/English Bilingual Text English Text Statistical Analysis Statistical Analysis Broken English Spanish English Translation Model P(s|e) Language Model P(e) Que hambre tengo yo I am so hungry Decoding algorithm argmax P(e) * P(s|e) e 12/7/2018

60 Three Problems for Statistical MT
Language model Given an English string e, assigns P(e) by formula good English string -> high P(e) random word sequence -> low P(e) Translation model Given a pair of strings <f,e>, assigns P(f | e) by formula <f,e> look like translations -> high P(f | e) <f,e> don’t look like translations -> low P(f | e) Decoding algorithm Given a language model, a translation model, and a new sentence f … find translation e maximizing P(e) * P(f | e) 12/7/2018

61 The Classic Language Model Word N-Grams
Goal of the language model -- choose among: He is on the soccer field He is in the soccer field Is table the on cup the The cup is on the table Rice shrine American shrine Rice company American company 12/7/2018

62 The Classic Language Model Word N-Grams
Generative approach: w1 = START repeat until END is generated: produce word w2 according to a big table P(w2 | w1) w1 := w2 P(I saw water on the table) = P(I | START) * P(saw | I) * P(water | saw) * P(on | water) * P(the | on) * P(table | the) * P(END | table) Probabilities can be learned from online English text. 12/7/2018

63 Translation Model? Mary did not slap the green witch
Generative approach: Mary did not slap the green witch Source-language morphological analysis Source parse tree Semantic representation Generate target structure Maria no dió una bofetada a la bruja verde 12/7/2018

64 Translation Model? Mary did not slap the green witch
Generative story: Mary did not slap the green witch Source-language morphological analysis Source parse tree Semantic representation Generate target structure What are all the possible moves and their associated probability tables? Maria no dió una bofetada a la bruja verde 12/7/2018

65 The Classic Translation Model Word Substitution/Permutation [IBM Model 3, Brown et al., 1993]
Generative approach: Mary did not slap the green witch n(3|slap) Mary not slap slap slap the green witch P-Null Mary not slap slap slap NULL the green witch t(la|the) Maria no dió una bofetada a la verde bruja d(j|i) Maria no dió una bofetada a la bruja verde 12/7/2018 Probabilities can be learned from raw bilingual text.

66 Statistical Machine Translation
… la maison … la maison bleue … la fleur … … the house … the blue house … the flower … All word alignments equally likely All P(french-word | english-word) equally likely 12/7/2018

67 Statistical Machine Translation
… la maison … la maison bleue … la fleur … … the house … the blue house … the flower … “la” and “the” observed to co-occur frequently, so P(la | the) is increased. 12/7/2018

68 Statistical Machine Translation
… la maison … la maison bleue … la fleur … … the house … the blue house … the flower … “house” co-occurs with both “la” and “maison”, but P(maison | house) can be raised without limit, to 1.0, while P(la | house) is limited because of “the” (pigeonhole principle) 12/7/2018

69 Statistical Machine Translation
… la maison … la maison bleue … la fleur … … the house … the blue house … the flower … settling down after another iteration 12/7/2018

70 Statistical Machine Translation
… la maison … la maison bleue … la fleur … … the house … the blue house … the flower … Inherent hidden structure revealed by EM training! For details, see: “A Statistical MT Tutorial Workbook” (Knight, 1999). “The Mathematics of Statistical Machine Translation” (Brown et al, 1993) Software: GIZA++ 12/7/2018

71 Statistical Machine Translation
… la maison … la maison bleue … la fleur … … the house … the blue house … the flower … P(juste | fair) = 0.411 P(juste | correct) = 0.027 P(juste | right) = 0.020 new French sentence Possible English translations, to be rescored by language model 12/7/2018

72 More formally: The IBM Model
Let’s flesh out these intuitions about P(S|T) and P(T) a bit. Many of the next slides are drawn from Kevin Knight’s fantastic “A Statistical MT Tutorial Workbook”! 12/7/2018

73 IBM Model 3 as probabilistic version of Direct MT
We translate English into Spanish as follows: Replace the words in the English sentence by Spanish words Scramble around the words to look like Spanish order But we can’t propose that English words are replaced by Spanish words one-for-one, because translations aren’t the same length. 12/7/2018

74 IBM Model 3 (from Knight 1999)
For each word ei in English sentence, choose a fertility i. The choice of i depends only on ei, not other words or ’s. For each word ei, generate i Spanish words. Choice of French word depends only on English word ei, not English context or any Spanish words. Permute all the Spanish words. Each Spanish word gets assign absolute target position slot (1,2,3, etc). Choice of Spanish word position dependent only on absolute position of English word generating it. 12/7/2018

75 Translation as String rewriting (from Knight 1999)
Mary did not slap the green witch Assign fertilities: 1 = copy over word, 2= copy twice, etc. 0 = delete Mary not slap slap slap the the green witch Replace English words with Spanish one-for-one: Mary no daba una bofetada a la verde bruja Permute the words Mary no daba una bofetada a la bruja verde 12/7/2018

76 Model 3: P(S|T) training parameters
What are the parameters for this model? Just look at dependencies: Words: P(casa|house) Fertilities: n(1|house): prob that “house” will produce 1 Spanish word whenever ‘house’ appears. Distortions: d(5|2) prob that English word in position 2 of English sentence generates French word in position 5 of French translation Actually, distortions are d(5,2,4,6) where 4 is length of English sentence, 6 is Spanish length Remember, P(S|T) doesn’t have to model fluency 12/7/2018

77 Model 3: last twist Imagine some Spanish words are “spurious”; they appear in Spanish even though they weren’t in English original Like function words; we generated “a la” from “the” by giving “the” fertility 2 Instead, we could give “the” fertility 1, and generat “a” spuriously Do this by pretending every English sentence contains invisible word NULL as word 0. Then parameters like t(a|NULL) give probability of word “a” generating spuriously from NULL 12/7/2018

78 Spurious words We could imagine having n(3|NULL) (probability of being exactly 3 spurious words in a Spanish translation) Instead, of n(0|NULL), n(1|NULL) … N(25|NULL), have a single parameter p1 After assign fertilities to non-NULL English words we want to generate (say) z Spanish words. As we generate each of z words, we optionally toss in spurious Spanish word with probability p1 Probability of not tossing in spurious word p0=1-p1 12/7/2018

79 Distortion probabilities for spurious words
Can’t just have d(5|0,4,6), I.e. chance that NULL word will end up in position 5. Why? These are spurious words! Could occur anywhere!! To hard to predict Instead, Use normal-word distortion parameters to choose positions for normally-generated Spanish words Put Null-generated words into empty slots left over If three NULL-generated words, and three empty slots, then there are 3!, or six, ways for slotting them all in We’ll assign a probability of 1/6 for each way 12/7/2018

80 Real Model 3 For each word ei in English sentence, choose fertility i with prob n(i| ei) Choose number 0 of spurious Spanish words to be generated from e0=NULL using p1 and sum of fertilities from step 1 Let m be sum of fertilities for all words including NULL For each i=0,1,2,…L , k=1,2,… I : choose Spanish word ikwith probability t(ik|ei) For each i=1,2,…L , k=1,2,… I : choose target Spanish position ikwith prob d(ik|I,L,m) For each k=1,2,…, 0 choose position 0k from 0 -k+1 remaining vacant positions in 1,2,…m for total prob of 1/ 0! Output Spanish sentence with words ik in positions ik (0<=I<=1,1<=k<= I) 12/7/2018

81 The Classic Translation Model Word Substitution/Permutation [IBM Model 3, Brown et al., 1993]
Generative approach: Mary did not slap the green witch n(3|slap) Mary not slap slap slap the green witch P-Null Mary not slap slap slap NULL the green witch t(la|the) Maria no dió una bofetada a la verde bruja d(j|i) Maria no dió una bofetada a la bruja verde 12/7/2018 Probabilities can be learned from raw bilingual text.

82 Model 3 parameters N,t,p,d If we had English strings and step-by-step rewritings into Spanish, we could: Compute n(0|did) by locating every instance of “did”, see what happens to it during first rewriting step If “did” appeared 15,000 times and was deleted during the first rewriting step 13,000 times, then n(0|did) = 13/15 12/7/2018

83 Alignments NULL And the program has been implemented | | | | | | /|\
| | | | | | /|\ Le programme a ete mis en application If we had lots of alignments like this, n(0|d): how many times “did” connects to no French words T(maison|house) how many of all French words generated by “house” were “maison” D(5|2,4,6) out of all times some word2 moved somewhere, how many times to word5? 12/7/2018

84 Where to get alignments: Parallel corpora
Example from DE-News (8/1/1996) English German Diverging opinions about planned tax reform Unterschiedliche Meinungen zur geplanten Steuerreform The discussion around the envisaged major tax reform continues . Die Diskussion um die vorgesehene grosse Steuerreform dauert an . The FDP economics expert , Graf Lambsdorff , today came out in favor of advancing the enactment of significant parts of the overhaul , currently planned for Der FDP - Wirtschaftsexperte Graf Lambsdorff sprach sich heute dafuer aus , wesentliche Teile der fuer 1999 geplanten Reform vorzuziehen . 12/7/2018 Slide from Christof Monz

85 Sentence Alignment El viejo está feliz porque ha pescado muchos veces. Su mujer habla con él. Los tiburones esperan. The old man is happy. He has fished many times. His wife talks to him. The fish are jumping. The sharks await. 12/7/2018

86 Sentence Alignment El viejo está feliz porque ha pescado muchos veces.
Su mujer habla con él. Los tiburones esperan. The old man is happy. He has fished many times. His wife talks to him. The fish are jumping. The sharks await. 12/7/2018

87 Sentence Alignment El viejo está feliz porque ha pescado muchos veces.
Su mujer habla con él. Los tiburones esperan. The old man is happy. He has fished many times. His wife talks to him. The fish are jumping. The sharks await. 12/7/2018

88 Sentence Alignment El viejo está feliz porque ha pescado muchos veces.
Su mujer habla con él. Los tiburones esperan. The old man is happy. He has fished many times. His wife talks to him. The sharks await. Note that unaligned sentences are thrown out, and sentences are merged in n-to-m alignments (n, m > 0). 12/7/2018

89 Where to get alignments
It turns out we can bootstrap alignments If we just have a bilingual corpus We can bootstrap alignments Assume some startup values for n,d,, etc Use values for n,d, , etc to use model 3 to do “forced alignment”; I.e. to pick the best word alignments between sentences Use these alignments to retrain n,d, , etc Go to 2 This is called the Expectation-Maximization or EM algorithm 12/7/2018

90 Summary Intro and a little history
Language Similarities and Divergences Four main MT Approaches Transfer Interlingua Direct Statistical Evaluation 12/7/2018

91 Classes LINGUIST 139M/239M. Human and Machine Translation. (Martin Kay) CS 224N. Natural Language Processing (Chris Manning) 12/7/2018


Download ppt "LING 180 SYMBSYS 138 Intro to Computer Speech and Language Processing"

Similar presentations


Ads by Google