Presentation is loading. Please wait.

Presentation is loading. Please wait.

Natural Language Processing Lecture 23—12/1/2015 Jim Martin.

Similar presentations


Presentation on theme: "Natural Language Processing Lecture 23—12/1/2015 Jim Martin."— Presentation transcript:

1

2 Natural Language Processing Lecture 23—12/1/2015 Jim Martin

3 2/19/2016 Speech and Language Processing - Jurafsky and Martin 2 Today  Schedule  Quiz 2 Review  HW 3 Questions  Machine translation framework Statistical  Evaluation methods  Basic probabilistic framework

4 What’s Left  Information Extraction HW  Machine translation  Next two lectures  Vector-based word representations  Review  Last lecture  Final  12/17 2/19/2016 Speech and Language Processing - Jurafsky and Martin 3

5 Quiz 2 Cancel the lunch reservation at Frasca 2/19/2016 Speech and Language Processing - Jurafsky and Martin 4

6 Quiz 2 Cancel the lunch reservation at Frasca 2/19/2016 Speech and Language Processing - Jurafsky and Martin 5

7 Quiz 2 Cancel the lunch reservation at Frasca 2/19/2016 Speech and Language Processing - Jurafsky and Martin 6

8 Quiz 2 Cancel the lunch reservation at Frasca 2/19/2016 Speech and Language Processing - Jurafsky and Martin 7

9 Machine Translation  Automatic translation of text (and speech) from one language to another.  One of the oldest applications of natural language processing (1950s) 2/19/2016 Speech and Language Processing - Jurafsky and Martin 8

10 2/19/2016 Speech and Language Processing - Jurafsky and Martin 9 Current MT Approaches: Motivation  If we’re translating French to English, the French we’re seeing is just a weird garbled version of English  The key problem is to decode the garbles back into the original English by using Bayes rule  Argmax P(E |F) by Bayes  Argmax P(F|E) P(E)  As we’ll see, the key to doing this is having access to aligned bilingual texts

11 2/19/2016 Speech and Language Processing - Jurafsky and Martin 10 Warren Weaver (1947) When I look at an article in Russian, I say to myself: This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.

12 2/19/2016 Speech and Language Processing - Jurafsky and Martin 11 What’s This?

13 2/19/2016 Speech and Language Processing - Jurafsky and Martin 12 Rosetta Stone Egyptian hieroglyphs Demotic Greek

14 2/19/2016 Speech and Language Processing - Jurafsky and Martin 13 Training Data Egyptian hieroglyphs Demotic Greek

15 2/19/2016 Speech and Language Processing - Jurafsky and Martin 14 Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this Centauri text to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

16 2/19/2016 Speech and Language Processing - Jurafsky and Martin 15 Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this Centauri text to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

17 2/19/2016 Speech and Language Processing - Jurafsky and Martin 16 Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat.

18 2/19/2016 Speech and Language Processing - Jurafsky and Martin 17 Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp ???

19 2/19/2016 Speech and Language Processing - Jurafsky and Martin 18 Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

20 2/19/2016 Speech and Language Processing - Jurafsky and Martin 19 Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

21 2/19/2016 Speech and Language Processing - Jurafsky and Martin 20 Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

22 2/19/2016 Speech and Language Processing - Jurafsky and Martin 21 Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp ???

23 2/19/2016 Speech and Language Processing - Jurafsky and Martin 22 Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

24 2/19/2016 Speech and Language Processing - Jurafsky and Martin 23 Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp process of elimination

25 2/19/2016 Speech and Language Processing - Jurafsky and Martin 24 Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp cognate?

26 2/19/2016 Speech and Language Processing - Jurafsky and Martin 25 Your assignment, put these words in order: { jjat, arrat, mat, bat, oloat, at-yurp } Centauri/Arcturan [Knight, 1997] 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. zero fertility

27 2/19/2016 Speech and Language Processing - Jurafsky and Martin 26 Spanish/English text 1a. Garcia and associates. 1b. Garcia y asociados. 7a. the clients and the associates are enemies. 7b. los clients y los asociados son enemigos. 2a. Carlos Garcia has three associates. 2b. Carlos Garcia tiene tres asociados. 8a. the company has three groups. 8b. la empresa tiene tres grupos. 3a. his associates are not strong. 3b. sus asociados no son fuertes. 9a. its groups are in Europe. 9b. sus grupos estan en Europa. 4a. Garcia has a company also. 4b. Garcia tambien tiene una empresa. 10a. the modern groups sell strong pharmaceuticals. 10b. los grupos modernos venden medicinas fuertes. 5a. its clients are angry. 5b. sus clientes estan enfadados. 11a. the groups do not sell zenzanine. 11b. los grupos no venden zanzanina. 6a. the associates are also angry. 6b. los asociados tambien estan enfadados. 12a. the small groups are not modern. 12b. los grupos pequenos no son modernos. Translate: Clients do not sell pharmaceuticals in Europe.

28 The Point  No dictionary for either language  No grammar  No meaning  No broader meaning context  Just text pairs and a bunch of dubious assumptions 2/19/2016 Speech and Language Processing - Jurafsky and Martin 27

29 2/19/2016 Speech and Language Processing - Jurafsky and Martin 28 Basic Statistical MT  Statistical MT systems are based on maximizing P(E|F)  E is the target language, F is the source  We’ll start as usual by using Bayes

30 2/19/2016 Speech and Language Processing - Jurafsky and Martin 29 Sub-Problems of Statistical MT  Language model  Given an English string e, assigns P(e) by the usual methods  good English string -> high P(e)  random word sequence -> low P(e)  Translation model  Given a pair of strings, assigns P(f | e) by formula  that look like translations -> high P(f | e)  that don’t look like translations -> low P(f | e)  Decoding algorithm  Given a language model, a translation model, and a new sentence f … find translation e maximizing P(e) * P(f | e)

31 2/19/2016 Speech and Language Processing - Jurafsky and Martin 30 Three Problems  For the language model, just use normal language models.  For the translation model there are lots of choices  Word-based  Phrase-based  Syntactic  Semantic  For the decoding model we’ll focus on A* like methods (heuristic search methods with a beam)

32 2/19/2016 Speech and Language Processing - Jurafsky and Martin 31 Three Problems  For the language model, just use normal language models.  For the translation model there are lots of choices  Word-based  Phrase-based  Syntactic  Semantic  For the decoding model we’ll focus on A* like methods (heuristic search methods with a beam)

33 2/19/2016 Speech and Language Processing - Jurafsky and Martin 32 Phrase-Based MT  In phrase-based MT, we proceed by  Translating phrases in the source text to phrases in the target text  Arranging those translated phrases in a way that makes sense for the target language.  So for English to German

34 2/19/2016 Speech and Language Processing - Jurafsky and Martin 33 Phrase-Based MT  The probability of such a translation is the product of the individual phrase translation probabilities and the movement/dislocations probabilities.

35 2/19/2016 Speech and Language Processing - Jurafsky and Martin 34 Phrase-Based MT  To do this, we need  Source and target language phrases  And the associated probabilities  Movement probabilities  A translation probability model  A decoding algorithm

36 2/19/2016 Speech and Language Processing - Jurafsky and Martin 35 Phrase-Based MT  To do this, we need  Source and target language phrases  And the associated probabilities  Movement probabilities  A translation probability model  A decoding algorithm

37 Phrase-Based Decoding  The first step is to break down the source text into all the possible words and phrases that are candidates to be translated. 2/19/2016 Speech and Language Processing - Jurafsky and Martin 36 The witch green at is home this week The green witch this week at home

38 Phrase-Based Decoding  The first step is to break down the source text into all the possible words and phrases that are candidates to be translated. 2/19/2016 Speech and Language Processing - Jurafsky and Martin 37 The witch green at is home this week The green witch this week at home

39 Phrase-Based Decoding  Then incrementally choose and score all possible translations as the first piece of the output. Entering each option into a priority queue based on its score. 2/19/2016 Speech and Language Processing - Jurafsky and Martin 38 The witch green at is home this week The green witch this week at home Diese Woche 700 ist 720 zu hause 900 …

40 Phrase-Based Decoding  Choose the option with the best score (least cost) and extend that with all the next possible phrases. 2/19/2016 Speech and Language Processing - Jurafsky and Martin 39 The witch green at is home this week The green witch this week at home Diese Woche 700 ist 720 zu hause 900 …

41 Phrase-Based Decoding  Choose the option with the best score (least cost) and extend that with all the next possible phrases. 2/19/2016 Speech and Language Processing - Jurafsky and Martin 40 The witch green at is home this week The green witch this week at home Diese Woche hause ist 720 zu hause 900 … Diese Woche gruen Diese Woche die gruen hexe Diese Woche ist

42 Phrase-Based Decoding  Score those phrases. Then choose the best of all the current options to extend next. 2/19/2016 Speech and Language Processing - Jurafsky and Martin 41 The witch green at is home this week The green witch this week at home Diese Woche hause ist 720 zu hause 900 … Diese Woche gruen Diese Woche die gruen hexe Diese Woche ist

43 Phrase-Based Decoding  Score those phrases. Then choose the best of all the current options to extend next. 2/19/2016 Speech and Language Processing - Jurafsky and Martin 42 The witch green at is home this week The green witch this week at home Diese Woche hause ist 720 zu hause 900 … Diese Woche gruen Diese Woche die gruen hexe Diese Woche ist Diese Woche ist die gruen hexe …

44 Phrase-Based Decoding  Keep going until you have a target sentence that “covers” all the source content. 2/19/2016 Speech and Language Processing - Jurafsky and Martin 43 The witch green at is home this week The green witch this week at home Diese Woche hause ist 720 zu hause 900 … Diese Woche gruen Diese Woche die gruen hexe Diese Woche ist Diese Woche ist die gruen hexe …

45 2/19/2016 Speech and Language Processing - Jurafsky and Martin 44 Decoding as Search  We start with a null state.  No foreign content accounted for  No English content produced  We drive the search by 1.Segmenting the foreign input (all segmentations allowed by the phrase table) 2.Choosing foreign word/phrases to “cover” 3.Choosing a way to cover them  English translations are pasted left-to-right to previous choices.  Done when all the foreign input is covered.

46 2/19/2016 Speech and Language Processing - Jurafsky and Martin 45 Decoding  Search cost is really based on two factors  Current cost  Language model cost and translation cost for the chosen phrase  Future cost  Estimated cost to translate the remaining parts of the Spanish.

47 2/19/2016 Speech and Language Processing - Jurafsky and Martin 46 Decoding Next

48 2/19/2016 Speech and Language Processing - Jurafsky and Martin 47 Decoding

49 2/19/2016 Speech and Language Processing - Jurafsky and Martin 48 Decoding  Lots of bells and whistles to make this work. Couple of major ones.  Pure A* is really a depth first mechanism  We need to add some breadth first so...  Maintain a list of N-best stacks  A* is too conservative  So we need to do some major pruning to eliminate lines that fall below a certain level  Together those result in a multistack beam decoder.

50 2/19/2016 Speech and Language Processing - Jurafsky and Martin 49 Phrase-Based MT  To build such a model, we need  phrases, phrase translations, and probabilities  phrase dislocation probabilities  a scoring scheme that makes sense  To get the phrases, we’re going to use aligned translated texts.  Aligned at the sentence level  We’re first going to learn a word alignment, and then discover the phrases given the word alignment

51 Word Alignments  Let’s start with a simple alignment type. From E to F with a 1 to 1 assumption  Each word in E aligns with 1 word in F 2/19/2016 Speech and Language Processing - Jurafsky and Martin 50 … la maison … la maison bleue … la fleur … … the house … the blue house … the flower … This is one possible alignment.

52 Word Alignments  Let’s start with a simple alignment type. From E to F with a 1 to 1 assumption. 2/19/2016 Speech and Language Processing - Jurafsky and Martin 51 Here’s another … la maison … la maison bleue … la fleur … … the house … the blue house … the flower …

53 Word Alignments  1 to 1 is not the only possible or useful type.  Based on the language pairs, 1 to many, many to 1, 1 to none, etc. are all likely.  1 to many is a word aligned to a phrase  Many to 1 is a phrase aligning to a word  1 to none is a word that just isn’t there in the other text 2/19/2016 Speech and Language Processing - Jurafsky and Martin 52

54 2/19/2016 Speech and Language Processing - Jurafsky and Martin 53 Alignment Probabilities … la maison … la maison bleue … la fleur … … the house … the blue house … the flower … Assume that all word alignments equally likely. That is, that all P(french-word | english-word) are equal Recall that we want P(f|e)

55 2/19/2016 Speech and Language Processing - Jurafsky and Martin 54 Word Alignment … la maison … la maison bleue … la fleur … … the house … the blue house … the flower … Assume we’re interested in P(la|the). ”the” can co-occur with (aligns) 4 distinct french words. If we make each of those equally likely then the P(la | the) is.25. LaMaisonBleueFleur the.25

56 2/19/2016 Speech and Language Processing - Jurafsky and Martin 55 Word Alignment … la maison … la maison bleue … la fleur … … the house … the blue house … the flower … But “la” and “the” are observed to co-occur more frequently than expected so P(la | the) should increase. Meaning the others P(x |the) need to decrease (to still sum to 1).

57 2/19/2016 Speech and Language Processing - Jurafsky and Martin 56 Word Alignment … la maison … la maison bleue … la fleur … … the house … the blue house … the flower …

58 2/19/2016 Speech and Language Processing - Jurafsky and Martin 57 Word Alignment … la maison … la maison bleue … la fleur … … the house … the blue house … the flower …

59 2/19/2016 Speech and Language Processing - Jurafsky and Martin 58 Word Alignment … la maison … la maison bleue … la fleur … … the house … the blue house … the flower …

60 What?  What was that? 2/19/2016 Speech and Language Processing - Jurafsky and Martin 59 EM 1.Start with equiprobable 1-1 word alignments 2.The P() of an alignment is the product of the probability of the word alignments that make it up 3.Count the 1-1 word alignments and prorate by the P() of the alignment from which they’re gathered 4.Use those recomputed discounted scores to recompute the P() of the alignments 5.Go to 3

61 2/19/2016 Speech and Language Processing - Jurafsky and Martin 60 Word Alignment via EM  Let’s start with a two sentence aligned corpus.

62 2/19/2016 Speech and Language Processing - Jurafsky and Martin 61 Alignment Probs Let’s define the P() of a sentence alignment as the product of the P()s of their component word translation probabilities.

63 2/19/2016 Speech and Language Processing - Jurafsky and Martin 62 Alignment Probs (2)  But each of those needs to be normalized... Nothing very surprising here. Each alignment is equally likely so they each normalize to 50/50 since there are 2 for each sentence (in this case).

64 2/19/2016 Speech and Language Processing - Jurafsky and Martin 63 Word Translation Probs  So now that we “know” the alignment probabilities, we can gather and prorate the individual translation counts.

65 2/19/2016 Speech and Language Processing - Jurafsky and Martin 64 Word Translation Probs  To turn those into probs we just count and divide to get new conditional probabilities  Note, these all started at 1/3. All the right ones have gone up and some of the wrong ones have gone down.

66 2/19/2016 Speech and Language Processing - Jurafsky and Martin 65 New Alignment Probs  Now we can use these new word translation probs to derive new sentence alignment probs  And thereby get new adjusted counts  To get new word translation probs  To get new alignment probs ...

67 2/19/2016 Speech and Language Processing - Jurafsky and Martin 66 Discovering Phrases  Ok so now we have a good idea of which words translate to which other words.  Now we need to use that to get phrases and phrase translation probabilities  Lots of (ad hoc?) schemes for doing this...  Symmetrizing alignments works by first aligning twice (E, F) and (F,E).

68 2/19/2016 Speech and Language Processing - Jurafsky and Martin 67 Discovering Phrases (1)  Align both ways, then intersect to get high precision alignments.

69 2/19/2016 Speech and Language Processing - Jurafsky and Martin 68 Discovering Phrases (2)  From these high precision points, add word alignments from the union of the original alignments. + + + + + + + +

70 2/19/2016 Speech and Language Processing - Jurafsky and Martin 69 Discovering Phrases (3)  These initial phrase alignments can then be grown by fusing the word alignments such that...  Each proposed phrase alignment includes all the words in the component phrase alignments on each side (i.e. don’t split adjacent alignment pairs).  Including words as necessary that were not in the original set.

71 2/19/2016 Speech and Language Processing - Jurafsky and Martin 70 Discovering Phrases (3)

72 2/19/2016 Speech and Language Processing - Jurafsky and Martin 71 Discovering Phrases (3)

73 2/19/2016 Speech and Language Processing - Jurafsky and Martin 72 Discovering Phrases (3)

74 2/19/2016 Speech and Language Processing - Jurafsky and Martin 73 Discovering Phrases (3)

75 2/19/2016 Speech and Language Processing - Jurafsky and Martin 74 Discovering Phrases (3)

76 2/19/2016 Speech and Language Processing - Jurafsky and Martin 75 Discovering Phrases (3)

77 2/19/2016 Speech and Language Processing - Jurafsky and Martin 76 Phrase Translation  Given such phrases we can get the required counts for our translation model from


Download ppt "Natural Language Processing Lecture 23—12/1/2015 Jim Martin."

Similar presentations


Ads by Google