Presentation is loading. Please wait.

Presentation is loading. Please wait.

Natural Language Processing Lecture 8—9/24/2013 Jim Martin.

Similar presentations


Presentation on theme: "Natural Language Processing Lecture 8—9/24/2013 Jim Martin."— Presentation transcript:

1

2 Natural Language Processing Lecture 8—9/24/2013 Jim Martin

3 5/15/2015 Speech and Language Processing - Jurafsky and Martin 2 Today  Review  Word classes  Part of speech tagging  HMMs  Basic HMM model  Decoding  Viterbi

4 5/15/2015 Speech and Language Processing - Jurafsky and Martin 3 Word Classes: Parts of Speech  8 (ish) traditional parts of speech  Noun, verb, adjective, preposition, adverb, article, interjection, pronoun, conjunction, etc  Called: parts-of-speech, lexical categories, word classes, morphological classes, lexical tags...  Lots of debate within linguistics about the number, nature, and universality of these  We’ll completely ignore this debate.

5 5/15/2015 Speech and Language Processing - Jurafsky and Martin 4 POS Tagging  The process of assigning a part-of-speech or lexical class marker to each word in a collection. WORD tag theDET koalaN put V the DET keysN onP theDET tableN

6 5/15/2015 Speech and Language Processing - Jurafsky and Martin 5 Penn TreeBank POS Tagset

7 5/15/2015 Speech and Language Processing - Jurafsky and Martin 6 POS Tagging  Words often have more than one part of speech: back  The back door = JJ  On my back = NN  Win the voters back = RB  Promised to back the bill = VB  The POS tagging problem is to determine the POS tag for a particular instance of a word in context  Usually a sentence

8 POS Tagging  Note this is distinct from the task of identifying which sense of a word is being used given a particular part of speech. That’s called word sense disambiguation. We’ll get to that later.  “… backed the car into a pole”  “… backed the wrong candidate” 5/15/2015 Speech and Language Processing - Jurafsky and Martin 7

9 5/15/2015 Speech and Language Processing - Jurafsky and Martin 8 How Hard is POS Tagging? Measuring Ambiguity

10 5/15/2015 Speech and Language Processing - Jurafsky and Martin 9 Two Methods for POS Tagging 1.Rule-based tagging  (ENGTWOL; Section 5.4 ) 2.Stochastic 1.Probabilistic sequence models  HMM (Hidden Markov Model) tagging  MEMMs (Maximum Entropy Markov Models)

11 5/15/2015 Speech and Language Processing - Jurafsky and Martin 10 POS Tagging as Sequence Classification  We are given a sentence (an “observation” or “sequence of observations”)  Secretariat is expected to race tomorrow  What is the best sequence of tags that corresponds to this sequence of observations?  Probabilistic view  Consider all possible sequences of tags  Out of this universe of sequences, choose the tag sequence which is most probable given the observation sequence of n words w 1 …w n.

12 5/15/2015 Speech and Language Processing - Jurafsky and Martin 11 Getting to HMMs  We want, out of all sequences of n tags t 1 …t n the single tag sequence such that P(t 1 …t n |w 1 …w n ) is highest.  Hat ^ means “our estimate of the best one”  Argmax x f(x) means “the x such that f(x) is maximized”

13 5/15/2015 Speech and Language Processing - Jurafsky and Martin 12 Getting to HMMs  This equation should give us the best tag sequence  But how to make it operational? How to compute this value?  Intuition of Bayesian inference:  Use Bayes rule to transform this equation into a set of probabilities that are easier to compute (and give the right answer)

14 5/15/2015 Speech and Language Processing - Jurafsky and Martin 13 Using Bayes Rule Know this.

15 5/15/2015 Speech and Language Processing - Jurafsky and Martin 14 Likelihood and Prior

16 5/15/2015 Speech and Language Processing - Jurafsky and Martin 15 Two Kinds of Probabilities  Tag transition probabilities p(t i |t i-1 )  Determiners likely to precede adjs and nouns  That/DT flight/NN  The/DT yellow/JJ hat/NN  So we expect P(NN|DT) and P(JJ|DT) to be high  Compute P(NN|DT) by counting in a labeled corpus:

17 5/15/2015 Speech and Language Processing - Jurafsky and Martin 16 Two Kinds of Probabilities  Word likelihood probabilities p(w i |t i )  VBZ (3sg Pres Verb) likely to be “is”  Compute P(is|VBZ) by counting in a labeled corpus:

18 5/15/2015 Speech and Language Processing - Jurafsky and Martin 17 Example: The Verb “race”  Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NR  People/NNS continue/VB to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN  How do we pick the right tag?

19 5/15/2015 Speech and Language Processing - Jurafsky and Martin 18 Disambiguating “race”

20 5/15/2015 Speech and Language Processing - Jurafsky and Martin 19 Disambiguating “race”

21 5/15/2015 Speech and Language Processing - Jurafsky and Martin 20 Example  P(NN|TO) =.00047  P(VB|TO) =.83  P(race|NN) =.00057  P(race|VB) =.00012  P(NR|VB) =.0027  P(NR|NN) =.0012  P(VB|TO)P(NR|VB)P(race|VB) =.00000027  P(NN|TO)P(NR|NN)P(race|NN)=.00000000032  So we (correctly) choose the verb tag for “race”

22 5/15/2015 Speech and Language Processing - Jurafsky and Martin 21 Break  Quiz readings  Chapters 1 to 6  Chapter 2  Chapter 3  Skip 3.4.1, 3.10, 3.12  Chapter 4  Skip 4.7, 4.8-4.11  Chapter 5  Skip 5.5.4, 5.6, 5.8-5.10  Chapter 6  Skip 6.6-6.9

23 5/15/2015 Speech and Language Processing - Jurafsky and Martin 22 Hidden Markov Models  What we’ve just described is called a Hidden Markov Model (HMM)  This is a kind of generative model.  There is a hidden underlying generator of observable events  The hidden generator can be modeled as a network of states and transitions  We want to infer the underlying state sequence given the observed event sequence

24 5/15/2015 Speech and Language Processing - Jurafsky and Martin 23  States Q = q 1, q 2 …q N;  Observations O= o 1, o 2 …o N;  Each observation is a symbol from a vocabulary V = {v 1,v 2,…v V }  Transition probabilities  Transition probability matrix A = {a ij }  Observation likelihoods  Output probability matrix B={b i (k)}  Special initial probability vector  Hidden Markov Models

25 5/15/2015 Speech and Language Processing - Jurafsky and Martin 24 HMMs for Ice Cream  You are a climatologist in the year 2799 studying global warming  You can’t find any records of the weather in Baltimore for summer of 2007  But you find Jason Eisner’s diary which lists how many ice-creams Jason ate every day that summer  Your job: figure out how hot it was each day

26 5/15/2015 Speech and Language Processing - Jurafsky and Martin 25 Eisner Task  Given  Ice Cream Observation Sequence: 1,2,3,2,2,2,3…  Produce:  Hidden Weather Sequence: H,C,H,H,H,C, C…

27 5/15/2015 Speech and Language Processing - Jurafsky and Martin 26 HMM for Ice Cream

28 Ice Cream HMM  Let’s just do 131 as the sequence  How many underlying state (hot/cold) sequences are there?  How do you pick the right one? 5/15/2015 Speech and Language Processing - Jurafsky and Martin 27 HHH HHC HCH HCC CCC CCH CHC CHH HHH HHC HCH HCC CCC CCH CHC CHH Argmax P(sequence | 1 3 1)

29 Ice Cream HMM Let’s just do 1 sequence: CHC 5/15/2015 Speech and Language Processing - Jurafsky and Martin 28 Cold as the initial state P(Cold|Start) Cold as the initial state P(Cold|Start) Observing a 1 on a cold day P(1 | Cold) Observing a 1 on a cold day P(1 | Cold) Hot as the next state P(Hot | Cold) Hot as the next state P(Hot | Cold) Observing a 3 on a hot day P(3 | Hot) Observing a 3 on a hot day P(3 | Hot) Cold as the next state P(Cold|Hot) Cold as the next state P(Cold|Hot) Observing a 1 on a cold day P(1 | Cold) Observing a 1 on a cold day P(1 | Cold).2.5.4.3.5.2.5.4.3.5.0024

30 5/15/2015 Speech and Language Processing - Jurafsky and Martin 29 POS Transition Probabilities

31 5/15/2015 Speech and Language Processing - Jurafsky and Martin 30 Observation Likelihoods

32 Question  If there are 30 or so tags in the Penn set  And the average sentence is around 20 words...  How many tag sequences do we have to enumerate to argmax over in the worst case scenario? 5/15/2015 Speech and Language Processing - Jurafsky and Martin 31 30 20

33 5/15/2015 Speech and Language Processing - Jurafsky and Martin 32 Decoding  Ok, now we have a complete model that can give us what we need. Recall that we need to get  We could just enumerate all paths given the input and use the model to assign probabilities to each.  Not a good idea.  Luckily dynamic programming (last seen in Ch. 3 with minimum edit distance) helps us here

34 5/15/2015 Speech and Language Processing - Jurafsky and Martin 33 Intuition  Consider a state sequence (tag sequence) that ends at state j with a particular tag T.  The probability of that tag sequence can be broken into two parts  The probability of the BEST tag sequence up through j-1  Multiplied by the transition probability from the tag at the end of the j-1 sequence to T.  And the observation probability of the word given tag T.

35 5/15/2015 Speech and Language Processing - Jurafsky and Martin 34 The Viterbi Algorithm

36 5/15/2015 Speech and Language Processing - Jurafsky and Martin 35 Viterbi Example

37 5/15/2015 Speech and Language Processing - Jurafsky and Martin 36 Viterbi Summary  Create an array  With columns corresponding to inputs  Rows corresponding to possible states  Sweep through the array in one pass filling the columns left to right using our transition probs and observations probs  Dynamic programming key is that we need only store the MAX prob path to each cell, (not all paths).

38 5/15/2015 Speech and Language Processing - Jurafsky and Martin 37 Evaluation  So once you have you POS tagger running how do you evaluate it?  Overall error rate with respect to a gold- standard test set.  Error rates on particular tags  Error rates on particular words  Tag confusions...

39 5/15/2015 Speech and Language Processing - Jurafsky and Martin 38 Error Analysis  Look at a confusion matrix  See what errors are causing problems  Noun (NN) vs ProperNoun (NNP) vs Adj (JJ)  Preterite (VBD) vs Participle (VBN) vs Adjective (JJ)

40 5/15/2015 Speech and Language Processing - Jurafsky and Martin 39 Evaluation  The result is compared with a manually coded “Gold Standard”  Typically accuracy reaches 96-97%  This may be compared with result for a baseline tagger (one that uses no context).  Important: 100% is impossible even for human annotators.

41 5/15/2015 Speech and Language Processing - Jurafsky and Martin 40 Summary  Parts of speech  Tagsets  Part of speech tagging  HMM Tagging  Markov Chains  Hidden Markov Models  Viterbi decoding

42 5/15/2015 Speech and Language Processing - Jurafsky and Martin 41 Next Time  Three tasks for HMMs  Decoding  Viterbi algorithm  Assigning probabilities to inputs  Forward algorithm  Finding parameters for a model  EM


Download ppt "Natural Language Processing Lecture 8—9/24/2013 Jim Martin."

Similar presentations


Ads by Google