Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSA3202 Human Language Technology HMMs for POS Tagging.

Similar presentations


Presentation on theme: "CSA3202 Human Language Technology HMMs for POS Tagging."— Presentation transcript:

1

2 CSA3202 Human Language Technology HMMs for POS Tagging

3 Acknowledgement  Slides adapted from lecture by Dan Jurafsky

4 3 Hidden Markov Model Tagging  Using an HMM to do POS tagging is a special case of Bayesian inference  Foundational work in computational linguistics  Bledsoe 1959: OCR  Mosteller and Wallace 1964: authorship identification  It is also related to the “noisy channel” model that’s the basis for ASR, OCR and MT

5 4 Noisy Channel Model for Machine Translation (Brown et. al. 1990) source sentence target sentence sentence

6 5 POS Tagging as Sequence Classification  Probabilistic view:  Given an observation sequence O = w 1 …w n  e.g. Secretariat is expected to race tomorrow  What is the best sequence of tags T that corresponds to this sequence of observations?  Consider all possible sequences of tags  Out of this universe of sequences, choose the tag sequence which maximises P(T|O)

7 6 Getting to HMMs  i.e. out of all sequences of n tags t 1 …t n the single tag sequence such that P(t 1 …t n |w 1 …w n ) is highest.  Hat ^ means “our estimate of the best one”  argmax x f(x) means “the x such that f(x) is maximized”

8 7 Getting to HMMs  But how to make it operational? How to compute this value?  Intuition of Bayesian classification:  Use Bayes Rule to transform this equation into a set of other probabilities that are easier to compute

9 8 Bayes Rule

10 9 recall this picture.... A B sample space A  B

11 10 Using Bayes Rule

12 11 Two Probabilities Involved: Probability of a tag sequence Probability of a word sequence given a tag sequence

13 12 Probability of tag sequence  We are seeking p(t i,...,t 1 ) = P(t 1 |0) * p(t 2 |t 1 ) * p(t 3 |t 1,t 2 )*, …, * p(t i |t i-1,…,t 1 ) (by chain rule of probabilities where 0 is the the beginning of string)  Next we approximate this product …

14 13 Probability of tag sequence by N-gram Approximation  By employing the following assumption p(t i |t i-1,...,t 1 ) ≈ p(t i |t i-1 )  i.e. we approximate the entire left context by the previous tag. So the product is simplified to P(t 1 |0) * p(t 2 |t 1 ) * p(t 3 |t 2 )*, …, * p(t i |t i-1 )

15 14 Generalizing the Approach The same trick works for approximating P(w n 1 |t n 1 ) P(w n 1 |t n 1 ) ≈ P(w 1 |t 1 ) * P(w 2 |t 2 ) * P(w n-1 |t n-1 ) * P(w n |t n ) as seen on the next slide

16 15 Likelihood and Prior

17 16 Two Kinds of Probabilities 1.Tag transition probabilities p(t i |t i-1 ) 2.Word likelihood probabilities p(w i |t i )

18 17 Estimating Probabilities  Tag transition probabilities p(t i |t i-1 )  Determiners likely to precede adjs and nouns  That/DT flight/NN  The/DT yellow/JJ hat/NN  So we expect P(NN|DT) and P(JJ|DT) to be high  Compute p(t i |t i-1 ) by counting in a labeled corpus:

19 18 Example

20 19 Two Kinds of Probabilities  Word likelihood probabilities p(w i |t i )  VBZ (3sg Pres verb) likely to be “is”  Compute P(is|VBZ) by counting in a labeled corpus:

21 20 Example: The Verb “race”  Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NR  People/NNS continue/VB to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN  How do we pick the right tag?

22 21 Disambiguating “race”

23 22 Example  P(NN|TO) =.00047  P(VB|TO) =.83  P(race|NN) =.00057  P(race|VB) =.00012  P(NR|VB) =.0027  P(NR|NN) =.0012  P(VB|TO)P(NR|VB)P(race|VB) =.00000027  P(NN|TO)P(NR|NN)P(race|NN)=.00000000032  So we (correctly) choose the verb reading

24 23 Hidden Markov Models  What we’ve described with these two kinds of probabilities is a Hidden Markov Model (HMM)  We approach the concept of an HMM via the simpler notions  Weighted Finite State Automaton  Markov Chain

25 24 WFST Definition  A Weighted Finite-State Automaton (WFST) adds probabilities to the arcs  The sum of the probabilities leaving any arc must sum to one  A Markov chain is a special case of a WFST in which the input sequence uniquely determines which states the automaton will go through  Examples follow

26 25 Markov Chain for Pronunciation Variants

27 26 Markov Chain for Weather

28 27 Probability of a Sequence  Markov chain assigns probabilities to unambiguous sequences  What is the probability of 3 consecutive rainy days?  Sequence is rainy-rainy-rainy  I.e., state sequence is 0-3-3-3  P(0,3,3,3) =   0 a 03 a 33 a 33 = 0.2 x (0.6) 3 = 0.0432

29 28 Markov Chain: “First-order observable Markov Model”  A set of states  Q = q 1, q 2 …q N; the state at time t is q t  Transition probabilities:  a set of probabilities A = a 01 a 02 …a n1 …a nn.  Each a ij represents the probability of transitioning from state i to state j  The set of these is the transition probability matrix A  Current state only depends on previous state

30 29 But....  Markov chains can’t represent inherently ambiguous problems  For that we need a Hidden Markov Model

31 30 Hidden Markov Model  For Markov chains, the output symbols are the same as the states.  See hot weather: we’re in state hot  But in part-of-speech tagging (and other things)  The output symbols are words  But the hidden states are part-of-speech tags  So we need an extension!  A Hidden Markov Model is an extension of a Markov chain in which the input symbols are not the same as the states.  This means we don’t know which state we are in.

32 31  States Q = q 1, q 2 … q N;  Observations O= o 1, o 2 … o N;  Each observation is a symbol from a vocabulary V = {v 1,v 2,…v V }  Transition probabilities  Transition probability matrix A = {a ij }  Observation likelihoods  Output probability matrix B={b i (k)}  Special initial probability vector  Hidden Markov Models

33 32 Underlying Markov Chain is as before, but we need to add...

34 33 Observation Likelihoods

35 34 Observation Likelihoods  Observation likelihoods P(O|S) link actual observations O with underlying states S  Are necessary whenever there is not a 1:1 correspondence between the two.  Examples from many different domains where there is uncertainty of interpretation  OCR  Speech Recognition  Spelling Correction  POS

36 35 Decoding  Recall that we need to compute  We could just enumerate all paths given the input and use the model to assign probabilities to each.  Many Paths  Only need to store the most probable path  Dynamic programming  Viterbi Algorithm

37 36 Viterbi Algorithm  Inputs  HMM  Q: States  A: Transition probabilities  B: Observation likelihoods  O: Observed Words  Output  Most probable state/tag sequence together with its probability

38 37 POS Transition Probabilities

39 38 POS Observation Likelihoods

40 39 Viterbi Summary  Create an array  Columns corresponding to inputs  Rows corresponding states  Sweep through the array in one pass filling the columns left to right using our transition probs and observations probs  Dynamic programming key is that we need only store the MAX prob path to each cell, (not all paths).

41 40 Viterbi Example

42 41

43 42 POS-Tagger Evaluation  Different metrics can be used:  Overall error rate with respect to a gold- standard test set.  Error rates on particular tags c(gold(tok)=t and tagger(tok)=t)/ c(gold(tok)=t)  Error rates on particular words  Tag confusions...

44 43 Evaluation  The result is compared with a manually coded “Gold Standard”  Typically accuracy reaches 96-97%  This may be compared with result for a baseline tagger (one that uses no context).  Important: 100% is impossible even for human annotators.  Inter annotator agreement problem

45 44 Error Analysis  Look at a confusion matrix  See what errors are causing problems  Noun (NN) vs ProperNoun (NNP) vs Adj (JJ)  Preterite (VBD) vs Participle (VBN) vs Adjective (JJ)

46 45 POS Tagging Summary  Parts of speech  Tagsets  Part of speech tagging  Rule-Based  HMM Tagging  Markov Chains  Hidden Markov Models  Next: Transformation based


Download ppt "CSA3202 Human Language Technology HMMs for POS Tagging."

Similar presentations


Ads by Google