Presentation is loading. Please wait.

Presentation is loading. Please wait.

More about tagging, assignment 2 DAC723 Language Technology Leif Grönqvist 4. March, 2003.

Similar presentations


Presentation on theme: "More about tagging, assignment 2 DAC723 Language Technology Leif Grönqvist 4. March, 2003."— Presentation transcript:

1 More about tagging, assignment 2 DAC723 Language Technology Leif Grönqvist 4. March, 2003

2 Part-of-speech tagging (reminder) We want to assign the right part-of-speech to each word in a corpus The tagset is determined in advance The word types in the corpus have various properties in the lexicon or training data –Some are unambiguous –Some are ambiguous (typically 2-7 POS) –Some are unknown

3 Various approaches Rule based tagging –Constraint based tagging (SweTwol, EngTwol by Lingsoft) –Transformation-based tagging (Eric Brill) Stochastic tagging (HMM) –Using maximum likelihood estimation –Or some bootstrap based training (i.e. Baum- Welch)

4 An HMM tagger The problem may be formulated as: Which may be reformulated as: But the denominator is constant and may be removed and we get:

5 HMM tagger, cont. The Markov assumption (for n=3) and the chain rule gives us: What we need now is:

6 The example: HMM WordSeq.1Seq.2Seq.3Seq.4 youpron haveverb toinfmrk booknoun verb aart chairnounverbnounverb onprep decknoun Select the sequence with the highest probability!

7 Overview of assignment 2 Parameter estimation (training) –MLE –Smoothing Implementation –Data structures for all the probabilities –The Viterbi algorithm Testing –Create a random sample from the test data –Calculate accuracy rate Report –Description and results –Hand it in before the end of March

8 Parameter estimation: MLE We need –Contextual probabilities: P(t i |t i-1 ) –Lexical probabilities: P(w|t) We assume that test data will have the same properties as training data, so: –P(t i |t i-1 ) = f(t i-1,t i ) / f(t i-1 ) –P(w|t) = f(w, t) / f(t) The frequencies are taken from the training data

9 The training data (" " (AB "ändå")) (" " (VB PRS AKT "bära")) (" " (PN UTR SIN DEF SUB "han")) (" " (PP "på")) (" " (DT UTR SIN IND "en")) (" " (NN UTR SIN IND NOM "misstanke")) (" " (SN "att")) (" " (NN UTR PLU DEF NOM "chans")) (" " (PP "till")) (" " (NN NEU SIN IND NOM "sommarjobb")) (" " (VB PRT AKT "ha")) (" " (VB SUP AKT "se")) (" " (AB "annorlunda")) (" " (AB "ut")) (" " (SN "om")) (" " (PN UTR SIN DEF SUB "han")) (" " (VB SUP AKT "vara")) (" " (JJ POS UTR SIN IND NOM "ljushyad")) (" " (DL MAD "."))

10 Smoothing We need smoothing to make sure there are sequences with non-zero probability –P(t i |t i-1 )>0 for all tag pairs (t i-1,t i ) –For all words w there is at least one tag t so P(w|t)>0 Laplace's Law (additive smoothing) is good enough for this exercise The lexical probabilities need smoothing only for unknown words, so accept that for example P(misstanke|pron)=0 To get a better result and a faster tagger, it is also a good idea to smooth unknown word for open classes only: AB, JJ, NN, PC, PM, RG, VB

11 Implementation: data structures A fast implementation will need fast look- up of probabilities: –The contextual probabilities could be put in a matrix but maybe this is not needed, I don’t know. –The lexical probabilities could be put in a trie if a hash table is not fast enough All these choices are up to you! It may of course depend on what programming language you are using

12 The Viterbi algorithm The secret is to keep just the probability and the best trace (highest probability) to each class at the current position We will loop through the text from position 1 to the end, word by word For each class in the next pos we have to check each class in current pos to be able to find the best choice –This will end up in a nested loop over the classes inside the loop over positions Pseudo code on page 179, fig 5.19 in J&M

13 Viterbi, example InfmVerbNoun Infm.0001.01 Verb.5.02 Noun.0001.02.01 InfmVerbNoun book.00001.001 to.7.00001 Infm.001 Verb.0001 Noun.00002 tobookachair

14 Test corpus sampling One way is: take random sentences from test data until 5000 tokens are reached – no duplicated sentences! Another: take 10 blocks of 500 tokens each – don’t cut in the middle of sentences! Make sure to save the test corpus to be able to test on the same corpus several times

15 Measuring accuracy Overall accuracy should be measured Useful is also the variance over say 10 blocks. This gives a measure of the stability Calculate the accuracy for a baseline tagger using no contextual information Report –Description, results and analysis/discussion should be included –Hand it in (by email) before the end of March –Tell me where to find, and how to run your tests!


Download ppt "More about tagging, assignment 2 DAC723 Language Technology Leif Grönqvist 4. March, 2003."

Similar presentations


Ads by Google