Presentation on theme: "Part of speech (POS) tagging"— Presentation transcript:
1 Part of speech (POS) tagging Tagging of words in a corpus with the correct part of speech, drawn from some tagset.Early automatic POS taggers were rule-based.Stochastic POS taggers are reasonably accurate.
2 Applications of POS tagging Parsingrecovering syntactic structure requires correct POS tagspartial parsing refers to and syntactic analysis which does not result in a full syntactic parse (e.g. finding noun phrases)- “parsing by chunks”
3 Applications of POS tagging Information extractionfill slots in predefined templates with informationfull parse is not needed for this task, but partial parsing results (phrases) can be very helpfulinformation extraction tags with grammatical categories to find semantic categories
4 Applications of POS tagging Question answeringsystem responds to a user question with a noun phraseWho shot JR? (Kristen Shepard)Where is Starbucks? (UB Commons)What is good to eat here? (pizza)
5 Background on POS tagging How hard is tagging?most words have just a single tag: easysome words have more than one possible tag: hardermany common words are ambiguousBrown corpus:10.4% of word types are ambiguous40%+ of word tokens are ambiguous
6 Disambiguation approaches Rule-basedrely on large set of rules to disambiguate in contextrules are mostly hand-writtenStochasticrely on probabilities of words having certain tags in contextprobabilities derived from training corpusCombinedtransformation-based tagger: uses stochastic approach to determine initial tagging, then uses a rule-based approach to “clean up” the tags
7 Determining the appropriate tag for an untagged word Two types of information can be used:syntagmatic informationconsider the tags of other words in the surrounding contexttagger using such information correctly tagged approx. 77% of wordsproblem: content words (which are the ones most likely to be ambiguous) typically have many parts of speech, via productive rules (e.g. N V)
8 Determining the appropriate tag for an untagged word use information about word (e.g. usage probability)baseline for tagger performance is given by a tagger that simply assigns the most common tag to ambiguous wordscorrectly tags 90% of wordsmodern taggers use a variety of information sources
9 Note about accuracy measures Modern taggers claim accuracy rates of around 96% to 97%.This sounds impressive, but how good are they really?This is a measure of correctness at the level of individual words, not whole corpora.With a 96% accuracy, 1 word out of 25 is tagged incorrectly. This represents roughly one tagging error per sentence.
10 Rule-based POS tagging Two-stage design:first stage looks up individual words in a dictionary and tags words with sets of possible tagssecond stage uses rules to disambiguate, resulting in singleton tag sets
11 Stochastic POS tagging Stochastic taggers choose tags that result in the highest probability:P(word | tag) * P(tag | previous n tags)Stochastic taggers generally maximize probabilities for tag sequences for sentences.
12 Bigram stochastic tagger This kind of tagger “…chooses tag ti for word wi that is most probable given the previous tag ti-1 and the current word wi:ti = argmaxj P(tj | ti-1, wi) (8.2)”[page 303]Bayes law says: P(T|W) = P(T)P(W|T)/P(W)P(tj | ti-1, wi) = P(tj) P(ti-1, wi | tj) / P(tI-1, wi)Since we take the argmax of this over the tis, result is the same as using:P(tj | ti-1, wi) = P(tj) P(ti-1, wi | tj)Rewriting:ti = argmaxj P(tj | ti-1)P(wi | tj)
13 Example (page 304) What tag to we assign to race? to/TO race/??the/DT race/??If we are choosing between NN and VB as tags for race, the equations are:P(VB|TO)P(race|VB)P(NN|TO)P(race|NN)Tagger will choose tag for NN which maximizes the probability
14 Example For first part – look at tag sequence probability: P(NN|TO) = 0.021P(VB|TO) = 0.34For second part – look at lexical likelihood:P(race|NN) =P(race|VB) =Combining these:P(VB|TO)P(race|VB) =P(NN|TO)P(race|NN) =