Part-of-speech Tagging cs224n Final project Spring, 2008 Tim Lai.
Published byModified over 4 years ago
Presentation on theme: "Part-of-speech Tagging cs224n Final project Spring, 2008 Tim Lai."— Presentation transcript:
Part-of-speech Tagging cs224n Final project Spring, 2008 Tim Lai
POS Tagging – 3 general techniques 1. Rule based system Relies on a hand-picked set of rules Performance is not very good 2. Stochastic methods HMM with Viterbi algorithm to determine best tagging Uses emission probabilities, i.e. P(word | tag) and transition probabilities, i.e. P(prevTag | currentTag) Maximum Entropy models also useful 3. Hybrid of the two Rules-based system to do POS tagging Uses rule templates and learns useful rules during training
Simple HMM vs Max-Ent HMM using bigrams for transition probabilities Max-Ent using simple features such as previous tag and current word
Error Analysis HMM and Max-Ent both perform well when tested on data from same domain Only 6.6 % of words were ambiguous, making known words easy to tag Accuracy drops when using test data from another domain Most errors are caused by unknown words, or the POS tagging of words near unknown words. In sentences without unknown words, accuracy ~ 99%! Most common mistake is mis-tagging JJ as NN Need to enhance both taggers to deal with unknowns.
Enhancement ideas For HMM – Transition probabilities can be modeled using trigrams, taking more context information into account when word is unknown For Max-Ent – Word shapes, word features, and more context can help Results: HMM – Switching from Unigram to Bigram helps a lot, but using Trigram doesn’t help much. Max-Ent – Hand picked features did not help much, but adding prefixes and suffixes were most helpful.
Transformation-based tagging One more idea to try – using rule-based templates to learn POS tagging rules Sample rule template: Change tag A to tag B when the [preceding | following] word is tagged Z. Change tag A to tag B when the the tag Z appears within [N] positions of the current word. Result Using a very restricted set of rule templates, accuracy went up 0.5 %
Final results HMM with bigram and rule-based adjustments Max-Ent with prefix/suffix, word shape features and rule-based adjustments Max-Ent performs better, with 97% accuracy achievable