Presentation is loading. Please wait.

Presentation is loading. Please wait.

Tagging – more details Reading: D Jurafsky & J H Martin (2000) Speech and Language Processing, Ch 8 R Dale et al (2000) Handbook of Natural Language Processing,

Similar presentations


Presentation on theme: "Tagging – more details Reading: D Jurafsky & J H Martin (2000) Speech and Language Processing, Ch 8 R Dale et al (2000) Handbook of Natural Language Processing,"— Presentation transcript:

1 Tagging – more details Reading: D Jurafsky & J H Martin (2000) Speech and Language Processing, Ch 8 R Dale et al (2000) Handbook of Natural Language Processing, Ch 17 C D Manning & H Schütze (1999) Foundations of Statistical Natural Language Processing, Ch 10

2 POS tagging - overview What is a “tagger”? Tagsets How to build a tagger and how a tagger works –Supervised vs unsupervised learning –Rule-based vs stochastic –And some details

3 What is a tagger? Lack of distinction between … –Software which allows you to create something you can then use to tag input text, e.g. “Brill’s tagger” –The result of running such software, e.g. a tagger for English (based on the such-and-such corpus) Taggers (even rule-based ones) are almost invariably trained on a given corpus “Tagging” usually understood to mean “POS tagging”, but you can have other types of tags (eg semantic tags)

4 Tagging vs. parsing Once tagger is “trained”, process consists straightforward look-up, plus local context (and sometimes morphology) Will attempt to assign a tag to unknown words, and to disambiguate homographs “Tagset” (list of categories) usually larger with more distinctions

5 Tagset Parsing usually has basic word-categories, whereas tagging makes more subtle distinctions E.g. noun sg vs pl vs genitive, common vs proper, +is, +has, … and all combinations Parser uses maybe 12-20 categories, tagger may use 60-100

6 Simple taggers Default tagger has one tag per word, and assigns it on the basis of dictionary lookup –Tags may indicate ambiguity but not resolve it, e.g. nvb for noun-or-verb Words may be assigned different tags with associated probabilities –Tagger will assign most probable tag unless –there is some way to identify when a less probable tag is in fact correct Tag sequences may be defined by regular expressions, and assigned probabilities (including 0 for illegal sequences)

7 What probabilities do we have to learn? (a)Individual word probabilities: Probability that a given tag t is appropriate for a given word w –Easy (in principle): learn from training corpus: –Problem of “sparse data”: Add a small amount to each calculation, so we get no zeros

8 (b) Tag sequence probability: Probability that a given tag sequence t 1,t 2,…,t n is appropriate for a given word sequence w 1,w 2,…,w n –P(t 1,t 2,…,t n | w 1,w 2,…,w n ) = ??? –Too hard to calculate entire sequence: P(t 1,t 2,t 3,t 4, … ) = P(t 2 |t 1 )  P(t 3 |t 1,t 2 )  P(t 4 |t 1,t 2,t 3 )  … –Subsequence is more tractable –Sequence of 2 or 3 should be enough: Bigram model: P(t 1,t 2 ) = P(t 2 |t 1 ) Trigram model: P(t 1,t 2,t 3 ) = P(t 2 |t 1 )  P(t 3 |t 2 ) N-gram model:

9 More complex taggers Bigram taggers assign tags on the basis of sequences of two words (usually assigning tag to word n on the basis of word n-1 ) An nth-order tagger assigns tags on the basis of sequences of n words As the value of n increases, so does the complexity of the statistical calculation involved in comparing probability combinations

10 History 19601970198019902000 Brown Corpus Created (EN-US) 1 Million Words Brown Corpus Tagged HMM Tagging (CLAWS) 93%-95% Greene and Rubin Rule Based - 70% LOB Corpus Created (EN-UK) 1 Million Words DeRose/Church Efficient HMM Sparse Data 95%+ British National Corpus (tagged by CLAWS) POS Tagging separated from other NLP Transformation Based Tagging (Eric Brill) Rule Based – 95%+ Tree-Based Statistics (Helmut Shmid) Rule Based – 96%+ Neural Network 96%+ Trigram Tagger (Kempe) 96%+ Combined Methods 98%+ Penn Treebank Corpus (WSJ, 4.5M) LOB Corpus Tagged

11 How do they work? Tagger must be “trained” Many different techniques, but typically … Small “training corpus” hand-tagged Tagging rules learned automatically Rules define most likely sequence of tags Rules based on –Internal evidence (morphology) –External evidence (context)

12 Rule-based taggers Earliest type of tagging: two stages Stage 1: look up word in lexicon to give list of potential POSs Stage 2: Apply rules which certify or disallow tag sequences Rules originally handwritten; more recently Machine Learning methods can be used cf transformation-based learning, below

13 Stochastic taggers Nowadays, pretty much all taggers are statistics-based and have been since 1980s (or even earlier... Some primitive algorithms were already published in 60s and 70s) Most common is based on Hidden markov Models (also found in speech processing, etc.)

14 (Hidden) Markov Models Probability calculations imply Markov models: we assume that P(t|w) is dependent only on the (or, a sequence of) previous word(s) (Informally) Markov models are the class of probabilistic models that assume we can predict the future without taking too much account of the past Markov chains can be modelled by finite state automata: the next state in a Markov chain is always dependent on some finite history of previous states Model is “hidden” if it is actually a succession of Markov models, whose intermediate states are of no interest

15 Three stages of HMM training Estimating likelihoods on the basis of a corpus: Forward-backward algorithm “Decoding”: applying the process to a given input: Viterbi algorithm Learning (training): Baum-Welch algorithm or Iterative Viterbi

16 Forward-backward algorithm Denote Claim: Therefore we can calculate all A t (s) in time O(L*T n ). Similar, by going backwards, we can get: Multiplying we can get: Note that summing this for all states at a time t gives the likelihood of w 1 …w L.

17 Viterbi algorithm (aka Dynamic programming) (see J&M p177ff) Denote Claim: Otherwise, appending s to the prefix would get a path better than Q t+1 (s). Therefore, checking all possible states q at time t, multiplying by the transition probability between q and s and the expression probability of w t+1 given s, and finding the maximum, gives Q t+1 (s). We need to store for each state the previous state in Q t (s). Find the maximal finish state, and reconstruct the path. O(L*T n ) instead of T L.

18 Baum-Welch algorithm Start with initial HMM Calculate, using F-B, the likelihood to get our observations given that a certain hidden state was used at time i. Re-estimate the HMM parameters Continue until convergence Can be shown to constantly improve likelihood

19 Unsupervised learning We have an untagged corpus We may also have partial information such as a set of tags, a dictionary, knowledge of tag transitions, etc. Use Baum-Welch to estimate both the context probabilities and the lexical probabilities

20 Supervised learning Use a tagged corpus Count the frequencies of tag-pairs t,w: C(t,w) Estimate (Maximum Likelihood Estimate): Count the frequencies of tag n-grams C(t 1 …t n ) Estimate (Maximum Likelihood Estimate): What about small counts? Zero counts?

21 Sparse Training Data - Smoothing Adding a bias: Compensates for estimation (Bayesean approach) Has larger effect on low-count words Solves zero-count word problem Generalized Smoothing: Reduces to bias using:

22 Decision-tree tagging Not all n-grams are created equal: –Some n-grams contain redundant information that may be expressed well enough with less tags –Some n-grams are too sparse Decision Tree (Schmid, 1994)

23 Decision Trees Each node is a binary test of tag t i-k. The leaves store probabilities for t i. All HMM algorithms can still be used Learning: –Build tree from root to leaves –Choose tests for nodes that maximize information gain –Stop when branch too sparse –Finally, prune tree

24 Transformation-based learning Eric Brill (1993) Start from an initial tagging, and apply a series of transformations Transformations are learned as well, from the training data Captures the tagging data in much fewer parameters than stochastic models The transformations learned have linguistic meaning

25 Transformation-based learning Examples: Change tag a to b when: –The preceding (following) word is tagged z –The word two before (after) is tagged z –One of the 2 preceding (following) words is tagged z –The preceding word is tagged z and the following word is tagged w –The preceding (following) word is W

26 Transformation-based Tagger: Learning Start with initial tagging Score the possible transformations by comparing their result to the “truth”. Choose the transformation that maximizes the score Repeat last 2 steps


Download ppt "Tagging – more details Reading: D Jurafsky & J H Martin (2000) Speech and Language Processing, Ch 8 R Dale et al (2000) Handbook of Natural Language Processing,"

Similar presentations


Ads by Google