Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU.

Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU

Parts of Speech Grammar is stated in terms of parts of speech (‘preterminals’): – classes of words sharing syntactic properties: noun verb adjective … 1/16/14NYU2

POS Tag Sets Most influential tag sets were those defined for projects to produce large POS-annotated corpora: Brown corpus – 1 million words from variety of genres – 87 tags UPenn Tree Bank – initially 1 million words of Wall Street Journal – later retagged Brown – first POS tags, then full parses – 45 tags (some distinctions captured in parses) 1/16/14NYU3

The Penn POS Tag Set Noun categories NN (common singular) NNS (common plural) NNP (proper singular) Penn POS tagsPenn POS tags NNPS (proper plural) Verb categories VB (base form) VBZ (3 rd person singular present tense) VBP (present tense, other than 3 rd person singular) VBD (past tense) VBG (present participle) VBN (past participle) 1/16/14NYU4

some tricky cases present participles which act as prepositions: – according/JJ to nationalities: – English/JJ cuisine – an English/NNP sentence adjective vs. participle – the striking/VBG teachers – a striking/JJ hat – he was very surprised/JJ – he was surprised/VBN by his wife 1/16/14NYU5

Tokenization any annotated corpus assumes some tokenization relatively straightforward for English – generally defined by whitespace and punctuation – treat negative contraction as separate token: do | n’t – treat possessive as separate token: cat | ‘s – do not split hyphenated terms: Chicago-based 1/16/14NYU6

the Tagging Task Task: assigning a POS to each word not trivial: many words have several tags dictionary only lists possible POS, independent of context how about using a parser to determine tags? – some analysis (e.g., partial parsers) assume input is tagged 1/16/14NYU7

Why tag? POS tagging can help parsing by reducing ambiguity Can resolve some pronunciation ambiguities for text-to-speech (“desert”) Can resolve some semantic ambiguities 1/16/14NYU8

Simple Models Natural language is very complex – we don't know how to model it fully, so we build simplified models which provide some approximation to natural language 1/16/14NYU9

Corpus-Based Methods How can we measure 'how good' these models are? we assemble a text corpus annotate it by hand with respect to the phenomenon we are interested in compare it with the predictions of our model – for example, how well the model predicts part-of- speech or syntactic structure 1/16/14NYU10

Preparing a Good Corpus To build a good corpus – we must define a task people can do reliably (choose a suitable POS set, for example) – we must provide good documentation for the task so annotation can be done consistently – we must measure human performance (through dual annotation and inter-annotator agreement) Often requires several iterations of refinement

Training the model How to build a model? – need a goodness metric – train by hand, by adjusting rules and analyzing errors (ex: Constraint Grammar) – train automatically develop new rules build probabilistic model (generally very hard to do by hand) choice of model affected by ability to train it (NN) 1/16/1412NYU

The simplest model The simplest POS model considers each word separately: We tag each word with its most likely part-of-speech – this works quite well: about 90% accuracy when trained and tested on similar texts – although many words have multiple parts of speech, one POS typically dominates within a single text type How can we take advantage of context to do better? 1/16/14NYU13

A Language Model To see how we might do better, let us consider a related problem: building a language model – a language model can generate sentences following some probability distribution 1/16/14NYU14

Markov Model In principle each word we select depends on all the decisions which came before (all preceding words in the sentence) But we’ll make life simple by assuming that the decision depends on only the immediately preceding decision [first-order] Markov Model representable by a finite state transition network T ij = probability of a transition from state i to state j

Finite State Network cat: meow cat: meow dog: woof dog: woof end start 0.50 0.30 0.40

Our bilingual pets Suppose our cat learned to say “woof” and our dog “meow” … they started chatting in the next room … and we wanted to know who said what

Hidden State Network woof meow cat dog end start

How do we predict When the cat is talking: t i = cat When the dog is talking: t i = dog We construct a probabilistic model of the phenomenon And then seek the most likely state sequence S

Hidden Markov Model Assume current word depends only on current tag

HMM for POS Tagging We can use the same formulas for POS tagging states  POS tags 1/16/14NYU21

Training an HMM Training an HMM is simple if we have a completely labeled corpus: – have marked the POS of each word. – can directly estimate both P ( t i | t i-1 ) and P ( w i | t i ) from corpus counts using the Maximum Likelihood Estimator. 1/16/14NYU22

Greedy Decoder simplest decoder (tagger) assign tags deterministically from left to right selects t i to maximize P(w i |t i ) * P(t i |t i-1 ) does not take advantage of right context can we do better? 1/16/14NYU23

1/16/14NYU24

Performance Accuracy with good unknown-word model trained and tested on WSJ is 96.5% to 96.8% 1/16/14NYU25

Unknown words Problem (as with NB) of zero counts … words not in the training corpus – simplest: assume all POS equally likely for unknown words – can make better estimate by observing unknown words are very likely open class words, and most likely nouns base P(t|w) of unknown word on probability distribution of words which occur once in corpus 1/16/14NYU26

Unknown words, cont’d – can do even better by taking into account the form of a word whether it is capitalized whether it is hyphenated its last few letters 1/16/14NYU27

Trigram Models in some cases we need to look two tags back to find an informative context – e.g, conjunction (N and N, V and V, …) but there’s not enough data for a pure trigram model so combine unigram, bigram, and trigram – linear interpolation – backoff 1/16/14NYU28

Domain adaptation Substantial loss in shifting to new domain – 8-10% loss in shift from WSJ to biology domain – adding small annotated sample (200-500 sentences) in new domain greatly reduces error – some reduction possible without annotated target data (Blitzer, Structured Correspondence Learning) 1/16/14NYU29

Jet Tagger HMM–based trained on WSJ file pos_hmm.txt 1/16/14NYU30

Transformation-Based Learning TBL provides a very different corpus-based approach to part-of-speech tagging It learns a set of rules for tagging – the result is inspectable 1/16/14NYU31

TBL Model TBL starts by assigning each word its most likely part of speech Then it applies a series of transformations to the corpus – each transformation states some condition and some change to be made to the assigned POS if the condition is met – for example: Change NN to VB if the preceding tag is TO. Change VBP to VB if one of the previous 3 tags is MD. 1/16/14NYU32

Transformation Templates Each transformation is based on one of a small number of templates, such as Change tag x to y if the preceding tag is z. Change tag x to y if one of the previous 2 tags is z. Change tag x to y if one of the previous 3 tags is z. Change tag x to y if the next tag is z. Change tag x to y if one of the next 2 tags is z. Change tag x to y if one of the next 3 tags is z. 1/16/14NYU33

Training the TBL Model To train the tagger, using a hand-tagged corpus, we begin by assigning each word its most common POS. We then try all possible rules (all instantiations of one of the templates) and keep the best rule -- the one which corrects the most errors. We do this repeatedly until we can no longer find a rule which corrects some minimum number of errors. 1/16/14NYU34

Some Transformations Changetoif NNVBprevious tag isTO VBPVBone of previous 3 tags is MD NNVBone of previous 2 tags is MD VBNNone of previous 2 tags is DT VBDVBNone of previous 3 tags is VBZ VBNVBDprevious tag is PRP VBNVBDprevious tag is NNP VBDVBNprevious tag is VBD VBPVBprevious tag is TO 1/16/14NYU35 the first 9 transformations found for WSJ corpus

TBL Performance Performance competitive with good HMM accuracy 96.6% on WSJ Compared to HMM, much slower to train, but faster to apply 1/16/14NYU36

Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU.

Similar presentations

Presentation on theme: "Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU.

Similar presentations

Presentation on theme: "Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU."— Presentation transcript:

Similar presentations

About project

Feedback