Presentation is loading. Please wait.

Presentation is loading. Please wait.

Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391.

Similar presentations


Presentation on theme: "Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391."— Presentation transcript:

1 Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391

2 CIS 391 - Intro to AI 2 NLP Task I – Determining Part of Speech Tags  The Problem: WordPOS listing in Brown Corpus heatnounverb oilnoun inprepnounadv adetnounnoun-proper largeadjnounadv potnoun

3 CIS 391 - Intro to AI 3 NLP Task I – Determining Part of Speech Tags  The Old Solution: Depth First search. If each of n words has k tags on average, try the n k combinations until one works.  Machine Learning Solutions: Automatically learn Part of Speech (POS) assignment. The best techniques achieve 97+% accuracy per word on new materials, given large training corpora.

4 CIS 391 - Intro to AI 4 What is POS tagging good for?  Speech synthesis: How to pronounce “ lead ” ? INsult inSULT OBject obJECT OVERflow overFLOW DIScountdisCOUNT CONtent conTENT  Stemming for information retrieval Knowing a word is a N tells you it gets plurals Can search for “aardvarks” get “aardvark”  Parsing and speech recognition and etc Possessive pronouns (my, your, her) followed by nouns Personal pronouns (I, you, he) likely to be followed by verbs

5 CIS 391 - Intro to AI 5 Equivalent Problem in Bioinformatics  Durbin et al. Biological Sequence Analysis, Cambridge University Press.  Several applications, e.g. proteins  From primary structure ATCPLELLLD  Infer secondary structure HHHBBBBBC..

6 CIS 391 - Intro to AI 6 Penn Treebank Tagset I TagDescriptionExample CCcoordinating conjunctionand CDcardinal number1, third DTdeterminerthe EXexistential therethere is FWforeign wordd'hoevre INpreposition/subordinating conjunctionin, of, like JJadjectivegreen JJRadjective, comparativegreener JJSadjective, superlativegreenest LSlist marker1) MDmodalcould, will NNnoun, singular or masstable NNSnoun pluraltables NNPproper noun, singularJohn NNPSproper noun, pluralVikings

7 CIS 391 - Intro to AI 7 TagDescriptionExample PDTpredeterminerboth the boys POSpossessive endingfriend 's PRPpersonal pronounI, me, him, he, it PRP$possessive pronounmy, his RBadverbhowever, usually, here, good RBRadverb, comparativebetter RBSadverb, superlativebest RPparticlegive up TOtoto go, to him UHinterjectionuhhuhhuhh Penn Treebank Tagset II

8 CIS 391 - Intro to AI 8 TagDescriptionExample VBverb, base formtake VBDverb, past tensetook VBGverb, gerund/present participletaking VBNverb, past participletaken VBPverb, sing. present, non-3dtake VBZverb, 3rd person sing. presenttakes WDTwh-determinerwhich WPwh-pronounwho, what WP$possessive wh-pronounwhose WRBwh-abverbwhere, when Penn Treebank Tagset III

9 CIS 391 - Intro to AI 9 Simple Statistical Approaches: Idea 1

10 CIS 391 - Intro to AI 10 Simple Statistical Approaches: Idea 2 For a string of words W = w 1 w 2 w 3 …w n find the string of POS tags T = t 1 t 2 t 3 …t n which maximizes P(T|W) i.e., the most likely POS tag t i for each word w i given its surrounding context

11 CIS 391 - Intro to AI 11 The Sparse Data Problem … A Simple, Impossible Approach to Compute P(T|W): Count up instances of the string "heat oil in a large pot" in the training corpus, and pick the most common tag assignment to the string..

12 CIS 391 - Intro to AI 12 A BOTEC Estimate of What We Can Estimate  What parameters can we estimate with a million words of hand tagged training data? Assume a uniform distribution of 5000 words and 40 part of speech tags..  Rich Models often require vast amounts of data  Good estimates of models with bad assumptions often outperform better models which are badly estimated

13 CIS 391 - Intro to AI 13 A Practical Statistical Tagger

14 CIS 391 - Intro to AI 14 A Practical Statistical Tagger II But we can't accurately estimate more than tag bigrams or so… Again, we change to a model that we CAN estimate:

15 CIS 391 - Intro to AI 15 A Practical Statistical Tagger III So, for a given string W = w 1 w 2 w 3 …w n, the tagger needs to find the string of tags T which maximizes

16 CIS 391 - Intro to AI 16 Training and Performance  To estimate the parameters of this model, given an annotated training corpus:  Because many of these counts are small, smoothing is necessary for best results…  Such taggers typically achieve about 95-96% correct tagging, for tag sets of 40-80 tags.

17 CIS 391 - Intro to AI 17 Hidden Markov Models This model is an instance of a Hidden Markov Model. Viewed graphically: Adj.3.3.6.6 Det. 02. 47 Noun.3.3.7.7 Verb. 51.1.1 P(w|Det) a.4 the.4 P(w|Adj) good.02 low.04 P(w|Noun) price.001 deal.0001

18 CIS 391 - Intro to AI 18 Viewed as a generator, an HMM: Adj.3.3.6.6 Det. 02. 47 Noun.3.3.7.7 Verb. 51.1.1.4the.4a P(w|Det).04low.02good P(w|Adj).0001deal.001price P(w|Noun)

19 CIS 391 - Intro to AI 19 Recognition using an HMM

20 CIS 391 - Intro to AI 20 A Practical Statistical Tagger IV  Finding this maximum can be done using an exponential search through all strings for T. linear time  However, there is a linear time solution using dynamic programming called Viterbi decoding.

21 CIS 391 - Intro to AI 21 Parameters of an HMM  States: A set of states S=s 1,…,s n  Transition probabilities: A= a 1,1,a 1,2,…,a n,n Each a i,j represents the probability of transitioning from state s i to s j.  Emission probabilities: A set B of functions of the form b i (o t ) which is the probability of observation o t being emitted by s i  Initial state distribution: is the probability that s i is a start state

22 CIS 391 - Intro to AI 22 The Three Basic HMM Problems  Problem 1 (Evaluation): Given the observation sequence O=o 1,…,o T and an HMM model, how do we compute the probability of O given the model?  Problem 2 (Decoding): Given the observation sequence O=o 1,…,o T and an HMM model, how do we find the state sequence that best explains the observations? (This and following slides follow classic formulation by Rabiner and Juang, as adapted by Manning and Schutze. Slides adapted from Dorr.)

23 CIS 391 - Intro to AI 23  Problem 3 (Learning): How do we adjust the model parameters, to maximize ? The Three Basic HMM Problems

24 CIS 391 - Intro to AI 24 Problem 1: Probability of an Observation Sequence  What is ?  The probability of a observation sequence is the sum of the probabilities of all possible state sequences in the HMM.  Naïve computation is very expensive. Given T observations and N states, there are N T possible state sequences.  Even small HMMs, e.g. T=10 and N=10, contain 10 billion different paths  Solution to this and problem 2 is to use dynamic programming

25 CIS 391 - Intro to AI 25 The Trellis

26 CIS 391 - Intro to AI 26 Forward Probabilities  What is the probability that, given an HMM, at time t the state is i and the partial observation o 1 … o t has been generated?

27 CIS 391 - Intro to AI 27 Forward Probabilities

28 CIS 391 - Intro to AI 28 Forward Algorithm  Initialization:  Induction:  Termination:

29 CIS 391 - Intro to AI 29 Forward Algorithm Complexity  Naïve approach takes O(2T*N T ) computation  Forward algorithm using dynamic programming takes O(N 2 T) computations

30 CIS 391 - Intro to AI 30 Backward Probabilities  What is the probability that given an HMM and given the state at time t is i, the partial observation o t+1 … o T is generated?  Analogous to forward probability, just in the other direction

31 CIS 391 - Intro to AI 31 Backward Probabilities

32 CIS 391 - Intro to AI 32 Backward Algorithm  Initialization:  Induction:  Termination:

33 CIS 391 - Intro to AI 33 Problem 2: Decoding  The Forward algorithm gives the sum of all paths through an HMM efficiently.  Here, we want to find the highest probability path.  We want to find the state sequence Q=q 1 …q T, such that

34 CIS 391 - Intro to AI 34 Viterbi Algorithm  Similar to computing the forward probabilities, but instead of summing over transitions from incoming states, compute the maximum  Forward:  Viterbi Recursion:

35 CIS 391 - Intro to AI 35 Core Idea of Viterbi Algorithm

36 CIS 391 - Intro to AI 36 Viterbi Algorithm  Initialization:  Induction:  Termination:  Read out path:

37 CIS 391 - Intro to AI 37 Problem 3: Learning  Up to now we’ve assumed that we know the underlying model  Often these parameters are estimated on annotated training data, but:  Annotation is often difficult and/or expensive  Training data is different from the current data  We want to maximize the parameters with respect to the current data, i.e., we’re looking for a model, such that

38 CIS 391 - Intro to AI 38 Problem 3: Learning (If Time Allows…)  Unfortunately, there is no known way to analytically find a global maximum, i.e., a model, such that  But it is possible to find a local maximum  Given an initial model, we can always find a model, such that

39 CIS 391 - Intro to AI 39 Forward-Backward (Baum-Welch) algorithm  Key Idea: parameter re-estimation by hill- climbing  From an arbitrary initial parameter instantiation, the FB algorithm iteratively re-estimates the parameters, improving the probability that a given observation was generated by

40 CIS 391 - Intro to AI 40 Parameter Re-estimation  Three parameters need to be re-estimated: Initial state distribution: Transition probabilities: a i,j Emission probabilities: b i (o t )

41 CIS 391 - Intro to AI 41 Re-estimating Transition Probabilities  What’s the probability of being in state s i at time t and going to state s j, given the current model and parameters?

42 CIS 391 - Intro to AI 42 Re-estimating Transition Probabilities

43 CIS 391 - Intro to AI 43 Re-estimating Transition Probabilities  The intuition behind the re-estimation equation for transition probabilities is  Formally:

44 CIS 391 - Intro to AI 44 Re-estimating Transition Probabilities  Defining As the probability of being in state s i, given the complete observation O  We can say:

45 CIS 391 - Intro to AI 45 Re-estimating Initial State Probabilities  Initial state distribution: is the probability that s i is a start state  Re-estimation is easy:  Formally:

46 CIS 391 - Intro to AI 46 Re-estimation of Emission Probabilities  Emission probabilities are re-estimated as  Formally: where  Note that here is the Kronecker delta function and is not related to the in the discussion of the Viterbi algorithm!!

47 CIS 391 - Intro to AI 47 The Updated Model  Coming from we get to by the following update rules:

48 CIS 391 - Intro to AI 48 Expectation Maximization  The forward-backward algorithm is an instance of the more general EM algorithm The E Step: Compute the forward and backward probabilities for a give model The M Step: Re-estimate the model parameters


Download ppt "Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391."

Similar presentations


Ads by Google