Presentation is loading. Please wait.

Presentation is loading. Please wait.

Conditional Markov Models: MaxEnt Tagging and MEMMs

Similar presentations


Presentation on theme: "Conditional Markov Models: MaxEnt Tagging and MEMMs"— Presentation transcript:

1 Conditional Markov Models: MaxEnt Tagging and MEMMs
William W. Cohen CALD

2 Review: Hidden Markov Models
0.9 0.5 0.8 0.2 0.1 A C 0.6 0.4 A C 0.9 0.1 S1 S2 Efficient dynamic programming algorithms exist for Finding Pr(S) The highest probability path P that maximizes Pr(S,P) (Viterbi) Training the model (Baum-Welch algorithm) S4 S3 In previous models, pr(ai) depended just on symbols appearing before some distance but not on the position of the symbol, I.e. not on I. To model drifting/evolving sequences need something more powerful. Hidden Markov models provide one such option. Here states do not correspond to substrings hence the name hidden. There are two kinds of probabilities: transition like before but emission too. Calculating Pr(seq) not easy since all symbols can potentially be generated from all states. So not a single path to generate the models, multiple paths each with some probability However, easy to calculate joint probability of path and emitted symbol. Enumerate all possible paths and sum probability. Can do much better by exploiting markov property. A C 0.3 0.7 A C 0.5

3 HMM for Segmentation Simplest Model: One state per entity type

4 HMM Learning Manally pick HMM’s graph (eg simple model, fully connected) Learn transition probabilities: Pr(si|sj) Learn emission probabilities: Pr(w|si) Attached with each state is a dictionary that can be any probabilistic model on the content words attached with that element. The common easy case is a multinomial model. For each word, attach a probability value. Sum over all probabilities = 1. Intuitively know that particular words are less important than some top-level features of the words. These features may be overlapping. Need to train a joint probability model. Maximum entropy provides a viable approach to capture this.

5 Learning model parameters
When training data defines unique path through HMM Transition probabilities Probability of transitioning from state i to state j = number of transitions from i to j total transitions from state i Emission probabilities Probability of emitting symbol k from state i = number of times k generated from i number of transition from I When training data defines multiple path: A more general EM like algorithm (Baum-Welch)

6 What is a “symbol” ??? Cohen => “Cohen”, “cohen”, “Xxxxx”, “Xx”, … ? 4601 => “4601”, “9999”, “9+”, “number”, … ? Datamold: choose best abstraction level using holdout set

7 What is a symbol? Bikel et al mix symbols from two abstraction levels

8 What is a symbol? Ideally we would like to use many, arbitrary, overlapping features of words. identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor S S S t - 1 t t+1 is “Wisniewski” part of noun phrase ends in “-ski” O O O t - 1 t t +1 Lots of learning systems are not confounded by multiple, non-independent features: decision trees, neural nets, SVMs, …

9 Stupid HMM tricks Pr(red|red) = 1 Pr(red) start Pr(green|green) = 1

10 Stupid HMM tricks Pr(y|x) = Pr(x|y) * Pr(y) / Pr(x)
start Pr(red) Pr(green) Pr(green|green) = 1 Pr(red|red) = 1 Pr(y|x) = Pr(x|y) * Pr(y) / Pr(x) argmax{y} Pr(y|x) = argmax{y} Pr(x|y) * Pr(y) = argmax{y} Pr(y) * Pr(x1|y)*Pr(x2|y)*...*Pr(xm|y) Pr(“I voted for Ralph Nader”|ggggg) = Pr(g)*Pr(I|g)*Pr(voted|g)*Pr(for|g)*Pr(Ralph|g)*Pr(Nader|g)

11 HMM’s = sequential NB

12 From NB to Maxent

13 From NB to Maxent

14 From NB to Maxent Learning: set alpha parameters to maximize this: the ML model of the data, given we’re using the same functional form as NB. Turns out this is the same as maximizing entropy of p(y|x) over all distributions.

15 MaxEnt Comments Implementation: Smoothing: All methods are iterative
Numerical issues (underflow rounding) are important. For NLP like problems with many features, modern gradient-like or Newton-like methods work well – sometimes better(?) and faster than GIS and IIS Smoothing: Typically maxent will overfit data if there are many infrequent features. Common solutions: discard low-count features; early stopping with holdout set; Gaussian prior centered on zero to limit size of alphas (ie, optimize log likelihood - sum alpha)

16 MaxEnt Comments Performance: Embedding in a larger system:
Good MaxEnt methods are competitive with linear SVMs and other state of are classifiers in accuracy. Can’t easily extend to higher-order interactions (e.g. kernel SVMs, AdaBoost). Training is relatively expensive. Embedding in a larger system: MaxEnt optimizes Pr(y|x), not error rate.

17 MaxEnt Comments MaxEnt competitors: Things I don’t understand:
Model Pr(y|x) with Pr(y|score(x)) using score from SVM’s, NB, … Regularized Winnow, BPETs, … Ranking-based methods that estimate if Pr(y1|x)>Pr(y2|x). Things I don’t understand: Why don’t we call it logistic regression? Why is always used to estimate the density of (y,x) pairs rather than a separate density for each class y? When are its confidence estimates reliable?

18 What is a symbol? Ideally we would like to use many, arbitrary, overlapping features of words. identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor S S S t - 1 t t+1 is “Wisniewski” part of noun phrase ends in “-ski” O O O t - 1 t t +1

19 What is a symbol? identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor S S S t - 1 t t+1 is “Wisniewski” part of noun phrase ends in “-ski” O O O t - 1 t t +1 Idea: replace generative model in HMM with a maxent model, where state depends on observations

20 What is a symbol? identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor S S S t - 1 t t+1 is “Wisniewski” part of noun phrase ends in “-ski” O O O t - 1 t t +1 Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state

21 What is a symbol? identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor S S t - 1 S t+1 t is “Wisniewski” part of noun phrase ends in “-ski” O O O t - 1 t t +1 Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state history

22 Ratnaparkhi’s MXPOST Sequential learning problem: predict POS tags of words. Uses MaxEnt model described above. Rich feature set. To smooth, discard features occurring < 10 times.

23 MXPOST

24 MXPOST: learning & inference
GIS Feature selection

25 MXPost inference

26 MXPost results State of art accuracy (for 1996)
Same approach used successfully for several other sequential classification steps of a stochastic parser (also state of art). Same approach used for NER by Bortwick, Malouf, Manning, and others.

27 Alternative inference

28 Finding the most probable path: the Viterbi algorithm (for HMMs)
define to be the probability of the most probable path accounting for the first i characters of x and ending in state k we want to compute , the probability of the most probable path accounting for all of the sequence and ending in the end state can define recursively can use dynamic programming to find efficiently

29 Finding the most probable path: the Viterbi algorithm for HMMs
initialization: Note: this is wrong for delete states: they shouldn’t be initialized like this.

30 The Viterbi algorithm for HMMs
recursion for emitting states (i =1…L):

31 The Viterbi algorithm for HMMs and Maxent Taggers
recursion for emitting states (i =1…L):

32 MEMMs Basic difference from ME tagging:
ME tagging: previous state is feature of MaxEnt classifier MEMM: build a separate MaxEnt classifier for each state. Can build any HMM architecture you want; eg parallel nested HMM’s, etc. Data is fragmented: examples where previous tag is “proper noun” give no information about learning tags when previous tag is “noun” Mostly a difference in viewpoint

33 MEMMs

34 MEMM task: FAQ parsing

35 MEMM features


Download ppt "Conditional Markov Models: MaxEnt Tagging and MEMMs"

Similar presentations


Ads by Google