Conditional Markov Models: MaxEnt Tagging and MEMMs

Slides:



Advertisements
Similar presentations
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Advertisements

Learning HMM parameters
Supervised Learning Recap
Hidden Markov Models.
Hidden Markov Models Ellen Walker Bioinformatics Hiram College, 2008.
Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수
Statistical NLP: Lecture 11
Chapter 6: HIDDEN MARKOV AND MAXIMUM ENTROPY Heshaam Faili University of Tehran.
Hidden Markov Models Theory By Johan Walters (SR 2003)
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Logistics Course reviews Project report deadline: March 16 Poster session guidelines: – 2.5 minutes per poster (3 hrs / 55 minus overhead) – presentations.
PatReco: Hidden Markov Models Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
Hidden Markov Models Pairwise Alignments. Hidden Markov Models Finite state automata with multiple states as a convenient description of complex dynamic.
Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Lecture 5: Learning models using EM
S. Maarschalkerweerd & A. Tjhang1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Conditional Random Fields
Elze de Groot1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
. Hidden Markov Models with slides from Lise Getoor, Sebastian Thrun, William Cohen, and Yair Weiss.
Hidden Markov Models. Hidden Markov Model In some Markov processes, we may not be able to observe the states directly.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)
Isolated-Word Speech Recognition Using Hidden Markov Models
Graphical models for part of speech tagging
Fundamentals of Hidden Markov Model Mehmet Yunus Dönmez.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.
Hidden Markov Models Usman Roshan CS 675 Machine Learning.
1 Information Extraction using HMMs Sunita Sarawagi.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
IE with Dictionaries Cohen & Sarawagi. Announcements Current statistics: –days with unscheduled student talks: 2 –students with unscheduled student talks:
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen CALD.
A Generalization of Forward-backward Algorithm Ai Azuma Yuji Matsumoto Nara Institute of Science and Technology.
CS Statistical Machine learning Lecture 24
Slides for “Data Mining” by I. H. Witten and E. Frank.
1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.
Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen Feb 8 IE Lecture.
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
John Lafferty Andrew McCallum Fernando Pereira
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
Logistic Regression William Cohen.
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
Information Extraction Entity Extraction: Statistical Methods Sunita Sarawagi.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica ext. 1819
Eric Xing © Eric CMU, Machine Learning Structured Models: Hidden Markov Models versus Conditional Random Fields Eric Xing Lecture 13,
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING.
Conditional Random Fields and Its Applications Presenter: Shih-Hsiang Lin 06/25/2007.
Hidden Markov Models BMI/CS 576
IE With Undirected Models: the saga continues
Maximum Entropy Models and Feature Engineering CSCI-GA.2591
Conditional Random Fields
MEMMs/CMMs and CRFs William W. Cohen Sep 22, 2010.
CSC 594 Topics in AI – Natural Language Processing
CSE 574 Finite State Machines for Information Extraction
Information Extraction Lecture
CONTEXT DEPENDENT CLASSIFICATION
IE With Undirected Models
NER with Models Allowing Long-Range Dependencies
Sequential Learning with Dependency Nets
Presentation transcript:

Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen CALD

Review: Hidden Markov Models 0.9 0.5 0.8 0.2 0.1 A C 0.6 0.4 A C 0.9 0.1 S1 S2 Efficient dynamic programming algorithms exist for Finding Pr(S) The highest probability path P that maximizes Pr(S,P) (Viterbi) Training the model (Baum-Welch algorithm) S4 S3 In previous models, pr(ai) depended just on symbols appearing before some distance but not on the position of the symbol, I.e. not on I. To model drifting/evolving sequences need something more powerful. Hidden Markov models provide one such option. Here states do not correspond to substrings hence the name hidden. There are two kinds of probabilities: transition like before but emission too. Calculating Pr(seq) not easy since all symbols can potentially be generated from all states. So not a single path to generate the models, multiple paths each with some probability However, easy to calculate joint probability of path and emitted symbol. Enumerate all possible paths and sum probability. Can do much better by exploiting markov property. A C 0.3 0.7 A C 0.5

HMM for Segmentation Simplest Model: One state per entity type

HMM Learning Manally pick HMM’s graph (eg simple model, fully connected) Learn transition probabilities: Pr(si|sj) Learn emission probabilities: Pr(w|si) Attached with each state is a dictionary that can be any probabilistic model on the content words attached with that element. The common easy case is a multinomial model. For each word, attach a probability value. Sum over all probabilities = 1. Intuitively know that particular words are less important than some top-level features of the words. These features may be overlapping. Need to train a joint probability model. Maximum entropy provides a viable approach to capture this.

Learning model parameters When training data defines unique path through HMM Transition probabilities Probability of transitioning from state i to state j = number of transitions from i to j total transitions from state i Emission probabilities Probability of emitting symbol k from state i = number of times k generated from i number of transition from I When training data defines multiple path: A more general EM like algorithm (Baum-Welch)

What is a “symbol” ??? Cohen => “Cohen”, “cohen”, “Xxxxx”, “Xx”, … ? 4601 => “4601”, “9999”, “9+”, “number”, … ? Datamold: choose best abstraction level using holdout set

What is a symbol? Bikel et al mix symbols from two abstraction levels

What is a symbol? Ideally we would like to use many, arbitrary, overlapping features of words. identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … S S S t - 1 t t+1 … is “Wisniewski” … part of noun phrase ends in “-ski” O O O t - 1 t t +1 Lots of learning systems are not confounded by multiple, non-independent features: decision trees, neural nets, SVMs, …

Stupid HMM tricks Pr(red|red) = 1 Pr(red) start Pr(green|green) = 1

Stupid HMM tricks Pr(y|x) = Pr(x|y) * Pr(y) / Pr(x) start Pr(red) Pr(green) Pr(green|green) = 1 Pr(red|red) = 1 Pr(y|x) = Pr(x|y) * Pr(y) / Pr(x) argmax{y} Pr(y|x) = argmax{y} Pr(x|y) * Pr(y) = argmax{y} Pr(y) * Pr(x1|y)*Pr(x2|y)*...*Pr(xm|y) Pr(“I voted for Ralph Nader”|ggggg) = Pr(g)*Pr(I|g)*Pr(voted|g)*Pr(for|g)*Pr(Ralph|g)*Pr(Nader|g)

HMM’s = sequential NB

From NB to Maxent

From NB to Maxent

From NB to Maxent Learning: set alpha parameters to maximize this: the ML model of the data, given we’re using the same functional form as NB. Turns out this is the same as maximizing entropy of p(y|x) over all distributions.

MaxEnt Comments Implementation: Smoothing: All methods are iterative Numerical issues (underflow rounding) are important. For NLP like problems with many features, modern gradient-like or Newton-like methods work well – sometimes better(?) and faster than GIS and IIS Smoothing: Typically maxent will overfit data if there are many infrequent features. Common solutions: discard low-count features; early stopping with holdout set; Gaussian prior centered on zero to limit size of alphas (ie, optimize log likelihood - sum alpha)

MaxEnt Comments Performance: Embedding in a larger system: Good MaxEnt methods are competitive with linear SVMs and other state of are classifiers in accuracy. Can’t easily extend to higher-order interactions (e.g. kernel SVMs, AdaBoost). Training is relatively expensive. Embedding in a larger system: MaxEnt optimizes Pr(y|x), not error rate.

MaxEnt Comments MaxEnt competitors: Things I don’t understand: Model Pr(y|x) with Pr(y|score(x)) using score from SVM’s, NB, … Regularized Winnow, BPETs, … Ranking-based methods that estimate if Pr(y1|x)>Pr(y2|x). Things I don’t understand: Why don’t we call it logistic regression? Why is always used to estimate the density of (y,x) pairs rather than a separate density for each class y? When are its confidence estimates reliable?

What is a symbol? Ideally we would like to use many, arbitrary, overlapping features of words. identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … S S S t - 1 t t+1 … is “Wisniewski” … part of noun phrase ends in “-ski” O O O t - 1 t t +1

What is a symbol? identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … S S S t - 1 t t+1 … is “Wisniewski” … part of noun phrase ends in “-ski” O O O t - 1 t t +1 Idea: replace generative model in HMM with a maxent model, where state depends on observations

What is a symbol? identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … S S S t - 1 t t+1 … is “Wisniewski” … part of noun phrase ends in “-ski” O O O t - 1 t t +1 Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state

What is a symbol? identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … S S t - 1 S t+1 t … is “Wisniewski” … part of noun phrase ends in “-ski” O O O t - 1 t t +1 Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state history

Ratnaparkhi’s MXPOST Sequential learning problem: predict POS tags of words. Uses MaxEnt model described above. Rich feature set. To smooth, discard features occurring < 10 times.

MXPOST

MXPOST: learning & inference GIS Feature selection

MXPost inference

MXPost results State of art accuracy (for 1996) Same approach used successfully for several other sequential classification steps of a stochastic parser (also state of art). Same approach used for NER by Bortwick, Malouf, Manning, and others.

Alternative inference

Finding the most probable path: the Viterbi algorithm (for HMMs) define to be the probability of the most probable path accounting for the first i characters of x and ending in state k we want to compute , the probability of the most probable path accounting for all of the sequence and ending in the end state can define recursively can use dynamic programming to find efficiently

Finding the most probable path: the Viterbi algorithm for HMMs initialization: Note: this is wrong for delete states: they shouldn’t be initialized like this.

The Viterbi algorithm for HMMs recursion for emitting states (i =1…L):

The Viterbi algorithm for HMMs and Maxent Taggers recursion for emitting states (i =1…L):

MEMMs Basic difference from ME tagging: ME tagging: previous state is feature of MaxEnt classifier MEMM: build a separate MaxEnt classifier for each state. Can build any HMM architecture you want; eg parallel nested HMM’s, etc. Data is fragmented: examples where previous tag is “proper noun” give no information about learning tags when previous tag is “noun” Mostly a difference in viewpoint

MEMMs

MEMM task: FAQ parsing

MEMM features