Graphical models for part of speech tagging

Slides:



Advertisements
Similar presentations
Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.
Advertisements

Conditional Random Fields and beyond …
Conditional Random Fields For Speech and Language Processing
Introduction to Conditional Random Fields John Osborne Sept 4, 2009.
John Lafferty, Andrew McCallum, Fernando Pereira
Conditional Random Fields - A probabilistic graphical model Stefan Mutter Machine Learning Group Conditional Random Fields - A probabilistic graphical.
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.
Introduction to Hidden Markov Models
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
2004/11/161 A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition LAWRENCE R. RABINER, FELLOW, IEEE Presented by: Chi-Chun.
Page 1 Hidden Markov Models for Automatic Speech Recognition Dr. Mike Johnson Marquette University, EECE Dept.
Hidden Markov Models Ellen Walker Bioinformatics Hiram College, 2008.
Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수
Hidden Markov Models Theory By Johan Walters (SR 2003)
Foundations of Statistical NLP Chapter 9. Markov Models 한 기 덕한 기 덕.
Hidden Markov Models in NLP
Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.
Logistics Course reviews Project report deadline: March 16 Poster session guidelines: – 2.5 minutes per poster (3 hrs / 55 minus overhead) – presentations.
Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Conditional Random Fields
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging.
Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.
STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.
1 Conditional Random Fields for ASR Jeremy Morris 11/23/2009.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture August 2007.
Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
1 Sequence Learning Sudeshna Sarkar 14 Aug Alternative graphical models for part of speech tagging.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
1 Information Extraction using HMMs Sunita Sarawagi.
Hidden Markov Models in Keystroke Dynamics Md Liakat Ali, John V. Monaco, and Charles C. Tappert Seidenberg School of CSIS, Pace University, White Plains,
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Thomas G. Dietterich Department of Computer Science Oregon State University Corvallis, Oregon Machine Learning for Sequential.
Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
Presented by Jian-Shiun Tzeng 5/7/2009 Conditional Random Fields: An Introduction Hanna M. Wallach University of Pennsylvania CIS Technical Report MS-CIS
CS Statistical Machine learning Lecture 24
1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
John Lafferty Andrew McCallum Fernando Pereira
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
Conditional Markov Models: MaxEnt Tagging and MEMMs
1 Hidden Markov Models Hsin-min Wang References: 1.L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
1 Conditional Random Fields An Overview Jeremy Morris 01/11/2008.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica ext. 1819
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
Conditional Random Fields & Table Extraction Dongfang Xu School of Information.
20 th August 2008 Indications in green = Live content Indications in white = Edit in master Indications in blue = Locked elements Indications in black.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Conditional Random Fields and Its Applications Presenter: Shih-Hsiang Lin 06/25/2007.
Hidden Markov Models HMM Hassanin M. Al-Barhamtoshy
Structured prediction
Maximum Entropy Models and Feature Engineering CSCI-GA.2591
Conditional Random Fields
CSC 594 Topics in AI – Natural Language Processing
CONTEXT DEPENDENT CLASSIFICATION
Conditional Random Fields model
Handwritten Characters Recognition Based on an HMM Model
Algorithms of POS Tagging
Presentation transcript:

Graphical models for part of speech tagging

Different Models for POS tagging HMM Maximum Entropy Markov Models Conditional Random Fields

POS tagging: A Sequence Labeling Problem Input and Output Input sequence x = x1x2 xn Output sequence y = y1y2 ym Labels of the input sequence Semantic representation of the input Other Applications Automatic speech recognition Text processing, e.g., tagging, name entity recognition, summarization by exploiting layout structure of text, etc.

Hidden Markov Models Doubly stochastic models Efficient dynamic programming algorithms exist for Finding Pr(S) The highest probability path P that maximizes Pr(S,P) (Viterbi) Training the model (Baum-Welch algorithm) A C 0.6 0.4 A C 0.9 0.1 0.9 0.5 0.8 0.2 0.1 S1 S2 S4 In previous models, pr(ai) depended just on symbols appearing before some distance but not on the position of the symbol, I.e. not on I. To model drifting/evolving sequences need something more powerful. Hidden Markov models provide one such option. Here states do not correspond to substrings hence the name hidden. There are two kinds of probabilities: transition like before but emission too. Calculating Pr(seq) not easy since all symbols can potentially be generated from all states. So not a single path to generate the models, multiple paths each with some probability However, easy to calculate joint probability of path and emitted symbol. Enumerate all possible paths and sum probability. Can do much better by exploiting markov property. S3 A C 0.5 A C 0.3 0.7

Hidden Markov Model (HMM) : Generative Modeling Source Model P(Y) Noisy Channel P(X|Y) y x e.g., 1st order Markov chain Parameter estimation: maximize the joint likelihood of training examples

Dependency (1st order)

Different Models for POS tagging HMM Maximum Entropy Markov Models Conditional Random Fields

Disadvantage of HMMs (1) No Rich Feature Information Rich information are required When xk is complex When data of xk is sparse Example: POS Tagging How to evaluate P(wk|tk) for unknown words wk ? Useful features Suffix, e.g., -ed, -tion, -ing, etc. Capitalization

Disadvantage of HMMs (2) Generative Model Parameter estimation: maximize the joint likelihood of training examples Better Approach Discriminative model which models P(y|x) directly Maximize the conditional likelihood of training examples

Maximum Entropy Markov Model Discriminative Sub Models Unify two parameters in generative model into one conditional model Two parameters in generative model, parameter in source model and parameter in noisy channel Unified conditional model Employ maximum entropy principle Maximum Entropy Markov Model

General Maximum Entropy Model Model distribution P(Y |X) with a set of features {f1, f2, , fl} defined on X and Y Idea Collect information of features from training data Assume nothing on distribution P(Y |X) other than the collected information Maximize the entropy as a criterion

Features Features Example: POS Tagging 0-1 indicator functions 1 if (x, y) satisfies a predefined condition 0 if not Example: POS Tagging

Constraints Empirical Information Statistics from training data T Expected Value From the distribution P(Y |X) we want to model Constraints

Maximum Entropy: Objective Maximization Problem

Dual Problem Dual Problem Conditional model Maximum likelihood of conditional data Solution Improved iterative scaling (IIS) (Berger et al. 1996) Generalized iterative scaling (GIS) (McCallum et al. 2000)

Maximum Entropy Markov Model Use Maximum Entropy Approach to Model 1st order Features Basic features (like parameters in HMM) Bigram (1st order) or trigram (2nd order) in source model State-output pair feature (Xk = xk, Yk = yk) Advantage: incorporate other advanced features on (xk, yk)

Maximum Entropy Markov Model (MEMM) HMM vs MEMM (1st order) Maximum Entropy Markov Model (MEMM) HMM

Performance in POS Tagging Data set: WSJ Features: HMM features, spelling features (like –ed, -tion, -s, -ing, etc.) Results (Lafferty et al. 2001) 1st order HMM 94.31% accuracy, 54.01% OOV accuracy 1st order MEMM 95.19% accuracy, 73.01% OOV accuracy

Different Models for POS tagging HMM Maximum Entropy Markov Models Conditional Random Fields

Disadvantage of MEMMs (1) Complex Algorithm of Maximum Entropy Solution Both IIS and GIS are difficult to implement Require many tricks in implementation Slow in Training Time consuming when data set is large Especially for MEMM

Disadvantage of MEMMs (2) Maximum Entropy Markov Model Maximum entropy model as a sub model Optimization of entropy on sub models, not on global model Label Bias Problem Conditional models with per-state normalization Effects of observations are weakened for states with fewer outgoing transitions

Label Bias Problem Training Data X:Y rib:123 rob:456 1 2 3 r i b 4 5 6 Model Parameters New input: rob

Solution Global Optimization Alternatives Optimize parameters in a global model simultaneously, not in sub models separately Alternatives Conditional random fields Application of perceptron algorithm

Conditional Random Field (CRF) (1) Let be a graph such that Y is indexed by the vertices Then (X, Y) is a conditional random field if Conditioned globally on X

Conditional Random Field (CRF) (2) Determined by State Transitions Exponential Model : a tree (or more specifically, a chain) with cliques as edges and vertices State determined Parameter Estimation Maximize the conditional likelihood of training examples IIS or GIS

MEMM vs CRF Similarities Differences Both employ maximum entropy principle Both incorporate rich feature information Differences Conditional random fields are always globally conditioned on X, resulting in a global optimized model

Performance in POS Tagging Data set: WSJ Features: HMM features, spelling features (like –ed, -tion, -s, -ing, etc.) Results (Lafferty et al. 2001) 1st order MEMM 95.19% accuracy, 73.01% OOV accuracy Conditional random fields 95.73% accuracy, 76.24% OOV accuracy

Comparison of the three approaches to POS Tagging Results (Lafferty et al. 2001) 1st order HMM 94.31% accuracy, 54.01% OOV accuracy 1st order MEMM 95.19% accuracy, 73.01% OOV accuracy Conditional random fields 95.73% accuracy, 76.24% OOV accuracy

References A. Berger, S. Della Pietra, and V. Della Pietra (1996). A Maximum Entropy Approach to Natural Language Processing. Computational Linguistics, 22(1), 39-71. J. Lafferty, A. McCallumn, and F. Pereira (2001). Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proc. ICML-2001, 282-289.