John Lafferty Andrew McCallum Fernando Pereira

Slides:



Advertisements
Similar presentations
Parameter Learning in MN. Outline CRF Learning CRF for 2-d image segmentation IPF parameter sharing revisited.
Advertisements

Conditional Random Fields and beyond …
Introduction to Conditional Random Fields John Osborne Sept 4, 2009.
John Lafferty, Andrew McCallum, Fernando Pereira
Conditional Random Fields - A probabilistic graphical model Stefan Mutter Machine Learning Group Conditional Random Fields - A probabilistic graphical.
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Chapter 6: HIDDEN MARKOV AND MAXIMUM ENTROPY Heshaam Faili University of Tehran.
Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.
Hidden Markov Models Theory By Johan Walters (SR 2003)
Logistics Course reviews Project report deadline: March 16 Poster session guidelines: – 2.5 minutes per poster (3 hrs / 55 minus overhead) – presentations.
Lecture 6, Thursday April 17, 2003
Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:
Lecture 5: Learning models using EM
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your.
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Conditional Random Fields
Expectation Maximization Algorithm
CPSC 422, Lecture 18Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18 Feb, 25, 2015 Slide Sources Raymond J. Mooney University of.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Maximum Entropy Model & Generalized Iterative Scaling Arindam Bose CS 621 – Artificial Intelligence 27 th August, 2007.
Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.
Crash Course on Machine Learning
11 CS 388: Natural Language Processing: Discriminative Training and Conditional Random Fields (CRFs) for Sequence Labeling Raymond J. Mooney University.
Machine Learning & Data Mining CS/CNS/EE 155 Lecture 6: Conditional Random Fields 1.
Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.
Conditional Random Fields
Graphical models for part of speech tagging
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Segmental Hidden Markov Models with Random Effects for Waveform Modeling Author: Seyoung Kim & Padhraic Smyth Presentor: Lu Ren.
CSC2535: Computation in Neural Networks Lecture 11: Conditional Random Fields Geoffrey Hinton.
1 Sequence Learning Sudeshna Sarkar 14 Aug Alternative graphical models for part of speech tagging.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
1 Generative and Discriminative Models Jie Tang Department of Computer Science & Technology Tsinghua University 2012.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)
An Asymptotic Analysis of Generative, Discriminative, and Pseudolikelihood Estimators by Percy Liang and Michael Jordan (ICML 2008 ) Presented by Lihan.
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
Presented by Jian-Shiun Tzeng 5/7/2009 Conditional Random Fields: An Introduction Hanna M. Wallach University of Pennsylvania CIS Technical Report MS-CIS
HMM - Part 2 The EM algorithm Continuous density HMM.
Shallow Parsing Swapnil Chaudhari – 11305R011 Ankur Aher Raj Dabre – 11305R001.
Lecture 2: Statistical learning primer for biologists
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
Information Extraction Entity Extraction: Statistical Methods Sunita Sarawagi.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Part of Speech Tagging in Context month day, year Alex Cheng Ling 575 Winter 08 Michele Banko, Robert Moore.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
CRF Recitation Kevin Tang. Conditional Random Field Definition.
Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica ext. 1819
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Conditional Random Fields and Its Applications Presenter: Shih-Hsiang Lin 06/25/2007.
Maximum Entropy Models and Feature Engineering CSCI-GA.2591
Conditional Random Fields
Ch3: Model Building through Regression
CSC 594 Topics in AI – Natural Language Processing
Markov Networks.
Statistical Models for Automatic Speech Recognition
N-Gram Model Formulas Word sequences Chain rule of probability
Conditional Random Fields model
Markov Networks.
Presentation transcript:

John Lafferty Andrew McCallum Fernando Pereira Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira

Goal: Sequence segmentation and labeling Computational biology Computational linguistics Computer science

Overview Generative models Conditional models Label bias problem Conditional random fields Experiments

Generative Models HMMs and stochastic grammars Assign a joint probability to paired observation and label sequences Parameters are trained to maximize joint likelihood of training examples

Generative Models Need to enumerate all possible observation sequences To ensure tractability of inference problem, must make strong independence assumptions (i.e., conditional independence given labels)

Conditional models Specify probabilities of label sequences given an observation sequence Does not expend modeling effort on the observations which are fixed at test time Conditional probability can dependent on arbitrary, non-independent features of the observation sequence

Example: MEMMs Maximum entropy Markov models Each source state has an exponential model that takes the observation feature as input and outputs a distribution over possible next states Weakness: Label bias problem

Label Bias Problem Per-state normalization of transition scores implies “conservation of score mass” Bias towards states with fewer outgoing transitions State with single outgoing transition effectively ignores observation

Solving Label Bias Collapse states, and delay branching until get a discriminating observation Not always possible or may lead to combinatorial explosion

Solving Label Bias (cont’d) Start with fully-connected model and let training procedure figure out a good structure Precludes use of prior structure knowledge

Overview Generative models Conditional models Label bias problem Conditional random fields Experiments

Conditional Random Fields Undirected graph (random field) Construct conditional model p(Y|X) Does not explicitly model marginal p(X) Assumption: graph is fixed Paper concerns itself with chain graphs and sequences

CRFs: Distribution weights features

CRFs: Example Features Corresponding parameters λ and μ similar to the (logarithms of the) HMM parameters p(y’|y) and p(x|y)

CRFs: Parameter Estimation Maximize log-likelihood objective function Paper uses iterative scaling to find optimal parameter vector

Overview Generative models Conditional models Label bias problem Conditional random fields Experiments

Experiment 1: Modeling Label Bias Generate data from simple HMM that encodes noisy version of network Each state emits designated symbol with prob. 29/32 2,000 training and 500 test samples MEMM error: 42%; CRF error: 4.6%

Experiment 2: More synthetic data Five labels: a – e 26 observation values: A – Z Generate data from a mixed-order HMM Randomly generate model For each model, generate sample of 1,000 sequences of length 25

MEMM vs. CRF

MEMM vs. HMM

CRF vs. HMM

Experiment 3: Part-of-speech Tagging Each word to be labeled with one of 45 syntactic tags. 50%-50% train-test split out-of-vocabulary (oov) words: not observed in the training set

Part-of-speech Tagging Second set of experiments: add small set of orthographic features (whether word is capitalized, whether word ends in –ing, -ogy, -ed, -s, -ly …) Overall error rate reduced by 25% and oov error reduced by around 50%

Part-of-speech Tagging Usually start training with zero parameter vector (corresponds to uniform distribution) Use optimal MEMM parameter vector as starting point for training corresponding CRF MEMM+ trained to convergence in around 100 iterations; CRF+ took additional 1,000 iterations When starting from uniform distribution, CRF+ had not converged after 2,000 iterations

Further Aspects of CRFs Automatic feature selection Start from feature-generating rules and evaluate the benefit of the generated features automatically on data

Conclusions CRFs do not suffer from the label bias problem! Parameter estimation guaranteed to find the global optimum Limitation: Slow convergence of the training algorithm