Conditional Random Fields and Its Applications Presenter: Shih-Hsiang Lin 06/25/2007.

Slides:



Advertisements
Similar presentations
Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.
Advertisements

Supervised Learning Recap
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
John Lafferty, Andrew McCallum, Fernando Pereira
Conditional Random Fields - A probabilistic graphical model Stefan Mutter Machine Learning Group Conditional Random Fields - A probabilistic graphical.
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수
Statistical NLP: Lecture 11
Chapter 6: HIDDEN MARKOV AND MAXIMUM ENTROPY Heshaam Faili University of Tehran.
Hidden Markov Models Theory By Johan Walters (SR 2003)
Statistical NLP: Hidden Markov Models Updated 8/12/2005.
Hidden Markov Models Fundamentals and applications to bioinformatics.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Logistics Course reviews Project report deadline: March 16 Poster session guidelines: – 2.5 minutes per poster (3 hrs / 55 minus overhead) – presentations.
HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.
Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Conditional Random Fields
Hidden Markov Models David Meir Blei November 1, 1999.
CPSC 422, Lecture 18Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18 Feb, 25, 2015 Slide Sources Raymond J. Mooney University of.
STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.
11 CS 388: Natural Language Processing: Discriminative Training and Conditional Random Fields (CRFs) for Sequence Labeling Raymond J. Mooney University.
Machine Learning & Data Mining CS/CNS/EE 155 Lecture 6: Conditional Random Fields 1.
Final review LING572 Fei Xia Week 10: 03/11/
Graphical models for part of speech tagging
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
Hidden Markov Models & POS Tagging Corpora and Statistical Methods Lecture 9.
Presented by Jian-Shiun Tzeng 5/7/2009 Conditional Random Fields: An Introduction Hanna M. Wallach University of Pennsylvania CIS Technical Report MS-CIS
CS Statistical Machine learning Lecture 24
1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.
Conditional Random Fields for ASR Jeremy Morris July 25, 2006.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
John Lafferty Andrew McCallum Fernando Pereira
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
Conditional Markov Models: MaxEnt Tagging and MEMMs
1 Hidden Markov Models Hsin-min Wang References: 1.L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter.
A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica ext. 1819
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
Other Models for Time Series. The Hidden Markov Model (HMM)
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Conditional Random Fields & Table Extraction Dongfang Xu School of Information.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Hidden Markov Models Wassnaa AL-mawee Western Michigan University Department of Computer Science CS6800 Adv. Theory of Computation Prof. Elise De Doncker.
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
Hidden Markov Models HMM Hassanin M. Al-Barhamtoshy
Structured prediction
Maximum Entropy Models and Feature Engineering CSCI-GA.2591
CSC 594 Topics in AI – Natural Language Processing
CONTEXT DEPENDENT CLASSIFICATION
Conditional Random Fields model
Presentation transcript:

Conditional Random Fields and Its Applications Presenter: Shih-Hsiang Lin 06/25/2007

2 Introduction The task of assigning label sequences to a set of observation sequences arises in many field –Bioinformatics, Computational linguistics, Speech recognition, Information extraction etc. One of the most common methods for performing such labeling and segmentation tasks is that of employing hidden Markov models (HMM) or probability finite-state automata –Identify the most likely sequence of labels in any given observation sequences

3 Labeling Sequence Data Problem Given observed data sequences X = {x 1, x 2, …, x n } A corresponding label sequence y k for each data sequence x k,and Y ={y 1, y 2, …, y n } –y i is assumed to range over a finite label alphabet A The problem: –Prediction Task Given a sequence x and model θ predict y –Learning Task Given training sets X and Y, learn the best model θ Thinkingisbeing X: x1x1 x2x2 x3x3 nounverbnoun y1y1 y2y2 y3y3 Y: POS

4 Brief overview of HMMs An HMM is a finite state automaton with stochastic state transitions and observations Formally: An HMM is –A finite set of states Y –A finite set of observations X –Two conditional probability distributions For y given y’: P(y|y’) For o given y: P(x|y) –The initial state distribution P (y 1 ) Assumptions made by HMMs –Markov Assumption, Stationary Assumption, Output Independence Assumption Three classical problems –Evaluation Problem, Decoding Problem, Learning Problem Efficient dynamic programming (DP) algorithms that solve these problems are the Forward, Viterbi, and Baum-Welch algorithms respectively y t-1 ytytytyt y t+1 x t-1 xtxtxtxt x t+1

5 Difficulties with HMMs: Motivation HMM cannot represent multiple interacting features or long range dependences between observed elements easily. We need a richer representation of observations –Describe observations with overlapping features –Example features in text-related tasks Capitalization, word ending, part-of-speech, Formatting, position on the page Model P(Y T |X T ) rather then the joint probability P(Y T,X T ) –Conditional probability P(label sequence y | observation sequence x) rather than joint probability P(y, x) Specify the probability of possible label sequences given an observation sequence –Allow arbitrary, non-independent features on the observation sequence X –The probability of a transition between labels may depend on past and future observations Relax strong independence assumptions in generative models Discriminative Generative

6 Maximum Entropy Markov Models (MEMMs) A conditional model that representing the probability of reaching a state given an observation and the previous state Consider observation sequences to be events to be conditioned upon. Given training set X with label sequence Y –Train a model θ that maximizes P(Y|X, θ) –For a new data sequence x, the predicted label s maximizes P(y|x, θ) –Notice the per-state normalization Subject to Label Bias Problem(HMM do not suffer from the label bias problem ) –Per-state normalization: all the mass that arrives at a state must be distributed among the possible successor states y t-1 ytytytyt y t+1 x t-1 xtxtxtxt x t+1

7 Label Bias Problem The label sequence 1,2 should score higher when ri is observed compared to ro Or, we expect P(1 and 2 | ri) to be greater than P(1 and 2 | ro) Mathematically, P(1 and 2 | ro) = P(2 | 1 and ro)P(1 | ro) = P(2 | 1 and o)P(1 | r) P(1 and 2 | ri) = P(2 | 1 and ri)P(1 | ri) = P(2 | 1 and i)P(1 | r) Since P(2 | 1 and x) = 1 for all x, P(1 and 2 | ro) = P(1 and 2 | ri) In the training data, label value 2 is the only label value observed after label value 1 Therefore P(2 | 1) = 1, so P(2 | 1 and x) = 1 for all x Consider this MEMM

8 Random Field

9 Conditional Random Field CRFs have all the advantages of MEMMs without label bias problem –MEMM uses per-state exponential model for the conditional probabilities of next states given the current state –CRF has a single exponential model for the joint probability of the entire sequence of labels given the observation sequence undirected graphical model globally conditioned on X y t-1 ytytytyt y t+1 x t-1 xtxtxtxt x t+1

10 Example of CRFs

11 Conditional Random Field (cont.) CRFs use the observation-dependent normalization Z(x) for the conditional distributions: Z(x) is a normalization over the data sequence x HMM MEMM CRF

12 Example : Phone Classification State Feature Function f([x is stop], /t/) One possible state feature function For our attributes and labels State Feature Weight λ=10 One possible weight value for this state feature (Strong) Transition Feature Function g(x, /iy/,/k/) One possible transition feature function Indicates /k/ followed by /iy/ Transition Feature Weight μ=4 One possible weight value for this transition feature

13 Parameter Estimation Parameter estimation is typically performed by penalized maximum likelihood –We need to determine the parameters from training data D = {(x (i), y (i) )} Because we are modeling the conditional distribution, the conditional log likelihood is appropriate One way to understand the conditional likelihood is to imagine combining it with some arbitrary prior to form a joint probability. Then when we optimize the joint log likelihood –The value of does not affect the optimization over –If we do not need to estimate, then we can simply drop the second term

14 Parameter Estimation (cont.) After substituting in the CRF model into the likelihood, we get the following expression As a measure to avoid over-fitting, we use regularization, which is a penalty on weight vectors whose norm is too large –Regularization parameter that determines the strength of the penalty In general, the function cannot be maximized in closed form, so numerical optimization is used. The partial derivatives are

15 Parameter Estimation (cont.) There is no analytical solutions for the parameter by maximizing the log-likelihood –Setting the gradient to zero and solving for does not always yield a closed form solution Iterative technique is adopted –Iterative scaling: IIS, GIS –Gradient descent: conjugate gradient, L-BFGS –Gradient tree boosting The core of the above techniques lies in computing the expectation of each feature function with respect to the CRF model distribution

16 Making Predictions Once a CRF model has been trained, there are (at least) two possible ways to do inference for a test sequence –We can predict the entire sequence Y that has the highest probability by Viterbi algorithm –We can also make predictions for individual y t and by forward-backward algorithm

17 Making Predictions (cont.) The marginal probability of states at each position in the sequence can be computed by a dynamic programming inference procedure similar to the forward-backward procedure for HMM The equals to The backward values can be defined similarly

CRF in Summarization

19 Corpus and Features The data set is an open benchmark data set which contains 147 document summary pairs from Document Understanding Conference (DUC) 2001 ( Basic Features –Position the position of x i along the sentence sequence of a document. If x i appears at the beginning of the document, the feature “Pos” is set to be 1; if it is at the end of the document, “Pos” is 2; Otherwise, “Pos” is set to be 3. –Length the number of terms contained in x i after removing the words according to a stop-word list. –Log Likelihood the log likelihood of xi being generated by the document, logP(x i |D)

20 Corpus and Features (cont.) –Thematic Words these are the most frequent words in the document after the stop words are removed. Sentences containing more thematic words are more likely to be summary sentences. We use this feature to record the number of thematic words in x i –Indicator Words some words are indicators of summary sentences, such as “in summary” and “in conclusion”. This feature is to denote whether x i contains such words –Upper Case Words some proper names are often important and presented through upper-case words, as well as some other words the authors want to emphasize. We use this feature to reflect whether x i contains the upper-case words –Similarity to Neighboring Sentences we define features to record the similarity between a sentence and its neighbors

21 Corpus and Features (cont.) Complex Features –LSA Scores use the projections as scores to rank sentences and select the top sentences into summary. –HITS Scores the authority score of HITS on the directed backward graph is more effective than other graph-based methods.

22 Results NB: Naïve Bayes LR: Logistic Regression LEAD: Lead Sentence Basic Feature Complex Feature