Presented by Jian-Shiun Tzeng 5/7/2009 Conditional Random Fields: An Introduction Hanna M. Wallach University of Pennsylvania CIS Technical Report MS-CIS-04-21.

Slides:



Advertisements
Similar presentations
Markov Networks Alan Ritter.
Advertisements

Learning on the Test Data: Leveraging “Unseen” Features Ben Taskar Ming FaiWong Daphne Koller.
Exact Inference. Inference Basic task for inference: – Compute a posterior distribution for some query variables given some observed evidence – Sum out.
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
Conditional Random Fields and beyond …
Supervised Learning Recap
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
John Lafferty, Andrew McCallum, Fernando Pereira
Conditional Random Fields - A probabilistic graphical model Stefan Mutter Machine Learning Group Conditional Random Fields - A probabilistic graphical.
Introduction of Probabilistic Reasoning and Bayesian Networks
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.
EE462 MLCV Lecture Introduction of Graphical Models Markov Random Fields Segmentation Tae-Kyun Kim 1.
Structural Inference of Hierarchies in Networks BY Yu Shuzhi 27, Mar 2014.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수
Entropy Rates of a Stochastic Process
Chapter 8-3 Markov Random Fields 1. Topics 1. Introduction 1. Undirected Graphical Models 2. Terminology 2. Conditional Independence 3. Factorization.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Visual Recognition Tutorial
A Graphical Model For Simultaneous Partitioning And Labeling Philip Cowans & Martin Szummer AISTATS, Jan 2005 Cambridge.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Pattern Recognition and Machine Learning
Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:
1 Unsupervised Learning With Non-ignorable Missing Data Machine Learning Group Talk University of Toronto Monday Oct 4, 2004 Ben Marlin Sam Roweis Rich.
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
Conditional Random Fields
What is it? When would you use it? Why does it work? How do you implement it? Where does it stand in relation to other methods? EM algorithm reading group.
Today Logistic Regression Decision Trees Redux Graphical Models
Computer vision: models, learning and inference Chapter 10 Graphical Models.
. Expressive Graphical Models in Variational Approximations: Chain-Graphs and Hidden Variables Tal El-Hay & Nir Friedman School of Computer Science & Engineering.
STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.
Copyright © Cengage Learning. All rights reserved. CHAPTER 11 ANALYSIS OF ALGORITHM EFFICIENCY ANALYSIS OF ALGORITHM EFFICIENCY.
Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.
Presenter : Kuang-Jui Hsu Date : 2011/5/23(Tues.).
Graphical models for part of speech tagging
CSC2535: Computation in Neural Networks Lecture 11: Conditional Random Fields Geoffrey Hinton.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.
第十讲 概率图模型导论 Chapter 10 Introduction to Probabilistic Graphical Models
Topic Modelling: Beyond Bag of Words By Hanna M. Wallach ICML 2006 Presented by Eric Wang, April 25 th 2008.
Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)
Probability and Measure September 2, Nonparametric Bayesian Fundamental Problem: Estimating Distribution from a collection of Data E. ( X a distribution-valued.
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
Ch 8. Graphical Models Pattern Recognition and Machine Learning, C. M. Bishop, Revised by M.-O. Heo Summarized by J.W. Nam Biointelligence Laboratory,
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.
Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
LAC group, 16/06/2011. So far...  Directed graphical models  Bayesian Networks Useful because both the structure and the parameters provide a natural.
CS Statistical Machine learning Lecture 24
ECE 8443 – Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem Proof EM Example – Missing Data Intro to Hidden Markov Models.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
A Dynamic Conditional Random Field Model for Object Segmentation in Image Sequences Duke University Machine Learning Group Presented by Qiuhua Liu March.
John Lafferty Andrew McCallum Fernando Pereira
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
Markov Random Fields & Conditional Random Fields
Christopher M. Bishop, Pattern Recognition and Machine Learning 1.
A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Conditional Random Fields and Its Applications Presenter: Shih-Hsiang Lin 06/25/2007.
Fundamentals of Data Analysis Lecture 11 Methods of parametric estimation.
LECTURE 10: EXPECTATION MAXIMIZATION (EM)
Lecture 5 Unsupervised Learning in fully Observed Directed and Undirected Graphical Models.
Markov Random Fields Presented by: Vladan Radosavljevic.
Conditional Random Fields model
The Improved Iterative Scaling Algorithm: A gentle Introduction
Presentation transcript:

Presented by Jian-Shiun Tzeng 5/7/2009 Conditional Random Fields: An Introduction Hanna M. Wallach University of Pennsylvania CIS Technical Report MS-CIS-04-21

Hanna Wallach is a postdoctoral research associate at the University of Massachusetts Amherst, working with Andrew McCallum. Hanna's Ph.D. work, undertaken at the University of Cambridge, introduced new methods for statistically modeling text using structured topic models—models that combine latent topics with information about document structure, ranging from local sentence structure to inter-document relationships. 2

Theses "Structured Topic Models for Language." Ph.D. thesis, University of Cambridge, "Efficient Training of Conditional Random Fields." M.Sc. thesis, University of Edinburgh, "Visual representation of CAD constraints." B.A. thesis, University of Cambridge,

Outline Labeling Sequential Data Undirected Graphical Models – Potential Functions Conditional Random Fields Maximum Entropy Maximum Likelihood Parameter Inference CRF Probability as Matrix Computations Dynamic Programming 4

Labeling Sequential Data One of the most common methods for performing labeling and segmentation tasks is that of employing hidden Markov models [13] (HMMs) or probabilistic finite-state automata to identify the most likely sequence of labels for the words in any given sentence 5

HMMs are a form of generative model, that defines a joint probability distribution p(X,Y ) where X and Y are random variables respectively ranging over observation sequences and their corresponding label sequences 6

In order to define a joint distribution of this nature, generative models must enumerate all possible observation sequences – a task which, for most domains, is intractable (hard) unless observation elements are represented as isolated units, independent from the other elements in an observation sequence 7

More precisely, the observation element at any given instant in time may only directly depend on the state, or label, at that time This is an appropriate assumption for a few simple data sets, however most real-world observation sequences are best represented in terms of multiple interacting features and long- range dependencies between observation elements 8

This representation issue is one of the most fundamental problems when labeling sequential data Clearly, a model that supports tractable (easy to handle) inference is necessary, however a model that represents the data without making unwarranted independence assumptions is also desirable 9

One way of satisfying both these criteria is to use a model that defines a conditional probability p(Y |x) over label sequences given a particular observation sequence x, rather than a joint distribution over both label and observation sequences 10

Conditional models are used to label a novel observation sequence x* by selecting the label sequence y* that maximizes the conditional probability p(y* |x*) The conditional nature of such models means that no effort is wasted on modeling the observations, and one is free from having to make unwarranted independence assumptions about these sequences 11

; arbitrary attributes of the observation data may be captured by the model, without the modeler having to worry about how these attributes are related 12

Conditional random fields [8] (CRFs) are a probabilistic framework for labeling and segmenting sequential data, based on the conditional approach described in the previous paragraph A CRF is a form of undirected graphical model that defines a single log-linear distribution over label sequences given a particular observation sequence 13

The primary advantage of CRFs over hidden Markov models is their conditional nature, resulting in the relaxation of the independence assumptions required by HMMs in order to ensure tractable inference. Additionally, CRFs avoid the label bias problem [8], a weakness exhibited by maximum entropy Markov models [9] (MEMMs) and other conditional Markov models based on directed graphical models CRFs outperform both MEMMs and HMMs on a number of real-world sequence labeling tasks [8, 11, 15] 14

Outline Labeling Sequential Data Undirected Graphical Models – Potential Functions Conditional Random Fields Maximum Entropy Maximum Likelihood Parameter Inference CRF Probability as Matrix Computations Dynamic Programming 15

Undirected Graphical Models A conditional random field may be viewed as an undirected graphical model, or Markov random field [3], globally conditioned on X, the random variable representing observation sequences 16

Formally, we define G = (V,E) to be an undirected graph such that there is a node v ∈ V corresponding to each of the random variables representing an element Y v of Y If each random variable Y v obeys the Markov property with respect to G, then (Y,X) is a conditional random field 17

In theory the structure of graph G may be arbitrary, provided it represents the conditional independencies in the label sequences being modeled However, when modeling sequences, the simplest and most common graph structure encountered is that in which the nodes corresponding to elements of Y form a simple first-order chain, as illustrated in Figure 1 18

19

Outline Labeling Sequential Data Undirected Graphical Models – Potential Functions Conditional Random Fields Maximum Entropy Maximum Likelihood Parameter Inference CRF Probability as Matrix Computations Dynamic Programming 20

Potential Functions The graphical structure of a conditional random field may be used to factorize the joint distribution over elements Y v of Y into a normalized product of strictly positive, real- valued potential functions, derived from the notion of conditional independence 21 … =

The product of a set of strictly positive, real- valued functions is not guaranteed to satisfy the axioms of probability A normalization factor is therefore introduced to ensure that the product of potential functions is a valid probability distribution over the random variables represented by vertices in G Each potential function operates on a subset of the random variables represented by vertices in G 22 … = N

According to the definition of conditional independence for undirected graphical models, the absence of an edge between two vertices in G implies that the random variables represented by these vertices are conditionally independent given all other random variables in the model 23

The potential functions must therefore ensure that it is possible to factorize the joint probability such that conditionally independent random variables do not appear in the same potential function  why? 24 … = N y s1, y s2, …, y sm y t1, y t2, …, y tn conditionally independent X

※ maximal clique A clique is defined as a subset of the nodes in a graph such that there exists a link between all pairs of nodes in the subset – The set of nodes in a clique is fully connected A maximal clique is a set of vertices that induces a complete subgraph, and that is not a subset of the vertices of any larger complete subgraph – That is, it is a set S such that every pair of vertices in S is connected by an edge and every vertex not in S is missing an edge to at least one vertex in S 25

The easiest way to fulfill this requirement is to require each potential function to operate on a set of random variables whose corresponding vertices form a maximal clique within G This ensures that no potential function refers to any pair of random variables whose vertices are not “directly” connected and, if two vertices appear together in a clique this relationship is made explicit 26

In the case of a chain-structured CRF, such as that depicted in Figure 1, each potential function will operate on pairs of adjacent label variables Y i and Y i+1 27

It is worth noting that an isolated potential function does not have a direct probabilistic interpretation, but instead represents constraints on the configurations of the random variables on which the function is defined This in turn affects the probability of global configurations – a global configuration with a high probability is likely to have satisfied more of these constraints than a global configuration with a low probability 28

Outline Labeling Sequential Data Undirected Graphical Models – Potential Functions Conditional Random Fields Maximum Entropy Maximum Likelihood Parameter Inference CRF Probability as Matrix Computations Dynamic Programming 29

Conditional Random Fields Lafferty et al. [8] define the probability of a particular label sequence y given observation sequence x to be a normalized product of potential functions, each of the form 30

31

When defining feature functions, we construct a set of real-valued features b(x, i) of the observation to expresses some characteristic of the empirical distribution of the training data that should also hold of the model distribution An example of such a feature is 32

Each feature function takes on the value of one of these real-valued observation features b(x, i) if the current state (in the case of a state function) or previous and current states (in the case of a transition function) take on particular values All feature functions are therefore real-valued For example, consider the following transition function: 33

34 n 1 :# of transition fcnsn 2 :# of state fcns n 1 +n 2

This allows the probability of a label sequence y given an observation sequence x to be written as 35

Outline Labeling Sequential Data Undirected Graphical Models – Potential Functions Conditional Random Fields Maximum Entropy Maximum Likelihood Parameter Inference CRF Probability as Matrix Computations Dynamic Programming 36

Maximum Entropy The form of a CRF, as given in (3), is heavily motivated by the principle of maximum entropy – a framework for estimating probability distributions from a set of training data Entropy of a probability distribution [16] is a measure of uncertainty and is maximized when the distribution in question is as uniform as possible 37

The principle of maximum entropy asserts that the only probability distribution that can justifiably be constructed from incomplete information, such as finite training data, is that which has maximum entropy subject to a set of constraints representing the information available Any other distribution will involve unwarranted assumptions. [7] 38

If the information encapsulated within training data is represented using a set of feature functions such as those described in the previous section, the maximum entropy model distribution is that which is as uniform as possible while ensuring that the expectation of each feature function with respect to the empirical distribution of the training data equals the expected value of that feature function with respect to the model distribution Identifying this distribution is a constrained optimization problem that can be shown [2, 10, 14] to be satisfied by (3) 39

Outline Labeling Sequential Data Undirected Graphical Models – Potential Functions Conditional Random Fields Maximum Entropy Maximum Likelihood Parameter Inference CRF Probability as Matrix Computations Dynamic Programming 40

Maximum Likelihood Parameter Inference Assuming the training data {(x (k), y (k) )} are independently and identically distributed, the product of (3) over all training sequences, as a function of the parameters λ, is known as the likelihood, denoted by p({y (k) }|{x (k) },λ) 41

Maximum likelihood training chooses parameter values such that the logarithm of the likelihood, known as the log-likelihood, is maximized For a CRF, the log-likelihood is given by 42

This function is concave, guaranteeing convergence to the global maximum Differentiating the log-likelihood with respect to parameter λ j gives 43

Note that setting this derivative to zero yields the maximum entropy model constraint: – The expectation of each feature with respect to the model distribution is equal to the expected value under the empirical distribution of the training data 44 =0

It is not possible to analytically determine the parameter values that maximize the log- likelihood – setting the gradient to zero and solving for λ does not always yield a closed form solution Instead, maximum likelihood parameters must be identified using an iterative technique such as iterative scaling [5, 1, 10] or gradient-based methods [15, 17] 45

Outline Labeling Sequential Data Undirected Graphical Models – Potential Functions Conditional Random Fields Maximum Entropy Maximum Likelihood Parameter Inference CRF Probability as Matrix Computations (not complete) Dynamic Programming 46

47

Outline Labeling Sequential Data Undirected Graphical Models – Potential Functions Conditional Random Fields Maximum Entropy Maximum Likelihood Parameter Inference CRF Probability as Matrix Computations Dynamic Programming 48