Conditional Random Fields model

Conditional Random Fields model
QingSong.Guo

Recent work XML keyword query refinement Two ways:
Focus on XML tree structure Focus on keywords

XML tree In keyword query, there are many nodes in the XML tree matching the keywords. Try to find semantically related keywords to avoid returning irrelevant XML nodes to users. LCA SLCA

Keywords Ambiguity “Mary Author Title Year”?
1、Find title and year of publications, of which Mary is an author. 2、Find additional author of the publications, of which Mary is an author. 3、Find year and author of publications with similar titles to Mary’s publications.

keywords Question: How to do that?
Spelling error correction : machin machine Word splitting : universtyof RUC university of ruc Word merging: on line online Phrase segmentation: mark word’s postion in phrase Word stemming: do doing Acronym expansion: RUC Renming University of China Question: How to do that?

Labeling Sequence Data
X are random variables over data sequences Y are random variables over label sequences A is the set of possible part-of-speech tags problem: how to get labeling sequence y from data sequence x ? Thinking is being X: x1 x2 x3 noun verb y1 y2 y3 Y:

Hidden Markov models (HMMs)
Assign a joint probability to paired observation and label sequences The parameters typically trained to maximize the joint likelihood of train examples

Markov model Markov property means that, given the present state, future states are independent of the past states State space, Random variables sequence from S Markov property:

HMM the state is not directly visible, but variables influenced by the state are visible labeling to the data sequence

Example of HMM 假设你有一个住得很远的朋友,他每天跟你打电话告诉你他那天作了什么.你的朋友仅仅对三种活动感兴趣:公园散步,购物以及清理房间.他选择做什么事情只凭天气.你对于他所住的地方的天气情况并不了解,但是你知道总的趋势.在他告诉你每天所做的事情基础上,你想要猜测他所在地的天气情况. 你认为天气的运行就像一个马尔可夫链.其有两个状态 "雨"和"晴",但是你无法直接观察它们,也就是说,它们对于你是隐藏的.每天,你的朋友有一定的概率进行下列活动:"散步", "购物", 或 "清理". 因为你朋友告诉你他的活动,所以这些活动就是你的观察数据.这整个系统就是一个隐马尔可夫模型HMM.

HMM Three Problems: Towards model λ=(A,B,π), how to compute the p(Y|λ) ? How to select the proper state sequence Y? how to estimate the parametrs to maximize the p(Y|λ) ?

HMM Get data Creat model application training Parameter
Estimation Model establish

HMM Definition: quintuple form(五元组) (S , K, A, B, π )
S = {S1,...,Sn}：set of states K = {K1,...,Km}：set of observations A = {aij}，aij = p(Xt+1 = qj |Xt = qi)： state transition probability B = {bik}，bik = p(Ot = vk | Xt = qi)： output probability π = {πi}， πi = p(X1 = qi)： initial state parobability

Generative Models Difficulties and disadvantages
Need to enumerate all possible observation sequences Not practical to represent multiple interacting features or long-range dependencies of the observations Very strict independence assumptions on the observations

Discriminative models
used in machine learning modeling the dependence of an unobserved variable y on an observed variable x. modeling the conditional probability distribution P(y | x), which can be used for predicting y from x.

Maximum Entropy Markov Models (MEMMs)
A conditional model that representing the probability of reaching a state given an observation and the previous state Given training set X with label sequence Y: Train parameter θ that maximizes P(Y|X, θ) For a new data sequence x, the predicted label y maximizes P(y|x, θ)

MEMMs Have all the advantages of Conditional Models
Subject to Label Bias Problem Bias toward states with fewer outgoing transitions

Label Bias Problem Since
P(1,2|ro) = P(2|1,ro)P(1|ro) = P(2|1,o)P(1|r) P(1,2|ri) = P(2|1,ri)P(1|ri) = P(2|1,i)P(1|r) Since P(2|1,x)=1 for all x, P(1,2|ro) = P(1,2|ri) In the training data, label value 2 is the only label value observed after label value 1 Therefore P(2|1) = 1, so P(2|1,x) = 1 for all x However, we expect P(1,2|ri) to be greater than P(1,2|ro). Per-state normalization does not allow the required expectation

Random Field

Conditional Random Fields (CRFs)
have all the advantages of MEMMs without label bias problem MEMM uses per-state exponential model for the conditional probabilities of next states given the current state CRF has a single exponential model for the joint probability of the entire sequence of labels given the observation sequence Undirected acyclic graph Allow some transitions “vote” more strongly than others depending on the corresponding observations

Definition of CRFs X : random variable over data sequences to be labeled Y : random variable over corresponding label sequences

Example of CRFs Here，we suppose the graph G is a chain

Conditional Distribution
If the graph G = (V, E) of Y is a tree, the conditional distribution over the label sequence Y = y, given X = x, by fundamental theorem of random fields is: v is a vertex from vertex set V set of label random variables e is an edge from edge set E over V k is the number of features are parameters to be estimated y|e is the set of components of y defined by edge e y|v is the set of components of y defined by vertex v

Conditional Distribution
CRFs use the observation-dependent normalization Z(x) for the conditional distributions: Z(x) is a normalization factor over the data sequence x

Feature function Transition feature funtion: State feature function:
if the xi is the word “september” otherwise if y i-1 = IN and yi = NNP othewise

Maximum Entropy Principle
The form of CRF given is heavily motivated by the principle of maximum entropy—‘‘A mathematical theory of communication‘‘,shannon The only probability distribution that can justifiably be constructed from finite training data is that which has maximum entropy subject to a set of constraints representing the information available

Maximum Entropy Principle
If the information within the training data is represented using a set of feature functions desribed previously The maximum entropy distribution is that which is as uniform as possible while ensuring that the expectation of feature function with respect to the empirical distribution of the training data equals the expected value of that feature function with respect to the model distribution

Learning for CRFs Assumption: the features fk and gk are given and fixed The learning problem determine the parameters λ = (λ1, λ2, ; µ1, µ2, . . .) maximize the log-likelihood function of training data D = {(x(k), y(k))} with empirical distribution p~(x, y). We simplify the notations by writing This allows the probability of a label sequence y given an observation sequence x to be written as

CRF Parameter Estimation
For a CRF, the log-likelihood is given by Differentiating the log-likelihood function with respect to parameters gives

CRF Parameter Estimation
There is no analytical solutions for the parameter by maximizing the log-likelihood Setting the derivative to zero and solving for does not always yield a closed form solution Iterative technique is adopted Iterative scaling Gradient decent The core of the above techniques lies in computing the expectation of each feature function with respect to the CRF model distribution

CRF Probability as Matrix Computations
Augment the label sequence with start and end state. We define n+1 matrices of size : The probability of label sequence y given observation sequence x can be written as the product of the appropriate elements of the n+1 matrices for that pair of sequences Normalization factor can be computed based on graph theory

Dynamic Programming The expectation of each feature function with respect to the CRF model distribution for every observation sequence x(k) in the training data is given by Rewriting right hand side of above equation

Dynamic Programming Defining forward and backward vectors
The probability of Yi and Yi-1 taking on labels y’ and y given observation sequence x(k) can be computed as

Making Predictions Once a CRF model has been trained, there are (at least) two possible ways to do inference for a test sequence We can predict the entire sequence Y that has the highest probability by Viterbi algorithm (MAP) We can also make predictions for individual yt and by forward-backward algorithm (MPM)

POS tagging Experiments

POS tagging Experiments
Compared HMMs, MEMMs, and CRFs on Penn treebank POS tagging Each word in a given input sentence must be labeled with one of 45 syntactic tags Add a small set of orthographic features: whether a spelling begins with a number or upper case letter, whether it contains a hyphen, and if it contains one of the following suffixes: -ing, -ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies oov = out-of-vocabulary (not observed in the training set)

CRF for XML Trees XML documents are represented by DOM tree
Only consider element nodes, attribute nodes and text nodes Attribute nodes are unordered, element nodes and text nodes are ordered

CRF for XML Trees With every set of nodes,associate a random field X of observables Xn and a random field Y of output variables Yn,n is position Xn will be the symbols of the input trees,and Yn will be the labels of their labelings Triangle feature function:

CRF for XML Trees Yn Xn DOM tree table account tr tr client product td
@class id name address name price number id Yn Xn Y0 Y1 Y2 Y1.1 Y1.2 Y2.1 Y2.2 Y2.3 Y2.4 DOM tree

CRF-Query Refinement Introduce operations and incorporate the operations into the CRF model Let o denote a sequence of refinement operations o=o1,o2,…,on Conditional model P(y,o|x) called CRF-QR model

Operations Task Operation Seplling correction
Deletion/insertion/insetion/substitution/transposition Word splitting Splitting Word merging Merging Phrase segmentation Begin/middle/end/out/ Word stemming +s/-s/+ed/-ed/+ing/-ing Acronym expansion expansion

Graphical representation

CRF-Query Refinement

Next work Along with the two ways Thanks!

Conditional Random Fields model

Similar presentations

Presentation on theme: "Conditional Random Fields model"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Conditional Random Fields model

Similar presentations

Presentation on theme: "Conditional Random Fields model"— Presentation transcript:

Similar presentations

About project

Feedback