 # Sequence labeling and beam search LING 572 Fei Xia 2/15/07.

## Presentation on theme: "Sequence labeling and beam search LING 572 Fei Xia 2/15/07."— Presentation transcript:

Sequence labeling and beam search LING 572 Fei Xia 2/15/07

Outline Classification problem (Recap) Sequence labeling problem HMM and Viterbi algorithm Beam search MaxEnt: case study

Classification Problem

Classification problem Setting: –C: a finite set of labels –Input: x –Output: y, where y 2 C. Training data: an instance list { (x i, y i ) } –Supervised learning: y i is known –Unsupervised learning: y i is unknown –Semi-supervised learning: y i is unknown for most instances.

The 1st step: data conversion Represent x as something else. Why? –The number of possible x is infinite. –The new representation makes the learning possible. How? –Represent x as a feature vector –Define feature templates: what part of x is useful for determining its y? –Calculate feature values

The 2 nd step: modeling kNN and Rocchio: find the closest neighbors / prototypes DT and DL: find the matched group.

Modeling: NB and MaxEnt Given x, choose y*, s.t. y* = arg max y P(y|x) = arg max y P(x,y) How to calculate P(x, y) ? –How many (x,y) “unique” pairs are there? –How can we make the task simpler? Decomposition Number of parameters: 2 k |C|  O(k |C|)

The 3 rd step: training kNN: no training Rocchio: calculate prototypes DT and DL: learn the trees/rules by selecting important features and splitting the data NB: calculate the parameter values by simply counting MaxEnt: estimate parameters with iterations

The 4 th step: testing kNN: calculate distance between x and its neighbors Rocchio: calculate distance between x and the prototypes DT and DL: traverse the tree/list NB and MaxEnt: calculate P(x,y)

Attribute-value table Each row corresponds to an instance Each column except the last one corresponds to a feature. No features refer to the class label.  At test time the classification of x i does not affect the classification of x j. all the feature values are available before testing starts.

Sequence labeling problem

Task: to find the most probable labeling of a sequence. Examples: –POS tagging –NP chunking –NE detection –Word segmentation –IGT detection –Parsing –…–…

Questions Training data: {(x i, y i )} What is x i ? What is y i ? What are the features? How to convert x i to a feature vector for training data? How to do that for test data?

How to solve a sequence labeling problem? Using a sequence labeling algorithm: e.g., HMM Using a classification algorithm –Don’t use features that refer to class labels –Use those features and get their values by running other processes –Use those features and find a good (global) solution.

Major steps Data conversion –What is the label set? Modeling Training Testing: –How to combine individual labels to get a label sequence? –How to find a good label sequence?

HMM and Viterbi algorithm

Two types of HMMs State-emission HMM (Moore machine): –The emission probability depends only on the state (from-state or to-state). Arc-emission HMM (Mealy machine): –The probability depends on (from-state, to-state) pair.

State-emission HMM Two kinds of parameters: Transition probability: P(s j | s i ) Output (Emission) probability: P(w k | s i )  # of Parameters: O(NM+N 2 ) w1w1 w4w4 w1w1 s1s1 s2s2 sNsN … w5w5 w3w3 w1w1

Arc-emission HMM s1s1 s2s2 sNsN … w1w1 w5w5 Same kinds of parameters but the emission probabilities depend on both states: P(w k, s j | s i )  # of Parameters: O(N 2 M+N 2 ). w4w4 w3w3 w2w2 w1w1 w1w1

Constraints For any integer n and any HMM

Properties of HMM Limited horizon: Time invariance: the probabilities do not change over time: The states are hidden because we know the structure of the machine (i.e., S and Σ), but we don’t know which state sequences generate a particular output.

Three fundamental questions for HMMs 1.Finding the probability of an observation 2.Finding the best state sequence 3.Training: estimating parameters

(2) Finding the best state sequence Given the observation O 1,T =o 1 …o T, find the state sequence X 1,T+1 =X 1 … X T+1 that maximizes P(X 1,T+1 | O 1,T ).  Viterbi algorithm X1X1 X2X2 XTXT … o1o1 o2o2 oToT X T+1

Viterbi algorithm The probability of the best path that produces O 1,t-1 while ending up in state s i : Initialization: Induction:

Important concepts State vs. class label Assumption: P(t i | t 1 i-1 ) = P(t i | t i-1 ) Multiple sequences of states (paths) can lead to a given state, but one is the most likely path to that state, called the "survivor path".

Viterbi search

Beam Search

Beam search (basic)

More options Expanding options: TopN, minhyps –If hyps_num < minhyps then use max (topN, minhyps) tags for w_i else use topN tags Pruning options: maxhyps, beam, minhyps –Keep a hyp iff prob(hyp) * beam > max_prob && hyp is among top maxhyps, or hyp is among the top minhyps

Beam search Generate m tags for w 1, set s 1j accordingly For i=2 to n (n is the sentence length) –Expanding: For each surviving sequence s (i-1),j Generate m tags for w i, given s (i-1)j as previous tag context Append each tag to s (i-1)j to make a new sequence. –Pruning Return highest prob sequence s n1.

Viterbi vs. Beam search DP vs. heuristic search Global optimal vs. inexact Small window vs. big window for features