Download presentation

Presentation is loading. Please wait.

1
Sequence labeling and beam search LING 572 Fei Xia 2/15/07

2
Outline Classification problem (Recap) Sequence labeling problem HMM and Viterbi algorithm Beam search MaxEnt: case study

3
Classification Problem

4
Classification problem Setting: –C: a finite set of labels –Input: x –Output: y, where y 2 C. Training data: an instance list { (x i, y i ) } –Supervised learning: y i is known –Unsupervised learning: y i is unknown –Semi-supervised learning: y i is unknown for most instances.

5
The 1st step: data conversion Represent x as something else. Why? –The number of possible x is infinite. –The new representation makes the learning possible. How? –Represent x as a feature vector –Define feature templates: what part of x is useful for determining its y? –Calculate feature values

6
The 2 nd step: modeling kNN and Rocchio: find the closest neighbors / prototypes DT and DL: find the matched group.

7
Modeling: NB and MaxEnt Given x, choose y*, s.t. y* = arg max y P(y|x) = arg max y P(x,y) How to calculate P(x, y) ? –How many (x,y) “unique” pairs are there? –How can we make the task simpler? Decomposition Number of parameters: 2 k |C| O(k |C|)

8
The 3 rd step: training kNN: no training Rocchio: calculate prototypes DT and DL: learn the trees/rules by selecting important features and splitting the data NB: calculate the parameter values by simply counting MaxEnt: estimate parameters with iterations

9
The 4 th step: testing kNN: calculate distance between x and its neighbors Rocchio: calculate distance between x and the prototypes DT and DL: traverse the tree/list NB and MaxEnt: calculate P(x,y)

10
Attribute-value table Each row corresponds to an instance Each column except the last one corresponds to a feature. No features refer to the class label. At test time the classification of x i does not affect the classification of x j. all the feature values are available before testing starts.

11
Sequence labeling problem

12
Task: to find the most probable labeling of a sequence. Examples: –POS tagging –NP chunking –NE detection –Word segmentation –IGT detection –Parsing –…–…

13
Questions Training data: {(x i, y i )} What is x i ? What is y i ? What are the features? How to convert x i to a feature vector for training data? How to do that for test data?

14
How to solve a sequence labeling problem? Using a sequence labeling algorithm: e.g., HMM Using a classification algorithm –Don’t use features that refer to class labels –Use those features and get their values by running other processes –Use those features and find a good (global) solution.

15
Major steps Data conversion –What is the label set? Modeling Training Testing: –How to combine individual labels to get a label sequence? –How to find a good label sequence?

16
HMM and Viterbi algorithm

17
Two types of HMMs State-emission HMM (Moore machine): –The emission probability depends only on the state (from-state or to-state). Arc-emission HMM (Mealy machine): –The probability depends on (from-state, to-state) pair.

18
State-emission HMM Two kinds of parameters: Transition probability: P(s j | s i ) Output (Emission) probability: P(w k | s i ) # of Parameters: O(NM+N 2 ) w1w1 w4w4 w1w1 s1s1 s2s2 sNsN … w5w5 w3w3 w1w1

19
Arc-emission HMM s1s1 s2s2 sNsN … w1w1 w5w5 Same kinds of parameters but the emission probabilities depend on both states: P(w k, s j | s i ) # of Parameters: O(N 2 M+N 2 ). w4w4 w3w3 w2w2 w1w1 w1w1

20
Constraints For any integer n and any HMM

21
Properties of HMM Limited horizon: Time invariance: the probabilities do not change over time: The states are hidden because we know the structure of the machine (i.e., S and Σ), but we don’t know which state sequences generate a particular output.

22
Three fundamental questions for HMMs 1.Finding the probability of an observation 2.Finding the best state sequence 3.Training: estimating parameters

23
(2) Finding the best state sequence Given the observation O 1,T =o 1 …o T, find the state sequence X 1,T+1 =X 1 … X T+1 that maximizes P(X 1,T+1 | O 1,T ). Viterbi algorithm X1X1 X2X2 XTXT … o1o1 o2o2 oToT X T+1

24
Viterbi algorithm The probability of the best path that produces O 1,t-1 while ending up in state s i : Initialization: Induction:

25
Important concepts State vs. class label Assumption: P(t i | t 1 i-1 ) = P(t i | t i-1 ) Multiple sequences of states (paths) can lead to a given state, but one is the most likely path to that state, called the "survivor path".

26
Viterbi search

27
Beam Search

28
Beam search (basic)

29
More options Expanding options: TopN, minhyps –If hyps_num < minhyps then use max (topN, minhyps) tags for w_i else use topN tags Pruning options: maxhyps, beam, minhyps –Keep a hyp iff prob(hyp) * beam > max_prob && hyp is among top maxhyps, or hyp is among the top minhyps

30
Beam search Generate m tags for w 1, set s 1j accordingly For i=2 to n (n is the sentence length) –Expanding: For each surviving sequence s (i-1),j Generate m tags for w i, given s (i-1)j as previous tag context Append each tag to s (i-1)j to make a new sequence. –Pruning Return highest prob sequence s n1.

31
Viterbi vs. Beam search DP vs. heuristic search Global optimal vs. inexact Small window vs. big window for features

32
Additional slides

33
(1) Finding the probability of the observation Forward probability: the probability of producing O 1,t-1 while ending up in state s i :

34
Calculating forward probability Initialization: Induction:

Similar presentations

© 2021 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google