Download presentation

Presentation is loading. Please wait.

Published byLinda O’Neal’ Modified about 1 year ago

1
DIRECTED GRAPHICAL MODEL MEMM AS SEQUENCE CLASSIFIER NIKET TANDON Tutor: Martin Theobald, Date: 09 June, 2011

2
Why talk about them.. HMMs is a sequence classifier Hidden Markov models (HMMs) successfully applied to: Part-of-speech tagging: He attends seminars Named entity recognition: MPI director Gerhard Weikum 2

3
Agenda HMM Recap MaxEnt Model HMM + MaxEnt ~= MEMM Training & Decoding Experimental results and discussion 3

4
Agenda HMM Recap MaxEnt Model HMM + MaxEnt ~= MEMM Training & Decoding Experimental results and discussion 4

5
Markov chains and HMM HMM and Markov chains extend FA States q i, Transitions b/w states A ij Markov assumption: P(arc leaving a node) sum to one 5 HMM Markov Chain Weighted FA Finite Automation

6
HMM (One additional layer of uncertainty. States are hidden!) Think of causal factors in prob.model States q i Transitions between states A ij Observation likelihood or emission probability B = b i (o t ) observation o t being generated from state q i Markov Assumption Output independence assumption 6 Task: Given observation #ice creams, predice weather states

7
HMM characterized by… 7 three fundamental problems: Computing likelihood (Given an HMM = (A,B) and O, find P(O| ) Decoding Given HMM = (A,B) and O, find best hidden seq Q Learning Given Q and O, learn HMM parameters A, B

8
Computing P(O| ) observation likelihood 8 N hidden states, T observations N x N x.. N = N T sequences. Exponentially large! HHC, HCH, CHH.. Efficient algo is Forward Algorithm (uses a table for intermediate values)

9
Forward Algorithm Previous forward prob * P(previous state to current state) * P(observation o t given current state j) a t (j) is P( t th state is state j after seeing first t observations ) For many HMM application many a ij are zero, thus reducing the space. Finally, summing over the prob of every path 9

10
10

11
Decoding: Viterbi Algorithm Given HMM = (A,B) and O, find most probable hidden seq Q Choose the hidden state sequence with max observation likelihood on running fwd algo. Exponential! An efficient alternative is: Viterbi Previous viterbi path prob * P(previous state to current state) * P(observation o t given current state j) Compare fwd algo (sum Vs max) Keep a pointer to best path that brought us here (backtrace) 11

12
12

13
13

14
Learning: Forward Backward Algorithm Train transition prob A, emmission B Consider Markov chain, then B = 1.0 ( so compute A) Max likelihood of a ij = transitions ij / all transitions from i But, HMM has hidden states! Additional work, computed using EM style Forward Backward Algorithm. sum over all successive values t+1 weighted by a ij and observation prob 14

15
Agenda HMM Recap MaxEnt Model HMM + MaxEnt ~= MEMM Training & Decoding Experimental results and discussion 15

16
Background - Maximum Entropy models A second probabilistic machine learning framework Maximum Entropy as a non- sequential, sequence classifier Most common MaxEnt sequence classifier is Max Entropy Markov Model MEMM First, we discuss non-seq MaxEnt Max Entropy works by extracting set of features (on input), combining linearly, and then sum as exponent You must know: linear regression, This value is not bounded. So, need normalize it. Instead of prob, compute odds (recall that if P(A) = 0.9, P(A’) = 0.1 then, odds of A = 0.9 /0.3 = 3 16

17
Logistic Regression Linear model to predict odds of y=true Now lies between 0 and ∞, but need lie between –∞ and +∞, so take log. Left func is called logit Model of regression used to estimate not prob but logit of Prob is called logistic regresssion. So, if linear func estimates logit: what is P(y=true)? 17 This function called logistic function, gives gives Logistic regression its name

18
Learning Weights (w) Unlike Linear Regression that minimizes squared loss on train set, Logistic Regression uses Conditional Max Likelihood estimation. w that makes P(observed y values in training data) to be highest, given x. For entire train set Taking log An unwieldy expression Condensed form (why)and substitution 18

19
How to solve the convex optim. Problem? Several methods to solve convex optimization problem Later we explain an algorithm: Generalized Iterative Scaling Method called GIS 19

20
MaxEnt Until now two classes, when more, Logistic regression is called Multinomial Logistic Regression (or MaxEnt) In MaxEnt Prob that y=c is: Normalize to make prob f i (c,x) means feature i of observation x for class c. f i are not real valued but binary (more common in text processing) Learning w is similar to logistic regression, with one change. MaxEnt learns very high weights, so 20 Let us see a classification problem..

21
Niket/NNP is/BEZ expected/VBN to/TO talk/?? today/ f1f2f3f4f5f6 VB(f)010110 VB(w).8.01.1 NN(f)100001 NN(w).8-1.3 VB/NN? 21

22
More complex features.. Word starting with capital letter (Day) is more likely to be NNP than a common noun (e.g. Independence Day) But a capitalized word occurring at beginning is not more likely NNP (e.g. Day after day) But, MaxEnt would be by hand as below 22 Key to successful use of MaxEnt is design of appropriate features and feature combinations

23
Why the name Maximum Entropy? Suppose we tag a new word Prabhu with a model that makes fewest assumptions, imposing no constraints. We get an equiprobable distribution Suppose we had some training data from which we learnt set of possible tags for Prabhu are NN,JJ,NNS,VB Since of the tags is correct so, P(NN)+ P(JJ)+ P(NNS)+ P(VB) = 1 NNJJ NNS VB NNP VBGINCD 1/8 NNJJ NNS VB NNP VBGINCD ¼¼¼¼0000 23

24
Maximum Entropy … of all possible distributions, the equiprobable distribution has the maximum entropy p * = argmax H(p) 24 Solution to this is entropy of a MaxEnt model whose weights W maximize the likelihood of the training data!

25
Agenda HMM Recap MaxEnt Model HMM + MaxEnt ~= MEMM Training & Decoding Experimental results and discussion 25 HMMMEMM

26
Why HMM is not sufficient HMM based on probabilities P(tag|tag) and P(word|tag). For tagging unknown words, useful features include capitalization, the presence of hyphens, word endings HMM is unable to use information from later words to inform its decision early on. MEMM (Maximum Entropy Markov Model) mates Viterbi algorithm with MaxEnt to overcome this problem! 26

27
HMM (Generative) Vs MEMM (Discriminative) HMM: two probabilities for the observation likelihood and prior. MEMM: single probability, conditioned on the previous state, observation. 27

28
MEMM MEMM can condition on many features of input (capitalization, morphology (ending in -s or -ed), as well as earlier words or tags). HMM can’t as its likelihood based, and needs to compute likelihood of each feature. HMM MEMM 28

29
Agenda HMM Recap MaxEnt Model HMM + MaxEnt ~= MEMM Training & Decoding Experimental results and discussion 29

30
Decoding(inference) and Learning in MEMM Viterbi in HMM Viterbi in MEMM Learning in MEMM : train the weights so as maximize the log-likelihood of the training corpus. GIS! 30

31
Learning: GIS basics estimate a probability distribution p(a,b) where a defines classes (y 1,y 2 ) and b is feature indicator (0,1) p(y 1,0)+p(y 2,0) = 0.6 Constraint on model’s expectation E p f = 0.6 (Expectation of feature) p(y 1,0)+p(y 1,1)+ p(y 2,0)+p(y 2,1) = 1 Find prob distribution p with maxEnt 01 y1y1 ?? y2y2 ?? 01 y1y1.3.2 y2y2.3.2 31

32
GIS algorithm sketch Finds weight parameter of the unique distribution p * belonging to P and Q Each iteration of GIS requires model expectation and trainset expectation of features. Train set expectation is a count of features. But, model expectation requires approximation. 32

33
Agenda HMM Recap MaxEnt Model HMM + MaxEnt ~= MEMM Training & Decoding Experimental results and discussion 33

34
Application: segmentation of FAQs 38 files belonging to 7 Usenet multi-part FAQs (set of files) Basic file structure: header text in Usenet header format [preamble or table of content] series of one of more question/answer pairs tail [copyright] [acknowledgements] [origin of document] Formatting regularities: indentation, numbered questions, types of paragraph breaks Consistent formatting within a single FAQ 34

35
Train data Lines in each file are hand-labeled into 4 categories: head, questions, answers, tail Archive-name: acorn/faq/part2 Frequency: monthly 2.6) What configuration of serial cable should I use Here follows a diagram of the necessary connections programs to work properly. They are as far as I know t Pins 1, 4, and 8 must be connected together inside is to avoid the well known serial port chip bugs. The Prediction: Given a sequence of lines, a learner must return a sequence of labels. 35

36
Boolean features of lines The 24 line-based features used in the experiments are: begins-with-numbercontains-question-mark begins-with-ordinalcontains-question-word begins-with-punctuationends-with-question-mark begins-with-question-wordfirst-alpha-is-capitalized begins-with-subjectindented blankindented-1-to-4 contains-alphanumindented-5-to-10 contains-bracketed-numbermore-than-one-third-space contains-httponly-punctuation contains-non-spaceprev-is-blank contains-numberprev-begins-with-ordinal contains-pipeshorter-than-30 36

37
Experiment setup “Leave one out” testing: For each file in a group (FAQ), train a learner and test it on the remaining files in the group. Scores are averaged over n(n-1) results. 37

38
Evaluation metrics Segment: consecutive lines belonging to the same category Co-occurrence agreement probability (COAP) Empirical probability that the actual and the predicted segmentation agree on the placement of two lines according to some distance distribution D between lines. Measures whether segment boundaries are properly aligned by the learner Segmentation precision (SP): Segmentation recall (SR): 38

39
Comparison of learners ME-Stateless: Maximum entropy classifier document is an unordered set of lines lines are classified in isolation using the binary features, not using label of previous line TokenHMM: Fully connected HMM with hidden states for each of the four labels no binary features transitions between states only on line boundaries FeatureHMM: same as TokenHMM lines are converted to sequences of features MEMM 39

40
Results LearnerCOAPSegPrecSegRecall ME-Stateless0.520.0380.362 TokenHMM0.8650.2760.14 FeatureHMM0.9410.4130.529 MEMM0.9650.8670.681 40

41
Problems with MEMM Label bias problem Lead to CRF and further models (more on this in a while) 41

42
Thank you!

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google