# DIRECTED GRAPHICAL MODEL MEMM AS SEQUENCE CLASSIFIER NIKET TANDON Tutor: Martin Theobald, Date: 09 June, 2011.

## Presentation on theme: "DIRECTED GRAPHICAL MODEL MEMM AS SEQUENCE CLASSIFIER NIKET TANDON Tutor: Martin Theobald, Date: 09 June, 2011."— Presentation transcript:

DIRECTED GRAPHICAL MODEL MEMM AS SEQUENCE CLASSIFIER NIKET TANDON Tutor: Martin Theobald, Date: 09 June, 2011

Why talk about them..  HMMs is a sequence classifier  Hidden Markov models (HMMs) successfully applied to:  Part-of-speech tagging: He attends seminars  Named entity recognition: MPI director Gerhard Weikum 2

Agenda  HMM Recap  MaxEnt Model  HMM + MaxEnt ~= MEMM  Training & Decoding  Experimental results and discussion 3

Agenda  HMM Recap  MaxEnt Model  HMM + MaxEnt ~= MEMM  Training & Decoding  Experimental results and discussion 4

Markov chains and HMM  HMM and Markov chains extend FA  States q i, Transitions b/w states A ij  Markov assumption:  P(arc leaving a node) sum to one 5 HMM Markov Chain Weighted FA Finite Automation

HMM (One additional layer of uncertainty. States are hidden!)  Think of causal factors in prob.model  States q i  Transitions between states A ij  Observation likelihood or emission probability B = b i (o t ) observation o t being generated from state q i  Markov Assumption  Output independence assumption 6 Task: Given observation #ice creams, predice weather states

HMM characterized by… 7  three fundamental problems:  Computing likelihood (Given an HMM  = (A,B) and O, find P(O| )  Decoding Given HMM = (A,B) and O, find best hidden seq Q  Learning Given Q and O, learn HMM parameters A, B

Computing P(O| ) observation likelihood 8  N hidden states, T observations  N x N x.. N = N T sequences. Exponentially large! HHC, HCH, CHH..  Efficient algo is Forward Algorithm (uses a table for intermediate values)

Forward Algorithm  Previous forward prob * P(previous state to current state) * P(observation o t given current state j)  a t (j) is P( t th state is state j after seeing first t observations )  For many HMM application many a ij are zero, thus reducing the space.  Finally, summing over the prob of every path 9

10

Decoding: Viterbi Algorithm  Given HMM = (A,B) and O, find most probable hidden seq Q  Choose the hidden state sequence with max observation likelihood on running fwd algo. Exponential!  An efficient alternative is: Viterbi  Previous viterbi path prob * P(previous state to current state) * P(observation o t given current state j)  Compare fwd algo (sum Vs max)  Keep a pointer to best path that brought us here (backtrace) 11

12

13

Learning: Forward Backward Algorithm  Train transition prob A, emmission B  Consider Markov chain, then B = 1.0 ( so compute A)  Max likelihood of a ij = transitions ij / all transitions from i  But, HMM has hidden states! Additional work, computed using EM style Forward Backward Algorithm.  sum over all successive values  t+1 weighted by a ij and observation prob 14

Agenda  HMM Recap  MaxEnt Model  HMM + MaxEnt ~= MEMM  Training & Decoding  Experimental results and discussion 15

Background - Maximum Entropy models  A second probabilistic machine learning framework  Maximum Entropy as a non- sequential, sequence classifier  Most common MaxEnt sequence classifier is Max Entropy Markov Model MEMM  First, we discuss non-seq MaxEnt  Max Entropy works by extracting set of features (on input), combining linearly, and then sum as exponent  You must know: linear regression,  This value is not bounded. So, need normalize it.  Instead of prob, compute odds (recall that if P(A) = 0.9, P(A’) = 0.1 then, odds of A = 0.9 /0.3 = 3 16

Logistic Regression  Linear model to predict odds of y=true  Now lies between 0 and ∞, but need lie between –∞ and +∞, so take log. Left func is called logit  Model of regression used to estimate not prob but logit of Prob is called logistic regresssion.  So, if linear func estimates logit: what is P(y=true)? 17 This function called logistic function, gives gives Logistic regression its name

Learning Weights (w)  Unlike Linear Regression that minimizes squared loss on train set, Logistic Regression uses Conditional Max Likelihood estimation.  w that makes P(observed y values in training data) to be highest, given x.  For entire train set  Taking log  An unwieldy expression  Condensed form (why)and substitution 18

How to solve the convex optim. Problem?  Several methods to solve convex optimization problem  Later we explain an algorithm: Generalized Iterative Scaling Method called GIS 19

MaxEnt  Until now two classes, when more, Logistic regression is called Multinomial Logistic Regression (or MaxEnt)  In MaxEnt Prob that y=c is:  Normalize to make prob  f i (c,x) means feature i of observation x for class c.  f i are not real valued but binary (more common in text processing)  Learning w is similar to logistic regression, with one change. MaxEnt learns very high weights, so 20 Let us see a classification problem..

Niket/NNP is/BEZ expected/VBN to/TO talk/?? today/ f1f2f3f4f5f6 VB(f)010110 VB(w).8.01.1 NN(f)100001 NN(w).8-1.3 VB/NN? 21

More complex features..  Word starting with capital letter (Day) is more likely to be NNP than a common noun (e.g. Independence Day)  But a capitalized word occurring at beginning is not more likely NNP (e.g. Day after day)  But, MaxEnt would be by hand as below 22  Key to successful use of MaxEnt is design of appropriate features and feature combinations

Why the name Maximum Entropy?  Suppose we tag a new word Prabhu with a model that makes fewest assumptions, imposing no constraints.  We get an equiprobable distribution  Suppose we had some training data from which we learnt set of possible tags for Prabhu are NN,JJ,NNS,VB  Since of the tags is correct so, P(NN)+ P(JJ)+ P(NNS)+ P(VB) = 1 NNJJ NNS VB NNP VBGINCD 1/8 NNJJ NNS VB NNP VBGINCD ¼¼¼¼0000 23

Maximum Entropy  … of all possible distributions, the equiprobable distribution has the maximum entropy p * = argmax H(p) 24  Solution to this is entropy of a MaxEnt model whose weights W maximize the likelihood of the training data!

Agenda  HMM Recap  MaxEnt Model  HMM + MaxEnt ~= MEMM  Training & Decoding  Experimental results and discussion 25 HMMMEMM

Why HMM is not sufficient  HMM based on probabilities P(tag|tag) and P(word|tag).  For tagging unknown words, useful features include capitalization, the presence of hyphens, word endings  HMM is unable to use information from later words to inform its decision early on.  MEMM (Maximum Entropy Markov Model) mates Viterbi algorithm with MaxEnt to overcome this problem! 26

HMM (Generative) Vs MEMM (Discriminative)  HMM: two probabilities for the observation likelihood and prior.  MEMM: single probability, conditioned on the previous state, observation. 27

MEMM  MEMM can condition on many features of input (capitalization, morphology (ending in -s or -ed), as well as earlier words or tags).  HMM can’t as its likelihood based, and needs to compute likelihood of each feature. HMM MEMM 28

Agenda  HMM Recap  MaxEnt Model  HMM + MaxEnt ~= MEMM  Training & Decoding  Experimental results and discussion 29

Decoding(inference) and Learning in MEMM  Viterbi in HMM  Viterbi in MEMM  Learning in MEMM : train the weights so as maximize the log-likelihood of the training corpus.  GIS! 30

Learning: GIS basics  estimate a probability distribution p(a,b) where a defines classes (y 1,y 2 ) and b is feature indicator (0,1)  p(y 1,0)+p(y 2,0) = 0.6  Constraint on model’s expectation E p f = 0.6 (Expectation of feature)  p(y 1,0)+p(y 1,1)+ p(y 2,0)+p(y 2,1) = 1  Find prob distribution p with maxEnt 01 y1y1 ?? y2y2 ?? 01 y1y1.3.2 y2y2.3.2 31

GIS algorithm sketch  Finds weight parameter of the unique distribution p * belonging to P and Q  Each iteration of GIS requires model expectation and trainset expectation of features.  Train set expectation is a count of features. But, model expectation requires approximation. 32

Agenda  HMM Recap  MaxEnt Model  HMM + MaxEnt ~= MEMM  Training & Decoding  Experimental results and discussion 33

Application: segmentation of FAQs  38 files belonging to 7 Usenet multi-part FAQs (set of files)  Basic file structure: header text in Usenet header format [preamble or table of content] series of one of more question/answer pairs tail [copyright] [acknowledgements] [origin of document]  Formatting regularities: indentation, numbered questions, types of paragraph breaks  Consistent formatting within a single FAQ 34

Train data  Lines in each file are hand-labeled into 4 categories: head, questions, answers, tail Archive-name: acorn/faq/part2 Frequency: monthly 2.6) What configuration of serial cable should I use Here follows a diagram of the necessary connections programs to work properly. They are as far as I know t Pins 1, 4, and 8 must be connected together inside is to avoid the well known serial port chip bugs. The  Prediction: Given a sequence of lines, a learner must return a sequence of labels. 35

Boolean features of lines  The 24 line-based features used in the experiments are: begins-with-numbercontains-question-mark begins-with-ordinalcontains-question-word begins-with-punctuationends-with-question-mark begins-with-question-wordfirst-alpha-is-capitalized begins-with-subjectindented blankindented-1-to-4 contains-alphanumindented-5-to-10 contains-bracketed-numbermore-than-one-third-space contains-httponly-punctuation contains-non-spaceprev-is-blank contains-numberprev-begins-with-ordinal contains-pipeshorter-than-30 36

Experiment setup  “Leave one out” testing: For each file in a group (FAQ), train a learner and test it on the remaining files in the group.  Scores are averaged over n(n-1) results. 37

Evaluation metrics  Segment: consecutive lines belonging to the same category  Co-occurrence agreement probability (COAP)  Empirical probability that the actual and the predicted segmentation agree on the placement of two lines according to some distance distribution D between lines.  Measures whether segment boundaries are properly aligned by the learner  Segmentation precision (SP):  Segmentation recall (SR): 38

Comparison of learners  ME-Stateless: Maximum entropy classifier  document is an unordered set of lines  lines are classified in isolation using the binary features, not using label of previous line  TokenHMM: Fully connected HMM with hidden states for each of the four labels  no binary features  transitions between states only on line boundaries  FeatureHMM: same as TokenHMM  lines are converted to sequences of features  MEMM 39

Results LearnerCOAPSegPrecSegRecall ME-Stateless0.520.0380.362 TokenHMM0.8650.2760.14 FeatureHMM0.9410.4130.529 MEMM0.9650.8670.681 40

Problems with MEMM  Label bias problem  Lead to CRF and further models (more on this in a while) 41

Thank you!

Similar presentations