Maximum Margin Markov Network Ben Taskar, Carlos Guestrin Daphne Koller 2004.

Presentation on theme: "Maximum Margin Markov Network Ben Taskar, Carlos Guestrin Daphne Koller 2004."— Presentation transcript:

Maximum Margin Markov Network Ben Taskar, Carlos Guestrin Daphne Koller 2004

Topics Covered  Main Idea.  Problem Setting.  Structure in classification problems.  Markov Model.  SVM  Combining SVM and Markov Network.  Generalization Bound.  Experiments and results.

Main Idea  Combining SVM(kernel based approach) and Markov network(graphical model) for sequential,structured learning.  SVM - 1) Ability to use high dimensional feature spaces. 2) Incapable of exploiting structure in the problem.  Markov Network - 1) Ability to represent correlations between labels by exploiting structure in the problem. 2) Incapable of dealing with high dimensional feature spaces.

Problem Setting  Multilabel Classification  Training data as Input:  Target is to predict y given new x.  We take example of OCR data.

Structure in classification problems  Feature function  Hypothesis  For multilabel classification number of possible assignments to y is exponential in the number of labels making arg max over y difficult to compute.  Alternative approach is to use probabilistic graphical models.

Markov Model  Use pairwise Markov Model.  Defined as a graph G=(Y,E).  Each edge (i,j) is associated with a potential function  The network encodes a conditional probability distribution as  Now we can take = f(x,y) to predict y for x.

SVM

Combining SVM and Markov Network  For single-label binary classification,Crammer and Singer provide an extension of SVM framework by maximizing the margin. where  The constraints ensure that  Here we are predicting multiple labels so loss function won’t be simply as o-1 loss but per label loss.

 More specifically margin between t(x) and y scales linearly with number of wrong labels in y: where :  However there is a problem with the above approach which is discussed in Taskar et al. This approach may give significant weight to output values that are not even close to target values because every increase in the loss increases the required margin.

 Now using standard transformation to eliminate and introducing slack variables we will have primal and dual:

Generalization Bound  Relate training error to testing error.  Average per label loss:  margin per label loss:

 with probability at least where q is the maximum edge degree in the network,l is the number of labels,K is a constant and k is number of classes in a label.

Experiments and Results  Handwriting recognition: 1) Input corpus contains 6100 handwritten words. 2) Data set divided into 10 folds of 600 training and 5500 testing examples. 3) Accuracy results are average over 10 folds.

 Hypertext classification: 1) Dataset contains web pages from 4 different CS departments. 2) Each page is labelled as course,faculty,student,project,other. 3) Learn model from three schools and test on remaining. 4) Error rate of M^3N is 40% lower than that of RMN’s and 51% lower than multi-class SVMs.

THANK YOU