# Structured SVM Chen-Tse Tsai and Siddharth Gupta.

## Presentation on theme: "Structured SVM Chen-Tse Tsai and Siddharth Gupta."— Presentation transcript:

Structured SVM Chen-Tse Tsai and Siddharth Gupta

Outline  Introduction to SVM  Large Margin Methods for Structured and Interdependent Output Variables (Tsochantaridis et. al., 2005)  Max-Margin Markov Networks (Taskar et. al., 2003)  Learning Structural SVMs with Latent Variables (Yu and Joachims, 2009) 2

SVM- The main idea

Maximum margin Find w and b such that is maximized and for all (x i, y i ), i=1..n : y i (w T x i + b) ≥ 1 Find w and b such that Φ(w) = ||w|| 2 =w T w is minimized and for all (x i, y i ), i=1..n : y i (w T x i + b) ≥ 1 quadratic optimization problem r ρ

Binary SVM  Training examples:  Primal form:  Dual form:

Multiclass SVM

Structured Output  Approach: view as multi-class classification task  Every complex output is one class  Problems:  Exponentially many classes  How to predict efficiently?  How to learn efficiently?  Potentially huge model  Manageable number of features? The dog chased the cat x S VPNP DetNV NP DetN y2y2 S VP DetNV NP VN y1y1 S VP DetNV NP DetN ykyk … 7

Multi-Class SVM (Crammer & Singer, 2001)  Training Examples:  Inference:  Training: Find that solve 8

Multi-Class SVM (Crammer & Singer, 2001) The dog chased the cat x S VPNP DetNV NP DetN y1y1 S VP DetNV NP VN y2y2 S VP NP y 58 S VPNP DetNV NP DetN y 12 S VPNP DetNV NP DetN y 34 S VPNP DetNV NP DetN y4y4 9

Joint Feature Map  Problem: exponential number of parameters  Feature vector that describes match between x and y  Learn single weight vector. Inference The dog chased the cat x S VPNP DetNV NP DetN y1y1 S VP DetNV NP VN y2y2 S VP DetNV NP DetN y 58 S VPNP DetNV NP DetN y 12 S VPNP DetNV NP DetN y 34 S VPNP DetNV NP DetN y4y4 10

Joint Feature Map for Trees  Weighted Context Free Grammar  Each rule has a weight  Score of a tree is the sum of its weight  Find highest scoring tree  Using CKY Parser The dog chased the cat S VPNP DetNV NP DetN Thecatthechaseddog x y 11

Structured SVM  Hard margin … 12

Structured SVM  Soft Margin  SVM 1   SVM 2 13

General Loss Function  measures the difference between prediction y, and the true value y i. The y with high loss should be penalized more severely.  Slack re-scaling  Margin re-scaling 14

A Cutting Plane Algorithm  Only polynomial number of constraints are needed. 15

A Cutting Plane Algorithm  Cutting plane algorithm 16

Computational problem  Prediction:  Get the most violated constraint:  Approximate inference methods in MRF  Training Structural SVMs when Exact Inference is Intractable. T. Finley, T. Joachims, ICML 2008 17

Outline  Large Margin Methods for Structured and Interdependent Output Variables (Tsochantaridis et. al., 2005)  Max-Margin Markov Networks (Taskar et. al., 2003)  Learning Structural SVMs with Latent Variables (Yu and Joachims, 2009) 18

Max-Margin Markov Network  Structured SVM entails a large number of constraints  So far, handled by adding one constraint a time  M 3 network  A way to solve SVM 1 with margin re-scaling  Use Markov network to encode dependency and generate features  Reduce exponential to polynomial number of constraints. 19

M 3 Network  A way to generate features.  Define features on the edges  The k-th feature of this instance  The loss function 20

M 3 Network  A way to solve SVM 1 with margin re-scaling  Primal:  Dual:  Only need node and edge marginal probability to compute expectation 21

Polynomial-Size Reformulation  The key step 22 y0y0 y1y1 y2y2 Δt x (y)α x (y) All possible y 11110.1 1102 1010 1001 0112 01030.2 0011 00020.1 Gold y101 µ x (0)0.60.5 µ x (1)0.40.5

Polynomial-Size Reformulation  The key step  Marginal dual variables  New constraints  Tree structure: 23

Polynomial-Size Reformulation  Factored dual QP  #variables and #constraints: N2 M down to N(M 2 +M) N: number of instances, M: the length of y  Problem  If the structure is not simple, we may need exponential number of new constraints  Enforce only local consistency of marginals, get an approximate result 24

SMO  Sequential minimal optimization  In binary SVM, we have a linear constraint  Working set selection: select the two variables to update  M 3 net: 25

Experimental Results  Max-Margin Parsing (Taskar et. al, 2004)  Apply M 3 Net to parsing  Discussed how to extract features from a grammar 26

Outline  Large Margin Methods for Structured and Interdependent Output Variables (Tsochantaridis et. al., 2005)  Max-Margin Markov Networks (Taskar et. al., 2003)  Learning Structural SVMs with Latent Variables (Yu and Joachims, 2009) 27

Latent Variable Models  Widely used in machine learning and statistics  Unobserved quantities/missing data in experiments  Dimensionality Reduction  Classical examples: Mixture models, PCA, LDA  This paper: Latent variables in supervised prediction tasks

Latent Variables in S-SVMs  How can we extend structural SVM to handle latent variables?

Structured SVM

Latent S-SVM Formulation

CCCP Algorithm

aa

Noun Phrase Co-reference

Noun phrase co-reference results

38