Discriminative Machine Learning Topic 3: SVM Duality Slides available online M. Pawan Kumar (Based on Prof.

Discriminative Machine Learning Topic 3: SVM Duality Slides available online http://mpawankumar.infohttp://mpawankumar.info M. Pawan Kumar (Based on Prof. A. Zisserman’s course material)

Linear Classifier Linear classifier is an appropriate choice Training loss is 0 x1x1 x2x2

Linear Classifier Training loss is small Linear classifier is an appropriate choice x1x1 x2x2 x1x1 x2x2

Linear Classifier Training loss is small Linear classifier is an appropriate choice Training loss is large Linear classifier is not an appropriate choice x1x1 x2x2 x1x1 x2x2

Linear Classifier Training loss is small Linear classifier is an appropriate choice Training loss is large Feature vector not appropriate x1x1 x2x2 x1x1 x2x2

Feature Vector We were using Φ(x) = [x 1 x 2 ] x1x1 x2x2 Instead, let us use Φ(x) = [x 1 2 x 2 2 √2x 1 x 2 ] x12x12 x22x22 √2x 1 x 2

Feature Vector Use a D dimensional feature vector Parameters will also be D dimensional For a large D, data may be linearly separable Large number of parameters to learn Accurate classification Inefficient optimization Can we somehow avoid this?D >> n

Reformulation –Examples –SVM Learning Problem SVM Dual Kernels Outline

Optimization – Simple Example min ξ s.t. ξ ≥ 3 2 3 4 ✔

Optimization – Simple Example min ξ s.t. ξ ≥ 3 2 3 4 ✔ ξ ≥ 4 ξ ≥ 2 We have to use the maximum lower bound Let us make this a bit more abstract

Constrained Optimization - Example min ξ s.t. ξ ≥ w 1 + w 2 w 1 + w 2 2w 1 +3w 2 max{w 1 + w 2, 2w 1 +3w 2 } ✔ ξ ≥ 2w 1 +3w 2 We have to use the maximum lower bound Let us consider the other direction

Unconstrained Optimization - Example min f(w 1,w 2 )+ max{w 1 +w 2,2w 1 +3w 2 }

Unconstrained Optimization - Example min f(w 1,w 2 )+ ξ s.t. ξ ≥ w 1 + w 2 ξ ≥ 2w 1 +3w 2 Equivalent constrained optimization problem We will call ξ a slack variable Reformulate SVM learning problem

Reformulation –Examples –SVM Learning Problem SVM Dual Kernels Outline

SVM Learning Problem max y {w T Ψ(x i,y) +Δ(y i,y)} - w T Ψ(x i,y i )∑i∑i λ||w|| 2 +min

SVM Learning Problem ξiξi ∑i∑i λ||w|| 2 +min w T Ψ(x i,y) +Δ(y i,y) - w T Ψ(x i,y i ) ≤ ξ i s.t. for all y Slight abuse of notation

SVM Learning Problem ξiξi ∑i∑i λ||w|| 2 +min w T Ψ(x i,y i,y) +Δ(y i,y) ≤ ξ i s.t. for all y Slight abuse of notation Ψ(x i,y i,y) = Ψ(x i,y) - Ψ(x i,y i ) Convex Quadratic Program

Convex Quadratic Program z T Qz + z T q + Cmin z s.t. z T a i ≤ b i i =1,…,m Q 0 Many efficient solvers But we already know how to optimize Reformulation allows us to write down the dual

Reformulation SVM Dual –Example –Generalization Kernels Outline

Example (w 1 2 +w 2 2 ) + ξmin w,ξ s.t. 2w 1 + w 2 + 1 ≤ ξ w 1 + 2w 2 + 1 ≤ ξ 0 ≤ ξ (w 1 2 +w 2 2 ) + 0 ≤ (w 1 2 +w 2 2 ) + ξ Lower bound on the objective

Example (w 1 2 +w 2 2 ) + 0 Lower bound on the objective min w 0w 1 = 0 Set derivatives with respect to w to 0 w 2 = 0

Example (w 1 2 +w 2 2 ) + ξmin w,ξ s.t. 2w 1 + w 2 + 1 ≤ ξ w 1 + 2w 2 + 1 ≤ ξ 0 ≤ ξ w 1 + w 2 + 2/3 ≤ ξ Lower bound on the objective X 1/3 ∑

Example w 1 + w 2 + 2/3 Lower bound on the objective (w 1 2 +w 2 2 ) +min w Set derivatives with respect to w to 0 1/6w 1 = -1/2w 2 = -1/2 How can I find the maximum lower bound?

Example (w 1 2 +w 2 2 ) + ξmin w,ξ s.t. 2w 1 + w 2 + 1 ≤ ξ w 1 + 2w 2 + 1 ≤ ξ 0 ≤ ξ X α 1 X α 2 X α 3 ∑ α 1 ≥ 0 α 2 ≥ 0α 3 ≥ 0 α 1 +α 2 +α 3 = 1 (2α 1 +α 2 )w 1 + (α 1 +2α 2 )w 2 + (α 1 +α 2 ) ≤ (α 1 +α 2 +α 3 )ξ

Example (w 1 2 +w 2 2 ) + ξmin w,ξ s.t. 2w 1 + w 2 + 1 ≤ ξ w 1 + 2w 2 + 1 ≤ ξ 0 ≤ ξ X α 1 X α 2 X α 3 ∑ α 1 ≥ 0 α 2 ≥ 0α 3 ≥ 0 α 1 +α 2 +α 3 = 1 (2α 1 +α 2 )w 1 + (α 1 +2α 2 )w 2 + (α 1 +α 2 ) ≤ ξ

Example (w 1 2 +w 2 2 ) + w 1 = -(2α 1 +α 2 )/2 Set derivatives with respect to w to 0 w 2 = -(α 1 +2α 2 )/2 α 1 ≥ 0 α 2 ≥ 0α 3 ≥ 0 α 1 +α 2 +α 3 = 1 (2α 1 +α 2 )w 1 + (α 1 +2α 2 )w 2 + (α 1 +α 2 )

Example w 1 = -(2α 1 +α 2 )/2 Set derivatives with respect to w to 0 w 2 = -(α 1 +2α 2 )/2 Maximum lower bound? α 1 ≥ 0 α 2 ≥ 0α 3 ≥ 0 α 1 +α 2 +α 3 = 1s.t. -(2α 1 +α 2 ) 2 /4 - (α 1 +2α 2 ) 2 /4 + (α 1 +α 2 )max α

Example (2α 1 +α 2 ) 2 /4 + (α 1 +2α 2 ) 2 /4 - (α 1 +α 2 ) α 1 ≥ 0 α 2 ≥ 0α 3 ≥ 0 α 1 +α 2 +α 3 = 1 min α s.t. Convex quadratic programDual problem Weak Duality Value of dual for any feasible α Value of primal for any feasible w ≤

Example (2α 1 +α 2 ) 2 /4 + (α 1 +2α 2 ) 2 /4 - (α 1 +α 2 ) α 1 ≥ 0 α 2 ≥ 0α 3 ≥ 0 α 1 +α 2 +α 3 = 1 min α s.t. Convex quadratic programDual problem Strong Duality Value of dual for the optimal α Value of primal for the optimal w =

Reformulation SVM Dual –Example –Generalization Kernels Outline

ξiξi ∑i∑i λ||w|| 2 +min w T Ψ(x i,y i,y) + Δ(y i,y) ≤ ξ i s.t. for all i, y Ψ(x i,y i,y) = Ψ(x i,y) - Ψ(x i,y i ) SVM Learning Problem α i (y) α i (y) ≥ 0 for all i∑ y α i (y) = 1 for all y

ξiξi ∑i∑i λ||w|| 2 +min w T Ψ(x i,y i,y) + Δ(y i,y) ≤ ξ i s.t. SVM Learning Problem α i (y)for all y ∑ y α i (y) (w T Ψ(x i,y i,y) + Δ(y i,y)) ≤ ξ i for all i, yα i (y) ≥ 0 for all i∑ y α i (y) = 1

∑i∑i SVM Learning Problem λ||w|| 2 + ∑ y α i (y) (w T Ψ(x i,y i,y) + Δ(y i,y)) w = -∑ i ∑ y α i (y)Ψ(x i,y i,y)/2λ Linear combination of joint feature vector Set derivatives with respect to w to 0 for all i, yα i (y) ≥ 0 for all i∑ y α i (y) = 1

SVM Learning Problem w = -∑ i ∑ y α i (y)Ψ(x i,y i,y)/2λ -∑ i,j ∑ y,ŷ α i (y) α i (y) Q(i,y,j,ŷ) + ∑ i ∑ y α i (y)Δ(y i,y) for all i, yα i (y) ≥ 0 for all i∑ y α i (y) = 1 Q(i,y,j,ŷ) = Ψ(x i,y i,y) T Ψ(x j,y j,ŷ) max α s.t.

SVM Learning Problem w = -∑ i ∑ y α i (y)Ψ(x i,y i,y)/2λ ∑ i,j ∑ y,ŷ α i (y) α i (y) Q(i,y,j,ŷ) - ∑ i ∑ y α i (y)Δ(y i,y) for all i, yα i (y) ≥ 0 for all i∑ y α i (y) = 1 Q(i,y,j,ŷ) = Ψ(x i,y i,y) T Ψ(x j,y j,ŷ) min α s.t.

SVM Dual Problem w = -∑ i ∑ y α i (y)Ψ(x i,y i,y)/2λ ∑ i,j ∑ y,ŷ α i (y) α i (y) Q(i,y,j,ŷ) - ∑ i ∑ y α i (y)Δ(y i,y) for all i, yα i (y) ≥ 0 for all i∑ y α i (y) = 1 Q(i,y,j,ŷ) = Ψ(x i,y i,y) T Ψ(x j,y j,ŷ) min α s.t. How to deal with high dimensional features?

Reformulation SVM Dual Kernels –Prediction –Learning –Results Outline

Prediction w = -∑ i ∑ y α i (y)Ψ(x i,y i,y)/2λβ i (y) = α i (y)/2λ We consider M-SVM Binary classification in example sheet

Prediction w = -∑ i ∑ y β i (y)Ψ(x i,y i,y)

Prediction w = ∑ i ∑ y β i (y)(Ψ(x i,y i ) - Ψ(x i,y)) y(w) = Given test input x w T Ψ(x,ŷ)argmax ŷ

Prediction w = ∑ i ∑ y β i (y)(Ψ(x i,y i ) - Ψ(x i,y)) w T Ψ(x,ŷ)

Prediction w = ∑ i ∑ y β i (y)(Ψ(x i,y i ) - Ψ(x i,y)) ∑ i ∑ y β i (y)(Ψ(x i,y i ) - Ψ(x i,y)) T Ψ(x,ŷ) We need to compute dot products of features Let us take a closer look at one product term

Dot Product Ψ(x i,y) T Ψ(x,ŷ) Ψ(x,1) = Φ(x) 0...... 0 Ψ(x,2) = 0 Φ(x)...... 0 … = 0 if y ≠ ŷ

Dot Product Ψ(x i,y) T Ψ(x,ŷ) Ψ(x,1) = Φ(x) 0...... 0 Ψ(x,2) = 0 Φ(x)...... 0 … = Φ(x i ) T Φ(x) if y = ŷ

Dot Product Ψ(x i,y) T Ψ(x,ŷ)= Φ(x i ) T Φ(x) if y = ŷ We do not need the feature vector Φ(.) We need a function that computes dot product Kernel Isn’t that as expensive as feature computation? O(D) operation for D-dimensional features

Kernel x1x1 x2x2 Φ(x) = [x 1 x 2 ] Corresponding feature? We can use the kernel k(x,x’) = x 1 x’ 1 + x 2 x’ 2

Kernel x1x1 x2x2 Φ(x) = [x 1 2 x 2 2 √2x 1 x 2 ] Corresponding feature? k(x,x’) = x 1 2 (x’ 1 ) 2 + x 2 2 (x’ 2 ) 2 + 2x 1 x’ 1 x 2 x’ 2

Kernel x1x1 x2x2 Infinite dimensional Corresponding feature? k(x,x’) = exp(-||x-x’|| 2 /2σ 2 )

Prediction - Summary ∑ i ∑ y β i (y)(Ψ(x i,y i ) - Ψ(x i,y)) T Ψ(x,ŷ) Compute non-zero dot products using kernels argmax y Many dot products are 0 Compute scores for all possible y y(w) = Choose maximum score to make a prediction

Kernel Commonly used kernels k(x,x’) = x T x’Linear k(x,x’) = (1+x T x’) d Polynomial Φ(.) has all polynomial terms up to degree d k(x,x’) = exp(-||x-x’|| 2 /2σ 2 ) Gaussian or RBF Φ(.) is infinite dimensional

SVM Dual Problem ∑ i,j ∑ y,ŷ α i (y) α i (y) Q(i,y,j,ŷ) - ∑ i ∑ y α i (y)Δ(y i,y) for all i, yα i (y) ≥ 0 for all i∑ y α i (y) = 1 Q(i,y,j,ŷ) = Ψ(x i,y i,y) T Ψ(x j,y j,ŷ) min α s.t. Need to compute Q Only requires dot products Kernel trick

Computational Efficiency ∑ i,j ∑ y,ŷ α i (y) α i (y) Q(i,y,j,ŷ) - ∑ i ∑ y α i (y)Δ(y i,y) for all i, yα i (y) ≥ 0 for all i∑ y α i (y) = 1 Q(i,y,j,ŷ) = Ψ(x i,y i,y) T Ψ(x j,y j,ŷ) min α s.t. Is this a convex quadratic program? Q 0Mercer Kernels

Data Not linearly separable in original space Use an RBF kernel

Results σ = 1.0λ = 0

Results σ = 1.0λ = 0.01 Increase in λ increases margin

Results σ = 1.0λ = 0.1

Results σ = 0.1λ = 0How does σ affect prediction?

Results σ = 0.1λ = 0Example sheet

Questions?

Discriminative Machine Learning Topic 3: SVM Duality Slides available online M. Pawan Kumar (Based on Prof.

Similar presentations

Presentation on theme: "Discriminative Machine Learning Topic 3: SVM Duality Slides available online M. Pawan Kumar (Based on Prof."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Discriminative Machine Learning Topic 3: SVM Duality Slides available online M. Pawan Kumar (Based on Prof.

Similar presentations

Presentation on theme: "Discriminative Machine Learning Topic 3: SVM Duality Slides available online M. Pawan Kumar (Based on Prof."— Presentation transcript:

Similar presentations

About project

Feedback