Presentation is loading. Please wait.

Presentation is loading. Please wait.

Discriminative Machine Learning Topic 3: SVM Duality Slides available online M. Pawan Kumar (Based on Prof.

Similar presentations


Presentation on theme: "Discriminative Machine Learning Topic 3: SVM Duality Slides available online M. Pawan Kumar (Based on Prof."— Presentation transcript:

1 Discriminative Machine Learning Topic 3: SVM Duality Slides available online http://mpawankumar.infohttp://mpawankumar.info M. Pawan Kumar (Based on Prof. A. Zisserman’s course material)

2 Linear Classifier Linear classifier is an appropriate choice Training loss is 0 x1x1 x2x2

3 Linear Classifier Training loss is small Linear classifier is an appropriate choice x1x1 x2x2 x1x1 x2x2

4 Linear Classifier Training loss is small Linear classifier is an appropriate choice Training loss is large Linear classifier is not an appropriate choice x1x1 x2x2 x1x1 x2x2

5 Linear Classifier Training loss is small Linear classifier is an appropriate choice Training loss is large Feature vector not appropriate x1x1 x2x2 x1x1 x2x2

6 Feature Vector We were using Φ(x) = [x 1 x 2 ] x1x1 x2x2 Instead, let us use Φ(x) = [x 1 2 x 2 2 √2x 1 x 2 ] x12x12 x22x22 √2x 1 x 2

7 Feature Vector Use a D dimensional feature vector Parameters will also be D dimensional For a large D, data may be linearly separable Large number of parameters to learn Accurate classification Inefficient optimization Can we somehow avoid this?D >> n

8 Reformulation –Examples –SVM Learning Problem SVM Dual Kernels Outline

9 Optimization – Simple Example min ξ s.t. ξ ≥ 3 2 3 4 ✔

10 Optimization – Simple Example min ξ s.t. ξ ≥ 3 2 3 4 ✔ ξ ≥ 4 ξ ≥ 2 We have to use the maximum lower bound Let us make this a bit more abstract

11 Constrained Optimization - Example min ξ s.t. ξ ≥ w 1 + w 2 w 1 + w 2 2w 1 +3w 2 max{w 1 + w 2, 2w 1 +3w 2 } ✔ ξ ≥ 2w 1 +3w 2 We have to use the maximum lower bound Let us consider the other direction

12 Unconstrained Optimization - Example min f(w 1,w 2 )+ max{w 1 +w 2,2w 1 +3w 2 }

13 Unconstrained Optimization - Example min f(w 1,w 2 )+ ξ s.t. ξ ≥ w 1 + w 2 ξ ≥ 2w 1 +3w 2 Equivalent constrained optimization problem We will call ξ a slack variable Reformulate SVM learning problem

14 Reformulation –Examples –SVM Learning Problem SVM Dual Kernels Outline

15 SVM Learning Problem max y {w T Ψ(x i,y) +Δ(y i,y)} - w T Ψ(x i,y i )∑i∑i λ||w|| 2 +min

16 SVM Learning Problem ξiξi ∑i∑i λ||w|| 2 +min w T Ψ(x i,y) +Δ(y i,y) - w T Ψ(x i,y i ) ≤ ξ i s.t. for all y Slight abuse of notation

17 SVM Learning Problem ξiξi ∑i∑i λ||w|| 2 +min w T Ψ(x i,y i,y) +Δ(y i,y) ≤ ξ i s.t. for all y Slight abuse of notation Ψ(x i,y i,y) = Ψ(x i,y) - Ψ(x i,y i ) Convex Quadratic Program

18 Convex Quadratic Program z T Qz + z T q + Cmin z s.t. z T a i ≤ b i i =1,…,m Q 0 Many efficient solvers But we already know how to optimize Reformulation allows us to write down the dual

19 Reformulation SVM Dual –Example –Generalization Kernels Outline

20 Example (w 1 2 +w 2 2 ) + ξmin w,ξ s.t. 2w 1 + w 2 + 1 ≤ ξ w 1 + 2w 2 + 1 ≤ ξ 0 ≤ ξ (w 1 2 +w 2 2 ) + 0 ≤ (w 1 2 +w 2 2 ) + ξ Lower bound on the objective

21 Example (w 1 2 +w 2 2 ) + 0 Lower bound on the objective min w 0w 1 = 0 Set derivatives with respect to w to 0 w 2 = 0

22 Example (w 1 2 +w 2 2 ) + ξmin w,ξ s.t. 2w 1 + w 2 + 1 ≤ ξ w 1 + 2w 2 + 1 ≤ ξ 0 ≤ ξ w 1 + w 2 + 2/3 ≤ ξ Lower bound on the objective X 1/3 ∑

23 Example w 1 + w 2 + 2/3 Lower bound on the objective (w 1 2 +w 2 2 ) +min w Set derivatives with respect to w to 0 1/6w 1 = -1/2w 2 = -1/2 How can I find the maximum lower bound?

24 Example (w 1 2 +w 2 2 ) + ξmin w,ξ s.t. 2w 1 + w 2 + 1 ≤ ξ w 1 + 2w 2 + 1 ≤ ξ 0 ≤ ξ X α 1 X α 2 X α 3 ∑ α 1 ≥ 0 α 2 ≥ 0α 3 ≥ 0 α 1 +α 2 +α 3 = 1 (2α 1 +α 2 )w 1 + (α 1 +2α 2 )w 2 + (α 1 +α 2 ) ≤ (α 1 +α 2 +α 3 )ξ

25 Example (w 1 2 +w 2 2 ) + ξmin w,ξ s.t. 2w 1 + w 2 + 1 ≤ ξ w 1 + 2w 2 + 1 ≤ ξ 0 ≤ ξ X α 1 X α 2 X α 3 ∑ α 1 ≥ 0 α 2 ≥ 0α 3 ≥ 0 α 1 +α 2 +α 3 = 1 (2α 1 +α 2 )w 1 + (α 1 +2α 2 )w 2 + (α 1 +α 2 ) ≤ ξ

26 Example (w 1 2 +w 2 2 ) + w 1 = -(2α 1 +α 2 )/2 Set derivatives with respect to w to 0 w 2 = -(α 1 +2α 2 )/2 α 1 ≥ 0 α 2 ≥ 0α 3 ≥ 0 α 1 +α 2 +α 3 = 1 (2α 1 +α 2 )w 1 + (α 1 +2α 2 )w 2 + (α 1 +α 2 )

27 Example w 1 = -(2α 1 +α 2 )/2 Set derivatives with respect to w to 0 w 2 = -(α 1 +2α 2 )/2 Maximum lower bound? α 1 ≥ 0 α 2 ≥ 0α 3 ≥ 0 α 1 +α 2 +α 3 = 1s.t. -(2α 1 +α 2 ) 2 /4 - (α 1 +2α 2 ) 2 /4 + (α 1 +α 2 )max α

28 Example (2α 1 +α 2 ) 2 /4 + (α 1 +2α 2 ) 2 /4 - (α 1 +α 2 ) α 1 ≥ 0 α 2 ≥ 0α 3 ≥ 0 α 1 +α 2 +α 3 = 1 min α s.t. Convex quadratic programDual problem Weak Duality Value of dual for any feasible α Value of primal for any feasible w ≤

29 Example (2α 1 +α 2 ) 2 /4 + (α 1 +2α 2 ) 2 /4 - (α 1 +α 2 ) α 1 ≥ 0 α 2 ≥ 0α 3 ≥ 0 α 1 +α 2 +α 3 = 1 min α s.t. Convex quadratic programDual problem Strong Duality Value of dual for the optimal α Value of primal for the optimal w =

30 Reformulation SVM Dual –Example –Generalization Kernels Outline

31 ξiξi ∑i∑i λ||w|| 2 +min w T Ψ(x i,y i,y) + Δ(y i,y) ≤ ξ i s.t. for all i, y Ψ(x i,y i,y) = Ψ(x i,y) - Ψ(x i,y i ) SVM Learning Problem α i (y) α i (y) ≥ 0 for all i∑ y α i (y) = 1 for all y

32 ξiξi ∑i∑i λ||w|| 2 +min w T Ψ(x i,y i,y) + Δ(y i,y) ≤ ξ i s.t. SVM Learning Problem α i (y)for all y ∑ y α i (y) (w T Ψ(x i,y i,y) + Δ(y i,y)) ≤ ξ i for all i, yα i (y) ≥ 0 for all i∑ y α i (y) = 1

33 ∑i∑i SVM Learning Problem λ||w|| 2 + ∑ y α i (y) (w T Ψ(x i,y i,y) + Δ(y i,y)) w = -∑ i ∑ y α i (y)Ψ(x i,y i,y)/2λ Linear combination of joint feature vector Set derivatives with respect to w to 0 for all i, yα i (y) ≥ 0 for all i∑ y α i (y) = 1

34 SVM Learning Problem w = -∑ i ∑ y α i (y)Ψ(x i,y i,y)/2λ -∑ i,j ∑ y,ŷ α i (y) α i (y) Q(i,y,j,ŷ) + ∑ i ∑ y α i (y)Δ(y i,y) for all i, yα i (y) ≥ 0 for all i∑ y α i (y) = 1 Q(i,y,j,ŷ) = Ψ(x i,y i,y) T Ψ(x j,y j,ŷ) max α s.t.

35 SVM Learning Problem w = -∑ i ∑ y α i (y)Ψ(x i,y i,y)/2λ ∑ i,j ∑ y,ŷ α i (y) α i (y) Q(i,y,j,ŷ) - ∑ i ∑ y α i (y)Δ(y i,y) for all i, yα i (y) ≥ 0 for all i∑ y α i (y) = 1 Q(i,y,j,ŷ) = Ψ(x i,y i,y) T Ψ(x j,y j,ŷ) min α s.t.

36 SVM Dual Problem w = -∑ i ∑ y α i (y)Ψ(x i,y i,y)/2λ ∑ i,j ∑ y,ŷ α i (y) α i (y) Q(i,y,j,ŷ) - ∑ i ∑ y α i (y)Δ(y i,y) for all i, yα i (y) ≥ 0 for all i∑ y α i (y) = 1 Q(i,y,j,ŷ) = Ψ(x i,y i,y) T Ψ(x j,y j,ŷ) min α s.t. How to deal with high dimensional features?

37 Reformulation SVM Dual Kernels –Prediction –Learning –Results Outline

38 Prediction w = -∑ i ∑ y α i (y)Ψ(x i,y i,y)/2λβ i (y) = α i (y)/2λ We consider M-SVM Binary classification in example sheet

39 Prediction w = -∑ i ∑ y β i (y)Ψ(x i,y i,y)

40 Prediction w = ∑ i ∑ y β i (y)(Ψ(x i,y i ) - Ψ(x i,y)) y(w) = Given test input x w T Ψ(x,ŷ)argmax ŷ

41 Prediction w = ∑ i ∑ y β i (y)(Ψ(x i,y i ) - Ψ(x i,y)) w T Ψ(x,ŷ)

42 Prediction w = ∑ i ∑ y β i (y)(Ψ(x i,y i ) - Ψ(x i,y)) ∑ i ∑ y β i (y)(Ψ(x i,y i ) - Ψ(x i,y)) T Ψ(x,ŷ) We need to compute dot products of features Let us take a closer look at one product term

43 Dot Product Ψ(x i,y) T Ψ(x,ŷ) Ψ(x,1) = Φ(x) 0...... 0 Ψ(x,2) = 0 Φ(x)...... 0 … = 0 if y ≠ ŷ

44 Dot Product Ψ(x i,y) T Ψ(x,ŷ) Ψ(x,1) = Φ(x) 0...... 0 Ψ(x,2) = 0 Φ(x)...... 0 … = Φ(x i ) T Φ(x) if y = ŷ

45 Dot Product Ψ(x i,y) T Ψ(x,ŷ)= Φ(x i ) T Φ(x) if y = ŷ We do not need the feature vector Φ(.) We need a function that computes dot product Kernel Isn’t that as expensive as feature computation? O(D) operation for D-dimensional features

46 Kernel x1x1 x2x2 Φ(x) = [x 1 x 2 ] Corresponding feature? We can use the kernel k(x,x’) = x 1 x’ 1 + x 2 x’ 2

47 Kernel x1x1 x2x2 Φ(x) = [x 1 2 x 2 2 √2x 1 x 2 ] Corresponding feature? k(x,x’) = x 1 2 (x’ 1 ) 2 + x 2 2 (x’ 2 ) 2 + 2x 1 x’ 1 x 2 x’ 2

48 Kernel x1x1 x2x2 Infinite dimensional Corresponding feature? k(x,x’) = exp(-||x-x’|| 2 /2σ 2 )

49 Prediction - Summary ∑ i ∑ y β i (y)(Ψ(x i,y i ) - Ψ(x i,y)) T Ψ(x,ŷ) Compute non-zero dot products using kernels argmax y Many dot products are 0 Compute scores for all possible y y(w) = Choose maximum score to make a prediction

50 Kernel Commonly used kernels k(x,x’) = x T x’Linear k(x,x’) = (1+x T x’) d Polynomial Φ(.) has all polynomial terms up to degree d k(x,x’) = exp(-||x-x’|| 2 /2σ 2 ) Gaussian or RBF Φ(.) is infinite dimensional

51 Reformulation SVM Dual Kernels –Prediction –Learning –Results Outline

52 SVM Dual Problem ∑ i,j ∑ y,ŷ α i (y) α i (y) Q(i,y,j,ŷ) - ∑ i ∑ y α i (y)Δ(y i,y) for all i, yα i (y) ≥ 0 for all i∑ y α i (y) = 1 Q(i,y,j,ŷ) = Ψ(x i,y i,y) T Ψ(x j,y j,ŷ) min α s.t. Need to compute Q Only requires dot products Kernel trick

53 Computational Efficiency ∑ i,j ∑ y,ŷ α i (y) α i (y) Q(i,y,j,ŷ) - ∑ i ∑ y α i (y)Δ(y i,y) for all i, yα i (y) ≥ 0 for all i∑ y α i (y) = 1 Q(i,y,j,ŷ) = Ψ(x i,y i,y) T Ψ(x j,y j,ŷ) min α s.t. Is this a convex quadratic program? Q 0Mercer Kernels

54 Reformulation SVM Dual Kernels –Prediction –Learning –Results Outline

55 Data Not linearly separable in original space Use an RBF kernel

56 Results σ = 1.0λ = 0

57 Results σ = 1.0λ = 0.01 Increase in λ increases margin

58 Results σ = 1.0λ = 0.1

59 Results σ = 1.0λ = 0

60 Results σ = 0.25λ = 0

61 Results σ = 0.1λ = 0How does σ affect prediction?

62 Results σ = 0.1λ = 0Example sheet

63 Questions?


Download ppt "Discriminative Machine Learning Topic 3: SVM Duality Slides available online M. Pawan Kumar (Based on Prof."

Similar presentations


Ads by Google