# Multiclass SVM and Applications in Object Classification

## Presentation on theme: "Multiclass SVM and Applications in Object Classification"— Presentation transcript:

Multiclass SVM and Applications in Object Classification
Yuval Kaminka, Einat Granot Advanced Topics in Computer Vision Seminar Faculty of Mathematics and Computer Science Weizmann Institute May 2007

Outline Motivation and Introduction Classification Algorithms
K-Nearest neighbors (KNN) SVM Multiclass SVM DAGSVM SVM-KNN Results - A taste of the distance Shape distance (shape context, tangent) Texture (texton histograms)

Object Classification
?

Motivation – Human Visual System
Large Number of Categories (~30,000) Discriminative Process Small Set of Examples Invariance to transformation Similarity to Prototype instead of Features

Similarity to Prototypes Vs Features
No need for Feature Space Easy to enlarge number of categories Includes spatial relation between features No need for feature definition, for example in the tangent distance

D( ) , Distance Function Similarity is defined by Distance Function
Easy to adjust to different types (Shape, Texture) Can include invariance to intra-class transformations

Distance Function – simple example
) = ) = || 2.1, 27, 31, 15, 8 . - || 13, 45, 22.5, 78, 91 ? , , 2.1 27 31 .

Outline Motivation and Introduction Classification Algorithms
K-Nearest neighbors (KNN) SVM Multiclass SVM DAGSVM SVM-KNN Results - A taste of the distance Shape distance (shape context, tangent) Texture (texton histograms)

A Classic Classification Problem
Training Set S: (X1..Xn), with class label (Y1.. Yn) Given a query image q, determine its label X2 X3 X1 X5 q X4 X6 X7

Nearest Neighbor (NN) ?

K-Nearest Neighbor (KNN)
? K = 3

K-NN Pros Simple, yet outperforms other methods Low Complexity: O(Dּn)
D - the cost per one distance function calculation No need for Feature Space definition No computational cost for adding new categories n  ∞ ==> Error Rate  Bayes optimal Bayes Optimal – A classifiers that always classify the classification that will get maximum probability, going over all possible hypothesis

K-NN Cons Complete Set Missing Set NN SVM
P. Vincent et al., K-local hyperplane and convex distance nearest neighbor algorithms, NIPS 2001

Outline Motivation and Introduction Classification Algorithms
K-Nearest neighbors (KNN) SVM Multiclass SVM DAGSVM SVM-KNN Results - A taste of the distance Shape distance (shape context, tangent) Texture (texton histograms)

SVM Two class classification algorithm
Hyperplane – תת-קבוצה של וקטורים במימד n-1 שמגדיר הפרדה במימד ה-n. Linear Hyperplane – Hyperplane שעובר דרך הראשית Class 1 We’re looking for a hyperplane that best separates the classes Some of the slides on SVM are adapted with permission from Martin Law’s presentation on SVM

As far away as possible from the data of both classes
SVM - Motivation Class 2 Class 2 Class 1 Class 1 As far away as possible from the data of both classes

SVM – A learning algorithm
KNN – simple classification, no training Class 1 Class 2 SVM – a learning algorithm Training – find the hyperplane Classification – label a new query Two Phases:

SVM – Training Phase We’re looking for (w,b) that will:
Class 2 ~b wTx+b=0 Class 1 We’re looking for (w,b) that will: Classify correctly the classes Give maximum margins

1. Correct classification
{x1, ..., xn} our training set wTx+b=0 Class 1 Correct classification: wTxi+b>0 for green, and wTxi+b<0 for red Assume the labels {y1.. yn} are from the set {-1,1}:

2. Margin maximization Class 2 m Class 1 m = ?

2. Margin maximization m We can scale (w,b)  (w,b), >0
|wTz+b| ||w|| Class 2 z m Class 1 We can scale (w,b)  (w,b), >0 Won’t change classification: wTx+b>0  wTx+b>0 Get a desired distance: |wTz+b|=a  =1/a, |wTz+b|=1

SVM as an Optimization Problem
Maximize margins Correct Classification Solve optimization problem with constraints We can find a1.. an, such that: Langrangian multipliers C.J.C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition, 1998.

SVM as an Optimization Problem
Maximize margins Correct Classification Classic optimization problem with constraints לשנות x ל-w ולתקן למטה ל-xi s.t.

SVM as an Optimization Problem
s.t. There must exist positive a1.. an such that: And in our case: There must exist positive a1.. an such that: gi(x) f(x)

Support Vectors xi with ai>0 are called support vectors (SV)
Class 2 a=0 a>0 a=0 a=0 a>0 a=0 a>0 a=0 a=0 Class 1 xi with ai>0 are called support vectors (SV) w is determined only by the SV

Allowing errors We would now like to minimize wTx+b=1 wTx+b=0 wTx+b=-1
Class 2 wTx+b=1 Class 1 wTx+b=0 wTx+b=-1 We would now like to minimize

Allowing errors As before we get: Class 2 Class 1

SVM – Classification phase
q Class 1 Compute wTq+b Classify as class 1 if positive, and class 2 otherwise

Upgrade SVM We only need to calculate inner products
In order to find a1.. an we need to calculate xiTxj i,j In order to classify a query q we need to calculate:

Feature Expansion f(.) Extended space Input space f(.)
( 1 , x , y , xy , x2 , y2 ) (x , y) Problem: too expensive!

Solution: The Kernel Trick
We only need to calculate inner products f( ) f(.) Find a kernel function K such that:

The Kernel Trick We only need to calculate inner products
In order to find a1.. an we need to calculate xiTxj i,j Build a kernel matrix MnXn: M[i,j]= (xi)T(xj)=K(xi,xj) In order to classify a query q we need to calculate wTq+b:

Inner product  Distance Function
We only need to calculate inner products In our case: convert to distance function Parallelogram law: ||u+v||^2+||u-v||^2=2||u||^2+2||v||^2 From “origin” Pairwise distance

Inner product  Distance Function
Use the fact that we only need to calculate inner products In order to find a1.. an we need to calculate xiTxj i,j Build a distance matrix DnXn: D[i,j] = xiTxj = 1/2ּ[d(xi,0)+d(xj,0)-d(xi,xj)] In order to classify a query q we need to calculate wTq+b:

SVM Pros and Cons Pros: Easy to integrate different distance functions
Fast classification of new objects (depends on SV) Good performance even with small set of examples Cons: Slow training ( O(n2), n=# of vectors in training set ) Separates only 2 classes להזכיר שהחיסרון הראשון "נעלם" כאשר מדובר על סט קטן של דוגמאות

Outline Motivation and Introduction Classification Algorithms
K-Nearest neighbors (KNN) SVM Multiclass SVM DAGSVM SVM-KNN Results - A taste of the distance Shape distance (shape context, tangent) Texture (texton histograms)

Multiclass SVM Extend SVM for multi-classes separation
Nc = number of classes Class 2 Class 1 Class 5 Class 4 Class 3

Two approaches Class 1 Class 2 Class 3 Class 4
1-vs-rest 1-vs-1 DAGSVM Combine multi-binary-classifiers Generate one function based on single optimization problem

1-vs-rest Class 2 Class 1 Class 4 Class 3

1-vs-rest w2 w1 Class 2 Class 1 w3 w4 Nc classifiers Class 3 Class 4

1-vs-rest Class 2 Class 1 Class 3 Class 4 w2 w1 w3 w4
~ Similarity(q,SV3) q ~ Similarity(q,SV2) w1Tq+b1 ~ Similarity(q,SV1) ~ Similarity(q,SV4) Class 3 Class 4

argmax1≤i ≤Nc{Sim(q,SVi)}
1-vs-rest w2 w1 Class 2 Class 1 w3 w4 q Label(q)= argmax1≤i ≤Nc{Sim(q,SVi)} Class 3 Class 4

1-vs-rest After training we’ll have Nc decision functions:
fi(x)=wiTx+bi Class of query object q is determined by: argmax1≤i ≤Nc{ wiTx+bi } Pros: Only Nc classifiers to be trained and tested Cons: Every classifier use all vectors for training No bound on generalization error

1-vs-rest Complexity For training:
Nc classifiers, each using n vectors for finding hyperplane For classifying new objects: Nc classifiers, each is tested once, M=max number of SV

1-vs-1 Class 2 Class 1 Class 4 Class 3

1-vs-1 Nc(Nc-1)/2 classifiers Class 2 Class 1 Class 4 Class 3 W1,2

1-vs-1 with Max Wins ☺ ☺ ☺ ☺ ☺ ☺ Class 2 Class 1 Class 4 Class 3 W1,2
q W2,3 ~ 2 or 4 ? Sign(w1,2Tq+b1,2) ~ 1 or 2 ? W1,3 ~ 1 or 3 ? W2,4 ~ 1 or 4 ? ~ 3 or 4 ? W3,4 ~ 2 or 3 ? Class 4 Class 3

1-vs-1 with Max Wins ☺ ☺ ☺ ☺ ☺ ☺ Class 2 Class 1 Class 4 Class 3 W1,2
q W2,3 W1,3 W2,4 W3,4 Class 4 Class 3

1-vs-1 with Max Wins After training we’ll have Nc(Nc-1)/2 decision functions: fij(x)=sign(wijTx+bij) Class of query object x is determined by max-votes Pros: Every classifier use a small set of vectors for training Cons: Nc(Nc-1)/2 classifiers to be trained and tested No bound on generalization error

1-vs-1 Complexity For training:
Assume that every class contains ~ n/Nc instances Nc(Nc-1)/2 classifiers, each using ~2n/Nc vectors: For classifying new objects: Nc(Nc-1)/2 classifiers, each is tested once, M as before

What did we have so far? 1-vs-1 1-vs-rest Nc(Nc-1)/2 Nc
Class 1 Class 2 Class 3 Class 4 Class 1 Class 2 Class 3 Class 4 1-vs-1 1-vs-rest Nc(Nc-1)/2 Nc # of classifiers (each need to be trained and tested) ~2n/Nc n (all vectors) # of vectors for training (per classifier) No bound on generalization error להזכיר שכשהאימון נעשה על מס' דוגמאות קטן זה אמנם יתרון מבחינת סיבוכיות, אך יכול להיות חסרון מבחינת ביצועים

DAGSVM 1-vs-1 Decision DAG (DDAG) 4 1 2 3
3 4 2 3 4 1 2 1 2 3 2 3 not 1 not 2 not 3 not 4 4 1 2 3 Class 1 Class 2 Class 3 Class 4 W1,2 W1,3 W1,4 W2,3 W3,4 W2,4 J. C. Platt et al., Large margin DAGs for multiclass classification. NIPS, 1999.

Binary decision function Nc(Nc-1)/2 internal nodes
DDAG on Nc Classes Single root node 1 vs 4 3 vs 4 2 vs 4 1 vs 3 2 vs 3 1 vs 2 3 4 2 3 4 1 2 1 2 3 2 3 not 1 not 2 not 3 not 4 4 1 2 3 In every node: Binary decision function Nc(Nc-1)/2 internal nodes DAG Nc leaves, one per class

Building the DDAG 1 2 3 4 change list order no affect on results 4 3 2
1 vs 4 change list order no affect on results not 1 not 4 2 3 4 2 vs 4 1 2 3 1 vs 3 not 2 not 4 not 1 not 3 2 3 3 vs 4 2 vs 3 1 vs 2 3 4 1 2 4 3 2 1

Classification using DDAG
1 vs 4 W1,2 ~ 1 or 2 ? q Class 2 Class 1 ~ 1 or 4 ? not 1 not 4 W1,4 ~ 1 or 3 ? 2 3 4 2 vs 4 1 2 3 1 vs 3 W1,3 W2,3 W2,4 W3,4 not 2 not 4 not 1 not 3 בהנחה שה-classes ניתנים להפרדה והשוליים שמתקבלים אכן גדולים אזי הגיוני "להיפטר" מה-class שלא בחרנו לסווג אליה בכל פעם. 3 4 2 3 3 vs 4 2 vs 3 1 vs 2 1 2 Class 4 Class 3 4 3 2 1

DAGSVM Pros: Only Nc-1 classifiers to be tested
Every classifier uses a small set of vectors for training Bound on generalization error (~margins size) Cons: Less vectors for training  worse classifier? Nc(Nc-1)/2 classifiers to be trained

DAGSVM Complexity For training:
Assume that every class contains ~n/Nc instances Nc(Nc-1)/2 classifiers, each using ~2n/Nc vectors: For classifying new objects: Nc-1 classifiers, each is tested once M = max number of SV

Classification complexity
Multiclass SVM DAGSVM 1-vs-1 1-vs-rest Nc # of classifiers O(Dּn2) O(DּNcn2) Training complexity O(M2ּNc) O(M2ּNc2) O(M1ּNc) Classification complexity

Multiclass SVM comparison
Classification Training

Multiclass SVM - Summary
Training: Classification: Error rates: Bound of generalization error - only on DAGSVM In practice – 1-vs-1 and DAGSVM The “one big optimization” methods Similar error rates Very slow training – limited to small data sets 1-vs-rest DAGSVM / 1-vs-1 O(DּNcּn2) O(Dּn2) 1-vs-1 DAGSVM / 1-vs-rest O(DּMּNc2) O(DּMּNc)

So what do we have? Nearest Neighbor (KNN) SVM Fast
Suitable for multi-class Easy to integrate different distance functions Problematic with few samples SVM Good performance even with small set of examples No natural extension to multi-class Slow to train Class 1 Class 2

SVM KNN - From coarse to fine
Suggestion  Hybrid system KNN SVM Zhang et al, SVM-KNN: Discriminative Nearest Neighbor Classification for Visual Category Recognition, 2006

Outline Motivation and Introduction Classification Algorithms
K-Nearest neighbors (KNN) SVM Multiclass SVM DAGSVM SVM-KNN Results - A taste of the distance Shape distance (shape context, tangent) Texture (texton histograms)

SVM KNN – General Algorithm
Calculate distance from query to training images Query image Class 1 Class 2 Class 3 Training images and query

SVM KNN – General Algorithm
Calculate distance from query to training images Pick K nearest neighbors Query image Class 1 Class 2 Class 3 Training images and query

SVM KNN – General Algorithm
Calculate distance from query to training images Pick K nearest neighbors Run SVM Query image Class 1 Class 2 Class 3 SVM works well with few samples Training images and query

SVM KNN – General Algorithm
Calculate distance from query to training images Pick K nearest neighbors Run SVM Label ! Query image Class 1 Class 2 Class 3 Query image  Class 2 Training images and query

Training + Classification
Calculate distance from query to training images Pick K nearest neighbors Run SVM Label ! KNN SVM Classic process: Training Classification SVM-KNN Coarse Classification Final classification

Details Details Details
Calculate distance from query to training images Pick K nearest neighbors Run SVM Label ! KNN SVM Calculating distance is a heavy task Compute crude distance – faster Finding Kpotential images Ignore all other images Compute accurate distance Only relative to the Kpotential images L2 Accurate Kpotential

Details Details Details
Calculate distance from query to training images Pick K nearest neighbors Run SVM Label ! KNN SVM Complexity: Crude distance Accurate distance L2 Accurate Kpotential

Details Details Details
Calculate distance from query to training images Pick K nearest neighbors Run SVM Label ! KNN SVM If K neighbors are from the same class  Done

Details Details Details
Calculate distance from query to training images Pick K nearest neighbors Run SVM Label ! KNN SVM Construct pairwise inner product matrix Improvement – cache distance calculation

Details Details Details
Calculate distance from query to training images Pick K nearest neighbors Run SVM Label ! KNN SVM Selected SVM: DAGSVM (faster) Complexity: 1 vs 4 3 vs 4 2 vs 4 1 vs 3 2 vs 3 1 vs 2

Complexity Total complexity DAGSVM training complexity
Calculate distance from query to training images Pick K nearest neighbors Run SVM Label ! KNN SVM Total complexity DAGSVM training complexity

SVM KNN – continuum Defining an SVM-KNN continuum: NN SVM
K = n (#images) NN KNN SVM SVM More than MAJ Biological motivation  Human visual system

SVM KNN Summary Similarity to prototypes
Combining Advantages from both methods NN – Fast, suitable for multiclass SVM – performs well with few samples and classes Compatible with many types of distance functions Biological motivation: Human visual system Discriminative process

Outline Motivation and Introduction Classification Algorithms
K-Nearest neighbors (KNN) SVM Multiclass SVM DAGSVM SVM-KNN Results - A taste of the distance Shape distance (shape context, tangent) Texture (texton histograms)

D( ) = ?? , Distance functions Shape Texture Query image
Class 1 Class 2 Class 3 Training images and query Shape Texture D( , ) = ??

Understanding the need - Shape
Well, which is it?? Capturing the shape Distance 1: Shape context Distance 2: Tangent distance query

Distance 1: Shape context
Find point correspondences Estimate transformation Distance correspondence quality transformation quality prototype query Belongie et al., Shape matching and object recognition using shape contexts, IEEE Trans. (2002)

Find correspondences Detector - Use edge points
Descriptor - Create “Landscape” Relationship to other edge points Histogram of orientations and distances Count = 5 Count = 6 prototype query

Find correspondence Detector - Use edge points
Descriptor - Create “Landscape” Relationship to other edge points Histogram of orientations and distances Matching  compare histograms ( ) prototype query

Distance 1: Shape context
Find point correspondences Estimate transformation Distance correspondence quality transformation (quality, magnitude) prototype query

MNIST – Digit DB 70,000 handwritten digits Each image 28x28
Us postal service

MNIST results Human error rate – 0.2% Better methods exist < 1%

Distance 2: Tangent distance
Distance includes invariance to small changes small rotations translations thickening Prototype query Taking the original image and allowing small rotations Simard et al., Transformation invariance in pattern recognition-tangent distance and tangent propagation. Neural Networks (1998)

Space induced by rotation
Rotation function α=1 α=0 But – this space might be nonlinear therefore we actually look at a linear approximation Dimension = 1 α= -1 α= -2 Pixel space

Tangent distance – Visual intuition
SQ The Tangent SP Prototype Image Desired distance But – calculating distance between non linear curves can be difficult Solution: Use linear approximation The Tangent P Q Query Image Euclidian distance (L2) Pixel space

Tangent Distance - General
For every image, create surface allowing transformations Rotations Translations Thickness, etc. Find a linear approximation - the tangent plane Distance  Calculate distance between linear planes Has efficient solutions  7 dimensions

USPS – digit DB 9298 handwritten digits taken from mail envelopes
Each image 16x16 Us postal service

USPS results Human error rate – 2.5% For L2 – For tangent not optimal
Q Human error rate – 2.5% For L2 – not optimal DAGSVM has similar results For tangent NN similar results DAGSVM similar to SVMKNN but SVM KNN is faster According to the paper on tangent distance, it received a 2.5% with NN using tangent distance.

Understanding Texture
Texture samples How to represent Texture??

Texture representation
Represent using responses to a filter bank Texture patch Filter bank – 48 filters Filter responses for pixel P1 Filter responses for pixel 0.1 0.8 . 0.3 P2 0.6 Filter responses for pixel -0.4 -0.7 . 0.17 P3 48 Motivation – V1 -0.2 . …. 0.4

Correspond to pixels of one image
Introducing Textons Filter responses – points in 48 dimensional space A texture patch – spatially repeating Representation is redundant Select representative responses (K-means) Correspond to pixels of one image Texture patch P1 P2 P3 Filter responses in 48-dimensional space Textons ! T. Leung, J. Malik Representing and recognizing the visual appearance of materials using three-dimensional textons (2001)

“Building blocks“ for all textures
Universal textons “Building blocks“ for all textures Prototype textures Filter bank Texton Filter responses in 48-dim space T1 T2 T3 T4

Distance 3: of Texton histograms
For a query texture Create filter responses Build texton histogram (using universal textons) Query texture Filter bank Filter responses in 48-dim space T1 T2 T3 T4 T1 T2 T3 T4 Query Texton histogram

Distance 3: of Texton histograms
For a query texture Create texton histogram Build texton histogram (using universal textons) Distance  compare histograms ( ) Prototype textures Query texture Query Texton histogram Prototype Texton histogram T1 T2 T3 T4 T1 T2 T3 T4

CUReT – texture DB 61 textures Different view points
Different illuminations

CUReT Results T1 T2 T3 T4 (comparing texton histograms)

Caltech-101 DB 102 categories Distance function
variations in color, pose, illumination Distance function combination of texture and shape 2 algorithms  Algo. A, Algo. B Samples from the Caltech-101 DB

Caltech-101 Results 66% correct Correct rate (%) Algo. B:
(15 training images) 66% correct Correct rate (%) Algo. B: Using only DAGSVM (no KNN) Still a long way to go…

Motivation – Human Visual System
Large Number of Categories (~30,000) Discriminative Process Small Set of Examples Invariance to transformation Similarity to Prototype instead of Features

Summary Popular methods NN SVM DAGSVM - extension to multi-class SVM
The hybrid method – SVM KNN Motivated by human perception (??) Improved complexity Better methods exist? A taste of the distance Shape, Texture Results classification method distance function Class 1 Class 2 1 vs 4 3 vs 4 2 vs 4 1 vs 3 2 vs 3 1 vs 2 P Q T1 T2 T3 T4

References H. Zhang, A. C. Berg, M. Maire and J. Malik. SVM-KNN: Discriminative Nearest Neighbor Classification for Visual Category Recognition. IEEE, Vol. 2, pages , 2006. P. Vincent and Y. Bengio. K-local hyperplane and convex distance nearest neighbor algorithms. NIPS, pages , 2001. J. C. Platt, N. Cristianini, and J. Shawe-Taylor. Large margin DAGs for multiclass classification. NIPS, pages , 1999. C. Hsu and C. Lin. A comparison of methods for multiclass support vector machines. IEEE, Vol. 13, pages , 2002. T. Leung and J. Malik. Representing and recognizing the visual appearance of materials using three-dimensional textons. Int. J. Computation Vision, 43(1):29-44, 2001. P. Simard, Y. LeCun, J. S. Denker, and B. Victorri. Transformation invariance in pattern recognition-tangent distance and tangent propagation. Neural Networks: Tricks of the Trade, pages , 1998. S. Belongie, J. Malik, and J. Puzicha. Shape matching and object recognition using shape contexts. IEEE, Vol. 24, pages , 2002.

Thank You!

Similar presentations