This whole paper is about...

This whole paper is about...
1) Objective: we want to minimize the number of misclassified examples on the training data (Minimum Classification Error objective: MCE) 2) Write down the objective as a differentiable function 3) Perform gradient descent on it

Some notation * means transpose
gi(...) := discriminant for class i, i is one of M classes x := one of N observations [wi ,w0i,] = lambda := trainable parameters for ith discriminant y=[x 1] Classification rule:

Other objectives besides MCE
Notation assumes two classes (M=2) but argument is true for M>2 Such as Minimum squared error Perceptron Selective Squared Distance Some others (e.g. SVM, large-margin perceptron) , bi >0

Notation assumes two classes (M=2) but argument is true for M>2 Such as Minimum squared error Perceptron Selective Squared Distance Some others (e.g. SVM, large-margin perceptron) Linearly seperable case , bi >0 Converges to MCE (Ho- Kashyap procedure) Converges to MCE Converges to MCE Converges to MCE

Notation assumes two classes (M=2) but argument is true for M>2 Such as Minimum squared error Perceptron Selective Squared Distance Some others (e.g. SVM, large-margin perceptron) Non-Linearly seperable case , bi >0 Converge but not to MCE in general Does not converge Does not converge Converge but not to MCE in general

Paper's key contribution
Encode the classification instructions in a differentiable function such that =0 when x is classified correctly, and 1 when x is classified incorrectly (The example 0-1 loss function)

Encode the classification instructions in a differentiable function such that =0 when x is classified correctly, and 1 when x is classified incorrectly (The example 0-1 loss function) Then we can optimize by gradient descent to directly minimize and (with a small extra step) the MCE on the training dataset

Encode the classification instructions in three steps Define a differentiable misclassification measure for class k, on example x Convert the misclassification measure into 0 when correct, and 1 when misclassified combine the misclassification measures from step 2 into a single 0-1 loss function

From example 0-1 loss to MCE on the training set
Emperical average cost: i

Application: MCE multi-layer Perceptron
M outputs (classes) K inputs Traditional objective: minimize Instead minimize (from previous slide)

Application: MCE multi-layer Perceptron
Traditional objective: minimize Instead minimize (from previous slide) non-linearity on the output nodes is removed error back-propagation on internal nodes remains exactly the same Results? Crazy good on Iris data (3 classes):

More results: MCE also beats Perceptron and Min-Squared- Error on Iris task And on 2-class problem with each class generated from a mixture of 2 gaussians Also beats improved variants of dynamic time warping on isolated word classification task (10- word vocab: b,c,d,e,g,p,t,v,z)

Future work idea Hack quicknet to do MCE training of MLPs for tandem models.

The end

Min Squared Error

Discriminative Learning for Minimum Error Classification
Biing-Hwang Juang, Shigeru Katagiri Presented by Arthur Kantor

This whole paper is about...

Similar presentations

Presentation on theme: "This whole paper is about..."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

This whole paper is about...

Similar presentations

Presentation on theme: "This whole paper is about..."— Presentation transcript:

Similar presentations

About project

Feedback