Presentation is loading. Please wait.

Presentation is loading. Please wait.

This whole paper is about...

Similar presentations


Presentation on theme: "This whole paper is about..."— Presentation transcript:

1 This whole paper is about...
1) Objective: we want to minimize the number of misclassified examples on the training data (Minimum Classification Error objective: MCE) 2) Write down the objective as a differentiable function 3) Perform gradient descent on it

2 Some notation * means transpose
gi(...) := discriminant for class i, i is one of M classes x := one of N observations [wi ,w0i,] = lambda := trainable parameters for ith discriminant y=[x 1] Classification rule:

3 Other objectives besides MCE
Notation assumes two classes (M=2) but argument is true for M>2 Such as Minimum squared error Perceptron Selective Squared Distance Some others (e.g. SVM, large-margin perceptron) , bi >0

4 Other objectives besides MCE
Notation assumes two classes (M=2) but argument is true for M>2 Such as Minimum squared error Perceptron Selective Squared Distance Some others (e.g. SVM, large-margin perceptron) Linearly seperable case , bi >0 Converges to MCE (Ho- Kashyap procedure) Converges to MCE Converges to MCE Converges to MCE

5 Other objectives besides MCE
Notation assumes two classes (M=2) but argument is true for M>2 Such as Minimum squared error Perceptron Selective Squared Distance Some others (e.g. SVM, large-margin perceptron) Non-Linearly seperable case , bi >0 Converge but not to MCE in general Does not converge Does not converge Converge but not to MCE in general

6 Paper's key contribution
Encode the classification instructions in a differentiable function such that =0 when x is classified correctly, and 1 when x is classified incorrectly (The example 0-1 loss function)

7 Paper's key contribution
Encode the classification instructions in a differentiable function such that =0 when x is classified correctly, and 1 when x is classified incorrectly (The example 0-1 loss function) Then we can optimize by gradient descent to directly minimize and (with a small extra step) the MCE on the training dataset

8 Paper's key contribution
Encode the classification instructions in three steps Define a differentiable misclassification measure for class k, on example x Convert the misclassification measure into 0 when correct, and 1 when misclassified combine the misclassification measures from step 2 into a single 0-1 loss function

9 From example 0-1 loss to MCE on the training set
Emperical average cost: i

10 Application: MCE multi-layer Perceptron
M outputs (classes) K inputs Traditional objective: minimize Instead minimize (from previous slide)

11 Application: MCE multi-layer Perceptron
Traditional objective: minimize Instead minimize (from previous slide) non-linearity on the output nodes is removed error back-propagation on internal nodes remains exactly the same Results? Crazy good on Iris data (3 classes):

12 More results: MCE also beats Perceptron and Min-Squared- Error on Iris task And on 2-class problem with each class generated from a mixture of 2 gaussians Also beats improved variants of dynamic time warping on isolated word classification task (10- word vocab: b,c,d,e,g,p,t,v,z)

13 Future work idea Hack quicknet to do MCE training of MLPs for tandem models.

14 The end

15 Min Squared Error

16 Discriminative Learning for Minimum Error Classification
Biing-Hwang Juang, Shigeru Katagiri Presented by Arthur Kantor


Download ppt "This whole paper is about..."

Similar presentations


Ads by Google