Kernel Technique Based on Mercer’s Condition (1909)

Name: Kernel Technique Based on Mercer’s Condition (1909)
Uploaded: 2017-07-15T10:24:51+00:00
Duration: PTM15S26
Description: Kernel Technique Based on Mercer’s Condition (1909)

Kernel Technique Based on Mercer’s Condition (1909)
The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from input space to feature space (might be infinite dim.) 2. do inner product in the feature space

More Examples of Kernel
is an integer: Polynomial Kernel : ) (Linear Kernel Gaussian (Radial Basis) Kernel : The -entry of represents the “similarity” of data points and

Nonlinear 1-Norm Soft Margin SVM
In Dual Form Linear SVM: Nonlinear SVM:

1-norm Support Vector Machines Good for Feature Selection
Solve the quadratic program for some : min s. t. , , denotes where or membership. Equivalent to solve a Linear Program as follows:

SVM as an Unconstrained Minimization Problem
(QP) At the solution of (QP): where Hence (QP) is equivalent to the nonsmooth SVM: min Change (QP) into an unconstrained MP Reduce (n+1+m) variables to (n+1) variables

Smooth the Plus Function: Integrate
Step function: Sigmoid function: p-function: Plus function:

SSVM: Smooth Support Vector Machine
Replacing the plus function in the nonsmooth SVM by the smooth , gives our SSVM: min nonsmooth SVM as goes to infinity. The solution of SSVM converges to the solution of (Typically, )

Newton-Armijo Method: Quadratic Approximation of SSVM
generated by solving a The sequence quadratic approximation of SSVM, converges to the of SSVM at a quadratic rate. unique solution Converges in 6 to 8 iterations At each iteration we solve a linear system of: n+1 equations in n+1 variables Complexity depends on dimension of input space It might be needed to select a stepsize

Newton-Armijo Algorithm
Start with any . Having stop if else : (i) Newton Direction : globally and quadratically converge to unique solution in a finite number of steps (ii) Armijo Stepsize : such that Armijo’s rule is satisfied

Nonlinear Smooth SVM Nonlinear Classifier: by a nonlinear kernel : min
Replace by a nonlinear kernel : min Use Newton-Armijo algorithm to solve the problem Each iteration solves m+1 linear equations in m+1 variables Nonlinear classifier depends on the data points with nonzero coefficients :

Conclusion An overview of SVMs for classification
SSVM: A new formulation of support vector machine as a smooth unconstrained minimization problem Can be solved by a fast Newton-Armijo algorithm No optimization (LP, QP) package is needed There are many important issues did not address this lecture such as: How to solve conventional SVM? How to select parameters: How to deal with massive datasets?

{ Perceptron  . i=0n wi xi g 1 if i=0n wi xi >0 o(xi) =
Linear threshold unit (LTU) x0=1 x1 w1 w0 w2 x2  . i=0n wi xi g wn xn 1 if i=0n wi xi >0 o(xi) = -1 otherwise {

Possibilities for function g
Sign function Step function Sigmoid (logistic) function sign(x) = +1, if x > 0 -1, if x  0 step(x) = 1, if x > threshold 0, if x  threshold (in picture above, threshold = 0) sigmoid(x) = 1/(1+e-x) Adding an extra input with activation x0 = 1 and weight wi, 0 = -T (called the bias weight) is equivalent to having a threshold at T. This way we can always assume a 0 threshold.

Using a Bias Weight to Standardize the Threshold
1 -T w1 x1 w2 x2 w1x1+ w2x2 < T w1x1+ w2x2 - T < 0

Perceptron Learning Rule
(x, t)=([2,1], -1) o =sgn( ) =1 x2 x2 w = [0.25 – ] x2 = 0.2 x1 – 0.5 o=-1 w = [0.2 –0.2 –0.2] (x, t)=([-1,-1], 1) o = sgn( ) =-1 x1 x1 (x, t)=([1,1], 1) o = sgn( ) = -1 -0.5x1+0.3x2+0.45>0  o = 1 w = [ ] w = [-0.2 –0.4 –0.2] x2 x2 x1 x1

The Perceptron Algorithm Rosenblatt, 1956
Given a linearly separable training set and learning rate and the initial weight vector, bias: and let

The Perceptron Algorithm (Primal Form)
Repeat: until no mistakes made within the for loop return: . What is ?

The Perceptron Algorithm
( STOP in Finite Steps ) Theorem (Novikoff) Let be a non-trivial training set, and let Suppose that there exists a vector and . Then the number of mistakes made by the on-line perceptron algorithm on is at most

Proof of Finite Termination
Proof: Let The algorithm starts with an augmented weight vector and updates it at each mistake. Let be the augmented weight vector prior to the th mistake. The th update is performed when where is the point incorrectly classified by

Update Rule of Perceotron
Similarly,

Update Rule of Perceotron

The Perceptron Algorithm (Dual Form)
Given a linearly separable training set and Repeat: until no mistakes made within the for loop return:

What We Got in the Dual Form Perceptron Algorithm?
The number of updates equals: implies that the training point has been misclassified in the training process at least once. implies that removing the training point will not affect the final results The training data only appear in the algorithm through the entries of the Gram matrix, which is defined below:

Reuters-21578 21578 docs – 27000 terms, and 135 classes
21578 documents belong to training set belong to testing set Reuters includes 135 categories by using ApteMod version of the TOPICS set Result in 90 categories with 7,770 training documents and 3,019 testing documents

Preprocessing Procedures (cont.)
After Stopwords Elimination After Porter Algorithm

Binary Text Classification earn(+) vs. acq(-)
Select top 500 terms using mutual information Evaluate each classifier using F-measure Compare two classifiers using 10-fold paired-t test

10-fold Testing Results RSVM vs. Naïve Bayes
2 3 4 5 6 7 8 9 10 RSVM 0.965 0.975 0.99 0.984 0.974 0.936 0.98 NB 0.969 0.941 0.964 0.953 0.958 -0.004 -0.009 0.021 0.01 0.033 0.02 -0.038 0.006 0.016 There is no difference between RSVM and NB Reject with 95% confidence level

Multi-Class SVMs Combining into multi-class classifier One-vs-Rest
Classes: in this class or not in this class Positive training samples: data in this class Negative training samples: the rest K binary SVMs (k is the number of classes) One-vs-One Classes: in class one or in class two Negative training samples: data in the other class K(K-1)/2 binary SVM

Performance Measures Precision and recall , F-measure
where TP is the number of true positive, FP is the number of false positive, and FN is the number of false negative. prediction y=1 prediction y=-1 label y=1 True Positive False Negative label y=-1 False Positive True Negative

Measures for Multi-class Classification (one vs. rest)
Macro-averaging: arithmetic average Micro-averaging: averages the contingency (confusion) tables

Summary of Top 10 Categories
category (pos , neg , test) acq (1648 , 6075 , 718) corn (180 , 7543 , 56) crude (385 , 7338 , 186) earn (2861 , 4862 , 1080) grain (428 , 7295 , 148) interest (348 , 7375 , 131) money-fx (534 , 7189 , 179) ship (191 , 7532, 87) trade (367 , 7356 , 116) wheat (211 , 7512 , 81)

F-measure of Top10 Categories
Category F-measure acq 97.03 corn 81.63 crude 88.58 earn 98.84 grain 90.51 interest 76.52 money-fx 78.26 ship 83.03 trade 75.83 wheat 80.00 macroavg. 85.05 microavg. 92.87

Kernel Technique Based on Mercer’s Condition (1909)

Similar presentations

Presentation on theme: "Kernel Technique Based on Mercer’s Condition (1909)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Kernel Technique Based on Mercer’s Condition (1909)

Similar presentations

Presentation on theme: "Kernel Technique Based on Mercer’s Condition (1909)"— Presentation transcript:

Similar presentations

About project

Feedback