A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.

A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf et al., Proc. 13 th ECML, Aug 19-23, 2002, pp. 511-528.

Presentation Outline Introduction  Motivation  A Brief review of SVM for linearly separable patterns  Kernel approach for SVM  Empirical kernel map Problem: almost orthogonal patterns in feature space  An example  Situations leading to almost orthogonal patterns Method to reduce large diagonals of Gram matrix  Gram matrix transformation  An approximate approach based on statistics Experiments  Artificial data (String classification, Microarray data with noise, Hidden variable problem)  Real data (Thrombin binding, Lymphoma classification, Protein family classification) Conclusions Comments

Introduction

Motivation Support vector machine (SVM)  Powerful method for classification (or regression) with high accuracy comparable to neural network  Exploit of kernel function for pattern separation in high dimensional space  The information of training data for SVM is stored in the Gram matrix (kernel matrix) The problem:  SVM doesn’t perform well if Gram matrix has large diagonal values

A Brief Review of SVM Minimize: Constraints: For linearly separable patterns: To maximize the margin: ─ ─ ── ─ ── ─ + + ++ + + + + ─ depends on closest points margin

For linearly non-separable patterns  Nonlinear mapping function  (x)  H: mapping the patterns to new feature space H of higher dimension  For example: the XOR problem  SVM in the new feature space: The kernel trick:  Solving the above minimization problem requires: 1) Explicit form of  2) Inner product in high dimensional space H  Simplification by wise selection of kernel functions with property: k(x i, x j ) =  (x i )   (x j ) Kernel Approach for SVM (1/3) Minimize: Constraints:

Transform the problem with kernel method  Expand w in the new feature space: w =  a i  (x i ) = [  (x)]a where [  (x)]=[  (x 1 ),  (x 2 ), …,  (x m )], and a=[a 1, a 2, … a m ] T  Gram matrix: K=[K ij ], where K ij =  (x i )   (x j ) = k(x i, x j ) (symmetric !)  The (squared) objective function:  ||w|| 2 = a T [  (x)] T [  (x)]a = a T Ka (sufficient condition for existence of optimal solution: K is positive definite)  The constraints: y i {w T  (x i ) + b} = y i {a T [  (x)] T  (x i ) + b} = y i {a T K i + b}  1, where K i is the i th column of K. Kernel Approach for SVM (2/3) Minimize: Constraints:

To predict new data with a trained SVM The explicit form of k(x i, x j ) is required for prediction of new data Kernel Approach for SVM (3/3) Where: a and b are optimal solution based on training data, and m is the number if training data

Assumption: m (the number if instances) is a sufficient high dimension of the new feature space. i.e. the patterns will be linearly separable in m- dimension space (R m ) Empirical kernel map:  m (x i ) = [k(x i,x 1 ), k(x i,x 2 ), …, k(x i,x m )] T = K i The SVM in R m The new Gram matrix K m associated with  m (x): Km=[Km ij ], where Km ij =  m (x i )   m (x j ) = K i  K j = K i T K j, i.e. Km = K T K = KK T Advantage of empirical kernel map: K m is positive definite  Km = KK T = (U T DU) (U T DU) T = U T D 2 U (K is symmetric, U is unitary matrix, D is diagonal)  Satisfied the sufficient condition of above minimization problem Empirical Kernel Mapping Minimize: Constraints:

The Problem: Almost Orthogonal Patterns in the Feature Space Result in Poor Performance

An Example of Almost Orthogonal Patterns The training dataset with almost orthogonal patterns The Gram matrix with linear kernel k(x i, x j ) = x i  x j w is the solution with standard SVM Observation: each large entry in w is corresponding to a column in X with only one large entry: w becomes a lookup table, the SVM won’t generalize well A better solution: Large Diagonals

Sparsity of the patterns in the new feature space, e.g.  x = [ 0, 0, 0, 1, 0, 0, 1, 0] T  Y = [ 0, 1, 1, 0, 0, 0, 0, 0] T  x  x  y  y >> x  y (large diagonals in Gram matrix) Some selection of kernel functions may result in sparsity in the new feature space  String kernel (Watkins 2000, et al)  Polynomial kernel, k(x i, x j ) = (x i  x j ) d, with large order d If x i  x i > x i  x j, for i  j, then k(x i, x i ) >> k(x i, x j ), for even moderately large d, due to the exponential function. Situations Leading to Almost Orthogonal Patterns

Methods to Reduce the Large Diagonals of Gram Matrices

Gram Matrix Transformation (1/2) For symmetric, positive definite Gram matrix K (or K m ),  K = U T DU U is unitary matrix, D is diagonal matrix  Define f(K) = U T f(D)U, and f(D) ii = f(D ii ) i.e., the function f operates on the eigenvalues i of K  f(K) should preserve positive definition of the Gram matrix A sample procedure for Gram matrix transformation  (Optional) Compute the positive definite matrix A = sqrt(K)  Suppress the large diagonals of A, and obtain a symmetric A’ i.e. transform the eigenvalues of A: [ min, max ]  [f( min ), f( max )]  Compute the positive definite matrix K’=(A’) 2

Gram Matrix Transformation (2/2) Effect of matrix transformation  The explicit form of new kernel function k’ is not available  k’ is required when the trained SVM is used to predict the testing data  A solution: include all test data into K before the matrix transformation K->K’ i.e. the testing data has to be known in training time (x)(x) K k( x i,x j )=  (x i )   (x j ) f(K) K’K’ Implicit transformation  ’(x) k’( x i,x j ) =  ’(x i )   ’(x j ) If x i has been used in calculating K’, the prediction on x i can simply use K’ i a’ and b’ from the portion of K’ corresponding to the training data i= 1, 2,…m+n, where m is the number if training data and n is the number of testing data K’ = f(K)

The empirical kernel map  m+n (x) should be used to calculate the Gram matrix Assuming the dataset size r is large Therefore, the SVM can be simply trained with the empirical map on the training set,  m (x), instead of  m+n (x) An Approximate Approach based on Statistics

Experiment Results

Artificial Data (1/3) String classification  String kernel function ( Watkins 2000, et al )  Sub-polynomial kernel k(x,y) = [  (x)   (y)] P, 0<P<1 for sufficiently small P, the large diagonals of K can be suppressed  50 strings (25 for training, and 25 for testing), 20 trials

Artificial Data (2/3) Microarray data with noise (Alon et al, 1999)  62 instance (22 positive, 44 negative), 2000 features in original data  10000 noise features were added (1% to be non-zero in probability) Error rate for SVM without noise addition is: 0.18  0.15

Artificial Data (3/3) Hidden variable problem  10 hidden variables (attributes), 10 additional attributes which are nonlinear functions of the 10 hidden variables  Original kernel is polynomial kernel of order 4

Real Data (1/3) Thrombin binding problem  1909 instances, 139,351 binary features  0.68% entries are non-zero  8-fold cross validation

Real Data (2/3) Lymphoma classification (Alizadeh et al, 2000)  96 samples, 4026 features  10-fold cross validation  Improved results observed compared with previous work (Weston, 2001)

Real Data (3/3) Protein family classification (Murzin et al, 1995)  Small positive set, large negative set Rate of false positive Receiver operating characteristic 1: best score 0: worst score

Conclusions Problem of degraded performance for SVM due to almost orthogonal patterns was identified and analyzed The common situation that sparse vectors leading to large diagonals was identified and discussed A method of Gram matrix transformation to suppress the large diagonals was proposed to improve the performance in such cases Experiment results show improved accuracy for various artificial or real datasets with suppressed large diagonals of Gram matrices

Comments Strong points:  The identification of the situations leads to large diagonals in Gram matrix, and the proposed Gram matrix transformation method for suppressing the large diagonals  Experiments are extensive Weak points:  The application of Gram matrix transformation may be severely restricted in forecasting or other applications in which the testing data is not know in training time  The proposed Gram matrix transformation method was not tested by experiments directly, instead, transformed kernel functions were used in experiments  The almost orthogonal patterns imply that multiple pattern vectors in the same direction rarely exist, therefore, the necessary condition for statistic approach for pattern distribution is not satisfied

A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.

Similar presentations

Presentation on theme: "A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.

Similar presentations

Presentation on theme: "A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf."— Presentation transcript:

Similar presentations

About project

Feedback