A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.

Slides:



Advertisements
Similar presentations
ECG Signal processing (2)
Advertisements

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Support Vector Machine & Its Applications Mingyue Tan The University of British Columbia Nov 26, 2004 A portion (1/3) of the slides are taken from Prof.
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
An Introduction of Support Vector Machine
Classification / Regression Support Vector Machines

Pattern Recognition and Machine Learning
An Introduction of Support Vector Machine
Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.
SVM—Support Vector Machines
Support vector machine
Machine learning continued Image source:
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Groundwater 3D Geological Modeling: Solving as Classification Problem with Support Vector Machine A. Smirnoff, E. Boisvert, S. J.Paradis Earth Sciences.
The Disputed Federalist Papers : SVM Feature Selection via Concave Minimization Glenn Fung and Olvi L. Mangasarian CSNA 2002 June 13-16, 2002 Madison,
Support Vector Machines
Pattern Recognition and Machine Learning
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
Principal Component Analysis
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.
An Introduction to Kernel-Based Learning Algorithms K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda and B. Scholkopf Presented by: Joanna Giforos CS8980: Topics.
Support Vector Machines Kernel Machines
Classification Problem 2-Category Linearly Separable Case A- A+ Malignant Benign.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Support Vector Machines and Kernel Methods
Support Vector Machines
Lecture 10: Support Vector Machines
An Introduction to Support Vector Machines Martin Law.
Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
Efficient Model Selection for Support Vector Machines
Support Vector Machine & Image Classification Applications
Support Vector Machine (SVM) Based on Nello Cristianini presentation
INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio LECTURE: Support Vector Machines.
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
Learning a Kernel Matrix for Nonlinear Dimensionality Reduction By K. Weinberger, F. Sha, and L. Saul Presented by Michael Barnathan.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
An Introduction to Support Vector Machines (M. Law)
Kernels Usman Roshan CS 675 Machine Learning. Feature space representation Consider two classes shown below Data cannot be separated by a hyperplane.
START OF DAY 5 Reading: Chap. 8. Support Vector Machine.
CS 478 – Tools for Machine Learning and Data Mining SVM.
Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
Support Vector Machines in Marketing Georgi Nalbantov MICC, Maastricht University.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
Support Vector Machines Tao Department of computer science University of Illinois.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Support Vector Machines Exercise solutions Ata Kaban The University of Birmingham.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.
Support Vector Machine: An Introduction. (C) by Yu Hen Hu 2 Linear Hyper-plane Classifier For x in the side of o : w T x + b  0; d = +1; For.
An Introduction of Support Vector Machine In part from of Jinwei Gu.
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Support Vector Machine
Support Vector Machines
Support Vector Machines Introduction to Data Mining, 2nd Edition by
Generally Discriminant Analysis
Usman Roshan CS 675 Machine Learning
COSC 4368 Machine Learning Organization
Linear Discrimination
Presentation transcript:

A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf et al., Proc. 13 th ECML, Aug 19-23, 2002, pp

Presentation Outline Introduction  Motivation  A Brief review of SVM for linearly separable patterns  Kernel approach for SVM  Empirical kernel map Problem: almost orthogonal patterns in feature space  An example  Situations leading to almost orthogonal patterns Method to reduce large diagonals of Gram matrix  Gram matrix transformation  An approximate approach based on statistics Experiments  Artificial data (String classification, Microarray data with noise, Hidden variable problem)  Real data (Thrombin binding, Lymphoma classification, Protein family classification) Conclusions Comments

Introduction

Motivation Support vector machine (SVM)  Powerful method for classification (or regression) with high accuracy comparable to neural network  Exploit of kernel function for pattern separation in high dimensional space  The information of training data for SVM is stored in the Gram matrix (kernel matrix) The problem:  SVM doesn’t perform well if Gram matrix has large diagonal values

A Brief Review of SVM Minimize: Constraints: For linearly separable patterns: To maximize the margin: ─ ─ ── ─ ── ─ ─ depends on closest points margin

For linearly non-separable patterns  Nonlinear mapping function  (x)  H: mapping the patterns to new feature space H of higher dimension  For example: the XOR problem  SVM in the new feature space: The kernel trick:  Solving the above minimization problem requires: 1) Explicit form of  2) Inner product in high dimensional space H  Simplification by wise selection of kernel functions with property: k(x i, x j ) =  (x i )   (x j ) Kernel Approach for SVM (1/3) Minimize: Constraints:

Transform the problem with kernel method  Expand w in the new feature space: w =  a i  (x i ) = [  (x)]a where [  (x)]=[  (x 1 ),  (x 2 ), …,  (x m )], and a=[a 1, a 2, … a m ] T  Gram matrix: K=[K ij ], where K ij =  (x i )   (x j ) = k(x i, x j ) (symmetric !)  The (squared) objective function:  ||w|| 2 = a T [  (x)] T [  (x)]a = a T Ka (sufficient condition for existence of optimal solution: K is positive definite)  The constraints: y i {w T  (x i ) + b} = y i {a T [  (x)] T  (x i ) + b} = y i {a T K i + b}  1, where K i is the i th column of K. Kernel Approach for SVM (2/3) Minimize: Constraints:

To predict new data with a trained SVM The explicit form of k(x i, x j ) is required for prediction of new data Kernel Approach for SVM (3/3) Where: a and b are optimal solution based on training data, and m is the number if training data

Assumption: m (the number if instances) is a sufficient high dimension of the new feature space. i.e. the patterns will be linearly separable in m- dimension space (R m ) Empirical kernel map:  m (x i ) = [k(x i,x 1 ), k(x i,x 2 ), …, k(x i,x m )] T = K i The SVM in R m The new Gram matrix K m associated with  m (x): Km=[Km ij ], where Km ij =  m (x i )   m (x j ) = K i  K j = K i T K j, i.e. Km = K T K = KK T Advantage of empirical kernel map: K m is positive definite  Km = KK T = (U T DU) (U T DU) T = U T D 2 U (K is symmetric, U is unitary matrix, D is diagonal)  Satisfied the sufficient condition of above minimization problem Empirical Kernel Mapping Minimize: Constraints:

The Problem: Almost Orthogonal Patterns in the Feature Space Result in Poor Performance

An Example of Almost Orthogonal Patterns The training dataset with almost orthogonal patterns The Gram matrix with linear kernel k(x i, x j ) = x i  x j w is the solution with standard SVM Observation: each large entry in w is corresponding to a column in X with only one large entry: w becomes a lookup table, the SVM won’t generalize well A better solution: Large Diagonals

Sparsity of the patterns in the new feature space, e.g.  x = [ 0, 0, 0, 1, 0, 0, 1, 0] T  Y = [ 0, 1, 1, 0, 0, 0, 0, 0] T  x  x  y  y >> x  y (large diagonals in Gram matrix) Some selection of kernel functions may result in sparsity in the new feature space  String kernel (Watkins 2000, et al)  Polynomial kernel, k(x i, x j ) = (x i  x j ) d, with large order d If x i  x i > x i  x j, for i  j, then k(x i, x i ) >> k(x i, x j ), for even moderately large d, due to the exponential function. Situations Leading to Almost Orthogonal Patterns

Methods to Reduce the Large Diagonals of Gram Matrices

Gram Matrix Transformation (1/2) For symmetric, positive definite Gram matrix K (or K m ),  K = U T DU U is unitary matrix, D is diagonal matrix  Define f(K) = U T f(D)U, and f(D) ii = f(D ii ) i.e., the function f operates on the eigenvalues i of K  f(K) should preserve positive definition of the Gram matrix A sample procedure for Gram matrix transformation  (Optional) Compute the positive definite matrix A = sqrt(K)  Suppress the large diagonals of A, and obtain a symmetric A’ i.e. transform the eigenvalues of A: [ min, max ]  [f( min ), f( max )]  Compute the positive definite matrix K’=(A’) 2

Gram Matrix Transformation (2/2) Effect of matrix transformation  The explicit form of new kernel function k’ is not available  k’ is required when the trained SVM is used to predict the testing data  A solution: include all test data into K before the matrix transformation K->K’ i.e. the testing data has to be known in training time (x)(x) K k( x i,x j )=  (x i )   (x j ) f(K) K’K’ Implicit transformation  ’(x) k’( x i,x j ) =  ’(x i )   ’(x j ) If x i has been used in calculating K’, the prediction on x i can simply use K’ i a’ and b’ from the portion of K’ corresponding to the training data i= 1, 2,…m+n, where m is the number if training data and n is the number of testing data K’ = f(K)

The empirical kernel map  m+n (x) should be used to calculate the Gram matrix Assuming the dataset size r is large Therefore, the SVM can be simply trained with the empirical map on the training set,  m (x), instead of  m+n (x) An Approximate Approach based on Statistics

Experiment Results

Artificial Data (1/3) String classification  String kernel function ( Watkins 2000, et al )  Sub-polynomial kernel k(x,y) = [  (x)   (y)] P, 0<P<1 for sufficiently small P, the large diagonals of K can be suppressed  50 strings (25 for training, and 25 for testing), 20 trials

Artificial Data (2/3) Microarray data with noise (Alon et al, 1999)  62 instance (22 positive, 44 negative), 2000 features in original data  noise features were added (1% to be non-zero in probability) Error rate for SVM without noise addition is: 0.18  0.15

Artificial Data (3/3) Hidden variable problem  10 hidden variables (attributes), 10 additional attributes which are nonlinear functions of the 10 hidden variables  Original kernel is polynomial kernel of order 4

Real Data (1/3) Thrombin binding problem  1909 instances, 139,351 binary features  0.68% entries are non-zero  8-fold cross validation

Real Data (2/3) Lymphoma classification (Alizadeh et al, 2000)  96 samples, 4026 features  10-fold cross validation  Improved results observed compared with previous work (Weston, 2001)

Real Data (3/3) Protein family classification (Murzin et al, 1995)  Small positive set, large negative set Rate of false positive Receiver operating characteristic 1: best score 0: worst score

Conclusions Problem of degraded performance for SVM due to almost orthogonal patterns was identified and analyzed The common situation that sparse vectors leading to large diagonals was identified and discussed A method of Gram matrix transformation to suppress the large diagonals was proposed to improve the performance in such cases Experiment results show improved accuracy for various artificial or real datasets with suppressed large diagonals of Gram matrices

Comments Strong points:  The identification of the situations leads to large diagonals in Gram matrix, and the proposed Gram matrix transformation method for suppressing the large diagonals  Experiments are extensive Weak points:  The application of Gram matrix transformation may be severely restricted in forecasting or other applications in which the testing data is not know in training time  The proposed Gram matrix transformation method was not tested by experiments directly, instead, transformed kernel functions were used in experiments  The almost orthogonal patterns imply that multiple pattern vectors in the same direction rarely exist, therefore, the necessary condition for statistic approach for pattern distribution is not satisfied

End!