Nonlinear Knowledge in Kernel Machines Olvi Mangasarian UW Madison & UCSD La Jolla Edward Wild UW Madison Data Mining and Mathematical Programming Workshop.

Nonlinear Knowledge in Kernel Machines Olvi Mangasarian UW Madison & UCSD La Jolla Edward Wild UW Madison Data Mining and Mathematical Programming Workshop Centre de Recherches Mathématiques Université de Montréal, Québec October 10-13, 2006

Objectives  Primary objective: Incorporate prior knowledge over completely arbitrary sets into:  function approximation, and  classification  without transforming (kernelizing) the knowledge  Secondary objective: Achieve transparency of the prior knowledge for practical applications

Graphical Example of Knowledge Incorporation +  + + + + + + +        g(x) · 0 h 1 (x) · 0 h 2 (x) · 0  Similar approach for approximation K(x 0, B 0 )u = 

Outline  Kernels in classification and function approximation  Incorporation of prior knowledge  Previous approaches: require transformation of knowledge  New approach does not require any transformation of knowledge Knowledge given over completely arbitrary sets  Fundamental tool used  Theorem of the alternative for convex functions  Experimental results  Four synthetic examples and two examples related to breast cancer prognosis  Classifiers and approximations with prior knowledge more accurate than those without prior knowledge

 Farkas: For A 2 R k £ n, b 2 R n, exactly one of the following must hold:  I. Ax  0, b 0 x > 0 has solution x 2 R n,,or  II. A 0 v =b, v  0, has solution v 2 R k   Let h:  ½ R n ! R k,  :  ! R,  convex set, h and  convex on  and h(x)<0 for some x 2 . Then, exactly one of the following must hold:  I. h(x)  0,  (x) < 0 has a solution x  , or  II.  v  R k, v  0: v 0 h(x)+  (x)  0  x    To get II of Farkas:  v  0, v 0 Ax- b 0 x  0 8 x 2 R n, v  0, A 0 v -b=0. Theorem of the Alternative for Convex Functions: Generalization of the Farkas Theorem

Classification and Function Approximation  Given a set of m points in n-dimensional real space R n with corresponding labels  Labels in {+1, -1} for classification problems  Labels in R for approximation problems  Points are represented by rows of a matrix A 2 R m £ n  Corresponding labels or function values are given by a vector y  Classification: y 2 {+1, -1} m  Approximation: y  R m  Find a function f(A i ) = y i based on the given data points A i  f : R n ! {+1, -1} for classification  f : R n ! R for approximation

Classification and Function Approximation  Problem: utilizing only given data may result in a poor classifier or approximation  Points may be noisy  Sampling may be costly  Solution: use prior knowledge to improve the classifier or approximation

Adding Prior Knowledge  Standard approximation and classification: fit function at given data points without knowledge  Constrained approximation: satisfy inequalities at given points  Previous approaches (2001 FMS, 2003 FMS, and 2004 MSW ): satisfy linear inequalities over polyhedral regions  Proposed new approach: satisfy nonlinear inequalities over arbitrary regions without kernelizing (transforming) knowledge

Kernel Machines  Approximate f by a nonlinear kernel function K using parameters u 2 R k and  in R  A kernel function is a nonlinear generalization of the scalar product  f(x)  K(x 0, B 0 )u - , x 2 R n, K:R n £ R n £ k ! R k  Gaussian K(x 0, B 0 ) i =  -  ||x-B i || 2, i=1,…..,k  B 2 R k £ n is a basis matrix  Usually, B = A 2 R m £ n = Input data matrix  In Reduced Support Vector Machines, B is a small subset of the rows of A  B may be any matrix with n columns

Kernel Machines  Introduce slack variable s to measure error in classification or approximation  Error s in kernel approximation of given data:  -s  K(A, B 0 )u -  e - y  s, e is a vector of ones in R m  Function approximation: f(x)  x 0, B 0 )u -   Error s in kernel classification of given data  K(A +, B 0 )u -  e + s + ¸ e, s + ¸ 0  K(A -, B 0 )u -  e - s -  - e, s - ¸ 0  More succinctly, let: D = diag(y), the m £ m matrix with diagonal y of § 1’s, then:  D(K(A, B 0 )u -  e) + s ¸ e, s ¸ 0  Classifier: f(x)  sign(  x 0, B 0 )u - 

 Trade off between solution complexity (||u|| 1 ) and data fitting (||s|| 1 )  At solution  e 0 a = ||u|| 1  e 0 s = ||s|| 1 Kernel Machines in Approximation OR Classification OR Kernel Machines in Approximation

 Cx  d  w 0 x-   h 0 x +   Need to “kernelize” knowledge from input space to transformed (feature) space of kernel  Requires change of variable x = A 0 t, w = A 0 u  CA 0 t  d  u 0 AA 0 t -   h 0 A 0 t +   K(C, A 0 )t  d  u 0 K(A, A 0 )t -   h 0 A 0 t +   Use a linear theorem of the alternative in the t space  Lost connection with original knowledge  Achieves good numerical results, but is not readily interpretable in the original space Incorporating Nonlinear Prior Knowledge: Previous Approaches

Nonlinear Prior Knowledge in Function Approximation: New Approach  Start with arbitrary nonlinear knowledge implication  g(x)  0  K(x 0, B 0 )u -   h(x), 8 x 2  ½ R n  g, h are arbitrary functions on   g:  ! R k, h:  ! R  Linear in v, u,  9 v ¸ 0: v 0 g(x) + K(x 0, B 0 )u -  - h(x) ¸ 0 8 x 2 

 Assume that g(x), K(x 0, B 0 )u - , -h(x) are convex functions of x, that  is convex and 9 x 2  : g(x)<0. Then either:  I. g(x)  0, K(x 0, B 0 )u -  - h(x) < 0 has a solution x  , or  II.  v  R k, v  0: K(x 0, B 0 )u -  - h(x) + v 0 g(x)  0  x    But never both.  If we can find v  0: K(x 0, B 0 )u -  - h(x) + v 0 g(x)  0  x  , then by above theorem  g(x)  0, K(x 0, B 0 )u -  - h(x) < 0 has no solution x   or equivalently:  g(x)  0  K(x 0, B 0 )u -   h(x), 8 x 2  Theorem of the Alternative for Convex Functions

Proof   I  II (Not needed for present application)  Follows from a fundamental theorem of Fan-Glicksburg- Hoffman for convex functions [1957] and the existence of an x 2  such that g(x) <0.   I  II (Requires no assumptions on g, h, K, or u  whatsoever)  Suppose not. That is, there exists x 2  v 2 R k,, v ¸ 0:  g(x)  0, K(x 0, B 0 )u -  - h(x) < 0, (I)  v  0, v 0 g(x) +K(x 0, B 0 )u -  - h(x)  0, 8 x 2   Then we have the contradiction:  0 > v 0 g(x) +K(x 0, B 0 )u -  - h(x)  0 · 0 < 0

 Linear semi-infinite program: infinite number of constraints  Discretize: finite linear program  g(x i ) · 0 ) K(x i 0, B 0 )u -  ¸ h(x i ), i = 1, …, k  Slacks allow knowledge to be satisfied inexactly  Add term to objective function to drive slacks to zero Incorporating Prior Knowledge

Numerical Experience: Approximation  Evaluate on three datasets  Two synthetic datasets  Wisconsin Prognostic Breast Cancer Database (WPBC)  194 patients £ 2 histogical features tumor size & number of metastasized lymph nodes  Compare approximation with prior knowledge to approximation without prior knowledge  Prior knowledge leads to an improved accuracy  General prior knowledge used cannot be handled exactly by previous work (MSW 2004)  No kernelization needed on knowledge set

Two-Dimensional Hyperboloid  f(x 1, x 2 ) = x 1 x 2

Two-Dimensional Hyperboloid  Given exact values only at 11 points along line x 1 = x 2  At x 1 2 {-5, …, 5} x1x1 x2x2

Two-Dimensional Hyperboloid Approximation without Prior Knowledge

Two-Dimensional Hyperboloid  Add prior (inexact) knowledge:  x 1 x 2  1  f(x 1, x 2 )  x 1 x 2  Nonlinear term x 1 x 2 can not be handled exactly by any previous approaches  Discretization used only 11 points along the line x 1 = -x 2, x 1  {-5, -4, …, 4, 5}

Two-Dimensional Hyperboloid Approximation with Prior Knowledge

Two-Dimensional Tower Function

Two-Dimensional Tower Function (Misleading) Data  Given 400 points on the grid [-4, 4]  [-4, 4]  Values are min{g(x), 2}, where g(x) is the exact tower function

Two-Dimensional Tower Function Approximation without Prior Knowledge

Two-Dimensional Tower Function Prior Knowledge  Add prior knowledge:  (x 1, x 2 )  [-4, 4]  [-4, 4]  f(x) = g(x)  Prior knowledge is the exact function value.  Enforced at 2500 points on the grid [-4, 4]  [-4, 4] through above implication  Principal objective of prior knowledge here is to overcome poor given data

Two-Dimensional Tower Function Approximation with Prior Knowledge

Breast Cancer Application: Predicting Lymph Node Metastasis as a Function of Tumor Size  Number of metastasized lymph nodes is an important prognostic indicator for breast cancer recurrence  Determined by surgery in addition to the removal of the tumor  Optional procedure especially if tumor size is small  Wisconsin Prognostic Breast Cancer (WPBC) data  Lymph node metastasis and tumor size for 194 patients  Task: predict the number of metastasized lymph nodes given tumor size alone

Predicting Lymph Node Metastasis  Split data into two portions  Past data: 20% used to find prior knowledge  Present data: 80% used to evaluate performance  Past data simulates prior knowledge obtained from an expert

Prior Knowledge for Lymph Node Metastasis as a Function of Tumor Size  Generate prior knowledge by fitting past data:  h(x) := K(x 0, B 0 )u -   B is the matrix of the past data points  Use density estimation to decide where to enforce knowledge  p(x) is the empirical density of the past data  Prior knowledge utilized on approximating function f(x):  Number of metastasized lymph nodes is greater than the predicted value on past data, with tolerance of 0.01  p(x)  0.1  f(x) ¸ h(x) - 0.01

Predicting Lymph Node Metastasis: Results  RMSE: root-mean-squared-error  LOO: leave-one-out error  Improvement due to knowledge: 14.9% ApproximationError Prior knowledge h(x) based on past data 20% 6.12 RMSE f(x) without knowledge based on present data 80% 5.92 LOO f(x) with knowledge based on present data 80% 5.04 LOO

Incorporating Prior Knowledge in Classification (Very Similar)  Implication for positive region  g(x)  0  K(x 0, B 0 )u -   , 8 x 2  ½ R n  K(x 0, B 0 )u -  -  + v 0 g(x) ¸ 0, v ¸ 0, 8 x 2   Similar implication for negative regions  Add discretized constraints to linear program

Incorporating Prior Knowledege: Classification

Numerical Experience: Classification  Evaluate on three datasets  Two synthetic datasets  Wisconsin Prognostic Breast Cancer Database  Compare classifier with prior knowledge to one without prior knowledge  Prior knowledge leads to an improved accuracy  General prior knowledge used cannot be handled exactly by previous work (FMS 2001, FMS 2003)  No kernelization needed on knowledge set

Checkerboard Dataset Black & White Points in R 2

Checkerboard Classifier Without Knowledge Using 16 Center Points Prior Knowledge for 16-Point Checkerboard Classifier Checkerboard Classifier With Knowledge Using 16 Center Points 100

Spiral Dataset 194 Points in R 2

 No misclassified points  Labels given only at 100 correctly classified circled points Spiral Classifier With Knowledge Spiral Classifier Without Knowledge Based on 100 Labeled Points Prior Knowledge Function for Spiral White ) + Gray ) Note the many incorrectly classified +’s  Prior knowledge imposed at 291 points in each region

Predicting Breast Cancer Recurrence Within 24 Months  Wisconsin Prognostic Breast Cancer (WPBC) dataset  155 patients monitored for recurrence within 24 months  30 cytological features  2 histological features: number of metastasized lymph nodes and tumor size  Predict whether or not a patient remains cancer free after 24 months  82% of patients remain disease free  86% accuracy (Bennett, 1992) best previously attained  Prior knowledge allows us to incorporate additional information to improve accuracy

Generating WPBC Prior Knowledge  Gray regions indicate areas where g(x) · 0  Simulate oncological surgeon’s advice about recurrence Tumor Size in Centimeters Number of Metastasized Lymph Nodes Knowledge imposed at dataset points inside given regions +Recur within 24 months Cancer free within 24 months +Recur Cancer free

WPBC Results 49.7 % improvement due to knowledge 35.7 % improvement over best previous predictor Classifier Misclassification Rate Without Knowledge18.1%. With Knowledge9.0%.

Conclusion  General nonlinear prior knowledge incorporated into kernel classification and approximation  Implemented as linear inequalities in a linear programming problem  Knowledge appears transparently  Demonstrated effectiveness  Four synthetic examples  Two real world problems from breast cancer prognosis  Future work  Prior knowledge with more general implications  User-friendly interface for knowledge specification  Generate prior knowledge for real-world datasets

Website Links to Talk & Papers  http://www.cs.wisc.edu/~olvi  http://www.cs.wisc.edu/~wildt

Nonlinear Knowledge in Kernel Machines Olvi Mangasarian UW Madison & UCSD La Jolla Edward Wild UW Madison Data Mining and Mathematical Programming Workshop.

Similar presentations

Presentation on theme: "Nonlinear Knowledge in Kernel Machines Olvi Mangasarian UW Madison & UCSD La Jolla Edward Wild UW Madison Data Mining and Mathematical Programming Workshop."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Nonlinear Knowledge in Kernel Machines Olvi Mangasarian UW Madison & UCSD La Jolla Edward Wild UW Madison Data Mining and Mathematical Programming Workshop.

Similar presentations

Presentation on theme: "Nonlinear Knowledge in Kernel Machines Olvi Mangasarian UW Madison & UCSD La Jolla Edward Wild UW Madison Data Mining and Mathematical Programming Workshop."— Presentation transcript:

Similar presentations

About project

Feedback