Classification via Mathematical Programming Based Support Vector Machines Glenn M. Fung Computer Sciences Dept. University of Wisconsin - Madison November.

Slides:



Advertisements
Similar presentations
Optimization in Data Mining Olvi L. Mangasarian with G. M. Fung, J. W. Shavlik, Y.-J. Lee, E.W. Wild & Collaborators at ExonHit – Paris University of Wisconsin.
Advertisements

Introduction to Support Vector Machines (SVM)
ECG Signal processing (2)
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
An Introduction of Support Vector Machine
Classification / Regression Support Vector Machines
Support Vector Machine Classification Computation & Informatics in Biology & Medicine Madison Retreat, November 15, 2002 Olvi L. Mangasarian with G. M.

Support Vector Machines
SVM—Support Vector Machines
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Separating Hyperplanes
The Disputed Federalist Papers : SVM Feature Selection via Concave Minimization Glenn Fung and Olvi L. Mangasarian CSNA 2002 June 13-16, 2002 Madison,
Y.-J. Lee, O. L. Mangasarian & W.H. Wolberg
Discriminative and generative methods for bags of features
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from.
Dual Problem of Linear Program subject to Primal LP Dual LP subject to ※ All duality theorems hold and work perfectly!
Support Vector Classification (Linearly Separable Case, Primal) The hyperplanethat solves the minimization problem: realizes the maximal margin hyperplane.
Proximal Support Vector Machine Classifiers KDD 2001 San Francisco August 26-29, 2001 Glenn Fung & Olvi Mangasarian Data Mining Institute University of.
Reduced Support Vector Machine
Difficulties with Nonlinear SVM for Large Problems  The nonlinear kernel is fully dense  Computational complexity depends on  Separating surface depends.
Classification Problem 2-Category Linearly Separable Case A- A+ Malignant Benign.
Binary Classification Problem Learn a Classifier from the Training Set
Support Vector Machines
Support Vector Machines
What is Learning All about ?  Get knowledge of by study, experience, or being taught  Become aware by information or from observation  Commit to memory.
Linear Discriminant Functions Chapter 5 (Duda et al.)
Difficulties with Nonlinear SVM for Large Problems  The nonlinear kernel is fully dense  Computational complexity depends on  Separating surface depends.
Mathematical Programming in Support Vector Machines
An Introduction to Support Vector Machines Martin Law.
Support Vector Machines Piyush Kumar. Perceptrons revisited Class 1 : (+1) Class 2 : (-1) Is this unique?
Incremental Support Vector Machine Classification Second SIAM International Conference on Data Mining Arlington, Virginia, April 11-13, 2002 Glenn Fung.
Support Vector Machine & Image Classification Applications
The Disputed Federalist Papers: Resolution via Support Vector Machine Feature Selection Olvi Mangasarian UW Madison & UCSD La Jolla Glenn Fung Amazon Inc.,
Support Vector Machines in Data Mining AFOSR Software & Systems Annual Meeting Syracuse, NY June 3-7, 2002 Olvi L. Mangasarian Data Mining Institute University.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
An Introduction to Support Vector Machine (SVM) Presenter : Ahey Date : 2007/07/20 The slides are based on lecture notes of Prof. 林智仁 and Daniel Yeung.
Machine Learning Using Support Vector Machines (Paper Review) Presented to: Prof. Dr. Mohamed Batouche Prepared By: Asma B. Al-Saleh Amani A. Al-Ajlan.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
Proximal Support Vector Machine Classifiers KDD 2001 San Francisco August 26-29, 2001 Glenn Fung & Olvi Mangasarian Data Mining Institute University of.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
An Introduction to Support Vector Machines (M. Law)
Nonlinear Data Discrimination via Generalized Support Vector Machines David R. Musicant and Olvi L. Mangasarian University of Wisconsin - Madison
Support Vector Machine Data Mining Olvi L. Mangasarian with Glenn M. Fung, Jude W. Shavlik & Collaborators at ExonHit – Paris Data Mining Institute University.
RSVM: Reduced Support Vector Machines Y.-J. Lee & O. L. Mangasarian First SIAM International Conference on Data Mining Chicago, April 6, 2001 University.
An Introduction to Support Vector Machine (SVM)
CS558 Project Local SVM Classification based on triangulation (on the plane) Glenn Fung.
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
Data Mining via Support Vector Machines Olvi L. Mangasarian University of Wisconsin - Madison IFIP TC7 Conference on System Modeling and Optimization Trier.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Proximal Plane Classification KDD 2001 San Francisco August 26-29, 2001 Glenn Fung & Olvi Mangasarian Second Annual Review June 1, 2001 Data Mining Institute.
Survival-Time Classification of Breast Cancer Patients and Chemotherapy Yuh-Jye Lee, Olvi Mangasarian & W. H. Wolberg UW Madison & UCSD La Jolla Computational.
Generalization Error of pac Model  Let be a set of training examples chosen i.i.d. according to  Treat the generalization error as a r.v. depending on.
Machine Learning and Data Mining: A Math Programming- Based Approach Glenn Fung CS412 April 10, 2003 Madison, Wisconsin.
Incremental Reduced Support Vector Machines Yuh-Jye Lee, Hung-Yi Lo and Su-Yun Huang National Taiwan University of Science and Technology and Institute.
Minimal Kernel Classifiers Glenn Fung Olvi Mangasarian Alexander Smola Data Mining Institute University of Wisconsin - Madison Informs 2002 San Jose, California,
Knowledge-Based Nonlinear Support Vector Machine Classifiers Glenn Fung, Olvi Mangasarian & Jude Shavlik COLT 2003, Washington, DC. August 24-27, 2003.
Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Glenn Fung, Murat Dundar, Bharat Rao and Jinbo Bi
Geometrical intuition behind the dual problem
Computer Sciences Dept. University of Wisconsin - Madison
Concave Minimization for Support Vector Machine Classifiers
University of Wisconsin - Madison
SVMs for Document Ranking
University of Wisconsin - Madison
Minimal Kernel Classifiers
Presentation transcript:

Classification via Mathematical Programming Based Support Vector Machines Glenn M. Fung Computer Sciences Dept. University of Wisconsin - Madison November 26, 2002

Outline of Talk  (Standard) Support vector machines (SVM)  Classify by halfspaces  Proximal support vector machines (PSVM)  Classify by proximity to planes  Numerical experiments  Incremental PSVM classifiers  Synthetic dataset consisting of 1 billion points in 10- dimensional input space classified in less than 2 hours and 26 minutes seconds  Knowledge based linear SVMs  Incorporating knowledge sets into a classifier  Numerical experiments

Support Vector Machines Maximizing the Margin between Bounding Planes A+ A- Support vectors

Standard Support Vector Machine Algebra of 2-Category Linearly Separable Case  Given m points in n dimensional space  Represented by an m-by-n matrix A  Membership of each in class +1 or –1 specified by:  An m-by-m diagonal matrix D with +1 & -1 entries  More succinctly: where e is a vector of ones.  Separate by two bounding planes,

Standard Support Vector Machine Formulation  Margin is maximized by minimizing  Solve the quadratic program for some : min s. t. (QP),, denotes where or membership.

Proximal Vector Machines (KDD 2002) Fitting the Data using two parallel Bounding Planes A+ A-

PSVM Formulation We have from the QP SVM formulation: (QP) min s. t. This simple, but critical modification, changes the nature of the optimization problem tremendously!! Solving for in terms of and gives: min

Advantages of New Formulation  Objective function remains strongly convex  An explicit exact solution can be written in terms of the problem data  PSVM classifier is obtained by solving a single system of linear equations in the usually small dimensional input space  Exact leave-one-out-correctness can be obtained in terms of problem data

Linear PSVM We want to solve: min  Setting the gradient equal to zero, gives a nonsingular system of linear equations.  Solution of the system gives the desired PSVM classifier

Linear PSVM Solution Here,  The linear system to solve depends on: which is of the size  is usually much smaller than

Linear Proximal SVM Algorithm Classifier: Input Define Solve Calculate

Nonlinear PSVM Formulation By QP “duality”,. Maximizing the margin in the “dual space”, gives: min  Replace by a nonlinear kernel : min )  Linear PSVM: (Linear separating surface: (QP) min s. t.

The Nonlinear Classifier  The nonlinear classifier:  Where K is a nonlinear kernel, e.g.:  Gaussian (Radial Basis) Kernel :  The -entry of represents the “similarity” of data pointsand

Nonlinear PSVM Defining slightly different:  Similar to the linear case, setting the gradient equal to zero, we obtain: However, reduced kernels techniques can be used (RSVM) to reduce dimensionality.  Here, the linear system to solve is of the size

Linear Proximal SVM Algorithm Input Solve Calculate Non Define Classifier:

Linear & Nonlinear PSVM MATLAB Code function [w, gamma] = psvm(A,d,nu) % PSVM: linear and nonlinear classification % INPUT: A, d=diag(D), nu. OUTPUT: w, gamma % [w, gamma] = psvm(A,d,nu); [m,n]=size(A);e=ones(m,1);H=[A -e]; v=(d’*H)’ %v=H’*D*e; r=(speye(n+1)/nu+H’*H)\v % solve (I/nu+H’*H)r=v w=r(1:n);gamma=r(n+1); % getting w,gamma from r

Linear PSVM Comparisons with Other SVMs Much Faster, Comparable Correctness Data Set m x n PSVM Ten-fold test % Time (sec.) SSVM Ten-fold test % Time (sec.) SVM Ten-fold test % Time (sec.) WPBC (60 mo.) 110 x Ionosphere 351 x Cleveland Heart 297 x Pima Indians 768 x BUPA Liver 345 x Galaxy Dim 4192 x

Linear PSVM vs LSVM 2-Million Dataset Over 30 Times Faster DatasetMethodTraining Correctness % Testing Correctness % Time Sec. NDC “Easy” LSVM PSVM NDC “Hard” LSVM PSVM

Nonlinear PSVM: Spiral Dataset 94 Red Dots & 94 White Dots

Nonlinear PSVM Comparisons Data Set m x n PSVM Ten-fold test % Time (sec.) SSVM Ten-fold test % Time (sec.) LSVM Ten-fold test % Time (sec.) Ionosphere 351 x BUPA Liver 345 x Tic-Tac-Toe 958 x Mushroom * 8124 x * A rectangular kernel was used of size 8124 x 215

Conclusion  PSVM is an extremely simple procedure for generating linear and nonlinear classifiers  PSVM classifier is obtained by solving a single system of linear equations in the usually small dimensional input space for a linear classifier  Comparable test set correctness to standard SVM  Much faster than standard SVMs : typically an order of magnitude less.

Incremental PSVM Classification (Second SIAM Data Mining Conference)  The linear system to solve depends on the compressed blocks: which are of the size and  Suppose we have two “blocks” of data

Linear Incremental Proximal SVM Algorithm Initialization Read from disk Compute and Store in memory Yes Compute output Update in memory No Discard: Keep:

Linear Incremental Proximal SVM Adding – Retiring Data  Capable of modifying an existing linear classifier by both adding and retiring data  Option of retiring old data is similar to adding new data  Financial Data: old data is obsolete  Option of keeping old data and merging it with the new data:  Medical Data: old data does not obsolesce.

Numerical experiments One-Billion Two-Class Dataset  Synthetic dataset consisting of 1 billion points in 10- dimensional input space  Generated by NDC (Normally Distributed Clustered) dataset generator  Dataset divided into 500 blocks of 2 million points each.  Solution obtained in less than 2 hours and 26 minutes  About 30% of the time was spent reading data from disk.  Testing set Correctness 90.79%

Numerical Experiments Simulation of Two-month 60-Million Dataset  Synthetic dataset consisting of 60 million points (1 million per day) in 10- dimensional input space  Generated using NDC  At the beginning, we only have data corresponding to the first month  Every day:  The oldest block of data is retired (1 Million)  A new block is added (1 Million)  A new linear classifier is calculated daily  Only an 11 by 11 matrix is kept in memory at the end of each day. All other data is purged.

Numerical experiments Separator changing through time

Numerical experiments Normals to the separating hyperplanes Corresponding to 5 day intervals

Conclusion  Proposed algorithm is an extremely simple procedure for generating linear classifiers in an incremental fashion for huge datasets.  The linear classifier is obtained by solving a single system of linear equations in the small dimensional input space.  The proposed algorithm has the ability to retire old data and add new data in a very simple manner.  Only a matrix of the size of the input space is kept in memory at any time

Support Vector Machines Linear Programming Formulation  Use the 1-norm instead of the 2-norm: min s.t.  This is equivalent to the following linear program: min s.t.

Conventional Data-Based SVM

{x | B 1 x  b 1 } x'w=   +1 {x | C 1 x  c 1 } 2 x  c 2 } x'w=  A - A + Knowledge-Based SVM via Polyhedral Knowledge Sets (NIPS 2002)

Incoporating Knowledge Sets Into an SVM Classifier  Will show that this implication is equivalent to a set of constraints that can be imposed on the classification problem.  Suppose that the knowledge set: belongs to the class A+. Hence it must lie in the halfspace :  We therefore have the implication:

Knowledge Set Equivalence Theorem

Proof of Equivalence Theorem ( Via Nonhomogeneous Farkas or LP Duality) Proof: By LP Duality:

Knowledge-Based SVM Classification

 Adding one set of constraints for each knowledge set to the 1-norm SVM LP, we have:

Parametrized Knowledge-Based LP

Numerical Testing The Promoter Recognition Dataset  Promoter: Short DNA sequence that precedes a gene sequence.  A promoter consists of 57 consecutive DNA nucleotides belonging to {A,G,C,T}.  Important to distinguish between promoters and nonpromoters  This distinction identifies starting locations of genes in long uncharacterized DNA sequences.

The Promoter Recognition Dataset Numerical Representation  Simple “1 of N” mapping scheme for converting nominal attributes into a real valued representation:  Not most economical representation, but commonly used.

The Promoter Recognition Dataset Numerical Representation  Feature space mapped from 57-dimensional nominal space to a real valued 57 x 4=228 dimensional space. 57 nominal values 57 x 4 =228 binary values

Promoter Recognition Dataset Prior Knowledge Rules  Prior knowledge consist of the following 64 rules:

Promoter Recognition Dataset Sample Rules where denotes position of a nucleotide, with respect to a meaningful reference point starting at position and ending at position Then:

The Promoter Recognition Dataset Comparative Algorithms  KBANN Knowledge-based artificial neural network [Shavlik et al]  BP: Standard back propagation for neural networks [Rumelhart et al]  O’Neill’s Method Empirical method suggested by biologist O’Neill [O’Neill]  NN: Nearest neighbor with k=3 [Cost et al]  ID3: Quinlan’s decision tree builder[Quinlan]  SVM1: Standard 1-norm SVM [Bradley et al]

The Promoter Recognition Dataset Comparative Test Results

Wisconsin Breast Cancer Prognosis Dataset Description of the data  110 instances corresponding to 41 patients whose cancer had recurred and 69 patients whose cancer had not recurred  32 numerical features  The domain theory: two simple rules used by doctors:

Wisconsin Breast Cancer Prognosis Dataset Numerical Testing Results  Doctor’s rules applicable to only 32 out of 110 patients.  Only 22 of 32 patients are classified correctly by this rule (20% Correctness).  KSVM linear classifier applicable to all patients with correctness of 66.4%.  Correctness comparable to best available results using conventional SVMs.  KSVM can get classifiers based on knowledge without using any data.

Conclusion  Prior knowledge easily incorporated into classifiers through polyhedral knowledge sets.  Resulting problem is a simple LP.  Knowledge sets can be used with or without conventional labeled data.  In either case KSVM is better than most knowledge based classifiers.

Breast Cancer Treatment Response Joint with ExonHit ( French BioTech)  35 patients treated by a drug cocktail  9 partial responders; 26 nonresponders  25 gene expression measurements made on each patient  1-Norm SVM classifier selected: 12 out of 25 genes  Combinatorially selected 6 genes out of 12  Separating plane obtained: T S U Z A X = 0.  Leave-one-out-error: 1 out of 35 (97.1% correctness)

Other papers:  A fast and Global Two Point Low Storage Optimization Technique for Tracing Rays in 2D and 3D Isotropic Media (Journal of Applied Geophysics)  Semi-Supervised Support Vector Machines for Unlabeled data Classification (Optimization Methods and Software)  Select a small subset of an unlabeled dataset to be labeled by an oracle or expert  Use the new labeled data and the remaining unlabeled data to train a SVM clasifier

Other papers:  Multicategory Proximal SVM Classifiers  Fast multicategory algorithm based on PSVM  Newton refinement step proposed  Data Selection for SVM Classifiers (KDD 2000)  Reduce the number of support vectors of a linear SVM  Minimal Kernel Classifiers (JMLR)  Use a concave minimization formulation to reduce the SVM model complexity.  Useful for online testing where testing time is an issue.

Other papers:  A Feature Selection Newton Method for SVM Classification  LP SVM solved using a Newton method  Very sparse solutions are obtained  Finite Newton method for Lagrangian SVM Classifiers (Neurocomputing Journal)  Very fast performance, specially when n>m