University of Wisconsin - Madison

Slides:



Advertisements
Similar presentations
Optimization in Data Mining Olvi L. Mangasarian with G. M. Fung, J. W. Shavlik, Y.-J. Lee, E.W. Wild & Collaborators at ExonHit – Paris University of Wisconsin.
Advertisements

Introduction to Support Vector Machines (SVM)
Lecture 9 Support Vector Machines
ECG Signal processing (2)
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
Classification / Regression Support Vector Machines
Support Vector Machine Classification Computation & Informatics in Biology & Medicine Madison Retreat, November 15, 2002 Olvi L. Mangasarian with G. M.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Separating Hyperplanes
The Disputed Federalist Papers : SVM Feature Selection via Concave Minimization Glenn Fung and Olvi L. Mangasarian CSNA 2002 June 13-16, 2002 Madison,
Y.-J. Lee, O. L. Mangasarian & W.H. Wolberg
Support Vector Machines
The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from.
Kernel Technique Based on Mercer’s Condition (1909)
Dual Problem of Linear Program subject to Primal LP Dual LP subject to ※ All duality theorems hold and work perfectly!
Support Vector Classification (Linearly Separable Case, Primal) The hyperplanethat solves the minimization problem: realizes the maximal margin hyperplane.
Proximal Support Vector Machine Classifiers KDD 2001 San Francisco August 26-29, 2001 Glenn Fung & Olvi Mangasarian Data Mining Institute University of.
Classification Problem 2-Category Linearly Separable Case A- A+ Malignant Benign.
Binary Classification Problem Learn a Classifier from the Training Set
Support Vector Machines
Unconstrained Optimization Problem
Lecture outline Support vector machines. Support Vector Machines Find a linear hyperplane (decision boundary) that will separate the data.
Support Vector Machines
What is Learning All about ?  Get knowledge of by study, experience, or being taught  Become aware by information or from observation  Commit to memory.
Mathematical Programming in Support Vector Machines
An Introduction to Support Vector Machines Martin Law.
Incremental Support Vector Machine Classification Second SIAM International Conference on Data Mining Arlington, Virginia, April 11-13, 2002 Glenn Fung.
Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian Edward Wild University of Wisconsin Madison.
The Disputed Federalist Papers: Resolution via Support Vector Machine Feature Selection Olvi Mangasarian UW Madison & UCSD La Jolla Glenn Fung Amazon Inc.,
Support Vector Machines in Data Mining AFOSR Software & Systems Annual Meeting Syracuse, NY June 3-7, 2002 Olvi L. Mangasarian Data Mining Institute University.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
Proximal Support Vector Machine Classifiers KDD 2001 San Francisco August 26-29, 2001 Glenn Fung & Olvi Mangasarian Data Mining Institute University of.
An Introduction to Support Vector Machines (M. Law)
Nonlinear Data Discrimination via Generalized Support Vector Machines David R. Musicant and Olvi L. Mangasarian University of Wisconsin - Madison
CS 478 – Tools for Machine Learning and Data Mining SVM.
Support Vector Machine Data Mining Olvi L. Mangasarian with Glenn M. Fung, Jude W. Shavlik & Collaborators at ExonHit – Paris Data Mining Institute University.
RSVM: Reduced Support Vector Machines Y.-J. Lee & O. L. Mangasarian First SIAM International Conference on Data Mining Chicago, April 6, 2001 University.
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
Biointelligence Laboratory, Seoul National University
An Introduction to Support Vector Machine (SVM)
CS558 Project Local SVM Classification based on triangulation (on the plane) Glenn Fung.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Feature Selection in k-Median Clustering Olvi Mangasarian and Edward Wild University of Wisconsin - Madison.
Data Mining via Support Vector Machines Olvi L. Mangasarian University of Wisconsin - Madison IFIP TC7 Conference on System Modeling and Optimization Trier.
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Support Vector Machines (SVM): A Tool for Machine Learning Yixin Chen Ph.D Candidate, CSE 1/10/2002.
Proximal Plane Classification KDD 2001 San Francisco August 26-29, 2001 Glenn Fung & Olvi Mangasarian Second Annual Review June 1, 2001 Data Mining Institute.
Survival-Time Classification of Breast Cancer Patients and Chemotherapy Yuh-Jye Lee, Olvi Mangasarian & W. H. Wolberg UW Madison & UCSD La Jolla Computational.
Support Vector Machine: An Introduction. (C) by Yu Hen Hu 2 Linear Hyper-plane Classifier For x in the side of o : w T x + b  0; d = +1; For.
Generalization Error of pac Model  Let be a set of training examples chosen i.i.d. according to  Treat the generalization error as a r.v. depending on.
Machine Learning and Data Mining: A Math Programming- Based Approach Glenn Fung CS412 April 10, 2003 Madison, Wisconsin.
Incremental Reduced Support Vector Machines Yuh-Jye Lee, Hung-Yi Lo and Su-Yun Huang National Taiwan University of Science and Technology and Institute.
Minimal Kernel Classifiers Glenn Fung Olvi Mangasarian Alexander Smola Data Mining Institute University of Wisconsin - Madison Informs 2002 San Jose, California,
Classification via Mathematical Programming Based Support Vector Machines Glenn M. Fung Computer Sciences Dept. University of Wisconsin - Madison November.
Knowledge-Based Nonlinear Support Vector Machine Classifiers Glenn Fung, Olvi Mangasarian & Jude Shavlik COLT 2003, Washington, DC. August 24-27, 2003.
Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:
Support Vector Machine Slides from Andrew Moore and Mingyue Tan.
PREDICT 422: Practical Machine Learning
Support Vector Machine
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
Glenn Fung, Murat Dundar, Bharat Rao and Jinbo Bi
Geometrical intuition behind the dual problem
Support Vector Machines
Computer Sciences Dept. University of Wisconsin - Madison
Lecture 18. SVM (II): Non-separable Cases
Concave Minimization for Support Vector Machine Classifiers
University of Wisconsin - Madison
SVMs for Document Ranking
Minimal Kernel Classifiers
Presentation transcript:

University of Wisconsin - Madison Machine Learning and Data Mining via Mathematical Programming Based Support Vector Machines Glenn M. Fung May 8, 2003 Ph. D. Dissertation Talk University of Wisconsin - Madison

Thesis Overview Proximal support vector machines (PSVM) Binary Classification Multiclass Classification Incremental Classification (massive datasets) Knowledge based SVMs (KSVM) Linear KSVM Extension to Nonlinear KSVM Sparse classifiers Data selection for linear classifiers Minimize # of support vectors Minimal kernel classifiers Feature selection Newton method for SVM. Semi-Supervised SVMs Finite Newton method for Lagrangian SVM classifiers

Outline of Talk (Standard) Support vector machine (SVM) Classify by halfspaces Proximal support vector machine (PSVM) Classify by proximity to planes Incremental PSVM classifiers Synthetic dataset consisting of 1 billion points in 10- dimensional input space classified in less than 2 hours and 26 minutes Knowledge based SVMs Incorporate prior knowledge sets into classifiers Minimal kernel classifiers Reduce data dependence of nonlinear classifiers

Support Vector Machines Maximizing the Margin between Bounding Planes vectors A-

Standard Support Vector Machine Algebra of 2-Category Linearly Separable Case Given m points in n dimensional space Represented by an m-by-n matrix A Membership of each in class +1 or –1 specified by: An m-by-m diagonal matrix D with +1 & -1 entries Separate by two bounding planes, More succinctly: where e is a vector of ones.

Standard Support Vector Machine Formulation Solve the quadratic program for some : min s. t. (QP) , , denotes where or membership. Margin is maximized by minimizing

Proximal Vector Machines Fitting the Data using two parallel Bounding Planes

PSVM Formulation min We have from the QP SVM formulation: (QP) s. t. Solving for in terms of and gives: min This simple, but critical modification, changes the nature of the optimization problem tremendously!!

Advantages of New Formulation Objective function remains strongly convex An explicit exact solution can be written in terms of the problem data PSVM classifier is obtained by solving a single system of linear equations in the usually small dimensional input space Exact leave-one-out-correctness can be obtained in terms of problem data

Linear PSVM We want to solve: min Setting the gradient equal to zero, gives a nonsingular system of linear equations. Solution of the system gives the desired PSVM classifier

Linear PSVM Solution Here, The linear system to solve depends on: which is of the size is usually much smaller than

Linear Proximal SVM Algorithm Input Define Calculate Solve Classifier:

Nonlinear PSVM Formulation Linear PSVM: (Linear separating surface: ) (QP) min s. t. By QP “duality”, . Maximizing the margin in the “dual space” , gives: min Replace by a nonlinear kernel : min

The Nonlinear Classifier Where K is a nonlinear kernel, e.g.: Gaussian (Radial Basis) Kernel : The -entry of represents the “similarity” of data points and

Nonlinear PSVM Similar to the linear case, Defining slightly different: Similar to the linear case, setting the gradient equal to zero, we obtain: Here, the linear system to solve is of the size However, reduced kernel techniques can be used (RSVM) to reduce dimensionality.

Linear Proximal SVM Algorithm Non Input Define Calculate Solve Classifier: Classifier:

Incremental PSVM Classification and Suppose we have two “blocks” of data The linear system to solve depends on the compressed blocks: which are of the size

Linear Incremental Proximal SVM Algorithm Initialization Read from disk Compute and Store in memory Update in memory No Discard: Keep: Yes Compute output

Linear Incremental Proximal SVM Adding – Retiring Data Capable of modifying an existing linear classifier by both adding and retiring data Option of retiring old data is similar to adding new data Financial Data: old data is obsolete Option of keeping old data and merging it with the new data: Medical Data: old data does not obsolesce.

Numerical experiments One-Billion Two-Class Dataset Synthetic dataset consisting of 1 billion points in 10- dimensional input space Generated by NDC (Normally Distributed Clustered) dataset generator Dataset divided into 500 blocks of 2 million points each. Solution obtained in less than 2 hours and 26 minutes About 30% of the time was spent reading data from disk. Testing set correctness 90.79%

Numerical Experiments Simulation of Two-month 60-Million Dataset Synthetic dataset consisting of 60 million points (1 million per day) in 10- dimensional input space Generated using NDC At the beginning, we only have data corresponding to the first month Every day: The oldest block of data is retired (1 Million) A new block is added (1 Million) A new linear classifier is calculated daily Only an 11 by 11 matrix is kept in memory at the end of each day. All other data is purged.

Numerical experiments Separator changing through time

Numerical experiments Normals to the separating hyperplanes Corresponding to 5 day intervals

Support Vector Machines Linear Programming Formulation Use the 1-norm instead of the 2-norm: min s.t. This is equivalent to the following linear program: min s.t.

Support Vector Machines Maximizing the Margin between Bounding Planes

Incoporating Knowledge Sets Into an SVM Classifier Suppose that the knowledge set: belongs to the class A+. Hence it must lie in the halfspace : We therefore have the implication: Will show that this implication is equivalent to a set of constraints that can be imposed on the classification problem.

Knowledge Set Equivalence Theorem

Knowledge-Based SVM Classification

Knowledge-Based SVM Classification Adding one set of constraints for each knowledge set to the 1-norm SVM LP, we have:

Knowledge-Based LP SVM with slack variables

Knowledge-Based SVM via Polyhedral Knowledge Sets

Numerical Testing The Promoter Recognition Dataset Promoter: Short DNA sequence that precedes a gene sequence. A promoter consists of 57 consecutive DNA nucleotides belonging to {A,G,C,T} . Important to distinguish between promoters and nonpromoters This distinction identifies starting locations of genes in long uncharacterized DNA sequences.

The Promoter Recognition Dataset Comparative Test Results

Minimal kernel Classifiers: Model Simplification Goal #1: Generate a very sparse solution vector . Why? Minimizes number of kernel functions used. Simplifies separating surface. Reduces storage Goal #2: Minimize number of active constraints. Why? Reduces data dependence. Useful for massive incremental classification.

Model Simplification Goal #1 Simplifying Separating Surface The nonlinear separating surface: The separating surface does not depend explicitly on the datapoint Minimize the number of nonzero

Model Simplification Goal #2 Minimize Data Dependence By KKT conditions: Hence: Minimize the number of nonzero

Achieving Model Simplification: Minimal Kernel Classifier Formulation s.t. Where is given by: The new loss function # is given by:

The (Pound) Loss Function #

Approximating the Pound Loss Function #

Minimal Kernel Classifier as a Concave Minimization Problem s.t. For we have: Problem can be effectively solved using the finite Successive Linearization Algorithm (SLA) (Mangasarian 1996)

Minimal Kernel Algorithm (SLA) Start with: Having determine by solving the LP: min s.t. Stop when:

Minimal Kernel Algorithm (SLA) Each iteration of the algorithm solves a Linear program. The algorithm terminates in a finite number of iterations (typically 5 to 7 iterations). Solution obtained satisfies the Minimum Principle necessary optimality condition.

Checkerboard Separating Surface # of Kernel Functions=27 Checkerboard Separating Surface # of Kernel Functions=27 * # of Active Constraints= 30 o

Conclusions (PSVM) PSVM is an extremely simple procedure for generating linear and nonlinear classifiers by solving a single system of linear equations Comparable test set correctness to standard SVM Much faster than standard SVMs : typically an order of magnitude less. We also Proposed algorithm is an extremely simple procedure for generating linear classifiers in an incremental fashion for huge datasets. The proposed algorithm has the ability to retire old data and add new data in a very simple manner. Only a matrix of the size of the input space is kept in memory at any time.

Conclusions (KSVM) Prior knowledge easily incorporated into classifiers through polyhedral knowledge sets. Resulting problem is a simple LP. Knowledge sets can be used with or without conventional labeled data. In either case KSVM is better than most knowledge based classifiers.

Conclusions (Minimal Kernel Classifiers) A finite algorithm that generates a classifier depending on a fraction of the input data only. Important for fast online testing of unseen data, e.g. fraud or intrusion detection. Useful for incremental training of massive data. Overall algorithm consists of solving 5 to 7 LPs. Kernel data dependence reduced up to 98.8% of the data used by a standard SVM. Testing time reduction up to: 98.2%. MKC testing set correctness comparable to that of more complex standard SVM.