Outline Separating Hyperplanes – Separable Case

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
Lecture 9 Support Vector Machines
ECG Signal processing (2)
INTRODUCTION TO Machine Learning 2nd Edition
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
Classification / Regression Support Vector Machines
Support Vector Machines Instructor Max Welling ICS273A UCIrvine.
CHAPTER 10: Linear Discrimination
Pattern Recognition and Machine Learning
An Introduction of Support Vector Machine
1 Lecture 5 Support Vector Machines Large-margin linear classifier Non-separable case The Kernel trick.
SVM—Support Vector Machines
Computer vision: models, learning and inference Chapter 8 Regression.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Dimension reduction (1)
Support Vector Machines
Support Vector Machines (and Kernel Methods in general)
Support Vector Machines and Kernel Methods
1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.
Linear Discriminant Analysis (Part II) Lucian, Joy, Jie.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Sparse Kernels Methods Steve Gunn.
Data mining and statistical learning - lecture 13 Separating hyperplane.
Lecture 10: Support Vector Machines
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
1 Linear Classification Problem Two approaches: -Fisher’s Linear Discriminant Analysis -Logistic regression model.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
An Introduction to Support Vector Machines Martin Law.
Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
This week: overview on pattern recognition (related to machine learning)
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Generalizing Linear Discriminant Analysis. Linear Discriminant Analysis Objective -Project a feature space (a dataset n-dimensional samples) onto a smaller.
An Introduction to Support Vector Machines (M. Law)
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Machine Learning Weak 4 Lecture 2. Hand in Data It is online Only around 6000 images!!! Deadline is one week. Next Thursday lecture will be only one hour.
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
Lecture 4 Linear machine
Linear Models for Classification
Linear Methods for Classification : Presentation for MA seminar in statistics Eli Dahan.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
Linear Methods for Classification Based on Chapter 4 of Hastie, Tibshirani, and Friedman David Madigan.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
Machine Learning CUNY Graduate Center Lecture 6: Linear Regression II.
Linear Classifiers Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Support Vector Machines
PREDICT 422: Practical Machine Learning
LECTURE 11: Advanced Discriminant Analysis
Dept. Computer Science & Engineering, Shanghai Jiao Tong University
CH 5: Multivariate Methods
Generally Discriminant Analysis
Support Vector Machines
CSCE833 Machine Learning Lecture 9 Linear Discriminant Analysis
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Linear Discrimination
SVMs for Document Ranking
Support Vector Machines 2
Presentation transcript:

Support Vector Machines Chapter 12

Outline Separating Hyperplanes – Separable Case Extension to Non-separable case – SVM Nonlinear SVM SVM as a Penalization method SVM regression

Separating Hyperplanes The separating hyperplane with maximum margin is likely to perform well on test data. Here the separating hyperplane is almost identical to the more standard linear logistic regression boundary

Distance to Hyperplanes For any point x0 in L, βT x0 = -β0 The signed distance of any point x to L is given by

Maximum Margin Classifier Found by quadratic programming (Convex optimization) Solution determined by just a few points (support vectors) near the boundary Sparse solution in dual space Decision function

Non-separable Case: Standard Support Vector Classifier This problem computationally equivalent to

Computation of SVM Lagrange (prime) function: Minimize w.r.t , 0 and i, a set derivatives to zero:

Computation of SVM Lagrange (dual) function: with constraints: 0  I   and i=1iyi = 0 Karush-Kuhn-Tucker conditions:

Computation of SVM The final solution:

Example-Mixture Data

SVMs for large p, small n Suppose we have 5000 genes(p) and 50 samples(n), divided into two classes Many more variables than observations Infinitely many separating hyperplanes in this feature space SVMs provide the unique maximal margin separating hyperplane Prediction performance can be good, but typically no better than simpler methods such as nearest centroids All genes get a weight, so no gene selection May overfit the data

Non-Linear SVM via Kernels Note that the SVM classifier involves inner products <xi, xj>=xiTxj Enlarge the feature space Replacing xiT xj by appropriate kernel K(xi,xj) = <(xi), (xj)> provides a non-linear SVM in the input space

Popular kernels

Kernel SVM-Mixture Data

Radial Basis Kernel Radial Basis function has infinite-dim basis: (x) are infinite dimension. Smaller the Bandwidth c, more wiggly the boundary and hence Less overlap Kernel trick doesn’t allow coefficients of all basis elements to be freely determined

SVM as penalization method For , consider the problem Margin Loss + Penalty For , the penalized setup leads to the same solution as SVM.

SVM and other Loss Functions

Population Minimizers for Two Loss Functions

Logistic Regression with Loglikelihood Loss

Curse of Dimensionality in SVM

SVM Loss-Functions for Regression

Example

Example

Example

Generalized Discriminant Analysis Chapter 12

Outline Flexible Discriminant Analysis(FDA) Penalized Discriminant Analysis Mixture Discriminant Analysis (MDA)

Linear Discriminant Analysis Let P(G = k) = k and P(X=x|G=k) = fk(x) Then Assume fk(x) ~ N(k, k) and 1 = 2 = …= K=  Then we can show the decision rule is (HW#1):

LDA (cont) Plug in the estimates:

LDA Example Prediction Vector Data In this three class problem, the middle class is classified correctly

LDA Example 11 classes and X  R10

Virtues and Failings of LDA Simple prototype (centriod) classifier New observation classified into the class with the closest centroid But uses Mahalonobis distance Simple decision rules based on linear decision boundaries Estimated Bayes classifier for Gaussian class conditionals But data might not be Gaussian Provides low dimensional view of data Using discriminant functions as coordinates Often produces best classification results Simplicity and low variance in estimation

Virtues and Failings of LDA LDA may fail in number of situations Often linear boundaries fail to separate classes With large N, may estimate quadratic decision boundary May want to model even more irregular (non-linear) boundaries Single prototype per class may not be insufficient May have many (correlated) predictors for digitized analog signals. Too many parameters estimated with high variance, and the performance suffers May want to regularize

Generalization of LDA Flexible Discriminant Analysis (FDA) LDA in enlarged space of predictors via basis expansions Penalized Discriminant Analysis (PDA) With too many predictors, do not want to expand the set: Already too large Fit LDA model with penalized coefficient to be smooth/coherent in spatial domain With large number of predictors, could use penalized FDA Mixture Discriminant Analysis (MDA) Model each class by a mixture of two or more Gaussians with different centroids, all sharing same covariance matrix Allows for subspace reduction

Flexible Discriminant Analysis Linear regression on derived responses for K-class problem Define indicator variables for each class (K in all) Using indicator functions as responses to create a set of Y variables Obtain mutually linear score functions as discriminant (canonical) variables Classify into the nearest class centroid Mahalanobis distance of a test point x to kth class centroid

Flexible Discriminant Analysis Mahalanobis distance of a test point x to kth class centroid We can replace linear regression fits by non-parametric fits, e.g., generalized additive fits, spline functions, MARS models etc., with a regularizer or kernel regression and possibly reduced rank regression

Computation of FDA Multivariate nonparametric regression Optimal scores Update the model from step 1 using the optimal scores

Example of FDA N(0, I) N(0, 9I/4) Bayes decision boundary FDA using degree-two Polynomial regression

Speech Recognition Data K=11 classes spoken vowels sound p=10 predictors extracted from digitized speech FDA uses adaptive additive-spline regression (BRUTO in S-plus) FDA/MARS Uses Multivariate Adaptive Regression Splines; degree=2 allows pairwise products

LDA Vs. FDA/BRUTO

Penalized Discriminant Analysis PDA is a regularized discriminant analysis on enlarged set of predictors via a basis expansion

Penalized Discriminant Analysis PDA enlarge the predictors to h(x) Use LDA in the enlarged space, with the penalized Mahalanobis distance: with W as within-class Cov

Penalized Discriminant Analysis Decompose the classification subspace using the penalized metric: max w.r.t.

USPS Digit Recognition

Digit Recognition-LDA vs. PDA

PDA Canonical Variates

Mixture Discriminant Analysis The class conditional densities modeled as mixture of Gaussians Possibly different # of components in each class Estimate the centroids and mixing proportions in each subclass by max joint likelihood P(G, X) EM algorithm for MLE Could use penalized estimation

FDA and MDA

Wave Form Signal with Additive Gaussian Noise Class 1: Xj = U h1(j) + (1-U)h2(j) +j Class 2: Xj = U h1(j) + (1-U)h3(j) +j Class 3: Xj = U h2(j) + (1-U)h3(j) +j Where j = 1,L, 21, and U ~ Unif(0,1) h1(j) = max(6-|j-11|,0) h2(j) = h1(j-4) h3(j) = h1(j+4)

Wave From Data Results