Support Vector Machine

Slides:



Advertisements
Similar presentations
Support Vector Machine & Its Applications
Advertisements

Introduction to Support Vector Machines (SVM)
Support Vector Machines
ECG Signal processing (2)
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Support Vector Machine & Its Applications Mingyue Tan The University of British Columbia Nov 26, 2004 A portion (1/3) of the slides are taken from Prof.
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
An Introduction of Support Vector Machine
Support Vector Machines
Machine learning continued Image source:
Support Vector Machines and Kernel Methods
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.
1 Introduction to Kernels Max Welling October (chapters 1,2,3,4)
Support Vector Machines Kernel Machines
Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.
Support Vector Machines and Kernel Methods
Kernel Methods and SVM’s. Predictive Modeling Goal: learn a mapping: y = f(x;  ) Need: 1. A model structure 2. A score function 3. An optimization strategy.
Support Vector Machines
Lecture 10: Support Vector Machines
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.
An Introduction to Support Vector Machines Martin Law.
Support Vector Machine & Image Classification Applications
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.
Copyright © 2001, Andrew W. Moore Support Vector Machines Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon University.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Classification - SVM CS 685: Special Topics in Data Mining Jinze Liu.
Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio LECTURE: Support Vector Machines.
Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Linear Document Classifier.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
An Introduction to Support Vector Machines (M. Law)
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Classification - SVM CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.
1 CMSC 671 Fall 2010 Class #24 – Wednesday, November 24.
1 Support Vector Machines Chapter Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines Andrew W. Moore Professor School.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
CS 478 – Tools for Machine Learning and Data Mining SVM.
Machine Learning Lecture 7: SVM Moshe Koppel Slides adapted from Andrew Moore Copyright © 2001, 2003, Andrew W. Moore.
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
SVM – Support Vector Machines Presented By: Bella Specktor.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Classification - SVM CS 685: Special Topics in Data Mining Jinze Liu.
CS 1699: Intro to Computer Vision Support Vector Machines Prof. Adriana Kovashka University of Pittsburgh October 29, 2015.
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
Dec 21, 2006For ICDM Panel on 10 Best Algorithms Support Vector Machines: A Survey Qiang Yang, for ICDM 2006 Panel Partially.
Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.
Support Vector Machines Tao Department of computer science University of Illinois.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
SVMs in a Nutshell.
A Brief Introduction to Support Vector Machine (SVM) Most slides were from Prof. A. W. Moore, School of Computer Science, Carnegie Mellon University.
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
Kernels Slides from Andrew Moore and Mingyue Tan.
Support Vector Machine & Its Applications. Overview Intro. to Support Vector Machines (SVM) Properties of SVM Applications  Gene Expression Data Classification.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Support Vector Machine Slides from Andrew Moore and Mingyue Tan.
Support Vector Machines
PREDICT 422: Practical Machine Learning
Support Vector Machines
Support Vector Machines
CS 2750: Machine Learning Support Vector Machines
CS 485: Special Topics in Data Mining Jinze Liu
Class #212 – Thursday, November 12
Support Vector Machines
Support Vector Machines
Presentation transcript:

Support Vector Machine Rong Jin

Linear Classifiers How to find the linear decision boundary that linearly separates data points from two classes? denotes +1 denotes -1

Linear Classifiers denotes +1 denotes -1

Linear Classifiers Any of these would be fine.. ..but which is best? denotes +1 denotes -1 Any of these would be fine.. ..but which is best? Copyright © 2001, 2003, Andrew W. Moore

Classifier Margin Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint. denotes +1 denotes -1 Copyright © 2001, 2003, Andrew W. Moore

Maximum Margin denotes +1 denotes -1 The maximum margin linear classifier is the linear classifier with the maximum margin. This is the simplest kind of SVM (called an Linear SVM) Copyright © 2001, 2003, Andrew W. Moore

Why Maximum Margin ? If we’ve made a small error in the location of the boundary (it’s been jolted in its perpendicular direction) this gives us least chance of causing a misclassification. There’s some theory (using VC dimension) that is related to (but not the same as) the proposition that this is a good thing. Empirically it works very very well. denotes +1 denotes -1 Copyright © 2001, 2003, Andrew W. Moore

Estimate the Margin denotes +1 denotes -1 x What is the distance expression for a point x to a line wx+b= 0? Copyright © 2001, 2003, Andrew W. Moore

Estimate the Margin What is the classification margin for wx+b= 0? denotes +1 denotes -1 x What is the classification margin for wx+b= 0? Copyright © 2001, 2003, Andrew W. Moore

Maximize the Classification Margin denotes +1 denotes -1 x Copyright © 2001, 2003, Andrew W. Moore

Maximum Margin denotes +1 denotes -1 x Copyright © 2001, 2003, Andrew W. Moore

Maximum Margin denotes +1 denotes -1 x Copyright © 2001, 2003, Andrew W. Moore

Maximum Margin Quadratic programming problem Quadratic objective function Linear equality and inequality constraints Well studied problem in OR

Quadratic Programming Find Quadratic criterion Subject to n additional linear inequality constraints And subject to e additional linear equality constraints Copyright © 2001, 2003, Andrew W. Moore

Linearly Inseparable Case This is going to be a problem! What should we do? denotes +1 denotes -1 Copyright © 2001, 2003, Andrew W. Moore

Linearly Inseparable Case Relax the constraints Penalize the relaxation denotes +1 denotes -1 Copyright © 2001, 2003, Andrew W. Moore

Linearly Inseparable Case Relax the constraints Penalize the relaxation denotes +1 denotes -1 Copyright © 2001, 2003, Andrew W. Moore

Linearly Inseparable Case denotes +1 denotes -1 Still a quadratic programming problem Copyright © 2001, 2003, Andrew W. Moore

Linearly Inseparable Case Copyright © 2001, 2003, Andrew W. Moore

Linearly Inseparable Case Support Vector Machine Regularized logistic regression Copyright © 2001, 2003, Andrew W. Moore

Dual Form of SVM How to decide b ? Copyright © 2001, 2003, Andrew W. Moore

Dual Form of SVM denotes +1 denotes -1 Copyright © 2001, 2003, Andrew W. Moore

Suppose we’re in 1-dimension What would SVMs do with this data? x=0 Copyright © 2001, 2003, Andrew W. Moore

Suppose we’re in 1-dimension Not a big surprise x=0 Positive “plane” Negative “plane” Copyright © 2001, 2003, Andrew W. Moore

Harder 1-dimensional Dataset x=0 Copyright © 2001, 2003, Andrew W. Moore

Harder 1-dimensional Dataset x=0 Copyright © 2001, 2003, Andrew W. Moore

Harder 1-dimensional Dataset x=0 Copyright © 2001, 2003, Andrew W. Moore

Common SVM Basis Functions Polynomial terms of x of degree 1 to q Radial (Gaussian) basis functions Copyright © 2001, 2003, Andrew W. Moore

Quadratic Basis Functions Constant Term Quadratic Basis Functions Linear Terms Number of terms (assuming m input dimensions) = (m+2)-choose-2 = (m+2)(m+1)/2 = m2/2 Pure Quadratic Terms Quadratic Cross-Terms Copyright © 2001, 2003, Andrew W. Moore

Dual Form of SVM Copyright © 2001, 2003, Andrew W. Moore

Dual Form of SVM Copyright © 2001, 2003, Andrew W. Moore

Quadratic Dot Products + Quadratic Dot Products + +

Quadratic Dot Products Just out of casual, innocent, interest, let’s look at another function of a and b: Quadratic Dot Products Copyright © 2001, 2003, Andrew W. Moore

Kernel Trick Copyright © 2001, 2003, Andrew W. Moore

Kernel Trick Copyright © 2001, 2003, Andrew W. Moore

SVM Kernel Functions Polynomial kernel function Radial basis kernel function (universal kernel) Copyright © 2001, 2003, Andrew W. Moore

Kernel Tricks Replacing dot product with a kernel function Not all functions are kernel functions Are they kernel functions ? Copyright © 2001, 2003, Andrew W. Moore

Kernel Tricks Mercer’s condition To expand Kernel function k(x,y) into a dot product, i.e. k(x,y)=(x)(y), k(x, y) has to be positive semi-definite function, i.e., for any function f(x) whose is finite, the following inequality holds Copyright © 2001, 2003, Andrew W. Moore

Kernel Tricks Introducing nonlinearity into the model Computationally efficient Copyright © 2001, 2003, Andrew W. Moore

Nonlinear Kernel (I) Copyright © 2001, 2003, Andrew W. Moore

Nonlinear Kernel (II) Copyright © 2001, 2003, Andrew W. Moore

Reproducing Kernel Hilbert Space (RKHS) Reproducing Kernel Hilbert Space H Eigen decomposition: Elements of space H: Reproducing property

Reproducing Kernel Hilbert Space (RKHS) Representer theorem

Kernelize Logistic Regression How can we introduce nonlinearity into the logistic regression model? Copyright © 2001, 2003, Andrew W. Moore

Diffusion Kernel Kernel function describes the correlation or similarity between two data points Given that a similarity function s(x,y) Non-negative and symmetric Does not obey Mercer’s condition How can we generate a kernel function based on this similarity function? A graph theory approach … Copyright © 2001, 2003, Andrew W. Moore

Diffusion Kernel Create a graph for all the training examples Each vertex corresponds to a data point The weight of each edge is the similarity s(x,y) Graph Laplacian Negative semi-definite Copyright © 2001, 2003, Andrew W. Moore

Diffusion Kernel Consider a simple Laplacian Consider L2, L3, … What do these matrices represent? A diffusion kernel Copyright © 2001, 2003, Andrew W. Moore

Diffusion Kernel: Properties Positive definite How to compute the diffusion kernel ? Copyright © 2001, 2003, Andrew W. Moore

Doing Multi-class Classification SVMs can only handle two-class outputs (i.e. a categorical output variable with arity 2). What can be done? Answer: with output arity N, learn N SVM’s SVM 1 learns “Output==1” vs “Output != 1” SVM 2 learns “Output==2” vs “Output != 2” : SVM N learns “Output==N” vs “Output != N” Then to predict the output for a new input, just predict with each SVM and find out which one puts the prediction the furthest into the positive region. Copyright © 2001, 2003, Andrew W. Moore

Kernel Learning What if we have multiple kernel functions Which one is the best ? How can we combine multiple kernels ?

Kernel Learning

References An excellent tutorial on VC-dimension and Support Vector Machines: C.J.C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):955-974, 1998. http://citeseer.nj.nec.com/burges98tutorial.html The VC/SRM/SVM Bible: (Not for beginners including myself) Statistical Learning Theory by Vladimir Vapnik, Wiley-Interscience; 1998 Software: SVM-light, http://svmlight.joachims.org/, free download Copyright © 2001, 2003, Andrew W. Moore