CS 4700: Foundations of Artificial Intelligence

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

Support Vector Machines
Lecture 9 Support Vector Machines
ECG Signal processing (2)
Support Vector Machine & Its Applications Abhishek Sharma Dept. of EEE BIT Mesra Aug 16, 2010 Course: Neural Network Professor: Dr. B.M. Karan Semester.
Support Vector Machine & Its Applications Mingyue Tan The University of British Columbia Nov 26, 2004 A portion (1/3) of the slides are taken from Prof.
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
An Introduction of Support Vector Machine
Classification / Regression Support Vector Machines
An Introduction of Support Vector Machine
Support Vector Machines
1 Lecture 5 Support Vector Machines Large-margin linear classifier Non-separable case The Kernel trick.
SVM—Support Vector Machines
CSCE822 Data Mining and Warehousing
Machine learning continued Image source:
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Support Vector Machine
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.
Support Vector Machines Kernel Machines
Support Vector Machines
Sparse Kernels Methods Steve Gunn.
Support Vector Machines
Lecture 10: Support Vector Machines
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
An Introduction to Support Vector Machines Martin Law.
Support Vector Machine & Image Classification Applications
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Support Vector Machines Mei-Chen Yeh 04/20/2010. The Classification Problem Label instances, usually represented by feature vectors, into one of the predefined.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
An Introduction to Support Vector Machine (SVM) Presenter : Ahey Date : 2007/07/20 The slides are based on lecture notes of Prof. 林智仁 and Daniel Yeung.
Machine Learning Using Support Vector Machines (Paper Review) Presented to: Prof. Dr. Mohamed Batouche Prepared By: Asma B. Al-Saleh Amani A. Al-Ajlan.
SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Machine Learning CS 165B Spring Course outline Introduction (Ch. 1) Concept learning (Ch. 2) Decision trees (Ch. 3) Ensemble learning Neural Networks.
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
An Introduction to Support Vector Machines (M. Law)
1 Chapter 6. Classification and Prediction Overview Classification algorithms and methods Decision tree induction Bayesian classification Lazy learning.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
CS 478 – Tools for Machine Learning and Data Mining SVM.
Support Vector Machine & Its Applications Mingyue Tan The University of British Columbia Nov 26, 2004 A portion (1/3) of the slides are taken from Prof.
An Introduction to Support Vector Machine (SVM)
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
Dec 21, 2006For ICDM Panel on 10 Best Algorithms Support Vector Machines: A Survey Qiang Yang, for ICDM 2006 Panel Partially.
CpSc 810: Machine Learning Support Vector Machine.
Support Vector Machines Tao Department of computer science University of Illinois.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
© Eric CMU, Machine Learning Support Vector Machines Eric Xing Lecture 4, August 12, 2010 Reading:
CpSc 810: Machine Learning Support Vector Machine.
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Support Vector Machine: An Introduction. (C) by Yu Hen Hu 2 Linear Hyper-plane Classifier For x in the side of o : w T x + b  0; d = +1; For.
An Introduction of Support Vector Machine In part from of Jinwei Gu.
Roughly overview of Support vector machines Reference: 1.Support vector machines and machine learning on documents. Christopher D. Manning, Prabhakar Raghavan.
A Brief Introduction to Support Vector Machine (SVM) Most slides were from Prof. A. W. Moore, School of Computer Science, Carnegie Mellon University.
Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Support Vector Machine Slides from Andrew Moore and Mingyue Tan.
Geometrical intuition behind the dual problem
Support Vector Machines
A Simple Introduction to Support Vector Machines
An Introduction to Support Vector Machines
An Introduction to Support Vector Machines
Support Vector Machines Introduction to Data Mining, 2nd Edition by
Support Vector Machines
Introduction to Support Vector Machines
COSC 4368 Machine Learning Organization
CSE 802. Prepared by Martin Law
Presentation transcript:

CS 4700: Foundations of Artificial Intelligence Carla P. Gomes gomes@cs.cornell.edu Module: SVM (Reading: Chapter 20.6) Adapted from Martin Law’s slides http://www.henrykautz.org/ai/intro_SVM_new.ppt

Support Vector Machines (SVM)  Supervised learning methods for classification and regression relatively new class of successful learning methods - they can represent non-linear functions and they have an efficient training algorithm  derived from statistical learning theory by Vapnik and Chervonenkis (COLT-92)  SVM got into mainstream because of their exceptional performance in Handwritten Digit Recognition 1.1% error rate which was comparable to a very carefully constructed (and complex) ANN First introduced in COLT-92 by Boser, Guyon and Vapnik COLT = computational learning theory

Two Class Problem: Linear Separable Case Many decision boundaries can separate these two classes Which one should we choose? Class 2 Perceptron learning rule can be used to find any decision boundary between class 1 and class 2 Class 1

Example of Bad Decision Boundaries Class 2 Class 2 Class 1 Class 1

Good Decision Boundary: Margin Should Be Large The decision boundary should be as far away from the data of both classes as possible We should maximize the margin, m Support vectors datapoints that the margin pushes up against Class 2 The maximum margin linear classifier is the linear classifier with the maximum margin. This is the simplest kind of SVM (Called an Linear SVM) m Class 1

The Optimization Problem Let {x1, ..., xn} be our data set and let yi  {1,-1} be the class label of xi The decision boundary should classify all points correctly  A constrained optimization problem ||w||2 = wTw

Lagrangian of Original Problem The Lagrangian is Note that ||w||2 = wTw Setting the gradient of w.r.t. w and b to zero, we have Lagrangian multipliers i0

The Dual Optimization Problem We can transform the problem to its dual This is a convex quadratic programming (QP) problem Global maximum of i can always be found well established tools for solving this optimization problem (e.g. cplex) Note: Dot product of X ’s  New variables (Lagrangian multipliers) Let x(1) and x(-1) be two S.V. Then b = -1/2( w^T x(1) + w^T x(-1) )

A Geometrical Interpretation Class 2 Support vectors ’s with values different from zero (they hold up the separating plane)! 10=0 8=0.6 7=0 2=0 5=0 1=0.8 4=0 So, if change internal points, no effect on the decision boundary 6=1.4 9=0 3=0 Class 1

Non-linearly Separable Problems We allow “error” xi in classification; it is based on the output of the discriminant function wTx+b xi approximates the number of misclassified samples Class 1 Class 2 New objective function: C : tradeoff parameter between error and margin; chosen by the user; large C means a higher penalty to errors

The Optimization Problem The dual of the problem is w is also recovered as The only difference with the linear separable case is that there is an upper bound C on i Once again, a QP solver can be used to find i efficiently!!! Note also, everything is done by inner-products

Extension to Non-linear SVMs (Kernel Machines)

Non-Linear SVM We know that data appears only as dot products (xi∙xj) How could we generalize this procedure to non-linear data? Vapnik in 1992 showed that transforming input data xi into a higher dimensional makes the problem easier. We know that data appears only as dot products (xi∙xj) Suppose we transform the data to some (possibly infinite dimensional) space H via a mapping function Φ such that the data appears of the form Φ(xi)Φ(xj) Why? Linear operation in H is equivalent to non-linear operation in input space. Similar to Hidden Layers in ANN

Non-linear SVMs: Feature Space General idea: the original input space (x) can be mapped to some higher-dimensional feature space (φ(x) )where the training set is separable: φ(x) =(x12,x22,2x1x2) x=(x1,x2) Φ: x → φ(x) x12 x22 2x1x2 If data are mapped into higher a space of sufficiently high dimension, then they will in general be linearly separable; N data points are in general separable in a space of N-1 dimensions or more!!! This slide is courtesy of www.iro.umontreal.ca/~pift6080/documents/papers/svm_tutorial.ppt

Transformation to Feature Space Possible problem of the transformation High computation burden due to high-dimensionality and hard to get a good estimate SVM solves these two issues simultaneously “Kernel tricks” for efficient computation Minimize ||w||2 can lead to a “good” classifier ( ) (.) Feature space Input space

Kernel Trick  Recall: maximize subject to Note that data only appears as dot products Recall: maximize subject to Since data is only represented as dot products, we need not do the mapping explicitly. Introduce a Kernel Function (*) K such that: (*)Kernel function – a function that can be applied to pairs of input data to evaluate dot products in some corresponding feature space

Example Transformation Consider the following transformation Define the kernel function K (x,y) as The inner product (.)(.) can be computed by K without going through the map (.) explicitly!!!

Modification Due to Kernel Function Change all inner products to kernel functions For training, Original With kernel function

Examples of Kernel Functions Polynomial kernel with degree d Radial basis function kernel with width  Closely related to radial basis function neural networks Sigmoid with parameter  and  It does not satisfy the Mercer condition on all  and  Research on different kernel functions in different applications is very active Despite violating Mercer condition, the sigmoid kernel function can still work

Example Suppose we have 5 1D data points x1=1, x2=2, x3=4, x4=5, x5=6, with 1, 2, 6 as class 1 and 4, 5 as class 2  y1=1, y2=1, y3=-1, y4=-1, y5=1 We use the polynomial kernel of degree 2 K(x,y) = (xy+1)2 C is set to 100 We first find i (i=1, …, 5) by H = 4 9 -25 -36 49 9 25 -81 -121 169 -25 -81 289 441 -625 -36 -121 441 676 -961 49 169 -625 -961 1369

Example By using a QP solver, we get 1=0, 2=2.5, 3=0, 4=7.333, 5=4.833 Verify (at home) that the constraints are indeed satisfied The support vectors are {x2=2, x4=5, x5=6} The discriminant function is b is recovered by solving f(2)=1 or by f(5)=-1 or by f(6)=1, as x2, x4, x5 lie on and all give b=9 The optimization toolbox of matlab contains a quadratic programming solver

Example Value of discriminant function class 1 class 2 class 1 1 2 4 5 6

Choosing the Kernel Function Probably the most tricky part of using SVM. The kernel function is important because it creates the kernel matrix, which summarizes all the data Many principles have been proposed (diffusion kernel, Fisher kernel, string kernel, …) There is even research to estimate the kernel matrix from available information In practice, a low degree polynomial kernel or RBF kernel with a reasonable width is a good initial try Note that SVM with RBF kernel is closely related to RBF neural networks, with the centers of the radial basis functions automatically chosen for SVM

Software A list of SVM implementation can be found at http://www.kernel-machines.org/software.html Some implementation (such as LIBSVM) can handle multi-class classification SVMLight is among one of the earliest implementation of SVM Several Matlab toolboxes for SVM are also available

Recap of Steps in SVM Prepare data matrix {(xi,yi)} Select a Kernel function Select the error parameter C “Train” the system (to find all αi) New data can be classified using αi and Support Vectors

Summary Training (and Testing) is quite slow compared to ANN Weaknesses Training (and Testing) is quite slow compared to ANN Because of Constrained Quadratic Programming Essentially a binary classifier However, there are some tricks to evade this. Very sensitive to noise A few off data points can completely throw off the algorithm Biggest Drawback: The choice of Kernel function. There is no “set-in-stone” theory for choosing a kernel function for any given problem (still in research...) Once a kernel function is chosen, there is only ONE modifiable parameter, the error penalty C.

Summary Training is relatively easy Strengths Training is relatively easy We don’t have to deal with local minimum like in ANN SVM solution is always global and unique (check “Burges” paper for proof and justification). Unlike ANN, doesn’t suffer from “curse of dimensionality”. How? Why? We have infinite dimensions?! Maximum Margin Constraint: DOT-PRODUCTS! Less prone to overfitting Simple, easy to understand geometric interpretation. No large networks to mess around with.

Applications of SVMs Bioinformatics Machine Vision Text Categorization Ranking (e.g., Google searches) Handwritten Character Recognition Time series analysis Lots of very successful applications!!! Prof. Throsten Joachims

Handwritten digit recognition

References Burges, C. “A Tutorial on Support Vector Machines for Pattern Recognition.” Bell Labs. 1998 Law, Martin. “A Simple Introduction to Support Vector Machines.” Michigan State University. 2006 Prabhakar, K. “An Introduction to Support Vector Machines”

Resources http://www.kernel-machines.org http://www.support-vector.net/ http://www.support-vector.net/icml-tutorial.pdf http://www.kernel-machines.org/papers/tutorial-nips.ps.gz http://www.clopinet.com/isabelle/Projects/SVM/applist.html http://www.cs.cornell.edu/People/tj/ http://svmlight.joachims.org/