Support Vector Machines

Slides:

Advertisements

Similar presentations

Support Vector Machine & Its Applications

Advertisements

Introduction to Support Vector Machines (SVM)

Support Vector Machine

Support Vector Machines

Lecture 9 Support Vector Machines

ECG Signal processing (2)

Support Vector Machine & Its Applications Abhishek Sharma Dept. of EEE BIT Mesra Aug 16, 2010 Course: Neural Network Professor: Dr. B.M. Karan Semester.

S UPPORT V ECTOR M ACHINES Jianping Fan Dept of Computer Science UNC-Charlotte.

Support Vector Machine & Its Applications Mingyue Tan The University of British Columbia Nov 26, 2004 A portion (1/3) of the slides are taken from Prof.

SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.

An Introduction of Support Vector Machine

Classification / Regression Support Vector Machines

Pattern Recognition and Machine Learning

An Introduction of Support Vector Machine

Support Vector Machines

SVM—Support Vector Machines

Support Vector Machines (SVMs) Chapter 5 (Duda et al.)

University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.

1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.

Support Vector Machines

Lecture 10: Support Vector Machines

Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.

Support Vector Machine & Image Classification Applications

1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.

计算机学院计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知计算机学院 Perceptron Revisited: Linear Separators Binary classification.

Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.

SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

CS 478 – Tools for Machine Learning and Data Mining SVM.

Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.

University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.

Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.

Dec 21, 2006For ICDM Panel on 10 Best Algorithms Support Vector Machines: A Survey Qiang Yang, for ICDM 2006 Panel Partially.

Text Classification using Support Vector Machine Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.

Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.

Roughly overview of Support vector machines Reference: 1.Support vector machines and machine learning on documents. Christopher D. Manning, Prabhakar Raghavan.

A Brief Introduction to Support Vector Machine (SVM) Most slides were from Prof. A. W. Moore, School of Computer Science, Carnegie Mellon University.

An Introduction of Support Vector Machine Courtesy of Jinwei Gu.

Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.

Support Vector Machine Slides from Andrew Moore and Mingyue Tan.

Support Vector Machine

Support Vector Machines

PREDICT 422: Practical Machine Learning

Support Vector Machine

Support Vector Machines

Geometrical intuition behind the dual problem

An Introduction to Support Vector Machines

Kernels Usman Roshan.

An Introduction to Support Vector Machines

Support Vector Machines

Support Vector Machines Introduction to Data Mining, 2nd Edition by

Statistical Learning Dong Liu Dept. EEIS, USTC.

CS 2750: Machine Learning Support Vector Machines

Support Vector Machines Most of the slides were taken from:

CSSE463: Image Recognition Day 14

CSSE463: Image Recognition Day 14

Support Vector Machines

Support Vector Machine & Its Applications

CSSE463: Image Recognition Day 14

CSSE463: Image Recognition Day 14

Support Vector Machines and Kernels

Usman Roshan CS 675 Machine Learning

Support Vector Machine

CSSE463: Image Recognition Day 14

Machine Learning Support Vector Machine Supervised Learning

CISC 841 Bioinformatics (Fall 2007) Kernel Based Methods (I)

SVMs for Document Ranking

Support Vector Machines Kernels

Presentation transcript:

Support Vector Machines

Perceptron Revisited: Linear Classifier: y(x) = sign(w.x + b) w.x + b > 0 w.x + b = 0 w.x + b < 0

Which one is the best?

Notion of Margin d r Distance from a data point to the boundary: Data points closest to the boundary are called support vectors Margin d is the distance between two classes. d r

Maximum Margin Classification Intuitively, the classifier of the maximum margin is the best solution Vapnik formally justifies this from the view of Structure Risk Minimization Also, it seems that only support vectors matter (is SVM a statistical classifier?)

Quantifying the Margin: Canonical hyper-planes: Redundancy in the choice of w and b: To break this redundancy, assuming the closest data points are on the hyper-planes (canonical hyper-planes): The margin is: The condition of correct classification w.xi + b ≥ 1 if yi = 1 w.xi + b ≤ -1 if yi = -1

Maximizing Margin: The quadratic optimization problem: A simpler formulation: Find w and b such that is maximized; and for all {(xi ,yi)} w.xi + b ≥ 1 if yi=1; w.xi + b ≤ -1 if yi = -1

The dual problem (1) Quadratic optimization problems are a well-known class of mathematical programming problems, and many (rather intricate) algorithms exist for solving them. The solution involves constructing a dual problem: The Lagrangian L: Minimizing L over w and b:

The dual problem (2) Therefore, the optimal value of w is: Using the above result we have: The dual optimization problem

Important Observations (1): The solution of the dual problem depends on the inner-product between data points, i.e., rather than data points themselves. The dominant contribution of support vectors: The Kuhn-Tucker condition Only support vectors have non-zero h values

Important Observations (2): The form of the final solution: Two features: Only depending on support vectors Depending on the inner-product of data vectors Fixing b:

Soft Margin Classification What if data points are not linearly separable? Slack variables ξi can be added to allow misclassification of difficult or noisy examples. ξi ξi

The formulation of soft margin The original problem: The dual problem:

Linear SVMs: Overview The classifier is a separating hyperplane. Most “important” training points are support vectors; they define the hyperplane. Quadratic optimization algorithms can identify which training points xi are support vectors with non-zero Lagrangian multipliers hi. Both in the dual formulation of the problem and in the solution training points appear only inside inner-products.

Who really need linear classifiers Datasets that are linearly separable with some noise, linear SVM work well: But if the dataset is non-linearly separable? How about… mapping data to a higher-dimensional space: x x x2 x

Non-linear SVMs: Feature spaces General idea: the original space can always be mapped to some higher-dimensional feature space where the training set becomes separable: Φ: x → φ(x)

The “Kernel Trick” The SVM only relies on the inner-product between vectors xi.xj If every datapoint is mapped into high-dimensional space via some transformation Φ: x → φ(x), the inner-product becomes: K(xi,xj)= φ(xi) .φ(xj) K(xi,xj ) is called the kernel function. For SVM, we only need specify the kernel K(xi,xj ), without need to know the corresponding non-linear mapping, φ(x).

Non-linear SVMs The dual problem: Optimization techniques for finding hi’s remain the same! The solution is:

Examples of Kernel Trick (1) For the example in the previous figure: The non-linear mapping The kernel Where is the benefit?

Examples of Kernel Trick (2) Polynomial kernel of degree 2 in 2 variables The non-linear mapping: The kernel

Examples of kernel trick (3) Gaussian kernel: The mapping is of infinite dimension: The moral: very high-dimensional and complicated non-linear mapping can be achieved by using a simple kernel!

What Functions are Kernels? For some functions K(xi,xj) checking that K(xi,xj)= φ(xi) .φ(xj) can be cumbersome. Mercer’s theorem: Every semi-positive definite symmetric function is a kernel

Examples of Kernel Functions Linear kernel: Polynomial kernel of power p: Gaussian kernel: In the form, equivalent to RBFNN, but has the advantage of that the center of basis functions, i.e., support vectors, are optimized in a supervised. Two-layer perceptron:

SVM Overviews Main features: By using the kernel trick, data is mapped into a high-dimensional feature space, without introducing much computational effort; Maximizing the margin achieves better generation performance; Soft-margin accommodates noisy data; Not too many parameters need to be tuned. Demos(http://svm.dcs.rhbnc.ac.uk/pagesnew/GPat.shtml)

SVM so far SVMs were originally proposed by Boser, Guyon and Vapnik in 1992 and gained increasing popularity in late 1990s. SVMs are currently among the best performers for many benchmark datasets. SVM techniques have been extended to a number of tasks such as regression [Vapnik et al. ’97]. Most popular optimization algorithms for SVMs are SMO [Platt ’99] and SVMlight [Joachims’ 99], both use decomposition to handle large size datasets. It seems the kernel trick is the most attracting site of SVMs. This idea has now been applied to many other learning models where the inner-product is concerned, and they are called ‘kernel’ methods. Tuning SVMs remains to be the main research focus: how to an optimal kernel? Kernel should match the smooth structure of data.