Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

Support Vector Machines
Lecture 9 Support Vector Machines
ECG Signal processing (2)
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
An Introduction of Support Vector Machine
Classification / Regression Support Vector Machines

CHAPTER 10: Linear Discrimination
An Introduction of Support Vector Machine
Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.
Support Vector Machines
Machine learning continued Image source:
Support Vector Machine
Final review LING572 Fei Xia Week 10: 03/13/08 1.
Support Vector Machines and Kernel Methods
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.
Support Vector Classification (Linearly Separable Case, Primal) The hyperplanethat solves the minimization problem: realizes the maximal margin hyperplane.
Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl.
Support Vector Machines Kernel Machines
Classification Problem 2-Category Linearly Separable Case A- A+ Malignant Benign.
Support Vector Machines and Kernel Methods
Support Vector Machines
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
SVMs Finalized. Where we are Last time Support vector machines in grungy detail The SVM objective function and QP Today Last details on SVMs Putting it.
Lecture outline Support vector machines. Support Vector Machines Find a linear hyperplane (decision boundary) that will separate the data.
Support Vector Machines
Lecture 10: Support Vector Machines
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
SVMs, cont’d Intro to Bayesian learning. Quadratic programming Problems of the form Minimize: Subject to: are called “quadratic programming” problems.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
An Introduction to Support Vector Machines Martin Law.
Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.
Final review LING572 Fei Xia Week 10: 03/11/
Support Vector Machine & Image Classification Applications
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Support Vector Machines Mei-Chen Yeh 04/20/2010. The Classification Problem Label instances, usually represented by feature vectors, into one of the predefined.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
An Introduction to Support Vector Machines (M. Law)
CS 478 – Tools for Machine Learning and Data Mining SVM.
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
An Introduction to Support Vector Machine (SVM)
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Chapter 10 The Support Vector Method For Estimating Indicator Functions Intelligent Information Processing Laboratory, Fudan University.
SVMs in a Nutshell.
An Introduction of Support Vector Machine In part from of Jinwei Gu.
Introduction to Machine Learning Prof. Nir Ailon Lecture 5: Support Vector Machines (SVM)
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
PREDICT 422: Practical Machine Learning
Support Vector Machine
Support Vector Machines
An Introduction to Support Vector Machines
Support Vector Machines Introduction to Data Mining, 2nd Edition by
COSC 4335: Other Classification Techniques
Support Vector Machines
Introduction to Machine Learning
Presentation transcript:

Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

Why another learning method? It is based on some “beautifully simple” ideas. It can use the kernel trick: kernel-based vs. feature-based. It produces good performance on many applications. It has some nice properties w.r.t. training and test error rates. 2

Kernel methods Rich family of “pattern analysis” algorithms Best known element is the Support Vector Machine (SVM) Very general task: given a set of data, find patterns Examples: – Classification – Regression – Correlation – Clustering – …. 3

History of SVM Linear classifier: 1962 – Use a hyperplane to separate examples – Choose the hyperplane that maximizes the minimal margin Kernel trick:

History of SVM (cont) Soft margin: 1995 – To deal with non-separable data or noise Transductive SVM: 1999 – To take advantage of unlabeled data 5

Main ideas Use a hyperplane to separate the examples. Among all the hyperplanes, choose the one with the maximum margin. Maximizing the margin is the same as minimizing ||w|| subject to some constraints. 6

Main ideas (cont) For the data set that are not linear separable, map the data to a higher dimensional space and separate them there by a hyperplane. The Kernel trick allows the mapping to be “done” efficiently. Soft margin deals with noise and/or inseparable data set. 7

Outline Linear SVM – Maximizing the margin – Soft margin Nonlinear SVM – Kernel trick A case study Handling multi-class problems 8

Papers (Manning et al., 2008) – Chapter 15 (Collins and Duffy, 2001): tree kernel (Scholkopf, 2000) – Sections 3, 4, and 7 (Hearst et al., 1998) 9

Inner product vs. dot product

Dot product

Inner product An inner product is a generalization of the dot product. It is a function that satisfies the following properties:

Some examples

Linear SVM 14

The setting Input: x 2 X Output: y 2 Y, Y = {-1, +1} Training set: S = {(x 1, y 1 ), …, (x i, y i )} Goal: Find a function y = f(x) that fits the data: f: X ! R 15

Notations 16

Inner product Inner product between two vectors 17

Inner product (cont) 18

Hyperplane A hyperplane: + b > 0 + b < 0 19

An example of hyperplane: + b = 0 (2,0) (0,1) x1 x2 x x 2 – 2 = 0w=(1,2), b=-2 10x x 2 – 20 = 0w=(10,20), b=-20 20

Finding a hyperplane Given the training instances, we want to find a hyperplane that separates them. If there are more than one hyperplane, SVM chooses the one with the maximum margin. 21

Maximizing the margin b=0 Training: to find w and b. 22

Support vectors b=0 +b=-1 +b=1 23

Training: Max margin = minimize norm subject to the constraint This is a quadratic programming (QP) problem

Decoding Hyperplane: w=(1,2), b=-2 f(x) = x x 2 – 2 x=(3,1) x=(0,0) 25 f(x) = = 3 > 0 f(x) = = -2 < 0

Lagrangian** 26

Finding the solution This is a Quadratic Programming (QP) problem. The function is convex and there is no local minima. Solvable in polynomial time. 27

The dual problem ** Maximize Subject to 28

29

The solution Training: Decoding: 30

kNN vs. SVM Majority voting: c* = arg max c g(c) Weighted voting: weighting is on each neighbor c* = arg max c  i w i  (c, f i (x)) Weighted voting allows us to use more training examples: e.g., w i = 1/dist(x, x i )  We can use all the training examples. 31 (weighted kNN, 2-class) (SVM)

Summary of linear SVM Main ideas: – Choose a hyperplane to separate instances: + b = 0 – Among all the allowed hyperplanes, choose the one with the max margin – Maximizing margin is the same as minimizing ||w|| – Choosing w and b is the same as choosing ® i 32

The problem 33

The dual problem 34

Remaining issues Linear classifier: what if the data is not separable? – The data would be linear separable without noise  soft margin – The data is not linear separable  map the data to a higher-dimension space 35

The dual problem** Maximize Subject to 36

Outline Linear SVM – Maximizing the margin – Soft margin Nonlinear SVM – Kernel trick A case study Handling multi-class problems 37

Non-linear SVM 38

The highlight Problem: Some data are not linear separable. Intuition: to transform the data to a high dimension space Input spaceFeature space 39

Example: the two spirals Separated by a hyperplane in feature space (Gaussian kernels) 40

Feature space Learning a non-linear classifier using SVM: – Define Á – Calculate Á (x) for each training example – Find a linear SVM in the feature space. Problems: – Feature space can be high dimensional or even have infinite dimensions. – Calculating Á (x) is very inefficient and even impossible. – Curse of dimensionality 41

Kernels Kernels are similarity functions that return inner products between the images of data points. Kernels can often be computed efficiently even for very high dimensional spaces. Choosing K is equivalent to choosing Á.  the feature space is implicitly defined by K 42

An example 43

An example** 44

45 From Page 750 of (Russell and Norvig, 2002)

Another example** 46

The kernel trick No need to know what Á is and what the feature space is. No need to explicitly map the data to the feature space. Define a kernel function K, and replace the dot produce with a kernel function K(x,z) in both training and testing. 47

Training Maximize Subject to 48

Decoding Linear SVM: (without mapping) Non-linear SVM:w could be infinite dimensional 49

Kernel vs. features 50

A tree kernel

Common kernel functions Linear : Polynominal: Radial basis function (RBF): Sigmoid: 52

Other kernels Kernels for – trees – sequences – sets – graphs – general structures –…–… A tree kernel example next time 53

The choice of kernel function Given a function, we can test whether it is a kernel function by using Mercer’s theorem. Different kernel functions could lead to very different results. Need some prior knowledge in order to choose a good kernel. 54

Summary so far Find the hyperplane that maximizes the margin. Introduce soft margin to deal with noisy data Implicitly map the data to a higher dimensional space to deal with non-linear problems. The kernel trick allows infinite number of features and efficient computation of the dot product in the feature space. The choice of the kernel function is important. 55

MaxEnt vs. SVM 56 MaxEntSVM ModelingMaximize P(Y|X, ¸ )Maximize the margin TrainingLearn ¸ i for each feature function Learn ® i for each training instance DecodingCalculate P(y|x)Calculate the sign of f(x). It is not prob Things to decideFeatures Regularization Training algorithm Kernel Regularization Training algorithm Binarization

More info Website: Textbook (2000): Tutorials: Workshops at NIPS 57

Soft margin 58

The highlight Problem: Some data set is not separable or there are mislabeled examples. Idea: split the data as cleanly as possible, while maximizing the distance to the nearest cleanly split examples. Mathematically, introduce the slack variables 59

Objective function Minimizing such that where C is the penalty, k = 1 or 2 60

Additional slides 61

Linear kernel The map Á is linear. The kernel adjusts the weight of the features according to their importance. 62

The Kernel Matrix (a.k.a. the Gram matrix) K(1,1)K(1,2)K(1,3)…K(1,m) K(2,1)K(2,2)K(2,3)…K(2,m) … … K(m,1)K(m,2)K(m,3)…K(m,m) 63

Mercer’s Theorem The kernel matrix is symmetric positive definite. Any symmetric positive definite matrix can be regarded as a kernel matrix; that is, there exists a Á such that K(x,z) = 64

Making kernels The set of kernels is closed under some operations. For instance, if K 1 and K 2 are kernels, so are the following: – K 1 +K 2 – cK 1 and cK 2 for c > 0 – cK 1 +dK 2 for c> 0 and d>0 One can make complicated kernels from simples ones 65