Linear Separators.

Slides:



Advertisements
Similar presentations
ECG Signal processing (2)
Advertisements

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
Classification / Regression Support Vector Machines
An Introduction of Support Vector Machine
Support Vector Machines
1 Lecture 5 Support Vector Machines Large-margin linear classifier Non-separable case The Kernel trick.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Machine learning continued Image source:
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Separating Hyperplanes
Linear Separators.
Support Vector Machines (and Kernel Methods in general)
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Simple Neural Nets For Pattern Classification
Linear Learning Machines  Simplest case: the decision function is a hyperplane in input space.  The Perceptron Algorithm: Rosenblatt, 1956  An on-line.
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.
Linear Separators. Bankruptcy example R is the ratio of earnings to expenses L is the number of late payments on credit cards over the past year. We will.
Linear Learning Machines  Simplest case: the decision function is a hyperplane in input space.  The Perceptron Algorithm: Rosenblatt, 1956  An on-line.
Classification Problem 2-Category Linearly Separable Case A- A+ Malignant Benign.
Binary Classification Problem Learn a Classifier from the Training Set
Support Vector Machines
Support Vector Machines
Lecture 10: Support Vector Machines
Linear Discriminant Functions Chapter 5 (Duda et al.)
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Radial Basis Function (RBF) Networks
Support Vector Machines Piyush Kumar. Perceptrons revisited Class 1 : (+1) Class 2 : (-1) Is this unique?
This week: overview on pattern recognition (related to machine learning)
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
ADVANCED CLASSIFICATION TECHNIQUES David Kauchak CS 159 – Fall 2014.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
START OF DAY 5 Reading: Chap. 8. Support Vector Machine.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
CS 478 – Tools for Machine Learning and Data Mining SVM.
CSSE463: Image Recognition Day 14 Lab due Weds, 3:25. Lab due Weds, 3:25. My solutions assume that you don't threshold the shapes.ppt image. My solutions.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Chapter 2 Single Layer Feedforward Networks
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 7: Linear and Generalized Discriminant Functions.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
1 Learning Bias & Clustering Louis Oliphant CS based on slides by Burr H. Settles.
Support Vector Machine: An Introduction. (C) by Yu Hen Hu 2 Linear Hyper-plane Classifier For x in the side of o : w T x + b  0; d = +1; For.
An Introduction of Support Vector Machine In part from of Jinwei Gu.
1 Kernel Machines A relatively new learning methodology (1992) derived from statistical learning theory. Became famous when it gave accuracy comparable.
Page 1 CS 546 Machine Learning in NLP Review 1: Supervised Learning, Binary Classifiers Dan Roth Department of Computer Science University of Illinois.
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:
Support Vector Machines
Large Margin classifiers
Dan Roth Department of Computer and Information Science
Support Vector Machines
Perceptrons Support-Vector Machines
LINEAR AND NON-LINEAR CLASSIFICATION USING SVM and KERNELS
COSC 4335: Other Classification Techniques
Recitation 6: Kernel SVM
Support Vector Machines
SVMs for Document Ranking
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Support Vector Machines 2
Presentation transcript:

Linear Separators

Bankruptcy example R is the ratio of earnings to expenses L is the number of late payments on credit cards over the past year. We would like here to draw a linear separator, and get so a classifier.

1-Nearest Neighbor Boundary The decision boundary will be the boundary between cells defined by points of different classes, as illustrated by the bold line shown here.

Decision Tree Boundary Similarly, a decision tree also defines a decision boundary in the feature space. Although both 1-NN and decision trees agree on all the training points, they disagree on the precise decision boundary and so will classify some query points differently. This is the essential difference between different learning algorithms.

Linear Boundary Linear separators are characterized by a single linear decision boundary in the space. The bankruptcy data can be successfully separated in that manner. But, there is no guarantee that a single linear separator will successfully classify any set of training data.

Linear Hypothesis Class Line equation (assume 2D first): w2x2+w1x1+b=0 Fact1: All the points (x1, x2) lying on the line make the equation true. Fact2: The line separates the plane in two half-planes. Fact3: The points (x1, x2) in one half-plane give us an inequality with respect to 0, which has the same direction for each of the points in the half-plane. Fact4: The points (x1, x2) in the other half-plane give us the reverse inequality with respect to 0.

Fact 3 proof w2x2+w1x1+b=0 We can write it as: (p,q) (p,r) (p,r) is on the line so: But q<r, so we get: i.e. Since (p,q) was an arbitrary point in the half-plane, we say that the same direction of inequality holds for any other point of the half-plane.

Fact 4 proof w2x2+w1x1+b=0 We can write it as: (p,r) (p,s) (p,r) is on the line so: But s>r, so we get: i.e. Since (p,s) was an arbitrary point in the (other) half-plane, we say that the same direction of inequality holds for any other point of that half-plane.

Corollary What’s an easy way to determine the direction of the inequalities for each subplane? Try it for the point (0,0), and determine the direction for the half-plane where (0,0) belongs. The points of the other half-plane will have the opposite inequality direction. How much bigger (or smaller) than zero is w2p+w1q+b is proportional to the distance of the point (p,q) from the line. The same can be said for an n-dimensional space. Simply, we don’t talk about “half-planes” but “half-spaces” (line is now hyperplane creating two half-spaces)

Linear classifier We can now exploit the sign of this distance to define a linear classifier, one whose decision boundary is a hyperplane. Instead of using 0 and 1 as the class labels (which was an arbitrary choice anyway) we use the sign of the distance, either +1 or -1 as the labels (that is the values of the yi ’s). Trick Which outputs +1 or –1.

Margin The margin is the product of w.xi for the training point xi and the known sign of the class, yi. margin: i = yiw.xi is proportional to perpendicular distance of point xi to line (hyperplane). i > 0 : point is correctly classified (sign of distance = yi) i < 0 : point is incorrectly classified (sign of distance  yi)

Perceptron algorithm How to find a linear separator? Perceptron algorithm, was developed by Rosenblatt in the mid 50's. This is a greedy, "mistake driven" algorithm. Algorithm Pick initial weight vector (including b), e.g. [.1, …, .1] Repeat until all points get correctly classified Repeat for each point xi Calculate margin yi.w.xi (this is number) If margin > 0, point xi is correctly classified Else, change weights proportional to yi.xi

Gradient Ascent/Descent Why pick yi.xi as increment to weights? The margin is a multiple input variable function. The variables are w2, w1, w0 (or in general wn,…,w0) In order to reach the maximum of this function, it is good to change the variables in the direction of the slope of the function. The slope is represented by the gradient of the function. The gradient is the vector of first (partial) derivatives of the function with respect to each of the input variables.

Perceptron algorithm Changes for the different points interfere with each other. So, it will not be the case that one pass through the points will produce a correct weight vector. In general, we will have to go around multiple times. However, the algorithm is guaranteed to terminate with the weights for a separating hyperplane as long as the data is linearly separable. The proof of this fact is beyond our scope. Notice that if the data is not separable, then this algorithm is an infinite loop. Good idea to keep track of the best separator we've seen so far.

Perceptron algorithm Bankruptcy data 49 iterations through the bankruptcy data for the algorithm to stop. The separator at the end of the loop is [0.4, 0.94, -2.2] We can pick some small "rate" constant to scale the change to w. This is called eta.

Dual Form The calculated w will be: where, i is the number of times data instance xi got missclassified. So, for classification we’ll check: where x is the new data instance to e classified.

Perceptron algorithm  = 0 Repeat until all points get correctly classified Repeat for each point xi Calculate margin If margin > 0, point xi is correctly classified Else, increment i . If data is not linearly separable then alphas grow without bound Note that, if yi=1 If xji > 0 then wj increases (margin increases) If xji < 0 then wj decreases (margin again increases) Similarly, for yi=-1, margin always increases

Non-linearly separable

Moving points into a different space Very easy now to divide X's from O's. Square every x1 and x2 value first. A point that was at (-1,2) would now be at (1,4), A point that was at (0.5,1) would now be at (0.25,1), and so on.

Main Idea Transform the points (vectors) into another space using some function  and then do linear separation in the new space, i.e. considering vectors  (x1),  (x2), ...,  (xn).

The Kernel Trick While you could write code to transform the data into a new space like this, it isn't usually done in practice because finding a dividing line when working with real datasets can require casting the data into hundreds or thousands of dimensions, and this is quite impractical to implement. However, with any algorithm that uses dot-products—including the linear classifier—you can use a technique called the kernel trick. The kernel trick involves replacing the dot-product function with a new function that returns what the dot-product would have been if the data had first been transformed to a higher dimensional space using some mapping function.

The Kernel Trick Remember, all we care is computing dot products. See something interesting: Let  : R2  R3 such that  (x) = ([x1, x2]) = [z1, z2 , z3] = [x12, 2x1x2, x22] Now, let r = [r1, r2, r3] and s = [s1, s2, s3] be two vectors in R3 corresponding to vectors a = [a1, a2] and b = [b1, b2] in R2.  (a) (b) = rs = r1s1+r2s2+r3s3 = (a1b1)2 + 2a1a2b1b2 + (a2b2)2 = (a1b1 +a2b2)2 = (ab)2

The Kernel Trick So instead of mapping the data vectors via  and computing the modified inner product  (a) (b), we can do it in one operation, leaving the mapping completely implicit. Because “modified inner product” is a long name, we call it a kernel, K(a, b) =  (a) (b). Useful Kernels Polynomial Kernel: K(a, b) = (ab)2 Visualization: http://www.youtube.com/watch?v=3liCbRZPrZA Gaussian Kernel: K(a, b) = e(1/2)||x−y||^2

Line Separators It's difficult to characterize the separator that the Perceptron algorithm will come up with. Different runs can come up with different separators. Can we do better?

Which one to pick? Natural choice: Pick the separator that has the maximal margin to its closest points on either side. Most conservative. Any other separator will be "closer" to one class than to the other. Those closest points are called "support vectors".