Presentation is loading. Please wait.

Presentation is loading. Please wait.

Linear Separators.

Similar presentations


Presentation on theme: "Linear Separators."— Presentation transcript:

1 Linear Separators

2 Bankruptcy example R is the ratio of earnings to expenses
L is the number of late payments on credit cards over the past year. We would like here to draw a linear separator, and get so a classifier.

3 1-Nearest Neighbor Boundary
The decision boundary will be the boundary between cells defined by points of different classes, as illustrated by the bold line shown here.

4 Decision Tree Boundary
Similarly, a decision tree also defines a decision boundary in the feature space. Although both 1-NN and decision trees agree on all the training points, they disagree on the precise decision boundary and so will classify some query points differently. This is the essential difference between different learning algorithms.

5 Linear Boundary Linear separators are characterized by a single linear decision boundary in the space. The bankruptcy data can be successfully separated in that manner. But, there is no guarantee that a single linear separator will successfully classify any set of training data.

6 Linear Hypothesis Class
Line equation (assume 2D first): w2x2+w1x1+b=0 Fact1: All the points (x1, x2) lying on the line make the equation true. Fact2: The line separates the plane in two half-planes. Fact3: The points (x1, x2) in one half-plane give us an inequality with respect to 0, which has the same direction for each of the points in the half-plane. Fact4: The points (x1, x2) in the other half-plane give us the reverse inequality with respect to 0.

7 Fact 3 proof w2x2+w1x1+b=0 We can write it as:
(p,q) (p,r) (p,r) is on the line so: But q<r, so we get: i.e. Since (p,q) was an arbitrary point in the half-plane, we say that the same direction of inequality holds for any other point of the half-plane.

8 Fact 4 proof w2x2+w1x1+b=0 We can write it as:
(p,r) (p,s) (p,r) is on the line so: But s>r, so we get: i.e. Since (p,s) was an arbitrary point in the (other) half-plane, we say that the same direction of inequality holds for any other point of that half-plane.

9 Corollary Depending on the slope of the line direction, the inequalities might alternate for the two half-planes. However, it will be the same direction among the points belonging to the same half-plane. What’s an easy way to determine the direction of the inequalities for each subplane? In order to determine the inequality direction, try it for the point (0,0), and determine the direction for the half-plane where (0,0) belongs. The points of the other half-plane will have the opposite inequality direction. How much bigger (or smaller) than zero is w2p+w1q+b is proportional to the distance of the point (p,q) from the line. The same can be said for an n-dimensional space. Simply, we don’t talk about “half-planes” but “half-spaces” (line is now hyperplane creating two half-spaces)

10 Linear classifier We can now exploit the sign of this distance to define a linear classifier, one whose decision boundary is a hyperplane. Instead of using 0 and 1 as the class labels (which was an arbitrary choice anyway) we use the sign of the distance, either +1 or -1 as the labels (that is the values of the yi ’s). Which outputs +1 or –1.

11 Margin A variant of the signed distance of a training point to a hyperplane is the margin of the point. The margin (gamma) is the product of w.xi+b for the training point xi and the known sign of the class, yi. If they agree (the training point is correctly classified), then the margin is positive; If they disagree (the classification is in error), then the margin is negative. margin: i = yi(w.xi+b) it’s proportional to perpendicular distance of point xi to line (hyperplane). i > 0 : point is correctly classified (sign of distance = yi) i < 0 : point is incorrectly classified (sign of distance  yi)

12 Perceptron algorithm How to find a linear separator?
The perceptron algorithm, was developed by Rosenblatt in the mid 50's. This is a greedy, "mistake driven" algorithm. We will be using the extended form of the weight and data-point vectors in this algorithm. The extended form is in fact a trick: This will simplify a bit the presentation.

13 Perceptron algorithm Pick initial weight vector (including b), e.g. [.1, …, .1] Repeat until all points get correctly classified Repeat for each point xi Calculate margin yi.w.xi (this is number) If margin > 0, point xi is correctly classified Else, change weights to increase margin; change weights proportional to yi.xi Note that, if yi=1 If xji > 0 then wj increases (margin increases) If xji < 0 then wj decreases (margin again increases) Similarly, for yi=-1, margin always increases

14 Perceptron algorithm (explanations)
The first step is to start with an initial value of the weight vector, usually all zeros. Then we repeat the inner loop until all the points are correctly classified using the current weight vector. The inner loop is to consider each point. If the point's margin is positive then it is correctly classified and we do nothing. Otherwise, if it is negative or zero, we have a mistake and we want to change the weights so as to increase the margin (so that it ultimately becomes positive). The trick is how to change the weights. It turns out that using a value proportional to yi.xi is the right thing. We'll see why, formally, later.

15 Perceptron algorithm So, each change of w increases the margin on a particular point. However, the changes for the different points interfere with each other, that is, different points might change the weights in opposing directions. So, it will not be the case that one pass through the points will produce a correct weight vector. In general, we will have to go around multiple times. The remarkable fact is that the algorithm is guaranteed to terminate with the weights for a separating hyperplane as long as the data is linearly separable. The proof of this fact is beyond our scope. Notice that if the data is not separable, then this algorithm is an infinite loop. It turns out that it is a good idea to keep track of the best separator we've seen so far (the one that makes the fewest mistakes) and after we get tired of going around the loop, return that one.

16 Perceptron algorithm Bankruptcy data
This shows a trace of the perceptron algorithm on the bankruptcy data. Here it took 49 iterations through the data (the outer loop) for the algorithm to stop. The separator at the end of the loop is [0.4, 0.94, -2.2] We usually pick some small "rate" constant to scale the change to w. .1 is used, but other small values also work well.

17 Gradient Ascent/Descent
Why pick yi.xi as increment to weights? The margin is a multiple input variable function. The variables are w2, w1, w0 (or in general wn,…,w0) In order to reach the maximum of this function, it is good to change the variables in the direction of the slope of the function. The slope is represented by the gradient of the function. The gradient is the vector of first (partial) derivatives of the function with respect to each of the input variables.


Download ppt "Linear Separators."

Similar presentations


Ads by Google