Presentation is loading. Please wait.

Presentation is loading. Please wait.

Announcements See Chapter 5 of Duda, Hart, and Stork. Tutorial by Burge linked to on web page. “Learning quickly when irrelevant attributes abound,” by.

Similar presentations


Presentation on theme: "Announcements See Chapter 5 of Duda, Hart, and Stork. Tutorial by Burge linked to on web page. “Learning quickly when irrelevant attributes abound,” by."— Presentation transcript:

1 Announcements See Chapter 5 of Duda, Hart, and Stork. Tutorial by Burge linked to on web page. “Learning quickly when irrelevant attributes abound,” by Littlestone, Machine Learning 2:285-318, 1988.

2 Supervised Learning Classification with labeled examples. Images vectors in high-D space.

3 Supervised Learning Labeled examples called training set. Query examples called test set. Training and test set must come from same distribution.

4 Linear Discrimants Images represented as vectors, x 1, x 2, …. Use these to find hyperplane defined by vector w and w 0. x is on hyperplane: w T x + w 0 = 0. Notation: a T = [w 0, w 1, …]. y T = [1,x 1,x 2, …] So hyperplane is a T y=0. A query, q, is classified based on whether a T q > 0 or a T q < 0.

5 Why linear discriminants? Optimal if classes are Gaussian with same covariances. Linear separators easier to find. Hyperplanes have few parameters, prevents overfitting. –Have lower VC dimension, but we don’t have time for this.

6 Linearly Separable Classes

7 For one set of classes, a T y > 0. For other set: a T y < 0. Notationally convenient if, for second class, make y negative. Then, finding a linear separator means finding a such that, for all i, a T y > 0. Note, this is a linear program. –Problem convex, so descent algorithms converge to global optimum.

8 Perceptron Algorithm Perceptron Error Function Y is set of misclassified vectors. So update a by: Simplify by cycling through y and whenever one is misclassified, update a  a + cy. This converges after finite # of steps.

9 Perceptron Intuition a(k) a(k+1)

10 Support Vector Machines Extremely popular. Find linear separator with maximum margin. – Some guarantees this generalizes well. Can work in high-dimensional space without overfitting. –Nonlinear map to high-dim. space, then find linear separator. –Special tricks allow efficiency in ridiculously high dimensional spaces. Can handle non-separable classes also. –Not as important if space very high-dimensional.

11 Maximum Margin Maximize the minimum distance from hyperplane to points. Points at this minimum distance are support vectors.

12 Geometric Intuitions Maximum margin between points -> Maximum margin between convex sets

13 This implies max margin hyperplane is orthogonal to vector connecting nearest points of convex sets, and halfway between.

14 Non-statistical learning There are a class of functions that could label the data. Our goal is to select the correct function, with as little information as possible. Don’t think of data coming from a class described by probability distributions. Look at worst-case performance. –This is CS’ey approach. –In statistical model, worst case not meaningful.

15 On-Line Learning Let X be a set of objects (eg., vectors in a high- dimensional space). Let C be a class of possible classifying functions (eg., hyperplanes). –f in C: X-> {0,1} –One of these correctly classifies all data. The learner is asked to classify an item in X, then told the correct class. Eventually, learner determines correct f. –Measure number of mistakes made. –Worst case bound for learning strategy.

16 VC Dimension S, a subset of X, is shattered by C if, for any U, a subset of S, there exists f in C such that f is 1 on U and 0 on S-U. The VC Dimension of C is the size of the largest set shattered by C.

17 VC Dimension and worst case learning Any learning strategy makes VCdim(C) mistakes in the worst case. –If S is shattered by C –Then for any assignment of values to S, there is an f in C that makes this assignment. –So any set of choices the learner makes for S can be entirely wrong.

18 Winnow x’s elements have binary values. Find weights. Classify x by whether w T x > 1 or w T x < 1. Algorithm requires that there exist weights u, such that: –u T x > 1 when f(x) = 1 –u T x < 1 –  when f(x) = 0. –That is, there is a margin of .

19 Winnow Algorithm Initialize w = (1,1, …1). Let  Decision: w T x >  If learner correct, weights don’t change. If wrong: Learner’s Prediction Correct Response Update Action Update name 10 w i :=w i /  if x i =1 w i unchanged if x i =0 Demotion Step 01 w i :=  w i  if x i =1 w i unchanged if x i =0 Promotion step

20 Some intuitions Note that this is just like Perceptron, but with multiplicative change, not additive. Moves weights in right direction; –if w T x was too big, shrink weights that effect inner product. –If w T x too small, make weights bigger. Weights change more rapidly (exponential with mistakes, not linear).

21 Theorem Number of errors bounded by: Set  = n Note: if x i is an irrelevant feature, u i = 0. Errors grow logarithmically with irrelevant features. Empirically, Perceptron errors grow linearly. This is optimal for k-monotone disjunctions: f(x 1, … x n ) = x i1 V x i2 V … V x ik

22 Winnow Summary Simple algorithm; variation on Perceptron. Great with irrelevant attributes. Optimal for monotone disjunctions.


Download ppt "Announcements See Chapter 5 of Duda, Hart, and Stork. Tutorial by Burge linked to on web page. “Learning quickly when irrelevant attributes abound,” by."

Similar presentations


Ads by Google