Margins, support vectors, and linear programming, oh my! Reading: Bishop, 4.0, 4.1, 7.0, 7.1 Burges tutorial (on class resources page)

Administrivia I Tomorrow is deadline for CS grad student conference (CSUSC): http://www.cs.unm.edu/~csgsa/conference/cfp. html http://www.cs.unm.edu/~csgsa/conference/cfp. html Still time to submit!

Administrivia II: Straw poll Which would you rather do first (probably at all)? Unsupervised learning Clustering Structure of data Scientific discovery (genomics, taxonomy, etc.) Reinforcement learning Control Robot navigation Learning behavior Let me know (in person, email, etc.)

Group Reading #1 Due: Feb 20 Dietterich, T. G., “An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization.” Machine Learning, 40(2), 139-157. DOI: 10.1023/A:1007607513941 http://www.springerlink.com Background material: Dietterich, T. G. Ensemble Learning. In The Handbook of Brain Theory and Neural Networks, Second edition, (M.A. Arbib, Ed.), Cambridge, MA: The MIT Press, 2002. http://www.cs.orst.edu/~tgd/publications/hbtnn-ensemble- learning.ps.gz http://www.cs.orst.edu/~tgd/publications/hbtnn-ensemble- learning.ps.gz DH&S Ch 9.5.1-2

Time rolls on... Last time: The linear regression problem Squared-error loss Vector derivatives The least-squared hyperplane This time: What about multiclass, nonlinear, or nonseparable gunk? Intro to support vector machines

Exercise Derive the vector derivative expressions: Find an expression for the minimum squared error weight vector, w, in the loss function:

Solution to LSE regression

The LSE method The quantity is called a Gram matrix and is positive semidefinite and symmetric The quantity is the pseudoinverse of X May not exist if Gram matrix is not invertible The complete “learning algorithm” is 2 whole lines of Matlab code

LSE example w=[6.72, -0.36]

LSE example t=y(x1,w)t=y(x1,w)

[0.36, 1] 6.72 t=y(x1,w)t=y(x1,w)

The LSE method So far, we have a regressor -- estimates a real valued t i for each x i Can convert to a classifier by assigning t=+1 or -1 to binary class training data Q: How do you handle non-binary data?

Handling non-binary data DTs and k -NN can handle multi-class data Linear discriminants (& many other) learners only work on binary 3 ways to “hack” binary classifiers to c -ary data:

Handling non-binary data DTs and k -NN can handle multi-class data Linear discriminants (& many other) learners only work on binary 3 ways to “hack” binary classifiers to c -ary data: 1 against many: Train c classifiers to recognize “class 1 vs anything else”; “class 2 vs everything else”... Cheap, easy May drastically unbalance the classes for each classifier What if two classifiers make diff predictions?

Multiclass trouble ?

Handling non-binary data All against all: Train O(c 2 ) classifiers, one for each pair of classes Run every test point through all classifiers Majority vote for final classifier More stable than 1 vs many Lot more overhead, esp for large c Data may be more balanced Each classifier trained on very small part of data

Handling non-binary data Coding theory approach Given c classes, choose b≥lg(c) Assign each class a b -bit “code word” Train one classifier for each bit Apply each classifier to a test instance => new code => reconstruct class x1x1 x2x2 x3x3 y green3.2-9apple yellow1.80.7lemon yellow6.9-3banana red0.8 grape green3.40.9pear x1x1 x2x2 x3x3 y1y1 y2y2 y3y3 green3.2-9000 yellow1.80.7001 yellow6.9-3010 red0.8 011 green3.40.9100

Support Vector Machines

Linear separators are nice... but what if your data looks like this:

Linearly nonseparable data 2 possibilities: Use nonlinear separators (diff hypothesis space) Possibly intersection of multiple linear separators, etc. (E.g., decision tree)

Linearly nonseparable data 2 possibilities: Use nonlinear separators (diff hypothesis space) Possibly intersection of multiple linear separators, etc. (E.g., decision tree) Change the data Nonlinear projection of data These turn out to be flip sides of each other Easier to think about (do math for) 2 nd case

Nonlinear data projection Suppose you have a “projection function”: Original feature space “Projected” space Usually Do learning w/ linear model in Ex:

Margins, support vectors, and linear programming, oh my! Reading: Bishop, 4.0, 4.1, 7.0, 7.1 Burges tutorial (on class resources page)

Similar presentations

Presentation on theme: "Margins, support vectors, and linear programming, oh my! Reading: Bishop, 4.0, 4.1, 7.0, 7.1 Burges tutorial (on class resources page)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Margins, support vectors, and linear programming, oh my! Reading: Bishop, 4.0, 4.1, 7.0, 7.1 Burges tutorial (on class resources page)

Similar presentations

Presentation on theme: "Margins, support vectors, and linear programming, oh my! Reading: Bishop, 4.0, 4.1, 7.0, 7.1 Burges tutorial (on class resources page)"— Presentation transcript:

Similar presentations

About project

Feedback