Crash Course on Machine Learning Part III

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

Lecture 9 Support Vector Machines
ECG Signal processing (2)
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
Classification / Regression Support Vector Machines
Linear Classifiers/SVMs
Support Vector Machines Instructor Max Welling ICS273A UCIrvine.
An Introduction of Support Vector Machine
Support Vector Machines
1 Lecture 5 Support Vector Machines Large-margin linear classifier Non-separable case The Kernel trick.
SVM—Support Vector Machines
Computer vision: models, learning and inference Chapter 8 Regression.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
SVMs Reprised. Administrivia I’m out of town Mar 1-3 May have guest lecturer May cancel class Will let you know more when I do...
Support Vector Machines (and Kernel Methods in general)
Support Vector Machines and Kernel Methods
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Artificial Intelligence Statistical learning methods Chapter 20, AIMA (only ANNs & SVMs)
1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.
Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl.
Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.
Binary Classification Problem Learn a Classifier from the Training Set
Support Vector Machines
CS 4700: Foundations of Artificial Intelligence
SVMs Finalized. Where we are Last time Support vector machines in grungy detail The SVM objective function and QP Today Last details on SVMs Putting it.
SVMs Reprised Reading: Bishop, Sec 4.1.1, 6.0, 6.1, 7.0, 7.1.
Lecture 10: Support Vector Machines
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
SVMs, cont’d Intro to Bayesian learning. Quadratic programming Problems of the form Minimize: Subject to: are called “quadratic programming” problems.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
An Introduction to Support Vector Machines Martin Law.
Support Vector Machines Piyush Kumar. Perceptrons revisited Class 1 : (+1) Class 2 : (-1) Is this unique?
Machine Learning Week 4 Lecture 1. Hand In Data Is coming online later today. I keep test set with approx test images That will be your real test.
SVM by Sequential Minimal Optimization (SMO)
Support Vector Machine & Image Classification Applications
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Support Vector Machines Mei-Chen Yeh 04/20/2010. The Classification Problem Label instances, usually represented by feature vectors, into one of the predefined.
SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
An Introduction to Support Vector Machines (M. Law)
Recognition II Ali Farhadi. We have talked about Nearest Neighbor Naïve Bayes Logistic Regression Boosting.
Machine Learning Weak 4 Lecture 2. Hand in Data It is online Only around 6000 images!!! Deadline is one week. Next Thursday lecture will be only one hour.
START OF DAY 5 Reading: Chap. 8. Support Vector Machine.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
CS 478 – Tools for Machine Learning and Data Mining SVM.
Biointelligence Laboratory, Seoul National University
CSSE463: Image Recognition Day 14 Lab due Weds, 3:25. Lab due Weds, 3:25. My solutions assume that you don't threshold the shapes.ppt image. My solutions.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Support Vector Machines Tao Department of computer science University of Illinois.
© Eric CMU, Machine Learning Support Vector Machines Eric Xing Lecture 4, August 12, 2010 Reading:
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
CSSE463: Image Recognition Day 14 Lab due Weds. Lab due Weds. These solutions assume that you don't threshold the shapes.ppt image: Shape1: elongation.
Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.
A Brief Introduction to Support Vector Machine (SVM) Most slides were from Prof. A. W. Moore, School of Computer Science, Carnegie Mellon University.
Support vector machines
PREDICT 422: Practical Machine Learning
Large Margin classifiers
ECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning
An Introduction to Support Vector Machines
Large Scale Support Vector Machines
CSSE463: Image Recognition Day 14
Support vector machines
Support Vector Machines
Support vector machines
Support vector machines
SVMs for Document Ranking
Presentation transcript:

Crash Course on Machine Learning Part III Several slides from Luke Zettlemoyer, Carlos Guestrin, Derek Hoiem, and Ben Taskar

Logistic regression and Boosting Minimize loss fn Define where xj predefined Jointly optimize parameters w0, w1, … wn via gradient ascent. Boosting: Minimize loss fn Define where ht(xi) defined dynamically to fit data Weights j learned incrementally (new one for each training pass)

What you need to know about Boosting Combine weak classifiers to get very strong classifier Weak classifier – slightly better than random on training data Resulting very strong classifier – can get zero training error AdaBoost algorithm Boosting v. Logistic Regression Both linear model, boosting “learns” features Similar loss functions Single optimization (LR) v. Incrementally improving classification (B) Most popular application of Boosting: Boosted decision stumps! Very simple to implement, very effective classifier

Linear classifiers – Which line is better? w. = j w(j) x(j) Data (1) perceptron, (2) minize weights, (3) max margin Example i

Pick the one with the largest margin! Margin: measures height of w.x+b plane at each point, increases with distance w.x + b >> 0 w.x + b > 0 w.x + b = 0 w.x + b < 0 w.x + b << 0 γ γ γ Max Margin: two equivalent forms γ (1) (2) w.x = j w(j) x(j)

How many possible solutions? w.x + b = 0 Any other ways of writing the same dividing line? w.x + b = 0 2w.x + 2b = 0 1000w.x + 1000b = 0 …. Any constant scaling has the same intersection with z=0 plane, so same dividing line! Do we really want to max γ,w,b?

Review: Normal to a plane w.x + b = 0 Key Terms -- projection of xj onto w -- unit vector normal to w

Idea: constrained margin w.x + b = +1 Generally: w.x + b = 0 w.x + b = -1 Assume: x+ on positive line, x- on negative γ x+ x- Final result: can maximize constrained margin by minimizing ||w||2!!!

Max margin using canonical hyperplanes w.x + b = +1 w.x + b = 0 w.x + b = -1 γ x+ x- The assumption of canonical hyperplanes (at +1 and -1) changes the objective and the constraints!

Support vector machines (SVMs) w.x + b = +1 w.x + b = -1 w.x + b = 0 margin 2 Solve efficiently by quadratic programming (QP) Well-studied solution algorithms Not simple gradient ascent, but close Hyperplane defined by support vectors Could use them as a lower-dimension basis to write down line, although we haven’t seen how yet More on this later Support Vectors: data points on the canonical lines Non-support Vectors: everything else moving them will not change w

What if the data is not linearly separable? Add More Features!!! What about overfitting?

What if the data is still not linearly separable? + C #(mistakes) First Idea: Jointly minimize w.w and number of training mistakes How to tradeoff two criteria? Pick C on development / cross validation Tradeoff #(mistakes) and w.w 0/1 loss Not QP anymore Also doesn’t distinguish near misses and really bad mistakes NP hard to find optimal solution!!!

Slack variables – Hinge loss + C Σj ξj w.x + b = +1 w.x + b = 0 w.x + b = -1 - ξj ξj≥0 ξ ξ ξ Slack Penalty C > 0: C=∞  have to separate the data! C=0  ignore data entirely! Select on dev. set, etc. ξ For each data point: If margin ≥ 1, don’t care If margin < 1, pay linear penalty

Side Note: Different Losses Logistic regression: Boosting : SVM: Hinge loss: Hinge loss can lead to sparse solutions!!!! 0-1 Loss: All our new losses approximate 0/1 loss!

What about multiple classes?

One against All Learn 3 classifiers: + vs {0,-}, weights w+ Output for x: y = argmaxi wi.x w- w0 Could we learn lines above? Yes, but would require slack vairables? Any other way? Any problems?

Learn 1 classifier: Multiclass SVM w+ Simultaneously learn 3 sets of weights: How do we guarantee the correct labels? Need new constraints! w- w0 Could we learn lines above? Currently, no slack variables… For j possible classes:

Learn 1 classifier: Multiclass SVM Also, can introduce slack variables, as before:

What you need to know Maximizing margin Derivation of SVM formulation Slack variables and hinge loss Relationship between SVMs and logistic regression 0/1 loss Hinge loss Log loss Tackling multiple class One against All Multiclass SVMs

Constrained optimization No Constraint x ≥ -1 x ≥ 1 x*=0 x*=0 x*=1 How do we solve with constraints?  Lagrange Multipliers!!!

Lagrange multipliers – Dual variables Add Lagrange multiplier Rewrite Constraint Introduce Lagrangian (objective): We will solve: Why does this work at all??? min is fighting max! x<b  (x-b)<0  maxα-α(x-b) = ∞ min won’t let that happen!! x>b, α>0 (x-b)>0  maxα-α(x-b) = 0, α*=0 min is cool with 0, and L(x, α)=x2 (original objective) x=b  α can be anything, and L(x, α)=x2 (original objective) Since min is on the outside, can force max to behave and constraints will be satisfied!!! Add new constraint

Dual SVM derivation (1) – the linearly separable case Original optimization problem: One Lagrange multiplier per example Rewrite constraints Lagrangian:

Dual SVM derivation (2) – the linearly separable case Can solve for optimal w,b as function of α:  Also, αk>0 implies constraint is tight  So, in dual formulation we will solve for α directly! w,b are computed from α (if needed)

Dual SVM interpretation: Sparsity w.x + b = +1 w.x + b = -1 w.x + b = 0 Support Vectors: αj≥0 Final solution tends to be sparse αj=0 for most j don’t need to store these points to compute w or make predictions Non-support Vectors: αj=0 moving them will not change w

Dual SVM formulation – linearly separable Lagrangian: Substituting (and some math we are skipping) produces Dual SVM: Notes: max instead of min. One α for each training example Sums over all training examples scalars dot product

Dual for the non-separable case – same basic story (we will skip details) Primal: Solve for w,b,α: Dual: What changed? Added upper bound of C on αi! Intuitive explanation: Without slack. αi  ∞ when constraints are violated (points misclassified) Upper bound of C limits the αi, so misclassifications are allowed

Wait a minute: why did we learn about the dual SVM? There are some quadratic programming algorithms that can solve the dual faster than the primal At least for small datasets But, more importantly, the “kernel trick”!!! Another little detour…

Reminder: What if the data is not linearly separable? Use features of features of features of features…. Feature space can get really large really quickly!

Higher order polynomials m – input features d – degree of polynomial number of monomial terms d=3 grows fast! d = 6, m = 100 about 1.6 billion terms d=2 number of input dimensions

Dual formulation only depends on dot-products, not on w! Remember the examples x only appear in one dot product First, we introduce features:  Next, replace the dot product with a Kernel: Why is this useful???

Efficient dot-product of polynomials Polynomials of degree exactly d d=1 d=2 For any d (we will skip proof): Cool! Taking a dot product and exponentiating gives same results as mapping into high dimensional space and then taking dot produce

Finally: the “kernel trick”! Never compute features explicitly!!! Compute dot products in closed form Constant-time high-dimensional dot-products for many classes of features But, O(n2) time in size of dataset to compute objective Naïve implements slow much work on speeding up

Common kernels Polynomials of degree exactly d Polynomials of degree up to d Gaussian kernels Sigmoid And many others: very active area of research! Features space for Gaussian kernel  Infinite!!!

Overfitting? Huge feature space with kernels, what about overfitting??? Maximizing margin leads to sparse set of support vectors Some interesting theory says that SVMs search for simple hypothesis with large margin Often robust to overfitting But everything overfits sometimes!!! Can control by: Setting C Choosing a better Kernel Varying parameters of the Kernel (width of Gaussian, etc.)

SVMs with kernels Choose a set of features and kernel function Solve dual problem to get support vectors and i At classification time: if we need to build (x), we are in trouble! instead compute: Classify as Only need to store support vectors and i!!!

Reminder: Kernel regression Instance-based learning: A distance metric Euclidian (and many more) How many nearby neighbors to look at? All of them A weighting function wi = exp(-D(xi, query)2 / Kw2) Nearby points to the query are weighted strongly, far points weakly. The KW parameter is the Kernel Width. Very important. How to fit with the local points? Predict the weighted average of the outputs: predict = Σwiyi / Σwi wi Kw D(x1,x2)

Kernels in logistic regression Define weights in terms of data points: Derive simple gradient descent rule on i,b Similar tricks for all linear models: Perceptron, etc

SVMs v. Kernel Regression or SVMs: Learn weights i (and bandwidth) Often sparse solution KR: Fixed “weights”, learn bandwidth Solution may not be sparse Much simpler to implement

What’s the difference between SVMs and Logistic Regression? Loss function High dimensional features with kernels Hinge Loss Log Loss Yes!!! Actually, yes!

Probability Distribution What’s the difference between SVMs and Logistic Regression? (Revisited) SVMs Logistic Regression Loss function Hinge loss Log-loss Kernels Yes! Solution sparse Semantics of learned model Often yes! Almost always no! Linear model from “Margin” Probability Distribution