Announcements See Chapter 5 of Duda, Hart, and Stork. Tutorial by Burge linked to on web page. “Learning quickly when irrelevant attributes abound,” by.

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Linear Classifiers (perceptrons)
Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.
Support Vector Machines
1 Lecture 5 Support Vector Machines Large-margin linear classifier Non-separable case The Kernel trick.
Support vector machine
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Separating Hyperplanes
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
Linear Learning Machines  Simplest case: the decision function is a hyperplane in input space.  The Perceptron Algorithm: Rosenblatt, 1956  An on-line.
Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.
Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.
Linear Learning Machines  Simplest case: the decision function is a hyperplane in input space.  The Perceptron Algorithm: Rosenblatt, 1956  An on-line.
Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
2806 Neural Computation Support Vector Machines Lecture Ari Visa.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Linear Discriminant Functions Chapter 5 (Duda et al.)
A New Linear-threshold Algorithm Anna Rapoport Lev Faivishevsky.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Linear Discriminators Chapter 20 From Data to Knowledge.
Nycomed Chair for Bioinformatics and Information Mining Kernel Methods for Classification From Theory to Practice 14. Sept 2009 Iris Adä, Michael Berthold,
Online Learning Algorithms
Classification III Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,
This week: overview on pattern recognition (related to machine learning)
Support Vector Machine & Image Classification Applications
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Support Vector Machine (SVM) Based on Nello Cristianini presentation
INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio LECTURE: Support Vector Machines.
SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.
Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
1 Chapter 6. Classification and Prediction Overview Classification algorithms and methods Decision tree induction Bayesian classification Lazy learning.
Linear Discrimination Reading: Chapter 2 of textbook.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.
Linear Models for Classification
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
Linear Methods for Classification Based on Chapter 4 of Hastie, Tibshirani, and Friedman David Madigan.
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall Perceptron Rule and Convergence Proof Capacity.
Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 9: Review.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Generalization Error of pac Model  Let be a set of training examples chosen i.i.d. according to  Treat the generalization error as a r.v. depending on.
1 Machine Learning in Natural Language More on Discriminative models Dan Roth University of Illinois, Urbana-Champaign
SUPPORT VECTOR MACHINES Presented by: Naman Fatehpuria Sumana Venkatesh.
Page 1 CS 546 Machine Learning in NLP Review 1: Supervised Learning, Binary Classifiers Dan Roth Department of Computer Science University of Illinois.
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Neural networks and support vector machines
CS 9633 Machine Learning Support Vector Machines
Gradient descent David Kauchak CS 158 – Fall 2016.
Deep Feedforward Networks
Dan Roth Department of Computer and Information Science
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
CS 4/527: Artificial Intelligence
LINEAR AND NON-LINEAR CLASSIFICATION USING SVM and KERNELS
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
CS480/680: Intro to ML Lecture 01: Perceptron 9/11/18 Yao-Liang Yu.
Support Vector Machines
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Announcements See Chapter 5 of Duda, Hart, and Stork. Tutorial by Burge linked to on web page. “Learning quickly when irrelevant attributes abound,” by Littlestone, Machine Learning 2: , 1988.

Supervised Learning Classification with labeled examples. Images vectors in high-D space.

Supervised Learning Labeled examples called training set. Query examples called test set. Training and test set must come from same distribution.

Linear Discrimants Images represented as vectors, x 1, x 2, …. Use these to find hyperplane defined by vector w and w 0. x is on hyperplane: w T x + w 0 = 0. Notation: a T = [w 0, w 1, …]. y T = [1,x 1,x 2, …] So hyperplane is a T y=0. A query, q, is classified based on whether a T q > 0 or a T q < 0.

Why linear discriminants? Optimal if classes are Gaussian with same covariances. Linear separators easier to find. Hyperplanes have few parameters, prevents overfitting. –Have lower VC dimension, but we don’t have time for this.

Linearly Separable Classes

For one set of classes, a T y > 0. For other set: a T y < 0. Notationally convenient if, for second class, make y negative. Then, finding a linear separator means finding a such that, for all i, a T y > 0. Note, this is a linear program. –Problem convex, so descent algorithms converge to global optimum.

Perceptron Algorithm Perceptron Error Function Y is set of misclassified vectors. So update a by: Simplify by cycling through y and whenever one is misclassified, update a  a + cy. This converges after finite # of steps.

Perceptron Intuition a(k) a(k+1)

Support Vector Machines Extremely popular. Find linear separator with maximum margin. – Some guarantees this generalizes well. Can work in high-dimensional space without overfitting. –Nonlinear map to high-dim. space, then find linear separator. –Special tricks allow efficiency in ridiculously high dimensional spaces. Can handle non-separable classes also. –Not as important if space very high-dimensional.

Maximum Margin Maximize the minimum distance from hyperplane to points. Points at this minimum distance are support vectors.

Geometric Intuitions Maximum margin between points -> Maximum margin between convex sets

This implies max margin hyperplane is orthogonal to vector connecting nearest points of convex sets, and halfway between.

Non-statistical learning There are a class of functions that could label the data. Our goal is to select the correct function, with as little information as possible. Don’t think of data coming from a class described by probability distributions. Look at worst-case performance. –This is CS’ey approach. –In statistical model, worst case not meaningful.

On-Line Learning Let X be a set of objects (eg., vectors in a high- dimensional space). Let C be a class of possible classifying functions (eg., hyperplanes). –f in C: X-> {0,1} –One of these correctly classifies all data. The learner is asked to classify an item in X, then told the correct class. Eventually, learner determines correct f. –Measure number of mistakes made. –Worst case bound for learning strategy.

VC Dimension S, a subset of X, is shattered by C if, for any U, a subset of S, there exists f in C such that f is 1 on U and 0 on S-U. The VC Dimension of C is the size of the largest set shattered by C.

VC Dimension and worst case learning Any learning strategy makes VCdim(C) mistakes in the worst case. –If S is shattered by C –Then for any assignment of values to S, there is an f in C that makes this assignment. –So any set of choices the learner makes for S can be entirely wrong.

Winnow x’s elements have binary values. Find weights. Classify x by whether w T x > 1 or w T x < 1. Algorithm requires that there exist weights u, such that: –u T x > 1 when f(x) = 1 –u T x < 1 –  when f(x) = 0. –That is, there is a margin of .

Winnow Algorithm Initialize w = (1,1, …1). Let  Decision: w T x >  If learner correct, weights don’t change. If wrong: Learner’s Prediction Correct Response Update Action Update name 10 w i :=w i /  if x i =1 w i unchanged if x i =0 Demotion Step 01 w i :=  w i  if x i =1 w i unchanged if x i =0 Promotion step

Some intuitions Note that this is just like Perceptron, but with multiplicative change, not additive. Moves weights in right direction; –if w T x was too big, shrink weights that effect inner product. –If w T x too small, make weights bigger. Weights change more rapidly (exponential with mistakes, not linear).

Theorem Number of errors bounded by: Set  = n Note: if x i is an irrelevant feature, u i = 0. Errors grow logarithmically with irrelevant features. Empirically, Perceptron errors grow linearly. This is optimal for k-monotone disjunctions: f(x 1, … x n ) = x i1 V x i2 V … V x ik

Winnow Summary Simple algorithm; variation on Perceptron. Great with irrelevant attributes. Optimal for monotone disjunctions.