Page 1 CS 546 Machine Learning in NLP Review 1: Supervised Learning, Binary Classifiers Dan Roth Department of Computer Science University of Illinois.

Slides:



Advertisements
Similar presentations
Statistical Machine Learning- The Basic Approach and Current Research Challenges Shai Ben-David CS497 February, 2007.
Advertisements

Linear Classifiers (perceptrons)
A KTEC Center of Excellence 1 Pattern Analysis using Convex Optimization: Part 2 of Chapter 7 Discussion Presenter: Brian Quanz.
Linear Separators.
Support Vector Machines
Machine learning continued Image source:
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Machine Learning Week 2 Lecture 1.
CS Perceptrons1. 2 Basic Neuron CS Perceptrons3 Expanded Neuron.
Linear Separators.
x – independent variable (input)
Linear Learning Machines  Simplest case: the decision function is a hyperplane in input space.  The Perceptron Algorithm: Rosenblatt, 1956  An on-line.
The Perceptron Algorithm (Primal Form) Repeat: until no mistakes made within the for loop return:. What is ?
Kernel Technique Based on Mercer’s Condition (1909)
Announcements See Chapter 5 of Duda, Hart, and Stork. Tutorial by Burge linked to on web page. “Learning quickly when irrelevant attributes abound,” by.
Linear Separators. Bankruptcy example R is the ratio of earnings to expenses L is the number of late payments on credit cards over the past year. We will.
Linear Learning Machines  Simplest case: the decision function is a hyperplane in input space.  The Perceptron Algorithm: Rosenblatt, 1956  An on-line.
September 21, 2010Neural Networks Lecture 5: The Perceptron 1 Supervised Function Approximation In supervised learning, we train an ANN with a set of vector.
Binary Classification Problem Learn a Classifier from the Training Set
Support Vector Machines
Introduction to Machine Learning course fall 2007 Lecturer: Amnon Shashua Teaching Assistant: Yevgeny Seldin School of Computer Science and Engineering.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Support Vector Machines Classification
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
CS 4700: Foundations of Artificial Intelligence
Linear Discriminators Chapter 20 From Data to Knowledge.
Machine learning Image source:
Online Learning Algorithms
Machine learning Image source:
Neural Networks Lecture 8: Two simple learning algorithms
Classification III Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,
Dr. Hala Moushir Ebied Faculty of Computers & Information Sciences
Support Vector Machines Piyush Kumar. Perceptrons revisited Class 1 : (+1) Class 2 : (-1) Is this unique?
SVM by Sequential Minimal Optimization (SMO)
Machine Learning CSE 681 CH2 - Supervised Learning.
Support Vector Machine (SVM) Based on Nello Cristianini presentation
Universit at Dortmund, LS VIII
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
ADVANCED PERCEPTRON LEARNING David Kauchak CS 451 – Fall 2013.
Linear Discrimination Reading: Chapter 2 of textbook.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
Linear Classification with Perceptrons
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.
6.S093 Visual Recognition through Machine Learning Competition Image by kirkh.deviantart.com Joseph Lim and Aditya Khosla Acknowledgment: Many slides from.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Neural NetworksNN 21 Architecture We consider the architecture: feed- forward NN with one layer It is sufficient to study single layer perceptrons with.
Machine Learning Lecture 1: Intro + Decision Trees Moshe Koppel Slides adapted from Tom Mitchell and from Dan Roth.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
1 Machine Learning in Natural Language More on Discriminative models Dan Roth University of Illinois, Urbana-Champaign
Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Fun with Hyperplanes: Perceptrons, SVMs, and Friends
Large Margin classifiers
Dan Roth Department of Computer and Information Science
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Dan Roth Department of Computer and Information Science
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
CS 4/527: Artificial Intelligence
An Introduction to Support Vector Machines
LINEAR AND NON-LINEAR CLASSIFICATION USING SVM and KERNELS
Linear Discriminators
Collaborative Filtering Matrix Factorization Approach
CS480/680: Intro to ML Lecture 01: Perceptron 9/11/18 Yao-Liang Yu.
Support Vector Machines
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Page 1 CS 546 Machine Learning in NLP Review 1: Supervised Learning, Binary Classifiers Dan Roth Department of Computer Science University of Illinois at Urbana-Champaign Augmented and modified by Vivek Srikumar

Announcements Registration Piazza Web site – Topics – Papers Information forms will be available – Essential for group formation 2

Supervised learning, Binary classification 1.Supervised learning: The general setting 2.Linear classifiers 3.The Perceptron algorithm 4.Learning as optimization 5.Support vector machines 6.Logistic Regression 3

Where are we? 1.Supervised learning: The general setting 2.Linear classifiers 3.The Perceptron algorithm 4.Learning as optimization 5.Support vector machines 6.Logistic Regression 4

Supervised learning: General setting Given: Training examples of the form – The function f is an unknown function The input x is represented in a feature space – Typically x 2 {0,1} n or x 2 < n For a training example x, f(x) is called the label Goal: Find a good approximation for f The label – Binary classification: f(x) 2 {-1,1} – Multiclass classification: f(x) 2 {1, 2, 3, , K} – Regression: f(x) 2 < 5

Nature of applications There is no human expert – Eg: Identify DNA binding sites Humans can perform a task, but can’t describe how they do it – Eg: Object detection in images The desired function is hard to obtain in closed form – Eg: Stock market 6

Binary classification Spam filtering – Is an spam or not? Recommendation systems – Given user’s movie preferences, will she like a new movie? Malware detection – Is an Android app malicious? Time series prediction – Will the future value of a stock increase or decrease with respect to its current value? 7

The fundamental problem: Machine learning is ill-posed! 8 Unknown function x1x1 x2x2 x3x3 x4x4 y = f(x 1, x 2, x 3, x 4 ) Can you learn this function? What is it?

Is learning possible at all? There are 2 16 = possible Boolean functions over 4 inputs – Why? There are 16 possible outputs. Each way to fill these 16 slots is a different function, giving 2 16 functions. We have seen only 7 outputs We cannot know what the rest are without seeing them – Think of an adversary filling in the labels every time you make a guess at the function 9

Solution: Restricted hypothesis space A hypothesis space is the set of possible functions we consider – We were looking at the space of all Boolean functions Instead choose a hypothesis space that is smaller than the space of all functions – Only simple conjunctions (with four variables, there are only 16 conjunctions without negations) – m-of-n rules: Pick a set of n variables. At least m of them must be true – Linear functions How do we pick a hypothesis space? – Using some prior knowledge (or by guessing) What if the hypothesis space is so small that nothing in it agrees with the data? – We need a hypothesis space that is flexible enough 10

Where are we? 1.Supervised learning: The general setting 2.Linear classifiers 3.The Perceptron algorithm 4.Learning as optimization 5.Support vector machines 6.Logistic Regression 11

Linear Classifiers Input is a n dimensional vector x Output is a label y 2 {-1, 1} Linear threshold units classify an example x as Classification rule= sgn(b+ w T x) = sgn(b +  w i x i ) – b + w T x ¸ 0 ) Predict y = 1 – b + w T x < 0 ) Predict y = For now

Linear Classifiers Input is a n dimensional vector x Output is a label y 2 {-1, 1} Linear threshold units classify an example x as Classification rule= sgn(b+ w T x) = sgn(b +  w i x i ) – b + w T x ¸ 0 ) Predict y = 1 – b + w T x < 0 ) Predict y = For now Called the activation

The geometry of a linear classifier 14 sgn(b +w 1 x 1 + w 2 x 2 ) x1x1 x2x b +w 1 x 1 + w 2 x 2 =0 In n dimensions, a linear classifier represents a hyperplane that separates the space into two half-spaces We only care about the sign, not the magnitude [w 1 w 2 ]

Linear classifiers are an expressive hypothesis class Many functions are linear – Conjunctions y = x 1 Æ x 2 Æ x 3 y = sgn(-3 + x 1 + x 2 + x 3 ) W = (1,1,1); b = -3 – At least m-of-n functions We will see later in the class that many structured predictors are linear functions too Often a good guess for a hypothesis space Some functions are not linear – The XOR function – Non-trivial Boolean functions 15

XOR is not linearly separable x1x1 x2x No line can be drawn to separate the two classes

Even these functions can be made linear The trick: Change the representation 17 These points are not separable in 1-dimension by a line What is a one-dimensional line, by the way?

Even these functions can be made linear The trick: Use feature conjunctions 18 Transform points: Represent each point x in 2 dimensions by (x, x 2 ) Now the data is linearly separable in this space! Exercise: How do you apply this for the XOR case?

Almost linearly separable data sgn(b +w 1 x 1 + w 2 x 2 ) x1x1 x2x2 b +w 1 x 1 + w 2 x 2 =0 Training data is almost separable, except for some noise How much noise do we allow for?

Training a linear classifier Three cases to consider: 1.Training data is linearly separable – Simple conjunctions, etc 2.Training data is linearly inseparable – XOR, etc 3.Training data is almost linearly separable – Noise in the data – We could allow some the classifier to make some mistakes on the training data to account for these 20

Simplifying notation We can stop writing b at each step because of the following notational sugar: The prediction function is sgn(b + w T x) Rewrite x as [1, x] ! x’ Rewrite w as [b, w] ! w’ increases dimensionality by one But we can write the prediction as sgn(w’ T x’) We will not show b, and instead fold the bias term into the input by adding an extra feature But remember that it is there 21

Where are we? 1.Supervised learning: The general setting 2.Linear classifiers 3.The Perceptron algorithm 4.Learning as optimization 5.Support vector machines 6.Logistic Regression 22

The Perceptron algorithm Rosenblatt 1958 The goal is to find a separating hyperplane – For separable data, guaranteed to find one An online algorithm – Processes one example at a time Several variants exist (will discuss briefly at towards the end) 23

The algorithm Given a training set D = {(x,y)}, x 2 < n, y 2 {-1,1} 1.Initialize w = 0 2 < n 2.For epoch = 1 … T: 1.For each training example (x, y) 2 D: 1.Predict y’ = sgn(w T x) 2.If y ≠ y’, update w à w + y x 3.Return w Prediction: sgn(w T x) 24

The algorithm Given a training set D = {(x,y)} 1.Initialize w = 0 2 < n 2.For epoch = 1 … T: 1.For each training example (x, y) 2 D: 1.Predict y’ = sgn(w T x) 2.If y ≠ y’, update w à w + y x 3.Return w Prediction: sgn(w T x) 25 Update only on an error. Perceptron is an mistake- driven algorithm. T is a hyperparameter to the algorithm In practice, good to shuffle D before the inner loop

Geometry of the perceptron update 26 w old (x, +1) w new w à w + y x Exercise: Verify for yourself that a similar picture can be drawn for negative examples too (x, +1) PredictUpdateAfter

Convergence Convergence theorem – If there exist a set of weights that are consistent with the data (i.e. the data is linearly separable), the perceptron algorithm will converge. – How long? Cycling theorem – If the training data is not linearly separable, then the learning algorithm will eventually repeat the same set of weights and enter an infinite loop 27

Margin The margin of a hyperplane for a dataset is the distance between the hyperplane and the data point nearest to it Margin with respect to this hyperplane

Margin The margin of a hyperplane for a dataset is the distance between the hyperplane and the data point nearest to it. The margin of a data set ( ° ) is the maximum margin possible for that dataset using any weight vector Margin of the data

The mistake bound theorem [Novikoff 1962] Let D={(x i, y i )} be a labeled dataset that is separable Let || x i || · R for all examples. Let ° be the margin of the dataset D. Then, the perceptron algorithm will make at most R 2 / ° 2 mistakes on the data. Proof idea: We know that there is some true weight vector w*. Each perceptron update reduces the angle between w and w*. Note: The theorem doesn’t depend on the number of examples we have. 30

Beyond the separable case The good news – Perceptron makes no assumption about data distribution – Even adversarial – After a fixed number of mistakes, you are done. Don’t even need to see any more data The bad news: Real world is not linearly separable – Can’t expect to never make mistakes again – What can we do: more features, try to be linearly separable if you can 31

Variants of the algorithm So far: We return the final weight vector Voted perceptron – Remember every weight vector in your sequence of updates. – At final prediction time, each weight vector gets to vote on the label. The number of votes it gets is the number of iterations it survived before being updated – Comes with strong theoretical guarantees about generalization, impractical because of storage issues Averaged perceptron – Instead of using all weight vectors, use the average weight vector (i.e longer surviving weight vectors get more say) – More practical alternative and widely used 32

Next steps What is the perceptron doing? – Error-bound exists, but over training data – Is it minimizing a loss function? Can we say something about future errors? More on different loss functions Support vector machine – A look at regularization Logistic Regression 33

Announcements Registration information – Class size has increased. You should be able to register The newsgroup for the class is now active – Google groups: cs6961-fall14 Please return the information forms if you have filled them out Please go over the material on logistic regression and at least one of the papers on SVM 34