Dan Roth Department of Computer and Information Science

Slides:



Advertisements
Similar presentations
Statistical Machine Learning- The Basic Approach and Current Research Challenges Shai Ben-David CS497 February, 2007.
Advertisements

Introduction to Neural Networks Computing
A KTEC Center of Excellence 1 Pattern Analysis using Convex Optimization: Part 2 of Chapter 7 Discussion Presenter: Brian Quanz.
Linear Separators.
Support Vector Machines
Machine Learning Week 2 Lecture 1.
Linear Separators.
Linear Learning Machines  Simplest case: the decision function is a hyperplane in input space.  The Perceptron Algorithm: Rosenblatt, 1956  An on-line.
The Perceptron Algorithm (Primal Form) Repeat: until no mistakes made within the for loop return:. What is ?
Announcements See Chapter 5 of Duda, Hart, and Stork. Tutorial by Burge linked to on web page. “Learning quickly when irrelevant attributes abound,” by.
Linear Separators. Bankruptcy example R is the ratio of earnings to expenses L is the number of late payments on credit cards over the past year. We will.
Linear Learning Machines  Simplest case: the decision function is a hyperplane in input space.  The Perceptron Algorithm: Rosenblatt, 1956  An on-line.
Binary Classification Problem Learn a Classifier from the Training Set
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Support Vector Machines Classification
Linear Discriminators Chapter 20 From Data to Knowledge.
Online Learning Algorithms
Neural Networks Lecture 8: Two simple learning algorithms
Dr. Hala Moushir Ebied Faculty of Computers & Information Sciences
Support Vector Machines Piyush Kumar. Perceptrons revisited Class 1 : (+1) Class 2 : (-1) Is this unique?
Support Vector Machine (SVM) Based on Nello Cristianini presentation
Universit at Dortmund, LS VIII
Linear Discrimination Reading: Chapter 2 of textbook.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Machine Learning Lecture 1: Intro + Decision Trees Moshe Koppel Slides adapted from Tom Mitchell and from Dan Roth.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Page 1 CS 546 Machine Learning in NLP Review 1: Supervised Learning, Binary Classifiers Dan Roth Department of Computer Science University of Illinois.
Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Neural networks and support vector machines
CS 9633 Machine Learning Support Vector Machines
Support Vector Machine
Gradient descent David Kauchak CS 158 – Fall 2016.
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
Deep Feedforward Networks
Fun with Hyperplanes: Perceptrons, SVMs, and Friends
Large Margin classifiers
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Chapter 2 Single Layer Feedforward Networks
Dan Roth Department of Computer and Information Science
Empirical risk minimization
Perceptrons Lirong Xia.
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Classification with Perceptrons Reading:
Wed June 12 Goals of today’s lecture. Learning Mechanisms
Perceptrons Support-Vector Machines
ECE 5424: Introduction to Machine Learning
CS 4/527: Artificial Intelligence
An Introduction to Support Vector Machines
LINEAR AND NON-LINEAR CLASSIFICATION USING SVM and KERNELS
Linear Discriminators
Perceptrons for Dummies
CS 188: Artificial Intelligence
Cos 429: Face Detection (Part 2) Viola-Jones and AdaBoost Guest Instructor: Andras Ferencz (Your Regular Instructor: Fei-Fei Li) Thanks to Fei-Fei.
CS 2750: Machine Learning Support Vector Machines
Large Scale Support Vector Machines
Perceptron as one Type of Linear Discriminants
Kai-Wei Chang University of Virginia
CS480/680: Intro to ML Lecture 01: Perceptron 9/11/18 Yao-Liang Yu.
Artificial Intelligence Lecture No. 28
Support Vector Machines
Empirical risk minimization
Neuro-Computing Lecture 2 Single-Layer Perceptrons
Lecture 8. Learning (V): Perceptron Learning
David Kauchak CS158 – Spring 2019
CISC 841 Bioinformatics (Fall 2007) Kernel Based Methods (I)
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Image recognition.
Perceptrons Lirong Xia.
Presentation transcript:

CIS 700 Advanced Machine Learning for NLP Review 1: Supervised Learning, Binary Classifiers Dan Roth Department of Computer and Information Science University of Pennsylvania Augmented and modified by Vivek Srikumar

Supervised learning: General setting Given: Training examples of the form <x, f(x)> The function f is an unknown function The input x is represented in a feature space Typically x 2 {0,1}n or x 2 <n For a training example x, f(x) is called the label Goal: Find a good approximation for f The label Binary classification: f(x) 2 {-1,1} Multiclass classification: f(x) 2 {1, 2, 3, , K} Regression: f(x) 2 <

Nature of applications Humans can perform a task, but can’t describe how they do it Eg: Object detection in images Context Sensitive Spelling The desired function is hard to obtain in closed form Eg: Stock market The subject argument of a verb

Binary classification Spam filtering Is an email spam or not? Recommendation systems Given user’s movie preferences, will she like a new movie? Malware detection Is an Android app malicious? Time series prediction Will the future value of a stock increase or decrease with respect to its current value?

The fundamental problem: Machine learning is ill-posed! Unknown function x1 x2 x3 x4 y = f(x1, x2, x3, x4) Can you learn this function? What is it?

Is learning possible at all? There are 216 = 65536 possible Boolean functions over 4 inputs Why? There are 16 possible outputs. Each way to fill these 16 slots is a different function, giving 216 functions. We have seen only 7 outputs We cannot know what the rest are without seeing them Think of an adversary filling in the labels every time you make a guess at the function

Solution: Restricted hypothesis space A hypothesis space is the set of possible functions we consider We were looking at the space of all Boolean functions Choose a hypothesis space that is smaller than the space of all functions Only simple conjunctions (with four variables, there are only 16 conjunctions without negations) m-of-n rules: Pick a set of n variables. At least m of them must be true Linear functions How do we pick a hypothesis space? Using some prior knowledge (or by guessing) What if the hypothesis space is so small that nothing in it agrees with the data? We need a hypothesis space that is flexible enough

Where are we? Supervised learning: The general setting Linear classifiers The Perceptron algorithm Learning as optimization Support vector machines Logistic Regression

Linear Classifiers Input is a n dimensional vector x Output is a label y 2 {-1, 1} Linear threshold units classify an example x as Classification rule= sgn(b+ wTx) = sgn(b +wi xi) b + wTx ¸ 0 ) Predict y = 1 b + wTx < 0 ) Predict y = -1 For now

Linear Classifiers Input is a n dimensional vector x Output is a label y 2 {-1, 1} Linear threshold units classify an example x as Classification rule= sgn(b+ wTx) = sgn(b +wi xi) b + wTx ¸ 0 ) Predict y = 1 b + wTx < 0 ) Predict y = -1 For now Called the activation

The geometry of a linear classifier sgn(b +w1 x1 + w2x2) We only care about the sign, not the magnitude + - b +w1 x1 + w2x2=0 [w1 w2] x1 In n dimensions, a linear classifier represents a hyperplane that separates the space into two half-spaces x2

Linear classifiers are an expressive hypothesis class Many functions are linear Conjunctions y = x1 Æ x2 Æ x3 y = sgn(-3 + x1 + x2 + x3) W = (1,1,1); b = -3 At least m-of-n functions We will see later in the class that many structured predictors are linear functions too Often a good guess for a hypothesis space Some functions are not linear The XOR function Non-trivial Boolean functions

XOR is not linearly separable No line can be drawn to separate the two classes - + x1 - + x2

Even these functions can be made linear The trick: Change the representation These points are not separable in 1-dimension by a line What is a one-dimensional line, by the way?

Even these functions can be made linear The trick: Use feature conjunctions Transform points: Represent each point x in 2 dimensions by (x, x2) Now the data is linearly separable in this space! Exercise: How do you apply this for the XOR case?

Almost linearly separable data sgn(b +w1 x1 + w2x2) Training data is almost separable, except for some noise How much noise do we allow for? b +w1 x1 + w2x2=0 + - x1 x2

Training a linear classifier Three cases to consider: Training data is linearly separable Simple conjunctions, general linear functions, etc Training data is linearly inseparable XOR, etc Training data is almost linearly separable Noise in the data We could allow some the classifier to make some mistakes on the training data to account for these

Simplifying notation We can stop writing b at each step because of the following notational sugar: The prediction function is sgn(b + wTx) Rewrite x as [1, x] ! x’ Rewrite w as [b, w] ! w’ increases dimensionality by one But we can write the prediction as sgn(w’Tx’) We will not show b, and instead fold the bias term into the input by adding an extra feature But remember that it is there

Where are we? Supervised learning: The general setting Linear classifiers The Perceptron algorithm Learning as optimization Support vector machines Logistic Regression

The Perceptron algorithm Rosenblatt 1958 The goal is to find a separating hyperplane For separable data, guaranteed to find one An online algorithm Processes one example at a time Several variants exist (will discuss briefly at towards the end)

The algorithm Given a training set D = {(x,y)}, x 2 <n, y 2 {-1,1} Initialize w = 0 2 <n For epoch = 1 … T: For each training example (x, y) 2 D: Predict y’ = sgn(wTx) If y ≠ y’, update w à w + y x Return w Prediction: sgn(wTx)

The algorithm Given a training set D = {(x,y)} Initialize w = 0 2 <n For epoch = 1 … T: For each training example (x, y) 2 D: Predict y’ = sgn(wTx) If y ≠ y’, update w à w + y x Return w Prediction: sgn(wTx) T is a hyperparameter to the algorithm In practice, good to shuffle D before the inner loop Update only on an error. Perceptron is an mistake-driven algorithm.

Geometry of the perceptron update Predict Update After w à w + y x wold wnew (x, +1) (x, +1) (x, +1) Exercise: Verify for yourself that a similar picture can be drawn for negative examples too

Convergence Convergence theorem Cycling theorem If there exist a set of weights that are consistent with the data (i.e. the data is linearly separable), the perceptron algorithm will converge. How long? Cycling theorem If the training data is not linearly separable, then the learning algorithm will eventually repeat the same set of weights and enter an infinite loop

Margin The margin of a hyperplane for a dataset is the distance between the hyperplane and the data point nearest to it. + - Margin with respect to this hyperplane

Margin The margin of a hyperplane for a dataset is the distance between the hyperplane and the data point nearest to it. The margin of a data set (°) is the maximum margin possible for that dataset using any weight vector. + - Margin of the data

The mistake bound theorem [Novikoff 1962] Let D={(xi, yi)} be a labeled dataset that is separable Let || xi|| · R for all examples. Let ° be the margin of the dataset D. Then, the perceptron algorithm will make at most R2/°2 mistakes on the data. Proof idea: We know that there is some true weight vector w*. Each perceptron update reduces the angle between w and w*. Note: The theorem doesn’t depend on the number of examples we have.

Beyond the separable case The good news Perceptron makes no assumption about data distribution Even adversarial After a fixed number of mistakes, you are done. Don’t even need to see any more data The bad news: Real world is not linearly separable Can’t expect to never make mistakes again What can we do: more features, try to be linearly separable if you can

Variants of the algorithm So far: We return the final weight vector Averaged perceptron Remember every weight vector in your sequence of updates. Weigh each weight vector as a function of the number of examples that survived this weight vector. Make a prediction with the weighted average of the weight vectors. Comes with strong theoretical guarantees about generalization. Need to be smart about implementation.

Next steps What is the perceptron doing? Error-bound exists, but over training data Is it minimizing a loss function? Can we say something about future errors? More on different loss functions Support vector machine A look at regularization Logistic Regression