Pattern Recognition and Image Analysis Dr. Manal Helal – Fall 2014 Lecture 7 Linear Classifiers.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.
Linear Regression.
Pattern Recognition and Machine Learning
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Supervised Learning Recap
CMPUT 466/551 Principal Source: CMU
Linear Discriminant Functions
The loss function, the normal equation,
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Visual Recognition Tutorial
Simple Neural Nets For Pattern Classification
x – independent variable (input)
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Announcements  Homework 4 is due on this Thursday (02/27/2004)  Project proposal is due on 03/02.
Linear Discriminant Functions Chapter 5 (Duda et al.)
CS 4700: Foundations of Artificial Intelligence
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Classification and Prediction: Regression Analysis
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Review of Lecture Two Linear Regression Normal Equation
Collaborative Filtering Matrix Factorization Approach
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Dr. Hala Moushir Ebied Faculty of Computers & Information Sciences
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
LOGISTIC REGRESSION David Kauchak CS451 – Fall 2013.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Linear Discrimination Reading: Chapter 2 of textbook.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Logistic Regression –NB & LR connections Readings: Barber.
Lecture 4 Linear machine
Linear Models for Classification
Data Modeling Patrice Koehl Department of Biological Sciences National University of Singapore
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 7: Linear and Generalized Discriminant Functions.
Logistic Regression William Cohen.
1 Perceptron as one Type of Linear Discriminants IntroductionIntroduction Design of Primitive UnitsDesign of Primitive Units PerceptronsPerceptrons.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
CPH Dr. Charnigo Chap. 11 Notes Figure 11.2 provides a diagram which shows, at a glance, what a neural network does. Inputs X 1, X 2,.., X P are.
Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Naive Bayes (Generative Classifier) vs. Logistic Regression (Discriminative Classifier) Minkyoung Kim.
Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.
Chapter 7. Classification and Prediction
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
Deep Feedforward Networks
Machine Learning Logistic Regression
10701 / Machine Learning.
A Simple Artificial Neuron
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Classification with Perceptrons Reading:
Machine Learning Logistic Regression
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Collaborative Filtering Matrix Factorization Approach
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Perceptron as one Type of Linear Discriminants
Generally Discriminant Analysis
Support Vector Machines
The loss function, the normal equation,
Parametric Methods Berlin Chen, 2005 References:
Logistic Regression Chapter 7.
Linear Discrimination
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Pattern Recognition and Image Analysis Dr. Manal Helal – Fall 2014 Lecture 7 Linear Classifiers

Linear Models Linear models Perceptron Naïve Bayes Logistic regression 2

Linear Models Linear models for classification separate input vectors into classes using linear (hyperplane) decision boundaries. Example: 2D Input vector x Two discrete classes C1 and C2 3

An Example Positive examples are blank, negative are filled 4 Which line describe the decision boundary? Higher dimensions?! Think of training examples as points in d-dimensional space. Each dimension corresponds to one feature. A linear binary classifier defines a plane in the space which separates positive from negative examples.

Linear Decision Boundary A hyper-plane is a generalization of a straight line to > 2 dimensions A hyper-plane contains all the points in a d dimensional space satisfying the following equation: w 1 x 1 +w 2 x 2,...,+w d x d +w 0 =0 Each coefficient w i can be thought of as a weight on the corresponding feature The vector containing all the weights w = (w 0,..., w d ) is the parameter vector or weight vector 5

Normal Vector Geometrically, the weight vector w is a normal vector of the separating hyper-plane A normal vector of a surface is any vector which is perpendicular to it 6

Hyper-plane as a classifier Let g(x)=w 1 x 1 +w 2 x 2,...,+w d x d +w 0 Then 7

Bias The slope of the hyper-plane is determined by w 1...w d. The location (intercept) is determined by bias w 0 Include bias in the weight vector and add a dummy component to the feature vector Set this component to x 0 = 1 Then g(x) = w. x 8

Separating hyper-planes in 2 dimensions 9

10 y(x) = w t x +w 0 y(x) ≥ 0 → x assigned to C 1 y(x) < 0 → x assigned to C 2 Thus y(x) = 0 defines the decision boundary

Learning The goal of the learning process is to come up with a “good” weight vector w The learning process will use examples to guide the search of a “good” w Different notions of “goodness” exist, which yield different learning algorithms We will describe some of these algorithms in the following slides 11

Perceptron 12

Perceptron Training How do we find a set of weights that separate our classes Perceptron: A simple mistake-driven online algorithm 1. Start with a zero weight vector and process each training example in turn. 2. If the current weight vector classifies the current example incorrectly, move the weight vector in the right direction. 3. If weights stop changing, stop If examples are linearly separable, then this algorithm is guaranteed to converge to the solution vector 13

Fixed increment online perceptron algorithm Binary classification, with classes +1 and − 1 Decision function y ′ = sign(w · x) Perceptron(x 1:N, y 1:N, I): 1:w ← 0 2: for i=1...I do 3: for n = 1...N do 4: if y (n) (w · x (n) ) ≤ 0 then 5: w ← (w · x (n) ) ≤ 0 then 6: return w 14

Or more explicitly 1:w ← 0 2:for i=1...I do 3:for n = 1...N do 4:if y (n) = sign(w · x (n) ) then 5:pass 6:elseif y (n) =+1 ∧ sign(w·x (n) )= − 1 then 7:w ← w + x (n) 8:elseif y (n) = − 1 ∧ sign(w·x (n) )=+1 then 9:w ← w − x (n) 10: return w 15 Tracing an example for a NAND function is found in:

Weight averaging Although the algorithm is guaranteed to converge, the solution is not unique! Sensitive to the order in which examples are processed Separating the training sample does not equal good accuracy on unseen data Empirically, better generalization performance with weight averaging A method of avoiding overfitting As final weight vector, use the mean of all the weight vector values for each step of the algorithm (cf. regularization in a following session) 16

Efficient averaged perceptron algorithm Perceptron(x 1:N, y 1:N, I): 1: w ← 0; wa ← 0 2: b ← 0;ba ← 0 3: c ← 1 4: for i=1...I do 5: for n = 1...N do 6:if y (n) (w · x (n) + b) ≤ 0 then 7: w ← w+y (n) x (n) ;b ← b+y (n) 8: wa ← wa + cy (n) x (n) ; ba ← ba + cy (n) 9: c ← c+1 10: return (w − wa/c, b − ba/c) 17

Naïve Bayes 18

Probabilistic Model Instead of thinking in terms of multidimensional space... Classification can be approached as a probability estimation problem We will try to find a probability distribution which Describes well our training data Allows us to make accurate predictions We’ll look at Naive Bayes as a simplest example of a probabilistic classifier 19

Representation of Examples We are trying to classify documents. Let’s represent a document as a sequence of terms (words) it contains t = (t 1...t n ) For (binary) classification we want to find the most probable class: ŷ = argmax P(Y =y|t) y ∈ { − 1,+1} But documents are close to unique: we cannot reliably condition Y |t Bayes’ rule to the rescue 20

Bayes Rule Bayes rule determines how joint and conditional probabilities are related 21

Prior and likelihood With Bayes’ rule we can invert the direction of conditioning Decomposed the task into estimating the prior P (Y ) (easy) and the likelihood P (t|Y = y) 22

Conditional Independence How to estimate P (t|Y = y)? Naively assume the occurrence of each word in the document is independent of the others, when conditioned on the class 23

Naive Bayes Putting it all together 24

Decision Function For binary classification: 25

Documents in Vector Notation Let’s represent documents as vocabulary-size-dimensional binary vectors Dimension i indicates how many times the i th vocabulary item appears in document x 26

Naive Bayes in Vector Notation Counts appear as exponents: If we take the logarithm of the threshold (ln 1 = 0) and g, we’ll get the same decision function 27

Linear Classifier Remember the linear classifier? Log prior ratio corresponds to the bias term Log likelihood ratios correspond to feature weights 28

What is the difference Training criterion and procedure Perceptron Zero-one loss function Error-driven algorithm 29

Naive Bayes Maximum Likelihood criterion Find parameters which maximize the log likelihood Parameters reduce to relative counts + Ad-hoc smoothing Alternatives (e.g. maximum a posteriori) 30

Comparison 31

Logistic Regression 32

Probabilistic Conditional Model Let’s try to come up with a probabilistic model, which has some of the advantages of perceptron Model P(y|x) directly, and not via P(x|y) and Bayes rule as in Naïve Bayes Avoid issue of dependencies between features of x We’ll take linear regression as a starting point The goal is to adapt regression to model class-conditional probability 33

Linear Regression Training data: observations paired with outcomes (n ∈ R) Observations have features (predictors, typically also real numbers) The model is a regression line y = ax + b which best fits the observations a is the slope b is the intercept This model has two parameters (or weights) One feature = x Example: x = number of vague adjectives in property descriptions y = amount house sold over asking price 34

35

Multiple Linear Regression More generally where y = outcome w 0 = intercept x 1..x d = features vector and w 1..w d weight vector Get rid of bias: Linear regression: uses g(x) directly Linear classifier: uses sign(g(x)) 36

Learning Linear Regression Minimize sum squared error over N training examples Closed-form formula for choosing the best weights w: where the matrix X contains training example features, and y is the vector of outcomes. 37

Logistic Regression In logistic regression we use the linear model to assign probabilities to class labels For binary classification, predict P (Y = 1|x). But predictions of linear regression model are ∈ R, whereas P(Y =1|x) ∈ [0,1] Instead predict logit function of the probability: 38

Solving for P (Y = 1|x) we obtain: 39

Logistic Regression – Classification Example x belongs to class 1 if: Equation w · x = 0 defines a hyper-plane with points above belonging to class 1 40

Multinomial Logistic Regression Logistic regression generalized to more than two classes 41

Learning Parameters Conditional likelihood estimation: choose the weights which make the probability of the observed values y be the highest, given the observations x i For the training set with N examples: 42

Error Function Equivalently, we seek the value of the parameters which minimize the error function: 43

A problem in convex optimization L-BFGS (Limited-memory Broyden-Fletcher-Goldfarb- Shanno method) gradient descent conjugate gradient iterative scaling algorithms 44

Gradient Descent Learning Rule Weight update rule: w j = w j +  ( t(i) –  [f(i)] )  [f(i)] x j (i) can rewrite as:  w j = w j +  * error * c * x j (i) The Basic idea A gradient is a slope of a function That is, a set of partial derivatives, one for each dimension (parameter) By following the gradient of a convex function we can descend to the bottom (minimum)

What is the gradient? Partial derivatives of E(w 0, w 1,w 2 ): e.g., d E(w 0, w 1,w 2 ) /d w 1 Gradient is defined as the vector of partial derivatives: Dw = [ d E(w) /d w 0, d E(w) /d w 1, d E(w) /d w 2 ] = gradient of w = vector of derivatives (defined here on 3 dimensions) 1. E(w) and Dw can be evaluated at any particular point w 2. The components of the gradient Dw tell us how fast E(w) is changing in each direction 3. When interpreted as a vector, Dw is the direction of steepest increase => - Dw is the direction of steepest decrease

Gradient Descent Rule in Multiple Dimensions Gradient Descent Rule: w new = w old -  (w) where  (w) is the gradient and  is the learning rate (small, positive) Notes: 1. This moves us downhill in direction  (w) (steepest downhill direction) 2. How far we go is determined by the value of  3. The perceptron learning rule is a special case of this general method

Illustration of Gradient Descent w1w1 w0w0 E(w)

Illustration of Gradient Descent w1w1 w0w0 E(w)

Illustration of Gradient Descent Direction of steepest descent = direction of negative gradient w1w1 w0w0 E(w)

Illustration of Gradient Descent Original point in weight space New point in weight space w1w1 w0w0 E(w)

Comments on the Gradient Descent Algorithm Equivalent to hill-climbing heuristic search Works on any objective function E(w) as long as we can evaluate the gradient  (w) this can be very useful for minimizing complex functions E Local minima can have multiple local minima (note: for perceptron, E(w) only has a single global minimum, so this is not a problem) gradient descent goes to the closest local minimum: solution: random restarts from multiple places in weight space (note: no local minima for perceptron learning)

General Gradient Descent Algorithm Define an objective function E(w) (function to be minimized) We want to find the vector of values w that minimize E(w) Algorithm: pick an initial set of weights w, e.g. randomly evaluate  (w) at w note: this can be done numerically or in closed form update all the weights w new = w old -  (w) check if  (w) is approximately 0 if so, we have converged to a “ flat minimum ” if not, we move again in weight space For perceptron learning,  (w) is ( t(i) –  [f(i)] )  [f(i)] x j (i)

Minimization of Mean Squared Error, E(w) E(w) w1w1 Minimum of function E(w)

Minimization of Mean Squared Error, E(w) E(w) w1w1 d E(w)/ dw 1 w1w1

Moving Downhill: Move in direction of negative derivative E(w) w1w1 d E(w)/ dw 1 w1w1 Decreasing E(w) d E(w)/dw 1 > 0 w 1 <= w 1 -  d E(w)/dw 1 i.e., the rule decreases w 1

Moving Downhill: Move in direction of negative derivative E(w) w1w1 d E(w)/ dw 1 w1w1 Decreasing E(w) d E(w)/dw 1 < 0 w 1 <= w 1 -  d E(w)/dw 1 i.e., the rule increases w 1

Gradient Descent Example Find argmin θ f( θ ) where f( θ ) = θ 2 Initial value of θ 1 = − 1 Gradient function: ∇ f( θ ) = 2 θ Update: θ (n+1) = θ (n) − η ∇ f ( θ (n) ) The learning rate η (= 0.2) controls the speed of the descent After first iteration: θ (2) = − 1 − 0.2 (2) = −

Five Iterations of Gradient Descent 59

Stochastic Gradient Descent (SGD) We could compute the gradient of error for the full dataset before each update Instead Compute the gradient of the error for a single example update the weight Move on to the next example On average, we’ll move in the right direction Efficient, online algorithm However, stochastic, uses random samples to accommodate for lots of local maxima/minima 60

Error gradient The gradient of the error function is the set of partial derivatives of the error function with respect to the parameters W yi 61

Single training example 62

Update Stochastic gradient update step 63

Update: Explicit For the correct class (y = y (n) ) where 1 − P (Y = y|x (n), W) is the residual For all other classes (y≠y (n) ) 64

65

Logistics Regression SGD vs Perceptron Very similar update! Perceptron is simply an instantiation of SGD for a particular error function The perceptron criterion: for a correctly classified example zero error; for a misclassified example − y (n) w · x (n) 66

Comparison 67

Exercise Do Example 2.2.1:2, and 2.3.1, not using a randomly generated data as shown, but on your project’s dataset. 68