Presentation is loading. Please wait.

Presentation is loading. Please wait.

Pattern Recognition and Image Analysis Dr. Manal Helal – Fall 2014 Lecture 7 Linear Classifiers.

Similar presentations


Presentation on theme: "Pattern Recognition and Image Analysis Dr. Manal Helal – Fall 2014 Lecture 7 Linear Classifiers."— Presentation transcript:

1 Pattern Recognition and Image Analysis Dr. Manal Helal – Fall 2014 Lecture 7 Linear Classifiers

2 Linear Models Linear models Perceptron Naïve Bayes Logistic regression 2

3 Linear Models Linear models for classification separate input vectors into classes using linear (hyperplane) decision boundaries. Example: 2D Input vector x Two discrete classes C1 and C2 3

4 An Example Positive examples are blank, negative are filled 4 Which line describe the decision boundary? Higher dimensions?! Think of training examples as points in d-dimensional space. Each dimension corresponds to one feature. A linear binary classifier defines a plane in the space which separates positive from negative examples.

5 Linear Decision Boundary A hyper-plane is a generalization of a straight line to > 2 dimensions A hyper-plane contains all the points in a d dimensional space satisfying the following equation: w 1 x 1 +w 2 x 2,...,+w d x d +w 0 =0 Each coefficient w i can be thought of as a weight on the corresponding feature The vector containing all the weights w = (w 0,..., w d ) is the parameter vector or weight vector 5

6 Normal Vector Geometrically, the weight vector w is a normal vector of the separating hyper-plane A normal vector of a surface is any vector which is perpendicular to it 6

7 Hyper-plane as a classifier Let g(x)=w 1 x 1 +w 2 x 2,...,+w d x d +w 0 Then 7

8 Bias The slope of the hyper-plane is determined by w 1...w d. The location (intercept) is determined by bias w 0 Include bias in the weight vector and add a dummy component to the feature vector Set this component to x 0 = 1 Then g(x) = w. x 8

9 Separating hyper-planes in 2 dimensions 9

10 10 y(x) = w t x +w 0 y(x) ≥ 0 → x assigned to C 1 y(x) < 0 → x assigned to C 2 Thus y(x) = 0 defines the decision boundary

11 Learning The goal of the learning process is to come up with a “good” weight vector w The learning process will use examples to guide the search of a “good” w Different notions of “goodness” exist, which yield different learning algorithms We will describe some of these algorithms in the following slides 11

12 Perceptron 12

13 Perceptron Training How do we find a set of weights that separate our classes Perceptron: A simple mistake-driven online algorithm 1. Start with a zero weight vector and process each training example in turn. 2. If the current weight vector classifies the current example incorrectly, move the weight vector in the right direction. 3. If weights stop changing, stop If examples are linearly separable, then this algorithm is guaranteed to converge to the solution vector 13

14 Fixed increment online perceptron algorithm Binary classification, with classes +1 and − 1 Decision function y ′ = sign(w · x) Perceptron(x 1:N, y 1:N, I): 1:w ← 0 2: for i=1...I do 3: for n = 1...N do 4: if y (n) (w · x (n) ) ≤ 0 then 5: w ← (w · x (n) ) ≤ 0 then 6: return w 14

15 Or more explicitly 1:w ← 0 2:for i=1...I do 3:for n = 1...N do 4:if y (n) = sign(w · x (n) ) then 5:pass 6:elseif y (n) =+1 ∧ sign(w·x (n) )= − 1 then 7:w ← w + x (n) 8:elseif y (n) = − 1 ∧ sign(w·x (n) )=+1 then 9:w ← w − x (n) 10: return w 15 Tracing an example for a NAND function is found in: http://en.wikipedia.org/wiki/Perceptron

16 Weight averaging Although the algorithm is guaranteed to converge, the solution is not unique! Sensitive to the order in which examples are processed Separating the training sample does not equal good accuracy on unseen data Empirically, better generalization performance with weight averaging A method of avoiding overfitting As final weight vector, use the mean of all the weight vector values for each step of the algorithm (cf. regularization in a following session) 16

17 Efficient averaged perceptron algorithm Perceptron(x 1:N, y 1:N, I): 1: w ← 0; wa ← 0 2: b ← 0;ba ← 0 3: c ← 1 4: for i=1...I do 5: for n = 1...N do 6:if y (n) (w · x (n) + b) ≤ 0 then 7: w ← w+y (n) x (n) ;b ← b+y (n) 8: wa ← wa + cy (n) x (n) ; ba ← ba + cy (n) 9: c ← c+1 10: return (w − wa/c, b − ba/c) 17

18 Naïve Bayes 18

19 Probabilistic Model Instead of thinking in terms of multidimensional space... Classification can be approached as a probability estimation problem We will try to find a probability distribution which Describes well our training data Allows us to make accurate predictions We’ll look at Naive Bayes as a simplest example of a probabilistic classifier 19

20 Representation of Examples We are trying to classify documents. Let’s represent a document as a sequence of terms (words) it contains t = (t 1...t n ) For (binary) classification we want to find the most probable class: ŷ = argmax P(Y =y|t) y ∈ { − 1,+1} But documents are close to unique: we cannot reliably condition Y |t Bayes’ rule to the rescue 20

21 Bayes Rule Bayes rule determines how joint and conditional probabilities are related 21

22 Prior and likelihood With Bayes’ rule we can invert the direction of conditioning Decomposed the task into estimating the prior P (Y ) (easy) and the likelihood P (t|Y = y) 22

23 Conditional Independence How to estimate P (t|Y = y)? Naively assume the occurrence of each word in the document is independent of the others, when conditioned on the class 23

24 Naive Bayes Putting it all together 24

25 Decision Function For binary classification: 25

26 Documents in Vector Notation Let’s represent documents as vocabulary-size-dimensional binary vectors Dimension i indicates how many times the i th vocabulary item appears in document x 26

27 Naive Bayes in Vector Notation Counts appear as exponents: If we take the logarithm of the threshold (ln 1 = 0) and g, we’ll get the same decision function 27

28 Linear Classifier Remember the linear classifier? Log prior ratio corresponds to the bias term Log likelihood ratios correspond to feature weights 28

29 What is the difference Training criterion and procedure Perceptron Zero-one loss function Error-driven algorithm 29

30 Naive Bayes Maximum Likelihood criterion Find parameters which maximize the log likelihood Parameters reduce to relative counts + Ad-hoc smoothing Alternatives (e.g. maximum a posteriori) 30

31 Comparison 31

32 Logistic Regression 32

33 Probabilistic Conditional Model Let’s try to come up with a probabilistic model, which has some of the advantages of perceptron Model P(y|x) directly, and not via P(x|y) and Bayes rule as in Naïve Bayes Avoid issue of dependencies between features of x We’ll take linear regression as a starting point The goal is to adapt regression to model class-conditional probability 33

34 Linear Regression Training data: observations paired with outcomes (n ∈ R) Observations have features (predictors, typically also real numbers) The model is a regression line y = ax + b which best fits the observations a is the slope b is the intercept This model has two parameters (or weights) One feature = x Example: x = number of vague adjectives in property descriptions y = amount house sold over asking price 34

35 35

36 Multiple Linear Regression More generally where y = outcome w 0 = intercept x 1..x d = features vector and w 1..w d weight vector Get rid of bias: Linear regression: uses g(x) directly Linear classifier: uses sign(g(x)) 36

37 Learning Linear Regression Minimize sum squared error over N training examples Closed-form formula for choosing the best weights w: where the matrix X contains training example features, and y is the vector of outcomes. 37

38 Logistic Regression In logistic regression we use the linear model to assign probabilities to class labels For binary classification, predict P (Y = 1|x). But predictions of linear regression model are ∈ R, whereas P(Y =1|x) ∈ [0,1] Instead predict logit function of the probability: 38

39 Solving for P (Y = 1|x) we obtain: 39

40 Logistic Regression – Classification Example x belongs to class 1 if: Equation w · x = 0 defines a hyper-plane with points above belonging to class 1 40

41 Multinomial Logistic Regression Logistic regression generalized to more than two classes 41

42 Learning Parameters Conditional likelihood estimation: choose the weights which make the probability of the observed values y be the highest, given the observations x i For the training set with N examples: 42

43 Error Function Equivalently, we seek the value of the parameters which minimize the error function: 43

44 A problem in convex optimization L-BFGS (Limited-memory Broyden-Fletcher-Goldfarb- Shanno method) gradient descent conjugate gradient iterative scaling algorithms 44

45 Gradient Descent Learning Rule Weight update rule: w j = w j +  ( t(i) –  [f(i)] )  [f(i)] x j (i) can rewrite as:  w j = w j +  * error * c * x j (i) The Basic idea A gradient is a slope of a function That is, a set of partial derivatives, one for each dimension (parameter) By following the gradient of a convex function we can descend to the bottom (minimum)

46 What is the gradient? Partial derivatives of E(w 0, w 1,w 2 ): e.g., d E(w 0, w 1,w 2 ) /d w 1 Gradient is defined as the vector of partial derivatives: Dw = [ d E(w) /d w 0, d E(w) /d w 1, d E(w) /d w 2 ] = gradient of w = vector of derivatives (defined here on 3 dimensions) 1. E(w) and Dw can be evaluated at any particular point w 2. The components of the gradient Dw tell us how fast E(w) is changing in each direction 3. When interpreted as a vector, Dw is the direction of steepest increase => - Dw is the direction of steepest decrease

47 Gradient Descent Rule in Multiple Dimensions Gradient Descent Rule: w new = w old -  (w) where  (w) is the gradient and  is the learning rate (small, positive) Notes: 1. This moves us downhill in direction  (w) (steepest downhill direction) 2. How far we go is determined by the value of  3. The perceptron learning rule is a special case of this general method

48 Illustration of Gradient Descent w1w1 w0w0 E(w)

49 Illustration of Gradient Descent w1w1 w0w0 E(w)

50 Illustration of Gradient Descent Direction of steepest descent = direction of negative gradient w1w1 w0w0 E(w)

51 Illustration of Gradient Descent Original point in weight space New point in weight space w1w1 w0w0 E(w)

52 Comments on the Gradient Descent Algorithm Equivalent to hill-climbing heuristic search Works on any objective function E(w) as long as we can evaluate the gradient  (w) this can be very useful for minimizing complex functions E Local minima can have multiple local minima (note: for perceptron, E(w) only has a single global minimum, so this is not a problem) gradient descent goes to the closest local minimum: solution: random restarts from multiple places in weight space (note: no local minima for perceptron learning)

53 General Gradient Descent Algorithm Define an objective function E(w) (function to be minimized) We want to find the vector of values w that minimize E(w) Algorithm: pick an initial set of weights w, e.g. randomly evaluate  (w) at w note: this can be done numerically or in closed form update all the weights w new = w old -  (w) check if  (w) is approximately 0 if so, we have converged to a “ flat minimum ” if not, we move again in weight space For perceptron learning,  (w) is ( t(i) –  [f(i)] )  [f(i)] x j (i)

54 Minimization of Mean Squared Error, E(w) E(w) w1w1 Minimum of function E(w)

55 Minimization of Mean Squared Error, E(w) E(w) w1w1 d E(w)/ dw 1 w1w1

56 Moving Downhill: Move in direction of negative derivative E(w) w1w1 d E(w)/ dw 1 w1w1 Decreasing E(w) d E(w)/dw 1 > 0 w 1 <= w 1 -  d E(w)/dw 1 i.e., the rule decreases w 1

57 Moving Downhill: Move in direction of negative derivative E(w) w1w1 d E(w)/ dw 1 w1w1 Decreasing E(w) d E(w)/dw 1 < 0 w 1 <= w 1 -  d E(w)/dw 1 i.e., the rule increases w 1

58 Gradient Descent Example Find argmin θ f( θ ) where f( θ ) = θ 2 Initial value of θ 1 = − 1 Gradient function: ∇ f( θ ) = 2 θ Update: θ (n+1) = θ (n) − η ∇ f ( θ (n) ) The learning rate η (= 0.2) controls the speed of the descent After first iteration: θ (2) = − 1 − 0.2 (2) = − 0.6 58

59 Five Iterations of Gradient Descent 59

60 Stochastic Gradient Descent (SGD) We could compute the gradient of error for the full dataset before each update Instead Compute the gradient of the error for a single example update the weight Move on to the next example On average, we’ll move in the right direction Efficient, online algorithm However, stochastic, uses random samples to accommodate for lots of local maxima/minima 60

61 Error gradient The gradient of the error function is the set of partial derivatives of the error function with respect to the parameters W yi 61

62 Single training example 62

63 Update Stochastic gradient update step 63

64 Update: Explicit For the correct class (y = y (n) ) where 1 − P (Y = y|x (n), W) is the residual For all other classes (y≠y (n) ) 64

65 65

66 Logistics Regression SGD vs Perceptron Very similar update! Perceptron is simply an instantiation of SGD for a particular error function The perceptron criterion: for a correctly classified example zero error; for a misclassified example − y (n) w · x (n) 66

67 Comparison 67

68 Exercise Do Example 2.2.1:2, and 2.3.1, not using a randomly generated data as shown, but on your project’s dataset. 68


Download ppt "Pattern Recognition and Image Analysis Dr. Manal Helal – Fall 2014 Lecture 7 Linear Classifiers."

Similar presentations


Ads by Google