Presentation is loading. Please wait.

Presentation is loading. Please wait.

Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.

Similar presentations


Presentation on theme: "Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois."— Presentation transcript:

1 Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois at Urbana-Champaign Augmented and modified by Vivek Srikumar

2 Recall: Hypothesis space A hypothesis space is the set of possible functions we consider Instead choose a hypothesis space that is smaller than the space of all functions – Only simple conjunctions (with four variables, there are only 16 conjunctions without negations) – m-of-n rules: Pick a set of n variables. At least m of them must be true – Linear functions How do we pick a hypothesis space? – Using some background knowledge (or by guessing) 2

3 Perceptron algorithm Given a training set D = {(x,y)}, x 2 < n, y 2 {-1,1} 1.Initialize w = 0 2 < n 2.For epoch = 1 … T: 1.For each training example (x, y) 2 D: 1.Predict y’ = sgn(w T x) 2.If y ≠ y’, update w à w + y x 3.Return w Prediction: sgn(w T x) 3 Or equivalently, If y w T x · 0, update w à w + y x

4 Where are we? 1.Supervised learning: The general setting 2.Linear classifiers 3.The Perceptron algorithm 4.Support vector machines 5.Learning as optimization 6.Logistic Regression 4

5 What is the Perceptron algorithm doing? Mistake-bound on the training set What about future examples? Can we say something about them? Can we say anything about the future? 5

6 Recall: Margin The margin of a hyperplane for a dataset is the distance between the hyperplane and the data point nearest to it. 6 + + + + + + + + - - - - - - - - - - - - - - - - - - Margin with respect to this hyperplane

7 Which line is a better choice? Why? 7 + + + + + + + + - - - - - - - - - - - - - - - - - - + + + + + + + + - - - - - - - - - - - - - - - - - - h1h1 h2h2

8 8 + + + + + + + + - - - - - - - - - - - - - - - - - - + + + + + + + + - - - - - - - - - - - - - - - - - - h2h2 + + A new example, not from the training set might be misclassified if the margin is smaller h1h1

9 Maximal margin and generalization Larger margin gives better generalization What we have seen so far: Learning = make as few errors as possible on the training set Maximizing margin => fewer errors on future examples – This idea forms the basis of many learning algorithms SVM, voted perceptron, AdaBoost,… 9

10 Maximizing margin Margin = distance of the closest point from the hyperplane We want max w ° 10

11 Recall: The geometry of a linear classifier 11 sgn(b +w 1 x 1 + w 2 x 2 ) x1x1 x2x2 + + + + + + + + - - - - - - - - - - - - - - - - - - b +w 1 x 1 + w 2 x 2 =0 We only care about the sign, not the magnitude [w 1 w 2 ]

12 Maximizing margin Margin = distance of the closest point from the hyperplane We want max w ° We only care about the sign of w in the end and not the magnitude – Set the activation of the closest point to be 1 and allow w to adjust itself – Sometimes called the functional margin max w ° is equivalent to min w ||w||in this setting 12

13 Max-margin classifiers Learning a classifier: min ||w|| such that the activation of the closest point is 1 Learning problem: This is called the “hard” Support Vector Machine We will look at solving this optimization problem later 13

14 What if the data is not separable? Hard SVM 14

15 What if the data is not separable? Hard SVM This is a constrained optimization problem If the data is not separable, there is no w that will classify the data Infeasible problem, no solution! 15

16 Dealing with non-separable data Key idea: Allow some examples to “break into the margin” 16 + + + + + + + + - - - - - - - - - - - - - - - - - - + This separator has a large enough margin that it should generalize well. So, while computing margin, ignore the examples that make the margin smaller or the data inseparable. -

17 Soft SVM Hard SVM: Introduce one slack variable » i per example and require y i w T x i ¸ 1 - » i and » i ¸ 0 New optimization problem for learning 17 Maximize margin Every example has an functional margin of at least 1

18 Soft SVM Hard SVM: Introduce one slack variable » i per example and require y i w T x i ¸ 1 - » i and » i ¸ 0 Soft SVM learning: 18 Maximize margin Every example is at least at a distance 1 from the hyperplane Minimize total slack (i.e allow as few examples as possible to violate the margin) Tradeoff between the two terms

19 Soft SVM Eliminate the slack variables to rewrite this as This form is more interpretable 19 Maximize margin Minimize total slack (i.e allow as few examples as possible to violate the margin) Tradeoff between the two terms

20 Maximizing margin and minimizing loss Three cases – An example is correctly classified: penalty = 0 – An example is incorrectly classified: penalty = 1 – y i w T x i – An example is correctly classified but within the margin: penalty = 1 – y i w T x i This is the hinge loss function 20 Maximize margin Penalty for the prediction according to the weight vector

21 The Hinge Loss 21

22 SVM objective function 22 Regularization term: Maximize the margin Imposes a preference over the hypothesis space and pushes for better generalization Can be replaced with other regularization terms which impose other preferences Empirical Loss: Hinge loss Penalizes weight vectors that make mistakes Can be replaced with other loss functions which impose other preferences A hyper-parameter that controls the tradeoff between a large margin and a small hinge-loss

23 Where are we? 1.Supervised learning: The general setting 2.Linear classifiers 3.The Perceptron algorithm 4.Support vector machines 5.Learning as optimization 6.Logistic Regression 23

24 Computational Learning Theory Studies theoretical issues about the importance of representation, sample complexity (how many examples are enough), computational complexity (how fast can I learn) – Led to algorithms such as support vector machine and boosting No assumptions about the distribution of examples – But assume that both training and test examples come from the same distribution Provides bounds that depend on the size of the hypothesis class 24

25 Learning as loss minimization Collect some annotated data. More is generally better Pick a hypothesis class (also called model) – Eg: binominal, linear classifiers – Also, decide on how to impose a preference over hypotheses Choose a loss function – Eg: negative log-likelihood – Decide on how to penalize incorrect decisions Minimize the loss – Eg: Set derivative to zero, more complex algorithm 25

26 Learning as loss minimization: The setup Examples are created from some unknown distribution P Identify a hypothesis class H Define penalty for incorrect predictions: – The loss function L Learning: Pick a function f 2 H to minimize expected loss min H E P [L] Use samples from P to estimate expectation: – The training set D = { } – “Empirical risk minimization” min f 2 H  D L(y, f(x)) 26

27 Regularized loss minimization: Logistic regression Learning: With linear classifiers: What is a loss function? – Loss functions should penalize mistakes – We are minimizing average loss over the training data What is the ideal loss function for classification? 27

28 The 0-1 loss Penalize classification mistakes between true label y and prediction y’ For linear classifiers, the prediction y’ = sgn(w T x) Mistake if y w T x · 0 Minimizing 0-1 loss is intractable. Need surrogates 28

29 Loss functions 29 Typically plotted as a function of yw T x SVM

30 Support Vector Machines: Summary SVM = linear classifier + regularization Recall that perceptron did not have regularization Ideally, we would like to minimize 0-1 loss, but can not SVM minimizes hinge loss – Variants exist Will not cover – Dual formulation, support vectors, kernels 30

31 Solving the SVM optimization problem This function is convex in w 31

32 Convex functions A function f is convex if for every u, v in the domain, and for every a 2 [0,1] we have f(a u + (1-a) v) · a f(u) + (1-a) f(v) The necessary condition for w * to be a minimum for a function f: df(w * )/dw = 0 For convex functions, this is both necessary and sufficient 32 uv f(v) f(u)

33 Solving the SVM optimization problem This function is convex in w This is a quadratic optimization problem because the objective is quadratic Earlier methods: Used techniques from Quadratic Programming – Very slow No constraints, can use gradient descent – Still very slow! 33

34 Gradient descent for SVM Gradient descent algorithm to minimize a function f(w): – Initialize solution w – Repeat Find the gradient of f at w: r f Set w à w – r r f Gradient of the SVM objective requires summing over the entire training set – Slow – Does not really scale 34

35 Stochastic gradient descent for SVM Given a training set D = {(x,y)}, x 2 < n, y 2 {-1,1} 1.Initialize w = 0 2 < n 2.For epoch = 1 … T: 1.For each training example (x, y) 2 D: 1.Treat (x,y) as a full dataset and take the derivative of the SVM objective at the current w to be r 2.w à w – r r 3.Return w What is the gradient of the hinge loss with respect to w? (The hinge loss is not a differentiable function!) 35

36 Stochastic sub-gradient descent for SVM Given a training set D = {(x,y)}, x 2 < n, y 2 {-1,1} 1.Initialize w = 0 2 < n 2.For epoch = 1 … T: 1.For each training example (x, y) 2 D: If y w T x · 1, then w à (1-r)w + r C y x else w à (1-r) w 3.Return w Prediction: sgn(w T x) Compare to the perceptron algorithm! 36 Perceptron update: If y w T x · 0, update w à w + r y x r: learning rate, many tweaks possible

37 SVM summary from optimization perspective Minimize regularized hinge loss Solve using stochastic gradient descent – Very fast, run time does not depend on number of examples – Compare with Perceptron algorithm: Perceptron does not maximize margin width Perceptron variants can force a margin – Convergence criterion is an issue; can be too aggressive in the beginning and get to a reasonably good solution fast; but convergence is slow for very accurate weight vector Another successful optimization algorithm: – Dual coordinate descent, implemented in liblinear 37 Questions?

38 Where are we? 1.Supervised learning: The general setting 2.Linear classifiers 3.The Perceptron algorithm 4.Support vector machine 5.Learning as optimization 6.Logistic Regression 38

39 Regularized loss minimization: Logistic regression Learning: With linear classifiers: SVM uses the hinge loss Another loss function: The logistic loss 39

40 Loss functions 40 Typically plotted as a function of yw T x SVMLogistic regression Smooth, differentiable

41 The probabilistic interpretation Suppose we believe that the labels are generated using the following probability distribution: Predict label = 1 if P(1 | x, w) > P(-1 | x, w) – Equivalent to predicting 1 if w T x ¸ 0 – Why? 41

42 The probabilistic interpretation Suppose we believe that the labels are generated using the following probability distribution: What is the log-likelihood of seeing a dataset D = { } given a weight vector w? 42

43 Prior distribution over the weight vectors A prior balances the tradeoff between the likelihood of the data and existing belief about the parameters – Suppose each weight w i is drawn independently from the normal distribution centered at zero with variance ¾ 2 Bias towards smaller weights – Probability of the entire weight vector: 43 Source: Wikipedia

44 Regularized logistic regression What is the probability of seeing a dataset D = { } and a weight vector w? P(w | D) / P(D, w) = P(D | w) P(w) Learning: Find weight vector by maximizing the posterior distribution P(w | D) Once again, regularized loss minimization! This is the Bayesian interpretation of regularization 44 Exercise: Derive the stochastic gradient descent algorithm for logistic regression.

45 Regularized loss minimization Learning objective for both SVM, logistic regression: loss over training data + regularizer – Different loss functions Hinge loss vs. logistic loss – Same regularizer, but different interpretation Margin vs prior – Hyper-parameter controls tradeoff between the loss and regularizer – Other regularizers/loss functions also possible 45 Questions?

46 Review of binary classification 1.Supervised learning: The general setting 2.Linear classifiers 3.The Perceptron algorithm 4.Support vector machine 5.Learning as optimization 6.Logistic Regression 46 Questions?


Download ppt "Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois."

Similar presentations


Ads by Google