Presentation is loading. Please wait.

Presentation is loading. Please wait.

Online Learning Algorithms

Similar presentations


Presentation on theme: "Online Learning Algorithms"— Presentation transcript:

1 Online Learning Algorithms

2 Outline Online learning Framework
Design principles of online learning algorithms (additive updates) Perceptron, Passive-Aggressive and Confidence weighted classification Classification – binary, multi-class and structured prediction Hypothesis averaging and Regularization Multiplicative updates Weighted majority, Winnow, and connections to Gradient Descent(GD) and Exponentiated Gradient Descent (EGD)

3 Formal setting – Classification
Instances Images, Sentences Labels Parse tree, Names Prediction rule Linear prediction rule Loss No. of mistakes

4 Predictions Continuous predictions : Linear Classifiers Label
Confidence Linear Classifiers Prediction : Confidence:

5 Loss Functions Natural Loss: Real-valued-predictions loss:
Zero-One loss: Real-valued-predictions loss: Hinge loss: Exponential loss (Boosting)

6 Loss Functions Hinge Loss Zero-One Loss 1 1

7 Online Framework Initialize Classifier Algorithm works in rounds
On round the online algorithm : Receives an input instance Outputs a prediction Receives a feedback label Computes loss Updates the prediction rule Goal : Suffer small cumulative loss

8 Margin Margin of an example with respect to the classifier : Note :
The set is separable iff there exists u such that

9 Geometrical Interpretation
Margin <<0 Margin >0 Margin >>0 Margin <0

10 Hinge Loss

11 Why Online Learning? Fast
Memory efficient - process one example at a time Simple to implement Formal guarantees – Mistake bounds Online to Batch conversions No statistical assumptions Adaptive

12 Update Rules Online algorithms are based on an update rule which defines from (and possibly other information) Linear Classifiers : find from based on the input Some Update Rules : Perceptron (Rosenblat) ALMA (Gentile) ROMMA (Li & Long) NORMA (Kivinen et. al) MIRA (Crammer & Singer) EG (Littlestown and Warmuth) Bregman Based (Warmuth) CWL (Dredge et. al)

13 Design Principles of Algorithms
If the learner suffers non-zero loss at any round, then we want to balance two goals: Corrective: Change weights enough so that we don’t make this error again (1) Conservative: Don’t change the weights too much (2) How to define too much ?

14 Design Principles of Algorithms
If we use Euclidean distance to measure the change between old and new weights Enforcing (1) and minimizing (2) e.g., Perceptron for squared loss (Windrow-Hoff or Least Mean Squares) Passive-Aggressive algorithms do exactly same except (1) is much stronger – we want to make a correct classification with margin of at least 1 Confidence-Weighted classifiers maintains a distribution over weight vectors (1) is same as passive-aggressive with a probabilistic notion of margin Change is measured by KL divergence between two distributions

15 Design Principles of Algorithms
If we assume all weights are positive we can use (unnormalized) KL divergence to measure the change Multiplicative update or EG algorithm (Kivinen and Warmuth)

16 The Perceptron Algorithm
If No-Mistake Do nothing If Mistake Update Margin after update:

17 Passive-Aggressive Algorithms

18 Passive-Aggressive: Motivation
Perceptron: No guaranties of margin after the update PA: Enforce a minimal non-zero margin after the update In particular: If the margin is large enough (1), then do nothing If the margin is less then unit, update such that the margin after the update is enforced to be unit

19 Aggressive Update Step
Set to be the solution of the following optimization problem: Closed-form update: (2) (1) where,

20 Passive-Aggressive Update

21 Unrealizable Case

22 Confidence Weighted Classification

23 Confidence-Weighted Classification: Motivation
Many positive reviews with the word best Wbest Later negative review “boring book – best if you want to sleep in seconds” Linear update will reduce both Wbest Wboring But best appeared more than boring How to adjust weights at different rates? Wboring Wbest

24 Update Rules The weight vector is a linear combination of examples
Two rate schedules (among others): Perceptron algorithm, conservative: Passive-aggressive

25 Distributions in Version Space
Mean weight-vector Example

26 Margin as a Random Variable
Signed margin is a Gaussian-distributed variable Thus:

27 PA-like Update PA: New Update :

28 Weight Vector (Version) Space
Place most of the probability mass in this region

29 Passive Step Nothing to do, most weight vectors already classify the example correctly

30 Aggressive Step Mean moved past the mistake line (large margin)
The covariance is shirked in the direction of the new example Project the current Gaussian distribution onto the half-space

31 Extensions: Multi-class and Structured Prediction

32 Multiclass Representation I
k Prototypes New instance Compute Prediction: the class achieving the highest Score Class r 1 -1.08 2 1.66 3 0.37 4 -2.09

33 Multiclass Representation II
Map all input and labels into a joint vector space Score labels by projecting the corresponding feature vector F Estimated volume was a light 2.4 million ounces . = ( … ) B I O B I I I I O

34 Multiclass Representation II
Predict label with highest score (Inference) Naïve search is expensive if the set of possible labels is large No. of labelings = 3No. of words Estimated volume was a light 2.4 million ounces . B I O B I I I I O Efficient Viterbi decoding for sequences!

35 Two Representations Weight-vector per class (Representation I)
Intuitive Improved algorithms Single weight-vector (Representation II) Generalizes representation I Allows complex interactions between input and output x F(x,4) =

36 Margin for Multi Class Binary: Multi Class:

37 Margin for Multi Class But different mistakes cost (aka loss function) differently – so use it! Margin scaled by loss function:

38 Perceptron Multiclass online algorithm
Initialize For Receive an input instance Outputs a prediction Receives a feedback label Computes loss Update the prediction rule

39 PA Multiclass online algorithm
Initialize For Receive an input instance Outputs a prediction Receives a feedback label Computes loss Update the prediction rule

40 Regularization Key Idea: Popular choices:
If an online algorithm works well on a sequence of i.i.d examples, then an ensemble of online hypotheses should generalize well. Popular choices: the averaged hypothesis the majority vote use validation set to make a choice


Download ppt "Online Learning Algorithms"

Similar presentations


Ads by Google