Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Machine Learning Fall 2013 Perceptron (6) Prof. Koby Crammer Department of Electrical Engineering Technion 1.

Similar presentations


Presentation on theme: "Introduction to Machine Learning Fall 2013 Perceptron (6) Prof. Koby Crammer Department of Electrical Engineering Technion 1."— Presentation transcript:

1 Introduction to Machine Learning Fall 2013 Perceptron (6) Prof. Koby Crammer Department of Electrical Engineering Technion 1

2 2 Online Learning Tyrannosaurus rex

3 3 Online Learning Triceratops

4 4 Online Learning Tyrannosaurus rex Velocireptor

5 5 Formal Setting – Binary Classification Instances –Images, Sentences Labels –Parse tree, Names Prediction rule –Linear predictions rules Loss –No. of mistakes

6 6 Online Framework Initialize Classifier Algorithm works in rounds On round the online algorithm : – Receives an input instance – Outputs a prediction – Receives a feedback label – Computes loss – Updates the prediction rule Goal : –Suffer small cumulative loss

7 7 Why Online Learning? Fast Memory efficient - process one example at a time Simple to implement Formal guarantees – Mistake bounds Online to Batch conversions No statistical assumptions Adaptive Not as good as a well designed batch algorithms

8 8 Update Rules Online algorithms are based on an update rule which defines from (and possibly other information) Linear Classifiers : find from based on the input Some Update Rules : –P–Perceptron (Rosenblat) –A–ALMA (Gentile) –R–ROMMA (Li & Long) –N–NORMA (Kivinen et. al) –M–MIRA (Crammer & Singer) –E–EG (Littlestown and Warmuth) –B–Bregman Based (Warmuth) –N–Numerous Online Convex Programming Algorithms

9 9 Today The Perceptron Algorithm : –Agmon 1954; –Rosenblatt , –Block 1962, Novikoff 1962, –Minsky & Papert 1969, –Freund & Schapire 1999, –Blum & Dunagan 2002

10 10 The Perceptron Algorithm If No-Mistake –Do nothing If Mistake –Update Margin after update : Prediction :

11 11 Geometrical Interpretation

12 12 For any competitor prediction function We bound the loss suffered by the algorithm with the loss suffered by Relative Loss Bound Cumulative Loss Suffered by the Algorithm Sequence of Prediction Functions Cumulative Loss of Competitor

13 13 For any competitor prediction function We bound the loss suffered by the algorithm with the loss suffered by Relative Loss Bound Inequality Possibly Large Gap Regret Extra Loss Competitiveness Ratio

14 14 For any competitor prediction function We bound the loss suffered by the algorithm with the loss suffered by Relative Loss Bound Grows With T Constant Grows With T

15 15 For any competitor prediction function We bound the loss suffered by the algorithm with the loss suffered by Relative Loss Bound Best Prediction Function in hindsight for the data sequence

16 16 Remarks If the input is inseparable, then the problem of finding a separating hyperplane which attains less then M errors is NP-hard (Open hemisphere) Obtaining a zero-one loss bound with a unit competitiveness ratio is as hard as finding a constant approximating error for the Open Hemesphere problem. Bound of the number of mistakes the perceptron makes with the hinge loss of any compitetor

17 17 Definitions Any Competitor The parameters vector can be chosen using the input data The parameterized hinge loss of on True hinge loss 1-norm of hinge loss

18 18 Geometrical Assumption All examples are bounded in a ball of radius R

19 19 Perceptron’s Mistake Bound Bounds : If the sample is separable then

20 20 Proof - Intuition Two views : –The angle between and decreases with –The following sum is fixed as we make more mistakes, our solution is better [FS99, SS05]

21 21 Proof Define the potential : Bound it’s cumulative sum from above and below [C04]

22 22 Proof Bound from above : Telescopic Sum Non-Negative Zero Vector

23 23 Proof Bound From Below : –No error on t th round –Error on t th round

24 24 Proof We bound each term :

25 25 Proof Bound From Below : –No error on t th round –Error on t th round Cumulative bound :

26 26 Proof Putting both bounds together : We use first degree of freedom (and scale) : Bound :

27 27 Proof General Bound : Choose : Simple Bound : Objective of SVM

28 28 Proof Better bound : optimize the value of

29 29 Remarks Bound does not depend on dimension of the feature vector The bound holds for all sequences. It is not tight for most real world data But, there exists a setting for which it is tight – worst

30 30 Three Bounds

31 31 Separable Case Assume there exists such that for all examples Then all bounds are equivalent Perceptron makes finite number of mistakes until convergence (not necessarily to )

32 32 Separable Case – Other Quantities Use 1 st (parameterization) degree of freedom Scale the such that Define The bound becomes

33 33 Separable Case - Illustration

34 34 separable Case – Illustration The Perceptron will make more mistakes Finding a separating hyperplance is more difficult

35 35 Inseparable Case Difficult problem implies a large value of In this case the Perceptron will make a large number of mistakes

36 36 Perceptron Algorithm Extremely easy to implement Relative loss bounds for separable and inseparable cases. Minimal assumptions (not iid) Easy to convert to a well-performing batch algorithm (under iid assumptions) Quantities in bound are not compatible : no. of mistakes vs. hinge-loss. Margin of examples is ignored by update Same update for separable case and inseparable case.

37 Concluding Remarks Batch vs. Online –Two phases: Training and then Test –Single continues process Statistical Assumption –Distribution over examples –All sequences Conversions –Online -> Batch –Batch -> Online 37


Download ppt "Introduction to Machine Learning Fall 2013 Perceptron (6) Prof. Koby Crammer Department of Electrical Engineering Technion 1."

Similar presentations


Ads by Google