# 1 Advanced Online Learning for Natural Language Processing Koby Crammer University of Pennsylvania.

## Presentation on theme: "1 Advanced Online Learning for Natural Language Processing Koby Crammer University of Pennsylvania."— Presentation transcript:

1 Advanced Online Learning for Natural Language Processing Koby Crammer University of Pennsylvania

2 Thanks Axel Bernal Steve Caroll Ofer Dekel Mark Dredze Kuzman Ganchev Artemis Hatzigeorgiu Josheph Keshet Ryan McDonald Fernando Pereira Fei Sha Shai Shalev-Schwatz Yoram Singer Partha Pratim Talukdar

3 Tutorial Context SVMs Tutorial Multicass, Structured Optimization Theory Parsing Online Learning Information Extraction Speech Recognition Text Classification

4 Online Learning Tyrannosaurus rex

5 Online Learning Triceratops

6 Online Learning Tyrannosaurus rex Velocireptor

7 Journey Concepts Principles Algorithms Theory Binary Complex Structured Multi-Class Parsing Information Extraction Speech Recognition Document Classification Sentiment Classification

8 Formal Setting – Binary Classification Instances –Speech signal, Sentences Labels –Parse tree, Names Prediction rule –Linear predictions rules Loss –No. of mistakes

9 Predictions Discrete Predictions: –Hard to optimize Continuous predictions : –Label –Confidence

10 Loss Functions Natural Loss: –Zero-One loss: Real-valued-predictions loss: –Hinge loss: –Exponential loss (Boosting) –Log loss (Max Entropy, Boosting)

11 Loss Functions 1 1 Zero-One Loss Hinge Loss

12 Online Framework Initialize Classifier Algorithm works in rounds On round the online algorithm : – Receives an input instance – Outputs a prediction – Receives a feedback label – Computes loss – Updates the prediction rule Goal : –Suffer small cumulative loss

13 Any Features W.l.o.g. Binary Classifiers of the form Linear Classifiers Notation Abuse

14 Prediction : Confidence in prediction: Linear Classifiers (cntd.)

15 Margin of an example with respect to the classifier : Note : The set is separable iff there exists such that Margin

16 Geometrical Interpretation

17 Geometrical Interpretation

18 Geometrical Interpretation

19 Degree of Freedom - I The same geometrical hyperplane can be represented by many parameter vectors

20 Degree of Freedom - II Problem difficulty does not change if we shrink or expand the input space

21 Geometrical Interpretation Margin >0 Margin <<0 Margin <0 Margin >>0

22 Hinge Loss

23 Separable Set

24 Inseparable Sets

25 Why Online Learning? Fast Memory efficient - process one example at a time Simple to implement Formal guarantees – Mistake bounds Online to Batch conversions No statistical assumptions Adaptive Not as good as a well designed batch algorithms

26 Initialize Classifier Algorithm works in rounds On round the online algorithm : – Receives an input instance – Outputs a prediction – Receives a feedback label – Computes loss – Updates the prediction rule Online

27 Update Rules Online algorithms are based on an update rule which defines from (and possibly other information) Linear Classifiers : find from based on the input Some Update Rules : –P–Perceptron (Rosenblat) –A–ALMA (Gentile) –R–ROMMA (Li & Long) –N–NORMA (Kivinen et. al) –M–MIRA (Crammer & Singer) –E–EG (Littlestown and Warmuth) –B–Bregman Based (Warmuth) –…–…

28 Three Update Rules The Perceptron Algorithm : –Agmon 1954; Rosenblatt 1952-1962, Block 1962, Novikoff 1962, Minsky & Papert 1969, Freund & Schapire 1999, Blum & Dunagan 2002 Hildreth’s Algorithm : –Hildreth 1957 –Censor & Zenios 1997 –Herbster 2002 Loss Scaled : –Crammer & Singer 2001, 2002

29 The Perceptron Algorithm If No-Mistake –Do nothing If Mistake –Update Margin after update :

30 Geometrical Interpretation

31 Perceptron Algorithm Extremely easy to implement Relative loss bounds for separable and inseparable cases. Minimal assumptions (not iid) Easy to convert to a well-performing batch algorithm (under iid assumptions) Margin of examples is ignored by update Same update for separable case and inseparable case

32 Passive – Aggressive Approach The basis for a well-known algorithm in convex optimization due to Hildreth (1957) –Asymptotic analysis –Does not work in the inseparable case Three versions : –PAseparable case –PA-IPA-IIinseparable case Beyond classification –Regression, one class, structured learning Relative loss bounds

33 Motivation Perceptron: –No guaranties of margin after the update PA : –Enforce a minimal non-zero margin after the update In particular : –If the margin is large enough (1), then do nothing –If the margin is less then unit, update such that the margin after the update is enforced to be unit

34 Input Space

35 Input Space vs. Version Space Input Space : –Points are input data –One constraint is induced by weight vector –Primal space –Half space = all input examples that are classified correctly by a given predictor (weight vector) Version Space : –Points are weight vectors –One constraints is induced by input data –Dual space –Half space = all predictors (weight vectors) that classify correctly a given input example

36 Weight Vector (Version) Space The algorithm forces to reside in this region

37 Passive Step Nothing to do. already resides on the desired side.

38 Aggressive Step The algorithm projects on the desired half-space

39 Aggressive Update Step Set to be the solution of the following optimization problem : Can be solved analytically If the constraint is already satisfied: –trivial solution Otherwise: –It will be satisfied as equality

40 An Equality Constraint Linear update : Force the constraint to hold as equality Solve :

41 Passive-Aggressive Update

42 Perceptron vs. PA Common Update : Perceptron Passive-Aggressive

43 Perceptron vs. PA Margin Error N o - E r r o r, S m a l l M a r g i n No-Error, Large Margin

44 Perceptron vs. PA

45 Unrealizable case There is no weight vector that satisfy all the constraints

46 Unrealizable Case

47 Unrealizable Case

48 Four Algorithms PerceptronPAPA IPA II

49 Four Algorithms PerceptronPAPA IPA II

50 For any competitor prediction function We bound the loss suffered by the algorithm with the loss suffered by Relative Loss Bound Cumulative Loss Suffered by the Algorithm Sequence of Prediction Functions Cumulative Loss of Competitor

51 For any competitor prediction function We bound the loss suffered by the algorithm with the loss suffered by Relative Loss Bound Inequality Possibly Large Gap Regret Extra Loss Competitiveness Ratio

52 For any competitor prediction function We bound the loss suffered by the algorithm with the loss suffered by Relative Loss Bound Grows With T Constant Grows With T

53 For any competitor prediction function We bound the loss suffered by the algorithm with the loss suffered by Relative Loss Bound Best Prediction Function in hindsight for the data sequence

54 Summary Binary Classification : –Basic building block –Online learning, processes one example –Few additive algorithms Next: –Complex problems –Examples

55 Binary Classification 33 If it’s not one class, It most be the other class

56 Multi Class Single Label 3426 Elimination of a single class is not enough

57 Ordinal Ranking / Regression Machine Prediction Viewer’s Rating Order relation over labels Structure Over Possible Labels Snyder & Barzilay, 2007

58 Hierarchical Classification Phonetic transcription of DECEMBER Gross error Small errors d ix CH eh m bcl b er d AE s eh m bcl b er d ix s eh NASAL bcl b er [DKS’04]

59 Phonetic Hierarchy b g PHONEMES Sononorants Silences Obstruents Nasals Liquids Vowels Plosives Fricatives Front Center Back n m ng d k p t f v sh s th dh zh z l y w r Affricates jh ch oy ow uh uw aa ao er aw ay iy ih ey eh ae Structure Over Possible Labels [DKS’04]

60 Multi-Class Multi-Label ECONOMICS CORPORATE / INDUSTRIAL REGULATION / POLICY MARKETS LABOUR GOVERNMENT / SOCIAL LEGAL/JUDICIAL REGULATION/POLICY SHARE LISTINGS PERFORMANCE ACCOUNTS/EARNINGS COMMENT / FORECASTS SHARE CAPITAL BONDS / DEBT ISSUES LOANS / CREDITS STRATEGY / PLANS INSOLVENCY / LIQUIDITY Full topic rankingDocument The higher minimum wage signed into law… will be welcome relief for millions of workers …. The 90- cent-an-hour increase …. Relevant topics MARKETS LABOUR ECONOMICS REGULATION / POLICY CORPORATE / INDUSTRIAL GOVERNMENT / SOCIAL

61 Multi-Class Multi-Label Document The higher minimum wage signed into law… will be welcome relief for millions of workers …. The 90- cent-an-hour increase …. ECONOMICS INSOLVENCY / LIQUIDITY CORPORATE / INDUSTRIAL REGULATION / POLICY COMMENT / FORECASTS LABOUR LOANS / CREDITS LEGAL/JUDICIAL REGULATION/POLICY SHARE LISTINGS GOVERNMENT / SOCIAL PERFORMANCE ACCOUNTS/EARNINGS MARKETS SHARE CAPITAL BONDS / DEBT ISSUES STRATEGY / PLANS Full topic ranking Relevant topics MARKETS LABOUR ECONOMICS REGULATION / POLICY CORPORATE / INDUSTRIAL GOVERNMENT / SOCIAL RecallPrecisionAny Error?No. Errors? Non-trivial Evaluation Measures

62 Noun Phrase Chunking Estimated volume was a light 2.4 million ounces. Simultaneous Labeling

63 Bill Clinton and Microsoft founder Bill Gates met today for 20 minutes. Bill Clinton and Microsoft founder Bill Gates met today for 20 minutes. Named Entity Extraction Interactive Decisions

64 Sentence Compression The Reverse Engineer Tool is available now and is priced on a site-licensing basis, ranging from \$8,000 for a single user to \$90,000 for a multiuser project site. Essentially, design recovery tools read existing code and translate it into the language in which CASE is conversant -- definitions and structured diagrams. The Reverse Engineer Tool is available now and is priced on a site-licensing basis, ranging from \$8,000 for a single user to \$90,000 for a multiuser project site. Essentially, design recovery tools read existing code and translate it into the language in which CASE is conversant -- definitions and structured diagrams. [McDonald06] Complex Input – Output Relation

65 Dependency Parsing John hit the ball with the bat Non-trivial Output

66 Aligning Polyphonic Music Symbolic representation: Acoustic representation: Two ways for representing music [Shalev-Shwartz, Keshet, Singer 2004]

67 Symbolic Representation time pitch - pitch symbolic representation: - start-time [Shalev-Shwartz, Keshet, Singer 2004]

68 [Shalev-Shwartz, Keshet, Singer 2004] Acoustic Representation Feature Extraction (e.g. Spectral Analysis) acoustic representation: acoustic signal:

69 [Shalev-Shwartz, Keshet, Singer 2004] The Alignment Problem Setting time pitch actual start-time:

70 Challenges Elimination is not enough Structure over possible labels Non-trivial loss functions Complex input – output relation Non-trivial output

71 Challenges Interactive Decisions A wide range of features Computing an answer is relatively costly Fake News Show

72 Analysis as Labeling Label gives role for corresponding input "Many to one relation” Still, not general enough Model

73 Examples B I O B I I I I O Estimated volume was a light 2.4 million ounces. Bill Clinton and Microsoft founder Bill Gates met today for 20 minutes. B-PER I-PER O B-ORG O B-PER I-PER O O O O O O

74 Outline of Solution A quantitative evaluation of all predictions –Loss Function (application dependent) Models class – set of all labeling functions –Generalize linear classification –Representation Learning algorithm –Extension of Perceptron –Extension of Passive-Aggressive

75 Loss Functions Hamming Distance (Number of wrong decisions) Levenshtein Distance (Edit distance) –Speech Number of words with incorrect parent –Dependency parsing B I O B I I I I O Estimated volume was a light 2.4 million ounces. B O O O B I I O O

76 Outline of Solution A quantitative evaluation of all predictions –Loss Function (application dependent) Models class – set of all labeling functions –Generalize linear classification –Representation Learning algorithm –Extension of Perceptron –Extension of Passive-Aggressive

77 Multiclass Representation I k Prototypes

78 k Prototypes New instance Multiclass Representation I

79 Multiclass Representation I k Prototypes New instance Compute Class r 1-1.08 21.66 30.37 4-2.09

80 Multiclass Representation I k Prototypes New instance Compute Prediction: The class achieving the highest Score Class r 1-1.08 21.66 30.37 4-2.09

81 Map all input and labels into a joint vector space Score labels by projecting the corresponding feature vector Multiclass Representation II Estimated volume was a light 2.4 million ounces. F = (0 1 1 0 … ) B I O B I I I I O

82 Multiclass Representation II Predict label with highest score (Inference) Naïve search is expensive if the set of possible labels is large –N–No. of labelings = 3 No. of words B I O B I I I I O Estimated volume was a light 2.4 million ounces.

83 Features based on local domains Efficient Viterbi decoding for sequences Multiclass Representation II Estimated volume was a light 2.4 million ounces. F = (0 1 1 0 … ) B I O B I I I I O

84 Multiclass Representation II Correct Labeling Almost Correct Labeling After Shalev

85 Multiclass Representation II Correct Labeling Almost Correct Labeling Worst Labeling Incorrect Labeling After Shalev

86 Multiclass Representation II Correct Labeling Almost Correct Labeling After Shalev Worst Labeling

87 Multiclass Representation II Correct Labeling Almost Correct Labeling After Shalev Worst Labeling

88 Two Representations Weight-vector per class (Representation I) –Intuitive –Improved algorithms Single weight-vector (Representation II) –Generalizes representation I –Allows complex interactions between input and output 000x0 F(x,4)=

89 Why linear models? Combine the best of generative and classification models: –Trade off labeling decisions at different positions –Allow overlapping features Modular –factored scoring –loss function –From features to kernels

90 Outline of Solution A quantitative evaluation of all predictions –Loss Function (application dependent) Models class – set of all labeling functions –Generalize linear classification –Representation Learning algorithm –Extension of Perceptron –Extension of Passive-Aggressive

91 Multiclass Multilabel Perceptron Single Round Get a new instance Predict ranking Get feedback Compute loss If update weight-vectors

92 Construct Error-Set Form any set of parameters that satisfies:   If then  Multiclass Multilabel Perceptron Update (1)

93 Multiclass Multilabel Perceptron Update (1)

94 Multiclass Multilabel Perceptron Update (1) 2 3 145

95 Multiclass Multilabel Perceptron Update (1) 2 3 145

96 Multiclass Multilabel Perceptron Update (1) 000 0 2 3 145

97 Multiclass Multilabel Perceptron Update (1) 000 1-a0a 2 3 145

98 Set for Update : Multiclass Multilabel Perceptron Update (2)

99 Multiclass Multilabel Perceptron Update (2) 000 1-a0a 2 3 145

100 Multiclass Multilabel Perceptron Update (2) 000 1-a0a 2 3 145

101 Uniform Update 000 1/20 2 3 145

102 Max Update 000 100 2 3 145 Sparse Performance is worse than Uniform’s

103 Update Results BeforeUniform UpdateMax Update

104 Binary : Multi Class : Margin for Multi Class Prediction Error Margin Error

105 Binary : Multi Class : Margin for Multi Class

106 Multi Class : Margin for Multi Class But not all mistakes are equal? How do you know? Because the loss function is not constant !

107 Multi Class : Margin for Multi Class So, use it !

108 Linear Structure Models Correct Labeling Almost Correct Labeling After Shalev

109 Linear Structure Models Correct Labeling Almost Correct Labeling Worst Labeling Incorrect Labeling After Shalev

110 Linear Structure Models Correct Labeling Almost Correct Labeling Worst Labeling Incorrect Labeling After Shalev

111 PA Multi Class Update

112 PA Multi Class Update Project the current weight vector such that the instance ranking is consistent with loss function Set to be the solution of the following optimization problem :

113 PA Multi Class Update Problem –intersection of constraints may be empty Solutions –Does not occur in practice –Add a slack variable –Remove constraints

114 Add a Slack Variable Add a slack variable : Rewrite the optimization : Generalized Hinge Loss

115 Add a Slack Variable We like to solve : May have exponential number of constraints, thus intractable –If the loss can be factored then there is a polynomial equivalent set of constraints (Taskar et.al 2003) –Remove constraints (the other solution for the empty- set problem) B I O B I I I I O Estimated volume was a light 2.4 million ounces.

116 PA Multi Class Update Remove constraints : How to choose the single competing labeling ? –The labeling that attains the highest score! –… which is the predicted label according to the current model

117 Initialize For – Receive an input instance – Outputs a prediction – Receives a feedback label – Computes loss – Update the prediction rule PA Multiclass online algorithm

118 Advantages Process one training instance at a time Very simple Predictable runtime, small memory Adaptable to different loss functions Requires : –Inference procedure –Loss function –Features

119 Batch Setting Often we are given two sets : –Training set used to find a good model –Test set for evaluation Enumerate over the training set –Possibly more than once –Fix the weight vector Evaluate on Test Set Formal guaranties if training set and test set are i.i.d from fixed (unknown) distribution

120 Two Improvements Averaging –Instead of using final weight-vector use a combination of all weight-vector obtained during training time Top – k update –Instead of using only the labeling with highest score, the k labelings with highest score

121 Initialize For – Receive an input instance – Outputs a prediction – Receives a feedback label – Computes loss – Update the prediction rule –Update the averaged rule Averaging MIRA

122 Top-k Update Recall : –Inference : –Update

123 Top-K Update Recall : –Inference –Top-k Inference

124 Top-K Update Top-k Inference : Update :

125 Previous Approaches Focus on sequences Similar constructions for other structures Mainly batch algorithms

126 Previous Approaches Generative models: probabilistic generators of sequence-structure pairs –Hidden Markov Models (HMMs) – Probabilistic CFGs Sequential classification: decompose structure assignment into a sequence of structural decisions Conditional models : probabilistic model of labels given input –Conditional Random Fields [LMR 2001]

127 Previous Approaches Re-ranking : Combine generative and discriminative models –Full parsing [Collins 2000] Max Margin Makrov Networks [TGK 2003] –Use all competing labels –Elegant factorization –Closely related to PA with all the constraints

128 HMMs y1y1 y2y2 y3y3 x 1a x 1b x 1c x 2a x 2b x 2c x 3a x 3b x 3c

129 HMM

130 HMMs Solves a more general problem then required. Models the joint probability. Hard to model overlapping features, yet application needs richer input representation. –E.g. word identity, capitalization, ends in “-tion”, word in word list, word font, white space ratio, begins with number, word font ends with “?” Relax conditional independence of features on labels intractability

131 Conditional Random Fields Define distributions over labels Maximize the log-likelihood of data Dynamic programming for expectations (forward-backward algorithm) Standard convex optimization (L-BFGS) John Lafferty, Andrew McCallum, Fernando Pereira 2001

132 Local Classifiers Train local classifiers. –E.g. Combine results in test time Cheap to train Can not model well long distance interactions –MEMMs, local classifiers –Combine classifiers at test time –Problem: Can not trade-off decisions at different locations (label- bias problem) B ? O B I I I I O Estimated volume was a light 2.4 million ounces. MEMM Dan Roth

133 Re-Ranking Use a generative model to reduce exponential number of labelings into a polynomial number of candidates –Local features Use the Perceptron algorithm to re-rank the list –Global features Great results! Michael Collins 2000

134 Empirical Evaluation Category Ranking / Multiclass multilabel Noun Phrase Chunking Named entities Dependency Parsing Genefinder Phoneme Alignment Non-projective Parsing

135 Experimental Setup Algorithms : –Rocchio, normalized prototypes –Perceptron, one per topic –Multiclass Perceptron - IsErr, ErrorSetSize Features –About 100 terms/words per category Online to Batch : –Cycle once through training set –Apply resulting weight-vector to the test set

136 Data Sets Reuters21578 Training Examples 8,631 Test Examples 2,158 Topics 90 1.24 No. Features 3,468

137 Data Sets Reuters21578Reuters2000 Training Examples 8,631521,439 Test Examples 2,158287,944 Topics 90102 1.243.20 No. Features 3,4689,325

138 Training Online Results Reuters 21578Reuters 2000 Round Number Average Cumulative IsErr IsErr Perceptron MP IsErr MP ErrSetSize

139 Training Online Results Reuters 21578Reuters 2000 Round Number Average Cumulative AvgP Average Precision Perceptron MP lsErr MP ErrSetSize

140 Test Results IsErrErrorSetSize IsErr and ErrorSetSize Perceptron MP IsErr MP ErrSetSize Rocciho IsErr R21578R2000 ErrSetSize R21578R2000

141 Sequence Text Analysis Features : –Meaningful word features –POS features –Unigram and bi-gram NER/NP features Inference : –Dynamic programming –Linear in length, quadratic in number of classes Estimated volume was a light 2.4 million ounces. F = (0 1 1 0 … ) B I O B I I I I O

142 Noun Phrase Chunking Estimated volume was a light 2.4 million ounces. 0.941Avg. Perceptron 0.942CRF 0.943MIRA McDonald, Crammer, Pereira

143 Noun Phrase Chunking Training time in CPU minutes Performance on test data McDonald, Crammer, Pereira

144 0.823Avg. Perceptron 0.830CRF 0.831MIRA McDonald, Crammer, Pereira Named Entity Extraction Bill Clinton and Microsoft founder Bill Gates met today for 20 minutes.

145 Named Entity Extraction Training time in CPU minutes Performance on test data McDonald, Crammer, Pereira

146 Dependency parsing Features : –Anything over words –single edges Inference : –Search over possible trees in cubic time (Eisner, Satta)

147 Dependency Parsing 90.3 Yamada & Matsumoto, 2003 87.3 Nivre & Scholz, 2004 90.6Avg. Perc. 90.9MIRA 82.9Avg. Perc. 83.2MIRA English Czech McDonald, Crammer, Pereira

148 Gene Finding

149 Gene Finding Semi-Markov Models (features includes segments length information) [ Sarawagi & Cohen, 2004 ] Decoding quadratic in length of sequence Bernal, Crammer, Hatzigeorgiu, Pereira specificity = FP / (FP + TN) sensitivity = FN / (TP + FN)

150 Dependency Parsing Complications: –High order features and multiply parents per word –Non-projective trees Approximate inference McDonald & Pereira,2006 83.01 – proj 84.22 – proj 84.11 – non-proj 85.22 – non-proj Czech

151 Phoneme Alignment Input: Acoustic signal + true phonemes Output : Segmentation of the signal t<40t<30t<20t<10 98.196.292.179.2Discriminative 97.194.488.975.3Brugnara et al Keshet, Shalev-Shwartz, Singer, Chazan

152 Summary Online training for complex decisions Simple to implement, fast to train Modular : –Loss function –Feature Engineering –Inference procedure (exact, approximated) Works well in practice Theoretically analyzed Find closest model that classify well current instance

153 Uncovered Topics Kernels Multiplicative updates Bregman divergences and projections Theory of online-to-batch Matrix representations and updates

154 Partial Bibliography Prediction, learning, and games. Nicolò Cesa-Bianchi and Gábor Lugosi Y. Censor & S.A. Zenios, “Parallel Optimization”, Oxford UP, 1997 Y. Freund & R. Schapire, “Large margin classification using the Perceptron algorithm”, MLJ, 1999. M. Herbster, “Learning additive models online with fast evaluating kernels”, COLT 2001 J. Kivinen, A. Smola, and R.C. Williamson, “Online learning with kernels”, IEEE Trans. on SP, 2004 H.H. Bauschke & J.M. Borwein, “On Projection Algorithms for Solving Convex Feasibility Problems”, SIAM Review, 1996

155 Applications Online Passive Aggressive Algorithms, CDSS’03 + CDKSS’05 Online Ranking by Projecting, CS’05 Large Margin Hierarchical Classification, DKS’04 Online Learning of Approximate Dependency Parsing Algorithms. R. McDonald and F. Pereira European Association for Computational Linguistics, 2006 Discriminative Sentence Compression with Soft Syntactic Constraints.R. McDonald. European Association for Computational Linguistics, 2006 Non-Projective Dependency Parsing using Spanning Tree Algorithms. R. McDonald, F. Pereira, K. Ribarov and J. Hajic HLT- EMNLP, 2005 Flexible Text Segmentation with Structured Multilabel Classification. R. McDonald, K. Crammer and F. Pereira HLT-EMNLP, 2005

156 Applications Online and Batch Learning of Pseudo-metrics, SSN’04 Learning to Align Polyphonic Music, SKS’04 The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees, DSS’04 First-Order Probabilistic Models for Coreference Resolution. Aron Culotta, Michael Wick, Robert Hall and Andrew McCallum. NAACL/HLT, 2007 Structured Models for Fine-to-Coarse Sentiment Analysis R. McDonald, K. Hannan, T. Neylon, M. Wells, and J. Reynar Association for Computational Linguistics, 2007 Multilingual Dependency Parsing with a Two-Stage Discriminative Parser.R. McDonald and K. Lerman and F. Pereira Conference on Natural Language Learning, 2006 Discriminative Kernel-Based Phoneme Sequence Recognition, Joseph Keshet, Shai Shalev-Shwartz, Samy Begio, Yoram Singer and Dan Chazan, International Conference on Spoken Language Processing, 2006

157 Remarks If the input is inseparable, then the problem of finding a separating hyperplane which attains less then M errors is NP-hard (Open hemisphere) Obtaining a zero-one loss bound with a unit competitiveness ratio is as hard as finding a constant approximating error for the Open Hemisphere problem. Bound of the number of mistakes the perceptron makes with the hinge loss of any competitor

158 Thank You

159 Definitions Any Competitor The parameters vector can be chosen using the input data The parameterized hinge loss of on True hinge loss 1-norm and 2-norm of hinge loss

160 Geometrical Assumption All examples are bounded in a ball of radius R

161 Perceptron’s Mistake Bound Bounds : If the sample is separable then

162 Proof - Intuition Two views : –The angle between and decreases with –The following sum is fixed as we make more mistakes, our solution is better [FS99, SS05]

163 Proof Define the potential : Bound it’s cumulative sum from above and below [C04]

164 Proof Bound from above : Telescopic Sum Non-Negative Zero Vector

165 Proof Bound From Below : –No error on t th round –Error on t th round

166 Proof We bound each term :

167 Proof Bound From Below : –No error on t th round –Error on t th round Cumulative bound :

168 Proof Putting both bounds together : We use first degree of freedom (and scale) : Bound :

169 Proof General Bound : Choose : Simple Bound : Objective of SVM

170 Proof Better bound : optimize the value of

171 Remarks Bound does not depend on dimension of the feature vector The bound holds for all sequences. It is not tight for most real world data But, there exists a setting for which it is tight

172 Three Bounds

173 Separable Case Assume there exists such that for all examples Then all bounds are equivalent Perceptron makes finite number of mistakes until convergence (not necessarily to )

174 Separable Case – Other Quantities Use 1 st (parameterization) degree of freedom Scale the such that Define The bound becomes

175 Separable Case - Illustration

176 separable Case – Illustration The Perceptron will make more mistakes Finding a separating hyperplance is more difficult

177 Inseparable Case Difficult problem implies a large value of In this case the Perceptron will make a large number of mistakes

178 Aggressive Update Step Optimize for : Set the derivative to zero Substitute back into the Lagrangian : Dual optimization problem

179 Aggressive Update Step Dual Problem : Solve it : What about the constraint?

180 Three Decision Problems ClassificationRegressionUniclass

181 Each example defines a set of consistent hypotheses: The new vector is set to be the projection of onto The Passive-Aggressive Algorithm ClassificationRegressionUniclass

182 Loss Bound Assume there exists such that Assume : Then : Note :

183 Proof Sketch Define: Upper bound: Lower bound: Lipschitz Condition

184 Proof Sketch (Cont.) Combining upper and lower bounds

185 Loss Bound for PA-I Mistake bound : Optimal value, set

186 Loss Bound for PA-II Loss bound : Similar proof technique as of PA Bound can be improved similarly to the Perceptron

187 Goals Provide methods to design, analyze and implement efficiently online learning algorithms Introduce new online algorithms –state-of-the-art performance in practice –interesting theoretical guarantees Complex-output problems: –binary classification –multi- class categorization –information extraction, parsing & speech recognition