Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Advanced Online Learning for Natural Language Processing Koby Crammer University of Pennsylvania.

Similar presentations


Presentation on theme: "1 Advanced Online Learning for Natural Language Processing Koby Crammer University of Pennsylvania."— Presentation transcript:

1

2 1 Advanced Online Learning for Natural Language Processing Koby Crammer University of Pennsylvania

3 2 Thanks Axel Bernal Steve Caroll Ofer Dekel Mark Dredze Kuzman Ganchev Artemis Hatzigeorgiu Josheph Keshet Ryan McDonald Fernando Pereira Fei Sha Shai Shalev-Schwatz Yoram Singer Partha Pratim Talukdar

4 3 Tutorial Context SVMs Tutorial Multicass, Structured Optimization Theory Parsing Online Learning Information Extraction Speech Recognition Text Classification

5 4 Online Learning Tyrannosaurus rex

6 5 Online Learning Triceratops

7 6 Online Learning Tyrannosaurus rex Velocireptor

8 7 Journey Concepts Principles Algorithms Theory Binary Complex Structured Multi-Class Parsing Information Extraction Speech Recognition Document Classification Sentiment Classification

9 8 Formal Setting – Binary Classification Instances –Speech signal, Sentences Labels –Parse tree, Names Prediction rule –Linear predictions rules Loss –No. of mistakes

10 9 Predictions Discrete Predictions: –Hard to optimize Continuous predictions : –Label –Confidence

11 10 Loss Functions Natural Loss: –Zero-One loss: Real-valued-predictions loss: –Hinge loss: –Exponential loss (Boosting) –Log loss (Max Entropy, Boosting)

12 11 Loss Functions 1 1 Zero-One Loss Hinge Loss

13 12 Online Framework Initialize Classifier Algorithm works in rounds On round the online algorithm : – Receives an input instance – Outputs a prediction – Receives a feedback label – Computes loss – Updates the prediction rule Goal : –Suffer small cumulative loss

14 13 Any Features W.l.o.g. Binary Classifiers of the form Linear Classifiers Notation Abuse

15 14 Prediction : Confidence in prediction: Linear Classifiers (cntd.)

16 15 Margin of an example with respect to the classifier : Note : The set is separable iff there exists such that Margin

17 16 Geometrical Interpretation

18 17 Geometrical Interpretation

19 18 Geometrical Interpretation

20 19 Degree of Freedom - I The same geometrical hyperplane can be represented by many parameter vectors

21 20 Degree of Freedom - II Problem difficulty does not change if we shrink or expand the input space

22 21 Geometrical Interpretation Margin >0 Margin <<0 Margin <0 Margin >>0

23 22 Hinge Loss

24 23 Separable Set

25 24 Inseparable Sets

26 25 Why Online Learning? Fast Memory efficient - process one example at a time Simple to implement Formal guarantees – Mistake bounds Online to Batch conversions No statistical assumptions Adaptive Not as good as a well designed batch algorithms

27 26 Initialize Classifier Algorithm works in rounds On round the online algorithm : – Receives an input instance – Outputs a prediction – Receives a feedback label – Computes loss – Updates the prediction rule Online

28 27 Update Rules Online algorithms are based on an update rule which defines from (and possibly other information) Linear Classifiers : find from based on the input Some Update Rules : –P–Perceptron (Rosenblat) –A–ALMA (Gentile) –R–ROMMA (Li & Long) –N–NORMA (Kivinen et. al) –M–MIRA (Crammer & Singer) –E–EG (Littlestown and Warmuth) –B–Bregman Based (Warmuth) –…–…

29 28 Three Update Rules The Perceptron Algorithm : –Agmon 1954; Rosenblatt , Block 1962, Novikoff 1962, Minsky & Papert 1969, Freund & Schapire 1999, Blum & Dunagan 2002 Hildreth’s Algorithm : –Hildreth 1957 –Censor & Zenios 1997 –Herbster 2002 Loss Scaled : –Crammer & Singer 2001, 2002

30 29 The Perceptron Algorithm If No-Mistake –Do nothing If Mistake –Update Margin after update :

31 30 Geometrical Interpretation

32 31 Perceptron Algorithm Extremely easy to implement Relative loss bounds for separable and inseparable cases. Minimal assumptions (not iid) Easy to convert to a well-performing batch algorithm (under iid assumptions) Margin of examples is ignored by update Same update for separable case and inseparable case

33 32 Passive – Aggressive Approach The basis for a well-known algorithm in convex optimization due to Hildreth (1957) –Asymptotic analysis –Does not work in the inseparable case Three versions : –PAseparable case –PA-IPA-IIinseparable case Beyond classification –Regression, one class, structured learning Relative loss bounds

34 33 Motivation Perceptron: –No guaranties of margin after the update PA : –Enforce a minimal non-zero margin after the update In particular : –If the margin is large enough (1), then do nothing –If the margin is less then unit, update such that the margin after the update is enforced to be unit

35 34 Input Space

36 35 Input Space vs. Version Space Input Space : –Points are input data –One constraint is induced by weight vector –Primal space –Half space = all input examples that are classified correctly by a given predictor (weight vector) Version Space : –Points are weight vectors –One constraints is induced by input data –Dual space –Half space = all predictors (weight vectors) that classify correctly a given input example

37 36 Weight Vector (Version) Space The algorithm forces to reside in this region

38 37 Passive Step Nothing to do. already resides on the desired side.

39 38 Aggressive Step The algorithm projects on the desired half-space

40 39 Aggressive Update Step Set to be the solution of the following optimization problem : Can be solved analytically If the constraint is already satisfied: –trivial solution Otherwise: –It will be satisfied as equality

41 40 An Equality Constraint Linear update : Force the constraint to hold as equality Solve :

42 41 Passive-Aggressive Update

43 42 Perceptron vs. PA Common Update : Perceptron Passive-Aggressive

44 43 Perceptron vs. PA Margin Error N o - E r r o r, S m a l l M a r g i n No-Error, Large Margin

45 44 Perceptron vs. PA

46 45 Unrealizable case There is no weight vector that satisfy all the constraints

47 46 Unrealizable Case

48 47 Unrealizable Case

49 48 Four Algorithms PerceptronPAPA IPA II

50 49 Four Algorithms PerceptronPAPA IPA II

51 50 For any competitor prediction function We bound the loss suffered by the algorithm with the loss suffered by Relative Loss Bound Cumulative Loss Suffered by the Algorithm Sequence of Prediction Functions Cumulative Loss of Competitor

52 51 For any competitor prediction function We bound the loss suffered by the algorithm with the loss suffered by Relative Loss Bound Inequality Possibly Large Gap Regret Extra Loss Competitiveness Ratio

53 52 For any competitor prediction function We bound the loss suffered by the algorithm with the loss suffered by Relative Loss Bound Grows With T Constant Grows With T

54 53 For any competitor prediction function We bound the loss suffered by the algorithm with the loss suffered by Relative Loss Bound Best Prediction Function in hindsight for the data sequence

55 54 Summary Binary Classification : –Basic building block –Online learning, processes one example –Few additive algorithms Next: –Complex problems –Examples

56 55 Binary Classification 33 If it’s not one class, It most be the other class

57 56 Multi Class Single Label 3426 Elimination of a single class is not enough

58 57 Ordinal Ranking / Regression Machine Prediction Viewer’s Rating Order relation over labels Structure Over Possible Labels Snyder & Barzilay, 2007

59 58 Hierarchical Classification Phonetic transcription of DECEMBER Gross error Small errors d ix CH eh m bcl b er d AE s eh m bcl b er d ix s eh NASAL bcl b er [DKS’04]

60 59 Phonetic Hierarchy b g PHONEMES Sononorants Silences Obstruents Nasals Liquids Vowels Plosives Fricatives Front Center Back n m ng d k p t f v sh s th dh zh z l y w r Affricates jh ch oy ow uh uw aa ao er aw ay iy ih ey eh ae Structure Over Possible Labels [DKS’04]

61 60 Multi-Class Multi-Label ECONOMICS CORPORATE / INDUSTRIAL REGULATION / POLICY MARKETS LABOUR GOVERNMENT / SOCIAL LEGAL/JUDICIAL REGULATION/POLICY SHARE LISTINGS PERFORMANCE ACCOUNTS/EARNINGS COMMENT / FORECASTS SHARE CAPITAL BONDS / DEBT ISSUES LOANS / CREDITS STRATEGY / PLANS INSOLVENCY / LIQUIDITY Full topic rankingDocument The higher minimum wage signed into law… will be welcome relief for millions of workers …. The 90- cent-an-hour increase …. Relevant topics MARKETS LABOUR ECONOMICS REGULATION / POLICY CORPORATE / INDUSTRIAL GOVERNMENT / SOCIAL

62 61 Multi-Class Multi-Label Document The higher minimum wage signed into law… will be welcome relief for millions of workers …. The 90- cent-an-hour increase …. ECONOMICS INSOLVENCY / LIQUIDITY CORPORATE / INDUSTRIAL REGULATION / POLICY COMMENT / FORECASTS LABOUR LOANS / CREDITS LEGAL/JUDICIAL REGULATION/POLICY SHARE LISTINGS GOVERNMENT / SOCIAL PERFORMANCE ACCOUNTS/EARNINGS MARKETS SHARE CAPITAL BONDS / DEBT ISSUES STRATEGY / PLANS Full topic ranking Relevant topics MARKETS LABOUR ECONOMICS REGULATION / POLICY CORPORATE / INDUSTRIAL GOVERNMENT / SOCIAL RecallPrecisionAny Error?No. Errors? Non-trivial Evaluation Measures

63 62 Noun Phrase Chunking Estimated volume was a light 2.4 million ounces. Simultaneous Labeling

64 63 Bill Clinton and Microsoft founder Bill Gates met today for 20 minutes. Bill Clinton and Microsoft founder Bill Gates met today for 20 minutes. Named Entity Extraction Interactive Decisions

65 64 Sentence Compression The Reverse Engineer Tool is available now and is priced on a site-licensing basis, ranging from $8,000 for a single user to $90,000 for a multiuser project site. Essentially, design recovery tools read existing code and translate it into the language in which CASE is conversant -- definitions and structured diagrams. The Reverse Engineer Tool is available now and is priced on a site-licensing basis, ranging from $8,000 for a single user to $90,000 for a multiuser project site. Essentially, design recovery tools read existing code and translate it into the language in which CASE is conversant -- definitions and structured diagrams. [McDonald06] Complex Input – Output Relation

66 65 Dependency Parsing John hit the ball with the bat Non-trivial Output

67 66 Aligning Polyphonic Music Symbolic representation: Acoustic representation: Two ways for representing music [Shalev-Shwartz, Keshet, Singer 2004]

68 67 Symbolic Representation time pitch - pitch symbolic representation: - start-time [Shalev-Shwartz, Keshet, Singer 2004]

69 68 [Shalev-Shwartz, Keshet, Singer 2004] Acoustic Representation Feature Extraction (e.g. Spectral Analysis) acoustic representation: acoustic signal:

70 69 [Shalev-Shwartz, Keshet, Singer 2004] The Alignment Problem Setting time pitch actual start-time:

71 70 Challenges Elimination is not enough Structure over possible labels Non-trivial loss functions Complex input – output relation Non-trivial output

72 71 Challenges Interactive Decisions A wide range of features Computing an answer is relatively costly Fake News Show

73 72 Analysis as Labeling Label gives role for corresponding input "Many to one relation” Still, not general enough Model

74 73 Examples B I O B I I I I O Estimated volume was a light 2.4 million ounces. Bill Clinton and Microsoft founder Bill Gates met today for 20 minutes. B-PER I-PER O B-ORG O B-PER I-PER O O O O O O

75 74 Outline of Solution A quantitative evaluation of all predictions –Loss Function (application dependent) Models class – set of all labeling functions –Generalize linear classification –Representation Learning algorithm –Extension of Perceptron –Extension of Passive-Aggressive

76 75 Loss Functions Hamming Distance (Number of wrong decisions) Levenshtein Distance (Edit distance) –Speech Number of words with incorrect parent –Dependency parsing B I O B I I I I O Estimated volume was a light 2.4 million ounces. B O O O B I I O O

77 76 Outline of Solution A quantitative evaluation of all predictions –Loss Function (application dependent) Models class – set of all labeling functions –Generalize linear classification –Representation Learning algorithm –Extension of Perceptron –Extension of Passive-Aggressive

78 77 Multiclass Representation I k Prototypes

79 78 k Prototypes New instance Multiclass Representation I

80 79 Multiclass Representation I k Prototypes New instance Compute Class r

81 80 Multiclass Representation I k Prototypes New instance Compute Prediction: The class achieving the highest Score Class r

82 81 Map all input and labels into a joint vector space Score labels by projecting the corresponding feature vector Multiclass Representation II Estimated volume was a light 2.4 million ounces. F = ( … ) B I O B I I I I O

83 82 Multiclass Representation II Predict label with highest score (Inference) Naïve search is expensive if the set of possible labels is large –N–No. of labelings = 3 No. of words B I O B I I I I O Estimated volume was a light 2.4 million ounces.

84 83 Features based on local domains Efficient Viterbi decoding for sequences Multiclass Representation II Estimated volume was a light 2.4 million ounces. F = ( … ) B I O B I I I I O

85 84 Multiclass Representation II Correct Labeling Almost Correct Labeling After Shalev

86 85 Multiclass Representation II Correct Labeling Almost Correct Labeling Worst Labeling Incorrect Labeling After Shalev

87 86 Multiclass Representation II Correct Labeling Almost Correct Labeling After Shalev Worst Labeling

88 87 Multiclass Representation II Correct Labeling Almost Correct Labeling After Shalev Worst Labeling

89 88 Two Representations Weight-vector per class (Representation I) –Intuitive –Improved algorithms Single weight-vector (Representation II) –Generalizes representation I –Allows complex interactions between input and output 000x0 F(x,4)=

90 89 Why linear models? Combine the best of generative and classification models: –Trade off labeling decisions at different positions –Allow overlapping features Modular –factored scoring –loss function –From features to kernels

91 90 Outline of Solution A quantitative evaluation of all predictions –Loss Function (application dependent) Models class – set of all labeling functions –Generalize linear classification –Representation Learning algorithm –Extension of Perceptron –Extension of Passive-Aggressive

92 91 Multiclass Multilabel Perceptron Single Round Get a new instance Predict ranking Get feedback Compute loss If update weight-vectors

93 92 Construct Error-Set Form any set of parameters that satisfies:   If then  Multiclass Multilabel Perceptron Update (1)

94 93 Multiclass Multilabel Perceptron Update (1)

95 94 Multiclass Multilabel Perceptron Update (1)

96 95 Multiclass Multilabel Perceptron Update (1)

97 96 Multiclass Multilabel Perceptron Update (1)

98 97 Multiclass Multilabel Perceptron Update (1) a0a

99 98 Set for Update : Multiclass Multilabel Perceptron Update (2)

100 99 Multiclass Multilabel Perceptron Update (2) a0a

101 100 Multiclass Multilabel Perceptron Update (2) a0a

102 101 Uniform Update 000 1/

103 102 Max Update Sparse Performance is worse than Uniform’s

104 103 Update Results BeforeUniform UpdateMax Update

105 104 Binary : Multi Class : Margin for Multi Class Prediction Error Margin Error

106 105 Binary : Multi Class : Margin for Multi Class

107 106 Multi Class : Margin for Multi Class But not all mistakes are equal? How do you know? Because the loss function is not constant !

108 107 Multi Class : Margin for Multi Class So, use it !

109 108 Linear Structure Models Correct Labeling Almost Correct Labeling After Shalev

110 109 Linear Structure Models Correct Labeling Almost Correct Labeling Worst Labeling Incorrect Labeling After Shalev

111 110 Linear Structure Models Correct Labeling Almost Correct Labeling Worst Labeling Incorrect Labeling After Shalev

112 111 PA Multi Class Update

113 112 PA Multi Class Update Project the current weight vector such that the instance ranking is consistent with loss function Set to be the solution of the following optimization problem :

114 113 PA Multi Class Update Problem –intersection of constraints may be empty Solutions –Does not occur in practice –Add a slack variable –Remove constraints

115 114 Add a Slack Variable Add a slack variable : Rewrite the optimization : Generalized Hinge Loss

116 115 Add a Slack Variable We like to solve : May have exponential number of constraints, thus intractable –If the loss can be factored then there is a polynomial equivalent set of constraints (Taskar et.al 2003) –Remove constraints (the other solution for the empty- set problem) B I O B I I I I O Estimated volume was a light 2.4 million ounces.

117 116 PA Multi Class Update Remove constraints : How to choose the single competing labeling ? –The labeling that attains the highest score! –… which is the predicted label according to the current model

118 117 Initialize For – Receive an input instance – Outputs a prediction – Receives a feedback label – Computes loss – Update the prediction rule PA Multiclass online algorithm

119 118 Advantages Process one training instance at a time Very simple Predictable runtime, small memory Adaptable to different loss functions Requires : –Inference procedure –Loss function –Features

120 119 Batch Setting Often we are given two sets : –Training set used to find a good model –Test set for evaluation Enumerate over the training set –Possibly more than once –Fix the weight vector Evaluate on Test Set Formal guaranties if training set and test set are i.i.d from fixed (unknown) distribution

121 120 Two Improvements Averaging –Instead of using final weight-vector use a combination of all weight-vector obtained during training time Top – k update –Instead of using only the labeling with highest score, the k labelings with highest score

122 121 Initialize For – Receive an input instance – Outputs a prediction – Receives a feedback label – Computes loss – Update the prediction rule –Update the averaged rule Averaging MIRA

123 122 Top-k Update Recall : –Inference : –Update

124 123 Top-K Update Recall : –Inference –Top-k Inference

125 124 Top-K Update Top-k Inference : Update :

126 125 Previous Approaches Focus on sequences Similar constructions for other structures Mainly batch algorithms

127 126 Previous Approaches Generative models: probabilistic generators of sequence-structure pairs –Hidden Markov Models (HMMs) – Probabilistic CFGs Sequential classification: decompose structure assignment into a sequence of structural decisions Conditional models : probabilistic model of labels given input –Conditional Random Fields [LMR 2001]

128 127 Previous Approaches Re-ranking : Combine generative and discriminative models –Full parsing [Collins 2000] Max Margin Makrov Networks [TGK 2003] –Use all competing labels –Elegant factorization –Closely related to PA with all the constraints

129 128 HMMs y1y1 y2y2 y3y3 x 1a x 1b x 1c x 2a x 2b x 2c x 3a x 3b x 3c

130 129 HMM

131 130 HMMs Solves a more general problem then required. Models the joint probability. Hard to model overlapping features, yet application needs richer input representation. –E.g. word identity, capitalization, ends in “-tion”, word in word list, word font, white space ratio, begins with number, word font ends with “?” Relax conditional independence of features on labels intractability

132 131 Conditional Random Fields Define distributions over labels Maximize the log-likelihood of data Dynamic programming for expectations (forward-backward algorithm) Standard convex optimization (L-BFGS) John Lafferty, Andrew McCallum, Fernando Pereira 2001

133 132 Local Classifiers Train local classifiers. –E.g. Combine results in test time Cheap to train Can not model well long distance interactions –MEMMs, local classifiers –Combine classifiers at test time –Problem: Can not trade-off decisions at different locations (label- bias problem) B ? O B I I I I O Estimated volume was a light 2.4 million ounces. MEMM Dan Roth

134 133 Re-Ranking Use a generative model to reduce exponential number of labelings into a polynomial number of candidates –Local features Use the Perceptron algorithm to re-rank the list –Global features Great results! Michael Collins 2000

135 134 Empirical Evaluation Category Ranking / Multiclass multilabel Noun Phrase Chunking Named entities Dependency Parsing Genefinder Phoneme Alignment Non-projective Parsing

136 135 Experimental Setup Algorithms : –Rocchio, normalized prototypes –Perceptron, one per topic –Multiclass Perceptron - IsErr, ErrorSetSize Features –About 100 terms/words per category Online to Batch : –Cycle once through training set –Apply resulting weight-vector to the test set

137 136 Data Sets Reuters21578 Training Examples 8,631 Test Examples 2,158 Topics No. Features 3,468

138 137 Data Sets Reuters21578Reuters2000 Training Examples 8,631521,439 Test Examples 2,158287,944 Topics No. Features 3,4689,325

139 138 Training Online Results Reuters 21578Reuters 2000 Round Number Average Cumulative IsErr IsErr Perceptron MP IsErr MP ErrSetSize

140 139 Training Online Results Reuters 21578Reuters 2000 Round Number Average Cumulative AvgP Average Precision Perceptron MP lsErr MP ErrSetSize

141 140 Test Results IsErrErrorSetSize IsErr and ErrorSetSize Perceptron MP IsErr MP ErrSetSize Rocciho IsErr R21578R2000 ErrSetSize R21578R2000

142 141 Sequence Text Analysis Features : –Meaningful word features –POS features –Unigram and bi-gram NER/NP features Inference : –Dynamic programming –Linear in length, quadratic in number of classes Estimated volume was a light 2.4 million ounces. F = ( … ) B I O B I I I I O

143 142 Noun Phrase Chunking Estimated volume was a light 2.4 million ounces Avg. Perceptron 0.942CRF 0.943MIRA McDonald, Crammer, Pereira

144 143 Noun Phrase Chunking Training time in CPU minutes Performance on test data McDonald, Crammer, Pereira

145 Avg. Perceptron 0.830CRF 0.831MIRA McDonald, Crammer, Pereira Named Entity Extraction Bill Clinton and Microsoft founder Bill Gates met today for 20 minutes.

146 145 Named Entity Extraction Training time in CPU minutes Performance on test data McDonald, Crammer, Pereira

147 146 Dependency parsing Features : –Anything over words –single edges Inference : –Search over possible trees in cubic time (Eisner, Satta)

148 147 Dependency Parsing 90.3 Yamada & Matsumoto, Nivre & Scholz, Avg. Perc. 90.9MIRA 82.9Avg. Perc. 83.2MIRA English Czech McDonald, Crammer, Pereira

149 148 Gene Finding

150 149 Gene Finding Semi-Markov Models (features includes segments length information) [ Sarawagi & Cohen, 2004 ] Decoding quadratic in length of sequence Bernal, Crammer, Hatzigeorgiu, Pereira specificity = FP / (FP + TN) sensitivity = FN / (TP + FN)

151 150 Dependency Parsing Complications: –High order features and multiply parents per word –Non-projective trees Approximate inference McDonald & Pereira, – proj – proj – non-proj – non-proj Czech

152 151 Phoneme Alignment Input: Acoustic signal + true phonemes Output : Segmentation of the signal t<40t<30t<20t< Discriminative Brugnara et al Keshet, Shalev-Shwartz, Singer, Chazan

153 152 Summary Online training for complex decisions Simple to implement, fast to train Modular : –Loss function –Feature Engineering –Inference procedure (exact, approximated) Works well in practice Theoretically analyzed Find closest model that classify well current instance

154 153 Uncovered Topics Kernels Multiplicative updates Bregman divergences and projections Theory of online-to-batch Matrix representations and updates

155 154 Partial Bibliography Prediction, learning, and games. Nicolò Cesa-Bianchi and Gábor Lugosi Y. Censor & S.A. Zenios, “Parallel Optimization”, Oxford UP, 1997 Y. Freund & R. Schapire, “Large margin classification using the Perceptron algorithm”, MLJ, M. Herbster, “Learning additive models online with fast evaluating kernels”, COLT 2001 J. Kivinen, A. Smola, and R.C. Williamson, “Online learning with kernels”, IEEE Trans. on SP, 2004 H.H. Bauschke & J.M. Borwein, “On Projection Algorithms for Solving Convex Feasibility Problems”, SIAM Review, 1996

156 155 Applications Online Passive Aggressive Algorithms, CDSS’03 + CDKSS’05 Online Ranking by Projecting, CS’05 Large Margin Hierarchical Classification, DKS’04 Online Learning of Approximate Dependency Parsing Algorithms. R. McDonald and F. Pereira European Association for Computational Linguistics, 2006 Discriminative Sentence Compression with Soft Syntactic Constraints.R. McDonald. European Association for Computational Linguistics, 2006 Non-Projective Dependency Parsing using Spanning Tree Algorithms. R. McDonald, F. Pereira, K. Ribarov and J. Hajic HLT- EMNLP, 2005 Flexible Text Segmentation with Structured Multilabel Classification. R. McDonald, K. Crammer and F. Pereira HLT-EMNLP, 2005

157 156 Applications Online and Batch Learning of Pseudo-metrics, SSN’04 Learning to Align Polyphonic Music, SKS’04 The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees, DSS’04 First-Order Probabilistic Models for Coreference Resolution. Aron Culotta, Michael Wick, Robert Hall and Andrew McCallum. NAACL/HLT, 2007 Structured Models for Fine-to-Coarse Sentiment Analysis R. McDonald, K. Hannan, T. Neylon, M. Wells, and J. Reynar Association for Computational Linguistics, 2007 Multilingual Dependency Parsing with a Two-Stage Discriminative Parser.R. McDonald and K. Lerman and F. Pereira Conference on Natural Language Learning, 2006 Discriminative Kernel-Based Phoneme Sequence Recognition, Joseph Keshet, Shai Shalev-Shwartz, Samy Begio, Yoram Singer and Dan Chazan, International Conference on Spoken Language Processing, 2006

158 157 Remarks If the input is inseparable, then the problem of finding a separating hyperplane which attains less then M errors is NP-hard (Open hemisphere) Obtaining a zero-one loss bound with a unit competitiveness ratio is as hard as finding a constant approximating error for the Open Hemisphere problem. Bound of the number of mistakes the perceptron makes with the hinge loss of any competitor

159 158 Thank You

160 159 Definitions Any Competitor The parameters vector can be chosen using the input data The parameterized hinge loss of on True hinge loss 1-norm and 2-norm of hinge loss

161 160 Geometrical Assumption All examples are bounded in a ball of radius R

162 161 Perceptron’s Mistake Bound Bounds : If the sample is separable then

163 162 Proof - Intuition Two views : –The angle between and decreases with –The following sum is fixed as we make more mistakes, our solution is better [FS99, SS05]

164 163 Proof Define the potential : Bound it’s cumulative sum from above and below [C04]

165 164 Proof Bound from above : Telescopic Sum Non-Negative Zero Vector

166 165 Proof Bound From Below : –No error on t th round –Error on t th round

167 166 Proof We bound each term :

168 167 Proof Bound From Below : –No error on t th round –Error on t th round Cumulative bound :

169 168 Proof Putting both bounds together : We use first degree of freedom (and scale) : Bound :

170 169 Proof General Bound : Choose : Simple Bound : Objective of SVM

171 170 Proof Better bound : optimize the value of

172 171 Remarks Bound does not depend on dimension of the feature vector The bound holds for all sequences. It is not tight for most real world data But, there exists a setting for which it is tight

173 172 Three Bounds

174 173 Separable Case Assume there exists such that for all examples Then all bounds are equivalent Perceptron makes finite number of mistakes until convergence (not necessarily to )

175 174 Separable Case – Other Quantities Use 1 st (parameterization) degree of freedom Scale the such that Define The bound becomes

176 175 Separable Case - Illustration

177 176 separable Case – Illustration The Perceptron will make more mistakes Finding a separating hyperplance is more difficult

178 177 Inseparable Case Difficult problem implies a large value of In this case the Perceptron will make a large number of mistakes

179 178 Aggressive Update Step Optimize for : Set the derivative to zero Substitute back into the Lagrangian : Dual optimization problem

180 179 Aggressive Update Step Dual Problem : Solve it : What about the constraint?

181 180 Three Decision Problems ClassificationRegressionUniclass

182 181 Each example defines a set of consistent hypotheses: The new vector is set to be the projection of onto The Passive-Aggressive Algorithm ClassificationRegressionUniclass

183 182 Loss Bound Assume there exists such that Assume : Then : Note :

184 183 Proof Sketch Define: Upper bound: Lower bound: Lipschitz Condition

185 184 Proof Sketch (Cont.) Combining upper and lower bounds

186 185 Loss Bound for PA-I Mistake bound : Optimal value, set

187 186 Loss Bound for PA-II Loss bound : Similar proof technique as of PA Bound can be improved similarly to the Perceptron

188 187 Goals Provide methods to design, analyze and implement efficiently online learning algorithms Introduce new online algorithms –state-of-the-art performance in practice –interesting theoretical guarantees Complex-output problems: –binary classification –multi- class categorization –information extraction, parsing & speech recognition


Download ppt "1 Advanced Online Learning for Natural Language Processing Koby Crammer University of Pennsylvania."

Similar presentations


Ads by Google