Download presentation

Presentation is loading. Please wait.

Published byCeleste Hofford Modified over 2 years ago

2
1 Advanced Online Learning for Natural Language Processing Koby Crammer University of Pennsylvania

3
2 Thanks Axel Bernal Steve Caroll Ofer Dekel Mark Dredze Kuzman Ganchev Artemis Hatzigeorgiu Josheph Keshet Ryan McDonald Fernando Pereira Fei Sha Shai Shalev-Schwatz Yoram Singer Partha Pratim Talukdar

4
3 Tutorial Context SVMs Tutorial Multicass, Structured Optimization Theory Parsing Online Learning Information Extraction Speech Recognition Text Classification

5
4 Online Learning Tyrannosaurus rex

6
5 Online Learning Triceratops

7
6 Online Learning Tyrannosaurus rex Velocireptor

8
7 Journey Concepts Principles Algorithms Theory Binary Complex Structured Multi-Class Parsing Information Extraction Speech Recognition Document Classification Sentiment Classification

9
8 Formal Setting – Binary Classification Instances –Speech signal, Sentences Labels –Parse tree, Names Prediction rule –Linear predictions rules Loss –No. of mistakes

10
9 Predictions Discrete Predictions: –Hard to optimize Continuous predictions : –Label –Confidence

11
10 Loss Functions Natural Loss: –Zero-One loss: Real-valued-predictions loss: –Hinge loss: –Exponential loss (Boosting) –Log loss (Max Entropy, Boosting)

12
11 Loss Functions 1 1 Zero-One Loss Hinge Loss

13
12 Online Framework Initialize Classifier Algorithm works in rounds On round the online algorithm : – Receives an input instance – Outputs a prediction – Receives a feedback label – Computes loss – Updates the prediction rule Goal : –Suffer small cumulative loss

14
13 Any Features W.l.o.g. Binary Classifiers of the form Linear Classifiers Notation Abuse

15
14 Prediction : Confidence in prediction: Linear Classifiers (cntd.)

16
15 Margin of an example with respect to the classifier : Note : The set is separable iff there exists such that Margin

17
16 Geometrical Interpretation

18
17 Geometrical Interpretation

19
18 Geometrical Interpretation

20
19 Degree of Freedom - I The same geometrical hyperplane can be represented by many parameter vectors

21
20 Degree of Freedom - II Problem difficulty does not change if we shrink or expand the input space

22
21 Geometrical Interpretation Margin >0 Margin <<0 Margin <0 Margin >>0

23
22 Hinge Loss

24
23 Separable Set

25
24 Inseparable Sets

26
25 Why Online Learning? Fast Memory efficient - process one example at a time Simple to implement Formal guarantees – Mistake bounds Online to Batch conversions No statistical assumptions Adaptive Not as good as a well designed batch algorithms

27
26 Initialize Classifier Algorithm works in rounds On round the online algorithm : – Receives an input instance – Outputs a prediction – Receives a feedback label – Computes loss – Updates the prediction rule Online

28
27 Update Rules Online algorithms are based on an update rule which defines from (and possibly other information) Linear Classifiers : find from based on the input Some Update Rules : –P–Perceptron (Rosenblat) –A–ALMA (Gentile) –R–ROMMA (Li & Long) –N–NORMA (Kivinen et. al) –M–MIRA (Crammer & Singer) –E–EG (Littlestown and Warmuth) –B–Bregman Based (Warmuth) –…–…

29
28 Three Update Rules The Perceptron Algorithm : –Agmon 1954; Rosenblatt 1952-1962, Block 1962, Novikoff 1962, Minsky & Papert 1969, Freund & Schapire 1999, Blum & Dunagan 2002 Hildreth’s Algorithm : –Hildreth 1957 –Censor & Zenios 1997 –Herbster 2002 Loss Scaled : –Crammer & Singer 2001, 2002

30
29 The Perceptron Algorithm If No-Mistake –Do nothing If Mistake –Update Margin after update :

31
30 Geometrical Interpretation

32
31 Perceptron Algorithm Extremely easy to implement Relative loss bounds for separable and inseparable cases. Minimal assumptions (not iid) Easy to convert to a well-performing batch algorithm (under iid assumptions) Margin of examples is ignored by update Same update for separable case and inseparable case

33
32 Passive – Aggressive Approach The basis for a well-known algorithm in convex optimization due to Hildreth (1957) –Asymptotic analysis –Does not work in the inseparable case Three versions : –PAseparable case –PA-IPA-IIinseparable case Beyond classification –Regression, one class, structured learning Relative loss bounds

34
33 Motivation Perceptron: –No guaranties of margin after the update PA : –Enforce a minimal non-zero margin after the update In particular : –If the margin is large enough (1), then do nothing –If the margin is less then unit, update such that the margin after the update is enforced to be unit

35
34 Input Space

36
35 Input Space vs. Version Space Input Space : –Points are input data –One constraint is induced by weight vector –Primal space –Half space = all input examples that are classified correctly by a given predictor (weight vector) Version Space : –Points are weight vectors –One constraints is induced by input data –Dual space –Half space = all predictors (weight vectors) that classify correctly a given input example

37
36 Weight Vector (Version) Space The algorithm forces to reside in this region

38
37 Passive Step Nothing to do. already resides on the desired side.

39
38 Aggressive Step The algorithm projects on the desired half-space

40
39 Aggressive Update Step Set to be the solution of the following optimization problem : Can be solved analytically If the constraint is already satisfied: –trivial solution Otherwise: –It will be satisfied as equality

41
40 An Equality Constraint Linear update : Force the constraint to hold as equality Solve :

42
41 Passive-Aggressive Update

43
42 Perceptron vs. PA Common Update : Perceptron Passive-Aggressive

44
43 Perceptron vs. PA Margin Error N o - E r r o r, S m a l l M a r g i n No-Error, Large Margin

45
44 Perceptron vs. PA

46
45 Unrealizable case There is no weight vector that satisfy all the constraints

47
46 Unrealizable Case

48
47 Unrealizable Case

49
48 Four Algorithms PerceptronPAPA IPA II

50
49 Four Algorithms PerceptronPAPA IPA II

51
50 For any competitor prediction function We bound the loss suffered by the algorithm with the loss suffered by Relative Loss Bound Cumulative Loss Suffered by the Algorithm Sequence of Prediction Functions Cumulative Loss of Competitor

52
51 For any competitor prediction function We bound the loss suffered by the algorithm with the loss suffered by Relative Loss Bound Inequality Possibly Large Gap Regret Extra Loss Competitiveness Ratio

53
52 For any competitor prediction function We bound the loss suffered by the algorithm with the loss suffered by Relative Loss Bound Grows With T Constant Grows With T

54
53 For any competitor prediction function We bound the loss suffered by the algorithm with the loss suffered by Relative Loss Bound Best Prediction Function in hindsight for the data sequence

55
54 Summary Binary Classification : –Basic building block –Online learning, processes one example –Few additive algorithms Next: –Complex problems –Examples

56
55 Binary Classification 33 If it’s not one class, It most be the other class

57
56 Multi Class Single Label 3426 Elimination of a single class is not enough

58
57 Ordinal Ranking / Regression Machine Prediction Viewer’s Rating Order relation over labels Structure Over Possible Labels Snyder & Barzilay, 2007

59
58 Hierarchical Classification Phonetic transcription of DECEMBER Gross error Small errors d ix CH eh m bcl b er d AE s eh m bcl b er d ix s eh NASAL bcl b er [DKS’04]

60
59 Phonetic Hierarchy b g PHONEMES Sononorants Silences Obstruents Nasals Liquids Vowels Plosives Fricatives Front Center Back n m ng d k p t f v sh s th dh zh z l y w r Affricates jh ch oy ow uh uw aa ao er aw ay iy ih ey eh ae Structure Over Possible Labels [DKS’04]

61
60 Multi-Class Multi-Label ECONOMICS CORPORATE / INDUSTRIAL REGULATION / POLICY MARKETS LABOUR GOVERNMENT / SOCIAL LEGAL/JUDICIAL REGULATION/POLICY SHARE LISTINGS PERFORMANCE ACCOUNTS/EARNINGS COMMENT / FORECASTS SHARE CAPITAL BONDS / DEBT ISSUES LOANS / CREDITS STRATEGY / PLANS INSOLVENCY / LIQUIDITY Full topic rankingDocument The higher minimum wage signed into law… will be welcome relief for millions of workers …. The 90- cent-an-hour increase …. Relevant topics MARKETS LABOUR ECONOMICS REGULATION / POLICY CORPORATE / INDUSTRIAL GOVERNMENT / SOCIAL

62
61 Multi-Class Multi-Label Document The higher minimum wage signed into law… will be welcome relief for millions of workers …. The 90- cent-an-hour increase …. ECONOMICS INSOLVENCY / LIQUIDITY CORPORATE / INDUSTRIAL REGULATION / POLICY COMMENT / FORECASTS LABOUR LOANS / CREDITS LEGAL/JUDICIAL REGULATION/POLICY SHARE LISTINGS GOVERNMENT / SOCIAL PERFORMANCE ACCOUNTS/EARNINGS MARKETS SHARE CAPITAL BONDS / DEBT ISSUES STRATEGY / PLANS Full topic ranking Relevant topics MARKETS LABOUR ECONOMICS REGULATION / POLICY CORPORATE / INDUSTRIAL GOVERNMENT / SOCIAL RecallPrecisionAny Error?No. Errors? Non-trivial Evaluation Measures

63
62 Noun Phrase Chunking Estimated volume was a light 2.4 million ounces. Simultaneous Labeling

64
63 Bill Clinton and Microsoft founder Bill Gates met today for 20 minutes. Bill Clinton and Microsoft founder Bill Gates met today for 20 minutes. Named Entity Extraction Interactive Decisions

65
64 Sentence Compression The Reverse Engineer Tool is available now and is priced on a site-licensing basis, ranging from $8,000 for a single user to $90,000 for a multiuser project site. Essentially, design recovery tools read existing code and translate it into the language in which CASE is conversant -- definitions and structured diagrams. The Reverse Engineer Tool is available now and is priced on a site-licensing basis, ranging from $8,000 for a single user to $90,000 for a multiuser project site. Essentially, design recovery tools read existing code and translate it into the language in which CASE is conversant -- definitions and structured diagrams. [McDonald06] Complex Input – Output Relation

66
65 Dependency Parsing John hit the ball with the bat Non-trivial Output

67
66 Aligning Polyphonic Music Symbolic representation: Acoustic representation: Two ways for representing music [Shalev-Shwartz, Keshet, Singer 2004]

68
67 Symbolic Representation time pitch - pitch symbolic representation: - start-time [Shalev-Shwartz, Keshet, Singer 2004]

69
68 [Shalev-Shwartz, Keshet, Singer 2004] Acoustic Representation Feature Extraction (e.g. Spectral Analysis) acoustic representation: acoustic signal:

70
69 [Shalev-Shwartz, Keshet, Singer 2004] The Alignment Problem Setting time pitch actual start-time:

71
70 Challenges Elimination is not enough Structure over possible labels Non-trivial loss functions Complex input – output relation Non-trivial output

72
71 Challenges Interactive Decisions A wide range of features Computing an answer is relatively costly Fake News Show

73
72 Analysis as Labeling Label gives role for corresponding input "Many to one relation” Still, not general enough Model

74
73 Examples B I O B I I I I O Estimated volume was a light 2.4 million ounces. Bill Clinton and Microsoft founder Bill Gates met today for 20 minutes. B-PER I-PER O B-ORG O B-PER I-PER O O O O O O

75
74 Outline of Solution A quantitative evaluation of all predictions –Loss Function (application dependent) Models class – set of all labeling functions –Generalize linear classification –Representation Learning algorithm –Extension of Perceptron –Extension of Passive-Aggressive

76
75 Loss Functions Hamming Distance (Number of wrong decisions) Levenshtein Distance (Edit distance) –Speech Number of words with incorrect parent –Dependency parsing B I O B I I I I O Estimated volume was a light 2.4 million ounces. B O O O B I I O O

77
76 Outline of Solution A quantitative evaluation of all predictions –Loss Function (application dependent) Models class – set of all labeling functions –Generalize linear classification –Representation Learning algorithm –Extension of Perceptron –Extension of Passive-Aggressive

78
77 Multiclass Representation I k Prototypes

79
78 k Prototypes New instance Multiclass Representation I

80
79 Multiclass Representation I k Prototypes New instance Compute Class r 1-1.08 21.66 30.37 4-2.09

81
80 Multiclass Representation I k Prototypes New instance Compute Prediction: The class achieving the highest Score Class r 1-1.08 21.66 30.37 4-2.09

82
81 Map all input and labels into a joint vector space Score labels by projecting the corresponding feature vector Multiclass Representation II Estimated volume was a light 2.4 million ounces. F = (0 1 1 0 … ) B I O B I I I I O

83
82 Multiclass Representation II Predict label with highest score (Inference) Naïve search is expensive if the set of possible labels is large –N–No. of labelings = 3 No. of words B I O B I I I I O Estimated volume was a light 2.4 million ounces.

84
83 Features based on local domains Efficient Viterbi decoding for sequences Multiclass Representation II Estimated volume was a light 2.4 million ounces. F = (0 1 1 0 … ) B I O B I I I I O

85
84 Multiclass Representation II Correct Labeling Almost Correct Labeling After Shalev

86
85 Multiclass Representation II Correct Labeling Almost Correct Labeling Worst Labeling Incorrect Labeling After Shalev

87
86 Multiclass Representation II Correct Labeling Almost Correct Labeling After Shalev Worst Labeling

88
87 Multiclass Representation II Correct Labeling Almost Correct Labeling After Shalev Worst Labeling

89
88 Two Representations Weight-vector per class (Representation I) –Intuitive –Improved algorithms Single weight-vector (Representation II) –Generalizes representation I –Allows complex interactions between input and output 000x0 F(x,4)=

90
89 Why linear models? Combine the best of generative and classification models: –Trade off labeling decisions at different positions –Allow overlapping features Modular –factored scoring –loss function –From features to kernels

91
90 Outline of Solution A quantitative evaluation of all predictions –Loss Function (application dependent) Models class – set of all labeling functions –Generalize linear classification –Representation Learning algorithm –Extension of Perceptron –Extension of Passive-Aggressive

92
91 Multiclass Multilabel Perceptron Single Round Get a new instance Predict ranking Get feedback Compute loss If update weight-vectors

93
92 Construct Error-Set Form any set of parameters that satisfies: If then Multiclass Multilabel Perceptron Update (1)

94
93 Multiclass Multilabel Perceptron Update (1)

95
94 Multiclass Multilabel Perceptron Update (1) 2 3 145

96
95 Multiclass Multilabel Perceptron Update (1) 2 3 145

97
96 Multiclass Multilabel Perceptron Update (1) 000 0 2 3 145

98
97 Multiclass Multilabel Perceptron Update (1) 000 1-a0a 2 3 145

99
98 Set for Update : Multiclass Multilabel Perceptron Update (2)

100
99 Multiclass Multilabel Perceptron Update (2) 000 1-a0a 2 3 145

101
100 Multiclass Multilabel Perceptron Update (2) 000 1-a0a 2 3 145

102
101 Uniform Update 000 1/20 2 3 145

103
102 Max Update 000 100 2 3 145 Sparse Performance is worse than Uniform’s

104
103 Update Results BeforeUniform UpdateMax Update

105
104 Binary : Multi Class : Margin for Multi Class Prediction Error Margin Error

106
105 Binary : Multi Class : Margin for Multi Class

107
106 Multi Class : Margin for Multi Class But not all mistakes are equal? How do you know? Because the loss function is not constant !

108
107 Multi Class : Margin for Multi Class So, use it !

109
108 Linear Structure Models Correct Labeling Almost Correct Labeling After Shalev

110
109 Linear Structure Models Correct Labeling Almost Correct Labeling Worst Labeling Incorrect Labeling After Shalev

111
110 Linear Structure Models Correct Labeling Almost Correct Labeling Worst Labeling Incorrect Labeling After Shalev

112
111 PA Multi Class Update

113
112 PA Multi Class Update Project the current weight vector such that the instance ranking is consistent with loss function Set to be the solution of the following optimization problem :

114
113 PA Multi Class Update Problem –intersection of constraints may be empty Solutions –Does not occur in practice –Add a slack variable –Remove constraints

115
114 Add a Slack Variable Add a slack variable : Rewrite the optimization : Generalized Hinge Loss

116
115 Add a Slack Variable We like to solve : May have exponential number of constraints, thus intractable –If the loss can be factored then there is a polynomial equivalent set of constraints (Taskar et.al 2003) –Remove constraints (the other solution for the empty- set problem) B I O B I I I I O Estimated volume was a light 2.4 million ounces.

117
116 PA Multi Class Update Remove constraints : How to choose the single competing labeling ? –The labeling that attains the highest score! –… which is the predicted label according to the current model

118
117 Initialize For – Receive an input instance – Outputs a prediction – Receives a feedback label – Computes loss – Update the prediction rule PA Multiclass online algorithm

119
118 Advantages Process one training instance at a time Very simple Predictable runtime, small memory Adaptable to different loss functions Requires : –Inference procedure –Loss function –Features

120
119 Batch Setting Often we are given two sets : –Training set used to find a good model –Test set for evaluation Enumerate over the training set –Possibly more than once –Fix the weight vector Evaluate on Test Set Formal guaranties if training set and test set are i.i.d from fixed (unknown) distribution

121
120 Two Improvements Averaging –Instead of using final weight-vector use a combination of all weight-vector obtained during training time Top – k update –Instead of using only the labeling with highest score, the k labelings with highest score

122
121 Initialize For – Receive an input instance – Outputs a prediction – Receives a feedback label – Computes loss – Update the prediction rule –Update the averaged rule Averaging MIRA

123
122 Top-k Update Recall : –Inference : –Update

124
123 Top-K Update Recall : –Inference –Top-k Inference

125
124 Top-K Update Top-k Inference : Update :

126
125 Previous Approaches Focus on sequences Similar constructions for other structures Mainly batch algorithms

127
126 Previous Approaches Generative models: probabilistic generators of sequence-structure pairs –Hidden Markov Models (HMMs) – Probabilistic CFGs Sequential classification: decompose structure assignment into a sequence of structural decisions Conditional models : probabilistic model of labels given input –Conditional Random Fields [LMR 2001]

128
127 Previous Approaches Re-ranking : Combine generative and discriminative models –Full parsing [Collins 2000] Max Margin Makrov Networks [TGK 2003] –Use all competing labels –Elegant factorization –Closely related to PA with all the constraints

129
128 HMMs y1y1 y2y2 y3y3 x 1a x 1b x 1c x 2a x 2b x 2c x 3a x 3b x 3c

130
129 HMM

131
130 HMMs Solves a more general problem then required. Models the joint probability. Hard to model overlapping features, yet application needs richer input representation. –E.g. word identity, capitalization, ends in “-tion”, word in word list, word font, white space ratio, begins with number, word font ends with “?” Relax conditional independence of features on labels intractability

132
131 Conditional Random Fields Define distributions over labels Maximize the log-likelihood of data Dynamic programming for expectations (forward-backward algorithm) Standard convex optimization (L-BFGS) John Lafferty, Andrew McCallum, Fernando Pereira 2001

133
132 Local Classifiers Train local classifiers. –E.g. Combine results in test time Cheap to train Can not model well long distance interactions –MEMMs, local classifiers –Combine classifiers at test time –Problem: Can not trade-off decisions at different locations (label- bias problem) B ? O B I I I I O Estimated volume was a light 2.4 million ounces. MEMM Dan Roth

134
133 Re-Ranking Use a generative model to reduce exponential number of labelings into a polynomial number of candidates –Local features Use the Perceptron algorithm to re-rank the list –Global features Great results! Michael Collins 2000

135
134 Empirical Evaluation Category Ranking / Multiclass multilabel Noun Phrase Chunking Named entities Dependency Parsing Genefinder Phoneme Alignment Non-projective Parsing

136
135 Experimental Setup Algorithms : –Rocchio, normalized prototypes –Perceptron, one per topic –Multiclass Perceptron - IsErr, ErrorSetSize Features –About 100 terms/words per category Online to Batch : –Cycle once through training set –Apply resulting weight-vector to the test set

137
136 Data Sets Reuters21578 Training Examples 8,631 Test Examples 2,158 Topics 90 1.24 No. Features 3,468

138
137 Data Sets Reuters21578Reuters2000 Training Examples 8,631521,439 Test Examples 2,158287,944 Topics 90102 1.243.20 No. Features 3,4689,325

139
138 Training Online Results Reuters 21578Reuters 2000 Round Number Average Cumulative IsErr IsErr Perceptron MP IsErr MP ErrSetSize

140
139 Training Online Results Reuters 21578Reuters 2000 Round Number Average Cumulative AvgP Average Precision Perceptron MP lsErr MP ErrSetSize

141
140 Test Results IsErrErrorSetSize IsErr and ErrorSetSize Perceptron MP IsErr MP ErrSetSize Rocciho IsErr R21578R2000 ErrSetSize R21578R2000

142
141 Sequence Text Analysis Features : –Meaningful word features –POS features –Unigram and bi-gram NER/NP features Inference : –Dynamic programming –Linear in length, quadratic in number of classes Estimated volume was a light 2.4 million ounces. F = (0 1 1 0 … ) B I O B I I I I O

143
142 Noun Phrase Chunking Estimated volume was a light 2.4 million ounces. 0.941Avg. Perceptron 0.942CRF 0.943MIRA McDonald, Crammer, Pereira

144
143 Noun Phrase Chunking Training time in CPU minutes Performance on test data McDonald, Crammer, Pereira

145
144 0.823Avg. Perceptron 0.830CRF 0.831MIRA McDonald, Crammer, Pereira Named Entity Extraction Bill Clinton and Microsoft founder Bill Gates met today for 20 minutes.

146
145 Named Entity Extraction Training time in CPU minutes Performance on test data McDonald, Crammer, Pereira

147
146 Dependency parsing Features : –Anything over words –single edges Inference : –Search over possible trees in cubic time (Eisner, Satta)

148
147 Dependency Parsing 90.3 Yamada & Matsumoto, 2003 87.3 Nivre & Scholz, 2004 90.6Avg. Perc. 90.9MIRA 82.9Avg. Perc. 83.2MIRA English Czech McDonald, Crammer, Pereira

149
148 Gene Finding

150
149 Gene Finding Semi-Markov Models (features includes segments length information) [ Sarawagi & Cohen, 2004 ] Decoding quadratic in length of sequence Bernal, Crammer, Hatzigeorgiu, Pereira specificity = FP / (FP + TN) sensitivity = FN / (TP + FN)

151
150 Dependency Parsing Complications: –High order features and multiply parents per word –Non-projective trees Approximate inference McDonald & Pereira,2006 83.01 – proj 84.22 – proj 84.11 – non-proj 85.22 – non-proj Czech

152
151 Phoneme Alignment Input: Acoustic signal + true phonemes Output : Segmentation of the signal t<40t<30t<20t<10 98.196.292.179.2Discriminative 97.194.488.975.3Brugnara et al Keshet, Shalev-Shwartz, Singer, Chazan

153
152 Summary Online training for complex decisions Simple to implement, fast to train Modular : –Loss function –Feature Engineering –Inference procedure (exact, approximated) Works well in practice Theoretically analyzed Find closest model that classify well current instance

154
153 Uncovered Topics Kernels Multiplicative updates Bregman divergences and projections Theory of online-to-batch Matrix representations and updates

155
154 Partial Bibliography Prediction, learning, and games. Nicolò Cesa-Bianchi and Gábor Lugosi Y. Censor & S.A. Zenios, “Parallel Optimization”, Oxford UP, 1997 Y. Freund & R. Schapire, “Large margin classification using the Perceptron algorithm”, MLJ, 1999. M. Herbster, “Learning additive models online with fast evaluating kernels”, COLT 2001 J. Kivinen, A. Smola, and R.C. Williamson, “Online learning with kernels”, IEEE Trans. on SP, 2004 H.H. Bauschke & J.M. Borwein, “On Projection Algorithms for Solving Convex Feasibility Problems”, SIAM Review, 1996

156
155 Applications Online Passive Aggressive Algorithms, CDSS’03 + CDKSS’05 Online Ranking by Projecting, CS’05 Large Margin Hierarchical Classification, DKS’04 Online Learning of Approximate Dependency Parsing Algorithms. R. McDonald and F. Pereira European Association for Computational Linguistics, 2006 Discriminative Sentence Compression with Soft Syntactic Constraints.R. McDonald. European Association for Computational Linguistics, 2006 Non-Projective Dependency Parsing using Spanning Tree Algorithms. R. McDonald, F. Pereira, K. Ribarov and J. Hajic HLT- EMNLP, 2005 Flexible Text Segmentation with Structured Multilabel Classification. R. McDonald, K. Crammer and F. Pereira HLT-EMNLP, 2005

157
156 Applications Online and Batch Learning of Pseudo-metrics, SSN’04 Learning to Align Polyphonic Music, SKS’04 The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees, DSS’04 First-Order Probabilistic Models for Coreference Resolution. Aron Culotta, Michael Wick, Robert Hall and Andrew McCallum. NAACL/HLT, 2007 Structured Models for Fine-to-Coarse Sentiment Analysis R. McDonald, K. Hannan, T. Neylon, M. Wells, and J. Reynar Association for Computational Linguistics, 2007 Multilingual Dependency Parsing with a Two-Stage Discriminative Parser.R. McDonald and K. Lerman and F. Pereira Conference on Natural Language Learning, 2006 Discriminative Kernel-Based Phoneme Sequence Recognition, Joseph Keshet, Shai Shalev-Shwartz, Samy Begio, Yoram Singer and Dan Chazan, International Conference on Spoken Language Processing, 2006

158
157 Remarks If the input is inseparable, then the problem of finding a separating hyperplane which attains less then M errors is NP-hard (Open hemisphere) Obtaining a zero-one loss bound with a unit competitiveness ratio is as hard as finding a constant approximating error for the Open Hemisphere problem. Bound of the number of mistakes the perceptron makes with the hinge loss of any competitor

159
158 Thank You

160
159 Definitions Any Competitor The parameters vector can be chosen using the input data The parameterized hinge loss of on True hinge loss 1-norm and 2-norm of hinge loss

161
160 Geometrical Assumption All examples are bounded in a ball of radius R

162
161 Perceptron’s Mistake Bound Bounds : If the sample is separable then

163
162 Proof - Intuition Two views : –The angle between and decreases with –The following sum is fixed as we make more mistakes, our solution is better [FS99, SS05]

164
163 Proof Define the potential : Bound it’s cumulative sum from above and below [C04]

165
164 Proof Bound from above : Telescopic Sum Non-Negative Zero Vector

166
165 Proof Bound From Below : –No error on t th round –Error on t th round

167
166 Proof We bound each term :

168
167 Proof Bound From Below : –No error on t th round –Error on t th round Cumulative bound :

169
168 Proof Putting both bounds together : We use first degree of freedom (and scale) : Bound :

170
169 Proof General Bound : Choose : Simple Bound : Objective of SVM

171
170 Proof Better bound : optimize the value of

172
171 Remarks Bound does not depend on dimension of the feature vector The bound holds for all sequences. It is not tight for most real world data But, there exists a setting for which it is tight

173
172 Three Bounds

174
173 Separable Case Assume there exists such that for all examples Then all bounds are equivalent Perceptron makes finite number of mistakes until convergence (not necessarily to )

175
174 Separable Case – Other Quantities Use 1 st (parameterization) degree of freedom Scale the such that Define The bound becomes

176
175 Separable Case - Illustration

177
176 separable Case – Illustration The Perceptron will make more mistakes Finding a separating hyperplance is more difficult

178
177 Inseparable Case Difficult problem implies a large value of In this case the Perceptron will make a large number of mistakes

179
178 Aggressive Update Step Optimize for : Set the derivative to zero Substitute back into the Lagrangian : Dual optimization problem

180
179 Aggressive Update Step Dual Problem : Solve it : What about the constraint?

181
180 Three Decision Problems ClassificationRegressionUniclass

182
181 Each example defines a set of consistent hypotheses: The new vector is set to be the projection of onto The Passive-Aggressive Algorithm ClassificationRegressionUniclass

183
182 Loss Bound Assume there exists such that Assume : Then : Note :

184
183 Proof Sketch Define: Upper bound: Lower bound: Lipschitz Condition

185
184 Proof Sketch (Cont.) Combining upper and lower bounds

186
185 Loss Bound for PA-I Mistake bound : Optimal value, set

187
186 Loss Bound for PA-II Loss bound : Similar proof technique as of PA Bound can be improved similarly to the Perceptron

188
187 Goals Provide methods to design, analyze and implement efficiently online learning algorithms Introduce new online algorithms –state-of-the-art performance in practice –interesting theoretical guarantees Complex-output problems: –binary classification –multi- class categorization –information extraction, parsing & speech recognition

Similar presentations

OK

Online Learning by Projecting: From Theory to Large Scale Web-spam filtering Yoram Singer Koby Crammer (Upenn), Ofer Dekel (Google/HUJI), Vineet Gupta.

Online Learning by Projecting: From Theory to Large Scale Web-spam filtering Yoram Singer Koby Crammer (Upenn), Ofer Dekel (Google/HUJI), Vineet Gupta.

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google