Download presentation

Presentation is loading. Please wait.

Published byDaisy Wilkey Modified over 2 years ago

1
Stochastic Gradient Descent Training for L1-regularizaed Log-linear Models with Cumulative Penalty Yoshimasa Tsuruoka, Junichi Tsujii, and Sophia Ananiadou University of Manchester 1

2
Log-linear models in NLP Maximum entropy models – Text classification (Nigam et al., 1999) – History-based approaches (Ratnaparkhi, 1998) Conditional random fields – Part-of-speech tagging (Lafferty et al., 2001), chunking (Sha and Pereira, 2003), etc. Structured prediction – Parsing (Clark and Curan, 2004), Semantic Role Labeling (Toutanova et al, 2005), etc. 2

3
Log-linear models Feature functionWeight Log-linear (a.k.a. maximum entropy) model Training –Maximize the conditional likelihood of the training data Partition function: 3

4
Regularization To avoid overfitting to the training data – Penalize the weights of the features L1 regularization – Most of the weights become zero – Produces sparse (compact) models – Saves memory and storage 4

5
Training log-linear models Numerical optimization methods – Gradient descent (steepest descent or hill-climbing) – Quasi-Newton methods (e.g. BFGS, OWL-QN) – Stochastic Gradient Descent (SGD) – etc. Training can take several hours (or even days), depending on the complexity of the model, the size of training data, etc. 5

6
Gradient Descent (Hill Climbing) objective 6

7
Stochastic Gradient Descent (SGD) objective Compute an approximate gradient using one training sample 7

8
Stochastic Gradient Descent (SGD) Weight update procedure – very simple (similar to the Perceptron algorithm) Not differentiable 8 : learning rate

9
Using subgradients Weight update procedure 9

10
Using subgradients Problems – L1 penalty needs to be applied to all features (including the ones that are not used in the current sample). – Few weights become zero as a result of training. 10

11
Clipping-at-zero approach Carpenter (2008) Special case of the FOLOS algorithm (Duchi and Singer, 2008) and the truncated gradient method (Langford et al., 2009) Enables lazy update w 11

12
Clipping-at-zero approach 12

13
Text chunking Named entity recognition Part-of-speech tagging 13 Number of non-zero features Quasi-Newton18,109 SGD (Naive)455,651 SGD (Clipping-at-zero)87,792 Number of non-zero features Quasi-Newton30,710 SGD (Naive)1,032,962 SGD (Clipping-at-zero)279,886 Number of non-zero features Quasi-Newton50,870 SGD (Naive)2,142,130 SGD (Clipping-at-zero)323,199

14
Why it does not produce sparse models In SGD, weights are not updated smoothly Fails to become zero! L1 penalty is wasted away 14

15
Cumulative L1 penalty The absolute value of the total L1 penalty which should have been applied to each weight The total L1 penalty which has actually been applied to each weight 15

16
Applying L1 with cumulative penalty Penalize each weight according to the difference between and

17
Implementation 10 lines of code! 17

18
Experiments Model: Conditional Random Fields (CRFs) Baseline: OWL-QN (Andrew and Gao, 2007) Tasks – Text chunking (shallow parsing) CoNLL 2000 shared task data Recognize base syntactic phrases (e.g. NP, VP, PP) – Named entity recognition NLPBA 2004 shared task data Recognize names of genes, proteins, etc. – Part-of-speech (POS) tagging WSJ corpus (sections 0-18 for training) 18

19
CoNLL 2000 chunking task: objective 19

20
CoNLL 2000 chunking: non-zero features 20

21
CoNLL 2000 chunking PassesObj.# FeaturesTime (sec)F-score OWL-QN , SGD (Naive) ,6511, SGD (Clipping + Lazy Update) , SGD (Cumulative) , SGD (Cumulative + ED) , Performance of the produced model Training is 4 times faster than OWL-QN The model is 4 times smaller than the clipping-at-zero approach The objective is also slightly better

22
PassesObj.# FeaturesTime (sec)F-score OWL-QN ,7102, SGD (Naive) ,032,9624, SGD (Clipping + Lazy Update) , SGD (Cumulative) , SGD (Cumulative + ED) , NLPBA 2004 named entity recognition 22 PassesObj.# FeaturesTime (sec)Accuracy OWL-QN ,8705, SGD (Naive) ,142,13018, SGD (Clipping + Lazy Update) ,1991, SGD (Cumulative) ,0431, SGD (Cumulative + ED) ,8571, Part-of-speech tagging on WSJ

23
Discussions Convergence – Demonstrated empirically – Penalties applied are not i.i.d. Learning rate – The need for tuning can be annoying – Rule of thumb: Exponential decay (passes = 30, alpha = 0.85) 23

24
Conclusions Stochastic gradient descent training for L1- regularized log-linear models – Force each weight to receive the total L1 penalty that would have been applied if the true (noiseless) gradient were available 3 to 4 times faster than OWL-QN Extremely easy to implement 24

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google