Download presentation

Published byDaisy Wilkey Modified over 3 years ago

1
**Stochastic Gradient Descent Training for L1-regularizaed Log-linear Models with Cumulative Penalty**

Yoshimasa Tsuruoka, Jun’ichi Tsujii, and Sophia Ananiadou University of Manchester

2
**Log-linear models in NLP**

Maximum entropy models Text classification (Nigam et al., 1999) History-based approaches (Ratnaparkhi, 1998) Conditional random fields Part-of-speech tagging (Lafferty et al., 2001), chunking (Sha and Pereira, 2003), etc. Structured prediction Parsing (Clark and Curan, 2004), Semantic Role Labeling (Toutanova et al, 2005), etc.

3
**Log-linear models Log-linear (a.k.a. maximum entropy) model Training**

Maximize the conditional likelihood of the training data Weight Feature function Partition function: 3

4
**Regularization To avoid overfitting to the training data**

Penalize the weights of the features L1 regularization Most of the weights become zero Produces sparse (compact) models Saves memory and storage

5
**Training log-linear models**

Numerical optimization methods Gradient descent (steepest descent or hill-climbing) Quasi-Newton methods (e.g. BFGS, OWL-QN) Stochastic Gradient Descent (SGD) etc. Training can take several hours (or even days), depending on the complexity of the model, the size of training data, etc.

6
**Gradient Descent (Hill Climbing)**

objective

7
**Stochastic Gradient Descent (SGD)**

objective Compute an approximate gradient using one training sample

8
**Stochastic Gradient Descent (SGD)**

Weight update procedure very simple (similar to the Perceptron algorithm) Not differentiable : learning rate

9
Using subgradients Weight update procedure

10
**Using subgradients Problems**

L1 penalty needs to be applied to all features (including the ones that are not used in the current sample). Few weights become zero as a result of training.

11
**Clipping-at-zero approach**

w Carpenter (2008) Special case of the FOLOS algorithm (Duchi and Singer, 2008) and the truncated gradient method (Langford et al., 2009) Enables lazy update

12
**Clipping-at-zero approach**

13
**Named entity recognition**

Text chunking Named entity recognition Part-of-speech tagging Number of non-zero features Quasi-Newton 18,109 SGD (Naive) 455,651 SGD (Clipping-at-zero) 87,792 Number of non-zero features Quasi-Newton 30,710 SGD (Naive) 1,032,962 SGD (Clipping-at-zero) 279,886 Number of non-zero features Quasi-Newton 50,870 SGD (Naive) 2,142,130 SGD (Clipping-at-zero) 323,199

14
**Why it does not produce sparse models**

In SGD, weights are not updated smoothly Fails to become zero! L1 penalty is wasted away

15
Cumulative L1 penalty The absolute value of the total L1 penalty which should have been applied to each weight The total L1 penalty which has actually been applied to each weight

16
**Applying L1 with cumulative penalty**

Penalize each weight according to the difference between and

17
Implementation 10 lines of code!

18
**Experiments Model: Conditional Random Fields (CRFs)**

Baseline: OWL-QN (Andrew and Gao, 2007) Tasks Text chunking (shallow parsing) CoNLL 2000 shared task data Recognize base syntactic phrases (e.g. NP, VP, PP) Named entity recognition NLPBA 2004 shared task data Recognize names of genes, proteins, etc. Part-of-speech (POS) tagging WSJ corpus (sections 0-18 for training)

19
**CoNLL 2000 chunking task: objective**

20
**CoNLL 2000 chunking: non-zero features**

21
**CoNLL 2000 chunking Performance of the produced model**

Passes Obj. # Features Time (sec) F-score OWL-QN 160 -1.583 18,109 598 93.62 SGD (Naive) 30 -1.671 455,651 1,117 93.64 SGD (Clipping + Lazy Update) 87,792 144 93.65 SGD (Cumulative) -1.653 28,189 149 93.68 SGD (Cumulative + ED) -1.622 23,584 148 93.66 Training is 4 times faster than OWL-QN The model is 4 times smaller than the clipping-at-zero approach The objective is also slightly better

22
**NLPBA 2004 named entity recognition**

Passes Obj. # Features Time (sec) F-score OWL-QN 160 -2.448 30,710 2,253 71.76 SGD (Naive) 30 -2.537 1,032,962 4,528 71.20 SGD (Clipping + Lazy Update) -2.538 279,886 585 SGD (Cumulative) -2.479 31,986 631 71.40 SGD (Cumulative + ED) -2.443 25,965 71.63 Part-of-speech tagging on WSJ Passes Obj. # Features Time (sec) Accuracy OWL-QN 124 -1.941 50,870 5,623 97.16 SGD (Naive) 30 -2.013 2,142,130 18,471 97.18 SGD (Clipping + Lazy Update) 323,199 1,680 SGD (Cumulative) -1.987 62,043 1,777 97.19 SGD (Cumulative + ED) -1.954 51,857 1,774 97.17

23
**Discussions Convergence Learning rate Demonstrated empirically**

Penalties applied are not i.i.d. Learning rate The need for tuning can be annoying Rule of thumb: Exponential decay (passes = 30, alpha = 0.85)

24
Conclusions Stochastic gradient descent training for L1-regularized log-linear models Force each weight to receive the total L1 penalty that would have been applied if the true (noiseless) gradient were available 3 to 4 times faster than OWL-QN Extremely easy to implement

Similar presentations

OK

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

© 2018 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt on working of search engines Ppt on earthquake for class 9 Seminar ppt on android Ppt on credit default swaps and derivatives Ppt on central administrative tribunal cause Ppt on switching network definition Ppt on world diabetes day colors Ppt on porter's five forces example Ppt on college website Ppt on management information systems