Presentation on theme: "Yoshimasa Tsuruoka, Jun’ichi Tsujii, and Sophia Ananiadou"— Presentation transcript:
1 Stochastic Gradient Descent Training for L1-regularizaed Log-linear Models with Cumulative Penalty Yoshimasa Tsuruoka, Jun’ichi Tsujii, and Sophia AnaniadouUniversity of Manchester
2 Log-linear models in NLP Maximum entropy modelsText classification (Nigam et al., 1999)History-based approaches (Ratnaparkhi, 1998)Conditional random fieldsPart-of-speech tagging (Lafferty et al., 2001), chunking (Sha and Pereira, 2003), etc.Structured predictionParsing (Clark and Curan, 2004), Semantic Role Labeling (Toutanova et al, 2005), etc.
3 Log-linear models Log-linear (a.k.a. maximum entropy) model Training Maximize the conditional likelihood of the training dataWeightFeature functionPartition function:3
4 Regularization To avoid overfitting to the training data Penalize the weights of the featuresL1 regularizationMost of the weights become zeroProduces sparse (compact) modelsSaves memory and storage
5 Training log-linear models Numerical optimization methodsGradient descent (steepest descent or hill-climbing)Quasi-Newton methods (e.g. BFGS, OWL-QN)Stochastic Gradient Descent (SGD)etc.Training can take several hours (or even days), depending on the complexity of the model, the size of training data, etc.
13 Named entity recognition Text chunkingNamed entity recognitionPart-of-speech taggingNumber of non-zero featuresQuasi-Newton18,109SGD (Naive)455,651SGD (Clipping-at-zero)87,792Number of non-zero featuresQuasi-Newton30,710SGD (Naive)1,032,962SGD (Clipping-at-zero)279,886Number of non-zero featuresQuasi-Newton50,870SGD (Naive)2,142,130SGD (Clipping-at-zero)323,199
14 Why it does not produce sparse models In SGD, weights are not updated smoothlyFails to becomezero!L1 penalty is wasted away
15 Cumulative L1 penaltyThe absolute value of the total L1 penalty which should have been applied to each weightThe total L1 penalty which has actually been applied to each weight
16 Applying L1 with cumulative penalty Penalize each weight according to the difference between and
21 CoNLL 2000 chunking Performance of the produced model PassesObj.# FeaturesTime (sec)F-scoreOWL-QN160-1.58318,10959893.62SGD (Naive)30-1.671455,6511,11793.64SGD (Clipping + Lazy Update)87,79214493.65SGD (Cumulative)-1.65328,18914993.68SGD (Cumulative + ED)-1.62223,58414893.66Training is 4 times faster than OWL-QNThe model is 4 times smaller than the clipping-at-zero approachThe objective is also slightly better
23 Discussions Convergence Learning rate Demonstrated empirically Penalties applied are not i.i.d.Learning rateThe need for tuning can be annoyingRule of thumb:Exponential decay (passes = 30, alpha = 0.85)
24 ConclusionsStochastic gradient descent training for L1-regularized log-linear modelsForce each weight to receive the total L1 penalty that would have been applied if the true (noiseless) gradient were available3 to 4 times faster than OWL-QNExtremely easy to implement