Presentation on theme: "Yoshimasa Tsuruoka, Jun’ichi Tsujii, and Sophia Ananiadou"— Presentation transcript:
1Stochastic Gradient Descent Training for L1-regularizaed Log-linear Models with Cumulative Penalty Yoshimasa Tsuruoka, Jun’ichi Tsujii, and Sophia AnaniadouUniversity of Manchester
2Log-linear models in NLP Maximum entropy modelsText classification (Nigam et al., 1999)History-based approaches (Ratnaparkhi, 1998)Conditional random fieldsPart-of-speech tagging (Lafferty et al., 2001), chunking (Sha and Pereira, 2003), etc.Structured predictionParsing (Clark and Curan, 2004), Semantic Role Labeling (Toutanova et al, 2005), etc.
3Log-linear models Log-linear (a.k.a. maximum entropy) model Training Maximize the conditional likelihood of the training dataWeightFeature functionPartition function:3
4Regularization To avoid overfitting to the training data Penalize the weights of the featuresL1 regularizationMost of the weights become zeroProduces sparse (compact) modelsSaves memory and storage
5Training log-linear models Numerical optimization methodsGradient descent (steepest descent or hill-climbing)Quasi-Newton methods (e.g. BFGS, OWL-QN)Stochastic Gradient Descent (SGD)etc.Training can take several hours (or even days), depending on the complexity of the model, the size of training data, etc.
21CoNLL 2000 chunking Performance of the produced model PassesObj.# FeaturesTime (sec)F-scoreOWL-QN160-1.58318,10959893.62SGD (Naive)30-1.671455,6511,11793.64SGD (Clipping + Lazy Update)87,79214493.65SGD (Cumulative)-1.65328,18914993.68SGD (Cumulative + ED)-1.62223,58414893.66Training is 4 times faster than OWL-QNThe model is 4 times smaller than the clipping-at-zero approachThe objective is also slightly better
23Discussions Convergence Learning rate Demonstrated empirically Penalties applied are not i.i.d.Learning rateThe need for tuning can be annoyingRule of thumb:Exponential decay (passes = 30, alpha = 0.85)
24ConclusionsStochastic gradient descent training for L1-regularized log-linear modelsForce each weight to receive the total L1 penalty that would have been applied if the true (noiseless) gradient were available3 to 4 times faster than OWL-QNExtremely easy to implement