Curriculum Learning Yoshua Bengio, U. Montreal Jérôme Louradour, A2iA Ronan Collobert, Jason Weston, NEC ICML, June 16th, 2009, Montreal Acknowledgment: Myriam Côté
Curriculum Learning Guided learning helps training humans and animals Shaping Education Start from simpler examples / easier tasks (Piaget 1952, Skinner 1958)
The Dogma in question It is best to learn from a training set of examples sampled from the same distribution as the test set. Really?
Can machine learning algorithms benefit from a curriculum strategy? Question Can machine learning algorithms benefit from a curriculum strategy? Cognition journal: (Elman 1993) vs (Rohde & Plaut 1999), (Krueger & Dayan 2009)
Convex vs Non-Convex Criteria Convex criteria: the order of presentation of examples should not matter to the convergence point, but could influence convergence speed Non-convex criteria: the order and selection of examples could yield to a better local minimum
Deep Architectures Theoretical arguments: deep architectures can be exponentially more compact than shallow ones representing the same function Cognitive and neuroscience arguments Many local minima Guiding the optimization by unsupervised pre-training yields much better local minima o/w not reachable Good candidate for testing curriculum ideas
Deep Training Trajectories (Erhan et al. AISTATS 09) Random initialization Unsupervised guidance
Starting from Easy Examples 3 Most difficult examples Higher level abstractions 2 1 Easiest Lower level abstractions
Continuation Methods Target objective Final solution Heavily smoothed objective = surrogate criterion Final solution Easy to find minimum Track local minima
Curriculum Learning as Continuation 3 Most difficult examples Higher level abstractions Sequence of training distributions Initially peaking on easier / simpler ones Gradually give more weight to more difficult ones until reach target distribution 2 1 Easiest Lower level abstractions
How to order examples? The right order is not known 3 series of experiments: Toy experiments with simple order Larger margin first Less noisy inputs first Simpler shapes first, more varied ones later Smaller vocabulary first
Larger Margin First: Faster Convergence Another way to sort examples is by the margin yw’x with easiest examples corresponding to larger values. Again, the error rate differences between the curriculum strategy and the no-curriculum are statistically significant.
Cleaner First: Faster Convergence We find that training only with easy examples gives rise to lower generalization error (16.3% vs 17.1%) (average over 50 runs). The difference is statistically significant Here the difficult examples are probably not useful because they confuse the learner rather than help it establish the right location of the decision surface.
Shape Recognition First: easier, basic shapes The task is to classify geometrical shapes into 3 classes (rectangle, ellipse, triangle) Degrees of variability: object position, object size, object orientation, the grey levels of the foreground and the background Second = target: more varied geometric shapes
Shape Recognition Experiment 3-hidden layers deep net known to involve local minima (unsupervised pre-training finds much better solutions) 10 000 training / 5 000 validation / 5 000 test examples Procedure: Train for k epochs on the easier shapes Switch to target training set (more variations) The switch epoch refers to the index of the epoch when the training data is switched to Geometric Shapes training set
Shape Recognition Results Box plot of the distribution of test classification error as a function of the switch epoch Each box corresponds to 20 seeds for random generators that initialize the network free parameters. The horizontal line inside the box represents the median The borders the 25th and the 75th percentile The ends the 5th and 95th percentiles Clearly, the best generalization is obtained by doing a 2-stage curriculum where the first half of the training time is spent on the easier examples rather than on the target examples. k
Language Modeling Experiment Objective: compute the score of the next word given the previous ones (ranking criterion) Architecture of the deep neural network (Bengio et al. 2001, Collobert & Weston 2008)
Language Modeling Results Gradually increase the vocabulary size (dips) Train on Wikipedia with sentences containing only words in vocabulary Ranking language model trained with vs without curriculum on Wikipedia. Error is log of the rank of the next word. In its first pass through Wikipedia, the curriculum-trained model skips examples with words outside of 5k most frequent words then skips examples outside of 10k-word most frequent vocabulary. The drop in rank occurs when the vocabulary size is increased, as the curriculum-trained model quickly gets better on the new words. We observe that the log rank on the target distribution with the curriculum strategy crosses the error of the no-curriculum strategy after about 1 billion updates, shortly after switching to the target vocabulary size of 20k words. The difference keeps increasing afterwards.
Conclusion Yes, machine learning algorithms can benefit from a curriculum strategy.
Why? Faster convergence to a minimum Wasting less time with noisy or harder to predict examples Convergence to better local minima Curriculum = particular continuation method Finds better local minima of a non-convex training criterion Like a regularizer, with main effect on test set
Perspectives How could we define better curriculum strategies? We should try to understand general principles that make some curricula work better than others Emphasizing harder examples and riding on the frontier
THANK YOU! Questions? Comments?
Training Criterion: Ranking Words The cost is minimized using stochastic gradient descent, by iteratively sampling pairs (s, w) composed of a window of text s from the training set S and a random word w, and performing a step in the direction of the gradient of Cs,w with respect to the parameters, including the matrix of embeddings W. with S a word sequence Cs score of the next word given the previous one w a word of the vocabulary D the considered word vocabulary
Curriculum = Continuation Method? Examples from are weighted by Sequence of distributions called a curriculum if: the entropy of these distributions increases (larger domain) monotonically increasing in λ: