Presentation is loading. Please wait.

Presentation is loading. Please wait.

Curriculum Learning Yoshua Bengio, U. Montreal Jérôme Louradour, A2iA

Similar presentations


Presentation on theme: "Curriculum Learning Yoshua Bengio, U. Montreal Jérôme Louradour, A2iA"— Presentation transcript:

1 Curriculum Learning Yoshua Bengio, U. Montreal Jérôme Louradour, A2iA
Ronan Collobert, Jason Weston, NEC ICML, June 16th, 2009, Montreal Acknowledgment: Myriam Côté

2 Curriculum Learning Guided learning helps training humans and animals
Shaping Education Start from simpler examples / easier tasks (Piaget 1952, Skinner 1958)

3 The Dogma in question It is best to learn from a training set of examples sampled from the same distribution as the test set. Really?

4 Can machine learning algorithms benefit from a curriculum strategy?
Question Can machine learning algorithms benefit from a curriculum strategy? Cognition journal: (Elman 1993) vs (Rohde & Plaut 1999), (Krueger & Dayan 2009)

5 Convex vs Non-Convex Criteria
Convex criteria: the order of presentation of examples should not matter to the convergence point, but could influence convergence speed Non-convex criteria: the order and selection of examples could yield to a better local minimum

6 Deep Architectures Theoretical arguments: deep architectures can be exponentially more compact than shallow ones representing the same function Cognitive and neuroscience arguments Many local minima Guiding the optimization by unsupervised pre-training yields much better local minima o/w not reachable Good candidate for testing curriculum ideas

7 Deep Training Trajectories
(Erhan et al. AISTATS 09) Random initialization Unsupervised guidance

8 Starting from Easy Examples
3 Most difficult examples Higher level abstractions 2 1 Easiest Lower level abstractions

9 Continuation Methods Target objective Final solution
Heavily smoothed objective = surrogate criterion Final solution Easy to find minimum Track local minima

10 Curriculum Learning as Continuation
3 Most difficult examples Higher level abstractions Sequence of training distributions Initially peaking on easier / simpler ones Gradually give more weight to more difficult ones until reach target distribution 2 1 Easiest Lower level abstractions

11 How to order examples? The right order is not known
3 series of experiments: Toy experiments with simple order Larger margin first Less noisy inputs first Simpler shapes first, more varied ones later Smaller vocabulary first

12 Larger Margin First: Faster Convergence
Another way to sort examples is by the margin yw’x with easiest examples corresponding to larger values. Again, the error rate differences between the curriculum strategy and the no-curriculum are statistically significant.

13 Cleaner First: Faster Convergence
We find that training only with easy examples gives rise to lower generalization error (16.3% vs 17.1%) (average over 50 runs). The difference is statistically significant Here the difficult examples are probably not useful because they confuse the learner rather than help it establish the right location of the decision surface.

14 Shape Recognition First: easier, basic shapes
The task is to classify geometrical shapes into 3 classes (rectangle, ellipse, triangle) Degrees of variability: object position, object size, object orientation, the grey levels of the foreground and the background Second = target: more varied geometric shapes

15 Shape Recognition Experiment
3-hidden layers deep net known to involve local minima (unsupervised pre-training finds much better solutions) training / validation / test examples Procedure: Train for k epochs on the easier shapes Switch to target training set (more variations) The switch epoch refers to the index of the epoch when the training data is switched to Geometric Shapes training set

16 Shape Recognition Results
Box plot of the distribution of test classification error as a function of the switch epoch Each box corresponds to 20 seeds for random generators that initialize the network free parameters. The horizontal line inside the box represents the median The borders the 25th and the 75th percentile The ends the 5th and 95th percentiles Clearly, the best generalization is obtained by doing a 2-stage curriculum where the first half of the training time is spent on the easier examples rather than on the target examples. k

17 Language Modeling Experiment
Objective: compute the score of the next word given the previous ones (ranking criterion) Architecture of the deep neural network (Bengio et al. 2001, Collobert & Weston 2008)

18 Language Modeling Results
Gradually increase the vocabulary size (dips) Train on Wikipedia with sentences containing only words in vocabulary Ranking language model trained with vs without curriculum on Wikipedia. Error is log of the rank of the next word. In its first pass through Wikipedia, the curriculum-trained model skips examples with words outside of 5k most frequent words then skips examples outside of 10k-word most frequent vocabulary. The drop in rank occurs when the vocabulary size is increased, as the curriculum-trained model quickly gets better on the new words. We observe that the log rank on the target distribution with the curriculum strategy crosses the error of the no-curriculum strategy after about 1 billion updates, shortly after switching to the target vocabulary size of 20k words. The difference keeps increasing afterwards.

19 Conclusion Yes, machine learning algorithms can benefit from a curriculum strategy.

20 Why? Faster convergence to a minimum
Wasting less time with noisy or harder to predict examples Convergence to better local minima Curriculum = particular continuation method Finds better local minima of a non-convex training criterion Like a regularizer, with main effect on test set

21 Perspectives How could we define better curriculum strategies?
We should try to understand general principles that make some curricula work better than others Emphasizing harder examples and riding on the frontier

22 THANK YOU! Questions? Comments?

23 Training Criterion: Ranking Words
The cost is minimized using stochastic gradient descent, by iteratively sampling pairs (s, w) composed of a window of text s from the training set S and a random word w, and performing a step in the direction of the gradient of Cs,w with respect to the parameters, including the matrix of embeddings W. with S a word sequence Cs score of the next word given the previous one w a word of the vocabulary D the considered word vocabulary

24 Curriculum = Continuation Method?
Examples from are weighted by Sequence of distributions called a curriculum if: the entropy of these distributions increases (larger domain) monotonically increasing in λ:


Download ppt "Curriculum Learning Yoshua Bengio, U. Montreal Jérôme Louradour, A2iA"

Similar presentations


Ads by Google