Curriculum Learning Yoshua Bengio, U. Montreal Jérôme Louradour, A2iA

Slides:



Advertisements
Similar presentations
Greedy Layer-Wise Training of Deep Networks
Advertisements

NEURAL NETWORKS Backpropagation Algorithm
Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.
Stochastic Neural Networks Deep Learning and Neural Nets Spring 2015.
Artificial Intelligence 13. Multi-Layer ANNs Course V231 Department of Computing Imperial College © Simon Colton.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 7: Learning in recurrent networks Geoffrey Hinton.
Deep Learning and Neural Nets Spring 2015
For Wednesday Read chapter 19, sections 1-3 No homework.
Algorithms + L. Grewe.
Measuring the Influence of Long Range Dependencies with Neural Network Language Models Le Hai Son, Alexandre Allauzen, Franc¸ois Yvon Univ. Paris-Sud and.
What is the Best Multi-Stage Architecture for Object Recognition? Ruiwen Wu [1] Jarrett, Kevin, et al. "What is the best multi-stage architecture for object.
Presented by: Mingyuan Zhou Duke University, ECE September 18, 2009
Structure learning with deep neuronal networks 6 th Network Modeling Workshop, 6/6/2013 Patrick Michl.
An introduction to: Deep Learning aka or related to Deep Neural Networks Deep Structural Learning Deep Belief Networks etc,
Unsupervised Learning With Neural Nets Deep Learning and Neural Nets Spring 2015.
Deep Belief Networks for Spam Filtering
Chapter 6: Multilayer Neural Networks
1 Automated Feature Abstraction of the fMRI Signal using Neural Network Clustering Techniques Stefan Niculescu and Tom Mitchell Siemens Medical Solutions,
LOGO Classification III Lecturer: Dr. Bo Yuan
Submitted by:Supervised by: Ankit Bhutani Prof. Amitabha Mukerjee (Y )Prof. K S Venkatesh.
Trading Convexity for Scalability Marco A. Alvarez CS7680 Department of Computer Science Utah State University.
Deep Learning for Speech and Language Yoshua Bengio, U. Montreal NIPS’2009 Workshop on Deep Learning for Speech Recognition and Related Applications December.
Artificial Neural Networks
Cortical Receptive Fields Using Deep Autoencoders Work done as a part of CS397 Ankit Awasthi (Y8084) Supervisor: Prof. H. Karnick.
Authors : Ramon F. Astudillo, Silvio Amir, Wang Lin, Mario Silva, Isabel Trancoso Learning Word Representations from Scarce Data By: Aadil Hayat (13002)
Introduction to Artificial Neural Network Models Angshuman Saha Image Source: ww.physiol.ucl.ac.uk/fedwards/ ca1%20neuron.jpg.
A shallow introduction to Deep Learning
Neural Network Introduction Hung-yi Lee. Review: Supervised Learning Training: Pick the “best” Function f * Training Data Model Testing: Hypothesis Function.
ARTIFICIAL NEURAL NETWORKS. Overview EdGeneral concepts Areej:Learning and Training Wesley:Limitations and optimization of ANNs Cora:Applications and.
Genetic Algorithms Przemyslaw Pawluk CSE 6111 Advanced Algorithm Design and Analysis
Intro. ANN & Fuzzy Systems Lecture 14. MLP (VI): Model Selection.
Neural Net Language Models
Neural Networks - Berrin Yanıkoğlu1 Applications and Examples From Mitchell Chp. 4.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Image Source: ww.physiol.ucl.ac.uk/fedwards/ ca1%20neuron.jpg
Introduction to Deep Learning
Chapter 6 Neural Network.
A Parallel Mixture of SVMs for Very Large Scale Problems Ronan Collobert Samy Bengio Yoshua Bengio Prepared : S.Y.C. Neural Information Processing Systems,
Mutual Information Brian Dils I590 – ALife/AI
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
A Presentation on Adaptive Neuro-Fuzzy Inference System using Particle Swarm Optimization and it’s Application By Sumanta Kundu (En.R.No.
Parsing Natural Scenes and Natural Language with Recursive Neural Networks INTERNATIONAL CONFERENCE ON MACHINE LEARNING (ICML 2011) RICHARD SOCHER CLIFF.
Fill-in-The-Blank Using Sum Product Network
Machine Learning Supervised Learning Classification and Regression
Neural networks and support vector machines
Big data classification using neural network
Neural Machine Translation
Learning Deep Generative Models by Ruslan Salakhutdinov
Deep Learning Amin Sobhani.
Data Mining, Neural Network and Genetic Programming
Real Neurons Cell structures Cell body Dendrites Axon
Matt Gormley Lecture 16 October 24, 2016
Restricted Boltzmann Machines for Classification
Deep Learning Yoshua Bengio, U. Montreal
Deep learning and applications to Natural language processing
Structure learning with deep autoencoders
Distributed Representation of Words, Sentences and Paragraphs
Incremental Training of Deep Convolutional Neural Networks
المشرف د.يــــاســـــــــر فـــــــؤاد By: ahmed badrealldeen
CSC321 Winter 2007 Lecture 21: Some Demonstrations of Restricted Boltzmann Machines Geoffrey Hinton.
Word embeddings Text processing with current NNs requires encoding into vectors. One-hot encoding: N words encoded by length N vectors. A word gets a.
Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton
An introduction to: Deep Learning aka or related to Deep Neural Networks Deep Structural Learning Deep Belief Networks etc,
Presented by: Anurag Paul
Neural Machine Translation using CNN
CSC 578 Neural Networks and Deep Learning
Continuous Curriculum Learning for RL
Presentation transcript:

Curriculum Learning Yoshua Bengio, U. Montreal Jérôme Louradour, A2iA Ronan Collobert, Jason Weston, NEC ICML, June 16th, 2009, Montreal Acknowledgment: Myriam Côté

Curriculum Learning Guided learning helps training humans and animals Shaping Education Start from simpler examples / easier tasks (Piaget 1952, Skinner 1958)

The Dogma in question It is best to learn from a training set of examples sampled from the same distribution as the test set. Really?

Can machine learning algorithms benefit from a curriculum strategy? Question Can machine learning algorithms benefit from a curriculum strategy? Cognition journal: (Elman 1993) vs (Rohde & Plaut 1999), (Krueger & Dayan 2009)

Convex vs Non-Convex Criteria Convex criteria: the order of presentation of examples should not matter to the convergence point, but could influence convergence speed Non-convex criteria: the order and selection of examples could yield to a better local minimum

Deep Architectures Theoretical arguments: deep architectures can be exponentially more compact than shallow ones representing the same function Cognitive and neuroscience arguments Many local minima Guiding the optimization by unsupervised pre-training yields much better local minima o/w not reachable Good candidate for testing curriculum ideas

Deep Training Trajectories (Erhan et al. AISTATS 09) Random initialization Unsupervised guidance

Starting from Easy Examples 3 Most difficult examples Higher level abstractions 2 1 Easiest Lower level abstractions

Continuation Methods Target objective Final solution Heavily smoothed objective = surrogate criterion Final solution Easy to find minimum Track local minima

Curriculum Learning as Continuation 3 Most difficult examples Higher level abstractions Sequence of training distributions Initially peaking on easier / simpler ones Gradually give more weight to more difficult ones until reach target distribution 2 1 Easiest Lower level abstractions

How to order examples? The right order is not known 3 series of experiments: Toy experiments with simple order Larger margin first Less noisy inputs first Simpler shapes first, more varied ones later Smaller vocabulary first

Larger Margin First: Faster Convergence Another way to sort examples is by the margin yw’x with easiest examples corresponding to larger values. Again, the error rate differences between the curriculum strategy and the no-curriculum are statistically significant.

Cleaner First: Faster Convergence We find that training only with easy examples gives rise to lower generalization error (16.3% vs 17.1%) (average over 50 runs). The difference is statistically significant Here the difficult examples are probably not useful because they confuse the learner rather than help it establish the right location of the decision surface.

Shape Recognition First: easier, basic shapes The task is to classify geometrical shapes into 3 classes (rectangle, ellipse, triangle) Degrees of variability: object position, object size, object orientation, the grey levels of the foreground and the background Second = target: more varied geometric shapes

Shape Recognition Experiment 3-hidden layers deep net known to involve local minima (unsupervised pre-training finds much better solutions) 10 000 training / 5 000 validation / 5 000 test examples Procedure: Train for k epochs on the easier shapes Switch to target training set (more variations) The switch epoch refers to the index of the epoch when the training data is switched to Geometric Shapes training set

Shape Recognition Results Box plot of the distribution of test classification error as a function of the switch epoch Each box corresponds to 20 seeds for random generators that initialize the network free parameters. The horizontal line inside the box represents the median The borders the 25th and the 75th percentile The ends the 5th and 95th percentiles Clearly, the best generalization is obtained by doing a 2-stage curriculum where the first half of the training time is spent on the easier examples rather than on the target examples. k

Language Modeling Experiment Objective: compute the score of the next word given the previous ones (ranking criterion) Architecture of the deep neural network (Bengio et al. 2001, Collobert & Weston 2008)

Language Modeling Results Gradually increase the vocabulary size (dips) Train on Wikipedia with sentences containing only words in vocabulary Ranking language model trained with vs without curriculum on Wikipedia. Error is log of the rank of the next word. In its first pass through Wikipedia, the curriculum-trained model skips examples with words outside of 5k most frequent words then skips examples outside of 10k-word most frequent vocabulary. The drop in rank occurs when the vocabulary size is increased, as the curriculum-trained model quickly gets better on the new words. We observe that the log rank on the target distribution with the curriculum strategy crosses the error of the no-curriculum strategy after about 1 billion updates, shortly after switching to the target vocabulary size of 20k words. The difference keeps increasing afterwards.

Conclusion Yes, machine learning algorithms can benefit from a curriculum strategy.

Why? Faster convergence to a minimum Wasting less time with noisy or harder to predict examples Convergence to better local minima Curriculum = particular continuation method Finds better local minima of a non-convex training criterion Like a regularizer, with main effect on test set

Perspectives How could we define better curriculum strategies? We should try to understand general principles that make some curricula work better than others Emphasizing harder examples and riding on the frontier

THANK YOU! Questions? Comments?

Training Criterion: Ranking Words The cost is minimized using stochastic gradient descent, by iteratively sampling pairs (s, w) composed of a window of text s from the training set S and a random word w, and performing a step in the direction of the gradient of Cs,w with respect to the parameters, including the matrix of embeddings W. with S a word sequence Cs score of the next word given the previous one w a word of the vocabulary D the considered word vocabulary

Curriculum = Continuation Method? Examples from are weighted by Sequence of distributions called a curriculum if: the entropy of these distributions increases (larger domain) monotonically increasing in λ: