CSC321: Lecture 7:Ways to prevent overfitting

Slides:



Advertisements
Similar presentations
Neural networks Introduction Fitting neural networks
Advertisements

Neural Networks and SVM Stat 600. Neural Networks History: started in the 50s and peaked in the 90s Idea: learning the way the brain does. Numerous applications.
Brief introduction on Logistic Regression
Artificial Intelligence 13. Multi-Layer ANNs Course V231 Department of Computing Imperial College © Simon Colton.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 7: Learning in recurrent networks Geoffrey Hinton.
Deep Learning and Neural Nets Spring 2015
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Model Assessment and Selection
Model Assessment, Selection and Averaging
Model Selection and Validation “All models are wrong; some are useful.”  George E. P. Box Some slides were taken from: J. C. Sapll: M ODELING C ONSIDERATIONS.
Model Selection and Validation
Lecture 4 Neural Networks ICS 273A UC Irvine Instructor: Max Welling Read chapter 4.
General Mining Issues a.j.m.m. (ton) weijters Overfitting Noise and Overfitting Quality of mined models (some figures are based on the ML-introduction.
Data mining and statistical learning - lecture 11 Neural networks - a model class providing a joint framework for prediction and classification  Relationship.
CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Neural Networks Lecture 8: Two simple learning algorithms
Machine Learning Week 4 Lecture 1. Hand In Data Is coming online later today. I keep test set with approx test images That will be your real test.
沈致远. Test error(generalization error): the expected prediction error over an independent test sample Training error: the average loss over the training.
CSC2515 Fall 2007 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.ppt,.ps, &.htm at
MML Inference of RBFs Enes Makalic Lloyd Allison Andrew Paplinski.
Chapter 11 – Neural Networks COMP 540 4/17/2007 Derek Singer.
11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering
Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 16: NEURAL NETWORKS Objectives: Feedforward.
Learning Lateral Connections between Hidden Units Geoffrey Hinton University of Toronto in collaboration with Kejie Bao University of Toronto.
Jeff Howbert Introduction to Machine Learning Winter Regression Linear Regression.
CSC321: Neural Networks Lecture 2: Learning with linear neurons Geoffrey Hinton.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
CSC321: Introduction to Neural Networks and machine Learning Lecture 16: Hopfield nets and simulated annealing Geoffrey Hinton.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.
Learning to perceive how hand-written digits were drawn Geoffrey Hinton Canadian Institute for Advanced Research and University of Toronto.
CSC 2535 Lecture 8 Products of Experts Geoffrey Hinton.
CSC2535: Computation in Neural Networks Lecture 12: Non-linear dimensionality reduction Geoffrey Hinton.
INTRODUCTION TO Machine Learning 3rd Edition
CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
CSC2515: Lecture 7 (post) Independent Components Analysis, and Autoencoders Geoffrey Hinton.
CIAR Summer School Tutorial Lecture 1b Sigmoid Belief Nets Geoffrey Hinton.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Lecture 8, CS5671 Neural Network Concepts Weight Matrix vs. NN MLP Network Architectures Overfitting Parameter Reduction Measures of Performance Sequence.
CSC2535: Lecture 3: Ways to make backpropagation generalize better, and ways to do without a supervision signal. Geoffrey Hinton.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 15: Mixtures of Experts Geoffrey Hinton.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.
Neural Networks Lecture 11: Learning in recurrent networks Geoffrey Hinton.
CSC2535 Lecture 5 Sigmoid Belief Nets
Neural Networks The Elements of Statistical Learning, Chapter 12 Presented by Nick Rizzolo.
CSC321 Lecture 24 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.
CSC321 Lecture 27 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.
Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.
Neural Networks for Machine Learning Lecture 3a Learning the weights of a linear neuron Geoffrey Hinton with Nitish Srivastava Kevin Swersky.
CPH Dr. Charnigo Chap. 11 Notes Figure 11.2 provides a diagram which shows, at a glance, what a neural network does. Inputs X 1, X 2,.., X P are.
Model Selection and the Bias–Variance Tradeoff All models described have a smoothing or complexity parameter that has to be considered: multiplier of the.
CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.
CSC2535: Lecture 4: Autoencoders, Free energy, and Minimum Description Length Geoffrey Hinton.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Regularization Techniques in Neural Networks
CSC2535: Computation in Neural Networks Lecture 11 Extracting coherent properties by maximizing mutual information across space or time Geoffrey Hinton.
Deep Feedforward Networks
CSE 4705 Artificial Intelligence
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Generalization ..
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 12: Combining models Geoffrey Hinton.
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
ML – Lecture 3B Deep NN.
Neural networks (1) Traditional multi-layer perceptrons
Artificial Intelligence 10. Neural Networks
Machine learning overview
Presentation transcript:

CSC321: Lecture 7:Ways to prevent overfitting Geoffrey Hinton

Overfitting The training data contains information about the regularities in the mapping from input to output. But it also contains noise The target values may be unreliable. There is sampling error. There will be accidental regularities just because of the particular training cases that were chosen. When we fit the model, it cannot tell which regularities are real and which are caused by sampling error. So it fits both kinds of regularity. If the model is very flexible it can model the sampling error really well. This is a disaster.

Preventing overfitting Use a model that has the right capacity: enough to model the true regularities not enough to also model the spurious regularities (assuming they are weaker). Standard ways to limit the capacity of a neural net: Limit the number of hidden units. Limit the size of the weights. Stop the learning before it has time to overfit.

Limiting the size of the weights Weight-decay involves adding an extra term to the cost function that penalizes the squared weights. Keeps weights small unless they have big error derivatives. C w

Weight-decay via noisy inputs Weight-decay reduces the effect of noise in the inputs. The noise variance is amplified by the squared weight The amplified noise makes an additive contribution to the squared error. So minimizing the squared error tends to minimize the squared weights when the inputs are noisy. It gets more complicated for non-linear networks. j i

Other kinds of weight penalty Sometimes it works better to penalize the absolute values of the weights. This makes some weights equal to zero which helps interpretation. Sometimes it works better to use a weight penalty that has negligible effect on large weights.

The effect of weight-decay It prevents the network from using weights that it does not need. This can often improve generalization a lot. It helps to stop it from fitting the sampling error. It makes a smoother model in which the output changes more slowly as the input changes. w If the network has two very similar inputs it prefers to put half the weight on each rather than all the weight on one. w/2 w/2 w

Model selection How do we decide which limit to use and how strong to make the limit? If we use the test data we get an unfair prediction of the error rate we would get on new test data. Suppose we compared a set of models that gave random results, the best one on a particular dataset would do better than chance. But it wont do better than chance on another test set. So use a separate validation set to do model selection.

Using a validation set Divide the total dataset into three subsets: Training data is used for learning the parameters of the model. Validation data is not used of learning but is used for deciding what type of model and what amount of regularization works best. Test data is used to get a final, unbiased estimate of how well the network works. We expect this estimate to be worse than on the validation data. We could then re-divide the total dataset to get another unbiased estimate of the true error rate.

Preventing overfitting by early stopping If we have lots of data and a big model, its very expensive to keep re-training it with different amounts of weight decay. It is much cheaper to start with very small weights and let them grow until the performance on the validation set starts getting worse (but don’t get fooled by noise!) The capacity of the model is limited because the weights have not had time to grow big.

Why early stopping works When the weights are very small, every hidden unit is in its linear range. So a net with a large layer of hidden units is linear. It has no more capacity than a linear net in which the inputs are directly connected to the outputs! As the weights grow, the hidden units start using their non-linear ranges so the capacity grows. outputs inputs

Another framework for model selection Using a validation set to determine the optimal weight-decay coefficient is safe and sensible, but it wastes valuable training data. Is there a way to determine the weight-decay coefficient automatically? The minimum description length principle is a version of Occam’s razor:The best model is the one that is simplest to describe.

The description length of the model Imagine that a sender must tell a receiver the correct output for every input vector in the training set. (The receiver can see the inputs.) Instead of just sending the outputs, the sender could first send a model and then send the residual errors. If the structure of the model was agreed in advance, the sender only needs to send the weights. How many bits does it take to send the weights and the residuals? That depends on how the weights and residuals are coded.