Practical Advice For Building Neural Nets Deep Learning and Neural Nets Spring 2015.

Slides:



Advertisements
Similar presentations
Neural Networks and Kernel Methods
Advertisements

NEURAL NETWORKS Backpropagation Algorithm
Artificial Intelligence 13. Multi-Layer ANNs Course V231 Department of Computing Imperial College © Simon Colton.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 7: Learning in recurrent networks Geoffrey Hinton.
Deep Learning and Neural Nets Spring 2015
CSC321: Neural Networks Lecture 3: Perceptrons
Performance Optimization
Simple Neural Nets For Pattern Classification
Unsupervised Learning With Neural Nets Deep Learning and Neural Nets Spring 2015.
The back-propagation training algorithm
Ensemble Learning: An Introduction
Before we start ADALINE
Lecture 4 Neural Networks ICS 273A UC Irvine Instructor: Max Welling Read chapter 4.
October 14, 2010Neural Networks Lecture 12: Backpropagation Examples 1 Example I: Predicting the Weather We decide (or experimentally determine) to use.
C82MCP Diploma Statistics School of Psychology University of Nottingham 1 Overview Central Limit Theorem The Normal Distribution The Standardised Normal.
CS 4700: Foundations of Artificial Intelligence
Saturation, Flat-spotting Shift up Derivative Weight Decay No derivative on output nodes.
ICS 273A UC Irvine Instructor: Max Welling Neural Networks.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Face Recognition Using Neural Networks Presented By: Hadis Mohseni Leila Taghavi Atefeh Mirsafian.
Dr. Hala Moushir Ebied Faculty of Computers & Information Sciences
Artificial Neural Networks
Classification Part 3: Artificial Neural Networks
EM and expected complete log-likelihood Mixture of Experts
ENSEMBLE LEARNING David Kauchak CS451 – Fall 2013.
Artificial Neural Networks
Neural Networks Ellen Walker Hiram College. Connectionist Architectures Characterized by (Rich & Knight) –Large number of very simple neuron-like processing.
Introduction to Artificial Neural Network Models Angshuman Saha Image Source: ww.physiol.ucl.ac.uk/fedwards/ ca1%20neuron.jpg.
Artificial Intelligence Methods Neural Networks Lecture 4 Rakesh K. Bissoondeeal Rakesh K. Bissoondeeal.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.
Neural Networks and Backpropagation Sebastian Thrun , Fall 2000.
CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.
CSSE463: Image Recognition Day 14 Lab due Weds, 3:25. Lab due Weds, 3:25. My solutions assume that you don't threshold the shapes.ppt image. My solutions.
CSC321: Lecture 7:Ways to prevent overfitting
CHAPTER 10 Widrow-Hoff Learning Ming-Feng Yeh.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Today’s Topics Read: Chapters 7, 8, and 9 on Logical Representation and Reasoning HW3 due at 11:55pm THURS (ditto for your Nannon Tourney Entry) Recipe.
Image Source: ww.physiol.ucl.ac.uk/fedwards/ ca1%20neuron.jpg
Hand-written character recognition
Artificial Intelligence Methods Neural Networks Lecture 3 Rakesh K. Bissoondeeal Rakesh K. Bissoondeeal.
Neural NetworksNN 21 Architecture We consider the architecture: feed- forward NN with one layer It is sufficient to study single layer perceptrons with.
BACKPROPAGATION (CONTINUED) Hidden unit transfer function usually sigmoid (s-shaped), a smooth curve. Limits the output (activation) unit between 0..1.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
CPH Dr. Charnigo Chap. 11 Notes Figure 11.2 provides a diagram which shows, at a glance, what a neural network does. Inputs X 1, X 2,.., X P are.
Today’s Topics 11/10/15CS Fall 2015 (Shavlik©), Lecture 21, Week 101 More on DEEP ANNs –Convolution –Max Pooling –Drop Out Final ANN Wrapup FYI:
Multinomial Regression and the Softmax Activation Function Gary Cottrell.
CSC321: Neural Networks Lecture 9: Speeding up the Learning
Machine Learning Supervised Learning Classification and Regression
Fall 2004 Backpropagation CS478 - Machine Learning.
Artificial Neural Networks
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Computer Science and Engineering, Seoul National University
Real Neurons Cell structures Cell body Dendrites Axon
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Neural Networks CS 446 Machine Learning.
CSC 578 Neural Networks and Deep Learning
Training a Neural Network
CS 4501: Introduction to Computer Vision Training Neural Networks II
BACKPROPAGATION Multlayer Network.
Artificial Neural Networks
Neural Networks ICS 273A UC Irvine Instructor: Max Welling
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
Done.
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
Batch Normalization.
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Pattern Recognition: Statistical and Neural
Presentation transcript:

Practical Advice For Building Neural Nets Deep Learning and Neural Nets Spring 2015

Day’s Agenda 1.Celebrity guests 2.Discuss issues and observations from homework 3.Catrin Mills on climate change in the Arctic 4.Practical issues in building nets

Homework Discussion Lei’s problem with the perceptron not converging  demo Manjunath’s questions about priors  Why does the balance of data make a difference in what is learned? (I believe it will for Assignment 2)  If I’d told you the priors should be 50/50, what would you have to do differently?  What’s the relationship between statistics of the training set and the test set?  Suppose we have a classifier that outputs a probability (not just 0 or 1). If you know the priors in the training set and you know the priors in the test set to be different, how do you repair the classifier to accommodate the mismatch between training and testing priors?

Weight Initialization Break symmetries  use small random values Weight magnitudes should depend on fan in What Mike has always done  Draw all weights feeding into neuron j (including bias) via  Normalize weights such that i.e.,  Works well for logistic units; to be determined for ReLU units

Weight Initialization A perhaps better idea (due to Kelvin Wagner)  Draw all weights feeding into neuron j (including bias) via  If input activities lie in [-1, +1], then variance of input to unit j grows with fan-in to j, f j  Normalize such that variance of input is equal to C 2 i.e.,  If input activities lie in [-1, +1], most net inputs will be in [-2C, +2C]

Does Weight Initialization Scheme Matter? Trained 3 layer net on 2500 digit patterns  10 output classes, tanh units  hidden units  automatic adaptation of learning rates  10 minibatches per epoch  500 epochs  12 replications of each architecture

Does Weight Initialization Scheme Matter? Weight initialization schemes  Gaussian random weights, N(0, )  Gaussian random weights, N(0,.01 2 )  L 1 constraint on Gaussian random weights [conditioning for worst case]  Gaussian weights, N(0, 4/FanIn) [conditioning for average case]  Gaussian weights, N(0, 1/FanIn) [conditioning for average case]

Small Random Weights Strangeness  training set can’t be learned caveat: plotting accuracy not MSE  if there’s overfitting, doesn’t happen until 200 hidden

Mike’s L 1 Normalization Scheme About the same as small random weights

Normalization Based On Fan In Perfect performance on training set Test set performance dependent on scaling

Conditioning The Input Vectors If m i is the mean activity of input unit i over the training set and s i is the std dev over the training set For each input (in both training and test sets), normalize by  where is the training set mean activity and is the std deviation of the training set activities

Conditioning The Hidden Units If you’re using logistic units, then replace logistic output with function scaled from -1 to +1  With net=0, y=0  Will tend to cause biases to be closer to zero and more on the same scale as other weights in network  Will also satisfy assumption I make to condition initial weights and weight updates for the units in the next layer  tanh function

Setting Learning Rates I Initial guess for learning rate  If error doesn’t drop consistently, lower initial learning rate and try again  If error falls reliably but slowly, increase learning rate. Toward end of training  Error will often jitter, at which point you can lower the learning rate down to 0 gradually to clean up weights Remember, plateaus in error often look like minima  be patient  have some idea a priori how well you expect your network to be doing, and print statistics during training that tell you how well it’s doing  plot epochwise error as a function of epoch, even if you’re doing minibatches

Setting Learning Rates II Momentum  Adaptive and neuron-specific learning rates  Observe error on epoch t-1 and epoch t  If decreasing, then increase global learning rate, ε global, by an additive constant  If increasing, decrease global learning rate by a multiplicative constant  If fan-in of neuron j is f j, then

Setting Learning Rates III Mike’s hack Initialization epsilon =.01 inc = epsilon / 10 if (batch_mode_training) scale =.5 else scale =.9 Update if (current_epoch_error < previous_epoch_error) epsilon = epsilon + inc saved_weights = weights else epsilon = epsilon * scale inc = epsilon / 10 if (batch_mode_training) weights = saved_weights

Setting Learning Rates IV rmsprop  Hinton lecture Exploit optimization methods using curvature  Requires computation of Hessian

When To Stop Training 1. Train n epochs; lower learning rate; train m epochs  bad idea: can’t assume one-size-fits-all approach 2. Error-change criterion  stop when error isn’t dropping  My recommendation: criterion based on % drop over a window of, say, 10 epochs 1 epoch is too noisy absolute error criterion is too problem dependent  Karl’s idea: train for a fixed number of epochs after criterion is reached (possibly with lower learning rate)

When To Stop Training 3. Weight-change criterion  Compare weights at epochs t-10 and t and test:  Don’t base on length of overall weight change vector  Possibly express as a percentage of the weight  Be cautious: small weight changes at critical points can result in rapid drop in error