7How Many Hidden Units Do We Need To Learn Handprinted Digits? Two isn’t enoughThink of hidden as a bottleneck conveying all information from input to outputSometimes networks can surprise youe.g., autoencoder
8Autoencoder Self-supervised training procedure Given a set of input vectors (no target outputs)Map input back to itself via a hidden layer bottleneckHow to achieve bottleneck?Fewer neuronsSparsity constraintInformation transmission constraint (e.g., add noise to unit, or shut off randomly, a.k.a. dropout)
9Autoencoder and 1-of-N Task Input/output vectorsHow many hidden units are require to perform this task?
10When To Stop Training1. Train n epochs; lower learning rate; train m epochsbad idea: can’t assume one-size-fits-all approach2. Error-change criterionstop when error isn’t droppingMy recommendation: criterion based on % drop over a window of, say, 10 epochs1 epoch is too noisyabsolute error criterion is too problem dependentKarl’s idea: train for a fixed number of epochs after criterion is reached (possibly with lower learning rate)NOTE: these belong in practical_advice.pptx. Move after 2015.
11When To Stop Training 3. Weight-change criterion Compare weights at epochs t-10 and t and test:Don’t base on length of overall weight change vectorPossibly express as a percentage of the weightBe cautious: small weight changes at critical points can result in rapid drop in error
12Setting Model Hyperparameters How do you select the appropriate model size, i.e., # of hidden units, # layers, connectivity, etc.?validation methodsplit training set into two parts, T and Vtrain many different architectures on Tchoose the architecture that minimizes error on Vfancy Bayesian optimization methods are starting to become popular
13The Danger Of Minimizing Network Size My sense is that local optima arise only if you use a highly constrained networkminimum number of hidden unitsminimum number of layersminimum number of connectionsxor example?Having spare capacity in the net means there are many equivalent solutions to traininge.g., if you have 10 hidden and need only 2, there are 45 equivalent solutions
14Regularization Techniques Instead of starting with smallest net possible, use a larger network and apply various tricks to avoid using the full network capacity7 ideas to follow…why is early stop
15Regularization Techniques 1. early stoppingRather than training network until error converges, stop training earlyRumelharthidden units all go after the same source of error initially -> redundancyHintonweights start small and grow over trainingwhen weights are small, model is mostly operating in linear regimeDangerous: Very dependent on training algorithme.g., what would happen with random weight search?While probably not the best technique for controlling model complexity, it does suggest that you shouldn’t obsess over finding a minimum error solution.why is early stop
16Regularization Techniques 2. Weight penalty termsL2 weight decay L1 weight decayweight eliminationSee Reed (1993) for survey of ‘pruning’ algorithmswhy is early stop
17Regularization Techniques 3. Hard constraint on weightsEnsure that for every unitIf constraint is violated, rescale all weights:[See Hinton minute 4:00]I’m not clear why L2 normalization and not L14. Injecting noise[See Hinton video]
18Regularization Techniques 6. Model averagingEnsemble methodsBayesian methods7. Drop out[watch Hinton video]why is early stop
19More On DropoutWith H hidden units, each of which can be dropped, we have 2H possible modelsEach of the 2H-1 models that include hidden unit h must share the same weights for the unitsserves as a form of regularizationmakes the models cooperateIncluding all hidden units at test with a scaling of 0.5 is equivalent to computing the geometric mean of all 2H modelsexact equivalence with one hidden layer“pretty good approximation” according to Geoff with multiple hidden layers
20Two Problems With Deep Networks Credit assignment problemVanishing error gradientsnote y(1-y) ≤ 25
21Unsupervised Pretraining Suppose you have access to a lot of unlabeled data in addition to labeled data“Semisupervised learning”Can we leverage unlabeled data to initialize network weights?alternative to small random weightsrequires an unsupervised procedure: autoencoderWith good initialization, we can minimize credit assignment problem.
22Autoencoder Self-supervised training procedure Given a set of input vectors (no target outputs)Map input back to itself via a hidden layer bottleneckHow to achieve bottleneck?Fewer neuronsSparsity constraintInformation transmission constraint (e.g., add noise to unit, or shut off randomly, a.k.a. dropout)
24Stacked Autoencoders...copydeep networkNote that decoders can be stacked to produce a generative model of the domain
25Rectified Linear Units Version 1Version 2Do we need to worry about z=0?Do we need to worry about lack of gradient for z<0?Note sparsity of activation patternNote no squashing of error derivativewhy is early stop
26Rectified Linear Units Hinton argues that this is a form of model averagingwhy is early stop
27Hinton Bag Of Tricks Deep network Unsupervised pretraining if you have lots of dataWeight initializationto prevent gradients from vanishing or explodingDropout trainingRectified linear unitsConvolutional NNs if spatial/temporal patterns