Download presentation
1
CS 189 Brian Chu brian.c@berkeley.edu Slides at: brianchu.com/ml/
Office Hours: Cory 246, 6-7p Mon. (hackerspace lounge)
2
Agenda NEURAL NETS WOOOHOOO
3
Terminology Unit – each “neuron”
2-layer neural network: a neural network with one hidden layer (what you’re building) Epoch – one pass through entire training data For SGD, this is N iterations For mini-batch gradient descent (batch size of B), this is (N/B) iterations
4
First off… Many of you will struggle to even finish.
In which case you can ignore my bells and whistles. My 2.6GHz quad core 16GB RAM Macbook takes ~1.5 hours to train to ~96-97%.
5
First off… Add a signal handler + snapshotting
E.g. implement functionality where if you press Ctrl-C (on Unix systems, this is sending the interrupt signal), your code saves a snapshot of the state of the training (current iteration, decayed learning rate, momentum, current weights, anything else), then exits. Look into Python “signal” and “pickle” libraries.
6
Art of tuning Training neural nets is an art, not a science
Cross-validation? Pfffft “I used to tune that parameter but I’m too lazy and I don’t bother any more” – grad student talking about weight decay hyperparameter. There are way too many hyperparameters for you to tune. Training is too slow for you to bother using cross-validation. Many hyperparameters: just use what is standard and spend your time elsewhere
7
Knobs Learning: SGD/mini-batch/full batch, momentum, RMSprop, Adagrad, NAG, etc. How to decay? ReLU, tanh, sigmoid activations Loss: MSE or cross-entropy (with softmax) L1, L2, Max-norm, Dropout, Dropconnect regularization Convolutional layers Initialization: Xavier, Gaussian, etc. When to stop? Early stop? Stopping rule? Or just run forever
8
I recommend Cross-entropy, softmax *
* = What everyone in the literature, in practice, uses Cross-entropy, softmax * Only decay per epoch (or more than 1 epoch)* (e.g. don’t just divide by # iterations) Epoch = one training pass thru entire data Only decay after a round of seeing every data point. Note: if your mini-batch size is 10, N = 20, then one epoch is 2 iterations Momentum learning rate ( ?) * Maybe RMSProp? Mini-batch (somewhere between ) * No regularization. Gaussian initialization (mean 0, std. dev. 0.01) * Run forever, take a snapshot when you feel like stopping (seriously!)
9
Activation functions tanh >>> sigmoid ReLU = stacked sigmoid
(tanh is just shifted sigmoid anyways) ReLU = stacked sigmoid ReLU is basically standard in computer vision
10
Almost certainly will improve accuracy but total overkill
Considered “standard” today: Convolutional layers (with max-pooling) Dropout (Dropconnect?)
11
If using numpy Not a single for-loop should be in your code.
Avoid unnecessary memory allocation: Use the “out=“ keyword argument to re-use numpy arrays
12
May want to consider Faster implementation than Python w/ numpy:
Cython, Java, Go, Julia, etc.
13
Honestly, if you want to win…
(if you have a compatible graphics card) Write a CUDA or OpenCL implementation, train for many days. (you might consider adding regularization in this case) I didn’t do this: I used other generic tricks that you can read in the literature.
14
Debugging Check your dimensions Check your numpy dtypes
Check your derivatives – comment all your backprop steps Numerical gradient calculator:
15
Connection with SVMs / linear classifiers with kernels
Kernel SVM can be thought of as: 1st layer: |units| = |support vectors| Value of each unit i = K(query, train(i)) 2nd layer: linear combo of first layer Simplest training for 1st layer: store all training points as templates.
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.