12Fitting Neural Networks General workflow of back-propagation:Forward: fix weights and computeBackward: computeback propagate to computeuse both to compute the gradients for the updatesupdate the weights
13Fitting Neural Networks Can use parallel computing - each hidden unit passes and receives information only to and from units that share a connection.Online training the fitting scheme allows the network to handle very large training sets, and also to update the weights as new observations come in.Training neural network is an “art” –the model is generally overparametrizedoptimization problem is nonconvex and unstableA neural network model is a blackbox and hard to directly interpret
14Fitting Neural Networks InitiationWhen weight vectors are close to length zeroall Z values are close to zero.The sigmoid curve is close to linear.the overall model is close to linear.a relatively simple model.(This can be seen as a regularized solution)Start with very small weights.Let the neural network learn necessary nonlinear relations from the data.Starting with large weights often leads to poor solutions.
15Fitting Neural Networks OverfittingThe model is too flexible, involving too many parameters. May easily overfit the data.Early stopping – do not let the algorithm converge. Because the model starts with linear, this is a regularized solution (towards linear).Explicit regularization (“weight decay”) – minimizetends to shrink smaller weights more.Cross-validation is used to estimate λ.
18Fitting Neural Networks Number of Hidden Units and LayersToo few – might not have enough flexibility to capture the nonlinearities in the dataToo many – overly flexible, BUT extra weights can be shrunk toward zero if appropriate regularization is used. ✔Typical range: 5-100Cross-validation can be used. It may not be necessary if cross-validation is used to tune the regularization parameter.
19Examples“A radial function is in a sense the most difficult for the neural net, as it is spherically symmetric and with no preferred directions.”
22Going beyond single hidden layer A benchmark problem: classification of handwritten numerals.
23Going beyond single hidden layer 5x5 13x3 1No weight sharing5x5 1weight sharedeach of the units in a single 8 × 8 feature map share the same set of nine weights (but have their own bias parameter)3x3 1same operation on different parts
26Finding the correct features is critical in the success. Deep learningDataFeaturesModelFinding the correct features is critical in the success.Kernels in SVMHidden layer nodes in neural networkPredictor combinations in RFA successful machine learning technology needs to be able to extract useful features (data representations) on its own.Deep learning methods:Composition of multiple non-linear transformations of the dataGoal: more abstract – and ultimately more useful representationsIEEE Trans Pattern Anal Mach Intell Aug;35(8):
27Deep learningIEEE Trans Pattern Anal Mach Intell Aug;35(8):
28Has to learn high level abstract concepts from data. Deep learningHas to learn high level abstract concepts from data.Ex:Wheels of a car.Eye, nose, etc. of a faceBe very resistant to irrelevant information.Car’s orientationNature 505, 146–148(09 January 2014)
29Major areas of application Speech Recognition and Signal Processing Deep learningMajor areas of applicationSpeech Recognition and Signal ProcessingObject RecognitionNatural Language Processing……So far in bioinformaticsTraining data size (subjects) is still too small compared to the number of variables (N<<p issue)Could be applied when human selection of variables is done first.Biological knowledge, in the form of existing networks, are already explicitly used, instead of being learned from data. They are hard to beat with a limited amount of data.IEEE Trans Pattern Anal Mach Intell Aug;35(8):