Neural networks (3) Regularization Autoencoder

Neural networks (3) Regularization Autoencoder
Recurrent neural network Goodfellow, Bengio, Courville, “Deep Learning” Charte, “A practical tutorial on autoencoders for nonlinear feature fusion” Le, “A tutorial on Deep Learning”

More on regularization
Neural network models are extremely versatile – potential for overfitting Regularization aims at reducing the EPE, not the training error. Too flexible High variance Low bias Too rigid Low variance High bias

Parameter Norm Penalties The regularized loss function: L2 Regularization: The gradient becomes: The update step becomes:

L1 Regularization The loss function: The gradient: Dataset Augmentation Generate new (x, y) pairs by transforming the x inputs in the training set. Ex: In object recognition, translating the training images a few pixels in each direction rotating the image or scaling the image

Early Stopping treat the number of training steps as another hyperparameter requires a validation set can be combined with other regularization strategies

Parameter Sharing CNN is an example Grows large network without dramatically increasing the number of unique model parameters  without requiring a corresponding increase in training data. Sparse Representation L1 penalization induces a sparse parametrization Representational sparsity means a representation where many of the elements of the representation are zero Achieved by: L1 penalty on the elements of the representation hard constraint on the activation values

sparse parametrization Representational sparsity

Dropout Bagging (bootstrap aggregating) reduces generalization error by combining several models - train several different models separately, then have all of the models vote on the output for test examples

Dropout Neural networks have many solution points. random initialization random selection of minibatches differences in hyperparameters different outcomes of non-deterministic implementations different members of the ensemble make partially independent errors However, training multiple models is impractical when each model is a large neural network.

Dropout provides an inexpensive approximation to training and evaluating a bagged ensemble of exponentially many neural networks. Use a minibatch-based learning algorithm Each time, randomly sample a different binary mask vector μ to apply to all of the input and hidden units in the network. The probability of sampling a mask value of one (causing a unit to be included) is a hyperparameter Run forward propagation, back-propagation, and the learning update as usual.

Dropout training consists in minimizing EμJ(θ, μ). The expectation contains exponentially many terms： 2number of non-output nodes The unbiased estimate is obtained by randomly sampling values of μ. Most of the exponentially large number of models are not explicitly trained. A tiny fraction of the possible sub-networks are each trained for a single step, and the parameter sharing causes the remaining sub-networks to arrive at good settings of the parameters. “weight scaling inference rule”: approximate the ensemble output pensemble by evaluating p(y | x) in one model: the model with all units, but with the weights going out of unit i multiplied by the probability of including unit i.

Autoencoder Seeks data structure within X Serves as a good starting point to fit CNN Map data {x(1), x(2),…, x(m)} to {z(1), z(2),…, z(m)} of lower dimension. Recover X from Z. In the linear case: Example: data compression

Autoencoder This is a linear autoencoder. The objective function: Can be minimized by stochastic gradient descent.

Autoencoder Nonlinear autoencoder – finds nonlinear structures in the data. Multiple layers can extract highly nonlinear structure.

Autoencoder Undercomplete, if the encoding layer has a lower dimensionality than the input. Overcomplete, if the encoding layer has the same or more units than the input (restrictions needed to avoid copying the data)

Autoencoder Purposes: nonlinear dimensionality reduction feature learning de-noising generative modeling Sparse autoencoder: A sparsity penalty on the code layer often to learn features for another task such as classification

Autoencoder We can think of feedforward networks trained by supervised learning as performing a kind of representation learning. - last layer is a linear classifier - previous layers learns data presentation for the classifier hidden layers take on properties that make the classification task easier. Unsupervised learning tries to capture the shape of the input distribution. – Can sometimes be useful for another task supervised learning with the same input domain Greedy layer-wise unsupervised pretraining can be used.

Autoencoder Autoencoder for the pre-training of a classification task:

Autoencoder Denoising autoencoder (DAE): receives a corrupted data point as input, and predict the original, uncorrupted data point as its output. Minimize

Autoencoder

Recurrent neural networks (RNNs)
recurrent neural networks: operating on a sequence that contains vectors x(t) with the time step index t ranging from 1 to τ. Challenges: A feedforward network that processes sentences of fixed length won’t work on extracting the time, as weights are associated to location. “I went to Nepal in 2009” “In 2009, I went to Nepal.” τ can be different for different inputs. Goal: Share statistical strength across different sequence lengths and across different positions.

Recurrent neural networks
Ex: statistical language modeling - predict the next word given previous words Predicting stock price based on the existing data:

Computational Graph Ex:

Computational Graph a dynamical system driven by an external signal x(t),

Three sets of parameters: input to hidden weights (W), hidden to hidden weights (U) hidden to label weight (V) Minimize a cost function, e.g. (y − f (x))2 to obtain the appropriate weights. Can use back propagation again – ”Back propagation through time (BPTT)”

Assume the output is discrete. For each time step from t = 1 to t = τ, The input sequence and output sequence are of same length The total loss:

gradient computation involves A forward propagation pass moving left to right through the unrolled graph A backward propagation pass moving right to left through the graph (back-propagation through time, or BPTT)

Recursive Neural Networks
A generalization of recurrent networks, with a different kind of computational graph. The computational graph is a deep tree, rather than the chain-like structure of RNNs. The depth can be reduced from τ to O(logτ). Structure can be imposed or learned.

An example. Socher et al. “Parsing Natural Scenes and Natural Language with Recursive Neural Networks”

An example. Data over-segmentation Feature extraction Map features into the “semantic” n-dimensional space (by neural network) The network computes the potential parent representation for these possible child nodes Object adjacency matrix

Major areas of application Speech Recognition and Signal Processing
Deep learning Major areas of application Speech Recognition and Signal Processing Object Recognition Natural Language Processing …… The technique is not applicable to all areas. In some challenging areas where data is limited: Training data size (subjects) is still too small compared to the number of variables Neural network could be applied when human selection of variables is done first. Existing knowledge, in the form of existing networks, are already explicitly used, instead of being learned from data. They are hard to beat with a limited amount of data. IEEE Trans Pattern Anal Mach Intell Aug;35(8):

Neural networks (3) Regularization Autoencoder

Similar presentations

Presentation on theme: "Neural networks (3) Regularization Autoencoder"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Neural networks (3) Regularization Autoencoder

Similar presentations

Presentation on theme: "Neural networks (3) Regularization Autoencoder"— Presentation transcript:

Similar presentations

About project

Feedback