Neural networks (3) Regularization Autoencoder

Slides:



Advertisements
Similar presentations
Neural networks Introduction Fitting neural networks
Advertisements

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 7: Learning in recurrent networks Geoffrey Hinton.
Deep Learning and Neural Nets Spring 2015
Machine learning continued Image source:
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Autoencoders Mostafa Heidarpour
Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning
11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.
CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
CSC321 Lecture 5 Applying backpropagation to shape recognition Geoffrey Hinton.
Object Recognizing. Deep Learning Success in 2012 DeepNet and speech processing.
Neural Networks Lecture 11: Learning in recurrent networks Geoffrey Hinton.
Neural Networks The Elements of Statistical Learning, Chapter 12 Presented by Nick Rizzolo.
Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with.
Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.
Parsing Natural Scenes and Natural Language with Recursive Neural Networks INTERNATIONAL CONFERENCE ON MACHINE LEARNING (ICML 2011) RICHARD SOCHER CLIFF.
Xintao Wu University of Arkansas Introduction to Deep Learning 1.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning Supervised Learning Classification and Regression
Neural networks and support vector machines
Regularization Techniques in Neural Networks
RNNs: An example applied to the prediction task
Data Transformation: Normalization
Learning Deep Generative Models by Ruslan Salakhutdinov
Convolutional Neural Network
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
Deep Feedforward Networks
Deep Learning Amin Sobhani.
Randomness in Neural Networks
Sentence Modeling Representation of sentences is the heart of Natural Language Processing A sentence model is a representation and analysis of semantic.
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Recursive Neural Networks
Recurrent Neural Networks for Natural Language Processing
COMP24111: Machine Learning and Optimisation
Matt Gormley Lecture 16 October 24, 2016
Intro to NLP and Deep Learning
Intelligent Information System Lab
Basic machine learning background with Python scikit-learn
Machine Learning Basics
Deep learning and applications to Natural language processing
Dipartimento di Ingegneria «Enzo Ferrari»
Machine Learning Feature Creation and Selection
Neural Networks and Backpropagation
RNNs: Going Beyond the SRN in Language Prediction
Convolutional Neural Networks for sentence classification
A critical review of RNN for sequence learning Zachary C
Goodfellow: Chap 6 Deep Feedforward Networks
INF 5860 Machine learning for image classification
Neuro-Computing Lecture 4 Radial Basis Function Network
Goodfellow: Chapter 14 Autoencoders
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
Convolutional networks
Neural Networks Geoff Hulten.
Neural networks (1) Traditional multi-layer perceptrons
Machine learning overview
Neural networks (3) Regularization Autoencoder
Autoencoders Supervised learning uses explicit labels/correct output in order to train a network. E.g., classification of images. Unsupervised learning.
Word embeddings (continued)
Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton
CSC321: Neural Networks Lecture 11: Learning in recurrent networks
Attention for translation
Introduction to Neural Networks
Image recognition.
CS249: Neural Language Model
Goodfellow: Chapter 14 Autoencoders
Presentation transcript:

Neural networks (3) Regularization Autoencoder Recurrent neural network Goodfellow, Bengio, Courville, “Deep Learning” Charte, “A practical tutorial on autoencoders for nonlinear feature fusion” Le, “A tutorial on Deep Learning”

More on regularization Neural network models are extremely versatile – potential for overfitting Regularization aims at reducing the EPE, not the training error. Too flexible High variance Low bias Too rigid Low variance High bias

More on regularization Parameter Norm Penalties The regularized loss function: L2 Regularization: The gradient becomes: The update step becomes:

More on regularization L1 Regularization The loss function: The gradient: Dataset Augmentation Generate new (x, y) pairs by transforming the x inputs in the training set. Ex: In object recognition, translating the training images a few pixels in each direction rotating the image or scaling the image

More on regularization Early Stopping treat the number of training steps as another hyperparameter requires a validation set can be combined with other regularization strategies

More on regularization

More on regularization Parameter Sharing CNN is an example Grows large network without dramatically increasing the number of unique model parameters  without requiring a corresponding increase in training data. Sparse Representation L1 penalization induces a sparse parametrization Representational sparsity means a representation where many of the elements of the representation are zero Achieved by: L1 penalty on the elements of the representation hard constraint on the activation values

More on regularization sparse parametrization Representational sparsity

More on regularization Dropout Bagging (bootstrap aggregating) reduces generalization error by combining several models - train several different models separately, then have all of the models vote on the output for test examples

More on regularization Dropout Neural networks have many solution points. random initialization random selection of minibatches differences in hyperparameters different outcomes of non-deterministic implementations different members of the ensemble make partially independent errors However, training multiple models is impractical when each model is a large neural network.

More on regularization Dropout provides an inexpensive approximation to training and evaluating a bagged ensemble of exponentially many neural networks. Use a minibatch-based learning algorithm Each time, randomly sample a different binary mask vector μ to apply to all of the input and hidden units in the network. The probability of sampling a mask value of one (causing a unit to be included) is a hyperparameter Run forward propagation, back-propagation, and the learning update as usual.

More on regularization

More on regularization

More on regularization Dropout training consists in minimizing EμJ(θ, μ). The expectation contains exponentially many terms: 2number of non-output nodes The unbiased estimate is obtained by randomly sampling values of μ. Most of the exponentially large number of models are not explicitly trained. A tiny fraction of the possible sub-networks are each trained for a single step, and the parameter sharing causes the remaining sub-networks to arrive at good settings of the parameters. “weight scaling inference rule”: approximate the ensemble output pensemble by evaluating p(y | x) in one model: the model with all units, but with the weights going out of unit i multiplied by the probability of including unit i.

Autoencoder Seeks data structure within X Serves as a good starting point to fit CNN Map data {x(1), x(2),…, x(m)} to {z(1), z(2),…, z(m)} of lower dimension. Recover X from Z. In the linear case: Example: data compression

Autoencoder This is a linear autoencoder. The objective function: Can be minimized by stochastic gradient descent.

Autoencoder Nonlinear autoencoder – finds nonlinear structures in the data. Multiple layers can extract highly nonlinear structure.

Autoencoder Undercomplete, if the encoding layer has a lower dimensionality than the input. Overcomplete, if the encoding layer has the same or more units than the input (restrictions needed to avoid copying the data)

Autoencoder Purposes: nonlinear dimensionality reduction feature learning de-noising generative modeling Sparse autoencoder: A sparsity penalty on the code layer often to learn features for another task such as classification

Autoencoder We can think of feedforward networks trained by supervised learning as performing a kind of representation learning. - last layer is a linear classifier - previous layers learns data presentation for the classifier hidden layers take on properties that make the classification task easier. Unsupervised learning tries to capture the shape of the input distribution. – Can sometimes be useful for another task supervised learning with the same input domain Greedy layer-wise unsupervised pretraining can be used.

Autoencoder Autoencoder for the pre-training of a classification task:

Autoencoder Denoising autoencoder (DAE): receives a corrupted data point as input, and predict the original, uncorrupted data point as its output. Minimize

Autoencoder

Recurrent neural networks (RNNs) recurrent neural networks: operating on a sequence that contains vectors x(t) with the time step index t ranging from 1 to τ. Challenges: A feedforward network that processes sentences of fixed length won’t work on extracting the time, as weights are associated to location. “I went to Nepal in 2009” “In 2009, I went to Nepal.” τ can be different for different inputs. Goal: Share statistical strength across different sequence lengths and across different positions.

Recurrent neural networks Ex: statistical language modeling - predict the next word given previous words Predicting stock price based on the existing data:

Computational Graph Ex:

Computational Graph a dynamical system driven by an external signal x(t),

Recurrent neural networks Three sets of parameters: input to hidden weights (W), hidden to hidden weights (U) hidden to label weight (V) Minimize a cost function, e.g. (y − f (x))2 to obtain the appropriate weights. Can use back propagation again – ”Back propagation through time (BPTT)”

Recurrent neural networks Assume the output is discrete. For each time step from t = 1 to t = τ, The input sequence and output sequence are of same length The total loss:

Recurrent neural networks gradient computation involves A forward propagation pass moving left to right through the unrolled graph A backward propagation pass moving right to left through the graph (back-propagation through time, or BPTT)

Recursive Neural Networks A generalization of recurrent networks, with a different kind of computational graph. The computational graph is a deep tree, rather than the chain-like structure of RNNs. The depth can be reduced from τ to O(logτ). Structure can be imposed or learned.

Recursive Neural Networks An example. Socher et al. “Parsing Natural Scenes and Natural Language with Recursive Neural Networks”

Recursive Neural Networks An example. Data over-segmentation Feature extraction Map features into the “semantic” n-dimensional space (by neural network) The network computes the potential parent representation for these possible child nodes Object adjacency matrix

Major areas of application Speech Recognition and Signal Processing Deep learning Major areas of application Speech Recognition and Signal Processing Object Recognition Natural Language Processing …… The technique is not applicable to all areas. In some challenging areas where data is limited: Training data size (subjects) is still too small compared to the number of variables Neural network could be applied when human selection of variables is done first. Existing knowledge, in the form of existing networks, are already explicitly used, instead of being learned from data. They are hard to beat with a limited amount of data. IEEE Trans Pattern Anal Mach Intell. 2013 Aug;35(8):1798-828