Deep Learning and Its Application to Signal and Image Processing and Analysis Class II - Fall 2016 Tammy Riklin Raviv, Electrical and Computer Engineering.

Deep Learning and Its Application to Signal and Image Processing and Analysis
Class II - Fall 2016 Tammy Riklin Raviv, Electrical and Computer Engineering

Today’s plan Last class … Feature extraction Optimization Back propagation

Last class – artificial neuron (AN)
AN -Linearly combines data at the input ports – like dendrites and non-linearly transforms the weighted sum into the output port like axon

Last class - A single layer NN

Last class - A single layer NN – linear classifier

Hidden layers The hidden layers can be seen as distorting the input in a non-linear way so that categories become linearly separable by the last layer. Deep learning methods are representation-learning methods with multiple levels of representation, obtained by composing simple but non-linear modules that each transform the representation at one level (starting with the raw input) into a representation at a higher, slightly more abstract level. With the composition of enough such transformations, very complex functions can be learned.

Conventional Machine Learning
required careful engineering and considerable domain expertise to design feature extractor and transform the raw data.

Classification- by feature extraction
Conventional machine learning techniques: required careful engineering and considerable domain expertise to design feature extractor and transform the raw data. "Describable Visual Attributes for Face Verification and Image Search," Neeraj Kumar, Alexander C. Berg, Peter N. Belhumeur, Shree K. Nayar, IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI),October 2011.

Feature extraction

Dimensionality reduction by PCA
Removing noise?

Eigen faces – Principal component analysis
Developed by Sirovich and Kirby (1987) Used by Matthew Turk and Alex Pentland in face classification

Training Database M - # of pixels n - # of training examples

Principal Component Analysis
Eigen vector Eigen value Eigen vectors matrix Vectorized image Weights vector Diagonal Eigen values Matrix

Linear Discriminant Analysis

Automatic feature extraction by DNN
Deep Learning - automatic feature extraction: Representations that are selective to the aspects of the image that are important for discrimination, but that are invariant to irrelevant aspects

Last class: score and loss
A (parameterized) score function mapping the raw image pixels to class scores (e.g. a linear function) A loss function that measured the quality of a particular set of parameters based on how well the induced scores agreed with the ground truth labels in the training data. We saw that there are many ways and versions of this (e.g. Softmax/SVM). Optimization is the process of finding the set of parameters W that minimize the loss function. Linear function regularization SVM loss

Last class: score and loss
A (parameterized) score function mapping the raw image pixels to class scores (e.g. a linear function ) A loss function that measured the quality of a particular set of parameters based on how well the induced scores agreed with the ground truth labels in the training data. We saw that there are many ways and versions of this (e.g. Softmax/SVM) Optimization is the process of finding the set of parameters W that minimize the loss function.

Score & Loss regularization

Regularization

Optimization Cost function values: the discrepancies between
the outputs (NN estimations) and the training set data points. Cost function Goal : find the set of weights for which a global minimum Is obtained the cost function is parameterized by the network’s weights — we control our loss function by changing the weights. In reality the cost function is not convex.

Optimization – why it is difficult?
1. There is no simple equation that can be solved analytically 2. High-dimensional function 3. Function might have many local minima & maxima Common approach: Iterative optimization algorithms, e.g. gradient descent

Gradients Numerical evaluation 1D nD

Numerical vs. analytic gradient

Gradient descent Update rule
Learning rate Learning rate: An important hyperparameter too small – very slow convergence or gets stuck in local minima Too Big – may “skip” the target minimum; may go in the wrong direction

Momentum As we train, we accumulate a “velocity” value V.
At each training step, we update V with the gradient at the current position. We then update our weight in the direction of the velocity, and repeat the process again. Over the first few training iterations, V will grow as our weights “pick up speed” and take successively bigger leaps. As we approach the minimum, our velocity stops accumulating as quickly, and eventually begins to decay, until we’ve essentially reached the minimum. The momentum method simulates a heavy ball rolling down a surface. The ball builds up velocity along the oor of the ravine, but not across the ravine because the opposing gradients on opposite sides of the ravine cancel each other out over time. Instead of using the estimated gradient times the learning rate to increment the values of the parameters, the momentum method uses this quantity to increment the velocity, v, of the parameters and the current velocity is then used as the parameter increment.

Backpropagation to train multilayer architectures
The backpropagation procedure to computes the gradient of an objective function with respect to the weights of a multilayer stack of modules. Practical application of the chain rule. The key insight is that the gradient of the objective with respect to the input of a module can be computed by working backwards from the gradient with respect to the output of that module (or the input of the subsequent module) . The backpropagation equation can be applied repeatedly to propagate gradients through all modules, starting from the output at the top (where the network produces its prediction) all the way to the bottom (where the external input is fed). Once these gradients have been computed, it is straightforward to compute the gradients with respect to the weights of each module. In the late 1990s, neural nets and backpropagation were largely forsaken by the machine-learning community and ignored by the computer-vision and speech-recognition communities. It was widely thought that learning useful, multistage, feature extractors with little prior knowledge was infeasible. In particular, it was commonly thought that simple gradient descent would get trapped in poor local minima — weight configurations for which no small change would reduce the average error. In practice, poor local minima are rarely a problem with large networks. Regardless of the initial conditions, the system nearly always reaches solutions of very similar quality. Recent theoretical and empirical results strongly suggest that local minima are not a serious issue in general. Instead, the landscape is packed with a combinatorially large number of saddle points where the gradient is zero, and the surface curves up in most dimensions and curves down in the

Backprop

Backprop step-by-step (I)

Backprop step-by-step (II)
Using the chain rule:

Backprop step-by-step (III)
L Using the chain rule:

Modularity: Sigmoid example
Sidenote: ReLU activation functions are also commonly used in classification contexts. There are downsides to using the sigmoid function — particularly the “vanishing gradient” problem

Derivation example - notations
Let be a neural network with connections. Input vectors: Estimated Output vectors: Correct outputs: Training examples:

Derivation example -specifications
Let A training example: Estimated output: Error function: (squared error function) A simple neural network with two input units and one output unit

Derivation example Error function: (squared error function) For each neuron its output is defined as For sigmoid: and

Derivation example Finding the derivative of the error with respect to the weights is done by using The chain rule twice: If

Derivation example Finding the derivative of the error with respect to the weights is done by using The chain rule twice: Considering as a function of the inputs of all neurons receiving input from neuron

Derivation example Putting it all together: with

Neural Network Training
1. Construct a network and initialize the weights (random numbers, not zeroes) 2. Perform one feed-forward using the training data. 3. Perform backpropagation to get the error derivatives w.r.t. each and every weight in the neural network. 4. Perform gradient descent to update each weight by the negative scalar reduction (w.r.t. some learning rate alpha) of the respective error derivative. Increment the number of iterations. 5. If not converged go to step 2. L L

Modes of learning There are two modes of learning to choose from: stochastic and batch. Stochastic learning: each propagation is followed immediately by a weight update. Batch learning: many propagations occur before updating the weights, accumulating errors over the samples within a batch. Stochastic learning introduces "noise" into the gradient descent process, using the local gradient calculated from one data point; this reduces the chance of the network getting stuck in a local minima. Yet batch learning typically yields a faster, more stable descent to a local minima, since each update is performed in the direction of the average error of the batch samples. In modern applications a common compromise choice is to use "mini-batches", meaning batch learning but with a batch of small size and with stochastically selected samples.

Training data collections
Online learning is used for dynamic environments that provide a continuous stream of new training data patterns. Offline learning makes use of a training set of static patterns.

Limitations Gradient descent with backpropagation is not guaranteed to find the global minimum of the error function, but only a local minimum; also, it has trouble crossing plateaux in the error function landscape. This issue, caused by the non-convexity of error functions in neural networks, was long thought to be a major drawback, but in a 2015 review article, Yann LeCun et al. (Deep Learning, Nature) argue that in many practical problems, it is not. Backpropagation learning does not require normalization of input vectors; however, normalization could improve performance

ConvNet 4-Key ideas Local connections Shared weights Pooling
Many layers.

Conv-net : Architecture
The first few stages are composed of two types of layers: I. Convolutional layers II. Pooling layers. Two or three stages of convolution, non-linearity and pooling are stacked, followed by more convolutional and fully-connected layers.

Conv-net : Convolutional layers
Units in a convolutional layer are organized in feature maps, Each unit is connected to local patches in the feature maps of the previous layer through a set of weights called a filter bank. The result of this local weighted sum is then passed through a non-linearity such as a ReLU. All units in a feature map share the same filter bank. Different feature maps in a layer use different filter banks. Mathematically, the filtering operation performed by a feature map is a discrete convolution, hence the name.

Pooling layers The role of the pooling layer is to merge semantically similar features into one. Because the relative positions of the features forming a motif can vary somewhat, reliably detecting the motif can be done by coarse-graining the position of each feature. A typical pooling unit computes the maximum of a local patch of units in one feature map (or in a few feature maps). Neighbouring pooling units take input from patches that are shifted by more than one row or column, thereby reducing the dimension of the representation and creating an invariance to small shifts and distortions.

Recurrent Neural Network (RNN)
The artificial neurons (for example, hidden units grouped under node s with values st at time t) get inputs from other neurons at previous time steps (this is represented with the black square, representing a delay of one time step, on the left). In this way, a recurrent neural network can map an input sequence with elements xt into an output sequence with elements ot, with each ot depending on all the previous xtʹ (for tʹ ≤ t). The same parameters (matrices U,V,W ) are used at each time step. Many other architectures are possible, including a variant in which the network can generate a sequence of outputs (for example, words), each of which is used as inputs for the next time step. The backpropagation algorithm (Fig. 1) can be directly applied to the computational graph of the unfolded network on the right, to compute the derivative of a total error (for example, the log-probability of generating the right sequence of outputs) with respect to all the states st and all the parameters. A recurrent neural network and the unfolding in time of the computation involved in its forward computation.

From Image to text

From Image to text When the RNN is given the ability to focus its attention on a different location in the input image (middle and bottom; the lighter patches were given more attention) as it generates each word (bold), it exploits this to achieve better ‘translation’ of images into captions. Vinyals, O., Toshev, A., Bengio, S. & Erhan, D. Show and tell: a neural image caption generator. In Proc. International Conference on Machine Learning arxiv.org/abs/ (2014).

LSTM memory cell

Deep Learning and Its Application to Signal and Image Processing and Analysis Class II - Fall 2016 Tammy Riklin Raviv, Electrical and Computer Engineering.

Similar presentations

Presentation on theme: "Deep Learning and Its Application to Signal and Image Processing and Analysis Class II - Fall 2016 Tammy Riklin Raviv, Electrical and Computer Engineering."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Deep Learning and Its Application to Signal and Image Processing and Analysis Class II - Fall 2016 Tammy Riklin Raviv, Electrical and Computer Engineering.

Similar presentations

Presentation on theme: "Deep Learning and Its Application to Signal and Image Processing and Analysis Class II - Fall 2016 Tammy Riklin Raviv, Electrical and Computer Engineering."— Presentation transcript:

Similar presentations

About project

Feedback