Deep Learning Techniques and Applications

Deep Learning Techniques and Applications
Georgiana Neculae

Outline Why Deep Learning?
Applications and specialized Neural Networks Neural Networks basics and training Potential issues Preventing overfitting Research directions Implementing your own!

Why Deep Learning? Deep learning has gained immense popularity in recent years by revolutionising industries, setting new standards for machine learning models and attracting the attention of very big hardware manufacturers that are now trying to speed up and increase the efficiency of deep learning models.

Why is it important? Impressive performance on what was perceived as exclusively human tasks: Playing games Artistic creativity Verbal communication Problem solving

Applications

Speech Recognition Aim: Input speech recordings and receive text.
Why? (translation, AI assistants, automatic subtitles) Challenges come from the differences between pronunciations: Intonation Accent Speed Cadence or inflection The reason is that deep learning finally made speech recognition accurate enough to be useful outside of carefully controlled environments. Andrew Ng has long predicted that as speech recognition goes from 95% accurate to 99% accurate, it will become a primary way that we interact with computers. The idea is that this 4% accuracy gap is the difference between annoyingly unreliable and incredibly useful. Thanks to Deep Learning, we’re finally cresting that peak.

Recurrent Neural Networks (RNNs)
Make use of internal memory to predict the most likely future sequence based on what they have seen so far The input to the neural network will be 20 millisecond audio chunks. For each little audio slice, it will try to figure out the letter that corresponds the sound currently being spoken. We’ll use a recurrent neural network — that is, a neural network that has a memory that influences future predictions. That’s because each letter it predicts should affect the likelihood of the next letter it will predict too. For example, if we have said “HEL” so far, it’s very likely we will say “LO” next to finish out the word “Hello”. It’s much less likely that we will say something unpronounceable next like “XYZ”. So having that memory of previous predictions helps the neural network make more accurate predictions going forward. After we run our entire audio clip through the neural network (one chunk at a time), we’ll end up with a mapping of each audio chunk to the letters most likely spoken during that chunk. Here’s what that mapping looks like for me saying “Hello” (diagram above) The trick is to combine these pronunciation-based predictions with likelihood scores based on large database of written text (books, news articles, etc). You throw out transcriptions that seem the least likely to be real and keep the transcription that seems the most realistic. Of our possible transcriptions “Hello”, “Hullo” and “Aullo”, obviously “Hello” will appear more frequently in a database of text (not to mention in our original audio-based training data) and thus is probably correct. So we’ll pick “Hello” as our final transcription instead of the others.

WaveNet Generates speech that sounds more natural than any existing techniques Also used to synthesize and generate music DeepMind

Object Detection and Recognition
Why? (face detection for cameras, counting, visual search engine) What features are important when learning to understand an image? An image classification problem is predicting the label of an image among the predefined labels. As issues mention number of classes to indetify between and the data necessary,Give the example of a dog, you can have many variations of what a dog looks like and we want the NN to be able to recognise all as a dog, so how much data is necessary and how can we describe the characteristics of a dog(no matter what shape it takes) to the NN? Face detection Since the mid-2000s some point and shoot cameras started to come with the feature of detecting faces for a more efficient auto-focus. While it’s a narrower type of object detection, the methods used apply to other types of objects as we’ll describe later. Counting One simple but often ignored use of object detection is counting. The ability to count people, cars, flowers, and even microorganisms, is a real world need that is broadly required for different types of systems using images. Recently with the ongoing surge of video surveillance devices, there’s a bigger than ever opportunity to turn that raw information into structured data using computer vision. Visual Search Engine Finally, one use case we’re fond of is the visual search engine of Pinterest. They use object detection as part of the pipeline for indexing different parts of the image. This way when searching for a specific purse, you can find instances of purses similar to the one you want in a different context. This is much more powerful than just finding similar images, like Google Image’s reverse search engine does.

Object Detection and Recognition
Difficulty arises from: Multiple objects can be identified in a photo Objects can be occluded by environment Object of interest could be too small Same class examples could be very different Detection is about not only finding the class of object but also localizing the extent of an object in the image. The object can be lying anywhere in the image and can be of any size(scale) as can be seen in the figure 1. Humans are especially good at identifying objects in the environment however NNs don’t have the same advantages we do. Variable number of objects We already mentioned the part about a variable number of objects, but we omitted why it’s a problem at all. When training machine learning models, you usually need to represent data into fixed-sized vectors. Since the number of objects in the image is not known beforehand, we would not know the correct number of outputs. Because of this, some post-processing is required, which adds complexity to the model. Historically, the variable number of outputs has been tackled using a sliding window based approach, generating the fixed-sized features of that window for all the different positions of it. After getting all predictions, some are discarded and some are merged to get the final result. Sizing Another big challenge is the different conceivable sizes of objects. When doing simple classification, you expect and want to classify objects that cover most of the image. On the other hand, some of the objects you may want to find could be a small as a dozen pixels (or a small percentage of the original image). Traditionally this has been solved with using sliding windows of different sizes, which is simple but very inefficient. Modeling A third challenge is solving two problems at the same time. How do we combine the two different types of requirements: location and classification into, ideally, a single model?

Convolutional Neural Networks (CNNs)
CNNs compare images piece by piece. The pieces that it looks for are called features. By finding rough feature matches in roughly the same positions in two images, CNNs get a lot better at seeing similarity than whole-image matching schemes. When presented with a new image, the CNN doesn’t know exactly where these features will match so it tries them everywhere, in every possible position. In calculating the match to a feature across the whole image, we make it a filter. The math we use to do this is called convolution, from which Convolutional Neural Networks take their name. To complete our convolution, we repeat this process, lining up the feature with every possible image patch. We can take the answer from each convolution and make a new two-dimensional array from it, based on where in the image each patch is located. This map of matches is also a filtered version of our original image. It’s a map of where in the image the feature is found. Values close to 1 show strong matches, values close to -1 show strong matches for the photographic negative of our feature, and values near zero show no match of any sort. The next step is to repeat the convolution process in its entirety for each of the other features. The result is a set of filtered images, one for each of our filters. It’s convenient to think of this whole collection of convolution operations as a single processing step. In CNNs this is referred to as a convolution layer, hinting at the fact that it will soon have other layers added to it. Another power tool that CNNs use is called pooling. Pooling is a way to take large images and shrink them down while preserving the most important information in them. It consists of stepping a small window across an image and taking the maximum value from the window at each step. A small but important player in this process is the Rectified Linear Unit or ReLU. It’s math is also very simple—wherever a negative number occurs, swap it out for a 0. This helps the CNN stay mathematically healthy by keeping learned values from getting stuck near 0 or blowing up toward infinity. It’s the axle grease of CNNs—not particularly glamorous, but without it they don’t get very far. They construct a representation in a hierarchial manner with increasing order of abstraction from lower to higher levels of neural network.

Object Recognition

Object Recognition http://extrapolated-art.com/

Reinforcement Learning
Learning is done through trial-and-error, based on rewards or punishments Agents independently develop successful strategies that lead to the greatest long-term rewards No hand engineered features or domain heuristics are provided, the agents being capable to learn directly from raw inputs Agents learn for themselves to achieve successful strategies that lead to the greatest long-term rewards Reinforcement learning is the problem of getting an agent to act in the world so as to maximize its rewards. For example, consider teaching a dog a new trick: you cannot tell it what to do, but you can reward/punish it if it does the right/wrong thing. It has to figure out what it did that made it get the reward/punishment, which is known as the credit assignment problem. We can use a similar method to train computers to do many tasks, such as playing backgammon or chess, scheduling jobs, and controlling robot limbs.

Reinforcement Learning
AlphaGo, a deep neural network trained using reinforcement learning, defeated Lee Sedol (the strongest Go player of the last decade) by 4 games to 1.

Neural Networks Basics

Perceptron ”the embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence” Frank Rosenblatt, 1957 -we’ve seen how deep learning has impacted all those areas and there is current research looking into self reproducing models however we are very far from a conscious model, as we aren’t even sure how to quantify consciousness. -having said that let’s now watch an interview from 1960 when the perceptron became popular in the community and see a very similar trend of excitement appear, however nowadays we actually are reaching a state where we have the computational resources to make those promises happen. Be more skeptical here, say that the first 4 are entirely possible and we have made huge progress in those areas however the last one is very ambitious and naive and we are a very long way from achieving it. The precursor to an algorithm that will be able to walk … Although NN have gained popularity in recent years it’s been around for a very long time but only now have we started to see significant success in these areas.

Perceptron to Logistic Regression (recap)

Logistic Regression (recap)
Linear model capable of solving 2 class problems Uses the Sigmoid function to scale the output between [0,1] f(x) wTx - t

Logistic Regression (recap)
Uses the Log-loss function (cross entropy) to minimize the error: E = - \sum_{i=1}^{N}\{y_i \log f(x_i\ + (1-y_i)\log (1-f(x_i)\}

Gradient Descent (recap)
Update rule: Update parameters in the negative direction of the gradient. Negative gradient Increase value of w1 Positive gradient Decrease value of w1

Gradient Descent (recap)
Log-loss function: Gradient is given by the partial derivative with respect to parameter wi: Okay, so what assumptions do we need to make about our cost function, CC, in order that backpropagation can be applied? The first assumption we need is that the cost function can be written as an average C=1n∑xCxC=1n∑xCxover cost functions CxCx for individual training examples, xx. This is the case for the quadratic cost function, where the cost for a single training example is Cx=12∥y−aL∥2Cx=12‖y−aL‖2. This assumption will also hold true for all the other cost functions we'll meet in this book. The reason we need this assumption is because what backpropagation actually lets us do is compute the partial derivatives ∂Cx/∂w∂Cx/∂w and ∂Cx/∂b∂Cx/∂bfor a single training example. We then recover ∂C/∂w∂C/∂w and ∂C/∂b∂C/∂b by averaging over training examples. In fact, with this assumption in mind, we'll suppose the training example xx has been fixed, and drop the xxsubscript, writing the cost CxCx as CC. We'll eventually put the xx back in, but for now it's a notational nuisance that is better left implicit.

Gradient Descent Gradient is given by the partial derivative with respect to parameter wi: Note that we don’t have a -t anymore as it can just be thought of as an additional neuron with a trainable weight

Gradient Descent What if we add another unit (neuron)
How do we update the parameters? \frac{1}{1+e^{-(\textbf{w}_{(1)}^T \textbf{x}-t)}}

Gradient Descent Gradient is computed in the same way.
How do we combine the outputs of the two neurons? \frac{\partial E}{\partial w_{(0)j}} = \frac{\partial E}{\partial f(x)}\frac{\partial f(x)}{\partial (\textbf{w}_{(0)}^T\textbf{x}-t)}\frac{\partial (\textbf{w}_{(0)}^T\textbf{x}-t)}{\partial w_{(0)j}} = - \sum_{i} (f(x_i)-y_i)x_{ij}

Multi-layer Perceptron
Two neurons can only be combined by using another neuron: \frac{\partial E}{\partial w_{(0)j}} = \frac{\partial E}{\partial f(x)}\frac{\partial f(x)}{\partial (\textbf{w}_{(0)}^T\textbf{x}-t)}\frac{\partial (\textbf{w}_{(0)}^T\textbf{x}-t)}{\partial w_{(0)j}} = - \sum_{i} (f(x_i)-y_i)x_{ij}

Error Function Regression (network predicts real values):
Classification (network predicts class probability estimates): Classification (predict discrete labels) vs regression (predict real values) For classification the last layer is a softmax and for regression a normal neurons And describe the error function suitable for each problem E = \frac{1}{2} \sum_{i=1}^{N} (f(x_i)-y_i)^2

Gradient Descent Real value
Note the use of the chain rule to compute the derivative Now go though backpropagating the error

BackProp Real value How do we update w(0,0) and w(0,1) ?
We propagate the error through the network. Now go though backpropagating the error

BackProp Now go though backpropagating the error
Descend to next layer and compute the gradient with respect to W(0,0) Real value Now go though backpropagating the error \frac{\partial E}{\partial w_{(0,0)j}} = \frac{\partial E}{\partial f_2 (x)}\frac{\partial f_2 (x)}{\partial (\textbf{w}_{(1)}^T\textbf{f}_1)}\frac{\partial (\textbf{w}_{(1)}^T\textbf{f}_1)}{\partial \textbf{f}_{1}} \frac{\partial \textbf{f}_1(x)}{\partial (\textbf{w}_{(0,0)}^T\textbf{x})} \frac{\partial (\textbf{w}_{(0,0)}^T\textbf{x})}{\partial w_{(0,0)j}} \frac{\partial E}{\partial w_{(0,0)j}} = (f(x_i) - y_i) \frac{\partial (\textbf{w}_{(1)}^T\textbf{f}_1-t)}{\partial (\textbf{w}_{(0,0)}^T\textbf{x}-t)}\frac{\partial (\textbf{w}_{(0,0)}^T\textbf{x}-t)}{\partial w_{(0,0)j}} \frac{\partial E}{\partial f(x_i)} = f(x_i) - y_i

Deep Neural Network Can add more layers and neurons in each layer A bias neuron can be used to shift the decision boundary, as in the Perceptron: The bias is used to replace the -t term that has been removed from the equations Note that we don’t have a -t anymore as it can just be thought of as an additional neuron with a trainable weight

Activation Functions Commonly used functions: f(x) f(x) f(x) x x x

Activation Functions Sigmoid ReLu (Rectified Linear Unit)
Output can be interpreted as probabilities ReLu (Rectified Linear Unit) No vanishing or exploding gradient Tanh (Hyperbolic Tangent) Converges faster than the sigmoid function SoftMax Generalisation of the logistic function, outputs can be interpreted as probabilities Generalisation of the logistic function whose outputs are in the interval 0,1. If the sigmoid is used for 2-class logistic regression the softmax is used for multi-class logistic regression.

Decision boundary XOR problem (non-linear)
Neural Networks are nonlinear models The network could have easily solved the task however it is possible that the two misclassified points have been identified as outliers or incorrectly classified examples and in order to have better generalisation the network has allowed the two examples to be misclassified. Also say why neural networks are nonlinear- layering several activation functions Takasi J. Ozaki, Decision Boundaries for Deep Learning and other Machine Learning classifiers

Potential Issues

Local minima Caused by the high dimensional parameter space, which causes points to be saddle points instead Because of this, they are not an issue in practice I This is probably due to high dimensional parameter space, which causes most critical points to be saddle points not local optima Because of this, they are not an issue in practice

Vanishing gradient problem
Appears when a change in a parameter’s value causes very small changes in the value of the network output Manifests in very small gradient values when the update of the parameter is computed Derivative of the sigmoid function

Vanishing gradient problem
Appears in gradient based methods, caused by some activation functions (sigmoid or tanh) Output of function Derivative of the sigmoid function Caused by the fact that some activation functions squash their input in a very small output range Magnified by the addition of hidden layers Input to function

Overfitting Bishop 2006: Pattern recognition and machine learning
M=number of hidden units Bishop 2006: Pattern recognition and machine learning

Preventing Overfitting

Early Stopping

Weight sharing Parameters are shared by having the values stored in the same memory location Decrease amount of parameters at the cost of reducing model complexity Mostly used in convolutional and recurrent networks The main advantage of shared weights, is that you can substantially lower the degrees of freedom of your problem. Take the simplest case, think of a tied autoencoder, where the input weights are Wx∈RdWx∈Rd and the output weights are WTxWxT. You have lowered the parameters of your model by half from 2d→d2d→d. You can see some visualizations here: link. Similar results would be obtained in a Conv Net. This way you can get the following results: less parameters to optimize, which means faster convergence to some minima, at the expense of making your model less flexible. It is interesting to note that, this "less flexibility" can work as a regularizer many times and avoiding overfitting as the weights are shared with some other neurons. Therefore, it is a nice tweak to experiment with and I would suggest you to try both. I've seen cases where sharing information (sharing weights), has paved the way to better performance, and others, that made my model become significantly less flexible.

Dropout Randomly omit some units of the network over a training batch (group of training examples) Encourage specialization of the generated network to the batch Here start from what a batch is and the talk about randomly removing units Remember you can do Gradient Descent over a batch instead of single examples. So let’s assume we have a batch of 100 examples that we present to the network, but before doing a forward pass for each example in order to generate the prediction of the network we randomly deactivate some neuron in the network so that they won’t participate to learning from this batch, so they won’t be updated based on the error made on this batch.

Dropout It is a form of regularization
Akin to using an ensemble, each trained on single batches Remember you can do Gradient Descent over a batch instead of single examples. So let’s assume we have a batch of 100 examples that we present to the network, but before doing a forward pass for each example in order to generate the prediction of the network we randomly deactivate some neuron in the network so that they won’t participate to learning from this batch, so they won’t be updated based on the error made on this batch.

Conclusions

Summary Impressive performance on difficult tasks has made Deep Learning very popular Based on Perceptron and Logistic Regression Training is done using Gradient Descent and Backprop Error function, activation function and architecture are problem dependent Easy to overfit, but there are ways to avoid it

Research Directions Understanding more about how Neural Networks learn
Applications to vision, speech and problem solving Improving computational performance, specialised hardware Tensor Processing Units (TPUs) Moving towards more biologically inspired neurons Spiking Neurons

Libraries and Resources
Tensorflow: great support and lots of resources Theano: one of the first deep learning libraries, no multi-GPU support (support discontinued) Keras: very high level library that work on top of Theano or Tensorflow Lasagne: similar to Keras, but only compatible with Theano Caffe: specialised more for computer vision than deep learning Torch: uses the programming language Lua, has a wrapper for Python

Thank You!

Deep Learning Techniques and Applications

Similar presentations

Presentation on theme: "Deep Learning Techniques and Applications"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Deep Learning Techniques and Applications

Similar presentations

Presentation on theme: "Deep Learning Techniques and Applications"— Presentation transcript:

Similar presentations

About project

Feedback