Presentation is loading. Please wait.

Presentation is loading. Please wait.

Department of Electrical and Computer Engineering

Similar presentations


Presentation on theme: "Department of Electrical and Computer Engineering"— Presentation transcript:

1 Department of Electrical and Computer Engineering
CNT 6805 Network Science and Applications Lecture 2 Unsupervised Deep learning Dr. Dapeng Oliver Wu University of Florida Department of Electrical and Computer Engineering Fall 2016

2 Outline Introduction to Machine Learning
Chronological Development of ideas Problems with Neural Networks What exactly is different in Deep Learning Energy Based Models and Training Applications to real world problems Scalability Issues

3 Learning to Learn Face Recognition- Object Recognition-
Weather prediction- ML can be broadly classified into 3 major categories of problems Clustering, Regression and Classification

4 Chronological Development
G0- Blind Guess G1- Linear Methods ( PCA, LDA, LR) What if relationship is nonlinear- G2- Neural Networks Uses multiple non-linear elements to approximate G3- Kernel Machines Linear Computations in infinite dim-space without ‘actually’ learning a mapping

5 Neural network Non-linear transformation at summing nodes in hidden and outer layers. -e.g.- Sigmoid- Output- estimates of posterior probability

6 Back-Propagation If S is a logistic function,
then S’(x) = S(x)(1 – S(x))

7 Challenges with multi-layers NN
Get stuck in local minima or plateaus due to random initialization. Vanishing gradient- Effect becomes smaller and smaller in lower layers Excellent training but poor in testing- A classic case of overfitting

8 Why Vanishing Gradient?
Both sigmoid and its derivative < 1 Gradient calculated to train each layer : Lower layers remain undertrained.

9 Deep Learning – Early Phase
Unsupervised pre-training followed by traditional supervised backpropagation Let the data speak for itself Try to derive the inherent features of input Why it clicks? Pre-training helps create a data-dependent prior and hence better regularization Gives a set of W’s that is better to start with Lower layers are better optimized and hence vanishing gradients do not affect much

10 Restricted Boltzmann Machine-I
x-Visible (input) h-Hidden (latent) Energy given by Joint Probability where Z is the partition function given by Target is to maximize P(x), (or its log-likelihood) P(h|x) & P(x|h) factorizable For the binary case {0,1}, again sigmoid function arises as

11 Restricted Boltzmann Machine-II
Gradient of log-likelihood looks like Where is called Free Energy If we average it over training set Q, the RHS looks like So, gradient= - training + model = - Observable + reconstruction

12 Sampling Approximations
Generally Intractable. But approximations lead to a simpler sampling problem Update equation now looks like

13 Cont’d Now, we take the partial derivatives of w.r.t. to the parameter vector So, an unbiased update rule for weights looks like Usually once sufficient

14 Deep belief Network Conditional distributions for
Layers 0,1….l-1 and joint for Each layer initialized as an RBM Training is done layer-by-layer greedily in a sequential order. It is then fed to a conventional Neural network

15 Deep Autoencoders Codes itself and then again reconstructs the output. Can be stacked to form DBNs Training procedure is similar layer-by-layer Except, in final step, may be supervised or unsupervised( just like backprop) Denoising AE Contractive AE Regularized AE

16 Dimensionality Reduction
Original DBN LogisticPCA Just PCA

17 What does it learn? Higher layers- birdview - Invariant Features
Denoising AE - Stacked RBMs (DBN)

18

19 Computational Considerations
Part 1- unsupervised pretraining Matrix Multiplications Weight Update sequential (just like adaptive systems/filters) But can be parallelized over nodes/ dimensions Tricks- use minibatches- Update the weight only once per many epochs by taking average

20 Unsup Pre-Training: Rarely Used Now
But with large number of labeled training examples, lower layers will eventually change Recent architectures prefer weight initialization like Glorot et al (2011) Gaussian distribution with Srivastava, Hinton, et al. (2014) proposes a dropout method to mitigate overfitting. He et al. (2015) derive optimal weight initialization for ReLU/ PReLU activations. ReLU: Rectified Linear Unit PReLU: Parametric Rectified Linear Unit

21 Dropout Neural Net Model (1)
Srivastava N, Hinton GE, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research Jan 1;15(1):

22 Dropout Neural Net Model (2)

23 Dropout Neural Net Model (3)

24 Example: Handwritten Digits Recognition

25 Recurrent Neural Networks (RNN)
Deep Learning for time series data Uses memory to process input sequences Output y can be any supervised target or even the future samples of x, as in prediction

26 Vanishing/Exploding Gradient- Both Temporally and Spatially
Multi-layered RNNs have their lower layers undertrained Information from previous input are not properly carried- chained gradients ALSO, cannot handle long range dependency

27 Why, again? We cannot relate inputs from the distant past to the target output.

28 Long Short Term Memory Error signals trapped within a memory cell cannot change. Gates have to learn which error to trap and which ones to forget

29 Conclusion Practical breakthrough, Companies happy
but theoreticians unconvinced. Deep Learning architectures have won many competitions in recent past. Plans to put concept to build artificial brain for big data


Download ppt "Department of Electrical and Computer Engineering"

Similar presentations


Ads by Google