Presentation is loading. Please wait.

Presentation is loading. Please wait.

Deep Architectures for Artificial Intelligence

Similar presentations


Presentation on theme: "Deep Architectures for Artificial Intelligence"— Presentation transcript:

1 Deep Architectures for Artificial Intelligence

2 Learning Features: The Past
Traditional model of pattern recognition involves fixed kernel machines and hand-crafted features(since late 50’s) The first learning machine was the “Perceptron” Built at Cornell in 1960, the perceptron was a linear classifier on top of a simple feature extractor The vast majority of practical applications of ML today use glorified linear classifiers or glorified template matching.

3 Learning Features: The Future
Modern approaches are based upon trainable features AND trainable classifier. Designing a feature extractor requires considerable efforts by experts

4 Machine Learning Supervised learning Unsupervised learning
The training data consists of input information with their corresponding output information. Unsupervised learning The training data consists of input information without their corresponding output information.

5 Neural networks Generative model Discriminative model P(x,y1) P(x,y2)
Model the distribution of input as well as output ,P(x , y) Discriminative model Model the posterior probabilities ,P(y | x) P(x,y1) P(x,y2) P(y1|x) P(y2|x)

6 Neural networks Two layer neural networks (Sigmoid neurons)
Back-propagation Step1: Randomly initial weight Determine the output vector Step2: Evaluating the gradient of an error function Step3: Adjusting weight, Repeat The step1,2,3 until error enough low

7 Deep Neural Networks ANN with more than 2 hidden layer are referred as deep Given enough hidden neuron, a single hidden layer is enough to approximate any function to any degree of precision However too many neuron may quickly make the network unfeasible to train Adding layer greatly improve the network learning capacity, thus reducing the number of neuron needed

8 Deep Learning Deep Learning is about representing high-dimensional data Learning Representations of data means to discover and disentangle the independent explanatory factors that underlie the data distribution. The Manifold Hypothesis: Natural data lives in a low- dimensional (non-linear) manifold because variables in natural data are mutually dependent. Internal intermediate representations can be viewed as latent variables to be inferred, and deep belief networks are a particular type of latent variable models.

9 Hierarchy of Representations
Hierarchy of representations with increasing level of abstraction Each stage is a kind of trainable feature transform Image recognition Image Pixel → edge → texton → motif → part → object Text Character → word → word group → clause → sentence → story Speech Sample → Spectral → Band → Sound → phoneme → word

10 How to train deep models?
Purely Supervised Initialize parameters randomly Train in supervised mode typically with SGD, using backprop to compute gradients Used in most practical systems for speech and image recognition Unsupervised, layerwise + supervised classifier on top Train each layer unsupervised, one after the other Train a supervised classifier on top, keeping the other layers fixed Good when very few labeled samples are available Unsupervised, layerwise + global supervised fine-tuning Add a classifier layer, and retrain the whole thing supervised Good when label set is poor (e.g. pedestrian detection) Unsupervised pre-training often uses regularized auto-encoders

11 Boltzmann Machine Model
one input layer and one hidden layer typically binary states for every unit stochastic (vs. deterministic) recurrent (vs. feed-forward) generative model (vs. discriminative): estimate the distribution of observations(say p(image)), while traditional discriminative networks only estimate the labels(say p(label|image)) defined Energy of the network and Probability of a unit’s state(scalar T is referred to as the “temperature”):

12 Restricted Boltzmann Machine Model
a bipartite graph: no intralayer connections, feed-forward RBM does not have T factor, the rest are the same as BM one important feature of RBM is that the visible units and hidden units are conditionally independent, which will lead to a beautiful result later on:

13 Restricted Boltzmann Machine
Two characters to define a Restricted Boltzmann Machine: states of all the units: obtained through probability distribution. weights of the network: obtained through training(Contrastive Divergence). As mentioned before, the objective of RBM is to estimate the distribution of input data. And this goal is fully determined by the weights, given the input. Energy defined for the RBM:

14 Restricted Boltzmann Machine
Distribution of visible layer of the RBM(Boltzmann Distribution): Z is the partition function defined as the sum of over all possible configurations of {v,h} Probability that unit i is on(binary state 1): is the logistic/sigmoid function

15 Deep Belief Net Based on RBMs
h2 data h1 h3 RBM DBNs based on stacks of RBMs: The top two hidden layers form an undirected associative memory(regarded as a shorthand for infinite stacks) and the remained hidden layers form a directed acyclic graph. The red arrows are NOT part of the generative model. They are just for inference purpose

16 Training Deep Belief Nets
Previous discussion gives an intuition of training stacks of RBMs one layer at a time. This greedy learning algorithm is proved to be efficient in the sense of expected variance by Hinton. First, learn all the weights tied.

17 Training Deep Belief Nets
Then freeze bottom layer and relearn all the other layers.

18 Training Deep Belief Nets
Then freeze bottom two layers and relearn all the other layers.

19 Training Deep Belief Nets
Each time we learn a new layer, the inference at the lower layers will become incorrect, but the variational bound on the log probability of the data improves, proved by Hinton. Since the inference at lower layers becomes incorrect, Hinton uses a fine- tuning procedure to adjust the weights, called wake-sleep algorithm.

20 Training Deep Belief Nets
Wake-sleep algorithm: wake phase: do a down-top pass, sample h using the recognition weight based on input v for each RBM, and then adjust the generative weight by the RBM learning rule. sleep phase: do a top-down pass, start by a random state of h at the top layer and generate v. Then the recognition weights are modified. Analogs for wake-sleep algorithm: wake phase: if the reality is different with the imagination, then modify the generative weights to make what is imagined as close as the reality. sleep phase: if the illusions produced by the concepts learned during wake phase are different with the concepts, then modify the recognition weight to make the illusions as close as the concepts.

21 Useful Resources Webpages: People:
Geoffrey E. Hinton’s readings (with source code available for DBN) Notes on Deep Belief Networks MLSS Tutorial, October 2010, ANU Canberra, Marcus Frean Deep Learning Tutorials Hinton’s Tutorial, Fergus’s Tutorial, CUHK MMlab project : People: Geoffrey E. Hinton’s Andrew Ng Yoshua Bengio Yann LeCun Rob Fergus


Download ppt "Deep Architectures for Artificial Intelligence"

Similar presentations


Ads by Google