CSC2535 Lecture 5 Sigmoid Belief Nets

Slides:



Advertisements
Similar presentations
The Helmholtz Machine P Dayan, GE Hinton, RM Neal, RS Zemel
Advertisements

A Tutorial on Learning with Bayesian Networks
Deep Learning Bing-Chen Tsai 1/21.
Mixture Models and the EM Algorithm
Stochastic Neural Networks Deep Learning and Neural Nets Spring 2015.
CS590M 2008 Fall: Paper Presentation
What kind of a Graphical Model is the Brain?
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Bayesian Wrap-Up (probably). 5 minutes of math... Marginal probabilities If you have a joint PDF:... and want to know about the probability of just one.
CSC2515: Lecture 7 (prelude) Some linear generative models and a coding perspective Geoffrey Hinton.
Switch to Top-down Top-down or move-to-nearest Partition documents into ‘k’ clusters Two variants “Hard” (0/1) assignment of documents to clusters “soft”
Restricted Boltzmann Machines and Deep Belief Networks
CPSC 422, Lecture 18Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18 Feb, 25, 2015 Slide Sources Raymond J. Mooney University of.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 20 Learning features one layer at a time Geoffrey Hinton.
Learning Energy-Based Models of High-Dimensional Data Geoffrey Hinton Max Welling Yee-Whye Teh Simon Osindero
CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.
Can computer simulations of the brain allow us to see into the mind? Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto.
CIAR Summer School Tutorial Lecture 1a: Mixtures of Gaussians, EM, and Variational Free Energy Geoffrey Hinton.
CSC2535 Spring 2013 Lecture 1: Introduction to Machine Learning and Graphical Models Geoffrey Hinton.
CSC2515 Fall 2007 Introduction to Machine Learning Lecture 5: Mixture models, EM and variational inference All lecture slides will be available as.ppt,.ps,
CSC2535: Computation in Neural Networks Lecture 11: Conditional Random Fields Geoffrey Hinton.
CSC 2535: Computation in Neural Networks Lecture 10 Learning Deterministic Energy-Based Models Geoffrey Hinton.
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
CIAR Second Summer School Tutorial Lecture 1b Contrastive Divergence and Deterministic Energy-Based Models Geoffrey Hinton.
IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are.
Boltzmann Machine (BM) (§6.4) Hopfield model + hidden nodes + simulated annealing BM Architecture –a set of visible nodes: nodes can be accessed from outside.
Learning Lateral Connections between Hidden Units Geoffrey Hinton University of Toronto in collaboration with Kejie Bao University of Toronto.
CSC321: Neural Networks Lecture 2: Learning with linear neurons Geoffrey Hinton.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Geoffrey Hinton CSC2535: 2013 Lecture 5 Deep Boltzmann Machines.
CSC321: Introduction to Neural Networks and machine Learning Lecture 16: Hopfield nets and simulated annealing Geoffrey Hinton.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
CIAR Second Summer School Tutorial Lecture 1a Sigmoid Belief Nets and Boltzmann Machines Geoffrey Hinton.
CSC321: Neural Networks Lecture 24 Products of Experts Geoffrey Hinton.
CSC 2535 Lecture 8 Products of Experts Geoffrey Hinton.
CSC2535 Lecture 4 Boltzmann Machines, Sigmoid Belief Nets and Gibbs sampling Geoffrey Hinton.
CSC2535 Spring 2011 Lecture 1: Introduction to Machine Learning and Graphical Models Geoffrey Hinton.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 18 Learning Boltzmann Machines Geoffrey Hinton.
CSC321: Neural Networks Lecture 16: Hidden Markov Models
CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.
CSC2515: Lecture 7 (post) Independent Components Analysis, and Autoencoders Geoffrey Hinton.
Lecture 2: Statistical learning primer for biologists
CIAR Summer School Tutorial Lecture 1b Sigmoid Belief Nets Geoffrey Hinton.
CSC321: Lecture 7:Ways to prevent overfitting
CSC321: Introduction to Neural Networks and Machine Learning Lecture 19: Learning Restricted Boltzmann Machines Geoffrey Hinton.
Learning and Acting with Bayes Nets Chapter 20.. Page 2 === A Network and a Training Data.
Boltzman Machines Stochastic Hopfield Machines Lectures 11e 1.
Pattern Recognition and Machine Learning
CSC321: Introduction to Neural Networks and Machine Learning Lecture 15: Mixtures of Experts Geoffrey Hinton.
CSC2535: 2011 Advanced Machine Learning Lecture 2: Variational Inference and Learning in Directed Graphical Models Geoffrey Hinton.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 17: Boltzmann Machines as Probabilistic Models Geoffrey Hinton.
CSC2515 Fall 2008 Introduction to Machine Learning Lecture 8 Deep Belief Nets All lecture slides will be available as.ppt,.ps, &.htm at
CSC321: Computation in Neural Networks Lecture 21: Stochastic Hopfield nets and simulated annealing Geoffrey Hinton.
Deep Belief Network Training Same greedy layer-wise approach First train lowest RBM (h 0 – h 1 ) using RBM update algorithm (note h 0 is x) Freeze weights.
CSC2535: Computation in Neural Networks Lecture 8: Hopfield nets Geoffrey Hinton.
CSC 2535: Computation in Neural Networks Lecture 10 Learning Deterministic Energy-Based Models Geoffrey Hinton.
CSC Lecture 23: Sigmoid Belief Nets and the wake-sleep algorithm Geoffrey Hinton.
CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.
CSC2535: Lecture 4: Autoencoders, Free energy, and Minimum Description Length Geoffrey Hinton.
CSC2535: 2013 Advanced Machine Learning Lecture 2b: Variational Inference and Learning in Directed Graphical Models Geoffrey Hinton.
Some Slides from 2007 NIPS tutorial by Prof. Geoffrey Hinton
CSC321 Lecture 18: Hopfield nets and simulated annealing
CSC321: Neural Networks Lecture 22 Learning features one layer at a time Geoffrey Hinton.
CSC321: Neural Networks Lecture 19: Boltzmann Machines as Probabilistic Models Geoffrey Hinton.
Deep Belief Networks Psychology 209 February 22, 2013.
FUNDAMENTAL CONCEPT OF ARTIFICIAL NETWORKS
Chapter 20. Learning and Acting with Bayes Nets
CSC321 Winter 2007 Lecture 21: Some Demonstrations of Restricted Boltzmann Machines Geoffrey Hinton.
Presentation transcript:

CSC2535 Lecture 5 Sigmoid Belief Nets Geoffrey Hinton

Discovering causal structure as a goal for unsupervised learning It is better to associate responses with the hidden causes than with the raw data. The hidden causes are useful for understanding the data. It would be interesting if real neurons really did represent independent hidden causes.

Bayes Nets: Directed Acyclic Graphical models The model generates data by picking states for each node using a probability distribution that depends on the values of the node’s parents. The model defines a probability distribution over all the nodes. This can be used to define a distribution over the leaf nodes. Hidden cause Visible effect

Ways to define the conditional probabilities State configurations of all parents For nodes that have discrete values, we could use conditional probability tables. For nodes that have real values we could let the parents define the parameters of a Gaussian Alternatively we could use a parameterized function. If the nodes have binary states, we could use a sigmoid: states of the node p sums to 1 j i

What is easy and what is hard in a DAG? It is easy to generate an unbiased example at the leaf nodes. It is typically hard to compute the posterior distribution over all possible configurations of hidden causes. It is also hard to compute the probability of an observed vector. Given samples from the posterior, it is easy to learn the conditional probabilities that define the model. Hidden cause Visible effect

Explaining away Even if two hidden causes are independent, they can become dependent when we observe an effect that they can both influence. If we learn that there was an earthquake it reduces the probability that the house jumped because of a truck. -10 -10 truck hits house earthquake 20 20 -20 house jumps

The learning rule for sigmoid belief nets Suppose we could “observe” the states of all the hidden units when the net was generating the observed data. E.g. Generate randomly from the net and ignore all the times when it does not generate data in the training set. Keep n examples of the hidden states for each datavector in the training set. For each node, maximize the log probability of its “observed” state given the observed states of its parents. This minimizes the energy of the complete configuration. j i

The derivatives of the log prob If unit i is on: If unit i is off: In both cases we get:

A coding view The sender and receiver use the SBN as a model for communicating the data. The sender must stochastically pick a hidden configuration for each data vector (using random bits from another message). The “bits-back” is the entropy of the distribution across hidden configurations. Using the chosen configuration, the cost is the energy of that configuration. The energy of a complete configuration is just the cost of sending that configuration This is the sum over all units of the cost of sending the state of a unit given the state of its parents (which has already been sent).

The cost of sending a complete configuration Cost of sending the state of unit i given the states of its parents: j i k

Minimizing the coding cost Pick hidden configurations using a Boltzmann distribution in their energies This is exactly the posterior distribution over configurations given the datavector Minimize the expected energy of the chosen configurations. Change the parameters to minimize the energies of configurations weighted by their probability of being picked. Don’t worry about the changes in the free energy caused by changes in the posterior distribution. We chose the distribution to minimize free energy. So small changes in the distribution have no effect on the free energy!

The Free Energy Free energy with data d clamped on visible units Expected energy Entropy of distribution over configurations Picking configurations with probability proportional to exp(-E) minimizes the free energy.

Why we can ignore the fact that changing the energies changes the equilibrium distribution When we change the old distribution to the new one, we reduce F. So we can safely ignore this change. We therefore adjust the weights to minimize the expected energy assuming that the old distribution does not change.

Sampling from the posterior distribution In a densely connected sigmoid belief net with many hidden units it is intractable to compute the full posterior distribution over hidden configurations. There are too many configurations to consider. But we can learn OK if we just get samples from the posterior. So how can we get samples efficiently? Generating at random and rejecting cases that do not produce data in the training set is hopeless.

Gibbs sampling First fix a datavector from the training set on the visible units. Then keeping visiting hidden units and updating their binary states using information from their parents and descendants. If we do this in the right way, we will eventually get unbiased samples from the posterior distribution for that datavector. This is relatively efficient because almost all hidden configurations will have negligible probability and will probably not be visited.

The recipe for Gibbs sampling Imagine a huge ensemble of networks. The networks have identical parameters. They have the same clamped datavector. The fraction of the ensemble with each possible hidden configuration defines a distribution over hidden configurations. Each time we pick the state of a hidden unit from its posterior distribution given the states of the other units, the distribution represented by the ensemble gets closer to the equilibrium distribution. The free energy, F, always decreases. Eventually, we reach the stationary distribution in which the number of networks that change from configuration a to configuration b is exactly the same as the number that change from b to a:

Computing the posterior for i given the rest We need to compute the difference between the energy of the whole network when i is on and the energy when i is off. Then the posterior probability for i is: Changing the state of i changes two kinds of energy term: how well the parents of i predict the state of i How well i and its spouses predict the state of each descendant of i. j i k

Terms in the global energy Compute for each descendant of i how the cost of predicting the state of that descendant changes Compute for i itself how the cost of predicting the state of i changes