Neural networks Supervised learning The training data consists of input information with their corresponding output information. Unsupervised learning The training data consists of input information without their corresponding output information. 3
Neural networks Generative model Model the distribution of input as well as output,P(x, y) Discriminative model Model the posterior probabilities,P(y | x) P(x,y1) P(x,y2) P(y1|x)P(y2|x) 4
Neural networks What is the neural? Linear neurons Binary threshold neurons Sigmoid neurons Stochastic binary neurons x1 x2 1 w1 w2 b y 1 if 0 otherwise 5
Neural networks Two layer neural networks ( Sigmoid neurons ) 6 Back-propagation Step1: Randomly initial weight Determine the output vector Step2: Evaluating the gradient of an error function Step3: Adjusting weight, Repeat The step1,2,3 until error enough low
Neural networks Back-propagation is not good for deep learning It requires labeled training data. Almost data is unlabeled. The learning time is very slow in networks with multiple hidden layers. It is very slow in networks with multi hidden layer. It can get stuck in poor local optima. For deep nets they are far from optimal. Learn P(input) not P(output | input) What kind of generative model should we learn? 7
Graphical model A graphical model is a probabilistic model for which graph denotes the conditional dependence structure between random variables probabilistic model 9 In this example: D depends on A, D depends on B, D depends on C, C depends on B, and C depends on D.
Graphical model Directed graphical model Undirected graphical model 10 A BC D A B C D
Belief nets A belief net is a directed acyclic graph composed of stochastic variables 12 stochastic hidden causes visible Stochastic binary neurons It is sigmoid belief nets
Belief nets we would like to solve two problems The inference problem: Infer the states of the unobserved variables. The learning problem: Adjust the interactions between variables to make the network more likely to generate the training data. 13 stochastic hidden causes visible
Belief nets It is easy to generate sample P(v | h) It is hard to infer P(h | v) Explaining away 14 stochastic hidden causes visible
Belief nets Explaining away 15 H1H2 V
Belief nets Some methods for learning deep belief nets Monte Carlo methods But its painfully slow for large, deep belief nets Learning with samples from the wrong distribution Use Restricted Boltzmann Machines 16
Boltzmann Machine It is a Undirected graphical model The Energy of a joint configuration 18 hidden i j visible
Boltzmann Machine 19 h1 h v1 v2 An example of how weights define a distribution
Boltzmann Machine A very surprising fact 20 Derivative of log probability of one training vector, v under the model. Expected value of product of states at thermal equilibrium when v is clamped on the visible units Expected value of product of states at thermal equilibrium with no clamping
Boltzmann Machines Restricted Boltzmann Machine We restrict the connectivity to make learning easier. Only one layer of hidden units. We will deal with more layers later No connections between hidden units Making the updates more parallel 21 visible
Boltzmann Machines the Boltzmann machine learning algorithm for an RBM 22 i j ii j i j t = 0 j t = 1t = 2t = infinity
Boltzmann Machines Contrastive divergence: A very surprising short-cut 23 t = 0 t = 1 reconstruction data i j i j This is not following the gradient of the log likelihood. But it works well.
DBN It is easy to generate sample P(v | h) It is hard to infer P(h | v) Explaining away Use RBM to initial weight can get good optimal 25 stochastic hidden causes visible
DBN Combining two RBMs to make a DBN 26 copy binary state for each v Compose the two RBM models to make a single DBN model Train this RBM first Then train this RBM It’s a deep belief nets!
DBN Why we can use RBM to initial belief nets weights? An infinite sigmoid belief net that is equivalent to an RBM Inference in a directed net with replicated weights Inference is trivial. We just multiply v0 by W transpose. The model above h0 implements a complementary prior. Multiplying v0 by W transpose gives the product of the likelihood term and the prior term. 27 v 1 h 1 v 0 h 0 v 2 h 2 etc.
DBN 28 X1 X2X3X4
DBN 29 X1 X2X3X4 X1 X2X3X4
DBN Combining two RBMs to make a DBN 31 copy binary state for each v Compose the two RBM models to make a single DBN model Train this RBM first Then train this RBM It’s a deep belief nets!
Reference Deep Belief Nets,2007 NIPS tutorial, G. Hinton https://class.coursera.org/neuralnets /class/index Machine learning 上課講義 el 32