reinforcement learning

reinforcement learning
world self world reinforcement learning redundancy reduction in early sensory sys ? diversity + errors diversity errors

Neural Networks Courtesy of: Elena Marchiori R4.47 elena@cs.vu.nl
Assistant: Kees Jong S2.22

Neural Networks A neural network (NN) is a machine learning approach inspired by the way in which the brain performs a particular learning task. Knowledge about the learning task is given in the form of examples called training examples. A NN is specified by: an architecture: a set of neurons and links connecting neurons. Each link has a weight, a neuron model: the information processing unit of the NN, a learning algorithm: used for training the NN by modifying the weights in order to model the particular learning task correctly on the training examples. The aim is to obtain a NN that generalizes well, that is, that behaves correctly on new instances of the learning task.

Single Layer Feed-forward
Input layer of source nodes Output layer of neurons

Multi layer feed-forward
3-4-2 Network Output layer Input layer Hidden Layer

Perceptron: Neuron Model
The (McCulloch-Pitts) perceptron is a single layer NN with a non-linear , the sign function x1 x2 xn w2 w1 wn b (bias) v y (v)

Perceptron for Classification
The perceptron is used for binary classification. Given training examples of classes C1, C2 train the perceptron in such a way that it classifies correctly the training examples: If the output of the perceptron is +1 then the input is assigned to class C1 If the output is -1 then the input is assigned to C2

Perceptron Training How can we train a perceptron for a classification task? We try to find suitable values for the weights in such a way that the training examples are correctly classified. Geometrically, we try to find a hyper-plane that separates the examples of the two classes.

Perceptron Geometric View
The equation below describes a (hyper-)plane in the input space consisting of real valued m-dimensional vectors. The plane splits the input space into two regions, each of them describing one class. decision region for C1 x2 w1x1 + w2x2 + w0 >= 0 decision boundary C1 x1 C2 w1x1 + w2x2 + w0 = 0

The fixed-increment learning algorithm
initialize w(n) randomly; while (there are misclassified training examples) Select a misclassified augmented example (x(n),d(n)) w(n+1) = w(n) + d(n)x(n); n = n+1; end-while;  = learning rate parameter (real number)

Example Consider the 2-dimensional training set C1  C2,
C1 = {(1,1), (1, -1), (0, -1)} with class label 1 C2 = {(-1,-1), (-1,1), (0,1)} with class label -1 Train a perceptron on C1  C2

A possible implementation
Consider the augmented training set C’1  C’2, with first entry fixed to 1 (to deal with the bias as extra weight): (1, 1, 1), (1, 1, -1), (1, 0, -1) ,(1,-1, -1), (1,-1, 1), (1,0, Replace x with -x for all x  C2’ and use the following update rule: Epoch = the application of the update rule to each example of the training set. Then terminate the execution of the learning algorithm if the weights do not change after one epoch.

Execution the execution of the perceptron learning algorithm for each epoch is illustrated below, with w(1)=(1,0,0),  =1, and transformed inputs (1, 1, 1), (1, 1, -1), (1,0, -1), (-1,1, 1), (-1,1, -1), (-1,0, -1) End epoch 1

Execution End epoch 2 At epoch 3 no weight changes. (check!)
 stop execution of algorithm. Final weight vector: (0, 2, -1).  decision hyperplane is 2x1 - x2 = 0.

Result + - x1 x2 C2 C1 1 -1 1/2 Decision boundary: 2x1 - x2 = 0

Termination of the learning algorithm
Suppose the classes C1, C2 are linearly separable (that is, there exists a hyper-plane that separates them). Then the perceptron algorithm applied to C1  C2 terminates successfully after a finite number of iterations. Proof: Consider the set C containing the inputs of C1  C2 transformed by replacing x with -x for each x with class label -1. For simplicity assume w(1) = 0,  = 1. Let x(1) … x(k)  C be the sequence of inputs that have been used after k iterations. Then w(2) = w(1) + x(1) w(3) = w(2) + x(2)  w(k+1) = x(1) + … + x(k) w(k+1) = w(k) + x(k)

Convergence theorem (proof)
Since C1 and C2 are linearly separable then there exists w* such that w*T x > 0  x  C. Let  = min w*T x Then w*T w(k+1) = w*T x(1) + … + w*T x(k)  k By the Cauchy-Schwarz inequality we get: ||w*||2 ||w(k+1)||2  [w*T w(k+1)]2 ||w(k+1)||  (A) k2  2 ||w*|| 2

Convergence theorem (proof)
Now we consider another route: w(k+1) = w(k) + x(k) || w(k+1)||2 = || w(k)||2 + ||x(k)||2 + 2 w T(k)x(k) euclidean norm   0 because x(k) is misclassified  ||w(k+1)||2  ||w(k)||2 + ||x(k)||2 =0 ||w(2)||2  ||w(1)||2 + ||x(1)||2 ||w(3)||2  ||w(2)||2 + ||x(2)||2  ||w(k+1)||2 

convergence theorem (proof)
Let  = max ||x(n)||2 x(n)  C ||w(k+1)||2  k  (B) For sufficiently large values of k: (B) becomes in conflict with (A) Then k cannot be greater than kmax such that (A) and (B) are both satisfied with the equality sign. Then the algorithm terminates successfully in at most iterations.  ||w*||2 2

Perceptron: Limitations
The perceptron can only model linearly separable classes, like (those described by) the following Boolean functions: AND OR COMPLEMENT It cannot model the XOR. You will experiment with these functions in the Matlab practical lessons.

Adaline: Adaptive Linear Element
When the two classes are not linearly separable, it may be desirable to obtain a linear separator that minimizes the mean squared error. Adaline (Adaptive Linear Element): uses a linear neuron model and the Least-Mean-Square (LMS) learning algorithm useful for robust linear classification and regression For an example (x,d) the error e(w) of the network is and the squared error is

Adaline The total error E_tot is the mean of the squared errors of all the examples. E_tot is a quadratic function of the weights whose derivative exists everywhere. Then incremental gradient descent may be used to minimize E_tot. (see Sec. 3.1,3.2 of the online Gurney book for an explanation of gradient descent). At each iteration LMS algorithm selects an example and decreases the network error E of that example, even when the example is correctly classified by the network.

Incremental Gradient Descent
start from an arbitrary point in the weight space the direction in which the error E of an example (as a function of the weights) is decreasing most rapidly is the opposite of the gradient of E: take a small step (of size ) in that direction

Weights Update Rule Computation of Gradient(E):
Delta rule for weight update:

LMS learning algorithm
initialize w(n) randomly; while (E_tot unsatisfactory and n<max_iterations) Select an example (x(n),d(n)) n = n+1; end-while;  = learning rate parameter (real number) A modification uses

Comparison Perceptron and Adaline

Multi layer feed-forward NN FFNN
We consider a more general network architecture: between the input and output layers there are hidden layers, as illustrated below. Hidden nodes do not directly receive inputs nor send outputs to the external environment. FFNNs overcome the limitation of single-layer NN: they can handle non-linearly separable learning tasks. Input layer Output layer Hidden Layer

XOR problem A typical example of non-linealy separable function is
the XOR. This function takes two input arguments with values in {-1,1} and returns one output in {-1,1}, as specified in the following table: If we think at -1 and 1 as encoding of the truth values false and true, respectively, then XOR computes the logical exclusive or, which yields true if and only if the two inputs have different truth values.

XOR problem 1 -1 x1 x2 In this graph of the XOR, input pairs giving output equal to 1 and -1 are depicted with green and red circles, respectively. These two classes (green and red) cannot be separated using a line. We have to use two lines, like those depicted in blue. The following NN with two hidden nodes realizes this non-linear separation, where each hidden node describes one of the two blue lines. +1 -1 0.1 This NN uses the sign activation function. The two green arrows indicate the directions of the weight vectors of the two hidden nodes, (1,-1) and (-1,1). They indicate the regions where the network output will be 1. The output node is used to combine the outputs of the two hidden nodes. x1 x2 -1

Types of decision regions
x1 1 x2 w2 w1 w0 Network with a single node 1 x1 x2 One-hidden layer network that realizes the convex region: each hidden node realizes one of the lines bounding the convex region Convex region L1 L2 L3 L4 -3.5 P1 two-hidden layer network that realizes the union of three convex regions: each box represents a one hidden layer network realizing one convex region P2 1 x1 x2 P3 1.5

FFNN NEURON MODEL The classical learning algorithm of FFNN is based on the gradient descent method. For this reason the activation function used in FFNN are continuous functions of the weights, differentiable everywhere. A typical activation function that can be viewed as a continuous approximation of the step (threshold) function is the Sigmoid Function. The activation function for node j is: when a tends to infinity then  becomes the step function 1 Increasing a

Training: Backprop algorithm
The Backprop algorithm searches for weight values that minimize the total error of the network over the set of training examples (training set). Backprop consists of the repeated application of the following two passes: Forward pass: in this step the network is activated on one example and the error of (each neuron of) the output layer is computed. Backward pass: in this step the network error is used for updating the weights (credit assignment problem). This process is more complex than the LMS algorithm for Adaline, because hidden nodes are linked to the error not directly but by means of the nodes of the next layer. Therefore, starting at the output layer, the error is propagated backwards through the network, layer by layer. This is done by recursively computing the local gradient of each neuron.

Backprop Back-propagation training algorithm
Backprop adjusts the weights of the NN in order to minimize the network total mean squared error. Network activation Forward Step Error propagation Backward Step

Total Mean Squared Error
The error of output neuron j after the activation of the network on the n-th training example is: The network error is the sum of the squared errors of the output neurons: The total mean squared error is the average of the network errors of the training examples.

Weight Update Rule The Backprop weight update rule is based on the gradient descent method: take a step in the direction yielding the maximum decrease of the network error E. This direction is the opposite of the gradient of E.

Weight Update Rule Input of neuron j is: Using the chain rule
we can write: Moreover if we define the local gradient of neuron j as follows: Then from we get

Weight update of output neuron
In order to compute the weight change we need to know the local gradient of neuron j . There are two cases, depending whether j is an output or an hidden neuron. If j is an output neuron then using the chain rule we obtain: because and So if j is an output node then the weight from neuron i to neuron j is updated of:

Weight update of hidden neuron
If j is a hidden neuron then its local gradient is computed using the local gradients of all the neurons of the next layer. Using the chain rule we have: Observe that Then Moreover So if j is a hidden node then the weight from neuron i to neuron j is updated of:

Error backpropagation
The flow-graph below illustrates how errors are back-propagated to hidden neuron j w1j e1 ’(v1) 1 j ’(vj) wkj ek ’(vk) k wm j em m ’(vm)

Summary: Delta Rule Delta rule wji = j yi where IF j output node
IF j hidden node where

Generalized delta rule
If  is small then the algorithm learns the weights very slowly, while if  is large then the large changes of the weights may cause an unstable behavior with oscillations of the weight values. A technique for tackling this problem is the introduction of a momentum term in the delta rule which takes into account previous updates. We obtain the following generalized Delta rule:  momentum constant the momentum accelerates the descent in steady downhill directions. the momentum has a stabilizing effect in directions that oscillate in time.

Other techniques: adaptation
Other heuristics for accelerating the convergence of the back-prop algorithm through  adaptation: Heuristic 1: Every weight has its own . Heuristic 2: Every  is allowed to vary from one iteration to the next.

Backprop learning algorithm (incremental-mode)
initialize w(n) randomly; while (stopping criterion not satisfied and n<max_iterations) for each example (x,d) - run the network with input x and compute the output y - update the weights in backward order starting from those of the output layer: with computed using the (generalized) Delta rule end-for n = n+1; end-while;

Backprop algorithm In the batch-mode the weights are updated only after all examples have been processed, using the formula The learning process continues on an epoch-by-epoch basis until the stopping condition is satisfied. In the incremental mode from one epoch to the next choose a randomized ordering for selecting the examples in the training set in order to avoid poor performance.

Stopping criterions Sensible stopping criterions:
total mean squared error change: Back-prop is considered to have converged when the absolute rate of change in the average squared error per epoch is sufficiently small (in the range [0.01, 0.1]). generalization based criterion: After each epoch the NN is tested for generalization. If the generalization performance is adequate then stop. If this stopping criterion is used then the part of the training set used for testing the network generalization will not be used for updating the weights.

Metric Striatal Networks

Caudate Putamen GPe GPi

STN SN (r & c)

Basal Ganglia: 3 circuits
Sensorimotor: Putamen to GPi Associative: Caudate to SNr Limbic: Ventral striatum to ventral pallidum

somatosensory cortices
Sensorimotor circuit somatosensory cortices Thalamus Putamen GPe GPi STN SNc motor cortices D1 D2 direct indirect excitation inhibition D1 & D2 Dopamine receptors

Medical Remarks Hypokinetic disorders result from overactivity in the indirect pathway. example: Decreased level of dopamine supply in nigrostriatal pathway results in akinesia, bradykinesia, and rigidity in Parkinson’s disease (PD). Hyperkinetic disorders result from underactivity in the indirect pathway. example: Lesions of STN result in Ballism. Damage to the pathway from Putamen to GPe results in Chorea, both of them are involuntary limb movements. Lesion-making in STN or GPi are successful therapeutic procedures of PD.

Cognitive Remarks Putamen, GPi, and GPe are organized somatotopically. Their neurons are selectively responsive to the direction of limb movement. A considerable convergence is also evident along the cortico-basal ganglio- thalamo- cortical pathway. GPi cells have a baseline firing rate of 60-80Hz. During a voluntary hand movement, the firing rates of 70% of the cells in the hand area of GPi increase, while those of the remaining 30% decrease. Focusing theory vs. Scaling theory (result of emphasis on somatotopy vs. convergence)

paleocortex (olfactory) archicortex (hippocampus) neocortex (the rest)

Brain Behav Evol. 1997;49(4): The telencephalon of tetrapods in evolution. Striedter GF. Department of Psychobiology, University of California, Irvine , USA. Numerous scientists have sought a homologue of mammalian isocortex in sauropsids (reptiles and birds) and a homologue of sauropsid dorsal ventricular ridge in mammals. Although some of the proposed theories were enormously influential, alternative theories continued to coexist, primarily because the striking differences in pallial organization between adult mammals, sauropsids, and amphibians enabled different authors to enlist different subsets of similarity data in support of different hypotheses of putative homology. A phylogenetic analysis based on parsimony cannot discriminate between such alternative hypotheses of putative homology, because sauropsids and mammals are sister groups. One solution to this dilemma is to include embryological patterns of telencephalic organization in the comparative analysis. Because early developmental stages in different taxa tend to resemble each other more than the adults do, the embryological data may reveal intermediate patterns of organization that provide unambiguous support for a single hypothesis of putative homology. The validity of this putative homology may then be supported by means of a phylogenetic analysis based on parsimony. A comparative analysis of pallial organization that includes embryological data suggests the following set of homologies. The lateral cortex in reptiles is homologous to the piriform cortex in birds and mammals. The anterior dorsal ventricular ridge in reptiles is probably homologous to the neostriatum and ventral hyperstriatum in birds and to the endopiriform nucleus in mammals. The posterior dorsal ventricular ridge in reptiles is most likely homologous to the archistriatum in birds and to the pallial amygdala in mammals. The pallial thickening in reptiles is probably homologous to the dorsal and intercalated portions of the hyperstriatum in birds and to the claustrum proper in mammals. Finally, the dorsal cortex in reptiles is probably homologous to the accessory hyperstriatum and parahippocampal area in birds and to the isocortex in mammals. These hypotheses of homology imply relatively minor evolutionary changes in development but major changes in neuronal connections. Most significantly, they imply the independent elaboration of thalamic sensory projections to derivatives of the lateral and dorsal pallia in sauropsids and mammals, respectively. They also imply the independent evolution of lamination in the pallium of birds and mammals.

Avian brains and a new understanding of vertebrate brain evolution
Biochem Cell Biol. 1997;75(6): The brain in evolution and involution. Parent A. Laboratoire de neurobiologie, Universite Laval Robert-Giffard, Beauport, QC, Canada. This paper provides an overview of the phylogenetic evolution and structural organization of the basal ganglia. These large subcortical structures that form the core of the cerebral hemispheres directly participate in the control of psychomotor behavior. Neuroanatomical methods combined with transmitter localization procedures were used to study the chemical organization of the forebrain in each major group of vertebrates. The various components of the basal ganglia appear well developed in amniote vertebrates, but remain rudimentary in anamniote vertebrates. For example, a typical substantia nigra composed of numerous dopaminergic neurons that project to the striatum already exists in the brain of reptiles. Other studies in mammals show that glutamatergic cortical inputs establish distinct functional territories within the basal ganglia, and that neurons in each of these territories act upon other brain neuronal systems principally via a GABAergic disinhibitory output mechanism. The functional status of the various basal ganglia chemospecific systems was examined in animal models of neurodegenerative diseases, as well as in postmortem material from Parkinson's and Huntington's disease patients. The neurodegenerative processes at play in such conditions specifically target the most phylogenetically ancient components of the brain, including the substantia nigra and the striatum, and the marked involution of these brain structures is accompanied by severe motor and cognitive deficits. Studies of neural mechanisms involved in these akinetic and hyperkinetic disorders have led to a complete reevaluation of the current model of the functional organization of the basal ganglia in both health and disease. Avian brains and a new understanding of vertebrate brain evolution Nature Reviews Neuroscience 6:151 (2005) (consortium)

Scalable architecture in mammalian brains
Nature 411, (2001) Scalable architecture in mammalian brains DAMON A. CLARK*†, PARTHA P. MITRA‡ & SAMUEL S.-H. WANG* * Department of Molecular Biology and † Department of Physics, Princeton University, Princeton, New Jersey 08544, USA ‡ Bell Laboratories, Lucent Technologies, 600 Mountain Avenue, Murray Hill, New Jersey 07974, USA Correspondence and requests for materials should be addressed to S.S.-H.W. (

Estimating information in the time course:
the Optican & Richmond approach.

The problem of limited sampling The shuffled information:
purely an index of the seriousness of the problem The analytical correction C1 Other correction methods: Jackknife (slow but reliable) neural network (John Hertz) a method developed by Optican et al Paninski, etc

Estimating mutual information after decoding:
simple 1D examples.

Cerebellar Networks

Theoretical concepts:
Timing theory (delay lines; Valentino Braitenberg) 2) Learned pattern recognition (Marr, Albus) Control theory (forward loop, gain adjustment; Ito)

The numbers of expansion recoding
Each MF terminates in several hundreds rosettes Each rosette has the dendrites of  28 GCs Each GC receives from  4 rosettes (MFs) There are 450 times more GCs than MFs In humans, there are 3x1010 GCs, each making about 300 PF synapse, for a total 1013 storage locations on some 5x107 Purkinje cells.

Nature 411, (2001) Scalable architecture in mammalian brains DAMON A. CLARK*†, PARTHA P. MITRA‡ & SAMUEL S.-H. WANG* * Department of Molecular Biology and † Department of Physics, Princeton University, Princeton, New Jersey 08544, USA ‡ Bell Laboratories, Lucent Technologies, 600 Mountain Avenue, Murray Hill, New Jersey 07974, USA Correspondence and requests for materials should be addressed to S.S.-H.W. ( Comparison of mammalian brain parts has often focused on differences in absolute size, revealing only a general tendency for all parts to grow together. Attempts to find size-independent effects using body weight as a reference variable obscure size relationships owing to independent variation of body size and give phylogenies of questionable significance. Here we use the brain itself as a size reference to define the cerebrotype, a species-by-species measure of brain composition. With this measure, across many mammalian taxa the cerebellum occupies a constant fraction of the total brain volume (0.13 0.02), arguing against the hypothesis that the cerebellum acts as a computational engine principally serving the neocortex. Mammalian taxa can be well separated by cerebrotype, thus allowing the use of quantitative neuroanatomical data to test evolutionary relationships. Primate cerebrotypes have progressively shifted and neocortical volume fractions have become successively larger in lemurs and lorises, New World monkeys, Old World monkeys, and hominoids, lending support to the idea that primate brain architecture has been driven by directed selection pressure. At the same time, absolute brain size can vary over 100-fold within a taxon, while maintaining a relatively uniform cerebrotype. Brains therefore constitute a scalable architecture. Nature 411, (2001); doi: / < p>

Nature 411, (2001) Scalable architecture in mammalian brains DAMON A. CLARK*†, PARTHA P. MITRA‡ & SAMUEL S.-H. WANG* Comparison of mammalian brain parts has often focused on differences in absolute size, revealing only a general tendency for all parts to grow together. Attempts to find size-independent effects using body weight as a reference variable obscure size relationships owing to independent variation of body size and give phylogenies of questionable significance. Here we use the brain itself as a size reference to define the cerebrotype, a species-by-species measure of brain composition. With this measure, across many mammalian taxa the cerebellum occupies a constant fraction of the total brain volume (0.13 ± 0.02), arguing against the hypothesis that the cerebellum acts as a computational engine principally serving the neocortex. Mammalian taxa can be well separated by cerebrotype, thus allowing the use of quantitative neuroanatomical data to test evolutionary relationships. Primate cerebrotypes have progressively shifted and neocortical volume fractions have become successively larger in lemurs and lorises, New World monkeys, Old World monkeys, and hominoids, lending support to the idea that primate brain architecture has been driven by directed selection pressure. At the same time, absolute brain size can vary over 100-fold within a taxon, while maintaining a relatively uniform cerebrotype. Brains therefore constitute a scalable architecture.

reinforcement learning

Similar presentations

Presentation on theme: "reinforcement learning"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

reinforcement learning

Similar presentations

Presentation on theme: "reinforcement learning"— Presentation transcript:

Similar presentations

About project

Feedback