Download presentation
Presentation is loading. Please wait.
1
Overview over different methods
2
Different Types/Classes of Learning
Unsupervised Learning (non-evaluative feedback) Trial and Error Learning. No Error Signal. No influence from a Teacher, Correlation evaluation only. Reinforcement Learning (evaluative feedback) (Classic. & Instrumental) Conditioning, Reward-based Lng. “Good-Bad” Error Signals. Teacher defines what is good and what is bad. Supervised Learning (evaluative error-signal feedback) Teaching, Coaching, Imitation Learning, Lng. from examples and more. Rigorous Error Signals. Direct influence from a teacher/teaching signal.
3
An unsupervised learning rule:
Basic Hebb-Rule: = m ui v m << 1 dwi dt For Learning: One input, one output. A reinforcement learning rule (TD-learning): One input, one output, one reward. A supervised learning rule (Delta Rule): No input, No output, one Error Function Derivative, where the error function compares input- with output- examples.
4
Self-organizing maps: unsupervised learning
input map Neighborhood relationships are usually preserved (+) Absolute structure depends on initial condition and cannot be predicted (-)
5
An unsupervised learning rule:
Basic Hebb-Rule: = m ui v m << 1 dwi dt For Learning: One input, one output A reinforcement learning rule (TD-learning): One input, one output, one reward A supervised learning rule (Delta Rule): No input, No output, one Error Function Derivative, where the error function compares input- with output- examples.
6
Classical Conditioning
I. Pawlow
7
An unsupervised learning rule:
Basic Hebb-Rule: = m ui v m << 1 dwi dt For Learning: One input, one output A reinforcement learning rule (TD-learning): One input, one output, one reward A supervised learning rule (Delta Rule): No input, No output, one Error Function Derivative, where the error function compares input- with output- examples.
8
Supervised Learning: Example OCR
9
The influence of the type of learning on speed and autonomy of the learner
Correlation based learning: No teacher Reinforcement learning , indirect influence Reinforcement learning, direct influence Supervised Learning, Teacher Programming Learning Speed Autonomy
10
Hebbian learning When an axon of cell A excites cell B and repeatedly or persistently takes part in firing it, some growth processes or metabolic change takes place in one or both cells so that A‘s efficiency ... is increased. Donald Hebb (1949) A B A t B
11
Overview over different methods
You are here !
12
…correlates inputs with outputs by the…
Hebbian Learning w1 u1 v …Basic Hebb-Rule: …correlates inputs with outputs by the… = m v u m << 1 dw1 dt Vector Notation Cell Activity: v = w . u This is a dot product, where w is a weight vector and u the input vector. Strictly we need to assume that weight changes are slow, otherwise this turns into a differential eq.
13
= m <v u> m << 1 dt
dw1 Single Input = m v u m << 1 dt dw = m v u m << 1 Many Inputs dt As v is a single output, it is scalar. Averaging Inputs dw = m <v u> m << 1 dt We can just average over all input patterns and approximate the weight change by this. Remember, this assumes that weight changes are slow. If we replace v with w . u we can write: dw = m Q . w where Q = <uu> is the input correlation matrix dt Note: Hebb yields an instable (always growing) weight vector!
14
Synaptic plasticity evoked artificially
Examples of Long term potentiation (LTP) and long term depression (LTD). LTP First demonstrated by Bliss and Lomo in Since then induced in many different ways, usually in slice. LTD, robustly shown by Dudek and Bear in 1992, in Hippocampal slice.
18
LTP will lead to new synaptic contacts
19
Symmetrical Weight-change curve
Conventional LTP = Hebbian Learning Synaptic change % Pre tPre Post tPost Pre tPre Post tPost Symmetrical Weight-change curve The temporal order of input and output does not play any role
21
Spike timing dependent plasticity - STDP
Markram et. al. 1997 There are various common methods for inducing bidirectional synaptic plasticity. The most traditional one is by using extracellular stimulation at different frequencies. High frequency produces LTP whereas low frequency may produce LTD. Another protocol is often called paring. Here the postsynaptic cell is voltage clamped to a certain postsynaptic voltage and at the same time a low frequency presynaptic stimuli is delivered. For small depolarization to ~-50 mv LTD is induced and depolarization to –10 produces LTP. A third recently popular protocol is spike time dependent plasticity STDP – here a presynaptic stimuli is delivered either closely before of after a postsynaptic AP. Typically if the pre comes before the post LTP is induced and if post comes before pre LTD is produced.
22
Spike Timing Dependent Plasticity: Temporal Hebbian Learning
Synaptic change % Pre tPre Post tPost Pre precedes Post: Long-term Potentiation Acausal Pre follows Post: Long-term Depression Pre tPre Post tPost Causal (possibly) Weight-change curve (Bi&Poo, 2001)
23
= m <v u> m << 1 dt
Back to the Math. We had: dw1 Single Input = m v u m << 1 dt dw = m v u m << 1 Many Inputs dt As v is a single output, it is scalar. Averaging Inputs dw = m <v u> m << 1 dt We can just average over all input patterns and approximate the weight change by this. Remember, this assumes that weight changes are slow. If we replace v with w . u we can write: dw = m Q . w where Q = <uu> is the input correlation matrix dt Note: Hebb yields an instable (always growing) weight vector!
24
= m C . w, where C is the covariance matrix of the input dt
Covariance Rule(s) Normally firing rates are only positive and plain Hebb would yield only LTP. Hence we introduce a threshold to also get LTD dw = m (v - Q) u m << 1 Output threshold dt dw = m v (u - Q) m << 1 Input vector threshold dt Many times one sets the threshold as the average activity of some reference time period (training period) Q = <v> or Q = <u> together with v = w . u we get: dw = m C . w, where C is the covariance matrix of the input dt C = <(u-<u>)(u-<u>)> = <uu> - <u2> = <(u-<u>)u>
25
dw = m vu (v - Q) m << 1 dt dQ = n (v2 - Q) n < 1 dt
The covariance rule can produce LTP without (!) post-synaptic output. This is biologically unrealistic and the BCM rule (Bienenstock, Cooper, Munro) takes care of this. BCM- Rule dw = m vu (v - Q) m << 1 dt As such this rule is again unstable, but BCM introduces a sliding threshold dQ = n (v2 - Q) n < 1 dt Note the rate of threshold change n should be faster than then weight changes (m), but slower than the presentation of the individual input patterns. This way the weight growth will be over-dampened relative to the (weight – induced) activity increase.
26
Problem: Hebbian Learning can lead to unlimited weight growth.
Solution: Weight normalization a) subtractive (subtract the mean change of all weights from each individual weight). b) multiplicative (mult. each weight by a gradually decreasing factor). Evidence for weight normalization: Reduced weight increase as soon as weights are already big (Bi and Poo, 1998, J. Neurosci.)
27
Examples of Applications
Kohonen (1984). Speech recognition - a map of phonemes in the Finish language Goodhill (1993) proposed a model for the development of retinotopy and ocular dominance, based on Kohonen Maps (SOM) Angeliol et al (1988) – travelling salesman problem (an optimization problem) Kohonen (1990) – learning vector quantization (pattern classification problem) Ritter & Kohonen (1989) – semantic maps OD ORI
28
Differential Hebbian Learning of Sequences
Learning to act in response to sequences of sensor events
29
Overview over different methods
You are here !
30
History of the Concept of Temporally
Asymmetrical Learning: Classical Conditioning I. Pawlow
32
History of the Concept of Temporally
Asymmetrical Learning: Classical Conditioning Correlating two stimuli which are shifted with respect to each other in time. Pavlov’s Dog: “Bell comes earlier than Food” This requires to remember the stimuli in the system. Eligibility Trace: A synapse remains “eligible” for modification for some time after it was active (Hull 1938, then a still abstract concept). I. Pawlow
33
S Dw1 w1 w0 = 1 Classical Conditioning: Eligibility Traces
Unconditioned Stimulus (Food) Conditioned Stimulus (Bell) Response Stimulus Trace E X Dw1 + The first stimulus needs to be “remembered” in the system
34
History of the Concept of Temporally
Asymmetrical Learning: Classical Conditioning Eligibility Traces Note: There are vastly different time-scales for (Pavlov’s) hehavioural experiments: Typically up to 4 seconds as compared to STDP at neurons: Typically milliseconds (max.) I. Pawlow
35
a-function: Defining the Trace
In general there are many ways to do this, but usually one chooses a trace that looks biologically realistic and allows for some analytical calculations, too. EPSP-like functions: a-function: k Shows an oscillation. Dampened Sine wave: k Double exp.: k This one is most easy to handle analytically and, thus, often used.
36
Overview over different methods
Mathematical formulation of learning rules is similar but time-scales are much different.
37
V’(t) Simpler Notation x = Input u = Traced Input
Differential Hebb Learning Rule V’(t) Simpler Notation x = Input u = Traced Input x Xi w V ui Early: “Bell” S X0 u0 Late: “Food”
38
u w Convolution used to define the traced input, Correlation used to calculate weight growth.
39
Differential Hebbian Learning
Derivative of the Output Filtered Input T Output Produces asymmetric weight change curve (if the filters h produce unimodal „humps“)
40
Symmetrical Weight-change curve
Conventional LTP Synaptic change % Pre tPre Post tPost Pre tPre Post tPost Symmetrical Weight-change curve The temporal order of input and output does not play any role
41
Differential Hebbian Learning
Derivative of the Output Filtered Input T Output Produces asymmetric weight change curve (if the filters h produce unimodal „humps“)
42
Spike-timing-dependent plasticity (STDP): Some vague shape similarity
Pre tPre Post tPost Pre precedes Post: Long-term Potentiation Synaptic change % Pre follows Post: Long-term Depression Pre tPre Post tPost T=tPost - tPre ms Weight-change curve (Bi&Poo, 2001)
43
Overview over different methods
You are here !
44
The biophysical equivalent of
Hebb’s postulate Plastic Synapse Presynaptic Signal (Glu) NMDA/AMPA Postsynaptic: Source of Depolarization Pre-Post Correlation, but why is this needed?
45
Plasticity is mainly mediated by so called
N-methyl-D-Aspartate (NMDA) channels. These channels respond to Glutamate as their transmitter and they are voltage depended:
46
Biophysical Model: Structure
x NMDA synapse v Source of depolarization: 1) Any other drive (AMPA or NMDA) 2) Back-propagating spike Hence NMDA-synapses (channels) do require a (hebbian) correlation between pre and post-synaptic activity!
47
Local Events at the Synapse
x1 Current sources “under” the synapse: Synaptic current Isynaptic Currents from all parts of the dendritic tree IDendritic Local IBP Influence of a Back-propagating spike u1 Global v
48
* Pre-syn. Spike On „Eligibility Traces“ gNMDA w S BP- or D-Spike V*h
Membrane potential: Weight Synaptic input Depolarization source Pre-syn. Spike BP- or D-Spike On „Eligibility Traces“ V*h gNMDA * w S
49
Model structure Dendritic compartment Source of Plastic Depolarization
Plastic synapse with NMDA channels Source of Ca2+ influx and coincidence detector BP spike Source of Depolarization Dendritic spike Source of depolarization: 1. Back-propagating spike 2. Local dendritic spike Plastic Synapse NMDA/AMPA NMDA/AMPA g
50
Plasticity Rule (Differential Hebb) Instantenous weight change:
NMDA synapse -Plastic synapse NMDA/AMPA g Source of depolarization Plasticity Rule (Differential Hebb) Instantenous weight change: Presynaptic influence Glutamate effect on NMDA channels Postsynaptic influence
51
Pre-synaptic influence
NMDA synapse -Plastic synapse NMDA/AMPA g Source of depolarization Pre-synaptic influence Normalized NMDA conductance: NMDA channels are instrumental for LTP and LTD induction (Malenka and Nicoll, 1999; Dudek and Bear ,1992)
52
Depolarizing potentials in the dendritic tree
spikes (Larkum et al., 2001 Golding et al, 2002 Häusser and Mel, 2003) Back-propagating spikes (Stuart et al., 1997)
53
NMDA synapse -Plastic synapse
NMDA/AMPA g Source of depolarization Postsyn. Influence For F we use a low-pass filtered („slow“) version of a back-propagating or a dendritic spike.
54
BP and D-Spikes
55
Source of Depolarization: Back-Propagating Spikes
Weight Change Curves Source of Depolarization: Back-Propagating Spikes Back-propagating spike Weight change curve NMDAr activation Back-propagating spike T T=tPost – tPre
56
CLOSED LOOP LEARNING Learning to Act (to produce appropriate behavior)
Instrumental (Operant) Conditioning
57
This is an Open Loop System !
conditioned Input Sensor 2 This is an open-loop system Bell Food Salivation Pavlov, 1927 Temporal Sequence This is an Open Loop System !
58
Closed loop Behaving Sensing Adaptable Neuron Env.
59
Instrumental/Operant Conditioning
60
Behaviorism B.F. Skinner (1904-1990)
“All we need to know in order to describe and explain behavior is this: actions followed by good outcomes are likely to recur, and actions followed by bad outcomes are less likely to recur.” (Skinner, 1953) Skinner had invented the type of experiments called operant conditioning. B.F. Skinner ( )
61
Operant behavior: occurs without an observable external stimulus.
Operates on the organism’s environment. The behavior is instrumental in securing a stimulus more representative of everyday learning. Skinner Box
62
OPERANT CONDITIONING TECHNIQUES
POSITIVE REINFORCEMENT = increasing a behavior by administering a reward NEGATIVE REINFORCEMENT = increasing a behavior by removing an aversive stimulus when a behavior occurs PUNISHMENT = decreasing a behavior by administering an aversive stimulus following a behavior OR by removing a positive stimulus EXTINCTION = decreasing a behavior by not rewarding it
63
Overview over different methods
You are here !
64
How to assure behavioral & learning convergence ??
This is achieved by starting with a stable reflex-like action and learning to supercede it by an anticipatory action. Remove before being hit !
65
(Compare to an electronic closed loop controller!)
Reflex Only (Compare to an electronic closed loop controller!) Think of a Thermostat ! This structure assures initial (behavioral) stability (“homeostasis”)
66
Robot Application S w x Early: “Vision” Late: “Bump”
67
Robot Application Learning Goal:
Correlate the vision signals with the touch signals and navigate without collisions. Initially built-in behavior: Retraction reaction whenever an obstacle is touched.
68
Robot Example
69
What has happened during learning to the system ?
The primary reflex re-action has effectively been eliminated and replaced by an anticipatory action
70
Overview over different methods – Supervised Learning
And many more You are here !
71
Supervised learning methods are mostly non-neuronal and will therefore not be discussed here.
72
Reinforcement Learning (RL)
Learning from rewards (and punishments) Learning to assess the value of states. Learning goal directed behavior. RL has been developed rather independently from two different fields: Dynamic Programming and Machine Learning (Bellman Equation). Psychology (Classical Conditioning) and later Neuroscience (Dopamine System in the brain)
73
Back to Classical Conditioning
U(C)S = Unconditioned Stimulus U(C)R = Unconditioned Response CS = Conditioned Stimulus CR = Conditioned Response I. Pawlow
74
Less “classical” but also Conditioning !
(Example from a car advertisement) Learning the association CS → U(C)R Porsche → Good Feeling
75
Overview over different methods – Reinforcement Learning
You are here !
76
Overview over different methods – Reinforcement Learning
And later also here !
77
Notation US = r,R = “Reward” CS = s,u = Stimulus = “State1”
CR = v,V = (Strength of the) Expected Reward = “Value” UR = --- (not required in mathematical formalisms of RL) Weight = w = weight used for calculating the value; e.g. v=wu Action = a = “Action” Policy = p = “Policy” 1 Note: The notion of a “state” really only makes sense as soon as there is more than one state.
78
A note on “Value” and “Reward Expectation”
If you are at a certain state then you would value this state according to how much reward you can expect when moving on from this state to the end-point of your trial. Hence: Value = Expected Reward ! More accurately: Value = Expected cumulative future discounted reward. (for this, see later!)
79
Types of Rules Rescorla-Wagner Rule: Allows for explaining several types of conditioning experiments. TD-rule (TD-algorithm) allows measuring the value of states and allows accumulating rewards. Thereby it generalizes the Resc.-Wagner rule. TD-algorithm can be extended to allow measuring the value of actions and thereby control behavior either by ways of Q or SARSA learning or with Actor-Critic Architectures
80
Overview over different methods – Reinforcement Learning
You are here !
81
Rescorla-Wagner Rule We realize that d is the prediction error.
Pre-Train Train Result Pavlovian: Extinction: Partial: u→r u→v=max u→r u→● u→v=0 u→r u→● u→v<max We define: v = wu, with u=1 or u=0, binary and w → w + mdu with d = r - v The associability between stimulus u and reward r is represented by the learning rate m. This learning rule minimizes the avg. squared error between actual reward r and the prediction v, hence min<(r-v)2> We realize that d is the prediction error.
82
Pawlovian Extinction Partial Stimulus u is paired with r=1 in 100% of the discrete “epochs” for Pawlovian and in 50% of the cases for Partial.
83
Rescorla-Wagner Rule, Vector Form for Multiple Stimuli
We define: v = w.u, and w → w + mdu with d = r – v Where we minimize d. Blocking: Train Result u1+u2→r Pre-Train u1→v=max, u2→v=0 u1→r For Blocking: The association formed during pre-training leads to d=0. As w2 starts with zero the expected reward v=w1u1+w2u2 remains at r. This keeps d=0 and the new association with u2 cannot be learned.
84
Rescorla-Wagner Rule, Vector Form for Multiple Stimuli
Pre-Train Train Result Inhibitory: u1+u2→●, u1→r u1→v=max, u2→v<0 Inhibitory Conditioning: Presentation of one stimulus together with the reward and alternating presenting a pair of stimuli where the reward is missing. In this case the second stimulus actually predicts the ABSENCE of the reward (negative v). Trials in which the first stimulus is presented together with the reward lead to w1>0. In trials where both stimuli are present the net prediction will be v=w1u1+w2u2 = 0. As u1,2=1 (or zero) and w1>0, we get w2<0 and, consequentially, v(u2)<0.
85
Rescorla-Wagner Rule, Vector Form for Multiple Stimuli
Pre-Train Train Result Overshadow: u1+u2→r u1→v<max, u2→v<max Overshadowing: Presenting always two stimuli together with the reward will lead to a “sharing” of the reward prediction between them. We get v= w1u1+w2u2 = r. Using different learning rates m will lead to differently strong growth of w1,2 and represents the often observed different saliency of the two stimuli.
86
Rescorla-Wagner Rule, Vector Form for Multiple Stimuli
Pre-Train Train Result Secondary: u1→r u2→u1 u2→v=max Secondary Conditioning reflect the “replacement” of one stimulus by a new one for the prediction of a reward. As we have seen the Rescorla-Wagner Rule is very simple but still able to represent many of the basic findings of diverse conditioning experiments. Secondary conditioning, however, CANNOT be captured.
87
Predicting Future Reward
The Rescorla-Wagner Rule cannot deal with the sequentiallity of stimuli (required to deal with Secondary Conditioning). As a consequence it treats this case similar to Inhibitory Conditioning lead to negative w2. Animals can predict to some degree such sequences and form the correct associations. For this we need algorithms that keep track of time. Here we do this by ways of states that are subsequently visited and evaluated.
88
Prediction and Control
The goal of RL is two-fold: To predict the value of states (exploring the state space following a policy) – Prediction Problem. Change the policy towards finding the optimal policy – Control Problem. Terminology (again): State, Action, Reward, Value, Policy
89
Markov Decision Problems (MDPs)
rewards actions states If the future of the system depends always only on the current state and action then the system is said to be “Markovian”.
90
What does an RL-agent do ?
An RL-agent explores the state space trying to accumulate as much reward as possible. It follows a behavioral policy performing actions (which usually will lead the agent from one state to the next). For the Prediction Problem: It updates the value of each given state by assessing how much future (!) reward can be obtained when moving onwards from this state (State Space). It does not change the policy, rather it evaluates it. (Policy Evaluation).
91
For the Control Problem: It updates the value of each given action at a given state and of by assessing how much future reward can be obtained when performing this action at that state (State-Action Space, which is larger than the State Space). and all following actions at the following state moving onwards. Guess: Will we have to evaluate ALL states and actions onwards?
92
What does an RL-agent do ?
Exploration – Exploitation Dilemma: The agent wants to get as much cumulative reward (also often called return) as possible. For this it should always perform the most rewarding action “exploiting” its (learned) knowledge of the state space. This way it might however miss an action which leads (a bit further on) to a much more rewarding path. Hence the agent must also “explore” into unknown parts of the state space. The agent must, thus, balance its policy to include exploitation and exploration. Policies Greedy Policy: The agent always exploits and selects the most rewarding action. This is sub-optimal as the agent never finds better new paths.
93
Policies e-Greedy Policy: With a small probability e the agent will choose a non-optimal action. *All non-optimal actions are chosen with equal probability.* This can take very long as it is not known how big e should be. One can also “anneal” the system by gradually lowering e to become more and more greedy. Softmax Policy: e-greedy can be problematic because of (*). Softmax ranks the actions according to their values and chooses roughly following the ranking using for example: where Qa is value of the currently to be evaluated action a and T is a temperature parameter. For large T all actions have approx. equal probability to get selected.
94
Overview over different methods – Reinforcement Learning
You are here !
95
Towards TD-learning – Pictorial View
In the following slides we will treat “Policy evaluation”: We define some given policy and want to evaluate the state space. We are at the moment still not interested in evaluating actions or in improving policies. Back to the question: To get the value of a given state, will we have to evaluate ALL states and actions onwards? There is no unique answer to this! Different methods exist which assign the value of a state by using differently many (weighted) values of subsequent states. We will discuss a few but concentrate on the most commonly used TD-algorithm(s). Temporal Difference (TD) Learning
96
Formalising RL: Policy Evaluation with goal to find the optimal value function of the state space
We consider a sequence st, rt+1, st+1, rt+2, , rT , sT . Note, rewards occur downstream (in the future) from a visited state. Thus, rt+1 is the next future reward which can be reached starting from state st. The complete return Rt to be expected in the future from state st is, thus, given by: where g≤1 is a discount factor. This accounts for the fact that rewards in the far future should be valued less. Reinforcement learning assumes that the value of a state V(s) is directly equivalent to the expected return Ep at this state, where p denotes the (here unspecified) action policy to be followed. Thus, the value of state st can be iteratively updated with:
97
We use a as a step-size parameter, which is not of great importance here, though, and can be held constant. Note, if V(st) correctly predicts the expected complete return Rt, the update will be zero and we have found the final value. This method is called constant-a Monte Carlo update. It requires to wait until a sequence has reached its terminal state (see some slides before!) before the update can commence. For long sequences this may be problematic. Thus, one should try to use an incremental procedure instead. We define a different update rule with: The elegant trick is to assume that, if the process converges, the value of the next state V(st+1) should be an accurate estimate of the expected return downstream to this state (i.e., downstream to st+1). Thus, we would hope that the following holds: This is why it is called TD (temp. diff.) Learning Indeed, proofs exist that under certain boundary conditions this procedure, known as TD(0), converges to the optimal value function for all states.
98
Reinforcement Learning – Relations to Brain Function I
You are here !
99
How to implement TD in a Neuronal Way
Now we have: We had defined: (first lecture!) u1 Trace
100
How to implement TD in a Neuronal Way
v(t+1)-v(t) Serial-Compound representations X1,…Xn for defining an eligibility trace. Note: v(t+1)-v(t) is acausal (future!). Make it “causal” by using delays.
101
Reinforcement Learning – Relations to Brain Function II
You are here !
102
TD-learning & Brain Function
This neuron is supposed to represent the d-error of TD-learning, which has moved forward as expected. Omission of reward leads to inhibition as also predicted by the TD-rule. DA-responses in the basal ganglia pars compacta of the substantia nigra and the medially adjoining ventral tegmental area (VTA).
103
TD-learning & Brain Function
This is even better visible from the population response of 68 striatal neurons This neuron is supposed to represent the reward expectation signal v. It has extended forward (almost) to the CS (here called Tr) as expected from the TD-rule. Such neurons are found in the striatum, orbitofrontal cortex and amygdala.
104
Reinforcement Learning – The Control Problem
So far we have concentrated on evaluating and unchanging policy. Now comes the question of how to actually improve a policy p trying to find the optimal policy. We will discuss: Actor-Critic Architectures But not: SARSA Learning Q-Learning Abbreviation for policy: p
105
Reinforcement Learning – Control Problem I
You are here !
106
Control Loops A basic feedback–loop controller (Reflex) as in the slide before.
107
Control Loops An Actor-Critic Architecture: The Critic produces evaluative, reinforcement feedback for the Actor by observing the consequences of its actions. The Critic takes the form of a TD-error which gives an indication if things have gone better or worse than expected with the preceding action. Thus, this TD-error can be used to evaluate the preceding action: If the error is positive the tendency to select this action should be strengthened or else, lessened.
108
Example of an Actor-Critic Procedure
Action selection here follows the Gibb’s Softmax method: where p(s,a) are the values of the modifiable (by the Critcic!) policy parameters of the actor, indicting the tendency to select action a when being in state s. We can now modify p for a given state action pair at time t with: where dt is the d-error of the TD-Critic.
109
Reinforcement Learning – Control I & Brain Function III
You are here !
110
Actor-Critics and the Basal Ganglia
The basal ganglia are a brain structure involved in motor control. It has been suggested that they learn by ways of an Actor-Critic mechanism. VP=ventral pallidum, SNr=substantia nigra pars reticulata, SNc=substantia nigra pars compacta, GPi=globus pallidus pars interna, GPe=globus pallidus pars externa, VTA=ventral tegmental area, RRA=retrorubral area, STN=subthalamic nucleus.
111
DA C Actor-Critics and the Basal Ganglia: The Critic
Cortex=C, striatum=S, STN=subthalamic Nucleus, DA=dopamine system, r=reward. So called striosomal modules fulfill the functions of the adaptive Critic. The prediction-error (d) characteristics of the DA-neurons of the Critic are generated by: 1) Equating the reward r with excitatory input from the lateral hypothalamus. 2) Equating the term v(t) with indirect excitation at the DA-neurons which is initiated from striatal striosomes and channelled through the subthalamic nucleus onto the DA neurons. 3) Equating the term v(t−1) with direct, long-lasting inhibition from striatal striosomes onto the DA-neurons. There are many problems with this simplistic view though: timing, mismatch to anatomy, etc.
112
The End
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.