Download presentation
Presentation is loading. Please wait.
Published byBenjamin Ellis Modified over 9 years ago
2
Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp Initial Research Project, OIST ATR Computational Neuroscience Laboratories CREST, Japan Science and Technology Agency Nara Institute of Science and Technology
3
“Creating the Brain” Traditional Neuroscience External observation anatomy recording lesions Synthetic Approach Think from inside the brain What is the problem? How can we solve it? Robots as thinking tools ? ?
4
Outline Introduction Cerebellum, basal ganglia, and cortex Meta-learning and neuromodulators Prediction time scale and serotonin
5
Learning to Walk (Doya & Nakano, 1985) Action: cycle of 4 postures Reward: speed sensor output Multiple solutions: creeping, jumping,…
6
Learning to Stand Up (Morimoto &Doya, 2001) early trials after learning Reward: height of the head No desired trajectory
7
Framework for learning state-action mapping (policy) by exploration and reward feedback Critic reward prediction Actor action selection Learning external reward r internal reward : difference from prediction Reinforcement Learning (RL) environment reward r action a state s agent critic actor
8
Reinforcement Learning Methods Model-free Methods Episode-based parameterize policy P(a|s; ) Temporal difference state value function V(s) (state-)action value function Q(s,a) Model-based methods Dynamic Programming forward model P(s’|s,a)
9
Temporal Difference Learning Predict reward: value function V(s) = E[ r(t) + r(t+1) + 2 r(t+2)…| s(t)=s] Q(s,a) = E[ r(t) + r(t+1) + 2 r(t+2)…| s(t)=s, a(t)=a] Select action greedy:a = argmax Q(s,a) Boltzmann:P(a|s) exp[ Q(s,a)] Update prediction: TD error (t) = r(t) + V(s(t+1)) - V(s(t)) V(s(t)) = (t) Q(s(t),a(t)) = (t)
10
Dynamic Programming and RL Dynamic Programming model-based, off-line solve Bellman equation V(s) = max a s’ [ P(s’|s,a) {r(s,a,s’) + V(s’)}] Reinforcement Learning model-free, on-line learn by TD error (t) = r(t) + V(s(t+1)) - V(s(t)) V(s(t)) = (t) Q(s(t),a(t)) = (t)
11
Discrete vs. Continuous RL (Doya, 2000) Discrete time Continuous time
12
Questions Computational Questions How to learn: direct policy P(a|s) value functions V(s), Q(s,a) forward models P(s’|s,a) When to use which method? Biological Questions Where in the brain? How are they represented/updated? How are they selected/coordinated?
13
Brain Hierarchy Forebrain Cerebral cortex (a) neocortex paleocortex: olfactory cortex archicortex: basal forebrain, hippocampus Basal nuclei (b) neostriatum: caudate, putamen paleostriatum: globus pallidus archistriatum: amygdala Diencephalon thalamus (c) hypothalamus (d) Brain stem & Cerebellum Midbrain (e) Hindbrain pons (f) cerebellum (g) Medulla (h) Spinal cord (i)
14
Just for Motor Control? (Middleton & Strick 1994) Basal ganglia (Globus Pallidus) Prefrontal cortex (area46) Cerebellum (dentate nucleus)
15
thalamus SN IO Cortex Basal Ganglia Cerebellum target error + - output input Cerebellum: Supervised Learning reward output input Basal Ganglia: Reinforcement Learning Cerebral Cortex : Unsupervised Learning output input Specialization by Learning Algorithms (Doya, 1999)
16
Cerebellum Purkinje cells ~10 5 parallel fibers single climbing fiber long-term depression Supervised learning perceptron hypothesis internal models
17
early learning after learning Internal Models in the Cerebellum (Imamizu et al., 2000) Learning to use ‘rotated’ mouse
18
Motor Imagery (Luft et al. 1998) Finger movement Imagery of movement
19
Basal Ganglia Striatum striosome & matrix dopamine-dependent plasticity Dopamine neurons reward-predictive response TD learning
20
rVrVrVrVrVrV Dopamine Neurons and TD Error (t) = r(t) + V(s(t+1)) - V(s(t)) before learning after learning omit reward (Schultz et al. 1997)
21
Reward-predicting Activities of Striatal Neurons Delayed saccade task (Kawagoe et al., 1998) Not just actions, but resulting rewards Reward:RightUpLeftDownAll Target: Right Up Left Down
22
Cerebral Cortex Recurrent connections Hebbian plasticity Unsupervised learning, e.g., PCA, ICA
23
Replicating V1Receptive Fields (Olshausen & Field, 1996) Infomax and sparseness Hebbian plasticity and recurrent inhibition
24
Specialization by Learning? Cerebellum: Supervised learning error signal by climbing fibers forward model s’=f(s,a) and policy a=g(s) Basal ganglia: Reinforcement leaning reward signal by dopamine fibers value functions V(s) and Q(s,a) Cerebral cortex: Unsupervised learning Hebbian plasticity and recurrent inhibition representation of state s and action a But how are they recruited and combined?
25
Multiple Action Selection Schemes Model-free a = argmax a Q(s,a) Model-based a = argmax a [r+V(f(s,a))] forward model: f(s,a) Encapsulation a = g(s) sa Q s’ a V aiai f s sa g
26
Lectures at OCNC 2005 Internal models/Cerebellum Reza Shadmehr Stefan Schaal Mitsuo Kawato Reward/Basal ganglia Andrew G. Barto Bernard Balleine Peter Dayan John O’Doherty Minoru Kimura Wolfram Schultz State coding/Cortex Nathaniel Daw Leo Sugrue Daeyeol Lee Jun Tanji Anitha Pasupathy Masamichi Sakagami
27
Outline Introduction Cerebellum, basal ganglia, and cortex Meta-learning and neuromodulators Prediction time scale and serotonin
28
Framework for learning state-action mapping (policy) by exploration and reward feedback Critic reward prediction Actor action selection Learning external reward r internal reward : difference from prediction Reinforcement Learning (RL) environment reward r action a state s agent critic actor
29
Reinforcement Learning Predict reward: value function V(s) = E[ r(t) + r(t+1) + 2 r(t+2)…| s(t)=s] Q(s,a) = E[ r(t) + r(t+1) + 2 r(t+2)…| s(t)=s, a(t)=a] Select action greedy:a = argmax Q(s,a) Boltzmann:P(a|s) exp[ Q(s,a)] Update prediction: TD error (t) = r(t) + V(s(t+1)) - V(s(t)) V(s(t)) = (t) Q(s(t),a(t)) = (t)
30
RL Model of Basal Ganglia (…, Doya 2000) Striatum: value functions V(s) and Q(s,a) Dopamine neurons: TD error r s V(s)Q(s,a) a SNr/GPi: action selection: Q(s,a) a
31
Cyber Rodent Project Robots with same constraint as biological agents What is the origin of rewards? What to be learned, what to be evolved? Self-preservation capture batteries Self-reproduction exchange programs through IR ports
32
Cyber Rodent: Hardware camera range sensor proximity sensors gyro battery latch two wheels IR port speaker microphones R/G/B LED
33
Evolving Robot Colony Survival catch battery packs Reproduction copy ‘genes’ through IR ports
34
Exploration: Inverse Temperature Focused searchWide exploration
35
Discounting Future Reward large small
36
Setting of Reward Function Reward r = r main + r supp - r cost e.g., reward for vision of battery
37
Reinforcement Learning of Reinforcement Learning (Schweighfer&Doya, 2003) Fluctuations in the metaparameters correlate with average reward reward
38
Battery level β 00.10.20.30.40.50.60.70.80.91 0 2 4 6 8 10 12 14 Randomness Control by Battery Level Greedier action at both extremes
39
Neuromodulators for Metalearning (Doya, 2002) Metaparameter tuning is critical in RL How does the brain tune them? Dopamine: TD error Acetylcholine: learning rate Noradrenaline: inv. temp. Serotonin: discount
40
Learning Rate V(s(t-1)) = (t) Q(s(t-1),a(t-1)) = (t) small slow learning large unstable learning Acetylcholinebasal forebrain Regulate memory update and retention (Hasselmo et al.) LTP in cortex, hippocampus top-down and bottom-up information flow
41
Inverse Temperature Greediness in action selection P(a i |s) exp[ Q(s,a i )] small exploration large exploitation Noradrenalinelocus coeruleus Correlation with performance accuracy (Aston-Jones et al.) Modulation of cellular I/O gain (Cohen et al.)
42
Serotonindorsal raphe Low activity associated with impulsivity depression, bipolar disorders aggression, eating disorders Discount Factor V(s(t)) = E[ r(t+1) + r(t+2) + 2 r(t+3) + …] Balance between short- and long-term results
43
TD Error (t) = r(t) + V(s(t)) - V(s(t-1)) Global learning signal reward prediction: V(s(t-1)) = (t) reinforcement: Q(s(t-1),a(t-1)) = (t) Dopaminesubstantia nigra, VTA Respond to errors in reward prediction Reinforcement of actions addiction
44
Neuromodulators for Metalearning (Doya, 2002) : Reward prediction: Dopamine : Discount factor: Serotonin : Inv. temperature: Noradrenaline : Learning rate: Acetylcholine Environment Experience Understand dynamics of neuromodulators Computational approach to emotion (t)= r(t)+ V(s(t)) - V(s(t-1)) P (a i |s(t)) exp[ Q(s(t), a i )]
45
TD Model of Basal Ganglia (Houk et al. 1995, Montague et al. 1996, Schultz et al. 1997,...) Striosome: state value V(s) Matrix: action value Q(s,a) s V(s) DA neurons: TD error r Q(s,a) a SNr/GPi: action selection: Q(s,a) a NA? Ach? 5-HT?
46
Possible Control of Discount Factor Modulation of TD error Selection/weighting of parallel networks V1V1 V2V2 V3V3 11 22 33 striatum Dopamine neurons (t) V(s(t)) V(s(t+1))
48
fMRI Experiment Marconi Eclipse 1.5T scanner at ATR BAIC 50 horizontal slices 3x3x3mm resolution TR=6s, TE=55ms, FOV192mm Four repetitions of: no reward (4scans) Short task (15scans) random reward (4scans) Long task (15scans) 20 right-handed subjects with informed consent approved by ethics committees of Hiroshima U. and ATR Analysis by SPM99
49
Markov Decision Task (Tanaka et al., 2004) State transition and reward functions Stimulus and response
50
Behavior Results All subjects successfully learned optimal behavior
51
Data Analysis SPM99 Preprocessing realignment, coregister, normalization, and smoothing Block-design analysis 4 boxcar regressors (NO, SHORT, RANDOM, LONG) Regressor analysis 4 boxcar regressors + 1 explanatory variable V or Second-level analysis 20 subjects, 0.001 uncorrected
52
Block-Design Analysis SHORT vs. NO (p < 0.001 uncorrected) LONG vs. SHORT (p < 0.0001 uncorrected) OFC InsulaStriatumCerebellum Striatum Dorsal raphe DLPFC, VLPFC, IPC, PMd Different brain areas involved in immediate and future reward prediction
53
Ventro-Dorsal Difference Lateral PFCInsula Striatum
54
Estimate V(t) and (t) from subjects’ performance data Regression analysis of fMRI data Model-based Regressor Analysis fMRI data Policy reward r(t) state s(t) action a(t) TD error (t) Agent Value function V(s) Value function V(s) TD error (t) Environment 20yen
55
Predicting Subject’s Predictions Value function V(s(t)) = E[ r(t)+ r(t+1)+...+ k V(s(t+k))]e.g. =0.9 previous visits to current state TD error time course of V and
56
Explanatory Variables (subject NS) Reward prediction V(t) = 0 = 0.3 = 0.6 = 0.8 = 0.9 = 0.99 Reward prediction error t = 0 = 0.3 = 0.6 = 0.8 = 0.9 = 0.99 1312 trial
57
Regression Analysis mPFCInsula x = -2 mmx = -42 mm Reward prediction V Reward prediction error Striatum z = 2
58
Map of Temporal Discounting (Tanaka et al., 2004) Markov decision task with delayed rewards Regression by values and TD errors with different discounting factors
59
Tryptophan Depletion/Loading Tryptophan: precursor of serotonin depletion/loading affect central serotonin levels (e.g. Bjork et al. 2001, Luciana et al. 2001) 100 g of amino acid drink experiments after 6 hours Day2: Tr0Day3: Tr+ Day1: Tr- 10.3g of tryptophan (Loading) No tryptophan (Depletion) 2.3g of tryptophan (Control)
60
Blood Tryptophan Levels N.D. (< 3.9 g/ml)
61
Delayed Reward Choice Task
62
Session s Initial black patchesPatches/step YellowWhiteYellowWhite 1,2,7,8 72 2418 98 26 2 3 72 2418 98 214 2 4 72 2418 916 214 2 5,6 72 2418 916 26 2 yellow: large reward with long delay white: small reward with short delay
63
Choice Behaviors Shift of indifference line not consistent among 12 subjects
64
Modulation of Striatal Response Tr0 0.99 0.9 0.8 0.7 0.6 Tr-Tr+
65
Modulation by Tr Levels
66
Changes in Correlation Coefficient = 0.6 (28, 0, -4) = 0.99 (16, 2, 28) Tr- < Tr+ correlation with V at large in dorsal Putamen Tr- > Tr+ correlation with V at small in ventral Putamen Regression slope ROI (region of interest) analysis
67
Summary Immediate reward lateral OFC Future reward parietal, PMd, DLPF lateranl cerebellum dorsal raphe Ventro-dorsal gradient insula striatum Serotonergic modulation
68
Outline Introduction Cerebellum, basal ganglia, and cortex Meta-learning and neuromodulators Prediction time scale and serotonin
69
Collaborators Kyoto PUM Minoru Kimura Yasumasa Ueda Hiroshima U Shigeto Yamawaki Yasumasa Okamoto Go Okada Kazutaka Ueda Shuji Asahi Kazuhiro Shishida ATR Jun Morimoto Kazuyuki Samejima CREST Nicolas Schweighofer Genci Capi NAIST Saori Tanaka OIST Eiji Uchibe Stefan Elfwing
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.