Presentation is loading. Please wait.

Presentation is loading. Please wait.

Prediction, Control and Decisions Kenji Doya Initial Research Project, OIST ATR Computational Neuroscience Laboratories CREST, Japan.

Similar presentations


Presentation on theme: "Prediction, Control and Decisions Kenji Doya Initial Research Project, OIST ATR Computational Neuroscience Laboratories CREST, Japan."— Presentation transcript:

1

2 Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp Initial Research Project, OIST ATR Computational Neuroscience Laboratories CREST, Japan Science and Technology Agency Nara Institute of Science and Technology

3 “Creating the Brain” Traditional Neuroscience External observation anatomy recording lesions Synthetic Approach Think from inside the brain What is the problem? How can we solve it? Robots as thinking tools ? ?

4 Outline Introduction Cerebellum, basal ganglia, and cortex Meta-learning and neuromodulators Prediction time scale and serotonin

5 Learning to Walk (Doya & Nakano, 1985) Action: cycle of 4 postures Reward: speed sensor output Multiple solutions: creeping, jumping,…

6 Learning to Stand Up (Morimoto &Doya, 2001) early trials after learning Reward: height of the head No desired trajectory

7 Framework for learning state-action mapping (policy) by exploration and reward feedback Critic reward prediction Actor action selection Learning external reward r internal reward  : difference from prediction Reinforcement Learning (RL) environment reward r action a state s agent critic actor 

8 Reinforcement Learning Methods Model-free Methods Episode-based parameterize policy P(a|s;  ) Temporal difference state value function V(s) (state-)action value function Q(s,a) Model-based methods Dynamic Programming forward model P(s’|s,a)

9 Temporal Difference Learning Predict reward: value function V(s) = E[ r(t) +  r(t+1) +  2 r(t+2)…| s(t)=s] Q(s,a) = E[ r(t) +  r(t+1) +  2 r(t+2)…| s(t)=s, a(t)=a] Select action greedy:a = argmax Q(s,a) Boltzmann:P(a|s)  exp[  Q(s,a)] Update prediction: TD error  (t) = r(t) +  V(s(t+1)) - V(s(t))  V(s(t)) =   (t)  Q(s(t),a(t)) =   (t)

10 Dynamic Programming and RL Dynamic Programming model-based, off-line solve Bellman equation V(s) = max a  s’ [ P(s’|s,a) {r(s,a,s’) +  V(s’)}] Reinforcement Learning model-free, on-line learn by TD error  (t) = r(t) +  V(s(t+1)) - V(s(t))  V(s(t)) =   (t)  Q(s(t),a(t)) =   (t)

11 Discrete vs. Continuous RL (Doya, 2000) Discrete time Continuous time

12 Questions Computational Questions How to learn: direct policy P(a|s) value functions V(s), Q(s,a) forward models P(s’|s,a) When to use which method? Biological Questions Where in the brain? How are they represented/updated? How are they selected/coordinated?

13 Brain Hierarchy Forebrain Cerebral cortex (a) neocortex paleocortex: olfactory cortex archicortex: basal forebrain, hippocampus Basal nuclei (b) neostriatum: caudate, putamen paleostriatum: globus pallidus archistriatum: amygdala Diencephalon thalamus (c) hypothalamus (d) Brain stem & Cerebellum Midbrain (e) Hindbrain pons (f) cerebellum (g) Medulla (h) Spinal cord (i)

14 Just for Motor Control? (Middleton & Strick 1994) Basal ganglia (Globus Pallidus) Prefrontal cortex (area46) Cerebellum (dentate nucleus)

15 thalamus SN IO Cortex Basal Ganglia Cerebellum target error + - output input Cerebellum: Supervised Learning reward output input Basal Ganglia: Reinforcement Learning Cerebral Cortex : Unsupervised Learning output input Specialization by Learning Algorithms (Doya, 1999)

16 Cerebellum Purkinje cells ~10 5 parallel fibers single climbing fiber long-term depression Supervised learning perceptron hypothesis internal models

17 early learning after learning Internal Models in the Cerebellum (Imamizu et al., 2000) Learning to use ‘rotated’ mouse

18 Motor Imagery (Luft et al. 1998) Finger movement Imagery of movement

19 Basal Ganglia Striatum striosome & matrix dopamine-dependent plasticity Dopamine neurons reward-predictive response TD learning

20 rVrVrVrVrVrV Dopamine Neurons and TD Error  (t) = r(t) +  V(s(t+1)) - V(s(t)) before learning after learning omit reward (Schultz et al. 1997)

21 Reward-predicting Activities of Striatal Neurons Delayed saccade task (Kawagoe et al., 1998) Not just actions, but resulting rewards Reward:RightUpLeftDownAll Target: Right Up Left Down

22 Cerebral Cortex Recurrent connections Hebbian plasticity Unsupervised learning, e.g., PCA, ICA

23 Replicating V1Receptive Fields (Olshausen & Field, 1996) Infomax and sparseness Hebbian plasticity and recurrent inhibition

24 Specialization by Learning? Cerebellum: Supervised learning error signal by climbing fibers forward model s’=f(s,a) and policy a=g(s) Basal ganglia: Reinforcement leaning reward signal by dopamine fibers value functions V(s) and Q(s,a) Cerebral cortex: Unsupervised learning Hebbian plasticity and recurrent inhibition representation of state s and action a But how are they recruited and combined?

25 Multiple Action Selection Schemes Model-free a = argmax a Q(s,a) Model-based a = argmax a [r+V(f(s,a))] forward model: f(s,a) Encapsulation a = g(s) sa Q s’ a V aiai f s sa g

26 Lectures at OCNC 2005 Internal models/Cerebellum Reza Shadmehr Stefan Schaal Mitsuo Kawato Reward/Basal ganglia Andrew G. Barto Bernard Balleine Peter Dayan John O’Doherty Minoru Kimura Wolfram Schultz State coding/Cortex Nathaniel Daw Leo Sugrue Daeyeol Lee Jun Tanji Anitha Pasupathy Masamichi Sakagami

27 Outline Introduction Cerebellum, basal ganglia, and cortex Meta-learning and neuromodulators Prediction time scale and serotonin

28 Framework for learning state-action mapping (policy) by exploration and reward feedback Critic reward prediction Actor action selection Learning external reward r internal reward  : difference from prediction Reinforcement Learning (RL) environment reward r action a state s agent critic actor 

29 Reinforcement Learning Predict reward: value function V(s) = E[ r(t) +  r(t+1) +  2 r(t+2)…| s(t)=s] Q(s,a) = E[ r(t) +  r(t+1) +  2 r(t+2)…| s(t)=s, a(t)=a] Select action greedy:a = argmax Q(s,a) Boltzmann:P(a|s)  exp[  Q(s,a)] Update prediction: TD error  (t) = r(t) +  V(s(t+1)) - V(s(t))  V(s(t)) =   (t)  Q(s(t),a(t)) =   (t)

30 RL Model of Basal Ganglia (…, Doya 2000) Striatum: value functions V(s) and Q(s,a) Dopamine neurons: TD error   r s V(s)Q(s,a) a SNr/GPi: action selection: Q(s,a)  a

31 Cyber Rodent Project Robots with same constraint as biological agents What is the origin of rewards? What to be learned, what to be evolved? Self-preservation capture batteries Self-reproduction exchange programs through IR ports

32 Cyber Rodent: Hardware camera range sensor proximity sensors gyro battery latch two wheels IR port speaker microphones R/G/B LED

33 Evolving Robot Colony Survival catch battery packs Reproduction copy ‘genes’ through IR ports

34 Exploration: Inverse Temperature Focused searchWide exploration

35 Discounting Future Reward large  small 

36 Setting of Reward Function Reward r = r main + r supp - r cost e.g., reward for vision of battery

37 Reinforcement Learning of Reinforcement Learning (Schweighfer&Doya, 2003) Fluctuations in the metaparameters  correlate with average reward reward   

38 Battery level β 00.10.20.30.40.50.60.70.80.91 0 2 4 6 8 10 12 14  Randomness Control by Battery Level Greedier action at both extremes

39 Neuromodulators for Metalearning (Doya, 2002) Metaparameter tuning is critical in RL How does the brain tune them? Dopamine: TD error  Acetylcholine: learning rate  Noradrenaline: inv. temp.  Serotonin: discount 

40 Learning Rate   V(s(t-1)) =   (t)  Q(s(t-1),a(t-1)) =   (t) small   slow learning large   unstable learning Acetylcholinebasal forebrain Regulate memory update and retention (Hasselmo et al.) LTP in cortex, hippocampus top-down and bottom-up information flow

41 Inverse Temperature  Greediness in action selection P(a i |s)  exp[  Q(s,a i )] small   exploration large   exploitation Noradrenalinelocus coeruleus Correlation with performance accuracy (Aston-Jones et al.) Modulation of cellular I/O gain (Cohen et al.)

42 Serotonindorsal raphe Low activity associated with impulsivity depression, bipolar disorders aggression, eating disorders Discount Factor  V(s(t)) = E[ r(t+1) +  r(t+2) +  2 r(t+3) + …] Balance between short- and long-term results

43 TD Error   (t) = r(t) +  V(s(t)) - V(s(t-1)) Global learning signal reward prediction:  V(s(t-1)) =   (t) reinforcement:  Q(s(t-1),a(t-1)) =   (t) Dopaminesubstantia nigra, VTA Respond to errors in reward prediction Reinforcement of actions addiction

44 Neuromodulators for Metalearning (Doya, 2002)  : Reward prediction: Dopamine  : Discount factor: Serotonin  : Inv. temperature: Noradrenaline  : Learning rate: Acetylcholine Environment Experience Understand dynamics of neuromodulators Computational approach to emotion  (t)= r(t)+  V(s(t)) - V(s(t-1)) P (a i |s(t))  exp[  Q(s(t), a i )]

45 TD Model of Basal Ganglia (Houk et al. 1995, Montague et al. 1996, Schultz et al. 1997,...) Striosome: state value V(s) Matrix: action value Q(s,a) s V(s) DA neurons: TD error   r Q(s,a) a SNr/GPi: action selection: Q(s,a)  a NA? Ach? 5-HT?

46 Possible Control of Discount Factor Modulation of TD error Selection/weighting of parallel networks V1V1 V2V2 V3V3 11 22 33 striatum Dopamine neurons  (t) V(s(t)) V(s(t+1))

47

48 fMRI Experiment Marconi Eclipse 1.5T scanner at ATR BAIC 50 horizontal slices 3x3x3mm resolution TR=6s, TE=55ms, FOV192mm Four repetitions of: no reward (4scans)  Short task (15scans)  random reward (4scans)  Long task (15scans)  20 right-handed subjects with informed consent approved by ethics committees of Hiroshima U. and ATR Analysis by SPM99

49 Markov Decision Task (Tanaka et al., 2004) State transition and reward functions Stimulus and response

50 Behavior Results All subjects successfully learned optimal behavior

51 Data Analysis SPM99 Preprocessing realignment, coregister, normalization, and smoothing Block-design analysis 4 boxcar regressors (NO, SHORT, RANDOM, LONG) Regressor analysis 4 boxcar regressors + 1 explanatory variable V or  Second-level analysis 20 subjects, 0.001 uncorrected

52 Block-Design Analysis SHORT vs. NO (p < 0.001 uncorrected) LONG vs. SHORT (p < 0.0001 uncorrected) OFC InsulaStriatumCerebellum Striatum Dorsal raphe DLPFC, VLPFC, IPC, PMd Different brain areas involved in immediate and future reward prediction

53 Ventro-Dorsal Difference Lateral PFCInsula Striatum

54 Estimate V(t) and  (t) from subjects’ performance data Regression analysis of fMRI data Model-based Regressor Analysis fMRI data Policy reward r(t) state s(t) action a(t) TD error  (t) Agent Value function V(s) Value function V(s) TD error  (t) Environment 20yen

55 Predicting Subject’s Predictions Value function V(s(t)) = E[ r(t)+  r(t+1)+...+  k V(s(t+k))]e.g.  =0.9 previous visits to current state TD error time course of V and 

56 Explanatory Variables (subject NS) Reward prediction V(t)  = 0  = 0.3  = 0.6  = 0.8  = 0.9  = 0.99 Reward prediction error  t   = 0  = 0.3  = 0.6  = 0.8  = 0.9  = 0.99 1312 trial

57 Regression Analysis mPFCInsula x = -2 mmx = -42 mm Reward prediction V Reward prediction error  Striatum z = 2

58 Map of Temporal Discounting (Tanaka et al., 2004) Markov decision task with delayed rewards Regression by values and TD errors with different discounting factors 

59 Tryptophan Depletion/Loading Tryptophan: precursor of serotonin depletion/loading affect central serotonin levels (e.g. Bjork et al. 2001, Luciana et al. 2001) 100 g of amino acid drink experiments after 6 hours Day2: Tr0Day3: Tr+ Day1: Tr- 10.3g of tryptophan (Loading) No tryptophan (Depletion) 2.3g of tryptophan (Control)

60 Blood Tryptophan Levels N.D. (< 3.9  g/ml)

61 Delayed Reward Choice Task

62 Session s Initial black patchesPatches/step YellowWhiteYellowWhite 1,2,7,8 72  2418  98  26  2 3 72  2418  98  214  2 4 72  2418  916  214  2 5,6 72  2418  916  26  2 yellow: large reward with long delay white: small reward with short delay

63 Choice Behaviors Shift of indifference line not consistent among 12 subjects

64 Modulation of Striatal Response Tr0 0.99 0.9 0.8 0.7 0.6  Tr-Tr+

65 Modulation by Tr Levels

66 Changes in Correlation Coefficient  = 0.6 (28, 0, -4)  = 0.99 (16, 2, 28) Tr- < Tr+ correlation with V at large  in dorsal Putamen Tr- > Tr+ correlation with V at small  in ventral Putamen Regression slope ROI (region of interest) analysis

67 Summary Immediate reward lateral OFC Future reward parietal, PMd, DLPF lateranl cerebellum dorsal raphe Ventro-dorsal gradient insula striatum Serotonergic modulation

68 Outline Introduction Cerebellum, basal ganglia, and cortex Meta-learning and neuromodulators Prediction time scale and serotonin

69 Collaborators Kyoto PUM Minoru Kimura Yasumasa Ueda Hiroshima U Shigeto Yamawaki Yasumasa Okamoto Go Okada Kazutaka Ueda Shuji Asahi Kazuhiro Shishida ATR Jun Morimoto Kazuyuki Samejima CREST Nicolas Schweighofer Genci Capi NAIST Saori Tanaka OIST Eiji Uchibe Stefan Elfwing


Download ppt "Prediction, Control and Decisions Kenji Doya Initial Research Project, OIST ATR Computational Neuroscience Laboratories CREST, Japan."

Similar presentations


Ads by Google