dopamine and prediction error

Slides:



Advertisements
Similar presentations
Lecture 18: Temporal-Difference Learning
Advertisements

Bayesian models for fMRI data
Detecting Conflict-Related Changes in the ACC Judy Savitskaya 1, Jack Grinband 1,3, Tor Wager 2, Vincent P. Ferrera 3, Joy Hirsch 1,3 1.Program for Imaging.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
1.Exams due 9am 16 th. (grades due 10am 19 th ) 2.Describe the organization of visual signals in extra-striate visual cortex and the specialization of.
Global plan Reinforcement learning I: –prediction –classical conditioning –dopamine Reinforcement learning II: –dynamic programming; action selection –Pavlovian.
Journal club Marian Tsanov Reinforcement Learning.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 2: Evaluative Feedback pEvaluating actions vs. instructing by giving correct.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Participants: 21 smokers (13M, ages 18-45) and 21 age-, gender-, race-, and education-matched controls. Procedure: Stimuli were 100 photographs: 50 food.
Models: reinforcement learning & fMRI Nathaniel Daw 11/28/2007.
Reward processing (1) There exists plenty of evidence that midbrain dopamine systems encode errors in reward predictions (Schultz, Neuron, 2002) Changes.
FIGURE 4 Responses of dopamine neurons to unpredicted primary reward (top) and the transfer of this response to progressively earlier reward-predicting.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Episodic Control: Singular Recall and Optimal Actions Peter Dayan Nathaniel Daw Máté Lengyel Yael Niv.
Revisiting James March’s Exploration- Exploitation Trade-off With a Neurobiological Basis Chiara Chelini University of Turin ESA World Meeting, Rome, 28°
1 Dr. Itamar Arel College of Engineering Electrical Engineering & Computer Science Department The University of Tennessee Fall 2009 August 24, 2009 ECE-517:
Susceptibility Induced Loss of Signal: Comparing PET and fMRI on a Semantic Task Devlin et al. (in press)
Prediction in Human Presented by: Rezvan Kianifar January 2009.
Reinforcement learning and human behavior Hanan Shteingart and Yonatan Loewenstein MTAT Seminar in Computational Neuroscience Zurab Bzhalava.
Study 1: 23 Ss viewed 265 pictures of food, attractive faces, symbols indicating monetary gains, and neutral objects. Stimuli were rated on 14 dimensions.
Dopamine enhances model-based over model-free choice behavior Peter Smittenaar *, Klaus Wunderlich *, Ray Dolan.
CS344 : Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 26- Reinforcement Learning for Robots; Brain Evidence.
Chapter 16. Basal Ganglia Models for Autonomous Behavior Learning in Creating Brain-Like Intelligence, Sendhoff et al. Course: Robots Learning from Humans.
INVESTIGATING THE ROLE OF THE ANTERIOR CINGULATE CORTEX IN THE SELECTION OF WILLED ACTIONS AND PERFORMANCE MONITORING Department of Experimental Psychology,
Unconditioned stimulus (food) causes unconditioned response (saliva) Conditioned stimulus (bell) causes conditioned response (saliva)
Processing Sequential Sensor Data The “John Krumm perspective” Thomas Plötz November 29 th, 2011.
A View from the Bottom Peter Dayan Gatsby Computational Neuroscience Unit.
Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides.
Neural Correlates of Learning Models in a Dynamic Decision-Making Task Darrell A. Worthy 1, Marissa A. Gorlick 2, Jeanette Mumford 2, Akram Bakkour 2,
Human and Optimal Exploration and Exploitation in Bandit Problems Department of Cognitive Sciences, University of California. A Bayesian analysis of human.
Global plan Reinforcement learning I: –prediction –classical conditioning –dopamine Reinforcement learning II: –dynamic programming; action selection –Pavlovian.
Neural correlates of risk sensitivity An fMRI study of instrumental choice behavior Yael Niv, Jeffrey A. Edlund, Peter Dayan, and John O’Doherty Cohen.
Deep Learning and Deep Reinforcement Learning. Topics 1.Deep learning with convolutional neural networks 2.Learning to play Atari video games with Deep.
Does the brain compute confidence estimates about decisions?
Neural Coding of Basic Reward Terms of Animal Learning Theory, Game Theory, Microeconomics and Behavioral Ecology Wolfram Schultz Current Opinion in Neurobiology.
Neuroimaging of associative learning
neuromodulators; midbrain; sub-cortical;
Computational models for imaging analyses
מוטיבציה והתנהגות free operant
Using Time-Varying Motion Stimuli to Explore Decision Dynamics
Dr. Unnikrishnan P.C. Professor, EEE
Chapter 2: Evaluative Feedback
Alan N. Hampton, Ralph Adolphs, J. Michael Tyszka, John P. O'Doherty 
Sang Wan Lee, Shinsuke Shimojo, John P. O’Doherty  Neuron 
Volume 92, Issue 5, Pages (December 2016)
Volume 94, Issue 2, Pages e6 (April 2017)
Neuroimaging of associative learning
Learning to Simulate Others' Decisions
Nils Kolling, Marco Wittmann, Matthew F.S. Rushworth  Neuron 
Volume 93, Issue 2, Pages (January 2017)
Volume 62, Issue 5, Pages (June 2009)
Volume 79, Issue 1, Pages (July 2013)
Volume 65, Issue 1, Pages (January 2010)
Volume 38, Issue 2, Pages (April 2003)
Subliminal Instrumental Conditioning Demonstrated in the Human Brain
Erie D. Boorman, John P. O’Doherty, Ralph Adolphs, Antonio Rangel 
Neuroimaging of associative learning
Neural Mechanisms Underlying Human Consensus Decision-Making
Volume 92, Issue 5, Pages (December 2016)
Value-Based Modulations in Human Visual Cortex
Learning to Simulate Others' Decisions
Predictive Neural Coding of Reward Preference Involves Dissociable Responses in Human Ventral Midbrain and Ventral Striatum  John P. O'Doherty, Tony W.
Volume 92, Issue 2, Pages (October 2016)
Encoding of Stimulus Probability in Macaque Inferior Temporal Cortex
Will Penny Wellcome Trust Centre for Neuroimaging,
Perceptual Classification in a Rapidly Changing Environment
Chapter 2: Evaluative Feedback
Striatal Activity Underlies Novelty-Based Choice in Humans
World models and basis functions
Presentation transcript:

dopamine and prediction error TD error L R Vt R no prediction prediction, reward prediction, no reward Schultz 1997

humans are no different dorsomedial striatum/PFC goal-directed control dorsolateral striatum habitual control ventral striatum Pavlovian control; value signals dopamine...

in humans… You won 40 cents < 1 sec 5 sec ISI 0.5 sec 2-5sec ITI 5 stimuli: 40¢ 20¢ 0/40¢ 0¢ < 1 sec 0.5 sec You won 40 cents 5 sec ISI 2-5sec ITI 19 subjects (dropped 3 non learners, N=16) 3T scanner, TR=2sec, interleaved 234 trials: 130 choice, 104 single stimulus randomly ordered and counterbalanced

what would a prediction error look like (in BOLD)?

prediction errors in NAC raw BOLD (avg over all subjects) unbiased anatomical ROI in nucleus accumbens (marked per subject*) can actually decide between different neuroeconomic models of risk * thanks to Laura deSouza

Peter Dayan Nathaniel Daw John O’Doherty Ray Dolan Polar Exploration Peter Dayan Nathaniel Daw John O’Doherty Ray Dolan

Background Learning from experience Neural: Dopamine, basal ganglia Computational: TD learning What about sampling: gathering experience to learn Striatum PFC Multiple, competitive decision systems eg PFC vs basal ganglia hot/cold (eg Ringel, Loewenstein) Surprising from optimal control perspective

Exploration vs. exploitation Classic dilemma in learned decision making For unfamiliar outcomes, how to trade off learning about their values against exploiting knowledge already gained

Exploration vs. exploitation Reward Time Exploitation Choose action expected to be best May never discover something better

Exploration vs. exploitation Reward Time Exploitation Choose action expected to be best May never discover something better Exploration: Choose action expected to be worse

Exploration vs. exploitation Reward Time Exploitation Choose action expected to be best May never discover something better Exploration: Choose action expected to be worse If it is, then go back to the original

Exploration vs. exploitation Reward Time Exploitation Choose action expected to be best May never discover something better Exploration: Choose action expected to be worse

Exploration vs. exploitation Reward Time Exploitation Choose action expected to be best May never discover something better Exploration: Choose action expected to be worse If it is better, then exploit in the future

Exploration vs. exploitation Reward Time Exploitation Choose action expected to be best May never discover something better Exploration: Choose action expected to be worse Balanced by the long-term gain if it turns out better (Even for risk or ambiguity averse subjects) nb: learning non trivial when outcomes noisy or changing

Bayesian analysis (Gittins 1972) Tractable dynamic program in restricted class of problems “n-armed bandit” Solution requires balancing Expected outcome values Uncertainty (need for exploration) Horizon/discounting (time to exploit) Optimal policy: Explore systematically Choose best sum of value plus bonus Bonus increases with uncertainty Intractable in general setting Various heuristics used in practice Value Action

Experiment How do humans handle tradeoff? Computation: Which strategies fit behavior? Several popular approximations Difference: what information influences exploration? Neural substrate: What systems are involved? PFC, high level control Competitive decision systems (Daw et al. 2005) Neuromodulators dopamine (Kakade & Dayan 2002) norepinephrine (Usher et al. 1999)

Task design Subjects (14 healthy, right-handed) repeatedly choose between four slot machines for points (“money”), in scanner Trial Onset Slots revealed

Task design Subjects (14 healthy, right-handed) repeatedly choose between four slot machines for points (“money”), in scanner + Trial Onset Slots revealed +~430 ms Subject makes choice - chosen slot spins.

Task design Subjects (14 healthy, right-handed) repeatedly choose between four slot machines for points (“money”), in scanner + Trial Onset Slots revealed +~430 ms + obtained 57 Subject makes choice - chosen slot spins. points +~3000 ms Outcome: Payoff revealed

Task design Subjects (14 healthy, right-handed) repeatedly choose between four slot machines for points (“money”), in scanner obtained 57 points + Trial Onset Slots revealed +~430 ms Subject makes choice - chosen slot spins. +~3000 ms Outcome: Payoff revealed +~1000 ms Screen cleared Trial ends

Payoff structure Noisy to require integration of data Subjects learn about payoffs only by sampling them

Payoff structure Noisy to require integration of data Subjects learn about payoffs only by sampling them

Payoff structure Payoff

Payoff structure Nonstationary to encourage ongoing exploration (Gaussian drift w/ decay)

Analysis strategy Behavior: Fit an RL model to choices Find best fitting parameters Compare different exploration models Imaging: Use model to estimate subjective factors (explore vs. exploit, value, etc.) Use these as regressors for the fMRI signal After Sugrue et al.

Behavior

Behavior

Behavior

Behavior

Behavior

Behavior model 1. Estimate payoffs mgreen , mred etc sgreen , sred etc 2. Derive choice probabilities Pgreen , Pred etc Choose randomly according to these

Behavior model Kalman filter Error update (like TD) Exact inference 1. Estimate payoffs mgreen , mred etc sgreen , sred etc 2. Derive choice probabilities Pgreen , Pred etc Choose randomly according to these

Behavior model Kalman filter Error update (like TD) Exact inference x 1. Estimate payoffs x payoff mgreen , mred etc sgreen , sred etc x 2. Derive choice probabilities trial t t+1 Pgreen , Pred etc Choose randomly according to these

Behavior model Kalman filter Error update (like TD) Exact inference x 1. Estimate payoffs x payoff mgreen , mred etc sgreen , sred etc x 2. Derive choice probabilities trial t t+1 Pgreen , Pred etc Choose randomly according to these

Behavior model Kalman filter Error update (like TD) Exact inference x x 1. Estimate payoffs x payoff mgreen , mred etc sgreen , sred etc x 2. Derive choice probabilities trial t t+1 Pgreen , Pred etc Choose randomly according to these

Behavior model Kalman filter Error update (like TD) Exact inference x x 1. Estimate payoffs x x payoff mgreen , mred etc sgreen , sred etc x x 2. Derive choice probabilities trial t t+1 Pgreen , Pred etc Choose randomly according to these

Behavior model Kalman filter Error update (like TD) Exact inference 1. Estimate payoffs payoff mgreen , mred etc sgreen , sred etc 2. Derive choice probabilities trial t t+1 Pgreen , Pred etc Choose randomly according to these Behrens & volatility

Behavior model Kalman filter 1. Estimate payoffs mgreen , mred etc sgreen , sred etc Compare rules: How is exploration directed? 2. Derive choice probabilities Pgreen , Pred etc Choose randomly according to these

Behavior model mgreen , mred etc sgreen , sred etc Compare rules: How is exploration directed? 2. Derive choice probabilities Pgreen , Pred etc Choose randomly according to these

Behavior model mgreen , mred etc sgreen , sred etc Compare rules: How is exploration directed? Value 2. Derive choice probabilities Action (dumber) (smarter)

Behavior model mgreen , mred etc sgreen , sred etc Compare rules: How is exploration directed? Value 2. Derive choice probabilities Action Randomly “e-greedy” Probability (dumber) (smarter)

Behavior model mgreen , mred etc sgreen , sred etc Compare rules: How is exploration directed? Value 2. Derive choice probabilities Action Randomly “e-greedy” By value “softmax” Probability (dumber) (smarter)

Behavior model mgreen , mred etc sgreen , sred etc Compare rules: How is exploration directed? Value 2. Derive choice probabilities Action Randomly “e-greedy” By value “softmax” By value and uncertainty “uncertainty bonuses” Probability (dumber) (smarter)

Model comparison Assess models based on likelihood of actual choices Product over subjects and trials of modeled probability of each choice Find maximum likelihood parameters Inference parameters, choice parameters Parameters yoked between subjects (… except choice noisiness, to model all heterogeneity)

Behavioral results Strong evidence for exploration directed by value e-greedy softmax uncertainty bonuses -log likelihood (smaller is better) 4208.3 3972.1 3972.1 # parameters 19 19 20 Strong evidence for exploration directed by value No evidence for direction by uncertainty Tried several variations

Behavioral results Strong evidence for exploration directed by value e-greedy softmax uncertainty bonuses -log likelihood (smaller is better) 4208.3 3972.1 3972.1 # parameters 19 19 20 Strong evidence for exploration directed by value No evidence for direction by uncertainty Tried several variations

Imaging methods 1.5 T Siemens Sonata scanner Sequence optimized for OFC (Deichmann et al. 2003) 2x385 volumes; 36 slices; 3mm thickness 3.24 secs TR SPM2 random effects model Regressors generated using fit model, trial-by-trial sequence of actual choices/payoffs.

Imaging results L vStr TD error: dopamine targets (dorsal and ventral striatum) Replicate previous studies, but weakish Graded payoffs? x,y,z= 9,12,-9 dStr p<0.01 x,y,z= 9,0,18 p<0.001

Value-related correlates probability (or exp. value) of chosen action: vmPFC L vmPFC vmPFC % signal change p<0.01 p<0.001 probability of chosen action x,y,z=-3,45,-18 payoff amount: OFC L mOFC mOFC % signal change p<0.01 p<0.001 payoff x,y,z=3,30,-21

Exploration Non-greedy > greedy choices: exploration Frontopolar cortex Survives whole-brain correction L rFP rFP p<0.01 p<0.001 LFP x,y,z=-27,48,4; 27,57,6

Timecourses Frontal pole IPS

Checks Do other factors explain differential BOLD activity better? Multiple regression vs. RT, actual reward, predicted reward, choice prob, stay vs. switch, uncertainty, more Only explore/exploit is significant (But 5 additional putative explore areas eliminated) Individual subjects: BOLD differences stronger for better behavioral fit

Frontal poles “One of the least well understood regions of the human brain” No cortical connections outside PFC (“PFC for PFC”) Rostrocaudal hierarchy in PFC (Christoff & Gabrielli 2000; Koechlin et al. 2003) Imaging – high level control Coordinating goals/subgoals (Koechlin et al. 1999, Braver & Bongiolatti 2002; Badre & Wagner 2004) Mediating cognitive processes (Ramnani & Owen 2004) Nothing this computationally specific Lesions: task switching (Burgess et al. 2000) more generic: perseveration

Interpretation Cognitive decision to explore overrides habit circuitry? Via parietal? Higher FP response when exploration chosen most against the odds Explore RT longer Exploration/exploitation are neurally distinct Computationally surprising, esp. bad for uncertainty bonus schemes proper exploration requires computational integration no behavioral evidence either Why softmax? Can misexplore Deterministic bonus schemes bad in adversarial/multiagent setting Dynamic temperature control? (norepinephrine; Usher et al.; Doya)

Conclusions Subjects direct exploration by value but not uncertainty Cortical regions differentially implicated in exploration computational consequences Integrative approach: computation, behavior, imaging Quantitatively assess & constrain models using raw behavior Infer subjective states using model, study neural correlates

Open Issues model-based vs model-free vs Pavlovian control environmental priors vs naive optimism vs neophilic compulsion? environmental priors and generalization curiosity/`intrinsic motivation’ from expected future reward