Presentation is loading. Please wait.

Presentation is loading. Please wait.

dopamine and prediction error

Similar presentations


Presentation on theme: "dopamine and prediction error"— Presentation transcript:

1 dopamine and prediction error
TD error L R Vt R no prediction prediction, reward prediction, no reward Schultz 1997

2 humans are no different
dorsomedial striatum/PFC goal-directed control dorsolateral striatum habitual control ventral striatum Pavlovian control; value signals dopamine...

3 in humans… You won 40 cents < 1 sec 5 sec ISI 0.5 sec 2-5sec ITI
5 stimuli: 40¢ 20¢ 0/40¢ < 1 sec 0.5 sec You won 40 cents 5 sec ISI 2-5sec ITI 19 subjects (dropped 3 non learners, N=16) 3T scanner, TR=2sec, interleaved 234 trials: 130 choice, 104 single stimulus randomly ordered and counterbalanced

4 what would a prediction error look like (in BOLD)?

5 prediction errors in NAC
raw BOLD (avg over all subjects) unbiased anatomical ROI in nucleus accumbens (marked per subject*) can actually decide between different neuroeconomic models of risk * thanks to Laura deSouza

6 Peter Dayan Nathaniel Daw John O’Doherty Ray Dolan
Polar Exploration Peter Dayan Nathaniel Daw John O’Doherty Ray Dolan

7 Background Learning from experience
Neural: Dopamine, basal ganglia Computational: TD learning What about sampling: gathering experience to learn Striatum PFC Multiple, competitive decision systems eg PFC vs basal ganglia hot/cold (eg Ringel, Loewenstein) Surprising from optimal control perspective

8 Exploration vs. exploitation
Classic dilemma in learned decision making For unfamiliar outcomes, how to trade off learning about their values against exploiting knowledge already gained

9 Exploration vs. exploitation
Reward Time Exploitation Choose action expected to be best May never discover something better

10 Exploration vs. exploitation
Reward Time Exploitation Choose action expected to be best May never discover something better Exploration: Choose action expected to be worse

11 Exploration vs. exploitation
Reward Time Exploitation Choose action expected to be best May never discover something better Exploration: Choose action expected to be worse If it is, then go back to the original

12 Exploration vs. exploitation
Reward Time Exploitation Choose action expected to be best May never discover something better Exploration: Choose action expected to be worse

13 Exploration vs. exploitation
Reward Time Exploitation Choose action expected to be best May never discover something better Exploration: Choose action expected to be worse If it is better, then exploit in the future

14 Exploration vs. exploitation
Reward Time Exploitation Choose action expected to be best May never discover something better Exploration: Choose action expected to be worse Balanced by the long-term gain if it turns out better (Even for risk or ambiguity averse subjects) nb: learning non trivial when outcomes noisy or changing

15 Bayesian analysis (Gittins 1972)
Tractable dynamic program in restricted class of problems “n-armed bandit” Solution requires balancing Expected outcome values Uncertainty (need for exploration) Horizon/discounting (time to exploit) Optimal policy: Explore systematically Choose best sum of value plus bonus Bonus increases with uncertainty Intractable in general setting Various heuristics used in practice Value Action

16 Experiment How do humans handle tradeoff?
Computation: Which strategies fit behavior? Several popular approximations Difference: what information influences exploration? Neural substrate: What systems are involved? PFC, high level control Competitive decision systems (Daw et al. 2005) Neuromodulators dopamine (Kakade & Dayan 2002) norepinephrine (Usher et al. 1999)

17 Task design Subjects (14 healthy, right-handed) repeatedly choose between four slot machines for points (“money”), in scanner Trial Onset Slots revealed

18 Task design Subjects (14 healthy, right-handed) repeatedly choose between four slot machines for points (“money”), in scanner + Trial Onset Slots revealed +~430 ms Subject makes choice - chosen slot spins.

19 Task design Subjects (14 healthy, right-handed) repeatedly choose between four slot machines for points (“money”), in scanner + Trial Onset Slots revealed +~430 ms + obtained 57 Subject makes choice - chosen slot spins. points +~3000 ms Outcome: Payoff revealed

20 Task design Subjects (14 healthy, right-handed) repeatedly choose between four slot machines for points (“money”), in scanner obtained 57 points + Trial Onset Slots revealed +~430 ms Subject makes choice - chosen slot spins. +~3000 ms Outcome: Payoff revealed +~1000 ms Screen cleared Trial ends

21 Payoff structure Noisy to require integration of data
Subjects learn about payoffs only by sampling them

22 Payoff structure Noisy to require integration of data
Subjects learn about payoffs only by sampling them

23 Payoff structure Payoff

24 Payoff structure Nonstationary to encourage ongoing exploration
(Gaussian drift w/ decay)

25 Analysis strategy Behavior: Fit an RL model to choices
Find best fitting parameters Compare different exploration models Imaging: Use model to estimate subjective factors (explore vs. exploit, value, etc.) Use these as regressors for the fMRI signal After Sugrue et al.

26 Behavior

27 Behavior

28 Behavior

29 Behavior

30 Behavior

31 Behavior model 1. Estimate payoffs mgreen , mred etc sgreen , sred etc
2. Derive choice probabilities Pgreen , Pred etc Choose randomly according to these

32 Behavior model Kalman filter Error update (like TD)
Exact inference 1. Estimate payoffs mgreen , mred etc sgreen , sred etc 2. Derive choice probabilities Pgreen , Pred etc Choose randomly according to these

33 Behavior model Kalman filter Error update (like TD)
Exact inference x 1. Estimate payoffs x payoff mgreen , mred etc sgreen , sred etc x 2. Derive choice probabilities trial t t+1 Pgreen , Pred etc Choose randomly according to these

34 Behavior model Kalman filter Error update (like TD)
Exact inference x 1. Estimate payoffs x payoff mgreen , mred etc sgreen , sred etc x 2. Derive choice probabilities trial t t+1 Pgreen , Pred etc Choose randomly according to these

35 Behavior model Kalman filter Error update (like TD)
Exact inference x x 1. Estimate payoffs x payoff mgreen , mred etc sgreen , sred etc x 2. Derive choice probabilities trial t t+1 Pgreen , Pred etc Choose randomly according to these

36 Behavior model Kalman filter Error update (like TD)
Exact inference x x 1. Estimate payoffs x x payoff mgreen , mred etc sgreen , sred etc x x 2. Derive choice probabilities trial t t+1 Pgreen , Pred etc Choose randomly according to these

37 Behavior model Kalman filter Error update (like TD)
Exact inference 1. Estimate payoffs payoff mgreen , mred etc sgreen , sred etc 2. Derive choice probabilities trial t t+1 Pgreen , Pred etc Choose randomly according to these Behrens & volatility

38 Behavior model Kalman filter 1. Estimate payoffs mgreen , mred etc
sgreen , sred etc Compare rules: How is exploration directed? 2. Derive choice probabilities Pgreen , Pred etc Choose randomly according to these

39 Behavior model mgreen , mred etc sgreen , sred etc
Compare rules: How is exploration directed? 2. Derive choice probabilities Pgreen , Pred etc Choose randomly according to these

40 Behavior model mgreen , mred etc sgreen , sred etc
Compare rules: How is exploration directed? Value 2. Derive choice probabilities Action (dumber) (smarter)

41 Behavior model mgreen , mred etc sgreen , sred etc
Compare rules: How is exploration directed? Value 2. Derive choice probabilities Action Randomly “e-greedy” Probability (dumber) (smarter)

42 Behavior model mgreen , mred etc sgreen , sred etc
Compare rules: How is exploration directed? Value 2. Derive choice probabilities Action Randomly “e-greedy” By value “softmax” Probability (dumber) (smarter)

43 Behavior model mgreen , mred etc sgreen , sred etc
Compare rules: How is exploration directed? Value 2. Derive choice probabilities Action Randomly “e-greedy” By value “softmax” By value and uncertainty “uncertainty bonuses” Probability (dumber) (smarter)

44 Model comparison Assess models based on likelihood of actual choices
Product over subjects and trials of modeled probability of each choice Find maximum likelihood parameters Inference parameters, choice parameters Parameters yoked between subjects (… except choice noisiness, to model all heterogeneity)

45 Behavioral results Strong evidence for exploration directed by value
e-greedy softmax uncertainty bonuses -log likelihood (smaller is better) 4208.3 3972.1 3972.1 # parameters 19 19 20 Strong evidence for exploration directed by value No evidence for direction by uncertainty Tried several variations

46 Behavioral results Strong evidence for exploration directed by value
e-greedy softmax uncertainty bonuses -log likelihood (smaller is better) 4208.3 3972.1 3972.1 # parameters 19 19 20 Strong evidence for exploration directed by value No evidence for direction by uncertainty Tried several variations

47 Imaging methods 1.5 T Siemens Sonata scanner
Sequence optimized for OFC (Deichmann et al. 2003) 2x385 volumes; 36 slices; 3mm thickness 3.24 secs TR SPM2 random effects model Regressors generated using fit model, trial-by-trial sequence of actual choices/payoffs.

48 Imaging results L vStr TD error: dopamine targets (dorsal and ventral striatum) Replicate previous studies, but weakish Graded payoffs? x,y,z= 9,12,-9 dStr p<0.01 x,y,z= 9,0,18 p<0.001

49 Value-related correlates
probability (or exp. value) of chosen action: vmPFC L vmPFC vmPFC % signal change p<0.01 p<0.001 probability of chosen action x,y,z=-3,45,-18 payoff amount: OFC L mOFC mOFC % signal change p<0.01 p<0.001 payoff x,y,z=3,30,-21

50 Exploration Non-greedy > greedy choices: exploration
Frontopolar cortex Survives whole-brain correction L rFP rFP p<0.01 p<0.001 LFP x,y,z=-27,48,4; 27,57,6

51 Timecourses Frontal pole IPS

52 Checks Do other factors explain differential BOLD activity better?
Multiple regression vs. RT, actual reward, predicted reward, choice prob, stay vs. switch, uncertainty, more Only explore/exploit is significant (But 5 additional putative explore areas eliminated) Individual subjects: BOLD differences stronger for better behavioral fit

53 Frontal poles “One of the least well understood regions of the human brain” No cortical connections outside PFC (“PFC for PFC”) Rostrocaudal hierarchy in PFC (Christoff & Gabrielli 2000; Koechlin et al. 2003) Imaging – high level control Coordinating goals/subgoals (Koechlin et al. 1999, Braver & Bongiolatti 2002; Badre & Wagner 2004) Mediating cognitive processes (Ramnani & Owen 2004) Nothing this computationally specific Lesions: task switching (Burgess et al. 2000) more generic: perseveration

54 Interpretation Cognitive decision to explore overrides habit circuitry? Via parietal? Higher FP response when exploration chosen most against the odds Explore RT longer Exploration/exploitation are neurally distinct Computationally surprising, esp. bad for uncertainty bonus schemes proper exploration requires computational integration no behavioral evidence either Why softmax? Can misexplore Deterministic bonus schemes bad in adversarial/multiagent setting Dynamic temperature control? (norepinephrine; Usher et al.; Doya)

55 Conclusions Subjects direct exploration by value but not uncertainty
Cortical regions differentially implicated in exploration computational consequences Integrative approach: computation, behavior, imaging Quantitatively assess & constrain models using raw behavior Infer subjective states using model, study neural correlates

56 Open Issues model-based vs model-free vs Pavlovian control
environmental priors vs naive optimism vs neophilic compulsion? environmental priors and generalization curiosity/`intrinsic motivation’ from expected future reward


Download ppt "dopamine and prediction error"

Similar presentations


Ads by Google