Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997.

Similar presentations


Presentation on theme: "1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997."— Presentation transcript:

1 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997

2 humans are no different dorsomedial striatum/PFC –goal-directed control dorsolateral striatum –habitual control ventral striatum –Pavlovian control; value signals dopamine...

3 in humans… < 1 sec 0.5 sec You won 40 cents 5 sec ISI 19 subjects (dropped 3 non learners, N=16) 3T scanner, TR=2sec, interleaved 234 trials: 130 choice, 104 single stimulus randomly ordered and counterbalanced 2-5sec ITI 5 stimuli: 40¢ 20¢ 0/40¢ 0¢ 5 stimuli: 40¢ 20¢ 0/40¢ 0¢

4 what would a prediction error look like (in BOLD)?

5 prediction errors in NAC unbiased anatomical ROI in nucleus accumbens (marked per subject*) * thanks to Laura deSouza raw BOLD (avg over all subjects) can actually decide between different neuroeconomic models of risk

6 Polar Exploration Peter Dayan Nathaniel Daw John O’Doherty Ray Dolan

7 Background Learning from experience –Neural: Dopamine, basal ganglia –Computational: TD learning What about sampling: gathering experience to learn PFC Striatum Multiple, competitive decision systems –eg PFC vs basal ganglia –hot/cold (eg Ringel, Loewenstein) –Surprising from optimal control perspective

8 Exploration vs. exploitation Classic dilemma in learned decision making For unfamiliar outcomes, how to trade off learning about their values against exploiting knowledge already gained

9 Exploration vs. exploitation Exploitation –Choose action expected to be best –May never discover something better Time Reward

10 Exploration vs. exploitation Exploitation –Choose action expected to be best –May never discover something better Exploration: –Choose action expected to be worse Time Reward

11 Exploration vs. exploitation Exploitation –Choose action expected to be best –May never discover something better Exploration: –Choose action expected to be worse –If it is, then go back to the original Time Reward

12 Exploration vs. exploitation Exploitation –Choose action expected to be best –May never discover something better Exploration: –Choose action expected to be worse Time Reward

13 Exploration vs. exploitation Exploitation –Choose action expected to be best –May never discover something better Exploration: –Choose action expected to be worse –If it is better, then exploit in the future Time Reward

14 Exploration vs. exploitation Exploitation –Choose action expected to be best –May never discover something better Exploration: –Choose action expected to be worse –Balanced by the long-term gain if it turns out better –(Even for risk or ambiguity averse subjects) –nb: learning non trivial when outcomes noisy or changing Time Reward

15 Bayesian analysis (Gittins 1972) Tractable dynamic program in restricted class of problems –“n-armed bandit” Solution requires balancing –Expected outcome values –Uncertainty (need for exploration) –Horizon/discounting (time to exploit) Optimal policy: Explore systematically –Choose best sum of value plus bonus –Bonus increases with uncertainty Intractable in general setting –Various heuristics used in practice Action Value

16 Experiment How do humans handle tradeoff? Computation: Which strategies fit behavior? –Several popular approximations Difference: what information influences exploration? Neural substrate: What systems are involved? –PFC, high level control Competitive decision systems (Daw et al. 2005) –Neuromodulators dopamine (Kakade & Dayan 2002) norepinephrine (Usher et al. 1999)

17 Task design Subjects (14 healthy, right- handed) repeatedly choose between four slot machines for points (“money”), in scanner Slots revealed Trial Onset

18 Task design Subjects (14 healthy, right- handed) repeatedly choose between four slot machines for points (“money”), in scanner Subject makes choice - chosen slot spins. Slots revealed Trial Onset +~430 ms +

19 Task design Subjects (14 healthy, right- handed) repeatedly choose between four slot machines for points (“money”), in scanner Subject makes choice - chosen slot spins. Slots revealed Outcome: Payoff revealed Trial Onset +~430 ms +~3000 ms obtained 57 points + +

20 Task design Subjects (14 healthy, right- handed) repeatedly choose between four slot machines for points (“money”), in scanner Subject makes choice - chosen slot spins. Slots revealed Outcome: Payoff revealed Trial Onset +~430 ms +~3000 ms Screen cleared +~1000 ms Trial ends obtained 57 points + + +

21 Payoff structure Noisy to require integration of data Subjects learn about payoffs only by sampling them

22 Payoff structure Noisy to require integration of data Subjects learn about payoffs only by sampling them

23 Payoff structure Payoff

24 Payoff structure Nonstationary to encourage ongoing exploration (Gaussian drift w/ decay)

25 Analysis strategy Behavior: Fit an RL model to choices –Find best fitting parameters –Compare different exploration models Imaging: Use model to estimate subjective factors (explore vs. exploit, value, etc.) –Use these as regressors for the fMRI signal –After Sugrue et al.

26 Behavior

27

28

29

30

31 Behavior model 1. Estimate payoffs 2. Derive choice probabilities  green,  red etc  green,  red etc Choose randomly according to these P green, P red etc

32 Behavior model 1. Estimate payoffs 2. Derive choice probabilities Kalman filter Error update (like TD) Exact inference  green,  red etc  green,  red etc Choose randomly according to these P green, P red etc

33 x x Behavior model 1. Estimate payoffs 2. Derive choice probabilities Kalman filter Error update (like TD) Exact inference payoff trialt t+1 x  green,  red etc  green,  red etc Choose randomly according to these P green, P red etc

34 x x Behavior model 1. Estimate payoffs 2. Derive choice probabilities Kalman filter Error update (like TD) Exact inference payoff trialt t+1 x  green,  red etc  green,  red etc Choose randomly according to these P green, P red etc

35 x x Behavior model 1. Estimate payoffs 2. Derive choice probabilities Kalman filter Error update (like TD) Exact inference payoff trialt t+1 x x  green,  red etc  green,  red etc Choose randomly according to these P green, P red etc

36 x x Behavior model 1. Estimate payoffs 2. Derive choice probabilities Kalman filter Error update (like TD) Exact inference payoff trialt t+1 x x  green,  red etc  green,  red etc Choose randomly according to these P green, P red etc x x

37 Behavior model 1. Estimate payoffs 2. Derive choice probabilities Kalman filter Error update (like TD) Exact inference payoff trialt t+1  green,  red etc  green,  red etc Choose randomly according to these P green, P red etc Behrens & volatility

38 Behavior model 1. Estimate payoffs 2. Derive choice probabilities Kalman filter Compare rules: How is exploration directed?  green,  red etc  green,  red etc Choose randomly according to these P green, P red etc

39 Behavior model 2. Derive choice probabilities Compare rules: How is exploration directed?  green,  red etc  green,  red etc P green, P red etc Choose randomly according to these

40 Behavior model 2. Derive choice probabilities Compare rules: How is exploration directed?  green,  red etc  green,  red etc Action Value (dumber)(smarter)

41 Behavior model 2. Derive choice probabilities Compare rules: How is exploration directed?  green,  red etc  green,  red etc Action Value (dumber)(smarter) Probability Randomly “  -greedy”

42 Behavior model 2. Derive choice probabilities Compare rules: How is exploration directed?  green,  red etc  green,  red etc Action Value (dumber)(smarter) Probability Randomly “  -greedy” By value “softmax”

43 Behavior model 2. Derive choice probabilities Compare rules: How is exploration directed?  green,  red etc  green,  red etc Action Value (dumber)(smarter) Probability Randomly “  -greedy” By value “softmax” By value and uncertainty “uncertainty bonuses”

44 Model comparison Assess models based on likelihood of actual choices –Product over subjects and trials of modeled probability of each choice –Find maximum likelihood parameters Inference parameters, choice parameters Parameters yoked between subjects (… except choice noisiness, to model all heterogeneity)

45 Behavioral results Strong evidence for exploration directed by value No evidence for direction by uncertainty –Tried several variations  -greedy softmaxuncertainty bonuses -log likelihood (smaller is better) # parameters

46 Behavioral results Strong evidence for exploration directed by value No evidence for direction by uncertainty –Tried several variations  -greedy softmaxuncertainty bonuses -log likelihood (smaller is better) # parameters

47 Imaging methods 1.5 T Siemens Sonata scanner Sequence optimized for OFC (Deichmann et al. 2003) 2x385 volumes; 36 slices; 3mm thickness 3.24 secs TR SPM2 random effects model Regressors generated using fit model, trial-by-trial sequence of actual choices/payoffs.

48 Imaging results TD error: dopamine targets (dorsal and ventral striatum) Replicate previous studies, but weakish –Graded payoffs? p<0.01 p<0.001 L vStr dStr x,y,z= 9,12,-9 x,y,z= 9,0,18

49 Value-related correlates p<0.01 p<0.001 L p<0.01 p<0.001 vmPFC vmPFC mOFC mOFC L probability (or exp. value) of chosen action: vmPFC payoff amount: OFC % signal change probability of chosen action payoff x,y,z=-3,45,-18 x,y,z=3,30,-21

50 Exploration Non-greedy > greedy choices: exploration Frontopolar cortex Survives whole-brain correction L rFP rFP L FP p<0.01 p<0.001 x,y,z=-27,48,4; 27,57,6

51 Timecourses Frontal pole IPS

52 Do other factors explain differential BOLD activity better? –Multiple regression vs. RT, actual reward, predicted reward, choice prob, stay vs. switch, uncertainty, more –Only explore/exploit is significant –(But 5 additional putative explore areas eliminated) Individual subjects: BOLD differences stronger for better behavioral fit Checks

53 Frontal poles Imaging – high level control –Coordinating goals/subgoals (Koechlin et al. 1999, Braver & Bongiolatti 2002; Badre & Wagner 2004) –Mediating cognitive processes (Ramnani & Owen 2004) –Nothing this computationally specific Lesions: task switching (Burgess et al. 2000) –more generic: perseveration “One of the least well understood regions of the human brain” No cortical connections outside PFC (“PFC for PFC”) Rostrocaudal hierarchy in PFC (Christoff & Gabrielli 2000; Koechlin et al. 2003)

54 Interpretation Cognitive decision to explore overrides habit circuitry? Via parietal? –Higher FP response when exploration chosen most against the odds –Explore RT longer Exploration/exploitation are neurally distinct Computationally surprising, esp. bad for uncertainty bonus schemes –proper exploration requires computational integration –no behavioral evidence either Why softmax? Can misexplore –Deterministic bonus schemes bad in adversarial/multiagent setting –Dynamic temperature control? (norepinephrine; Usher et al.; Doya)

55 Conclusions Subjects direct exploration by value but not uncertainty Cortical regions differentially implicated in exploration –computational consequences Integrative approach: computation, behavior, imaging –Quantitatively assess & constrain models using raw behavior –Infer subjective states using model, study neural correlates

56 Open Issues model-based vs model-free vs Pavlovian control –environmental priors vs naive optimism vs neophilic compulsion? environmental priors and generalization –curiosity/`intrinsic motivation’ from expected future reward


Download ppt "1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997."

Similar presentations


Ads by Google