Download presentation

Presentation is loading. Please wait.

1
**dopamine and prediction error**

TD error L R Vt R no prediction prediction, reward prediction, no reward Schultz 1997

2
**humans are no different**

dorsomedial striatum/PFC goal-directed control dorsolateral striatum habitual control ventral striatum Pavlovian control; value signals dopamine...

3
**in humans… You won 40 cents < 1 sec 5 sec ISI 0.5 sec 2-5sec ITI**

5 stimuli: 40¢ 20¢ 0/40¢ 0¢ < 1 sec 0.5 sec You won 40 cents 5 sec ISI 2-5sec ITI 19 subjects (dropped 3 non learners, N=16) 3T scanner, TR=2sec, interleaved 234 trials: 130 choice, 104 single stimulus randomly ordered and counterbalanced

4
**what would a prediction error look like (in BOLD)?**

5
**prediction errors in NAC**

raw BOLD (avg over all subjects) unbiased anatomical ROI in nucleus accumbens (marked per subject*) can actually decide between different neuroeconomic models of risk * thanks to Laura deSouza

6
**Peter Dayan Nathaniel Daw John O’Doherty Ray Dolan**

Polar Exploration Peter Dayan Nathaniel Daw John O’Doherty Ray Dolan

7
**Background Learning from experience**

Neural: Dopamine, basal ganglia Computational: TD learning What about sampling: gathering experience to learn Striatum PFC Multiple, competitive decision systems eg PFC vs basal ganglia hot/cold (eg Ringel, Loewenstein) Surprising from optimal control perspective

8
**Exploration vs. exploitation**

Classic dilemma in learned decision making For unfamiliar outcomes, how to trade off learning about their values against exploiting knowledge already gained

9
**Exploration vs. exploitation**

Reward Time Exploitation Choose action expected to be best May never discover something better

10
**Exploration vs. exploitation**

Reward Time Exploitation Choose action expected to be best May never discover something better Exploration: Choose action expected to be worse

11
**Exploration vs. exploitation**

Reward Time Exploitation Choose action expected to be best May never discover something better Exploration: Choose action expected to be worse If it is, then go back to the original

12
**Exploration vs. exploitation**

Reward Time Exploitation Choose action expected to be best May never discover something better Exploration: Choose action expected to be worse

13
**Exploration vs. exploitation**

Reward Time Exploitation Choose action expected to be best May never discover something better Exploration: Choose action expected to be worse If it is better, then exploit in the future

14
**Exploration vs. exploitation**

Reward Time Exploitation Choose action expected to be best May never discover something better Exploration: Choose action expected to be worse Balanced by the long-term gain if it turns out better (Even for risk or ambiguity averse subjects) nb: learning non trivial when outcomes noisy or changing

15
**Bayesian analysis (Gittins 1972)**

Tractable dynamic program in restricted class of problems “n-armed bandit” Solution requires balancing Expected outcome values Uncertainty (need for exploration) Horizon/discounting (time to exploit) Optimal policy: Explore systematically Choose best sum of value plus bonus Bonus increases with uncertainty Intractable in general setting Various heuristics used in practice Value Action

16
**Experiment How do humans handle tradeoff?**

Computation: Which strategies fit behavior? Several popular approximations Difference: what information influences exploration? Neural substrate: What systems are involved? PFC, high level control Competitive decision systems (Daw et al. 2005) Neuromodulators dopamine (Kakade & Dayan 2002) norepinephrine (Usher et al. 1999)

17
Task design Subjects (14 healthy, right-handed) repeatedly choose between four slot machines for points (“money”), in scanner Trial Onset Slots revealed

18
Task design Subjects (14 healthy, right-handed) repeatedly choose between four slot machines for points (“money”), in scanner + Trial Onset Slots revealed +~430 ms Subject makes choice - chosen slot spins.

19
Task design Subjects (14 healthy, right-handed) repeatedly choose between four slot machines for points (“money”), in scanner + Trial Onset Slots revealed +~430 ms + obtained 57 Subject makes choice - chosen slot spins. points +~3000 ms Outcome: Payoff revealed

20
Task design Subjects (14 healthy, right-handed) repeatedly choose between four slot machines for points (“money”), in scanner obtained 57 points + Trial Onset Slots revealed +~430 ms Subject makes choice - chosen slot spins. +~3000 ms Outcome: Payoff revealed +~1000 ms Screen cleared Trial ends

21
**Payoff structure Noisy to require integration of data**

Subjects learn about payoffs only by sampling them

22
**Payoff structure Noisy to require integration of data**

Subjects learn about payoffs only by sampling them

23
Payoff structure Payoff

24
**Payoff structure Nonstationary to encourage ongoing exploration**

(Gaussian drift w/ decay)

25
**Analysis strategy Behavior: Fit an RL model to choices**

Find best fitting parameters Compare different exploration models Imaging: Use model to estimate subjective factors (explore vs. exploit, value, etc.) Use these as regressors for the fMRI signal After Sugrue et al.

26
Behavior

27
Behavior

28
Behavior

29
Behavior

30
Behavior

31
**Behavior model 1. Estimate payoffs mgreen , mred etc sgreen , sred etc**

2. Derive choice probabilities Pgreen , Pred etc Choose randomly according to these

32
**Behavior model Kalman filter Error update (like TD)**

Exact inference 1. Estimate payoffs mgreen , mred etc sgreen , sred etc 2. Derive choice probabilities Pgreen , Pred etc Choose randomly according to these

33
**Behavior model Kalman filter Error update (like TD)**

Exact inference x 1. Estimate payoffs x payoff mgreen , mred etc sgreen , sred etc x 2. Derive choice probabilities trial t t+1 Pgreen , Pred etc Choose randomly according to these

34
**Behavior model Kalman filter Error update (like TD)**

Exact inference x 1. Estimate payoffs x payoff mgreen , mred etc sgreen , sred etc x 2. Derive choice probabilities trial t t+1 Pgreen , Pred etc Choose randomly according to these

35
**Behavior model Kalman filter Error update (like TD)**

Exact inference x x 1. Estimate payoffs x payoff mgreen , mred etc sgreen , sred etc x 2. Derive choice probabilities trial t t+1 Pgreen , Pred etc Choose randomly according to these

36
**Behavior model Kalman filter Error update (like TD)**

Exact inference x x 1. Estimate payoffs x x payoff mgreen , mred etc sgreen , sred etc x x 2. Derive choice probabilities trial t t+1 Pgreen , Pred etc Choose randomly according to these

37
**Behavior model Kalman filter Error update (like TD)**

Exact inference 1. Estimate payoffs payoff mgreen , mred etc sgreen , sred etc 2. Derive choice probabilities trial t t+1 Pgreen , Pred etc Choose randomly according to these Behrens & volatility

38
**Behavior model Kalman filter 1. Estimate payoffs mgreen , mred etc**

sgreen , sred etc Compare rules: How is exploration directed? 2. Derive choice probabilities Pgreen , Pred etc Choose randomly according to these

39
**Behavior model mgreen , mred etc sgreen , sred etc**

Compare rules: How is exploration directed? 2. Derive choice probabilities Pgreen , Pred etc Choose randomly according to these

40
**Behavior model mgreen , mred etc sgreen , sred etc**

Compare rules: How is exploration directed? Value 2. Derive choice probabilities Action (dumber) (smarter)

41
**Behavior model mgreen , mred etc sgreen , sred etc**

Compare rules: How is exploration directed? Value 2. Derive choice probabilities Action Randomly “e-greedy” Probability (dumber) (smarter)

42
**Behavior model mgreen , mred etc sgreen , sred etc**

Compare rules: How is exploration directed? Value 2. Derive choice probabilities Action Randomly “e-greedy” By value “softmax” Probability (dumber) (smarter)

43
**Behavior model mgreen , mred etc sgreen , sred etc**

Compare rules: How is exploration directed? Value 2. Derive choice probabilities Action Randomly “e-greedy” By value “softmax” By value and uncertainty “uncertainty bonuses” Probability (dumber) (smarter)

44
**Model comparison Assess models based on likelihood of actual choices**

Product over subjects and trials of modeled probability of each choice Find maximum likelihood parameters Inference parameters, choice parameters Parameters yoked between subjects (… except choice noisiness, to model all heterogeneity)

45
**Behavioral results Strong evidence for exploration directed by value**

e-greedy softmax uncertainty bonuses -log likelihood (smaller is better) 4208.3 3972.1 3972.1 # parameters 19 19 20 Strong evidence for exploration directed by value No evidence for direction by uncertainty Tried several variations

46
**Behavioral results Strong evidence for exploration directed by value**

e-greedy softmax uncertainty bonuses -log likelihood (smaller is better) 4208.3 3972.1 3972.1 # parameters 19 19 20 Strong evidence for exploration directed by value No evidence for direction by uncertainty Tried several variations

47
**Imaging methods 1.5 T Siemens Sonata scanner**

Sequence optimized for OFC (Deichmann et al. 2003) 2x385 volumes; 36 slices; 3mm thickness 3.24 secs TR SPM2 random effects model Regressors generated using fit model, trial-by-trial sequence of actual choices/payoffs.

48
Imaging results L vStr TD error: dopamine targets (dorsal and ventral striatum) Replicate previous studies, but weakish Graded payoffs? x,y,z= 9,12,-9 dStr p<0.01 x,y,z= 9,0,18 p<0.001

49
**Value-related correlates**

probability (or exp. value) of chosen action: vmPFC L vmPFC vmPFC % signal change p<0.01 p<0.001 probability of chosen action x,y,z=-3,45,-18 payoff amount: OFC L mOFC mOFC % signal change p<0.01 p<0.001 payoff x,y,z=3,30,-21

50
**Exploration Non-greedy > greedy choices: exploration**

Frontopolar cortex Survives whole-brain correction L rFP rFP p<0.01 p<0.001 LFP x,y,z=-27,48,4; 27,57,6

51
Timecourses Frontal pole IPS

52
**Checks Do other factors explain differential BOLD activity better?**

Multiple regression vs. RT, actual reward, predicted reward, choice prob, stay vs. switch, uncertainty, more Only explore/exploit is significant (But 5 additional putative explore areas eliminated) Individual subjects: BOLD differences stronger for better behavioral fit

53
Frontal poles “One of the least well understood regions of the human brain” No cortical connections outside PFC (“PFC for PFC”) Rostrocaudal hierarchy in PFC (Christoff & Gabrielli 2000; Koechlin et al. 2003) Imaging – high level control Coordinating goals/subgoals (Koechlin et al. 1999, Braver & Bongiolatti 2002; Badre & Wagner 2004) Mediating cognitive processes (Ramnani & Owen 2004) Nothing this computationally specific Lesions: task switching (Burgess et al. 2000) more generic: perseveration

54
Interpretation Cognitive decision to explore overrides habit circuitry? Via parietal? Higher FP response when exploration chosen most against the odds Explore RT longer Exploration/exploitation are neurally distinct Computationally surprising, esp. bad for uncertainty bonus schemes proper exploration requires computational integration no behavioral evidence either Why softmax? Can misexplore Deterministic bonus schemes bad in adversarial/multiagent setting Dynamic temperature control? (norepinephrine; Usher et al.; Doya)

55
**Conclusions Subjects direct exploration by value but not uncertainty**

Cortical regions differentially implicated in exploration computational consequences Integrative approach: computation, behavior, imaging Quantitatively assess & constrain models using raw behavior Infer subjective states using model, study neural correlates

56
**Open Issues model-based vs model-free vs Pavlovian control**

environmental priors vs naive optimism vs neophilic compulsion? environmental priors and generalization curiosity/`intrinsic motivation’ from expected future reward

Similar presentations

OK

Revisiting James March’s Exploration- Exploitation Trade-off With a Neurobiological Basis Chiara Chelini University of Turin ESA World Meeting, Rome, 28°

Revisiting James March’s Exploration- Exploitation Trade-off With a Neurobiological Basis Chiara Chelini University of Turin ESA World Meeting, Rome, 28°

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt on palliative care pain management end of life care Convert free pdf to ppt online convert Ppt on 60 years of indian parliamentary Ppt on eddy current test Ppt on sustainable development in bangladesh Ppt on glorious past of india Ppt on regional trade agreements of china Ppt on self development charter Ppt on regional transport office lucknow Ppt on trial and error quotes