Models: reinforcement learning & fMRI Nathaniel Daw 11/28/2007.

models: reinforcement learning & fMRI Nathaniel Daw 11/28/2007

overview reinforcement learning model fitting: behavior model fitting: fMRI

overview reinforcement learning –simple example –tracking –choice model fitting: behavior model fitting: fMRI

Reinforcement learning: the problem Optimal choice learned by repeated trial-and-error –eg between slot machines that pay off with different probabilities But… –Payoff amounts & probabilities may be unknown –May additionally be changing –Decisions may be sequentially structured (chess, mazes: this we wont consider today) Very hard computational problem; computational shortcuts essential Interplay between what you can and should do Both have behavioral & neural consequences

Simple example n-armed bandit, unknown but IID payoffs –surprisingly rich problem Vague strategy to maximize expected payoff: 1)Predict expected payoff for each option 2)Choose the best (?) 3)Learn from outcome to improve predictions

Simple example 1)Predict expected payoff for each option –Take V L = last reward received on option L –(more generally, some weighted average of past rewards) –This is an unbiased, albeit lousy, estimator 2)Choose the best –(more generally, choose stochastically s.t. the machine judged richer is more likely to be chosen) Say left machine pays 10 with prob 10%, 0 owise Say right machine pays 1 always What happens? (Niv et al. 2000; Bateson & Kacelnik)

Behavioral anomalies Apparent risk aversion arises due to learning, i.e. due to the way payoffs are estimated –Even though we are trying to optimize expected reward, risk neutral –Easy to construct other examples for risk proneness, “probability matching” Behavioral anomalies can have computational roots Sampling and choice interact in subtle ways

what can we do?

What can we do? Exponentially weighted running average of rewards on an option: Convenient form because it can be recursively maintained (‘exponential filter’) ‘error-driven learning’, ‘delta rule’, ‘Rescorla-Wagner’ Reward prediction trials into past weight

what should we do? [learning]

Bayesian view Specify ‘generative model’ for payoffs Assume payoff following choice of A is Gaussian with unknown mean  A ; known variance   PAYOFF Assume mean  A changes via a Gaussian random walk with zero mean and variance   WALK AA trials payoff for A

Bayesian view Describe prior beliefs about parameters as a probability distribution Assume they are Gaussian with mean ; variance Update beliefs in light of experience with Bayes’ rule mean of payoff for A P(  A | payoff) / P(payoff |  A )P(  A )

Bayesian belief updating mean of payoff for A

Notes on Kalman filter Looks like Rescorla/Wagner but We track uncertainty as well as mean Learning rate is function of uncertainty (asymptotically constant but nonzero) Why do we exponentially weight past rewards?

what should we do? [choice]

The n-armed bandit n slot machines binary payoffs, unknown fixed probabilities you get some limited (technically: random, exponentially distributed) number of spins want to maximize income surprisingly rich problem

The n-armed bandit 1.Track payoff probabilities Bayesian: learn a distribution over possible probs for each machine This is easy: Just requires counting wins and losses (Beta posterior)

The n-armed bandit 2.Choose This is hard. Why?

The explore-exploit dilemma 2.Choose Simply choosing apparently best machine might miss something better: must balance exploration and exploitation simple heuristics, eg choose at random once in a while

Explore / exploit Which should you choose? left bandit: 4/8 spins rewarded right bandit: 1/2 spins rewarded mean of both distributions: 50%

Explore / exploit left bandit: 4/8 spins rewarded right bandit: 1/2 spins rewarded green bandit more uncertain (distribution has larger variance) Which should you choose?

Explore / exploit Which should you choose? Trade off uncertainty, exp value, horizon ‘Value of information’: exploring improves future choices How to quantify? … it also has a larger chance of being better …which would be useful to find out, if true although green bandit has a larger chance of being worse…

Optimal solution This is really a sequential choice problem; can be solved with dynamic programming Naïve approach: Each machine has k ‘states’ (number of wins/losses so far); state of total game is product over all machines; curse of dimensionality (k n states) Clever approach: (Gittins 1972) Problem decouples to one with k states – consider continuing on a single bandit versus switching to a bandit that always pays some known amount. The amount for which you’d switch is the ‘Gittins index’. It properly balances mean, uncertainty & horizon

overview reinforcement learning model fitting: behavior –pooling multiple subjects –example model fitting: fMRI

Model estimation What is a model? –parameterized stochastic data-generation process Model m predicts data D given parameters  Estimate parameters: posterior distribution over  by Bayes’ rule Typically use a maximum likelihood point estimate instead ie the parameters for which data are most likely. Can still study uncertainty around peak: interactions, identifiability

application to RL eg D for a subject is ordered list of choices c t, rewards r t for eg where V might be learned by an exponential filter with decay 

Example behavioral task Reinforcement learning for reward & punishment: participants (31) repeatedly choose between boxes each box has (hidden, changing) chance of giving money (20p) also, independent chance of giving electric shock (8 on 1-10 pain scale) shock money

This is good for what? parameters may measure something of interest –eg learning rate, monetary value of shock allow to quantify & study neural representations of subjective quantities –expected value, prediction error compare models compare groups

Compare models In principle: ‘automatic Occam’s razor’ In practice: approximate integral as max likelihood + penalty: Laplace, BIC, AIC etc. Frequentist version: likelihood ratio test Or: holdout set; difficult in sequential case Good example refs: Ho & Camerer

Compare groups How to model data for a group of subjects? Want to account for (potential) inter-subject variability in parameters  –this is called treating the parameters as “random effects” –ie random variables instantiated once per subject –hierarchical model: each subject’s parameters drawn from population distribution her choices drawn from model given those parameters

Random effects model Hierarchical model: –What is  s ? e.g., a learning rate –What is P(  s |  )? eg a Gaussian, or a MOG –What is  eg the mean and variance, over the population, of the regression weights Interested in identifying population characteristics  (all multisubject fMRI analyses work this way)

Random effects model Interested in identifying population characteristics  –method 1: summary statistics of individual ML fits (cheap & cheerful: used in fMRI) –method 2: estimate integral over parameters eg with Monte Carlo What good is this? –can make statistical statements about parameters in population –can compare groups –can regularize individual parameter estimates ie, P(  |  s ) : “empirical Bayes”

Example behavioral task Reinforcement learning for reward & punishment: participants (31) repeatedly choose between boxes each box has (hidden, changing) chance of giving money (20p) also, independent chance of giving electric shock (8 on 1-10 pain scale) shock money

Behavioral analysis Fit trial-by-trial choices using “conditional logit” regression model  coefficients estimate effects on choice of past rewards, shocks, & choices (Lau & Glimcher; Corrado et al)  selective effect of acute tryptophan depletion? [ 0 0 1 0 0 0 1 0… reward shock choice 0 1 1 0 0 1 1 0… ] [weights] 0 1 1 0 0 0 0 0… value(box 1) = value(box 2) = etc 1 0 0 0 1 0 0 1… ] [weights] 0 0 0 0 1 0 0 1…[ 1 0 0 0 1 0 0 1… values  choice probabilities using logistic (‘softmax’) rule probabilities  choices stochastically estimate weights by maximizing joint likelihood of choices, conditional on rewards prob(box 1) exp(value(box 1))

Summary statistics of individual ML fits –fairly noisy (unconstrained model, unregularized fits)

models predict exponential decays in reward & shock weights  & typically neglect choice-choice autocorrelation

Fit of TD model (w/ exponentially decaying choice sensitivity), visualized same way (5x fewer parameters, essentially as good fit to data; estimates better regularized)

£0.20 -£0.12 £0.04 Quantify value of pain

Effect of acute tryptophan depletion?

Depleted participants are: equally shock-driven more ‘sticky’ (driven to repeat choices) less money-driven (this effect less reliable)

p >.5 linear effects of blood tryptophan levels:

p <.005 linear effects of blood tryptophan levels:

p <.01p <.005 linear effects of blood tryptophan levels:

overview reinforcement learning model fitting: behavior model fitting: fMRI –random effects –RL regressors

L rFP rFP L FP p<0.01 p<0.001 What does this mean when there are multiple subjects? regression coefficients as random effects if we drew more subjects from this population is the expected effect size > 0?

History 1990-1991 – SPM paper, software released, used for PET low ratio of samples to subjects (within-subject variance not important) 1992-1997 – Development of fMRI more samples per subject 1998 – Holmes & Friston introduce distinction between fixed and random effects analysis in conference presentation; reveal SPM had been fixed effects all along 1999 – Series of papers semi-defending fixed effects; but software fixed

RL & fMRI Common approach: fit models to behavior, use models to generate regressors for fMRI GLM –eg predicted value; error in predicted value –where in brain does BOLD signal correlate with computationally generated signal (convolved with HRF)? –quantify & study neural representation of subjective factors reward prediction error (O’Doherty et al 2003 and lots of other papers) Schoenberg et al 2007

Examples: Value expectation (exactly same approach common in animal phys) Sugrue et al. (2004): primate LIP neurons: Daw, O’Doherty et al. (2006): vmPFC activity in humans % signal change probability of chosen action

note: can also fit parametric models to neural signals; compare neural & behavioral fits (Kable et al 2007; Tom et al 2007) note 2: must as always be suspicious about spurious correlations –still good to use controls (eg is regressor loading better in this condition than another)

Examples: loss aversion Tom et al (2007): compare loss aversion estimated from neural value signals to behavioral loss aversion from choices money utility

example positional uncertainty in navigation task (Yoshida et al 2006) model: subjects assume they are someplace until proven wrong; then try assuming somewhere else estimate where subject thinks they are at each step

correlate uncertainty in position estimate with BOLD signal

summary trial and error learning & choice –interaction between the two –rich theory even for simple tasks model fits to choice behavior –hierarchical model of population –quantify subjective factors same methods for fMRI, ephys –but keep your wits about you

Models: reinforcement learning & fMRI Nathaniel Daw 11/28/2007.

Similar presentations

Presentation on theme: "Models: reinforcement learning & fMRI Nathaniel Daw 11/28/2007."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Models: reinforcement learning & fMRI Nathaniel Daw 11/28/2007.

Similar presentations

Presentation on theme: "Models: reinforcement learning & fMRI Nathaniel Daw 11/28/2007."— Presentation transcript:

Similar presentations

About project

Feedback