Model-based RL (+ action sequences): maybe it can explain everything

Model-based RL (+ action sequences): maybe it can explain everything
Niv lab meeting 6/11/2012 Stephanie Chan

goal-directed v.s. habitual instrumental actions
After extensive training Choose action based on previous actions/stimuli Sensory motor cortices + DLS (putamen) Not sensitive to: reinforcer devaluation action-outcome changes in contingency After moderate training Choose action based on expected outcome PFC & DMS(caudate) Usually: Model-based RL Model-free RL

goal-directed v.s. habitual instrumental actions
What do real animals do?

Model-free RL Explains resistance to devaluation:
Devaluation occurs in “extinction”. No feedback / no TD error Does NOT explain resistance to changes in action-outcome contingency In fact, habituated behavior should be MORE sensitive to changes in contingency Maybe: update rates go small after extended training

Alternative explanation
We don’t need model-free RL Habit formation = association of individual actions into “action sequences” More parsimonious A means of modeling action sequences

Over the course of training
Exploration -> exploitation Variability -> stereotypy Errors and RT -> decrease Individual actions -> “chunked” sequences PFC + associative striatum -> sensorimotor striatum “closed loop” -> “open loop”

When should actions get chunked?
Q-learning with dwell time Q(s,a) = R(s) + E[V(s’)] – D(s)<R> When costs (possible mistakes) are outweighed by benefits (decrease decision time) Cost: C(s,a,a’) = E[Q(s’,a’)-V(s’)] = E[A(s’,a’)] Efficient way to compute this: TDt = [rt – dt<R> + V(st+1)]-V(st) = a sample of A(st,at) Benefit: (# timesteps saved)  <R>

When do they get unchunked?
C(s,a,a’) is insensitive to changes in environment Primitive actions no longer evaluated, no TD error, no samples for C But <R> is sensitive to changes… Action sequences get unchunked when environment changes to decrease <R> No unchunking if environment changes to present a better alternative to increase <R> Ostlund et al 2009: rats are immediately sensitive to devaluation of the state that the macro action lands on, but not on the intermediate states

Simulations I : SRTT-like task

Simulations II: Instrumental conditioning
Reinforcer devaluation Non-contingent Omission

Model-based RL (+ action sequences): maybe it can explain everything

Similar presentations

Presentation on theme: "Model-based RL (+ action sequences): maybe it can explain everything"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Model-based RL (+ action sequences): maybe it can explain everything

Similar presentations

Presentation on theme: "Model-based RL (+ action sequences): maybe it can explain everything"— Presentation transcript:

Similar presentations

About project

Feedback