Presentation on theme: "Summary of part I: prediction and RL"— Presentation transcript:
1Summary of part I: prediction and RL Prediction is important for action selectionThe problem: prediction of future rewardThe algorithm: temporal difference learningNeural implementation: dopamine dependent learning in BGA precise computational model of learning allows one to look in the brain for “hidden variables” postulated by the modelPrecise (normative!) theory for generation of dopamine firing patternsExplains anticipatory dopaminergic responding, second order conditioningCompelling account for the role of dopamine in classical conditioning: prediction error acts as signal driving learning in prediction areas
2prediction error hypothesis of dopamine model prediction errormeasured firing rateat end of trial: t = rt - Vt (just like R-W)Bayer & Glimcher (2005)
3Global plan Reinforcement learning I: Reinforcement learning II: predictionclassical conditioningdopamineReinforcement learning II:dynamic programming; action selectionPavlovian misbehaviourvigorChapter 9 of Theoretical Neuroscience
4Action Selection Evolutionary specification Immediate reinforcement: leg flexionThorndike puzzle boxpigeon; rat; human matchingDelayed reinforcement:these tasksmazeschessBandler;Blanchard
5Immediate Reinforcement stochastic policy:based on action values:
6Indirect Actoruse RW rule:switch every 100 trials
20Impulsivity & Hyperbolic Discounting humans (and animals) show impulsivity in:dietsaddictionspending, …intertemporal conflict between short and long term choicesoften explained via hyperbolic discount functionsalternative is Pavlovian imperative to an immediate reinforcerframing, trolley dilemmas, etc
21Direct/Indirect Pathways Frankdirect: D1: GO; learn from DA increaseindirect: D2: noGO; learn from DA decreasehyperdirect (STN) delay actions given strongly attractive choices
24Multiple Systems in RL model-based RL cached-based RL build a forward model of the task, outcomessearch in the forward model (online DP)optimal use of informationcomputationally ruinouscached-based RLlearn Q values, which summarize future worthcomputationally trivialbootstrap-based; so statistically inefficientlearn both – select according to uncertainty
30Human Canary... a b c if a c and c £££ , then do more of a or b? MB: bMF: a (or even no effect)
31Behaviour action values depend on both systems: expect that will vary by subject (but be fixed)
32Neural Prediction Errors (12) R ventral striatum(anatomical definition)note that MB RL does not use this prediction error – training signal?
33Neural Prediction Errors (1) right nucleus accumbensbehaviour1-2, not 1
34Vigour Two components to choice: real-valued DP what: lever pressingdirection to runmeal to choosewhen/how fast/how vigorousfree operant tasksreal-valued DP
35The model ? 1 time 2 time S0 S1 S2 vigour cost unit cost (reward) UR PRhow fast?LPNPOtherS0S1S21 time2 timechoose(action,) = (LP,1)CostsRewardschoose(action,) = (LP,2)CostsRewardsgoal
36The modelGoal: Choose actions and latencies to maximize the average rate of return (rewards minus costs per time)S0S1S21 time2 timechoose(action,) = (LP,1)CostsRewardschoose(action,) = (LP,2)CostsRewardsARL
37Average Reward RL Compute differential values of actions Differential value of taking action L with latency when in state xρ = average rewards minus costs, per unit timeFuture ReturnsQL,(x) = Rewards – Costs +Mention that the model has few parameters (basically the cost constants and the reward utility) but we will not try to fit any of these, but just look at principlessteady state behavior (not learning dynamics)(Extension of Schwartz 1993)
38Average Reward Cost/benefit Tradeoffs 1. Which action to take?Choose action with largest expected reward minus costHow fast to perform it?slow less costly (vigour cost)slow delays (all) rewardsnet rate of rewards = cost of delay (opportunity cost of time)Choose rate that balances vigour and opportunity costsexplains faster (irrelevant) actions under hunger, etcmasochism
39Optimal response rates 1st Nose pokeNiv, Dayan, Joel, unpublishedExperimental datasecondsseconds since reinforcement1st Nose pokeseconds since reinforcementsecondsModel simulation
40% Reinforcements on lever A Optimal response rates20406080100Pigeon APigeon BPerfect matching% Reinforcements on key A% Responses on key AHerrnstein 1961Experimental dataModel simulation50% Reinforcements on lever A% Responses on lever AModelPerfect matchingMore:# responsesinterval lengthamount of rewardratio vs. intervalbreaking pointtemporal structureetc.
41Effects of motivation (in the model) RR25low utilityhigh utilitymean latencyLP Otherenergizingeffect
42seconds from reinforcement seconds from reinforcement Effects of motivation (in the model)RR25response rate / minuteseconds from reinforcementdirecting effect1response rate / minuteseconds from reinforcementUR 50%low utilityhigh utilitymean latencyLP Otherenergizingeffect2
43Relation to Dopamine Phasic dopamine firing = reward prediction error What about tonic dopamine?moreless
44Tonic dopamine = Average reward rate explains pharmacological manipulationsdopamine control of vigour through BG pathwaysAberman and Salamone 1999# LPs in 30 minutes1416645001000150020002500ControlDA depletedModel simulation# LPs in 30 minutesControlDA depletedeating time confoundcontext/state dependence (motivation & drugs?)less switching=perseverationNB. phasic signal RPE for choice/value learning
45Tonic dopamine hypothesis ♫$Satoh and Kimura 2003Ljungberg, Apicella and Schultz 1992reaction timefiring rate…also explains effects of phasic dopamine on response times
46Sensory Decisions as Optimal Stopping consider listening to:decision: choose, or sample
47Optimal Stoppingequivalent of state u=1 isand states u=2,3 is
49Computational Neuromodulation dopaminephasic: prediction error for rewardtonic: average reward (vigour)serotoninphasic: prediction error for punishment?acetylcholine:expected uncertainty?norepinephrineunexpected uncertainty; neural interrupt?
50Conditioning Ethology Psychology Computation Algorithm Neurobiology prediction: of important eventscontrol: in the light of those predictionsEthologyoptimalityappropriatenessPsychologyclassical/operantconditioningComputationdynamic progr.Kalman filteringAlgorithmTD/delta rulessimple weightsNeurobiologyneuromodulators; amygdala; OFCnucleus accumbens; dorsal striatum
51Markov Decision Process class of stylized tasks withstates, actions & rewardsat each timestep t the world takes on state st and delivers reward rt, and the agent chooses an action at
52Markov Decision Process World: You are in state 34.Your immediate reward is 3. You have 3 actions.Robot: I’ll take action 2.World: You are in state 77.Your immediate reward is -7. You have 2 actions.Robot: I’ll take action 1.World: You’re in state 34 (again).
53Markov Decision Process Stochastic process defined by:reward function: rt ~ P(rt | st)transition function: st ~ P(st+1 | st, at)
54Markov Decision Process Stochastic process defined by:reward function: rt ~ P(rt | st)transition function: st ~ P(st+1 | st, at)Markov propertyfuture conditionally independent of past, given st
55The optimal policyDefinition: a policy such that at every state, its expected value is better than (or equal to) that of all other policiesTheorem: For every MDP there exists (at least) one deterministic optimal policy. by the way, why is the optimal policy just a mapping from states to actions? couldn’t you earn more reward by choosing a different action depending on last 2 states?
57Pavlovian-Instrumental Interactions synergisticconditioned reinforcementPavlovian-instrumental transferPavlovian cue predicts the instrumental outcomebehavioural inhibition to avoid aversive outcomesneutralPavlovian cue predicts outcome with same motivational valenceopponentPavlovian cue predicts opposite motivational valencenegative automaintenance
58-ve Automaintenance in Autoshaping simple choice taskN: nogo gives reward r=1G: go gives reward r=0learn three quantitiesaverage valueQ value for NQ value for Ginstrumental propensity is
59-ve Automaintenance in Autoshaping Pavlovian actionassert: Pavlovian impetus towards G is v(t)weight Pavlovian and instrumental advantages by ω – competitive reliability of Pavlovnew propensitiesnew action choice
60-ve Automaintenance in Autoshaping basic –ve automaintenance effect (μ=5)lines are theoretical asymptotesequilibrium probabilities of action