Presentation is loading. Please wait.

Presentation is loading. Please wait.

Reinforcement learning 2: action selection Peter Dayan (thanks to Nathaniel Daw)

Similar presentations

Presentation on theme: "Reinforcement learning 2: action selection Peter Dayan (thanks to Nathaniel Daw)"— Presentation transcript:

1 Reinforcement learning 2: action selection Peter Dayan (thanks to Nathaniel Daw)

2 Global plan Reinforcement learning I: (Wednesday) –prediction –classical conditioning –dopamine Reinforcement learning II: –dynamic programming; action selection –sequential sensory decisions –vigor –Pavlovian misbehaviour Chapter 9 of Theoretical Neuroscience

3 Learning and Inference Learning: predict; control weight (learning rate) x (error) x (stimulus) –dopamine phasic prediction error for future reward –serotonin phasic prediction error for future punishment –acetylcholine expected uncertainty boosts learning –norepinephrine unexpected uncertainty boosts learning

4 Action Selection Evolutionary specification Immediate reinforcement: –leg flexion –Thorndike puzzle box –pigeon; rat; human matching Delayed reinforcement: –these tasks –mazes –chess Bandler; Blanchard

5 Immediate Reinforcement stochastic policy: based on action values: 5

6 6 Indirect Actor use RW rule: switch every 100 trials

7 Direct Actor

8 8

9 Could we Tell? correlate past rewards, actions with present choice indirect actor (separate clocks): direct actor (single clock):

10 Matching: Concurrent VI-VI Lau, Glimcher, Corrado, Sugrue, Newsome

11 Matching income not return approximately exponential in r alternation choice kernel

12 12 Action at a (Temporal) Distance learning an appropriate action at u=1 : –depends on the actions at u=2 and u=3 –gains no immediate feedback idea: use prediction as surrogate feedback

13 13 Action Selection start with policy: evaluate it: improve it: thus choose L more frequently than R 0.025 -0.175 -0.125 0.125

14 14 Policy value is too pessimistic action is better than average

15 actor/critic m1m1 m2m2 m3m3 mnmn dopamine signals to both motivational & motor striatum appear, surprisingly the same suggestion: training both values & policies

16 Variants: SARSA Morris et al, 2006

17 Variants: Q learning Roesch et al, 2007

18 Summary prediction learning –Bellman evaluation actor-critic –asynchronous policy iteration indirect method (Q learning) –asynchronous value iteration

19 Sensory Decisions as Optimal Stopping consider listening to: decision: choose, or sample

20 Optimal Stopping equivalent of state u=1 is and states u=2,3 is

21 Transition Probabilities

22 Evidence Accumulation Gold & Shadlen, 2007

23 Current Topics Vigour & tonic dopamine Priors over decision problems (LH) Pavlovian-instrumental interactions –impulsivity –behavioural inhibition –framing Model-based, model-free and episodic control Exploration vs exploitation Game theoretic interactions (inequity aversion)

24 24 Vigour Two components to choice: –what: lever pressing direction to run meal to choose –when/how fast/how vigorous free operant tasks real-valued DP

25 25 The model choose (action, ) = (LP, 1 ) 1 time Costs Rewards choose (action, ) = (LP, 2 ) Costs Rewards cost LP NP Other ? how fast 2 time S1S1 S2S2 vigour cost unit cost (reward) S0S0 URUR PRPR goal

26 26 The model choose (action, ) = (LP, 1 ) 1 time Costs Rewards choose (action, ) = (LP, 2 ) Costs Rewards 2 time S1S1 S2S2 S0S0 Goal: Choose actions and latencies to maximize the average rate of return (rewards minus costs per time) ARL

27 27 Compute differential values of actions Differential value of taking action L with latency when in state x ρ = average rewards minus costs, per unit time steady state behavior (not learning dynamics) (Extension of Schwartz 1993) Q L, (x) = Rewards – Costs + Future Returns Average Reward RL

28 28 Choose action with largest expected reward minus cost 1. Which action to take? slow delays (all) rewards net rate of rewards = cost of delay (opportunity cost of time) Choose rate that balances vigour and opportunity costs 2.How fast to perform it? slow less costly (vigour cost) Average Reward Cost/benefit Tradeoffs explains faster (irrelevant) actions under hunger, etc masochism

29 29 Optimal response rates Experimental data Niv, Dayan, Joel, unpublished 1 st Nose poke seconds since reinforcement Model simulation 1 st Nose poke seconds since reinforcementseconds

30 30 Optimal response rates Model simulation 50 0 0 % Reinforcements on lever A % Responses on lever A Model Perfect matching 020406080100 80 60 40 20 Pigeon A Pigeon B Perfect matching % Reinforcements on key A % Responses on key A Herrnstein 1961 Experimental data More: # responses interval length amount of reward ratio vs. interval breaking point temporal structure etc.

31 31 Effects of motivation (in the model) energizing effect RR25 low utility high utility mean latency LPOther energizing effect

32 32 Effects of motivation (in the model) U R 50% RR25 response rate / minute seconds from reinforcement directing effect 1 low utility high utility mean latency LPOther energizing effect 2

33 33 What about tonic dopamine? Phasic dopamine firing = reward prediction error moreless Relation to Dopamine

34 34 Tonic dopamine = Average reward rate NB. phasic signal RPE for choice/value learning Aberman and Salamone 1999 # LPs in 30 minutes 1416 64 500 1000 1500 2000 2500 Control DA depleted Model simulation # LPs in 30 minutes Control DA depleted 1.explains pharmacological manipulations 2.dopamine control of vigour through BG pathways eating time confound context/state dependence (motivation & drugs?) less switching=perseveration

35 35 Tonic dopamine hypothesis Satoh and Kimura 2003 Ljungberg, Apicella and Schultz 1992 reaction time firing rate …also explains effects of phasic dopamine on response times $$$$$$

36 Pavlovian & Instrumental Conditioning Pavlovian –learning values and predictions –using TD error Instrumental –learning actions: by reinforcement (leg flexion) by (TD) critic –(actually different forms: goal directed & habitual)

37 Pavlovian-Instrumental Interactions synergistic –conditioned reinforcement –Pavlovian-instrumental transfer Pavlovian cue predicts the instrumental outcome behavioural inhibition to avoid aversive outcomes neutral –Pavlovian-instrumental transfer Pavlovian cue predicts outcome with same motivational valence opponent –Pavlovian-instrumental transfer Pavlovian cue predicts opposite motivational valence –negative automaintenance

38 -ve Automaintenance in Autoshaping simple choice task –N: nogo gives reward r=1 –G: go gives reward r=0 learn three quantities –average value –Q value for N –Q value for G instrumental propensity is

39 -ve Automaintenance in Autoshaping Pavlovian action –assert: Pavlovian impetus towards G is v(t) –weight Pavlovian and instrumental advantages by ω – competitive reliability of Pavlov new propensities new action choice

40 -ve Automaintenance in Autoshaping basic –ve automaintenance effect (μ=5) lines are theoretical asymptotes equilibrium probabilities of action

41 Impulsivity & Hyperbolic Discounting humans (and animals) show impulsivity in: –diets –addiction –spending, … intertemporal conflict between short and long term choices often explained via hyperbolic discount functions alternative is Pavlovian imperative to an immediate reinforcer framing, trolley dilemmas, etc

42 Kalman Filter Markov random walk (or OU process) no punctate changes additive model of combination forward inference

43 Kalman Posterior ε ^

44 Assumed Density KF Rescorla-Wagner error correction competitive allocation of learning –P&H, M

45 Blocking forward blocking: error correction backward blocking: -ve off-diag

46 Mackintosh vs P&H under diagonal approximation: for slow learning, –effect like Mackintosh E

47 predictor competition Summary stimulus/correlation rerepresentation (Daw) Kalman filter models many standard conditioning paradigms elements of RW, Mackintosh, P&H but: –downwards unblocking –negative patterning L r; T r; L+T · –recency vs primacy (Kruschke)

48 Uncertainty (Yu) expected uncertainty - ignorance –amygdala, cholinergic basal forebrain for conditioning –?basal forebrain for top-down attentional allocation unexpected uncertainty – `set change –noradrenergic locus coeruleus part opponent; part synergistic interaction

49 ACh & NE have distinct behavioral effects: ACh boosts learning to stimuli with uncertain consequences NE boosts learning upon encountering global changes in the environment (e.g. Bear & Singer, 1986; Kilgard & Merzenich, 1998) ACh & NE have similar physiological effects suppress recurrent & feedback processing enhance thalamocortical transmission boost experience-dependent plasticity (e.g. Gil et al, 1997) (e.g. Kimura et al, 1995; Kobayashi et al, 2000) Experimental Data (e.g. Bucci, Holland, & Gallagher, 1998) (e.g. Devauges & Sara, 1990)

50 Model Schematics ACh expected uncertainty top-down processing bottom-up processing sensory inputs cortical processing context NE unexpected uncertainty prediction, learning,...

51 Attention attentional selection for (statistically) optimal processing, above and beyond the traditional view of resource constraint sensory input Example 1: Posners Task stimulus location cue sensory input cue high validity low validity stimulus location (Phillips, McAlonan, Robb, & Brown, 2000) cue target response 0.2-0.5s 0.1s 0.15s generalize to the case that cue identity changes with no notice

52 Formal Framework cues: vestibular, visual,... target: stimulus location, exit direction... variability in quality of relevant cue variability in identity of relevant cue ACh NE Sensory Information avoid representing full uncertainty

53 Simulation Results: Posners Task increase ACh validity effect % normal level 100120140 decrease ACh % normal level 1008060 vary cue validity vary ACh fix relevant cue low NE nicotine validity effect concentration scopolamine (Phillips, McAlonan, Robb, & Brown, 2000)

54 Maze Task example 2: attentional shift reward cue 1cue 2 reward cue 1cue 2 relevant irrelevant relevant (Devauges & Sara, 1990) no issue of validity

55 Simulation Results: Maze Navigation fix cue validity no explicit manipulation of ACh change relevant cue NE % Rats reaching criterion No. days after shift from spatial to visual task % Rats reaching criterion No. days after shift from spatial to visual task experimental data model data (Devauges & Sara, 1990)

56 Simulation Results: Full Model true & estimated relevant stimuli neuromodulation in action trials validity effect (VE)

57 Simulated Psychopharmacology 50% NE 50% ACh/NE ACh compensation NE can nearly catch up

58 Inter-trial Uncertainty single framework for understanding ACh, NE and some aspects of attention ACh/NE as expected/unexpected uncertainty signals experimental psychopharmacological data replicated by model simulations implications from complex interactions between ACh & NE predictions at the cellular, systems, and behavioral levels activity vs weight vs neuromodulatory vs population representations of uncertainty

59 Aston-Jones: Target Detection detect and react to a rare target amongst common distractors elevated tonic activity for reversal activated by rare target (and reverses) not reward/stimulus related? more response related? no reason to persist as unexpected uncertainty argue for intra-trial uncertainty explanation

60 Vigilance Model variable time in start η controls confusability one single run cumulative is clearer exact inference effect of 80% prior

61 Phasic NE NE reports uncertainty about current state state in the model, not state of the model divisively related to prior probability of that state NE measured relative to default state sequence start distractor temporal aspect - start distractor structural aspect target versus distractor

62 Phasic NE onset response from timing uncertainty (SET) growth as P( target )/0.2 rises act when P( target )=0.95 stop if P( target )=0.01 arbitrarily set NE=0 after 5 timesteps (small prob of reflexive action)

63 Four Types of Trial 19% 1.5% 1% 77% fall is rather arbitrary

64 Response Locking slightly flatters the model – since no further response variability

65 Task Difficulty set η=0.65 rather than 0.675 information accumulates over a longer period hits more affected than crs timing not quite right

66 Interrupts LC PFC/ACC

67 Intra-trial Uncertainty phasic NE as unexpected state change within a model relative to prior probability; against default interrupts ongoing processing tie to ADHD? close to alerting (AJ) – but not necessarily tied to behavioral output (onset rise) close to behavioural switching (PR) – but not DA farther from optimal inference (EB) phasic ACh: aspects of known variability within a state?

68 Learning and Inference Learning: predict; control weight (learning rate) x (error) x (stimulus) –dopamine phasic prediction error for future reward –serotonin phasic prediction error for future punishment –acetylcholine expected uncertainty boosts learning –norepinephrine unexpected uncertainty boosts learning

69 69 Conditioning Ethology –optimality –appropriateness Psychology –classical/operant conditioning Computation –dynamic progr. –Kalman filtering Algorithm –TD/delta rules –simple weights Neurobiology neuromodulators; amygdala; OFC nucleus accumbens; dorsal striatum prediction: of important events control: in the light of those predictions

70 Computational Neuromodulation dopamine –phasic: prediction error for reward –tonic: average reward (vigour) serotonin –phasic: prediction error for punishment? acetylcholine: –expected uncertainty? norepinephrine –unexpected uncertainty; neural interrupt?

71 class of stylized tasks with states, actions & rewards –at each timestep t the world takes on state s t and delivers reward r t, and the agent chooses an action a t Markov Decision Process

72 World:You are in state 34. Your immediate reward is 3. You have 3 actions. Robot:Ill take action 2. World: You are in state 77. Your immediate reward is -7. You have 2 actions. Robot: Ill take action 1. World:Youre in state 34 (again). Your immediate reward is 3. You have 3 actions. Markov Decision Process

73 Stochastic process defined by: –reward function: r t ~ P(r t | s t ) –transition function: s t ~ P(s t+1 | s t, a t )

74 Markov Decision Process Stochastic process defined by: –reward function: r t ~ P(r t | s t ) –transition function: s t ~ P(s t+1 | s t, a t ) Markov property –future conditionally independent of past, given s t

75 The optimal policy Definition: a policy such that at every state, its expected value is better than (or equal to) that of all other policies Theorem: For every MDP there exists (at least) one deterministic optimal policy. by the way, why is the optimal policy just a mapping from states to actions? couldnt you earn more reward by choosing a different action depending on last 2 states?

Download ppt "Reinforcement learning 2: action selection Peter Dayan (thanks to Nathaniel Daw)"

Similar presentations

Ads by Google