Reinforcement Learning I: prediction and classical conditioning

Slides:



Advertisements
Similar presentations
Alan Pickering Department of Psychology
Advertisements

Unconditioned stimulus (food) causes unconditioned response (saliva) Conditioned stimulus (bell) causes conditioned response (saliva)
Reinforcement learning
Summary of part I: prediction and RL
Wanting Things How Your Brain Works - Week 8 Dr. Jan Schnupp HowYourBrainWorks.net.
Computational Neuromodulation Peter Dayan Gatsby Computational Neuroscience Unit University College London Nathaniel Daw Sham Kakade Read Montague John.
Global plan Reinforcement learning I: –prediction –classical conditioning –dopamine Reinforcement learning II: –dynamic programming; action selection –Pavlovian.
Control of Attention and Gaze in the Natural World.
1 Decision making. 2 How does the brain learn the values?
Journal club Marian Tsanov Reinforcement Learning.
Dopamine, Uncertainty and TD Learning CNS 2004 Yael Niv Michael Duff Peter Dayan Gatsby Computational Neuroscience Unit, UCL.
Neurobiology of drug action and
Decision making. ? Blaise Pascal Probability in games of chance How much should I bet on ’20’? E[gain] = Σgain(x) Pr(x)
Neural Control of Muscle. Quick Review: Spinal Cord Flexors Extensors.
Introduction: What does phasic Dopamine encode ? With asymmetric coding of errors, the mean TD error at the time of reward is proportional to p(1-p) ->
Dopamine, Uncertainty and TD Learning CoSyNe’04 Yael Niv Michael Duff Peter Dayan.
PSY402 Theories of Learning Wednesday, November 19, 2003 Chapter 6 -- Traditional Theories (Cont.)
Reward processing (1) There exists plenty of evidence that midbrain dopamine systems encode errors in reward predictions (Schultz, Neuron, 2002) Changes.
WHS AP Psychology Unit 5: Learning (Behaviorism) Essential Task 5-2: Describe basic classical conditioning phenomena with specific attention to unconditioned.
FIGURE 4 Responses of dopamine neurons to unpredicted primary reward (top) and the transfer of this response to progressively earlier reward-predicting.
Neurobiology of drug action and addiction Richard Palmiter Dept Biochemistry.
Learning and Memory: Basic Mechanisms
Chapter 8 Instrumental Conditioning: Learning the Consequences of Behavior.
Lyle Ungar, University of Pennsylvania Learning and Memory Reinforcement Learning.
Theoretical Analysis of Classical Conditioning Thomas G. Bowers, Ph.D. Penn State Harrisburg.
Reinforcement learning This is mostly taken from Dayan and Abbot ch. 9 Reinforcement learning is different than supervised learning in that there is no.
Prediction in Human Presented by: Rezvan Kianifar January 2009.
Reinforcement learning and human behavior Hanan Shteingart and Yonatan Loewenstein MTAT Seminar in Computational Neuroscience Zurab Bzhalava.
Michael S. Beauchamp, Ph.D. Assistant Professor Department of Neurobiology and Anatomy University of Texas Health Science Center at Houston Houston, TX.
Testing computational models of dopamine and noradrenaline dysfunction in attention deficit/hyperactivity disorder Jaeseung Jeong, Ph.D Department of Bio.
CS344 : Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 26- Reinforcement Learning for Robots; Brain Evidence.
Learning and Memory in the Auditory System The University of Iowa Department of Psychology Behavioral and Cognitive Neuroscience Amy Poremba, Ph.D. 1.
Current Theoretical Approaches and Issues in Classical Conditioning Psychology 3306.
Version 0.1 (c) CELEST Associative Learning.
Basic Behavioral Neurobiology of Drug Addiction J. David Jentsch, Ph.D. Assistant Professor of Psychology (Behavioral Neuroscience)
Chapter 16. Basal Ganglia Models for Autonomous Behavior Learning in Creating Brain-Like Intelligence, Sendhoff et al. Course: Robots Learning from Humans.
The Role of the Basal Ganglia in Habit Formation Week 10 Group 1 Amanda Ayoub Kindra Akridge Barbara Kim Alyona Koneva.
Summary of part I: prediction and RL Prediction is important for action selection The problem: prediction of future reward The algorithm: temporal difference.
A View from the Bottom Peter Dayan Gatsby Computational Neuroscience Unit.
Summary of part I: prediction and RL Prediction is important for action selection The problem: prediction of future reward The algorithm: temporal difference.
Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides.
Lectures 9&10: Pavlovian Conditioning (Major Theories)
Current Theoretical Approaches and Issues in Classical Conditioning Psychology 3306.
Global plan Reinforcement learning I: –prediction –classical conditioning –dopamine Reinforcement learning II: –dynamic programming; action selection –Pavlovian.
Learning & Memory JEOPARDY. The Field CC Basics Important Variables Theories Grab Bag $100 $200$200 $300 $500 $400 $300 $400 $300 $400 $500 $400.
Learning. What is Learning? Acquisition of new knowledge, behavior, skills, values, preferences or understanding.
STRUCTURE AND CIRCUITS OF THE BASAL GANGLIA Rastislav Druga Inst. of Anatomy, 2nd Medical Faculty.
Neural correlates of risk sensitivity An fMRI study of instrumental choice behavior Yael Niv, Jeffrey A. Edlund, Peter Dayan, and John O’Doherty Cohen.
The Neural Basis of Addiction : A Pathology of Motivation and Choice Am J Psychiatry 162 : , August 2005.
Nucleus Accumbens An Introductory Guide.
Does the brain compute confidence estimates about decisions?
Chapter 6 LEARNING. Learning Learning – A process through which experience produces lasting change in behavior or mental processes. Behavioral Learning.
Neural Coding of Basic Reward Terms of Animal Learning Theory, Game Theory, Microeconomics and Behavioral Ecology Wolfram Schultz Current Opinion in Neurobiology.
The Rescorla-Wagner Model
Dopamine system: neuroanatomy
Neuroimaging of associative learning
neuromodulators; midbrain; sub-cortical;
מוטיבציה והתנהגות free operant
Classical Conditioning and prediction
Pavlovian Conditioning: Mechanisms and Theories
Neuroimaging of associative learning
Dopamine in Motivational Control: Rewarding, Aversive, and Alerting
PSY402 Theories of Learning
Ronald Keiflin, Patricia H. Janak  Neuron 
Neuroimaging of associative learning
Reward Mechanisms in Obesity: New Insights and Future Directions
Circuitry of self-control and its role in reducing addiction
Reward Mechanisms in Obesity: New Insights and Future Directions
Orbitofrontal Cortex as a Cognitive Map of Task Space
Presentation transcript:

Reinforcement Learning I: prediction and classical conditioning Peter Dayan Gatsby Computational Neuroscience Unit dayan@gatsby.ucl.ac.uk (thanks to Yael Niv)

Global plan Reinforcement learning I: Reinforcement learning II: prediction classical conditioning dopamine Reinforcement learning II: dynamic programming; action selection Pavlovian misbehaviour vigor Chapter 9 of Theoretical Neuroscience

Conditioning Ethology Psychology Computation Algorithm Neurobiology prediction: of important events control: in the light of those predictions Ethology optimality appropriateness Psychology classical/operant conditioning Computation dynamic progr. Kalman filtering Algorithm TD/delta rules simple weights Neurobiology neuromodulators; amygdala; OFC nucleus accumbens; dorsal striatum

Animals learn predictions Ivan Pavlov = Conditioned Stimulus = Unconditioned Stimulus = Unconditioned Response (reflex); Conditioned Response (reflex)

Animals learn predictions Ivan Pavlov very general across species, stimuli, behaviors

But do they really? 1. Rescorla’s control temporal contiguity is not enough - need contingency P(food | light) > P(food | no light)

But do they really? 2. Kamin’s blocking contingency is not enough either… need surprise

But do they really? 3. Reynold’s overshadowing seems like stimuli compete for learning

Theories of prediction learning: Goals Explain how the CS acquires “value” When (under what conditions) does this happen? Basic phenomena: gradual learning and extinction curves More elaborate behavioral phenomena (Neural data) P.S. Why are we looking at old-fashioned Pavlovian conditioning?  it is the perfect uncontaminated test case for examining prediction learning on its own

Rescorla & Wagner (1972) error-driven learning: change in value is proportional to the difference between actual and predicted outcome Assumptions: learning is driven by error (formalizes notion of surprise) summations of predictors is linear A simple model - but very powerful! explains: gradual acquisition & extinction, blocking, overshadowing, conditioned inhibition, and more.. predicted overexpectation note: US as “special stimulus”

Rescorla-Wagner learning how does this explain acquisition and extinction? what would V look like with 50% reinforcement? eg. 1 1 0 1 0 0 1 1 1 0 0 what would V be on average after learning? what would the error term be on average after learning?

Rescorla-Wagner learning how is the prediction on trial (t) influenced by rewards at times (t-1), (t-2), …? the R-W rule estimates expected reward using a weighted average of past rewards recent rewards weigh more heavily why is this sensible? learning rate = forgetting rate!

Summary so far Predictions are useful for behavior Animals (and people) learn predictions (Pavlovian conditioning = prediction learning) Prediction learning can be explained by an error-correcting learning rule (Rescorla-Wagner): predictions are learned from experiencing the world and comparing predictions to reality Marr:

But: second order conditioning phase 1: phase 2: test: ? what do you think will happen? what would Rescorla-Wagner learning predict here? animals learn that a predictor of a predictor is also a predictor of reward!  not interested solely in predicting immediate reward

lets start over: this time from the top Marr’s 3 levels: The problem: optimal prediction of future reward what’s the obvious prediction error? what’s the obvious problem with this? want to predict expected sum of future reward in a trial/episode (N.B. here t indexes time within a trial)

lets start over: this time from the top Marr’s 3 levels: The problem: optimal prediction of future reward want to predict expected sum of future reward in a trial/episode Bellman eqn for policy evaluation

lets start over: this time from the top Marr’s 3 levels: The problem: optimal prediction of future reward The algorithm: temporal difference learning temporal difference prediction error t compare to:

prediction error Vt L R R TD error no prediction prediction, reward prediction, no reward

Summary so far Temporal difference learning versus Rescorla-Wagner derived from first principles about the future explains everything that R-W does, and more (eg. 2nd order conditioning) a generalization of R-W to real time

Back to Marr’s 3 levels The problem: optimal prediction of future reward The algorithm: temporal difference learning Neural implementation: does the brain use TD learning?

Dorsal Striatum (Caudate, Dopamine Parkinson’s Disease  Motor control + initiation? Intracranial self-stimulation; Drug addiction; Natural rewards  Reward pathway?  Learning? Also involved in: Working memory Novel situations ADHD Schizophrenia … Dorsal Striatum (Caudate, Putamen ) Ventral Tegmental Area Substantia Nigra Amygdala Nucleus Accumbens (Ventral Striatum) Prefrontal Cortex

Role of dopamine: Many hypotheses Anhedonia hypothesis Prediction error (learning, action selection) Salience/attention Incentive salience Uncertainty Cost/benefit computation Energizing/motivating behavior

dopamine and prediction error TD error L R Vt R no prediction prediction, reward prediction, no reward

prediction error hypothesis of dopamine Fiorillo et al, 2003 The idea: Dopamine encodes a reward prediction error Tobler et al, 2005

prediction error hypothesis of dopamine model prediction error measured firing rate at end of trial: t = rt - Vt (just like R-W) Bayer & Glimcher (2005)

Where does dopamine project to? Basal ganglia Several large subcortical nuclei (unfortunate anatomical names follow structure rather than function, eg caudate + putamen + nucleus accumbens are all relatively similar pieces of striatum; but globus pallidus & substantia nigra each comprise two different things)

Where does dopamine project to? Basal ganglia inputs to BG are from all over the cortex (and topographically mapped) Voorn et al, 2004

Dopamine and plasticity Prediction errors are for learning… Indeed: cortico-striatal synapses show dopamine-dependent plasticity Wickens et al, 1996

Dopamine and plasticity Prediction errors are for learning… Indeed: cortico-striatal synapses show dopamine-dependent plasticity Wickens et al, 1996

Corticostriatal synapses: 3 factor learning Cortex Stimulus Representation X1 X2 X3 XN adjustable synapses Striatum learned values V1 V2 V3 VN Explain 3 factor learning rule and contrast to normal Hebbian learning Prediction Error (Dopamine) PPTN, habenula etc R  VTA, SNc but also amygdala; orbitofrontal cortex; ...

punishment prediction error TD error Value Prediction error High Pain 0.8 1.0 0.2 0.2 Low Pain 0.8 1.0

punishment prediction error experimental sequence….. A – B – HIGH C – D – LOW C – B – HIGH A – B – HIGH A – D – LOW C – D – LOW A – B – HIGH A – B – HIGH C – D – LOW C – B – HIGH MR scanner TD model Prediction error Brain responses ? Ben Seymour; John O’Doherty

punishment prediction error TD prediction error: ventral striatum Z=-4 R

punishment prediction dorsal raphe (5HT)? right anterior insula

generalization

generalization

aversion

opponency

Solomon & Corbit

Summary of this part: prediction and RL Prediction is important for action selection The problem: prediction of future reward The algorithm: temporal difference learning Neural implementation: dopamine dependent learning in BG A precise computational model of learning allows one to look in the brain for “hidden variables” postulated by the model Precise (normative!) theory for generation of dopamine firing patterns Explains anticipatory dopaminergic responding, second order conditioning Compelling account for the role of dopamine in classical conditioning: prediction error acts as signal driving learning in prediction areas

Striatum and learned values Striatal neurons show ramping activity that precedes a reward (and changes with learning!) start food (Daw) (Schultz)

Phasic dopamine also responds to… Novel stimuli Especially salient (attention grabbing) stimuli Aversive stimuli (??) Reinforcers and appetitive stimuli induce approach behavior and learning, but also have attention functions (elicit orienting response) and disrupt ongoing behaviour. → Perhaps DA reports salience of stimuli (to attract attention; switching) and not a prediction error? (Horvitz, Redgrave)