Journal club 09. 09. 2008 Marian Tsanov Reinforcement Learning.

Slides:

Advertisements

Similar presentations

Unconditioned stimulus (food) causes unconditioned response (saliva) Conditioned stimulus (bell) causes conditioned response (saliva)

Advertisements

Reinforcement learning

Lecture 18: Temporal-Difference Learning

1 Brain Circuits Involved in Emotion processing: Subcortical Regions BIOS E 232 Sabina Berretta, MD Harvard Medical School McLean Hospital.

Learning - Dot Point 2. Part A. Learning and Changes in the Brain – Brain Structures Associated with Learning.

1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.

Dopamine, Uncertainty and TD Learning CNS 2004 Yael Niv Michael Duff Peter Dayan Gatsby Computational Neuroscience Unit, UCL.

Reinforcement Learning

Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)

Tracking multiple independent targets: Evidence for a parallel tracking mechanism Zenon Pylyshyn and Ron Storm presented by Nick Howe.

CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Details and Biology 4/3/2008 Srini Narayanan – ICSI and UC Berkeley.

Neurobiology of drug action and

Decision making. ? Blaise Pascal Probability in games of chance How much should I bet on ’20’? E[gain] = Σgain(x) Pr(x)

Chapter 6: Temporal Difference Learning

Chapter 6: Temporal Difference Learning

Reward processing (1) There exists plenty of evidence that midbrain dopamine systems encode errors in reward predictions (Schultz, Neuron, 2002) Changes.

Outline 1) Goal-Directed Feature Learning (Weber & Triesch, IJCNN 2009)‏ Task 4.1 Visual processing based on feature abstraction 2) Emergence of Disparity.

FIGURE 4 Responses of dopamine neurons to unpredicted primary reward (top) and the transfer of this response to progressively earlier reward-predicting.

Models of addiction: role of dopamine and other neurobiological substrates Paul E. M. Phillips, Ph.D. Department of Psychiatry and Behavioral Sciences.

Synaptic plasticity Basic Neuroscience NBL 120. classical conditioning CS (neutral) - no response US - UR After pairing: CS - CR.

Lyle Ungar, University of Pennsylvania Learning and Memory Reinforcement Learning.

Neural circuits for bias and sensitivity in decision-making Jan Lauwereyns Associate Professor, Victoria University of Wellington, New Zealand Long-term.

LIMBIC SYSTEM NBIO 401 Robinson. Objectives: -1) Be able to describe the major inputs and outputs, function, and the consequences of lesions or electrical.

Reinforcement learning This is mostly taken from Dayan and Abbot ch. 9 Reinforcement learning is different than supervised learning in that there is no.

Reinforcement learning and human behavior Hanan Shteingart and Yonatan Loewenstein MTAT Seminar in Computational Neuroscience Zurab Bzhalava.

Natural Actor-Critic Authors: Jan Peters and Stefan Schaal Neurocomputing, 2008 Cognitive robotics 2008/2009 Wouter Klijn.

The Basal Ganglia (Lecture 6) Harry R. Erwin, PhD COMM2E University of Sunderland.

Testing computational models of dopamine and noradrenaline dysfunction in attention deficit/hyperactivity disorder Jaeseung Jeong, Ph.D Department of Bio.

The reward pathway.  ensures beneficial behaviour  also called mesolimbic pathway  connected to:  ventral tegmental area  nucleus accumbens  prefrontal.

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 11: Temporal Difference Learning (cont.), Eligibility Traces Dr. Itamar Arel College.

CS344 : Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 26- Reinforcement Learning for Robots; Brain Evidence.

Learning and Memory in the Auditory System The University of Iowa Department of Psychology Behavioral and Cognitive Neuroscience Amy Poremba, Ph.D. 1.

Subject wearing a VR helmet immersed in the virtual environment on the right, with obstacles and targets. Subjects follow the path, avoid the obstacels,

Chapter 16. Basal Ganglia Models for Autonomous Behavior Learning in Creating Brain-Like Intelligence, Sendhoff et al. Course: Robots Learning from Humans.

Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.

Unconditioned stimulus (food) causes unconditioned response (saliva) Conditioned stimulus (bell) causes conditioned response (saliva)

6/26/20071 ACQ and the Basal Ganglia Jimmy Bonaiuto USC Brain Project 6/26/2007.

Curiosity-Driven Exploration with Planning Trajectories Tyler Streeter PhD Student, Human Computer Interaction Iowa State University

Autism Presented by : Hosein Hamdi. Autism manifests during the first three years of life Genetic factors play a significant and complex role in autism.

CMSC 471 Fall 2009 Temporal Difference Learning Prof. Marie desJardins Class #25 – Tuesday, 11/24 Thanks to Rich Sutton and Andy Barto for the use of their.

Neural Networks Chapter 7

A View from the Bottom Peter Dayan Gatsby Computational Neuroscience Unit.

Design and Implementation of General Purpose Reinforcement Learning Agents Tyler Streeter November 17, 2005.

Eight genes that make us brainiacs Genes and Human 郭承玉

Goal Directed Reaching with the Motor Cortex Model Cheol Han Feb 20, 2007.

A B SAMPLE DELAY CHOICE A B A B 1st Reversal2nd Reversal etc… Trial Time (ms) Conditional Visuomotor Learning Task Asaad, W.F., Rainer, G. and.

Neural correlates of risk sensitivity An fMRI study of instrumental choice behavior Yael Niv, Jeffrey A. Edlund, Peter Dayan, and John O’Doherty Cohen.

Does the brain compute confidence estimates about decisions?

Dopamine system: neuroanatomy

Chapter 6: Temporal Difference Learning

Comparing Single and Multiple Neuron Simulations of Integrated Dorsal and Ventral Striatal Pathway Models of Action Initiation Selin Metin1, Neslihan Serap.

Reinforcement learning (Chapter 21)

An Overview of Reinforcement Learning

Neuroimaging of associative learning

מוטיבציה והתנהגות free operant

Homework Schultz, Dayan, & Montague, Science, 1997

Sensorimotor Learning and the Development of Position Invariance

Dr. Unnikrishnan P.C. Professor, EEE

Neuroimaging of associative learning

The Brain on Drugs: From Reward to Addiction

October 6, 2011 Dr. Itamar Arel College of Engineering

Chapter 6: Temporal Difference Learning

Ronald Keiflin, Patricia H. Janak Neuron

Neuroimaging of associative learning

Reward Mechanisms in Obesity: New Insights and Future Directions

Brain Reward Circuitry

Circuitry of self-control and its role in reducing addiction

JJ Orban de Xivry Hands on session JJ Orban de Xivry

Reward Mechanisms in Obesity: New Insights and Future Directions

Orbitofrontal Cortex as a Cognitive Map of Task Space

Presentation transcript:

Journal club Marian Tsanov Reinforcement Learning

Predicting Future Reward – Temporal Difference Learning Actor-Critic Learning Sarsa learning Q - learning TD error: Where V is the current value function implemented by the critic.

Actor-critic methods are TD methods that have a separate memory structure to explicitly represent the policy independent of the value function. The policy structure is known as the actor, because it is used to select actions, and the estimated value function is known as the critic, because it criticizes the Actions made by the actor.

Why is Trial-To-Trial Variability needed for reinforcement learning ? In reinforcement learning there is no „supervisor“ which tells a neural circuit what to do with its input. Instead, a neural circuit has to try out different ways of processing the input, until it finds a successful (i.e., rewarded) one. This process is called exploration.

trial-by-trial learning rule known as the Rescorla-Wagner rule Here e is the learning rate, which can be interpreted in psychological terms as the associability of the stimulus with the reward.

Dayan and Abbott, 2000

Foster, Morris and Dayan, 2000

The prediction error δ plays an essential role in both the Rescorla-Wagner and temporal difference learning rules – biologically implemented by VTA.

Actor – critic model and reinforcement learning circuits Barnes et al. 2005

In a search for a critic – the striato-nigral problem Proposed critic – striosomes of dorsal striatum Proposed actor – matriomes of dorsal striatum

r DS SMC DP Dorsal striatum /coincidence detector/ hebbian w target SNc VP PPTN critic circuit LH actor circuit Prefrontal cortex Amigdala Hippocampus VS Lateral Hypothalamus /sensory-driven reward/ Ventral Striatum (Nucleus Accumbens) Ventral Pallidum Dorsal Pallidum /action/ pedunculopontine tegmental nucleus Dopamine Substantia nigra Sensory Motor Cortex t LCAC tt

Evidence for interregional learning systems interaction DeCoteau et al Schultz et al SNc Striosome

Q-learning

Need for multiple critics/actors Adaptive State Aggregation R R

Neuronal activity when shifting the cue modality Could current models explain this data? plain Q-learning? SARSA? a single actor-critic?

First Phase (tones) SWITCH BETWEEN ACTORS Actors used in the end (mean ± se) N = 3 (2.64 ± 0.05) N = 4 (2.96 ± 0.06) (mean ± se) N = 2 (0.76 ± 0.12) N = 3 (0.64 ± 0.09) N = 4 (0.62 ± 0.09) (mean ± se) N = 2 (1.06 ± 0.07) N = 3 (1.06 ± 0.1) N = 4 (1.05 ± 0.09) Second Phase (textures)

4 ACTORS

IF THE CTX/STRIATUM CAN TRACK THE PERFORMANCE OF THE ACTORS, AFTER THE TRANSFER THERE MIGHT BE AN INITIAL BIAS FOR THE BEFORE USED ACTORS (HERE WE IMPLEMENTED RANDOMLY). THEN THE PERFORMANCE SHOULD BE CLOSER TO THE RESULTS WITH N=2, EVEN IF MORE ACTORS ARE AVAILABLE

Motivation: How is the knowledge transferred to the second cue? What is a “State” in reinforcement learning? A place where you are free to choose your next action in to other states The representation of environment should change State aggregation The Knowledge Transfer Problem State aggregation and sequence learning

DS SMC DP (action) coincidence detector DS SMC DP (action) coincidence detector DS SMC DP (action) coincidence detector hebbian w target STDP w target Sequence learning Theta dependent STDP plasticity DeCoteau et al. 2007

A B A B Unsupervised Theta Dependent STDP

SMCLC/AC before learning after learning (audio cue) actors aggregation in dorsal striatum

Algorithm Adaptive combination of states Knowledge Transfer: Keep the learned states Number of activated states.

State aggregation The Knowledge Tranfer Effect Trail Number – Average Runing Step No Aggregation State Aggregation

State Number Reduction

Conclusion: State Aggregation - Link to Learned Actor Multiple Motor Layer and higher level decision making State Aggregation: change to abstract states of Motion Selector Sequential Learning Learning pattern generator