Presentation is loading. Please wait.

Presentation is loading. Please wait.

Journal club 09. 09. 2008 Marian Tsanov Reinforcement Learning.

Similar presentations


Presentation on theme: "Journal club 09. 09. 2008 Marian Tsanov Reinforcement Learning."— Presentation transcript:

1 Journal club 09. 09. 2008 Marian Tsanov Reinforcement Learning

2

3

4

5

6

7

8

9

10

11

12

13

14 Predicting Future Reward – Temporal Difference Learning Actor-Critic Learning Sarsa learning Q - learning TD error: Where V is the current value function implemented by the critic.

15 Actor-critic methods are TD methods that have a separate memory structure to explicitly represent the policy independent of the value function. The policy structure is known as the actor, because it is used to select actions, and the estimated value function is known as the critic, because it criticizes the Actions made by the actor.

16 Why is Trial-To-Trial Variability needed for reinforcement learning ? In reinforcement learning there is no „supervisor“ which tells a neural circuit what to do with its input. Instead, a neural circuit has to try out different ways of processing the input, until it finds a successful (i.e., rewarded) one. This process is called exploration.

17 trial-by-trial learning rule known as the Rescorla-Wagner rule Here e is the learning rate, which can be interpreted in psychological terms as the associability of the stimulus with the reward.

18 Dayan and Abbott, 2000

19 Foster, Morris and Dayan, 2000

20 The prediction error δ plays an essential role in both the Rescorla-Wagner and temporal difference learning rules – biologically implemented by VTA.

21

22

23

24 Actor – critic model and reinforcement learning circuits Barnes et al. 2005

25 In a search for a critic – the striato-nigral problem Proposed critic – striosomes of dorsal striatum Proposed actor – matriomes of dorsal striatum

26 r DS SMC DP Dorsal striatum /coincidence detector/ hebbian w target SNc VP PPTN critic circuit LH actor circuit Prefrontal cortex Amigdala Hippocampus VS Lateral Hypothalamus /sensory-driven reward/ Ventral Striatum (Nucleus Accumbens) Ventral Pallidum Dorsal Pallidum /action/ pedunculopontine tegmental nucleus Dopamine Substantia nigra Sensory Motor Cortex t LCAC tt

27 Evidence for interregional learning systems interaction DeCoteau et al. 2007 Schultz et al. 1993 SNc Striosome

28 Q-learning

29

30 Need for multiple critics/actors Adaptive State Aggregation R R

31 Neuronal activity when shifting the cue modality Could current models explain this data? plain Q-learning? SARSA? a single actor-critic?

32 First Phase (tones) SWITCH BETWEEN ACTORS Actors used in the end (mean ± se) N = 3 (2.64 ± 0.05) N = 4 (2.96 ± 0.06) (mean ± se) N = 2 (0.76 ± 0.12) N = 3 (0.64 ± 0.09) N = 4 (0.62 ± 0.09) (mean ± se) N = 2 (1.06 ± 0.07) N = 3 (1.06 ± 0.1) N = 4 (1.05 ± 0.09) Second Phase (textures)

33 4 ACTORS

34 IF THE CTX/STRIATUM CAN TRACK THE PERFORMANCE OF THE ACTORS, AFTER THE TRANSFER THERE MIGHT BE AN INITIAL BIAS FOR THE BEFORE USED ACTORS (HERE WE IMPLEMENTED RANDOMLY). THEN THE PERFORMANCE SHOULD BE CLOSER TO THE RESULTS WITH N=2, EVEN IF MORE ACTORS ARE AVAILABLE

35 Motivation: How is the knowledge transferred to the second cue? What is a “State” in reinforcement learning? A place where you are free to choose your next action in to other states The representation of environment should change State aggregation The Knowledge Transfer Problem State aggregation and sequence learning

36 DS SMC DP (action) coincidence detector DS SMC DP (action) coincidence detector DS SMC DP (action) coincidence detector hebbian w target STDP w target Sequence learning Theta dependent STDP plasticity DeCoteau et al. 2007

37 A B A B Unsupervised Theta Dependent STDP

38 SMCLC/AC before learning after learning (audio cue) actors aggregation in dorsal striatum

39 Algorithm Adaptive combination of states Knowledge Transfer: Keep the learned states Number of activated states.

40 State aggregation The Knowledge Tranfer Effect Trail Number – Average Runing Step No Aggregation State Aggregation

41 State Number Reduction

42 Conclusion: State Aggregation - Link to Learned Actor Multiple Motor Layer and higher level decision making State Aggregation: change to abstract states of Motion Selector Sequential Learning Learning pattern generator

43


Download ppt "Journal club 09. 09. 2008 Marian Tsanov Reinforcement Learning."

Similar presentations


Ads by Google