Learning Rules 2 Computational Neuroscience 03 Lecture 9.

Slides:



Advertisements
Similar presentations
Unconditioned stimulus (food) causes unconditioned response (saliva) Conditioned stimulus (bell) causes conditioned response (saliva)
Advertisements

Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014.
Lecture 18: Temporal-Difference Learning
Linear Regression.
Markov Decision Process
Artificial Intelligence 13. Multi-Layer ANNs Course V231 Department of Computing Imperial College © Simon Colton.
CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)
Probability Distributions CSLU 2850.Lo1 Spring 2008 Cameron McInally Fordham University May contain work from the Creative Commons.
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
The loss function, the normal equation,
Infinite Horizon Problems
Planning under Uncertainty
Visual Recognition Tutorial
x – independent variable (input)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 2: Evaluative Feedback pEvaluating actions vs. instructing by giving correct.
Reinforcement Learning
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.
CS 188: Artificial Intelligence Fall 2009 Lecture 10: MDPs 9/29/2009 Dan Klein – UC Berkeley Many slides over the course adapted from either Stuart Russell.
October 14, 2010Neural Networks Lecture 12: Backpropagation Examples 1 Example I: Predicting the Weather We decide (or experimentally determine) to use.
FIGURE 4 Responses of dopamine neurons to unpredicted primary reward (top) and the transfer of this response to progressively earlier reward-predicting.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Jochen Triesch, UC San Diego, 1 Organizing Principles for Learning in the Brain Associative Learning: Hebb rule and variations,
Reinforcement Learning
1 Dr. Itamar Arel College of Engineering Electrical Engineering & Computer Science Department The University of Tennessee Fall 2009 August 24, 2009 ECE-517:
Slides are based on Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems.
Reinforcement learning This is mostly taken from Dayan and Abbot ch. 9 Reinforcement learning is different than supervised learning in that there is no.
Module 1: Statistical Issues in Micro simulation Paul Sousa.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 2: Temporal difference learning.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Unconditioned stimulus (food) causes unconditioned response (saliva) Conditioned stimulus (bell) causes conditioned response (saliva)
Curiosity-Driven Exploration with Planning Trajectories Tyler Streeter PhD Student, Human Computer Interaction Iowa State University
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
ECE-7000: Nonlinear Dynamical Systems Overfitting and model costs Overfitting  The more free parameters a model has, the better it can be adapted.
Schedule for presentations. 6.1: Chris? – The agent is driving home from work from a new work location, but enters the freeway from the same point. Thus,
COMP 2208 Dr. Long Tran-Thanh University of Southampton Reinforcement Learning.
QUIZ!!  T/F: Optimal policies can be defined from an optimal Value function. TRUE  T/F: “Pick the MEU action first, then follow optimal policy” is optimal.
REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,
Def gradientDescent(x, y, theta, alpha, m, numIterations): xTrans = x.transpose() replaceMe =.0001 for i in range(0, numIterations): hypothesis = np.dot(x,
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
Does the brain compute confidence estimates about decisions?
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Reinforcement Learning (1)
A Simple Artificial Neuron
Hidden Markov Models Part 2: Algorithms
Reinforcement learning
CAP 5636 – Advanced Artificial Intelligence
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Chapter 2: Evaluative Feedback
CS 188: Artificial Intelligence Fall 2008
CS 188: Artificial Intelligence Fall 2007
October 6, 2011 Dr. Itamar Arel College of Engineering
CS 188: Artificial Intelligence Spring 2006
The loss function, the normal equation,
Mathematical Foundations of BME Reza Shadmehr
FEATURE WEIGHTING THROUGH A GENERALIZED LEAST SQUARES ESTIMATOR
CS 188: Artificial Intelligence Spring 2006
Chapter 2: Evaluative Feedback
Reinforcement Learning (2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Reinforcement Learning (2)
Presentation transcript:

Learning Rules 2 Computational Neuroscience 03 Lecture 9

In reinforcement learning we have a stimulus s a reward r and an expected reward v. We represent the presence or absence of the stimulus by a binary variable u (apologies for confusion over labels: this follows convention in the literature) Reinforcement Learning Where the weight w is established by a learning rule which minimises the mean square error between expected reward and actual reward (note similarities to ANN training) Using this terminology have the Rescorla-Wagner rule (1972): Where  is learning rate (form of stochastic gradient descent)

If  is sufficinetly small and u = 1 on all trials the rule makes w fluctuate about the equilibrium value w= Using the above rule can get most of the classical conditioning paradigms (where -> indicates an association between a one or 2 stimuli and a reward (r) or the absence of a reward. In the result column the association is with an expectaion of a reward) paradigmPre-trainTrainResult Pavlovians -> rs -> ‘r’ Extinctions -> rs ->.s -> ‘.’ Partials -> r s ->. s ->  ‘r’ Blockings 1 -> rs 1 + s 2 -> rs 1 -> ‘r’ s 2 -> ‘.’ Inhibitorys 1 + s 2 ->. s 1 -> rs 1 -> ‘r’ s 2 -> ‘-r’ Overshadows 1 + s 2 -> r s 1 ->   ‘r’ s 2 ->   ‘r’ Secondarys 1 -> rs 1 -> s 2 s 2 -> ‘r’

For instance here we can see acquisition, extinction and partial reinforcement. Can also get blocking, inhibitory conditioning and overshadowing. However, cannot get secondary conditioning due to lack of a temporal dimension and the fact that reward is delayed

But how are these estimates of expected reward used to determine an animal’s behaviour? Idea is that animal develops a policy (plan of action) aimed at maximising the reward that it gets Thus the policy is tied into its estimate of the reward If reward/punishment follows action immediately we have what’s known as static action choice If rewards are delayed until several actions are completed have sequential action choice

Suppose we have bees foraging in a field of 20 blue and 20 yellow flowers Blue flowers give a reward of r b of nectar drwan from a probability distribution p(r b ) Blue flowers give a reward of r y of nectar drwan from a probability distribution p(r y ) Forgetting about spatial aspects of foraging we assume at each timestep the beeis faced with ablue or yellow flower and must choose between them: task known as a stochastic two-armed bandit problem Static Action Choice

Bee follows a stochastic policy parameterised by 2 which means it chooses flowers with probability P(b) and P(y) where convenient to choose: Here m b and m y the action values parameterise the probabilities and are updated using a learning process based on expected and received rewards If there are multiple actions, use a vector of action values m Note P(b) = 1 - P(y) Also, note that both are sigmoidal functions of  (m b -m y ). Thus the sensitivity of probabilities to the action values is governed by  

If  is large and m b >m y P(b) is almost one => deterministic sampling: Exploitation Low  implies more random sampling (  =0 => P(b)=P(y)=0.5). Exploration Clearly need a trade-off between exploration and exploitation as we must keep sampling all flowers to get a good estimate of reward but this comes at a cost of not getting optimal nectar Exploration vs Exploitation

First learning scheme is to learn average nectar volumes for each type of flower ie set m b = and m y = Indirect Actor scheme as policy is mediated indirectly by the total expected nectar volumes received Using Rescorla-Wagner rule Indirect Actor we saw that w stabilises at. Therefore we use this reinforcement learning rule (with u =1 always) to update the m’s via

Results for models bees using the indirect actor scheme. =2 and =1 for 1 st 100 visits. Then reward values swapped ( =1 and =2) for 2 nd 100. A shows m b and m y. B- D shows cumulative visists to each type of flower. B  = 1 C+D  = 50 From results we can see that with a low  value (  =1) (fig B), learning is slow but change to optimal flower colour is reliable For a high  value (  =50), sometimes get optimal behaviour (C) but sometimes get suboptimal (D) However, such a scheme would have trouble if eg r y =2 always while r b =6 1/3 of the time and r b =0 2/3 of time

Direct actor schemes try to maximise expected reward directly ie use = P(b) + P(y) And maximise over time using stochastic gradient ascent Direct Actor Same task as previous slide. One run has quite good results (A, B) while other has bad results (C,D) Results for this rule are quite variable and behaviour after reward change can be poor. However direct actor can be useful to see how action choice can be separated from action evaluation

Imagine we have a stimulus presented at t=5 but the reward not given till t=10. To be able to learn based on future rewards, need to add a temporal dimension to Rescorla-Wagner Use a discrete time variable t where 0<= t <= T and stimulus u(t), prediction v(t) and reward r(t) are all functions of t Here now v(t) is interpreted as the expected future reward from time t to T as this provides a better match to empirical data ie Temporal difference learning And the learning rule becomes: where

How does this work? Imagine we have a trial 10 timesteps long with a single stimulus at t=5 and a reward of 0.5 at t=10. For the case of a single stimulus have: So:v(0) = w(0)u(0) v(1) = w(0)u(1) + w(1)u(0) v(2) = w(0)u(2) + w(1)u(1) + w(2)u(0) v(3) = w(0)u(3) + w(1)u(2) + w(2)u(1) etc … So, since u(t)=0 except for t = 5 where u=1, we have v(t)=0 for t<5 and: v(5)= w(0)u(5)= w(0), v(6)= w(1)u(5) = w(1), v(7)=w(2), v(8)= w(3), v(9)=w(4), v(10)= w(5), Ie v(t) = w(t-5)

we therefore get:  t  = 0 for t < 10 and  = 0.5 Also, as with calculating the v’s, since u(t)=0 for all t not 5 and u(5) =1 when calculating increase in w need: t –  = 5 ie t =  + 5 Therfore setting  = 0.1 get At the start (Trial 0) all w =0. Therefore all v=0. Remembering that:

Trial 1:  t  = 0 for t < 10 and  = 0.5 w’sunless t+5=10 ie t=5  = 0 so w(5) =  all other w’s zero as other  ’s are zero v’sunless t-5 = 5 w = 0 so all v zero apart from v(10) = 0.05  ’s  = r(10) + v(11) – v(10) = – 0.05 = 0.45  = r(9) + v(10) – v(9) = – 0 = 0.05 rest are 0

Trial 2:  10  = 0.45,  = 0.05 w’sNow need either t+5=10 (t=5) or t+5=9 (t=4) so: w(5) -> w(5)  x0.45  w(4) -> w(4)  x0.05  other w’s = zero v’sunless t-5 = 5 or t-5 =4 w = 0 so v(10)=w(5)=0.095, v(9)=w(4)=0.005  ’s  = r(10) + v(11) – v(10) = – 0.95 =  = r(9) + v(10) – v(9) = – = 0.09  = r(8) + v(9) – v(8) = – 0 = others zero

Trial 100 w’s: w(6) and more = 0 since then add on  0. w(5) and lower keep increasing until they hit 0.5. Why do they stop then? If w(5)=0.5 then v(10)=0.5, so  10  = r(10) + v(11) –v(10) = – 0.5 = 0 ie no change to w(5) And if w(4) =0.5, v(10)=v(9)=0.5  9  =r(9)+v(10)–v(9)=0.5–0.5 = 0 Therefore no change to w(4) and if w(3) = 0.5,  8  =0, so no change etc

Trial 100 v’s: So since w(0)-w(5)=0.5, rest zero v(10)-v(5) = 0.5, rest zero And  ’s:  10  = r(10) + v(11) –v(10) = – 0.5 = 0  = r(9) + v(10) –v(9) = – 0.5 = 0 and same for   until we get to  Here v(5) = 0.5 but v(4)=0 so:  4  =r(4)+v(5)–v(4)=0+0.5–0=0.5 But for  3 , v(4)=v(3)=0 so  3  =r(3)+v(4)–v(3)=0+0–0=0 And the same for 

Can see a similar effect here (stimulus at t=100, reward at t=200)

Temporal difference (TD) learning is needed in cases where the reward does not follow immediately after the action. Consider the maze task below: Sequential Action Choice While we could use static action choice to get actions at B and C, we don’t know what reward we get for turning left at A Use policy iteration. Have a stochastic policy which is maintained and updated and determines actions at each point

Have 2 elements: A critic which uses TD learning to estimate the future reward from A, B and C if current policy followed An actor which maintains and improves the policy based on the values from the critic Effectively, rat still uses static action choice at A but using the expectation of the future reaward from the critic Actor-Critic Learning

Eg rat in a maze. Initially rat has no preference for left or right ie m=0 so probability of going either way is 0.5.Thus: v(B) = 0.5(0 + 5) = 2.5,v(C) = 0.5(0 + 2) = 1, v(A) = 0.5(v(B) + v(C)) = 1.75 These are future rewards expected if rat explores maze using random choices. These can be learnt via TD learning. Here if rat chooses action a at location u and ends up at u’ have: where

Get results as above. Dashed lines are correct expected rewards. Learning rate of 0.5 (fast but noisy). Thin solid lines are actual values, thick lines are running averages of the weight values. Weights converge to the true values of the rewards This process is known as polcy evaluation

Now use policy improvement where the worth to rat if it takes action a at u and moves to u’ is sum of reward received and rewards expected to follow ie r a (u) + v(u’) Policy improvement uses the difference between this reward and the total expected reward v(u) This value is then used to update the policy

Eg suppose we start from location A. Using the true values of the locations evaluated earlier get For a left turn For a right turn This means that the policy is adapted to increase the probability of tuning left as learning rule increases probability for  > 0 and decreases probability for  < 0

Strictly, policy should be evaluated fully before policy is improved and more straightforward to improve policy fully before policy re-evaluated However, a convenient (but not provably correct alternative) is to interleave partial policy evaluation and policy improvement steps This is known as the actor-critic algorithm and generates the results above

Actor critic rule can be generalised in a number of ways eg 1. Discounting rewards: more recent rewards/punishments have more effect. In calculating expected future reward, multiply the reward by  t where t is the number of time-steps until the reward is received and 0<=  <= 1. The smaller  the stronger the effect of discounting. This can be implemented simply by changing  to be: Actor-Critic Generalisations

2. Multiple sensory information at a point a. Eg as well as having a stimulus at a there is also a food scent. Instead of having u represented by a binary variable we therefore have a vector u which parameterises the sensory input (eg stimulus and scent would be a 2 element vector. Vectors for maze would be u(A) = (1, 0, 0), u(B) = (0, 1, 0), u(C) = (0, 0, 1) where sensory info is ‘at A’ ‘at B’ and ‘at C’). Now v(u) = w.u so need w to be a vector of same length. Thus: w->w +  u(a) and need M as a matrix of probabilities so that m = M.u

3. Learning usually based on difference between immediate reward and one from the next timestep. Instead can base learning rules on the sum of next 2, 3 or more immediate rewards and the estimate of future rewards on more temporally distant timesteps. Using to weight our future estimates this can be achieved using eg the recursive rule: Basically takes into account some measure of past activity. = 0: new u = standard u and no notice taken of past. = 1: no notice taken of present