Presentation is loading. Please wait.

Presentation is loading. Please wait.

Unconditioned stimulus (food) causes unconditioned response (saliva) Conditioned stimulus (bell) causes conditioned response (saliva)

Similar presentations


Presentation on theme: "Unconditioned stimulus (food) causes unconditioned response (saliva) Conditioned stimulus (bell) causes conditioned response (saliva)"— Presentation transcript:

1 Unconditioned stimulus (food) causes unconditioned response (saliva) Conditioned stimulus (bell) causes conditioned response (saliva)

2 Rescola-Wagner Rule V=wu, with u stimulus (0,1), w weight and v is predicted response. Adapt w to minimize quadratic error

3 Rescola Wagner rule for multiple inputs can predict various phenomena: Blocking: learned s1 to r prevents learning of association s2 to r Inhibition: s2 reduces prediction when combined with any predicting stimulus

4 Temporal difference learning Interpret v(t) as total future expected reward v(t) is predicted from the past

5

6 After learning delta(t)=0 implies: v(t=0) is sum of expected future reward v(t) constant, thus expected reward r(t)=0 v(t) decreasing, positive expected reward

7 Explanation fig 9.2 Since u(t)=delta(t,0), Eq. 9.6 becomes: v(t)=w(t) Eq. 9.7 becomes delta w(t)= \epsilon delta(t) Thus, delta v(t)= \epsilon(r(t)+v(t+1)-v(t)) R(t)=delta(t,T) Step 1: only change is v(T)=v(T)+epsilon Step 2: change v(T-1) and v(T) Etc.

8 Dopamine Monkey release button and press other after stimulus to receive reward. A: VTA cells respond to reward in early trials and to stimulus in late trials. Similar to delta in TD rule fig. 9.2

9 Dopamine Dopamine neurons encode reward prediction error (delta). B: witholding reward reduced neural firing in agreement with delta interpretation.

10 Static action choice Rewards result from actions Bees visit flowers whose color (blue, yellow) predict reward (sugar). M are action values, encode expected reward. Beta implements exploration

11 The indirect actor model Learn the average nectar volumes for each flower and act accordingly. Implemented by on-line learning. When visit blue flower And leave yellow estimate unchanged Fig: rb=1, ry=2 for t=1:100 and reversed For t=101:200. A: my, mb; B-D Cumulated reward low beta (B), high Beta (C,D).

12 Bumble bees Blue: r=2 for all flowers, yellow: r=6 for 1/3 of the flowers. When switched at t=15 bees adapt fast.

13 Bumble bees Model with m= with f concave, so that mb=f(2) larger than my=1/3 f(6)

14 Direct actor (policy gradient)

15 Direct actor Stochastic gradient ascent: Fig: two sessions as in fig. 9.4 with good and Bad behaviour. Problem is size m prevents Exploration.

16 Sequential action choice Reward obtained after sequence of actions Credit assignment problem.

17 Sequential action choice Policy iteration: –Critic: use TD eval. v(state) using current policy –Actor: improve policy m(state)

18 Policy evaluation Policy is random left/right at each turn. Implemented as TD:

19 Policy improvement Can be understood as policy gradient rule: where we replace ra-r by And m becomes state dependent. Example: current state is A

20 Policy improvement Policy improvement changes policy, thus reevaluate policy for proven convergence Interleaving PI and PE is called actor-critic Fig: AC learning of maze. NB learning at C is slow.

21 Generalizations Discounted reward: TD rule changes to TD(lambda): apply TD rule not only to update value of current state but also of recently past visited states. TD(0)=TD, TD(1)=updating all past states.

22 Water maze State u = 493 place cells, 8 actions AC rules:

23 Comparing rats and model RL predicts well initial learning, but not change to new task.

24 Markov decision process State transitions P(u|u,a). Absorbing states: Find M such that Solution: solve Bellman equation

25 Policy iteration Is Policy evaluation + policy improvement Evaluation step: Find value of a policy M: RL evaluates rhs stochasticly V(u)=v(u) +eps delta(t)

26 Improvement step: maximize {...} wrt a Requires knowledge of P(u|u,a). Earlier formula can be derived as stochastic version


Download ppt "Unconditioned stimulus (food) causes unconditioned response (saliva) Conditioned stimulus (bell) causes conditioned response (saliva)"

Similar presentations


Ads by Google