Unconditioned stimulus (food) causes unconditioned response (saliva) Conditioned stimulus (bell) causes conditioned response (saliva)

Rescola-Wagner Rule V=wu, with u stimulus (0,1), w weight and v is predicted response. Adapt w to minimize quadratic error

Rescola Wagner rule for multiple inputs can predict various phenomena: Blocking: learned s1 to r prevents learning of association s2 to r Inhibition: s2 reduces prediction when combined with any predicting stimulus

Temporal difference learning Interpret v(t) as ‘total future expected reward’ v(t) is predicted from the past

After learning delta(t)=0 implies: v(t=0) is sum of expected future reward v(t) constant, thus expected reward r(t)=0 v(t) decreasing, positive expected reward

Explanation fig 9.2 Since u(t)=delta(t,0), Eq. 9.6 becomes: v(t)=w(t) Eq. 9.7 becomes delta w(t)= \epsilon delta(t) Thus, delta v(t)= \epsilon(r(t)+v(t+1)-v(t)) R(t)=delta(t,T) Step 1: only change is v(T)=v(T)+epsilon Step 2: change v(T-1) and v(T) Etc.

Dopamine Monkey release button and press other after stimulus to receive reward. A: VTA cells respond to reward in early trials and to stimulus in late trials. Similar to delta in TD rule fig. 9.2

Dopamine Dopamine neurons encode reward prediction error (delta). B: witholding reward reduced neural firing in agreement with delta interpretation.

Static action choice Rewards result directly from actions Bees visit flowers whose color (blue, yellow) predict reward (sugar). –M are action values, encode expected reward. Beta implements exploration –P are action probabilities

The indirect actor model Learn the average nectar volumes for each flower and act accordingly. Implemented by on-line learning. When visit blue flower And leave yellow estimate unchanged Fig: rb=1, ry=2 for t=1:100 and reversed For t=101:200. A: my, mb; B-D Cumulated reward low beta (B), high Beta (C,D).

Bumble bees Risk aversion: –Blue: r=2 for all flowers, yellow: r=6 for 1/3 of the flowers. When switched at t=15 bees adapt fast. –A: av. Of 5 bees –B: subjective utility function m(2) > 2/3 m(0)+ 1/3 m(6) favours risk avoidance –C: model prediction

Direct actor (policy gradient)

Direct actor Stochastic gradient ascent: Fig: two sessions as in fig. 9.4 with good and Bad behaviour. Problem is size m prevents Exploration.

Sequential action choice/Delayed reward Reward obtained after sequence of actions –Rat moves without back tracking. After reward removed from maze and restart Delayed reward problem: –Choice at A has no direct reward

Sequential action choice/Delayed reward Policy iteration (see also Kaelbling 3.2.2): Loop: –Policy evaluation: Compute value V_pi for policy pi. Run Bellman backup until convergence –Policy improvement: Improve pi

Sequential action choice/Delayed reward Actor Critic (see also Kaelbling 4.1): Loop: –Critic: use TD eval. V(state) using current policy –Actor: improve policy p(state)

Policy evaluation Policy is random left/right at each turn. Implemented as TD (w=v):

Policy improvement Base action on expected future reward minus expected current reward Example: state A: Use epsilon greedy or softmax for exploration.

Policy improvement Policy improvement changes policy, thus reevaluate policy for proven convergence Interleaving PI and PE is called actor-critic Fig: AC learning of maze. NB learning at C is slow.

Generalizations Discounted reward: TD rule changes to TD(lambda): apply TD rule not only to update value of current state but also of recently past visited states. TD(0)=TD, TD(1)=updating all past states.

Water maze State dependent place cell activity (Foster Eq. 1). 8 actions Critic and Actor (Foster Eqs. 3-10)

Comparing rats and model Left: average performance of 12 rats, four trials per day. RL predicts well initial learning, but not change to new task.

Markov decision process State transitions P(u’|u,a). Absorbing states: Find M such that Solution: solve Bellman equation

Policy iteration Is Policy evaluation + policy improvement Evaluation step: Find value of a policy M: RL evaluates rhs stochasticly V(u)=v(u) +eps delta(t)

Improvement step: maximize {...} wrt a Requires knowledge of P(u’|u,a). Earlier formula can be derived as stochastic version

Unconditioned stimulus (food) causes unconditioned response (saliva) Conditioned stimulus (bell) causes conditioned response (saliva)

Similar presentations

Presentation on theme: "Unconditioned stimulus (food) causes unconditioned response (saliva) Conditioned stimulus (bell) causes conditioned response (saliva)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Unconditioned stimulus (food) causes unconditioned response (saliva) Conditioned stimulus (bell) causes conditioned response (saliva)

Similar presentations

Presentation on theme: "Unconditioned stimulus (food) causes unconditioned response (saliva) Conditioned stimulus (bell) causes conditioned response (saliva)"— Presentation transcript:

Similar presentations

About project

Feedback