Presentation is loading. Please wait.

Presentation is loading. Please wait.

Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO AUTOMATICO Lezione 12 - Reinforcement Learning Prof. Giancarlo Mauri.

Similar presentations


Presentation on theme: "Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO AUTOMATICO Lezione 12 - Reinforcement Learning Prof. Giancarlo Mauri."— Presentation transcript:

1 Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO AUTOMATICO Lezione 12 - Reinforcement Learning Prof. Giancarlo Mauri

2 Agenda  Reinforcement Learning Scenario  Dynamic Programming  Monte-Carlo Methods  Temporal Difference Learning

3 Reinforcement Learning Scenario Agent Environment state s t s t+1 rtrt action a t s0s0 a0a0 r0r0 s1s1 a1a1 r1r1 s2s2 a2a2 r2r2 Goal: Learn to choose actions a t that maximize future rewards r 0 +  r 1 +  2 r 2 +… where 0 <  < 1 is a discount factor s3s3

4 Reinforcement Learning Scenario  Some application domains Robot learning to dock to a battery station Learning to choose actions to optimize a factory output Learning to play Backgammon  Characteristics Delayed reward No direct feedback (error signal) for good and bad actions Opportunity for active exploration Possibility that state is only partially observable

5 Example [Tesauro 1995]  Learning to play backgammon +100 if win -100 if lose 0 for all other states  Trained by playing 1.5 million games against itself  Now approximately equals the best human player

6 Markov Decision Process (MDP)  Finite set of states S  Finite set of actions A  At each time step the agent: observes state s t  S and chooses action a t  A(s t ) then receives immediate reward r t and state changes to s t+1  Markov assumption: r t = r(s t,a t ) and s t+1 =  (s t,a t ) Reward and next state only depend on current state s t and action a t Functions  (s t,a t ) and r(s t,a t ) may be non-deterministic Functions  (s t,a t ) and r(s t,a t ) not necessarily known to agent

7 Learning Task  Execute actions in the environment, observe results and  Learn a policy  : S  A that associates to every state s  S an action a  A so as to maximize the expected reward E[r t +  r t+1 +  2 r t+2 +…] from any starting state s  NOTE: 0 <  < 1 is the discount factor for future rewards Target function is  : S  A But there are no direct training examples of the form Training examples are of the form,r>

8 Learning task  Consider deterministic environments, namely  (s,a) and r(s,a) are deterministic functions of s and a  To evaluate a given policy  : S  A the agent might adopt the discounted cumulative reward over time: V  (s) = r 0 +  r 1 +   r 2 +…=  i=0…∞ r i  i where r o, r 1,… are generated by following the policy  from start state s  Task: Learn the optimal policy  * that maximizes V  (s)  * = argmax  V  (s),  s

9 Alternative definitions  Finite horizon reward  i=0…h r i  Average reward lim h  ∞  i=0…h r i

10 State and Action Value Functions  State value function denotes the reward for starting in state s and following policy  V  (s) = r 0 +  r 1 +   r 2 +…=  i=0  i r i  Action value function denotes the reward for starting in state s, taking action a and following policy  afterwards Q  (s,a)= r(s,a) +  r 1 +   r 2 +…= r(s,a) +  V  (  (s,a))  NOTE: a dynamical programming approach  * (s) = argmax a (r(s,a) +  V * (  (s,a))

11 Example G: terminal state, upon entering G agent obtains a reward of +100, remains in G forever and obtains no further rewards Rewards for other actions are 0  = 0.9 G +100 s3s3 s2s2 s6s6

12 Example One optimal policy G

13 Example V* values for each state G 10090 0 100 90 81

14 Example Q(s,a) values G 10090 81 100 0 72 90 72 81

15 What to learn  We might try to have agent learn the evaluation function V*  It could then do a lookahead search to choose best action from any state s because  * (s) = argmax a (r(s,a) +  V * (  (s,a))  But: This works well if agent knows functions  and r When it doesn’t, it can’t choose actions this way

16 Q function  Since  * (s) = argmax a Q(s,a) if agent learns Q, it can choose optimal action even without knowing 

17 Learning Q function  Note Q and V* are closely related V  (s) = max a Q(s,a)  Which allows us to write Q recursively as Q(s t,a t ) = r(s t,a t ) +  V*(  (s t,a t ))) = r(s t,a t ) +   max a’ Q(s t+1,a’)  Let ^Q denote learner’s current approximation to Q. Consider training rule ^Q := r + max a’ ^ Q(s’,a’) where s’ is the state resulting from applying action a in state s

18 Learning Q function  For each s, a do^Q := 0  Observe current state s  Do forever Select an action a and execute it Receive immediate reward r Observe new state s’ Update ^Q(s,a) := r +  max^Q(s’,a’) s := s’

19 Iterative Policy Evaluation  Bellman equation as an update rule for action-value function: Q k+1  (s,a) = r(s,a) +   a’   (  (s,a),a’) Q k  (  (s,a),a’)  =0.9 0 G 0 0 0 00 0 0 0 0 000 0 G 100 0 00 0 0 0 0 000 0 G 0 450 0 0 30 0 0 0 0 G 100 23 4523 0 0 30 13 3023 0 G 100 23 5523 16 41 13 4123 0 G 100 34 5534 16 41 26 4134 0 G 100 52 6952 44 60 47 6052

20 Policy Improvement  Suppose we have determined the value function V  for an arbitrary deterministic policy .  For some state s we would like to know if it is better to choose an action a  (s).  Select a and follow the existing policy  afterwards gives us reward Q  (s,a)  If Q  (s,a) > V  then a is obviously better than  (s)  Therefore choose new policy  ’ as  ’(s)=argmax a Q  (s,a) = argmax a r(s,a)+   V  (  (s,a))

21 Example  ’(s)=argmax a r(s,a)+   V  (  (s,a)) V  =0V  =71V  =63 V  =56V  =61V  =78 r=100  (s,a)=1/|a| V  ’= 0 V  ’= 100 V  ’= 90 V  ’= 81 V  ’= 90 V  ’= 100

22 Example  ’(s)=argmax a Q  (s,a) 0 G 100 52 6952 44 60 47 6052

23 Generalized Policy Iteration  Intertwine policy evaluation with policy improvement  0  V  0   1  V  1   2  V  2  …   *  V  * E I E I E I … I E  V evaluation improvement V  V   greedy(V)

24 Value Iteration (Q-Learning)  Idea: do not wait for policy evaluation to converge, but improve policy after each iteration. V k+1  (s) = max a  (r(s,a) +   V k  (  (s,a))) or Q k+1  (s,a) = r(s,a) +   max a’ Q k  (  (s,a),a’) Stop when  s |V k+1  (s)- V k  (s)| <  or  s,a |Q k+1  (s,a)- Q k  (s,a)| < 


Download ppt "Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO AUTOMATICO Lezione 12 - Reinforcement Learning Prof. Giancarlo Mauri."

Similar presentations


Ads by Google