Download presentation
Presentation is loading. Please wait.
Published byLetitia Freeman Modified over 7 years ago
1
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO AUTOMATICO Lezione 12 - Reinforcement Learning Prof. Giancarlo Mauri
2
Agenda Reinforcement Learning Scenario Dynamic Programming Monte-Carlo Methods Temporal Difference Learning
3
Reinforcement Learning Scenario Agent Environment state s t s t+1 rtrt action a t s0s0 a0a0 r0r0 s1s1 a1a1 r1r1 s2s2 a2a2 r2r2 Goal: Learn to choose actions a t that maximize future rewards r 0 + r 1 + 2 r 2 +… where 0 < < 1 is a discount factor s3s3
4
Reinforcement Learning Scenario Some application domains Robot learning to dock to a battery station Learning to choose actions to optimize a factory output Learning to play Backgammon Characteristics Delayed reward No direct feedback (error signal) for good and bad actions Opportunity for active exploration Possibility that state is only partially observable
5
Example [Tesauro 1995] Learning to play backgammon +100 if win -100 if lose 0 for all other states Trained by playing 1.5 million games against itself Now approximately equals the best human player
6
Markov Decision Process (MDP) Finite set of states S Finite set of actions A At each time step the agent: observes state s t S and chooses action a t A(s t ) then receives immediate reward r t and state changes to s t+1 Markov assumption: r t = r(s t,a t ) and s t+1 = (s t,a t ) Reward and next state only depend on current state s t and action a t Functions (s t,a t ) and r(s t,a t ) may be non-deterministic Functions (s t,a t ) and r(s t,a t ) not necessarily known to agent
7
Learning Task Execute actions in the environment, observe results and Learn a policy : S A that associates to every state s S an action a A so as to maximize the expected reward E[r t + r t+1 + 2 r t+2 +…] from any starting state s NOTE: 0 < < 1 is the discount factor for future rewards Target function is : S A But there are no direct training examples of the form Training examples are of the form,r>
8
Learning task Consider deterministic environments, namely (s,a) and r(s,a) are deterministic functions of s and a To evaluate a given policy : S A the agent might adopt the discounted cumulative reward over time: V (s) = r 0 + r 1 + r 2 +…= i=0…∞ r i i where r o, r 1,… are generated by following the policy from start state s Task: Learn the optimal policy * that maximizes V (s) * = argmax V (s), s
9
Alternative definitions Finite horizon reward i=0…h r i Average reward lim h ∞ i=0…h r i
10
State and Action Value Functions State value function denotes the reward for starting in state s and following policy V (s) = r 0 + r 1 + r 2 +…= i=0 i r i Action value function denotes the reward for starting in state s, taking action a and following policy afterwards Q (s,a)= r(s,a) + r 1 + r 2 +…= r(s,a) + V ( (s,a)) NOTE: a dynamical programming approach * (s) = argmax a (r(s,a) + V * ( (s,a))
11
Example G: terminal state, upon entering G agent obtains a reward of +100, remains in G forever and obtains no further rewards Rewards for other actions are 0 = 0.9 G +100 s3s3 s2s2 s6s6
12
Example One optimal policy G
13
Example V* values for each state G 10090 0 100 90 81
14
Example Q(s,a) values G 10090 81 100 0 72 90 72 81
15
What to learn We might try to have agent learn the evaluation function V* It could then do a lookahead search to choose best action from any state s because * (s) = argmax a (r(s,a) + V * ( (s,a)) But: This works well if agent knows functions and r When it doesn’t, it can’t choose actions this way
16
Q function Since * (s) = argmax a Q(s,a) if agent learns Q, it can choose optimal action even without knowing
17
Learning Q function Note Q and V* are closely related V (s) = max a Q(s,a) Which allows us to write Q recursively as Q(s t,a t ) = r(s t,a t ) + V*( (s t,a t ))) = r(s t,a t ) + max a’ Q(s t+1,a’) Let ^Q denote learner’s current approximation to Q. Consider training rule ^Q := r + max a’ ^ Q(s’,a’) where s’ is the state resulting from applying action a in state s
18
Learning Q function For each s, a do^Q := 0 Observe current state s Do forever Select an action a and execute it Receive immediate reward r Observe new state s’ Update ^Q(s,a) := r + max^Q(s’,a’) s := s’
19
Iterative Policy Evaluation Bellman equation as an update rule for action-value function: Q k+1 (s,a) = r(s,a) + a’ ( (s,a),a’) Q k ( (s,a),a’) =0.9 0 G 0 0 0 00 0 0 0 0 000 0 G 100 0 00 0 0 0 0 000 0 G 0 450 0 0 30 0 0 0 0 G 100 23 4523 0 0 30 13 3023 0 G 100 23 5523 16 41 13 4123 0 G 100 34 5534 16 41 26 4134 0 G 100 52 6952 44 60 47 6052
20
Policy Improvement Suppose we have determined the value function V for an arbitrary deterministic policy . For some state s we would like to know if it is better to choose an action a (s). Select a and follow the existing policy afterwards gives us reward Q (s,a) If Q (s,a) > V then a is obviously better than (s) Therefore choose new policy ’ as ’(s)=argmax a Q (s,a) = argmax a r(s,a)+ V ( (s,a))
21
Example ’(s)=argmax a r(s,a)+ V ( (s,a)) V =0V =71V =63 V =56V =61V =78 r=100 (s,a)=1/|a| V ’= 0 V ’= 100 V ’= 90 V ’= 81 V ’= 90 V ’= 100
22
Example ’(s)=argmax a Q (s,a) 0 G 100 52 6952 44 60 47 6052
23
Generalized Policy Iteration Intertwine policy evaluation with policy improvement 0 V 0 1 V 1 2 V 2 … * V * E I E I E I … I E V evaluation improvement V V greedy(V)
24
Value Iteration (Q-Learning) Idea: do not wait for policy evaluation to converge, but improve policy after each iteration. V k+1 (s) = max a (r(s,a) + V k ( (s,a))) or Q k+1 (s,a) = r(s,a) + max a’ Q k ( (s,a),a’) Stop when s |V k+1 (s)- V k (s)| < or s,a |Q k+1 (s,a)- Q k (s,a)| <
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.