Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO AUTOMATICO Lezione 12 - Reinforcement Learning Prof. Giancarlo Mauri
Agenda Reinforcement Learning Scenario Dynamic Programming Monte-Carlo Methods Temporal Difference Learning
Reinforcement Learning Scenario Agent Environment state s t s t+1 rtrt action a t s0s0 a0a0 r0r0 s1s1 a1a1 r1r1 s2s2 a2a2 r2r2 Goal: Learn to choose actions a t that maximize future rewards r 0 + r 1 + 2 r 2 +… where 0 < < 1 is a discount factor s3s3
Reinforcement Learning Scenario Some application domains Robot learning to dock to a battery station Learning to choose actions to optimize a factory output Learning to play Backgammon Characteristics Delayed reward No direct feedback (error signal) for good and bad actions Opportunity for active exploration Possibility that state is only partially observable
Example [Tesauro 1995] Learning to play backgammon +100 if win -100 if lose 0 for all other states Trained by playing 1.5 million games against itself Now approximately equals the best human player
Markov Decision Process (MDP) Finite set of states S Finite set of actions A At each time step the agent: observes state s t S and chooses action a t A(s t ) then receives immediate reward r t and state changes to s t+1 Markov assumption: r t = r(s t,a t ) and s t+1 = (s t,a t ) Reward and next state only depend on current state s t and action a t Functions (s t,a t ) and r(s t,a t ) may be non-deterministic Functions (s t,a t ) and r(s t,a t ) not necessarily known to agent
Learning Task Execute actions in the environment, observe results and Learn a policy : S A that associates to every state s S an action a A so as to maximize the expected reward E[r t + r t+1 + 2 r t+2 +…] from any starting state s NOTE: 0 < < 1 is the discount factor for future rewards Target function is : S A But there are no direct training examples of the form Training examples are of the form,r>
Learning task Consider deterministic environments, namely (s,a) and r(s,a) are deterministic functions of s and a To evaluate a given policy : S A the agent might adopt the discounted cumulative reward over time: V (s) = r 0 + r 1 + r 2 +…= i=0…∞ r i i where r o, r 1,… are generated by following the policy from start state s Task: Learn the optimal policy * that maximizes V (s) * = argmax V (s), s
Alternative definitions Finite horizon reward i=0…h r i Average reward lim h ∞ i=0…h r i
State and Action Value Functions State value function denotes the reward for starting in state s and following policy V (s) = r 0 + r 1 + r 2 +…= i=0 i r i Action value function denotes the reward for starting in state s, taking action a and following policy afterwards Q (s,a)= r(s,a) + r 1 + r 2 +…= r(s,a) + V ( (s,a)) NOTE: a dynamical programming approach * (s) = argmax a (r(s,a) + V * ( (s,a))
Example G: terminal state, upon entering G agent obtains a reward of +100, remains in G forever and obtains no further rewards Rewards for other actions are 0 = 0.9 G +100 s3s3 s2s2 s6s6
Example One optimal policy G
Example V* values for each state G
Example Q(s,a) values G
What to learn We might try to have agent learn the evaluation function V* It could then do a lookahead search to choose best action from any state s because * (s) = argmax a (r(s,a) + V * ( (s,a)) But: This works well if agent knows functions and r When it doesn’t, it can’t choose actions this way
Q function Since * (s) = argmax a Q(s,a) if agent learns Q, it can choose optimal action even without knowing
Learning Q function Note Q and V* are closely related V (s) = max a Q(s,a) Which allows us to write Q recursively as Q(s t,a t ) = r(s t,a t ) + V*( (s t,a t ))) = r(s t,a t ) + max a’ Q(s t+1,a’) Let ^Q denote learner’s current approximation to Q. Consider training rule ^Q := r + max a’ ^ Q(s’,a’) where s’ is the state resulting from applying action a in state s
Learning Q function For each s, a do^Q := 0 Observe current state s Do forever Select an action a and execute it Receive immediate reward r Observe new state s’ Update ^Q(s,a) := r + max^Q(s’,a’) s := s’
Iterative Policy Evaluation Bellman equation as an update rule for action-value function: Q k+1 (s,a) = r(s,a) + a’ ( (s,a),a’) Q k ( (s,a),a’) =0.9 0 G G G G G G G
Policy Improvement Suppose we have determined the value function V for an arbitrary deterministic policy . For some state s we would like to know if it is better to choose an action a (s). Select a and follow the existing policy afterwards gives us reward Q (s,a) If Q (s,a) > V then a is obviously better than (s) Therefore choose new policy ’ as ’(s)=argmax a Q (s,a) = argmax a r(s,a)+ V ( (s,a))
Example ’(s)=argmax a r(s,a)+ V ( (s,a)) V =0V =71V =63 V =56V =61V =78 r=100 (s,a)=1/|a| V ’= 0 V ’= 100 V ’= 90 V ’= 81 V ’= 90 V ’= 100
Example ’(s)=argmax a Q (s,a) 0 G
Generalized Policy Iteration Intertwine policy evaluation with policy improvement 0 V 0 1 V 1 2 V 2 … * V * E I E I E I … I E V evaluation improvement V V greedy(V)
Value Iteration (Q-Learning) Idea: do not wait for policy evaluation to converge, but improve policy after each iteration. V k+1 (s) = max a (r(s,a) + V k ( (s,a))) or Q k+1 (s,a) = r(s,a) + max a’ Q k ( (s,a),a’) Stop when s |V k+1 (s)- V k (s)| < or s,a |Q k+1 (s,a)- Q k (s,a)| <