1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 6: Optimality Criterion in MDPs Dr. Itamar Arel College of Engineering Department.

1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 6: Optimality Criterion in MDPs Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science The University of Tennessee Fall 2011 September 8, 2011

ECE 517 - Reinforcement Learning in AI 2 Outline Optimal value functions (cont.) Implementation considerations Optimality and approximation

ECE 517 - Reinforcement Learning in AI 3 We define the state-value function for policy  as Similarly, we define the action-value function for The Bellman equation The value function V   s  is the unique solution to its Bellman equation 0 Recap on Value Functions 0 0 ∆

ECE 517 - Reinforcement Learning in AI 4 Optimal Value Functions A policy  is defined to be better than or equal to a policy  , if its expected return is greater than or equal to that of   for all states, i.e. There is always at least one policy (a.k.a. optimal policy) that is better than or equal to all other policies Optimal policies also share the same optimal action-value function, defined as

ECE 517 - Reinforcement Learning in AI 5 Optimal Value Functions (cont.) The latter gives the expected return for taking action a in state s and thereafter following an optimal policy Thus, we can write Since V   s  is the value function for a policy, it must satisfy the Bellman equation This is called the Bellman optimality equation This is called the Bellman optimality equation Intuitively, the Bellman optimality equation expresses the fact that the value of a state under an optimal policy must equal the expected return for the best action from that state ∆

ECE 517 - Reinforcement Learning in AI 6 0 Optimal Value Functions (cont.) ∆

ECE 517 - Reinforcement Learning in AI 7 0 0 Optimal Value Functions (cont.) The Bellman optimality equation for Q  is Backup diagrams  arcs have been added at the agent's choice points to represent that the maximum over that choice is taken rather than the expected value (given some policy)

ECE 517 - Reinforcement Learning in AI 8 Optimal Value Functions (cont.) For finite MDPs, the Bellman optimality equation has a unique solution independent of the policy The Bellman optimality equation is actually a system of equations, one for each state The Bellman optimality equation is actually a system of equations, one for each state N equations (one for each state) N equations (one for each state) N variables – V   s  N variables – V   s  This assumes you know the dynamics of the environment Once one has V   s , it is relatively easy to determine an optimal policy … For each state there will be one or more actions for which the maximum is obtained in the Bellman optimality equation For each state there will be one or more actions for which the maximum is obtained in the Bellman optimality equation Any policy that assigns nonzero probability only to these actions is an optimal policy Any policy that assigns nonzero probability only to these actions is an optimal policy This translates to a one-step search, i.e. greedy decisions will be optimal

ECE 517 - Reinforcement Learning in AI 9 Optimal Value Functions (cont.) With Q , the agent does not even have to do a one-step- ahead search For any state s – the agent can simply find any action that maximizes Q  (s,a) For any state s – the agent can simply find any action that maximizes Q  (s,a) The action-value function effectively embeds the results of all one-step-ahead searches It provides the optimal expected long-term return as a value that is locally and immediately available for each state-action pair Agent does not need to know anything about the dynamics of the environment Agent does not need to know anything about the dynamics of the environment Q: What are the implementation tradeoffs here? ∆

ECE 517 - Reinforcement Learning in AI 10 Implementation Considerations Computational Complexity How complex is it to evaluate the value and state- value functions? How complex is it to evaluate the value and state- value functions? In software In software In hardware In hardware Data flow constraints Which part of the data needs to be globally vs. locally available? Which part of the data needs to be globally vs. locally available? Impact of memory bandwidth limitations Impact of memory bandwidth limitations ∆

ECE 517 - Reinforcement Learning in AI 11 Recycling Robot revisited 0 A transition graph is a useful way to summarize the dynamics of a finite MDP State node for each possible state State node for each possible state Action node for each possible state-action pair Action node for each possible state-action pair

ECE 517 - Reinforcement Learning in AI 12 0 Bellman Optimality Equations for the Recycling Robot To make things more compact, we abbreviate the states high and low, and the actions search, wait, and recharge respectively by h, l, s, w, and re

ECE 517 - Reinforcement Learning in AI 13 Optimality and Approximation Clearly, an agent that learns an optimal policy has done very well, but in practice this rarely happens Usually involves heavy computational load Usually involves heavy computational load Typically agents perform approximations to the optimal policy A critical aspect of the problem facing the agent is always the computational resources available to it In particular, the amount of computation it can perform in a single time step In particular, the amount of computation it can perform in a single time step Practical considerations are thus: Computational complexity Computational complexity Memory available Memory available Tabular methods apply for small state sets Communication overhead (for distributed implementations) Communication overhead (for distributed implementations) Hardware vs. software Hardware vs. software

ECE 517 - Reinforcement Learning in AI 14 Are approximations good or bad ? RL typically relies on approximation mechanisms (see later) This could be an opportunity Efficient “Feature-extraction” type of approximation may actually reduce “noise” Efficient “Feature-extraction” type of approximation may actually reduce “noise” Make it practical for us to address large-scale problems Make it practical for us to address large-scale problems In general, making “bad” decisions in RL result in learning opportunities (online) The online nature of RL encourages learning more effectively from events that occur frequently Supported in nature Supported in nature Capturing regularities is a key property of RL

1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 6: Optimality Criterion in MDPs Dr. Itamar Arel College of Engineering Department.

Similar presentations

Presentation on theme: "1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 6: Optimality Criterion in MDPs Dr. Itamar Arel College of Engineering Department."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 6: Optimality Criterion in MDPs Dr. Itamar Arel College of Engineering Department.

Similar presentations

Presentation on theme: "1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 6: Optimality Criterion in MDPs Dr. Itamar Arel College of Engineering Department."— Presentation transcript:

Similar presentations

About project

Feedback