1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 8: Dynamic Programming – Value Iteration Dr. Itamar Arel College of Engineering Department.

Slides:

Advertisements

Similar presentations

Markov Decision Process

Advertisements

Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.

1 Dynamic Programming Week #4. 2 Introduction Dynamic Programming (DP) –refers to a collection of algorithms –has a high computational complexity –assumes.

Announcements  Homework 3: Games  Due tonight at 11:59pm.  Project 2: Multi-Agent Pacman  Has been released, due Friday 2/21 at 5:00pm.  Optional.

Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 3 Ann Nowé By Sutton.

Optimal Policies for POMDP Presented by Alp Sardağ.

1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.

1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.

Markov Decision Processes

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 4: Dynamic Programming pOverview of a collection of classical solution.

Planning under Uncertainty

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 4: Dynamic Programming pOverview of a collection of classical solution.

Markov Decision Processes

Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.

Nov 14 th  Homework 4 due  Project 4 due 11/26.

Reinforcement Learning

4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 From Sutton & Barto Reinforcement Learning An Introduction.

Department of Computer Science Undergraduate Events More

Reinforcement Learning Yishay Mansour Tel-Aviv University.

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 4: Dynamic Programming pOverview of a collection of classical solution.

Reinforcement Learning (1)

9/23. Announcements Homework 1 returned today (Avg 27.8; highest 37) –Homework 2 due Thursday Homework 3 socket to open today Project 1 due Tuesday –A.

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

MDP Reinforcement Learning. Markov Decision Process “Should you give money to charity?” “Would you contribute?” “Should you give money to charity?” $

MAKING COMPLEX DEClSlONS

Search and Planning for Inference and Learning in Computer Vision

General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.

1 Dr. Itamar Arel College of Engineering Electrical Engineering & Computer Science Department The University of Tennessee Fall 2009 August 24, 2009 ECE-517:

OBJECT FOCUSED Q-LEARNING FOR AUTONOMOUS AGENTS M. ONUR CANCI.

1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 6: Optimality Criterion in MDPs Dr. Itamar Arel College of Engineering Department.

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.

1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 17: TRTRL, Implementation Considerations, Apprenticeship Learning Dr. Itamar Arel.

Salem, OR 4 th owl attack in month. Any Questions? Programming Assignments?

Reinforcement Learning Yishay Mansour Tel-Aviv University.

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

1 Parrondo's Paradox. 2 Two losing games can be combined to make a winning game. Game A: repeatedly flip a biased coin (coin a) that comes up head with.

MDPs (cont) & Reinforcement Learning

Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.

1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 12: Generalization and Function Approximation Dr. Itamar Arel College of Engineering.

Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012.

1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 21: Dynamic Multi-Criteria RL problems Dr. Itamar Arel College of Engineering Department.

Department of Computer Science Undergraduate Events More

Sutton & Barto, Chapter 4 Dynamic Programming. Programming Assignments? Course Discussions?

Department of Computer Science Undergraduate Events More

Any Questions? Programming Assignments?. CptS 450/516 Singularity Weka / Torch / RL-Glue/Burlap/RLPy Finite MDP = ? Start arbitrarily, moving towards.

Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.

Sutton & Barto, Chapter 4 Dynamic Programming. Policy Improvement Theorem Let π & π’ be any pair of deterministic policies s.t. for all s in S, Then,

1 Markov Decision Processes Finite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.

Markov Decision Processes

CS 188: Artificial Intelligence

CMSC 671 – Fall 2010 Class #22 – Wednesday 11/17

Announcements Homework 3 due today (grace period through Friday)

CMSC 471 Fall 2009 RL using Dynamic Programming

Chapter 4: Dynamic Programming

Chapter 4: Dynamic Programming

CAP 5636 – Advanced Artificial Intelligence

CS 188: Artificial Intelligence Fall 2007

13. Acting under Uncertainty Wolfram Burgard and Bernhard Nebel

September 22, 2011 Dr. Itamar Arel College of Engineering

October 6, 2011 Dr. Itamar Arel College of Engineering

September 1, 2010 Dr. Itamar Arel College of Engineering

CS 188: Artificial Intelligence Spring 2006

CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29

Chapter 4: Dynamic Programming

October 20, 2010 Dr. Itamar Arel College of Engineering

Markov Decision Processes

Markov Decision Processes

Presentation transcript:

1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 8: Dynamic Programming – Value Iteration Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science The University of Tennessee Fall 2011 September 20, 2011

ECE Reinforcement Learning in AI 2 Outline Value Iteration Asynchronous Dynamic Programming Generalized Policy Iteration Efficiency of Dynamic Programming

ECE Reinforcement Learning in AI 3 Second DP Method: Value Iteration A drawback of Policy Iteration – the need to perform policy evaluation in each iteration Computationally heavy Computationally heavy Multiple sweeps through the state set Multiple sweeps through the state set Question: can we truncate the policy evaluation process? Reduce the number of computations involved? Reduce the number of computations involved? Turns out we can – without loosing convergence properties One such way is value iteration Policy evaluation is stopped after just one sweep Policy evaluation is stopped after just one sweep Has the same convergence guarantees as policy iteration Has the same convergence guarantees as policy iteration

ECE Reinforcement Learning in AI 4 0 Value Iteration (cont.) Effectively embeds one sweep of policy evaluation and one sweep of policy improvement All variations converge to an optimal policy for discounted MDPs

ECE Reinforcement Learning in AI 5 0 Example: Gambler’s Problem A gambler has the opportunity to make bets on the outcomes of a sequence of coin flips. If the coin comes up heads, he wins as many dollars as he has staked on that flip; if it is tails, he loses his stake. If the coin comes up heads, he wins as many dollars as he has staked on that flip; if it is tails, he loses his stake. The game ends when the gambler wins by reaching his goal of $100, or loses by running out of money. The game ends when the gambler wins by reaching his goal of $100, or loses by running out of money. On each flip, the gambler must decide what portion of his capital to stake, in integer numbers of dollars. This problem can be formulated as an undiscounted, finite (non-deterministic) MDP.

ECE Reinforcement Learning in AI 6 Gambler’s Problem: MDP Formulation The state is the gambler's capital s = {0,1,2,3…,100} The actions are stakes a = {1, 2, …, min(s,100-s)} The reward is zero on all transitions except those on which the gambler reaches his goal, when it is +1. The state-value function then gives the probability of winning from each state. A policy is a mapping from levels of capital to stakes The optimal policy maximizes the probability of reaching the goal. Let p denote the probability of the coin coming up heads. Let p denote the probability of the coin coming up heads. If p is known, then the entire problem space is known and can be solved (i.e. complete model is provided) If p is known, then the entire problem space is known and can be solved (i.e. complete model is provided) 0

ECE Reinforcement Learning in AI 7 Gambler’s Problem: Solutions for p =0.4

ECE Reinforcement Learning in AI 8 Asynchronous Dynamic Programming Major drawback of DP – computational complexity Involve operations that sweep the entire state set Involve operations that sweep the entire state set Example: Backgammon game has over states – even a million states per second would not suffice Example: Backgammon game has over states – even a million states per second would not suffice Asynchronous DP algorithms – in-place iterative DP algorithms that are not organized in terms of systematic sweeps of the state set The values of some states may be backed up several times while the values of other states are backed up once The condition for convergence to the optimal policy – all states are backed up evenly in the long run Allow practical sub-optimal solutions to be found

ECE Reinforcement Learning in AI 9 Asynchronous Dynamic Programming What we gain: Speed of convergence – no need for complete sweep every iteration Speed of convergence – no need for complete sweep every iteration Flexibility in selecting the states to be updated Flexibility in selecting the states to be updated It’s a zero-sum game: For the optimal solution we need to perform all calculations involved in sweeping the entire state set For the optimal solution we need to perform all calculations involved in sweeping the entire state set Some states may not need to be updated as often as others Some states may not need to be updated as often as others Some states may be skipped altogether (as will be discussed later) Some states may be skipped altogether (as will be discussed later) Facilitates real-time learning (learn and experience concurrently) For example, we can focus on the states the agent visits For example, we can focus on the states the agent visits

ECE Reinforcement Learning in AI 10 Generalized Policy Iteration Policy iteration consists of two processes Making the value function consistent with the current policy (policy evaluation) Making the value function consistent with the current policy (policy evaluation) Making the policy greedy with respect to the value function (policy improvement) Making the policy greedy with respect to the value function (policy improvement) So far we’ve considered methods that alternated between the two phases (on different granularities) Generalized Policy Iteration (GPI) refers to all variations of the above, regardless of granularity and details The two components can be viewed as both competing and cooperating They complete in the sense of pulling in opposite directions They complete in the sense of pulling in opposite directions They complement as they lead to an optimal policy They complement as they lead to an optimal policy

ECE Reinforcement Learning in AI 11 Efficiency of Dynamic Programming DP may not be practical for large problems Compared to alternatives – DP is pretty good Finding optimal policy is polynomial in states and actions Finding optimal policy is polynomial in states and actions If |A| = M, and |S| = N then DP is guaranteed to find the optimal policy in polynomial time If |A| = M, and |S| = N then DP is guaranteed to find the optimal policy in polynomial time Even though total number of policies is M N Even though total number of policies is M N DP is often considered impractical because of the curse of dimensionality |S| grows exponentially with the number of state variables |S| grows exponentially with the number of state variables DP methods can solve (using PC) MDPs with millions of states On problems with large state spaces, asynchronous DP works best Works because (practically speaking) only a few states occur along optimal solution trajectories Works because (practically speaking) only a few states occur along optimal solution trajectories