General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.

Slides:



Advertisements
Similar presentations
Reinforcement Learning
Advertisements

Markov Decision Process
1 University of Southern California Keep the Adversary Guessing: Agent Security by Policy Randomization Praveen Paruchuri University of Southern California.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.
Decision Theoretic Planning
MDP Presentation CS594 Automated Optimal Decision Making Sohail M Yousof Advanced Artificial Intelligence.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
An Introduction to Markov Decision Processes Sarah Hickmott
主講人:虞台文 大同大學資工所 智慧型多媒體研究室
COSC 878 Seminar on Large Scale Statistical Machine Learning 1.
Planning under Uncertainty
SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.
Reinforcement Learning
Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.
Nov 14 th  Homework 4 due  Project 4 due 11/26.
Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.
4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.
More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Reinforcement Learning (1)
Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK
Learning and Planning for POMDPs Eyal Even-Dar, Tel-Aviv University Sham Kakade, University of Pennsylvania Yishay Mansour, Tel-Aviv University.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
MAKING COMPLEX DEClSlONS
Search and Planning for Inference and Learning in Computer Vision
Reinforcement Learning on Markov Games Nilanjan Dasgupta Department of Electrical and Computer Engineering Duke University Durham, NC Machine Learning.
Reinforcement Learning
REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Reinforcement Learning
Reinforcement Learning 主講人:虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.
Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Reinforcement Learning Ata Kaban School of Computer Science University of Birmingham.
1 S ystems Analysis Laboratory Helsinki University of Technology Flight Time Allocation Using Reinforcement Learning Ville Mattila and Kai Virtanen Systems.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
INTRODUCTION TO Machine Learning
Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
1 Introduction to Reinforcement Learning Freek Stulp.
Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.
Reinforcement Learning
Automated Planning and Decision Making Prof. Ronen Brafman Automated Planning and Decision Making Fully Observable MDP.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Markov Decision Processes AIMA: 17.1, 17.2 (excluding ), 17.3.
R. Brafman and M. Tennenholtz Presented by Daniel Rasmussen.
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
Reinforcement Learning. Overview Supervised Learning: Immediate feedback (labels provided for every input). Unsupervised Learning: No feedback (no labels.
REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Achieving Goals in Decentralized POMDPs Christopher Amato Shlomo Zilberstein UMass.
1 Markov Decision Processes Finite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
Keep the Adversary Guessing: Agent Security by Policy Randomization
István Szita & András Lőrincz
Reinforcement Learning in POMDPs Without Resets
Policy Gradient in Continuous Time
Markov Decision Processes
Convergence, Targeted Optimality, and Safety in Multiagent Learning
Markov Decision Processes
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Reinforcement Learning in MDPs by Lease-Square Policy Iteration
CS 416 Artificial Intelligence
Reinforcement Learning Dealing with Partial Observability
Reinforcement Nisheeth 18th January 2019.
Markov Decision Processes
Markov Decision Processes
Presentation transcript:

General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005

Outline Reinforcement Learning Explicit Explore or Exploit (E 3 ) algorithm Implicit Explore or Exploit (R-Max) algorithm Conclusions

What is Reinforcement Learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error interactions with a dynamic environment. Two strategies for solving reinforcement-learning problems: –Search in the space of behaviors to find the best performance; –Use statistical techniques and dynamic programming methods to estimate the utility of taking actions in states of the world.

Reinforcement Learning Model Formally, the model consists of a discrete set of environment states, S; a discrete set of agent actions, A; A set of scalar reinforcement signals; typically {0,1}, or the real numbers. Figure 1: The standard reinforcement-learning model

Example Dialogue The environment is non-deterministic but stationary.

Some Measurements Models of optimal behavior: –Finite-horizon ; –Infinite-horizon ; –Average-reward. Learning performance Convergence rate and Speed of convergence

Exploitation versus Exploration One major difference between reinforcement learning and supervised learning is that a reinforcement-learner must explicitly explore its environment. A simplest traditional reinforcement-learning problem: K-armed bandit problem – K gambling machines, h pulls. How do you decide which machine to pull?

Markov Decision Process Model The MDP is defined by the tuple S is a finite set of states of the world; A is a finite set of actions; T: S  A   (S) is the state-transition function, the probability of an action changing the the world state from one to another,T(s, a, s’); R: S  A   is the reward for the agent in a given world state after performing an action, R(s, a). The agent does not know the parameters of this process.

Near-Optimal learning in Polynomial Time We call the value of the lower bound on T given above the – horizon time for the discounted MDP M.

Proof of the Lemma The lower bound follows from the definitions, since all expected payoffs are nonnegative. For the upper bound, fix any infinite path p, and let Ri be the expected payoffs along this path

The Explicit Explore or Exploit (E 3 ) Algorithm Model-based – Maintain a model for the transition probabilities and the expected payoffs for some subset of the states of the unknown MDP M. Balanced wandering – Take an arbitrary action from “unknown state”; enough visits to one state makes the state become a “known state”. Known-state MDP M s – Induced on the set of currently known states S; all of the unknown states are represented by a single additional, absorbing state s 0.

Initialization – The set S of known states is empty; Balanced wandering – Any time the current state is not in S, the algorithm performs balanced wandering; Discovery of New known states – Any time a state i has been visited m known times, it enters the known set S. Off-line optimization – Upon reaching a known state i in S, the algorithm performs the two off-line optimal policy computations on M s and M s ’ –Attempted Exploitation: If the resulting exploitation policy achieves return from i in Ms that is at least, the algorithm executes for the next T steps. –Attempted Exploration: Otherwise, the algorithm executes the exploration policy derived from M s ’ to do T steps exploration.

Explore or Exploit Lemma

R-Max – the implicit explore or exploit algorithm In the spirit of E 3 algorithm, a general polynomial time algorithm for near-optimal reinforcement learning. The agent does not know its behavior is exploitation or exploration. However, it knows that it will either optimize or learn efficiently. R-max is described in the context of stochastic game (SG), which also considers the actions of the adversary. (Maybe useful for moving target problem?)

SG and MDP An MDP is an SG in which the adversary has a single action at each state. SGMDP StateGiGi SiSi Action(a, a’)a TransitionP M (s,t,a,a’)P M (s,t,a) Unknown stateG0G0 S0S0 RewardMatrix on each G i R(s,a)

Initialization – Construct model M’ consisting of N+1 stage- games, {G0,G1,…,GN}. G0 is an additional fictitious game. Initialize all game matrices to have (Rmax,0) in all entries. Initialize PM(Gi,G0,a,a’)=1 for all I and all actions a,a’. Compute and Act – Compute an optimal T-step policy for the current state, and execute it for T-steps or until a new entry becomes known. Observe and update –Update the reward for (a,a’) in the state Gi –Update the set of states reached by playing (a,a’) in Gi –If the record of states reached from this entry contains elements mark this entry as KNOWN,and update the transition matrix for this entry.

Conclusion The author described R-Max, a simple RL algorithm that leads to polynomial time convergence to near-optimal reward. R-Max is an optimistic model-based algorithm in the spirit of E 3 algorithm. However, unlike E 3, R-Max makes implicit trade-off between exploration and exploitation.

Related to our work This paper focus on the proof of algorithm existence and discussion of optimality and convergence while the detailed MDP solution is not addressed. We may utilize our POMDP solver in this framework to make some extension. This algorithm does not require random walk for learning environment in advance. This may be interesting for our robot navigation problem.

Reference R.Brafman and M.Tennenholtz, “R-MAX – A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning”, Journal of Machine Learning Research 2002 M.Kearns and S.Singh, “Near-optimal reinforcement learning in polynomial time”, ICML 1998 L.P.Kaelbling, M.L.Littleman and A.W.Moore, “Reinforcement learning: A survey.” Journal of AI Research 1996