Model-based Bayesian Reinforcement Learning in Partially Observable Domains by Pascal Poupart and Nikos Vlassis (2008 International Symposium on Artificial.

Slides:



Advertisements
Similar presentations
Markov Decision Processes (MDPs) read Ch utility-based agents –goals encoded in utility function U(s), or U:S  effects of actions encoded in.
Advertisements

State Estimation and Kalman Filtering CS B659 Spring 2013 Kris Hauser.
Markov Decision Process
1 University of Southern California Keep the Adversary Guessing: Agent Security by Policy Randomization Praveen Paruchuri University of Southern California.
Partially Observable Markov Decision Process (POMDP)
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)
Dynamic Bayesian Networks (DBNs)
Gibbs Sampling Qianji Zheng Oct. 5th, 2010.
1 Slides for the book: Probabilistic Robotics Authors: Sebastian Thrun Wolfram Burgard Dieter Fox Publisher: MIT Press, Web site for the book & more.
Pradeep Varakantham Singapore Management University Joint work with J.Y.Kwak, M.Taylor, J. Marecki, P. Scerri, M.Tambe.
What Are Partially Observable Markov Decision Processes and Why Might You Care? Bob Wall CS 536.
An Introduction to Markov Decision Processes Sarah Hickmott
Bayes Filters Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics TexPoint fonts used in EMF. Read the.
Markov Decision Processes
Planning under Uncertainty
Lihan He, Shihao Ji and Lawrence Carin ECE, Duke University Combining Exploration and Exploitation in Landmine Detection.
POMDPs: Partially Observable Markov Decision Processes Advanced AI
KI Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.
Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.
CS 547: Sensing and Planning in Robotics Gaurav S. Sukhatme Computer Science Robotic Embedded Systems Laboratory University of Southern California
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Probabilistic Robotics Introduction Probabilities Bayes rule Bayes filters.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
MAKING COMPLEX DEClSlONS
Conference Paper by: Bikramjit Banerjee University of Southern Mississippi From the Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence.
General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.
CSE-473 Artificial Intelligence Partially-Observable MDPS (POMDPs)
Simultaneous Localization and Mapping Presented by Lihan He Apr. 21, 2006.
Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up Ekhlas Sonu, Prashant Doshi Dept. of Computer Science University.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
Privacy-Preserving Bayes-Adaptive MDPs CS548 Term Project Kanghoon Lee, AIPR Lab., KAIST CS548 Advanced Information Security Spring 2010.
Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.
An Asymptotic Analysis of Generative, Discriminative, and Pseudolikelihood Estimators by Percy Liang and Michael Jordan (ICML 2008 ) Presented by Lihan.
Decision Making Under Uncertainty Lec #8: Reinforcement Learning UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Most slides by Jeremy.
Simultaneously Learning and Filtering Juan F. Mancilla-Caceres CS498EA - Fall 2011 Some slides from Connecting Learning and Logic, Eyal Amir 2006.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
A Tutorial on the Partially Observable Markov Decision Process and Its Applications Lawrence Carin June 7,2006.
Conformant Probabilistic Planning via CSPs ICAPS-2003 Nathanael Hyafil & Fahiem Bacchus University of Toronto.
Transfer in Variable - Reward Hierarchical Reinforcement Learning Hui Li March 31, 2006.
CS Statistical Machine learning Lecture 24
MDPs (cont) & Reinforcement Learning
Beam Sampling for the Infinite Hidden Markov Model by Jurgen Van Gael, Yunus Saatic, Yee Whye Teh and Zoubin Ghahramani (ICML 2008) Presented by Lihan.
Transfer Learning in Sequential Decision Problems: A Hierarchical Bayesian Approach Aaron Wilson, Alan Fern, Prasad Tadepalli School of EECS Oregon State.
Smart Sleeping Policies for Wireless Sensor Networks Venu Veeravalli ECE Department & Coordinated Science Lab University of Illinois at Urbana-Champaign.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Keep the Adversary Guessing: Agent Security by Policy Randomization
Partially Observable Markov Decision Process and RL
CS b659: Intelligent Robotics
Making complex decisions
Reinforcement Learning in POMDPs Without Resets
Particle Filtering for Geometric Active Contours
Markov Decision Processes
Course: Autonomous Machine Learning
Policy Gradient in Continuous Time
Markov Decision Processes
Propagating Uncertainty In POMDP Value Iteration with Gaussian Process
Markov Decision Processes
Hierarchical POMDP Solutions
Reinforcement Learning in MDPs by Lease-Square Policy Iteration
Chapter 17 – Making Complex Decisions
CS 416 Artificial Intelligence
Reinforcement Learning Dealing with Partial Observability
Reinforcement Nisheeth 18th January 2019.
Presentation transcript:

Model-based Bayesian Reinforcement Learning in Partially Observable Domains by Pascal Poupart and Nikos Vlassis (2008 International Symposium on Artificial Intelligence and Math) Presented by Lihan He ECE, Duke University Oct 3, 2008

Introduction POMDP represented as dynamic decision network (DDN) Partially observable reinforcement learning Belief update Value function and optimal action Partially observable BEETLE Offline policy optimization Online policy execution Conclusion Outline 1/14

Introduction Final objective: learn optimal actions (policy) to achieve best reward POMDP: partially observable Markov decision process  represented by  sequential decision-making problem Reinforcement learning for POMDP: solve the decision-making problem given feedback from environment, when the dynamics of the environment (T and O) are unknown.  given action-observation sequence as history  model-based: explicitly model the environment  model-free: avoid to explicitly model the environment  online learning: policy learning and execution at the same time  offline learning: learn policy first given training data, and then execute policy without modifying the policy 2/14

Introduction This paper:  Bayesian model-based approach  Set the prior for belief as mixture of products of Dirichlets  The posterior belief is a mixture of products of Dirichlets  The value function is also a mixture of products of Dirichlets  The number of the mixture components increases exponentially with respect to the time step  PO-BEETLE algorithm 3/14

POMDP and DDN Redefine POMDP as dynamic decision network (DDN) X, X’ : two consecutive time steps  Observation and reward are subsets of state variable  The conditional probability distributions of state Pr(s’|pa s’ ) jointly encode the transition, observation and reward models T, O and R 4/14

POMDP and DDN The optimal value function satisfies Bellman’s equation Given X, S, R, O, A, edge E and the dynamics Pr(s’|pa s’ ): Belief update: Objective: finding a policy that maximizes the expected total reward Value iteration algorithms optimize the value function by iteratively computing the right hand side of the Bellman’s equation. 5/14

POMDP and DDN For reinforcement learning, assume X, S, R, O, A are known, and edges E are known, but the dynamics Pr(s’|pa s’ ) are unknown. We augment graph: Dynamics are included in the graph, denoted by parameter Θ. If the unknown model is static, Belief over sjoint belief over s and θ 6/14

PORL: belief update Problem: number of mixture components increases by a factor of |S| (exponential growth with time) Prior setting for belief: a mixture of products of Dirichlets Posterior belief (after taking action a and receiving observation o’) is again a mixture of products of Dirichlets 7/14

PORL: value function and optimal action The augmented POMDP is hybrid, with discrete state variables S and continuous model variables Θ Discrete state POMDP: with Continuous state POMDP [1]: [1] Porta, J. M.; Vlassis, N. A.; Spaaan, M. T. J.; and Poupart, P Point-based value iteration for continuous POMDPs. Journal of Machine Learning Research 7:2329–2367. The α-function α(s,θ) can also be represented as a mixture of products of Dirichlets Hybrid state POMDP: 8/14

PORL: value function and optimal action Assume for k step-to-go is then for k+1 step-to-go is decomposed in 3 steps find optimal action for belief b find the corresponding α-function with Problem: number of mixture components increases by a factor of |S| (exponential growth with time) 9/14 1) 2) 3)

PO-BEETLE: offline policy optimization Policy learning is performed offline, given sufficient training data (action-observation sequence) 10/14

PO-BEETLE: offline policy optimization Keep the number of mixture components for α-functions bounded: Approach 1: approximation using basis functions Approach 2: approximation by important components 11/14

PO-BEETLE: online policy execution Given policy, the agent executes the policy and updates belief online. Keep the number of mixture components for belief b bounded: Approach 1: approximation using importance sampling 12/14

PO-BEETLE: online policy execution Approach 2: particle filtering: simultaneously update belief and reduce the number of mixture components Sample one updated component (after taking a and receiving o’) The updated belief is represented by k particles 13/14

Conclusion  Bayesian model-based reinforcement learning;  Prior belief is a mixture of products of Dirichlets;  Posterior belief is also a mixture of products of Dirichlets, with the number of mixture components growing exponentially with time;  α-functions (associated with value functions) are also represented as mixtures of products of Dirichlets that grow exponentially with time;  Partially observable BEETLE algorithm. 14/14