Generalized Point Based Value Iteration for Interactive POMDPs Prashant Doshi Dept. of Computer Science and AI Institute University of Georgia

Slides:

Advertisements

Similar presentations

A Decision-Theoretic Model of Assistance - Evaluation, Extension and Open Problems Sriraam Natarajan, Kshitij Judah, Prasad Tadepalli and Alan Fern School.

Advertisements

Markov Decision Process

1 University of Southern California Keep the Adversary Guessing: Agent Security by Policy Randomization Praveen Paruchuri University of Southern California.

Partially Observable Markov Decision Process (POMDP)

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.

SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.

SARSOP Successive Approximations of the Reachable Space under Optimal Policies Devin Grady 4 April 2013.

CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)

Compressing Mental Model Spaces and Modeling Human Strategic Intent.

Fast approximate POMDP planning: Overcoming the curse of history! Joelle Pineau, Geoff Gordon and Sebastian Thrun, CMU Point-based value iteration: an.

Decision Theoretic Planning

Optimal Policies for POMDP Presented by Alp Sardağ.

1 Graphical Models for Online Solutions to Interactive POMDPs Prashant Doshi Yifeng Zeng Qiongyu Chen University of Georgia Aalborg University National.

1 Slides for the book: Probabilistic Robotics Authors: Sebastian Thrun Wolfram Burgard Dieter Fox Publisher: MIT Press, Web site for the book & more.

A Hybridized Planner for Stochastic Domains Mausam and Daniel S. Weld University of Washington, Seattle Piergiorgio Bertoli ITC-IRST, Trento.

Partially Observable Markov Decision Process By Nezih Ergin Özkucur.

Infinite Horizon Problems

Planning under Uncertainty

1 Policies for POMDPs Minqing Hu. 2 Background on Solving POMDPs MDPs policy: to find a mapping from states to actions POMDPs policy: to find a mapping.

POMDPs: Partially Observable Markov Decision Processes Advanced AI

Approximate Solutions for Partially Observable Stochastic Games with Common Payoffs Rosemary Emery-Montemerlo joint work with Geoff Gordon, Jeff Schneider.

Incremental Pruning CSE 574 May 9, 2003 Stanley Kok.

Probabilistic Robotics Introduction Probabilities Bayes rule Bayes filters.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.

Learning and Planning for POMDPs Eyal Even-Dar, Tel-Aviv University Sham Kakade, University of Pennsylvania Yishay Mansour, Tel-Aviv University.

Hardness-Aware Restart Policies Yongshao Ruan, Eric Horvitz, & Henry Kautz IJCAI 2003 Workshop on Stochastic Search.

Leroy Garcia 1.  Artificial Intelligence is the branch of computer science that is concerned with the automation of intelligent behavior (Luger, 2008).

Photo-realistic Rendering and Global Illumination in Computer Graphics Spring 2012 Stochastic Radiosity K. H. Ko School of Mechatronics Gwangju Institute.

MAKING COMPLEX DEClSlONS

Conference Paper by: Bikramjit Banerjee University of Southern Mississippi From the Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence.

Fifth International Conference on Autonomous Agents and Multi-agent Systems (AAMAS-06) Exact Solutions of Interactive POMDPs Using Behavioral Equivalence.

General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.

CSE-473 Artificial Intelligence Partially-Observable MDPS (POMDPs)

Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up Ekhlas Sonu, Prashant Doshi Dept. of Computer Science University.

1 Robot Environment Interaction Environment perception provides information about the environment’s state, and it tends to increase the robot’s knowledge.

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.

TKK | Automation Technology Laboratory Partially Observable Markov Decision Process (Chapter 15 & 16) José Luis Peralta.

Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher.

Utilities and MDP: A Lesson in Multiagent System Based on Jose Vidal’s book Fundamentals of Multiagent Systems Henry Hexmoor SIUC.

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

Using Reinforcement Learning to Model True Team Behavior in Uncertain Multiagent Settings in Interactive DIDs Muthukumaran Chandrasekaran THINC Lab, CS.

Simultaneously Learning and Filtering Juan F. Mancilla-Caceres CS498EA - Fall 2011 Some slides from Connecting Learning and Logic, Eyal Amir 2006.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

A Tutorial on the Partially Observable Markov Decision Process and Its Applications Lawrence Carin June 7,2006.

Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford.

Twenty Second Conference on Artificial Intelligence AAAI 2007 Improved State Estimation in Multiagent Settings with Continuous or Large Discrete State.

The set of SE models include s those that are BE. It further includes models that include identical distributions over the subject agent’s action observation.

Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.

1 Random Disambiguation Paths Al Aksakalli In Collaboration with Carey Priebe & Donniell Fishkind Department of Applied Mathematics and Statistics Johns.

1 Chapter 17 2 nd Part Making Complex Decisions --- Decision-theoretic Agent Design Xin Lu 11/04/2002.

Learning Team Behavior Using Individual Decision Making in Multiagent Settings Using Interactive DIDs Muthukumaran Chandrasekaran THINC Lab, CS Department.

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

4 th International Conference on Service Oriented Computing Adaptive Web Processes Using Value of Changed Information John Harney, Prashant Doshi LSDIS.

1 (Chapter 3 of) Planning and Control in Stochastic Domains with Imperfect Information by Milos Hauskrecht CS594 Automated Decision Making Course Presentation.

Probabilistic Robotics Introduction Probabilities Bayes rule Bayes filters.

On the Difficulty of Achieving Equilibrium in Interactive POMDPs Prashant Doshi Dept. of Computer Science University of Georgia Athens, GA Twenty.

Partial Observability “Planning and acting in partially observable stochastic domains” Leslie Pack Kaelbling, Michael L. Littman, Anthony R. Cassandra;

Probabilistic Robotics Probability Theory Basics Error Propagation Slides from Autonomous Robots (Siegwart and Nourbaksh), Chapter 5 Probabilistic Robotics.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Achieving Goals in Decentralized POMDPs Christopher Amato Shlomo Zilberstein UMass.

Yifeng Zeng Aalborg University Denmark

POMDPs Logistics Outline No class Wed

Analytics and OR DP- summary.

Biomedical Data & Markov Decision Process

Robust Belief-based Execution of Manipulation Programs

Approximate POMDP planning: Overcoming the curse of history!

CS 416 Artificial Intelligence

Reinforcement Learning Dealing with Partial Observability

Reinforcement Nisheeth 18th January 2019.

Presentation transcript:

Generalized Point Based Value Iteration for Interactive POMDPs Prashant Doshi Dept. of Computer Science and AI Institute University of Georgia Twenty Third Conference on AI (AAAI 2008) Dennis P. Perez AI Institute University of Georgia UGA

Background –Point Based Value Iteration (PBVI) for POMDPs –Overview of Interactive POMDPs Interactive PBVI: Generalization of PBVI to Multiagent Settings Experimental Results Outline

Decision Making in Single Agent Setting Physical State (Loc, Orient,...) Physical State (Loc, Orient,...) transition POMDP

Solving POMDPs Exact solutions are difficult to compute in general –PSpace-Complete for finite horizon (Papadimitriou&Tsitsiklis ’87, Mundhenk et al. ’00) –Undecidable for infinite or indefinite horizon (Madani et al. ’03) Approximations that always guarantee a near- optimal policy quickly are not possible (Lusena et al. ’01) –However, for many instances, algorithms in use or being developed will be fast and close to optimal

Point Based Value Iteration (PBVI) Potentially scalable approach for solving POMDPs approximately (Pineau et al., ’03,‘06) Exact SolutionSelect belief pointsPrune alpha vectors Expansion of belief points Next run Prune alpha vectors

Point Based Value Iteration Many different belief expansion strategies –Stochastic trajectory generation –Greedy error minimization –Gain based methods (Samples et al. ’07) Improvements on PBVI –Randomly backing up vectors at select points (Perseus; Spaan&Vlassis, ’05 ) –Prioritized vector backup (Shani et al. ’06)

Interactive state Decision Making in Multiagent Setting Physical State (Loc, Orient,...) Physical State (Loc, Orient,...) Interactive POMDP (See JAIR ‘05)

Interactive POMDPs Key ideas: Integrate game theoretic concepts into a decision theoretic framework –Include possible models of other agents in decision making  intentional (types) and subintentional models –Address uncertainty by maintaining beliefs over the state and models of other agents  Bayesian learning –Beliefs over intentional models give rise to interactive belief systems  Interactive epistemology, recursive modeling –Computable approximation of the interactive belief system  Finitely nested belief system –Compute best responses to your beliefs  Subjective rationality

Interactive POMDPs Interactive state space –Include models of other agents into the state space Beliefs in I-POMDPs could be nested (computable)

Interactive PBVI (I-PBVI) Hypothesis: Extending PBVI approach to I- POMDPs results in a scalable approximation for I- POMDPs Generalizing PBVI to multiagent settings is not trivial –Research challenges: 1. Space of agent models is countably infinite 2. Parameterized representation of nested beliefs is difficult 3. Other agents’ actions need to be predicted suggesting a recursive implementation

Issue 1: Space of agent models is infinite Approach Analogous to PBVI in POMDPs, select a few initial models of the other agent –Need to ensure that the true model is within this set, otherwise the belief update is inconsistent Select models so that the Absolute Continuity Condition is satisfied –Subjective distribution over future observations (paths of play) should not rule out the observation histories considered possible by the true distribution How to satisfy ACC? –Cautious beliefs –Select a finite set of models,, with the partial (domain) knowledge that the true or an equivalent model is one of them

Issue 2: Representing nested beliefs is difficult Level 0 beliefs are standard discrete distributions (vectors of probabilities that sum to 1) Level 1 beliefs could be represented as probability density functions over level 0 beliefs Probability density functions over level 1 beliefs may not be computable in general –Parameters of level 1 beliefs may not be bounded (e.g., a polynomial of any degree) –Level 2 beliefs are strictly partial recursive functions Approach We previously limited the set of models, Level l belief becomes a discrete probability distribution

Issue 3: Predict other agent’s actions Approach Candidate agent models grow over time and must be tracked –Define a complete interactive state space Reach (,0) = Solve other agent’s models at each level to predict actions –Recursively invoke I-PBVI to solve models Reach (, H ) = Set of models of agent j in the course of H steps

Interactive PBVI Back project alpha vectors for I-POMDPs (see paper) Retain alpha vectors optimal at selected belief points Computational Savings vs

Experimental Results Measured the least time taken in reaching a particular performance in terms of the rewards –Function of belief points, number of models and horizons –Compared with Interactive Particle Filter (I-PF; Doshi& Gmytrasiewicz, ’05) LEVEL 1 LEVEL 2 (Dual Processor Xeon, 3.4GHz, 4GB RAM, Linux) Multiagent Tiger Problem

Experimental Results I-PBVI is able to solve for larger horizons than previous approximation technique Level 1HTime (secs) I-PBVII-PF Tiger ,290* 403,043* MM ,303* 403,360*

Discussion Interactive PBVI generalizes PBVI to multiagent settings –The generalization is not trivial I-PBVI demonstrates scalable results on toy problems –Further testing on realistic applications is within reach Further improvement is possible by carefully limiting the set of models in Reach () –True or equivalent model must remain in the set otherwise the belief update may become inconsistent

Thank You Questions

Interactive POMDPs Background –Well-known framework for decision making in single agent partially observable settings: POMDP –Traditional analysis of multiagent interactions: Game theory Problem “... there is currently no good way to combine game theoretic and POMDP control strategies.” - Russell and Norvig AI: A Modern Approach, 2 nd Ed.

Belief Update: The belief update function for I-POMDP i involves: –Use the other agent’s model to predict its action(s) –Anticipate the other agent’s observations and how it updates its model –Use your own observations to correct your beliefs Interactive POMDPs Formal Definition and Relevant Properties Prediction: Correction: Policy Computation – Analogously to POMDPs (given the new belief update)