Using Reinforcement Learning to Model True Team Behavior in Uncertain Multiagent Settings in Interactive DIDs Muthukumaran Chandrasekaran THINC Lab, CS.

Slides:



Advertisements
Similar presentations
A Decision-Theoretic Model of Assistance - Evaluation, Extension and Open Problems Sriraam Natarajan, Kshitij Judah, Prasad Tadepalli and Alan Fern School.
Advertisements

Dialogue Policy Optimisation
Markov Decision Process
Reinforcement Learning (II.) Exercise Solutions Ata Kaban School of Computer Science University of Birmingham 2003.
Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.
Department of Computer Science Undergraduate Events More
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
11 Planning and Learning Week #9. 22 Introduction... 1 Two types of methods in RL ◦Planning methods: Those that require an environment model  Dynamic.
Optimal Policies for POMDP Presented by Alp Sardağ.
Markov Models for Multi-Agent Coordination Maayan Roth Multi-Robot Reading Group April 13, 2005.
1 Graphical Models for Online Solutions to Interactive POMDPs Prashant Doshi Yifeng Zeng Qiongyu Chen University of Georgia Aalborg University National.
MDP Presentation CS594 Automated Optimal Decision Making Sohail M Yousof Advanced Artificial Intelligence.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
Markov Decision Processes
Planning under Uncertainty
POMDPs: Partially Observable Markov Decision Processes Advanced AI
1 University of Southern California Towards A Formalization Of Teamwork With Resource Constraints Praveen Paruchuri, Milind Tambe, Fernando Ordonez University.
Approximate Solutions for Partially Observable Stochastic Games with Common Payoffs Rosemary Emery-Montemerlo joint work with Geoff Gordon, Jeff Schneider.
1 University of Southern California Security in Multiagent Systems by Policy Randomization Praveen Paruchuri, Milind Tambe, Fernando Ordonez University.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Discretization Pieter Abbeel UC Berkeley EECS
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.
Markov Decision Processes
Department of Computer Science Undergraduate Events More
Reinforcement Learning Yishay Mansour Tel-Aviv University.
A Multi-Agent Learning Approach to Online Distributed Resource Allocation Chongjie Zhang Victor Lesser Prashant Shenoy Computer Science Department University.
Collaborative Reinforcement Learning Presented by Dr. Ying Lu.
Learning and Planning for POMDPs Eyal Even-Dar, Tel-Aviv University Sham Kakade, University of Pennsylvania Yishay Mansour, Tel-Aviv University.
Curious Characters in Multiuser Games: A Study in Motivated Reinforcement Learning for Creative Behavior Policies * Mary Lou Maher University of Sydney.
Decision-Making on Robots Using POMDPs and Answer Set Programming Introduction Robots are an integral part of many sectors such as medicine, disaster rescue.
Conference Paper by: Bikramjit Banerjee University of Southern Mississippi From the Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence.
Fifth International Conference on Autonomous Agents and Multi-agent Systems (AAMAS-06) Exact Solutions of Interactive POMDPs Using Behavioral Equivalence.
Department of Computer Science Christopher Amato Carnegie Mellon University Feb 5 th, 2010 Increasing Scalability in Algorithms for Centralized and Decentralized.
Reinforcement Learning
REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.
GaTAC: A Scalable and Realistic Testbed for Multiagent Decision Making Ekhlas Sonu, Prashant Doshi Dept. of Computer Science University of Georgia Athens,
Reinforcement Learning (II.) Exercise Solutions Ata Kaban School of Computer Science University of Birmingham.
Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up Ekhlas Sonu, Prashant Doshi Dept. of Computer Science University.
Software Multiagent Systems: Lecture 13 Milind Tambe University of Southern California
Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher.
Shlomo Zilberstein Alan Carlin Bounded Rationality in Multiagent Systems using Decentralized Metareasoning TexPoint fonts used in EMF. Read the TexPoint.
Reinforcement Learning 主講人:虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.
Utilities and MDP: A Lesson in Multiagent System Based on Jose Vidal’s book Fundamentals of Multiagent Systems Henry Hexmoor SIUC.
Privacy-Preserving Bayes-Adaptive MDPs CS548 Term Project Kanghoon Lee, AIPR Lab., KAIST CS548 Advanced Information Security Spring 2010.
Department of Computer Science Undergraduate Events More
Solving POMDPs through Macro Decomposition
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Twenty Second Conference on Artificial Intelligence AAAI 2007 Improved State Estimation in Multiagent Settings with Continuous or Large Discrete State.
The set of SE models include s those that are BE. It further includes models that include identical distributions over the subject agent’s action observation.
Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.
1 Multiagent Teamwork: Analyzing the Optimality and Complexity of Key Theories and Models David V. Pynadath and Milind Tambe Information Sciences Institute.
Department of Computer Science Undergraduate Events More
Learning Team Behavior Using Individual Decision Making in Multiagent Settings Using Interactive DIDs Muthukumaran Chandrasekaran THINC Lab, CS Department.
1 (Chapter 3 of) Planning and Control in Stochastic Domains with Imperfect Information by Milos Hauskrecht CS594 Automated Decision Making Course Presentation.
Generalized Point Based Value Iteration for Interactive POMDPs Prashant Doshi Dept. of Computer Science and AI Institute University of Georgia
On the Difficulty of Achieving Equilibrium in Interactive POMDPs Prashant Doshi Dept. of Computer Science University of Georgia Athens, GA Twenty.
Partial Observability “Planning and acting in partially observable stochastic domains” Leslie Pack Kaelbling, Michael L. Littman, Anthony R. Cassandra;
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Achieving Goals in Decentralized POMDPs Christopher Amato Shlomo Zilberstein UMass.
Yifeng Zeng Aalborg University Denmark
Nevin L. Zhang Room 3504, phone: ,
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Networked Distributed POMDPs: DCOP-Inspired Distributed POMDPs
Markov Decision Processes
Markov Decision Processes
Harm van Seijen Bram Bakker Leon Kester TNO / UvA UvA
Markov Decision Processes
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Presentation transcript:

Using Reinforcement Learning to Model True Team Behavior in Uncertain Multiagent Settings in Interactive DIDs Muthukumaran Chandrasekaran THINC Lab, CS Department The University of Georgia Department of Computer Science UGA Department of Computer Science UGA I NTRODUCTION D ISCUSSION A PPROACH This work’s contribution is two-fold: 1.We investigate modeling true team behavior of agents using reinforcement learning (RL) in the context of the Interactive Dynamic Influence Diagram (I-DID) framework for solving multiagent decision making problems under uncertain settings. This presents us with a non-trivial problem that requires setting up learning agents in such a way that they could learn the existence of external factors such as the existence of other agents, thereby modeling teamwork in a non-traditional but systematic way. 2.So far, we have only experimented with I-DIDs on 2-agent settings. So, we seek to show the framework’s flexibility by designing & testing I-DIDs for 3 or more agent scenarios. B ACKGROUND I-DIDs have nodes (decision (rectangle), chance (oval), utility (diamond), model (hexagon)), arcs (functional, conditional, informational), links (policy (dashed), model update (dotted)) (shown in Fig 1). I-DIDs are graphical counterparts of I-POMDPs [1]. Fig. 1 shows a two time slice level l, I-DID. In order to induce team behavior, our algorithm uses a variant of the RL algorithm called Monte-Carlo Exploring Starts for POMDPs (MCESP) [2] for learning the level 0 policies that uses the new definition of action value, that provides info about the value of policies in a local neighborhood of the current policy. Fig. 4 shows an overview of RL in POMDPs. Implementing Teamwork: A Paradox Are we modeling true teamwork correctly? - An issue common to all existing sequential decision making frameworks. Consider a 2 agent team setting where agents i and j are levels 1 and 0 in their thinking capacity respectively. Thus, according to j, it is acting alone in the environment. So, technically agent j doesn’t know that it is a part of a team. So agent i (which models j) is forced to follow j’s actions in order to maximize the joint reward even though it may not be the team’s optimal joint action. Algorithm When the environment emits a joint observation o ∈ O, indicative of how the environment has changed after actions have been performed by both agents in the previous step, the agents choose a joint action a from a finite set A(o) based on these observations. A deterministic reactive team policy, π, maps each o ∈ O to some a ∈ A(o). The goal of the agents is to compute a team policy, π : o → a based on the experience gathered. The algorithm begins by assuming a start state distribution for the agents and episodic tasks with one or more terminal states. First, agent j performs its actions according to the MCESP algorithm and agent i performs its actions based on the Dec-POMDP policy of the other agent, j’, that i assumes, is a part of its team. The Dec-POMDP policy computed using the JESP algorithm [3] using the MADP toolkit [4]. Then based on the agents’ joint actions, the next state is first sampled using the transition function, followed by the observations for both agents using the observation function. Agent j follows MCESP while Agent i follows agent j’’s DEC-POMDP policy to make their respective next decisions. This way, a trajectory τ = {o 0, a 0, r 0, o 0, a 0, r 0,..., o T } (where o T is an observation corresponding to a terminal state) in the MCESP algorithm is generated. The rest of the MCESP procedure is similar to the original one described in [2]. E XPERIMENTS Test Domains – Variants of the Multiagent Box Pushing (fig. 3) and the Multiagent Meeting in a Grid (fig. 2) Problems Implementation – Exact-BE [1] and MCESP [2] using HUGIN C++ API for Dynamic Influence Diagrams (DIDs) R EFERENCES A CKNOWLEDGMENTS I thank Dr. Prashant Doshi, Dr. Yifeng Zeng and his students for their valuable contributions in the implementation of this work. This research is partially supported by the Obel Family Foundation (Denmark) and the NSFC (# and # ) grant to Dr. Yifeng Zeng and NSF CAREER grant (#IIS ) to Dr. Prashant Doshi. 1.P. Doshi, Y. Zeng, and Q. Chen, Graphical models for interactive POMDPs: Representations and solutions, JAAMAS, T. J. Perkins, Reinforcement Learning for POMDPs based on Action Values and Stochastic Optimization, AAAI, R. Nair, M. Tambe, M. Yokoo, D. Pynadath, and S. Marsella, Taming Decentralized POMDPs: Towards Efficient Policy Computation for Multiagent Settings, IJCAI, Matthijs T. J. Spaan and Frans A. Oliehoek, The MultiAgent Decision Process toolbox, MSDM Workshop, AAMAS, This work is significant as it bridges the Interactive POMDP and Decentralized POMDP frameworks together. 2.We model true team behavior of agents using RL in the context of I-DIDs and can be extended to any multiagent decision making framework. 3.In future, we also hope to show flexibility of the I-DID framework by conducting scalability tests and extending it to scenarios containing 3 or more agents. Fig. 1. A Generic Two Time Slice Level l I-DID Fig. 4: RL in POMDPs (using MCESP)