Learning Team Behavior Using Individual Decision Making in Multiagent Settings Using Interactive DIDs Muthukumaran Chandrasekaran THINC Lab, CS Department.

Slides:

Advertisements

Similar presentations

A Decision-Theoretic Model of Assistance - Evaluation, Extension and Open Problems Sriraam Natarajan, Kshitij Judah, Prasad Tadepalli and Alan Fern School.

Advertisements

Autonomic Scaling of Cloud Computing Resources

Markov Decision Process

Partially Observable Markov Decision Process (POMDP)

Modeling Maze Navigation Consider the case of a stationary robot and a mobile robot moving towards a goal in a maze. We can model the utility of sharing.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.

SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.

CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)

Compressing Mental Model Spaces and Modeling Human Strategic Intent.

Meta-Level Control in Multi-Agent Systems Anita Raja and Victor Lesser Department of Computer Science University of Massachusetts Amherst, MA

Centralizing POMDP Solvers via Web Services BY Roi Ceren Muthukumaran Chandrasekaran.

Optimal Policies for POMDP Presented by Alp Sardağ.

1 Graphical Models for Online Solutions to Interactive POMDPs Prashant Doshi Yifeng Zeng Qiongyu Chen University of Georgia Aalborg University National.

MDP Presentation CS594 Automated Optimal Decision Making Sohail M Yousof Advanced Artificial Intelligence.

Markov Decision Processes

Planning under Uncertainty

POMDPs: Partially Observable Markov Decision Processes Advanced AI

In practice, we run into three common issues faced by concurrent optimization algorithms. We alter our model-shaping to mitigate these by reasoning about.

Approximate Solutions for Partially Observable Stochastic Games with Common Payoffs Rosemary Emery-Montemerlo joint work with Geoff Gordon, Jeff Schneider.

4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)

Integrating POMDP and RL for a Two Layer Simulated Robot Architecture Presented by Alp Sardağ.

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

Efficient Methodologies for Reliability Based Design Optimization

Hierarchical Reinforcement Learning Ersin Basaran 19/03/2005.

CMPUT 551 Analyzing abstraction and approximation within MDP/POMDP environment Magdalena Jankowska (M.Sc. - Algorithms) Ilya Levner (M.Sc - AI/ML)

MULTIPLE INTEGRALS Double Integrals over General Regions MULTIPLE INTEGRALS In this section, we will learn: How to use double integrals to.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.

A Multi-Agent Learning Approach to Online Distributed Resource Allocation Chongjie Zhang Victor Lesser Prashant Shenoy Computer Science Department University.

Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK

Learning and Planning for POMDPs Eyal Even-Dar, Tel-Aviv University Sham Kakade, University of Pennsylvania Yishay Mansour, Tel-Aviv University.

MAKING COMPLEX DEClSlONS

Decision-Making on Robots Using POMDPs and Answer Set Programming Introduction Robots are an integral part of many sectors such as medicine, disaster rescue.

Conference Paper by: Bikramjit Banerjee University of Southern Mississippi From the Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence.

Fifth International Conference on Autonomous Agents and Multi-agent Systems (AAMAS-06) Exact Solutions of Interactive POMDPs Using Behavioral Equivalence.

General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.

CSE-473 Artificial Intelligence Partially-Observable MDPS (POMDPs)

GaTAC: A Scalable and Realistic Testbed for Multiagent Decision Making Ekhlas Sonu, Prashant Doshi Dept. of Computer Science University of Georgia Athens,

Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up Ekhlas Sonu, Prashant Doshi Dept. of Computer Science University.

1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 6: Optimality Criterion in MDPs Dr. Itamar Arel College of Engineering Department.

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.

TKK | Automation Technology Laboratory Partially Observable Markov Decision Process (Chapter 15 & 16) José Luis Peralta.

Model-based Bayesian Reinforcement Learning in Partially Observable Domains by Pascal Poupart and Nikos Vlassis (2008 International Symposium on Artificial.

Reinforcement Learning 主講人：虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.

Haley: A Hierarchical Framework for Logical Composition of Web Services Haibo Zhao, Prashant Doshi LSDIS Lab, Dept. of Computer Science, University of.

Using Reinforcement Learning to Model True Team Behavior in Uncertain Multiagent Settings in Interactive DIDs Muthukumaran Chandrasekaran THINC Lab, CS.

Solving POMDPs through Macro Decomposition

CUHK Learning-Based Power Management for Multi-Core Processors YE Rong Nov 15, 2011.

Algorithmic, Game-theoretic and Logical Foundations

Twenty Second Conference on Artificial Intelligence AAAI 2007 Improved State Estimation in Multiagent Settings with Continuous or Large Discrete State.

The set of SE models include s those that are BE. It further includes models that include identical distributions over the subject agent’s action observation.

Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.

1 Multiagent Teamwork: Analyzing the Optimality and Complexity of Key Theories and Models David V. Pynadath and Milind Tambe Information Sciences Institute.

G LOBAL S IMILARITY B ETWEEN M ULTIPLE B IONETWORKS Yunkai Liu Computer Science Department University of South Dakota.

1 (Chapter 3 of) Planning and Control in Stochastic Domains with Imperfect Information by Milos Hauskrecht CS594 Automated Decision Making Course Presentation.

Generalized Point Based Value Iteration for Interactive POMDPs Prashant Doshi Dept. of Computer Science and AI Institute University of Georgia

Distributed cooperation and coordination using the Max-Sum algorithm

On the Difficulty of Achieving Equilibrium in Interactive POMDPs Prashant Doshi Dept. of Computer Science University of Georgia Athens, GA Twenty.

 We have also applied VPI in a disaster management setting:  We investigate overlapping coalition formation models. Sequential Decision Making in Repeated.

Perfect recall: Every decision node observes all earlier decision nodes and their parents (along a “temporal” order) Sum-max-sum rule (dynamical programming):

Partial Observability “Planning and acting in partially observable stochastic domains” Leslie Pack Kaelbling, Michael L. Littman, Anthony R. Cassandra;

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Achieving Goals in Decentralized POMDPs Christopher Amato Shlomo Zilberstein UMass.

Yifeng Zeng Aalborg University Denmark

Reinforcement Learning in POMDPs Without Resets

Markov Decision Processes

Harm van Seijen Bram Bakker Leon Kester TNO / UvA UvA

Chapter 17 – Making Complex Decisions

CS 416 Artificial Intelligence

Reinforcement Learning Dealing with Partial Observability

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Presentation transcript:

Learning Team Behavior Using Individual Decision Making in Multiagent Settings Using Interactive DIDs Muthukumaran Chandrasekaran THINC Lab, CS Department The University of Georgia Department of Computer Science UGA Department of Computer Science UGA R ESULTS / D ISCUSSION Individual decision making in multiagent settings faces the task of having to reason about other agents' actions who themselves could be reasoning about others. An approximation that enables the application of this approach is to bound the infinite nesting from below by introducing level 0 models. A consequence of the finitely nested modeling is that we may not obtain optimal team solutions in cooperative settings. We address this limitation by including models at level 0 whose solution involves learning. We demonstrate that the integrated learning with planning facilitates optimal team behavior. We investigate this approach within the framework of interactive dynamic influence diagrams and evaluate its performance. I-DIDs have nodes (decision (rectangle), chance (oval), utility (diamond), model (hexagon)), arcs (functional, conditional, informational), links (policy (dashed), model update (dotted)). I-DIDs are graphical counterparts of I- POMDPs [1]. Implausibility of Teamwork Proposition 1: There exist cooperative multiagent settings in which intentional agents each of which is modeled using the finitely-nested I-DID (or I-POMDP) may not choose the jointly optimal behavior of working together as a team. Augmented I-DID Solution In order to induce team behavior, our algorithm uses a variant of the RL algorithm called Monte-Carlo Exploring Starts for POMDPs (MCESP) [2] for learning the level 0 policies that uses the new definition of action value, that provides info about the value of policies in a local neighborhood of the current policy. Solving augmented I-DIDs is similar to solving the traditional I-DIDs except for the fact that the candidate models of the agent at level 0 may be learning models. For learning at level 0, we assume that i’s policy is hidden from j and considered to be a part of the environment. However, since i’s policy space may be extremely large, we use heuristics to obtain a subset of those policies and create as many candidate models of j for i’s I-DID. We may further reduce agent j's policy space by keeping top-K policies of j, K>0, in terms of their expected utilities. Proposition 2: Top-K policies of level 0 models of agent j given same initial beliefs, K > 0, guarantee inclusion of j's optimal team policy resulting in the optimal team behavior of agent i at level 1. Experimentation: Table 1 shows the experiment setup. Table 2 and Fig. 1 shows some results for the Multi-agent Box Pushing (BP), Grid-Meeting (Grid), and the Multi-Access Broadcast Channel (MABC) problems. R EFERENCES A CKNOWLEDGMENTS I thank Dr. Prashant Doshi, Dr. Yifeng Zeng and his students for their valuable contributions in the implementation of this work 1.P. Doshi, Y. Zeng, and Q. Chen, Graphical models for interactive POMDPs: Representations and solutions, JAAMAS, T. J. Perkins, Reinforcement Learning for POMDPs based on Action Values and Stochastic Optimization, AAAI, Table 2: Performance Comparison shows near-optimal expected utility is achieved by Aug. I-DIDs while the Trad. I-DIDs failed! Teamwork in Interactive DIDs Teamwork involves multiple agents working collaboratively in order to optimize the team reward. Each agent in the team behaves according to a policy, which maps the agent's observation history or beliefs to the action(s) it should perform. We begin by showing that the finitely- nested hierarchy in I-DIDs~(I-POMDPs) does not facilitate team behavior. However, augmenting the traditional model space with models whose solution is obtained via RL provides a way for team behavior to emerge. I NTRODUCTION A PPROACH B ACKGROUND A PPROACH / E XPERIMENTS Fig. 1: Top-K Method reduces the added solution complexity of the Augmented I-DID. Contribution: We bridge the gap in the applicability of individual decision-making frameworks (e.g., I-POMDP, I-DID) to achieve globally optimal solutions EXACTLY in cooperative- settings which was initially impossible because of insufficient complexity of the level-0 model used in the hierarchy. Table 1: Domain Dimension and Experimental Settings