1 Graphical Models for Online Solutions to Interactive POMDPs Prashant Doshi Yifeng Zeng Qiongyu Chen University of Georgia Aalborg University National.

Slides:



Advertisements
Similar presentations
Dialogue Policy Optimisation
Advertisements

Continuation Methods for Structured Games Ben Blum Christian Shelton Daphne Koller Stanford University.
1 University of Southern California Keep the Adversary Guessing: Agent Security by Policy Randomization Praveen Paruchuri University of Southern California.
Partially Observable Markov Decision Process (POMDP)
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)
Compressing Mental Model Spaces and Modeling Human Strategic Intent.
Extraction and Transfer of Knowledge in Reinforcement Learning A.LAZARIC Inria “30 minutes de Science” Seminars SequeL Inria Lille – Nord Europe December.
Short introduction to game theory 1. 2  Decision Theory = Probability theory + Utility Theory (deals with chance) (deals with outcomes)  Fundamental.
5/11/2015 Mahdi Naser-Moghadasi Texas Tech University.
Markov Models for Multi-Agent Coordination Maayan Roth Multi-Robot Reading Group April 13, 2005.
Satisfaction Equilibrium Stéphane Ross. Canadian AI / 21 Problem In real life multiagent systems :  Agents generally do not know the preferences.
Pradeep Varakantham Singapore Management University Joint work with J.Y.Kwak, M.Taylor, J. Marecki, P. Scerri, M.Tambe.
A camper awakens to the growl of a hungry bear and sees his friend putting on a pair of running shoes, “You can’t outrun a bear,” scoffs the camper. His.
Planning under Uncertainty
Temporal Action-Graph Games: A New Representation for Dynamic Games Albert Xin Jiang University of British Columbia Kevin Leyton-Brown University of British.
1 University of Southern California Towards A Formalization Of Teamwork With Resource Constraints Praveen Paruchuri, Milind Tambe, Fernando Ordonez University.
Approximate Solutions for Partially Observable Stochastic Games with Common Payoffs Rosemary Emery-Montemerlo joint work with Geoff Gordon, Jeff Schneider.
XYZ 6/18/2015 MIT Brain and Cognitive Sciences Convergence Analysis of Reinforcement Learning Agents Srinivas Turaga th March, 2004.
Optimal Tuning of Continual Online Exploration in Reinforcement Learning Youssef Achbany, Francois Fouss, Luh Yen, Alain Pirotte & Marco Saerens Information.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.
A Multi-Agent Learning Approach to Online Distributed Resource Allocation Chongjie Zhang Victor Lesser Prashant Shenoy Computer Science Department University.
Making Decisions CSE 592 Winter 2003 Henry Kautz.
Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK
Learning and Planning for POMDPs Eyal Even-Dar, Tel-Aviv University Sham Kakade, University of Pennsylvania Yishay Mansour, Tel-Aviv University.
Bottom-Up Coordination in the El Farol Game: an agent-based model Shu-Heng Chen, Umberto Gostoli.
Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
Influence Diagrams for Robust Decision Making in Multiagent Settings.
MAKING COMPLEX DEClSlONS
Conference Paper by: Bikramjit Banerjee University of Southern Mississippi From the Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence.
Fifth International Conference on Autonomous Agents and Multi-agent Systems (AAMAS-06) Exact Solutions of Interactive POMDPs Using Behavioral Equivalence.
An efficient distributed protocol for collective decision- making in combinatorial domains CMSS Feb , 2012 Minyi Li Intelligent Agent Technology.
History-Dependent Graphical Multiagent Models Quang Duong Michael P. Wellman Satinder Singh Computer Science and Engineering University of Michigan, USA.
General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.
Overview  Decision processes and Markov Decision Processes (MDP)  Rewards and Optimal Policies  Defining features of Markov Decision Process  Solving.
CSE-473 Artificial Intelligence Partially-Observable MDPS (POMDPs)
Planning and Execution with Phase Transitions Håkan L. S. Younes Carnegie Mellon University Follow-up paper to Younes & Simmons’ “Solving Generalized Semi-Markov.
Dynamic Games of complete information: Backward Induction and Subgame perfection - Repeated Games -
Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up Ekhlas Sonu, Prashant Doshi Dept. of Computer Science University.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
Model-based Bayesian Reinforcement Learning in Partially Observable Domains by Pascal Poupart and Nikos Vlassis (2008 International Symposium on Artificial.
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.
1 S ystems Analysis Laboratory Helsinki University of Technology Flight Time Allocation Using Reinforcement Learning Ville Mattila and Kai Virtanen Systems.
Using Reinforcement Learning to Model True Team Behavior in Uncertain Multiagent Settings in Interactive DIDs Muthukumaran Chandrasekaran THINC Lab, CS.
Modeling Agents’ Reasoning in Strategic Situations Avi Pfeffer Sevan Ficici Kobi Gal.
Twenty Second Conference on Artificial Intelligence AAAI 2007 Improved State Estimation in Multiagent Settings with Continuous or Large Discrete State.
The set of SE models include s those that are BE. It further includes models that include identical distributions over the subject agent’s action observation.
1 Ann Nowé Nature inspired agents to handle interaction in IT systems Ann Nowé Computational modeling Lab Vrije Universiteit Brussel.
1 Chapter 17 2 nd Part Making Complex Decisions --- Decision-theoretic Agent Design Xin Lu 11/04/2002.
Learning Team Behavior Using Individual Decision Making in Multiagent Settings Using Interactive DIDs Muthukumaran Chandrasekaran THINC Lab, CS Department.
4 th International Conference on Service Oriented Computing Adaptive Web Processes Using Value of Changed Information John Harney, Prashant Doshi LSDIS.
1 (Chapter 3 of) Planning and Control in Stochastic Domains with Imperfect Information by Milos Hauskrecht CS594 Automated Decision Making Course Presentation.
Generalized Point Based Value Iteration for Interactive POMDPs Prashant Doshi Dept. of Computer Science and AI Institute University of Georgia
Econ 805 Advanced Micro Theory 1 Dan Quint Fall 2009 Lecture 1 A Quick Review of Game Theory and, in particular, Bayesian Games.
On the Difficulty of Achieving Equilibrium in Interactive POMDPs Prashant Doshi Dept. of Computer Science University of Georgia Athens, GA Twenty.
M9302 Mathematical Models in Economics Instructor: Georgi Burlakov 2.1.Dynamic Games of Complete and Perfect Information Lecture
Chapter 5 Adversarial Search. 5.1 Games Why Study Game Playing? Games allow us to experiment with easier versions of real-world situations Hostile agents.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Achieving Goals in Decentralized POMDPs Christopher Amato Shlomo Zilberstein UMass.
Yifeng Zeng Aalborg University Denmark
Keep the Adversary Guessing: Agent Security by Policy Randomization
Extensive-Form Game Abstraction with Bounds
Making complex decisions
Reinforcement Learning in POMDPs Without Resets
Markov Decision Processes
Markov Decision Processes
Multiagent Systems Extensive Form Games © Manfred Huber 2018.
Dr. Unnikrishnan P.C. Professor, EEE
CS 416 Artificial Intelligence
Presentation transcript:

1 Graphical Models for Online Solutions to Interactive POMDPs Prashant Doshi Yifeng Zeng Qiongyu Chen University of Georgia Aalborg University National Univ. USA Denmark of Singapore International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2007)

2 Decision-Making in Multiagent Settings State (S) Act to optimize preferences given beliefs Actions (A i ) Agent i Observations (O i ) Actions (A j ) Observations (O j ) Agent j Belief over state and model of j Belief over state and model of i

3 Finitely Nested I-POMDP (Gmytrasiewicz&Doshi, 05) A finitely nested I-POMDP of agent i with a strategy level l :  Interactive states: Beliefs about physical environments: Beliefs about other agents in terms of their preferences, capabilities, and beliefs:  Type:  A Joint actions  Possible observations  T i Transition function: S×A×S  [0,1]  O i Observation function: S×A×  [0,1]  R i Reward function: S×A 

4 Belief Update

5 Forget It! Different approach  Use the language of Influence Diagrams (IDs) to represent the problem more transparently Belief update  Use standard ID algorithms to solve it Solution

6 Challenges Representation of nested models for other agents  Influence diagram is a single agent oriented language Update beliefs on models of other agents  New models of other agents  Over time agents revise beliefs over the models of others as they receive observations

7 RelatedWork Multiagent Influence Diagrams (MAIDs) (Koller&Milch,2001)  Uses IDs to represent incomplete information games  Compute Nash equilibrium solutions efficiently by exploiting conditional independence Network of Influence Diagrams (NIDs) (Gal&Pfeffer,2003)  Allows uncertainty over the game  Allows multiple models of an individual agent  Solution involves collapsing models into a MAID or ID Both model static single play games  Do not consider agent interactions over time (sequential decision- making)

8 Introduce Model Node and Policy Link A generic level l Interactive- ID (I-ID) for agent i situated with one other agent j  Model Node: M j,l-1 Models of agent j at level l-1  Policy link: dashed line Distribution over the other agent’s actions given its models  Beliefs on M j,l-1 P(M j,l-1 |s) Update? AiAi RiRi OiOi S AjAj M j,l-1 Level l I-ID

9 Details of the Model Node Members of the model node  Different chance nodes are solutions of models m j,l-1  Mod[M j ] represents the different models of agent j CPT of the chance node A j is a multiplexer  Assumes the distribution of each of the action nodes (A j 1, A j 2 ) depending on the value of Mod[M j ] Mod[M j ] Aj1Aj1 Aj2Aj2 M j,l-1 S m j,l-1 1 m j,l-1 2 AjAj m j,l-1 1, m j,l-1 2 could be I-IDs or IDs

10 Whole I-ID AiAi RiRi OiOi SAjAj Mod[M j ] Aj1Aj1 Aj2Aj2 m j,l-1 1 m j,l-1 2 m j,l-1 1, m j,l-1 2 could be I-IDs or IDs

11 Interactive Dynamic Influence Diagrams (I-DIDs) A i t+1 RiRi O i t+1 S t+1 A j t+1 M j,l-1 t+1 AitAit RiRi OitOit StSt AjtAjt M j,l-1 t Model Update Link

12 m j,l-1 t,2 Semantics of Model Update Link Mod[M j t ] Aj1Aj1 M j,l-1 t stst m j,l-1 t,1 AjtAjt Aj2Aj2 Oj1Oj1 Oj2Oj2 OjOj Mod[M j t+1 ] Aj1Aj1 M j,l-1 t+1 s t+1 m j,l-1 t+1,1 m j,l-1 t+1,2 A j t+1 Aj2Aj2 Aj3Aj3 Aj4Aj4 m j,l-1 t+1,3 m j,l-1 t+1,4 These models differ in their initial beliefs, each of which is the result of j updating its beliefs due to its actions and possible observations

13 Notes Updated set of models at time step (t+1) will have at most models  :number of models at time step t  :largest space of actions  :largest space of observations New distribution over the updated models uses  original distribution over the models  probability of the other agent performing the action, and  receiving the observation that led to the updated model

14 A i t+1 RiRi O i t+1 S t+1 OitOit AitAit RiRi StSt m j,l-1 t,1 m j,l-1 t,2 Aj1Aj1 Oj1Oj1 Aj2Aj2 Oj2Oj2 Aj1Aj1 Aj2Aj2 Aj3Aj3 Aj4Aj4 m j,l-1 t+1,1 m j,l-1 t+1,2 m j,l-1 t+1,3 m j,l-1 t+1,4 A j t+1 Mod[M j t ] AjtAjt OjOj Mod[M j t+1 ]

15 Example Applications: Emergence of Social Behaviors Followership and Leadership in the persistent multiagent tiger problem Altruism and Reciprocity in the public good problem with punishment Strategies in a simple version of two-player Poker

16 Followership and Leadership in Multiagent Persistent Tiger Experimental Setup:  Agent j has a better hearing capability (95% accurate) compared to i’s (65% accuracy)  Agent i does not have initial information about the tiger’s location  Agent i considers two models of agent j which differ in j’s level 0 initial beliefs Agent j likely thinks that the tiger is behind the left door Agent j likely thinks that the tiger is behind the right door Solve the corresponding level 1 I-DID expanded over three time steps and get the normative behavioral policy of agent i

17 Level 1 I-ID in the Tiger Problem Expand over three time steps Mapping decision nodes to chance nodes

18 Policy Tree 1: Agent i has hearing accuracy of 65% L L L L L L OR L L L L L L L L OL GL,*GR,* GL,CR GL,S/CL GR,* GL,* GR,S/CR GR,CL Conditional Followership

19 Policy Tree 2: Agent i loses hearing ability (accuracy is 0.5) L L L L OR OL L L *,* *,CR *,S*,CL Unconditional (Blind) Followership

20 Example 2: Altruism and Reciprocity in the Public Good Problem Public Good Game  Two agents initially endowed with X T amount of resources  Each agent may choose contribute (C) a fixed amount of the resources to a public pot not contribute ie. defect (D)  Agents’ actions and pot are not observable, but agents receive an observation symbolizing the state of the public pot plenty (PY) meager (MR)  Value of resources in the public pot is discounted by c i (<1) for each agent i, where c i is the marginal private return  In order to encourage contributions, the contributing agents punish free riders P but incur a small cost c p for administering the punishment

21 Agent Types Altruistic and Non-altruistic types  Altruistic agent has a high marginal private return (c i is close to 1) and does not punish others who defect Optimal Behavior  One action remaining: both types of agents choose to contribute to avoid being punished  Two actions to go: altruistic type chooses to contribute, while the other defects Why?  Three steps to go: the altruistic agent contributes to avoid punishment and the non-altruistic type defects  Greater than three steps: altruistic agent continues to contribute to the public pot depending on how close its marginal return is to 1, the non-altruistic type prescribes defection

22 Level 1 I-ID in the Public Good Game Expand over three time steps

23 Policy Tree 1: Altruism in PG If agent i (altruistic type) believes with a probability 1 that j is altruistic, i chooses to contribute for each of the three steps. This behavior persists when i is unaware of whether j is altruistic, and when i assigns a high probability to j being the non-altruistic type C C C C C C * *

24 Policy Tree 2: Reciprocal Agents Reciprocal Type  The reciprocal type’s marginal private return is less and obtains a greater payoff when its action is similar to that of the other Experimental Setup  Consider the case when the reciprocal agent i is unsure of whether j is altruistic and believes that the public pot is likely to be half full Optimal Behavior  From this prior belief, i chooses to defect  On receiving an observation of plenty, i decides to contribute, while an observation of meager makes it defect  With one action to go, i believes that j contributes, will choose to contribute too to avoid punishment regardless of its observations D D C C C C D D C C * PY * MR

25 Conclusion and Future Work I-DIDs: A general ID-based formalism for sequential decision-making in multiagent settings  Online counterparts of I-POMDPs Solving I-DIDs approximately for computational efficiency (see AAAI ’07 paper on model clustering) Apply I-DIDs to other application domains Visit our poster on I-DIDs today for more information

26 Thank You!