Markov Models for Multi-Agent Coordination Maayan Roth Multi-Robot Reading Group April 13, 2005.

Slides:



Advertisements
Similar presentations
1 University of Southern California Keep the Adversary Guessing: Agent Security by Policy Randomization Praveen Paruchuri University of Southern California.
Advertisements

Partially Observable Markov Decision Process (POMDP)
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
SARSOP Successive Approximations of the Reachable Space under Optimal Policies Devin Grady 4 April 2013.
CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)
Meta-Level Control in Multi-Agent Systems Anita Raja and Victor Lesser Department of Computer Science University of Massachusetts Amherst, MA
Models of Coordination in Multiagent Decision Making
Optimal Policies for POMDP Presented by Alp Sardağ.
5/11/2015 Mahdi Naser-Moghadasi Texas Tech University.
Meeting 3 POMDP (Partial Observability MDP) 資工四 阮鶴鳴 李運寰 Advisor: 李琳山教授.
MDP Presentation CS594 Automated Optimal Decision Making Sohail M Yousof Advanced Artificial Intelligence.
Pradeep Varakantham Singapore Management University Joint work with J.Y.Kwak, M.Taylor, J. Marecki, P. Scerri, M.Tambe.
An Introduction to Markov Decision Processes Sarah Hickmott
Partially Observable Markov Decision Process By Nezih Ergin Özkucur.
Planning under Uncertainty
Bulding Practical Agent Teams: A hybrid perspective Milind Tambe Computer Science Dept University of Southern California Joint work with.
POMDPs: Partially Observable Markov Decision Processes Advanced AI
In practice, we run into three common issues faced by concurrent optimization algorithms. We alter our model-shaping to mitigate these by reasoning about.
Zach Ramaekers Computer Science University of Nebraska at Omaha Advisor: Dr. Raj Dasgupta 1.
Announcements Homework 3: Games Project 2: Multi-Agent Pacman
SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.
KI Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.
A Principled Information Valuation for Communications During Multi-Agent Coordination Simon A. Williamson, Enrico H. Gerding, Nicholas R. Jennings School.
1 University of Southern California Towards A Formalization Of Teamwork With Resource Constraints Praveen Paruchuri, Milind Tambe, Fernando Ordonez University.
Approximate Solutions for Partially Observable Stochastic Games with Common Payoffs Rosemary Emery-Montemerlo joint work with Geoff Gordon, Jeff Schneider.
Uninformed Search Reading: Chapter 3 by today, Chapter by Wednesday, 9/12 Homework #2 will be given out on Wednesday DID YOU TURN IN YOUR SURVEY?
1 University of Southern California Security in Multiagent Systems by Policy Randomization Praveen Paruchuri, Milind Tambe, Fernando Ordonez University.
1 University of Southern California Increasing Security through Communication and Policy Randomization in Multiagent Systems Praveen Paruchuri, Milind.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.
Markov Decision Processes
Reinforcement Learning Yishay Mansour Tel-Aviv University.
1 Execution-Time Communication Decisions for Coordination of Multi-Agent Teams Maayan Roth Thesis Defense Carnegie Mellon University September 4, 2007.
MAKING COMPLEX DEClSlONS
Conference Paper by: Bikramjit Banerjee University of Southern Mississippi From the Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence.
Reinforcement Learning on Markov Games Nilanjan Dasgupta Department of Electrical and Computer Engineering Duke University Durham, NC Machine Learning.
Department of Computer Science Christopher Amato Carnegie Mellon University Feb 5 th, 2010 Increasing Scalability in Algorithms for Centralized and Decentralized.
CSE-473 Artificial Intelligence Partially-Observable MDPS (POMDPs)
Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up Ekhlas Sonu, Prashant Doshi Dept. of Computer Science University.
Software Multiagent Systems: Lecture 13 Milind Tambe University of Southern California
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
TKK | Automation Technology Laboratory Partially Observable Markov Decision Process (Chapter 15 & 16) José Luis Peralta.
Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher.
Model-based Bayesian Reinforcement Learning in Partially Observable Domains by Pascal Poupart and Nikos Vlassis (2008 International Symposium on Artificial.
Shlomo Zilberstein Alan Carlin Bounded Rationality in Multiagent Systems using Decentralized Metareasoning TexPoint fonts used in EMF. Read the TexPoint.
By: Messias, Spaan, Lima Presented by: Mike Plasker DMES – Ocean Engineering.
Using Reinforcement Learning to Model True Team Behavior in Uncertain Multiagent Settings in Interactive DIDs Muthukumaran Chandrasekaran THINC Lab, CS.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
A Tutorial on the Partially Observable Markov Decision Process and Its Applications Lawrence Carin June 7,2006.
1 Multiagent Teamwork: Analyzing the Optimality and Complexity of Key Theories and Models David V. Pynadath and Milind Tambe Information Sciences Institute.
Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.
Heuristic Search for problems with uncertainty CSE 574 April 22, 2003 Mausam.
Learning Team Behavior Using Individual Decision Making in Multiagent Settings Using Interactive DIDs Muthukumaran Chandrasekaran THINC Lab, CS Department.
Markov Decision Process (MDP)
COMP 2208 Dr. Long Tran-Thanh University of Southampton Reinforcement Learning.
Generalized Point Based Value Iteration for Interactive POMDPs Prashant Doshi Dept. of Computer Science and AI Institute University of Georgia
CS382 Introduction to Artificial Intelligence Lecture 1: The Foundations of AI and Intelligent Agents 24 January 2012 Instructor: Kostas Bekris Computer.
Markov Decision Processes AIMA: 17.1, 17.2 (excluding ), 17.3.
Partial Observability “Planning and acting in partially observable stochastic domains” Leslie Pack Kaelbling, Michael L. Littman, Anthony R. Cassandra;
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Achieving Goals in Decentralized POMDPs Christopher Amato Shlomo Zilberstein UMass.
Keep the Adversary Guessing: Agent Security by Policy Randomization
CS b659: Intelligent Robotics
Networked Distributed POMDPs: DCOP-Inspired Distributed POMDPs
Markov Decision Processes
Markov Decision Processes
Chapter 17 – Making Complex Decisions
CS 416 Artificial Intelligence
Reinforcement Learning Dealing with Partial Observability
Normal Form (Matrix) Games
Presentation transcript:

Markov Models for Multi-Agent Coordination Maayan Roth Multi-Robot Reading Group April 13, 2005

Multi-Agent Coordination Teams: – Agent work together to achieve a common goal – No individual motivations Objective: – Generate policies (individually or globally) to yield best team performance

Observability [Pynadath and Tambe, 2002] “Observability” - degree to which agents can, either individually or as a team, identify current world state – Individual observability MMDP [Boutilier, 1996] – Collective observability O 1 + O 2 + … + O n uniquely identify the state e.g. [Xuan, Lesser, Zilberstein, 2001] – Collective partial observability DEC-POMDP [Bernstein et al., 2000] POIPSG [Peskin, Kim, Meuleau, Kaelbling, 2000] – Non-observability

Communication [Pynadath and Tambe, 2002] “Communication” - explicit message-passing from one agent to another – Free Communication No cost to send messages Transforms MMDP to MDP, DEC-POMDP to POMDP – General Communication Communication is available but has cost or is limited – No Communication No explicit message-passing

Taxonomy of Cooperative MAS No Communication General Communication Free Communication Full Observability Collective Observability Partial Observability Unobservability Open-loop control? MMDP MDP POMDP DEC-MDPDEC-POMDP

Complexity Individually Observable Collectively Observable Collectively Partially Observable No Comm.P-completeNEXP-complete General Comm.P-completeNEXP-complete Free Comm.P-complete PSPACE-complete

Multi-Agent MDP (MMDP) [Boutilier, 1996] Also called IPSG (identical payoff stochastic games) M = – S is set of possible world states – {A i } i  m is set of joint actions, where a i  A i – T defines transition probabilities over joint actions – R is team reward function State is fully observable by each agent P-complete

Coordination Problems Multiple equilibria – two optimal joint policies s1 s2 s3 s4 s5 s ;

Randomization with Learning Expand MMDP to include FSM that selects actions based on whether agents are coordinated or uncoordinated Build the expanded MMDP by adding a coordination mechanism at every state where a coordination problem is discovered U A B random(a,b) b a ;

Multi-Agent POMDP DEC-POMDP [Bernstein et al., 2000], MTDP [Pynadath et al., 2002], POIPSG [Peshkin et al., 2000] M = – S is set of possible world states – {A i } i  m is set of joint actions, where a i  A i – T defines transition probabilities over joint actions – {  i } i  m is set of joint observations, where  i   i – O defines observation probabilities over joint actions and joint observations – R is team reward function

Complexity [Bernstein, Zilberstein, Immerman, 2000] For all m  2, DEC-POMDP with m agents is NEXP-complete – “For a given DEC-POMDP with finite time horizon T, and an integer K, is there a policy for which yields a total reward  K?” – Agents must reason about all possible actions and observations of their teammates For all m  3, DEC-MDP with m agents is NEXP-complete

Dynamic Programming [Hansen, Bernstein, Zilberstein, 2004] Finds optimal solution Partially observable stochastic game (POSG) framework – Iterate between DP step and pruning step to remove dominated strategies Experimental results show speed-up in some domains – Still NEXP-complete, so hyper-exponential in worst case No communication

Transition Independence [Becker, Zilberstein, Lesser, Goldman, 2003] DEC-MDP – collective observability Transition independence: – Local state transitions Each agent observes local state Individual actions only affect local state transitions – Team connected through joint reward Coverage set algorithm – finds optimal policy quickly in experimental domains No communication

Efficient Policy Computation [Nair, Pynadath, Yokoo, Tambe, Marsella, 2003] Joint Equilibrium-Based Search for Policies – initialize random policies for all agents – for each agent i: fix the policies for all other agents find the optimal policy for agent i – repeat until no policies change Finds locally optimal policies with potentially exponential speedups over finding the global optimum No communication

Communication [Pynadath and Tambe, 2002] COM-MTDP framework for analyzing communication – Enhance model with {  i } i  m where  i is the set of possible messages agent i can send – Reward function includes cost (  0) of communication If communication is free: – DEC-POMDP reducible to single-agent POMDP (PSPACE- complete) – Optimal communication policy is to communicate at every time step Under general communication (no restrictions on cost), DEC-POMDP still NEXP-complete –  i may contain every possible observation history

COMM-JESP [Nair, Roth, Yokoo, Tambe, 2004] Add SYNC action to domain – If one agent chooses SYNC, all other agents SYNC – At SYNC, send entire observation history since last SYNC SYNC brings agents to synchronized belief over world states Policies indexed by root synchronized belief and observation history since last SYNC t=0 (SL () 0.5) (SR () 0.5) (SL (HR) ) (SL (HL) ) (SR (HR) ) (SR (HL) ) (SL (HR) ) (SL (HL) ) (SR (HR) ) (SR (HL) ) a = {Listen, Listen}  = HL  = HR a = SYNC t = 2 (SL () 0.5) (SR () 0.5) t = 2 (SL () 0.97) (SR () 0.03) “At-most K” heuristic – there must be a SYNC within at most K timesteps

Dec-Comm [Roth, Simmons, Veloso, 2005] At plan-time, assume communication is free – generate joint policy using single-agent POMDP- solver Reason over possible joint beliefs to execute policy Use communication to integrate local observations into team belief

Tiger Domain: (States, Actions) Two-agent tiger problem [Nair et al., 2003]: S: {SL, SR} Tiger is either behind left door or behind right door Individual Actions: a i  {OpenL, OpenR, Listen} Robot can open left door, open right door, or listen

Tiger Domain: (Observations) Individual Observations:  I  {HL, HR} Robot can hear tiger behind left door or hear tiger behind right door Observations are noisy and independent.

Tiger Domain: (Reward) Coordination problem – agents must act together for maximum reward Maximum reward (+20) when both agents open door with treasure Minimum reward (-100) when only one agent opens door with tiger Listen has small cost (-1 per agent) Both agents opening door with tiger leads to medium negative reward (-50)

Joint Beliefs Joint belief (b t ) – distribution over world states Why compute possible joint beliefs? – action coordination – transition and observation functions depend on joint action – agent can’t accurately estimate belief if joint action is unknown To ensure action coordination, agents can only reason over information known by all teammates

Possible Joint Beliefs b: P(SL) = 0.5 p: p(b) = 1.0  = {} L0L0 a = HL b: P(SL) = 0.8 p: p(b) = 0.29  = {HL,HL} b: P(SL) = 0.5 p: p(b) = 0.21  = {HL,HR} b: P(SL) = 0.5 p: p(b) = 0.21  = {HR,HL} b: P(SL) = 0.2 p: p(b) = 0.29  = {HR,HR} L1L1 How should agents select actions over joint beliefs?

Q-POMDP Heuristic Select joint action over possible joint beliefs Q-MDP (Littman et. al., 1995) – approximate solution to large POMDP using underlying MDP Q-POMDP – approximate solution to DEC-POMDP using underlying single-agent POMDP

Q-POMDP Heuristic b: P(SL) = 0.5 p: p(b) = 0.21  = {HR,HL} b: P(SL) = 0.8 p: p(b) = 0.29  = {HL,HL} b: P(SL) = 0.5 p: p(b) = 0.21  = {HL,HR} b: P(SL) = 0.5 p: p(b) = 1.0  = {} b: P(SL) = 0.2 p: p(b) = 0.29  = {HR,HR} Choose joint action by computing expected reward over all leaves Agents will independently select same joint action… but action choice is very conservative (always ) DEC-COMM: Use communication to add local observations to joint belief

Dec-Comm Example b: P(SL) = 0.5 p: p(b) = 0.21  = {HR,HL} b: P(SL) = 0.8 p: p(b) = 0.29  = {HL,HL} b: P(SL) = 0.5 p: p(b) = 0.21  = {HL,HR} b: P(SL) = 0.5 p: p(b) = 1.0  = {} b: P(SL) = 0.2 p: p(b) = 0.29  = {HR,HR} HL L1L1 a NC = Q-POMDP(L 1 ) = L* = circled nodes a C = Q-POMDP(L * ) = Don’t communicate

Dec-Comm Example (cont’d) b: P(SL) = 0.85 p: p(b) = 0.29  = {HL,HL} b: P(SL) = 0.5 p: p(b) = 1.0  = {} L1L1 b: P(SL) = 0.5 p: p(b) = 0.21  = {HL,HR} b: P(SL) = 0.5 p: p(b) = 0.21  = {HR,HL} b: P(SL) = 0.15 p: p(b) = 0.29  = {HR,HR} … b: P(SL) = 0.97 p: p(b) = 0.12  : {, } b: P(SL) = 0.85 p: p(b) = 0.06  : {, } b: P(SL) = 0.85 p: p(b) = 0.06  : {, } b: P(SL) = 0.5 p: p(b) = 0.04  : {, } b: P(SL) = 0.85 p: p(b) = 0.06  : {, } b: P(SL) = 0.5 p: p(b) = 0.04  : {, } b: P(SL) = 0.5 p: p(b) = 0.04  : {, } b: P(SL) = 0.16 p: p(b) = 0.06  : {, } L2L2 a = a NC = Q-POMDP(L 2 ) = L* = circled nodes V(a C ) - V(a NC ) > ε Agent 1 communicates a C = Q-POMDP(L*) =

Dec-Comm Example (cont’d) b: P(SL) = 0.85 p: p(b) = 0.29  = {HL,HL} b: P(SL) = 0.5 p: p(b) = 1.0  = {} L1L1 b: P(SL) = 0.5 p: p(b) = 0.21  = {HL,HR} b: P(SL) = 0.5 p: p(b) = 0.21  = {HR,HL} b: P(SL) = 0.15 p: p(b) = 0.29  = {HR,HR} … b: P(SL) = 0.97 p: p(b) = 0.12  : {, } b: P(SL) = 0.85 p: p(b) = 0.06  : {, } b: P(SL) = 0.85 p: p(b) = 0.06  : {, } b: P(SL) = 0.5 p: p(b) = 0.04  : {, } b: P(SL) = 0.85 p: p(b) = 0.06  : {, } b: P(SL) = 0.5 p: p(b) = 0.04  : {, } b: P(SL) = 0.5 p: p(b) = 0.04  : {, } b: P(SL) = 0.16 p: p(b) = 0.06  : {, } L2L2 a = Agent 1 communicates Agent 2 communicates Q-POMDP(L 2 ) = Agents open right door!

References Becker, R., Zilberstein, S., Lesser, V., Goldman, C. V. Transition-independent decentralized Markov decision processes. In International Joint Conference on Autonomous Agents and Multi-Agent Systems, Bernstein, D., Zilberstein, S., Immerman, N. The complexity of decentralized control of Markov decision processes. In Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence, Boutilier, C. Sequential optimality and coordination in multiagent systems. In Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, Hansen, E. A., Bernstein, D. S., Zilberstein, S. Dynamic programming for partially observable stochastic games. In National Conference on Artificial Intelligence, Nair, R., Roth, M., Yokoo, M., Tambe, M. Communication for improving policy computation in distributed POMDPs. To appear in Proceedings of the International Conference on Autonomous Agents and Multiagent Systems, Nair, R., Pynadath, D., Yokoo, M., Tambe, M., Marsella, S. Taming decentralized POMDPs: Towards efficient policy computation for multiagent settings. In Proceedings of International Joint Conference on Artificial Intelligence, Peshkin, L., Kim, K.-E., Meuleau, N., Kaelbling, L. P. Learning to cooperate via policy search. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, Pynadath, D. and Tambe, M. The communicative multiagent team decision problem: Analyzing teamwork theories and models. In Journal of Artificial Intelligence Research, Roth, M., Simmons, R., Veloso, M. Reasoning about joint beliefs for execution-time communication decisions. To appear in International Joint Conference on Autonomous Agents and Multi-Agent Systems, Xuan, P., Lesser, V., Zilberstein, S. Communication decisions in multi-agent cooperation: Model and experiments. In Proceedings of the International Joint Conference on Artificial Intelligence, 2001.