U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Achieving Goals in Decentralized POMDPs Christopher Amato Shlomo Zilberstein UMass.

Slides:



Advertisements
Similar presentations
Markov Decision Process
Advertisements

Partially Observable Markov Decision Process (POMDP)
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
Decision Theoretic Planning
Markov Models for Multi-Agent Coordination Maayan Roth Multi-Robot Reading Group April 13, 2005.
MDP Presentation CS594 Automated Optimal Decision Making Sohail M Yousof Advanced Artificial Intelligence.
Pradeep Varakantham Singapore Management University Joint work with J.Y.Kwak, M.Taylor, J. Marecki, P. Scerri, M.Tambe.
What Are Partially Observable Markov Decision Processes and Why Might You Care? Bob Wall CS 536.
An Introduction to Markov Decision Processes Sarah Hickmott
主講人:虞台文 大同大學資工所 智慧型多媒體研究室
Markov Decision Processes
Infinite Horizon Problems
Planning under Uncertainty
Announcements Homework 3: Games Project 2: Multi-Agent Pacman
SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.
KI Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.
Reinforcement Learning
Decision Making Under Uncertainty Russell and Norvig: ch 16, 17 CMSC421 – Fall 2005.
1 University of Southern California Security in Multiagent Systems by Policy Randomization Praveen Paruchuri, Milind Tambe, Fernando Ordonez University.
Making Decisions under Probabilistic Uncertainty (Where an agent optimizes what it gets on average, but it may get more... or less ) R&N: Chap. 17, Sect.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.
Markov Decision Processes
Department of Computer Science Undergraduate Events More
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Decision Making Under Uncertainty Russell and Norvig: ch 16, 17 CMSC421 – Fall 2003 material from Jean-Claude Latombe, and Daphne Koller.
Learning and Planning for POMDPs Eyal Even-Dar, Tel-Aviv University Sham Kakade, University of Pennsylvania Yishay Mansour, Tel-Aviv University.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
MAKING COMPLEX DEClSlONS
Conference Paper by: Bikramjit Banerjee University of Southern Mississippi From the Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence.
Department of Computer Science Christopher Amato Carnegie Mellon University Feb 5 th, 2010 Increasing Scalability in Algorithms for Centralized and Decentralized.
General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.
Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up Ekhlas Sonu, Prashant Doshi Dept. of Computer Science University.
Software Multiagent Systems: Lecture 13 Milind Tambe University of Southern California
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Computer Science CPSC 502 Lecture 14 Markov Decision Processes (Ch. 9, up to 9.5.3)
Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher.
Shlomo Zilberstein Alan Carlin Bounded Rationality in Multiagent Systems using Decentralized Metareasoning TexPoint fonts used in EMF. Read the TexPoint.
Reinforcement Learning 主講人:虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.
Using Reinforcement Learning to Model True Team Behavior in Uncertain Multiagent Settings in Interactive DIDs Muthukumaran Chandrasekaran THINC Lab, CS.
Solving POMDPs through Macro Decomposition
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Conformant Probabilistic Planning via CSPs ICAPS-2003 Nathanael Hyafil & Fahiem Bacchus University of Toronto.
Deciding Under Probabilistic Uncertainty Russell and Norvig: Sect ,Chap. 17 CS121 – Winter 2003.
Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.
Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.
Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.
Automated Planning and Decision Making Prof. Ronen Brafman Automated Planning and Decision Making Fully Observable MDP.
Department of Computer Science Undergraduate Events More
Markov Decision Process (MDP)
Planning Under Uncertainty. Sensing error Partial observability Unpredictable dynamics Other agents.
Generalized Point Based Value Iteration for Interactive POMDPs Prashant Doshi Dept. of Computer Science and AI Institute University of Georgia
Markov Decision Processes AIMA: 17.1, 17.2 (excluding ), 17.3.
R. Brafman and M. Tennenholtz Presented by Daniel Rasmussen.
1 Markov Decision Processes Finite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
CS b659: Intelligent Robotics
Making complex decisions
Analytics and OR DP- summary.
Reinforcement Learning in POMDPs Without Resets
Markov Decision Processes
Markov Decision Processes
Markov Decision Processes
CS 416 Artificial Intelligence
Reinforcement Learning Dealing with Partial Observability
CS 416 Artificial Intelligence
Reinforcement Nisheeth 18th January 2019.
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Presentation transcript:

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Achieving Goals in Decentralized POMDPs Christopher Amato Shlomo Zilberstein UMass Amherst May 14, 2009

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 2 Overview The importance of goals DEC-POMDP model Previous work on goals Indefinite-horizon DEC-POMDPs Goal-directed DEC-POMDPs Results and future work

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 3 Achieving goals in multiagent setting General setting Problem proceeds over a sequence of steps until a goal is achieved Multiagent setting Can terminate when any number of agents achieve local goals or when all agents achieve a global goal Many problems have this structure Meeting or catching a target Cooperatively completing a task How do we make use of this structure?

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 4 DEC-POMDPs Decentralized partially observable Markov decision process (DEC-POMDP) Multiagent sequential decision making under uncertainty At each stage, each agent receives: A local observation rather than the actual state A joint immediate reward Environment a1a1 o1o1 a2a2 o2o2 r

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 5 DEC-POMDP definition A two agent DEC-POMDP can be defined with the tuple: M =  S, A 1, A 2, P, R,  1,  2, O  S, a finite set of states with designated initial state distribution b 0 A 1 and A 2, each agent’s finite set of actions P, the state transition model: P(s’ | s, a 1, a 2 ) R, the reward model: R(s, a 1, a 2 )  1 and  2, each agent’s finite set of observations O, the observation model: O(o 1, o 2 | s', a 1, a 2 ) This model can be extended to any number of agents

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 6 DEC-POMDP solutions A policy for each agent is a mapping from their observation sequences to actions,  *  A, allowing distributed execution Note that planning can be centralized but execution is distributed A joint policy is a policy for each agent Finite-horizon case: goal is to maximize expected reward over infinite steps Infinite-horizon case: discount the reward to keep sum finite using factor, 

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 7 Achieving goals If problem terminates after goal is achieved, how do we model it? Unclear how many steps are needed until termination Want to avoid a discount factor: value is often arbitrary and can change the solution [ what else to say here? ]

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 8 Previous work Some in POMDPs, but for DEC- POMDPs only Goldman and Zilberstein 04 Modeled problems with goals as finite horizon and studied the complexity Same complexity unless agents have independent transitions and observations and one goal is always better This assumes negative rewards for non- goal states and no-op available at goal

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 9 Indefinite-horizon DEC-POMDPs Extend POMDP assumptions Patek 01 and Hansen 07 Our assumptions Each agents possesses a set of terminal actions Negative rewards for non-terminal actions Problem stops when a terminal action is taken by each agent simultaneously Can capture uncertainty about reaching goal Many problems can be modeled this way Example: Capturing a Target All (or a subset) of agents must simultaneously attack Or agents are targets are must meet at same location Agents are unsure when goal is reached, but must choose when to terminate problem

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 10 Optimal solution Lemma 3.1. An optimal set of indefinite- horizon policy trees must have horizon less than where is the value of the best combination of terminal actions, the value of best combination of non-terminal actions and is the maximum value attained by choosing a set of terminal actions on the first step given the initial state distribution. Theorem 3.2. Our dynamic programming algorithm for indefinite-horizon POMDPs returns an optimal set of policy trees for the given initial state distribution.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 11 Goal-directed DEC-POMDPs Relax assumptions, but still have goal Problem terminates when: The set of agents reach a global goal state A single agent or set of agents reach local goal states Any chosen combination of actions and observations is taken or seen by the set of agents Can no longer guarantee termination, so becomes subclass of infinite-horizon More problems fall into this class (can terminate without agent knowledge) Example: Completing a set of experiments Robots must travel to different sites and perform different experiments at each Some require cooperation (simultaneous action) while some can be completed independently Problem ends when all necessary experiments are completed

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 12 Sample-based approach Use sampling to generate agent trajectories From the known initial state until goal conditions are met Produces only action and observation sequences that lead to goal This reduces the number of policies to consider We prove a bound on the number of samples required to approach optimality (extended from Kearns, Mansour and Ng 99 ) Showed Probability that the value attained is at least  from optimal is at most  with  samples

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 13 Getting more from fewer samples Optimize a finite-state controller Use trajectories to create a controller Ensures a valid DEC-POMDP policy Allows solution to be more compact Choose actions and adjust resulting transitions (permitting possibilities that were not sampled) Optimize in the context of the other agents Trajectories create an initial controller which is then optimized to produce a high-valued policy

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 14 Generating controllers from trajectories Trajectories: a 1 -o 1 g a 1 -o 3 a 1 -o 1 g a 1 -o 3 a 1 -o 3 a 1 -o 1 g a 4 -o 4 a 1 -o 2 a 3 -o 1 g a 4 -o 3 a 1 -o 1 g a4a4 g g g g g a3a3 a1a1 a1a1 a1a1 a1a1 a1a1 o4o4 o1o1 o1o1 o1o1 o1o1 o1o1 o2o2 o3o3 o3o3 o3o3 a4a4 g g g g a3a3 a1a1 a1a1 a1a1 a1a1 o4o4 o1o1 o1o1 o1o1 o1o1 o2o2 o3o3 o3o3 o3o3 g g g a1a1 a1a1 a1a1 o1o1 o1o1 o1o1 o3o3 o3o3 a1a1 o1o1 o 2-4 g Initial controller: Optimized controllers:Reduced controller: a) b)

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 15 Experiments Compared our goal-directed approach with leading approximate infinite-horizon algorithms BFS: Szer and Charpillet 05 DEC-BPI: Bernstein, Hansen and Zilberstein 05 NLP: Amato, Bernstein and Zilberstein 07 Each approach was run with larger controllers until resources were exhausted (2GB or 4 hours) BFS provides an optimal deterministic controller for a given size Other algs were run 10 times and mean times and values are reported

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 16 Experimental results We built controllers from a small number of the highest valued trajectories Our sample-based approach outperforms other methods on these problems # samples= , 10 # samples= , 25 # samples= , 10 #=500000, 5

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 17 Conclusions Make use of goal structure, when present to improve efficiency and solution quality Indefinite-horizon approach Created model for DEC-POMDPs Developed algorithm and proved optimality Goal-directed problems Described more general goal model Developed sample-based algorithm and demonstrated high quality results Proved a bound on the number of samples needed to approach optimality Future: can extend this work to general finite and infinite-horizon problems

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 18 Thank you