Presentation is loading. Please wait.

Presentation is loading. Please wait.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Achieving Goals in Decentralized POMDPs Christopher Amato Shlomo Zilberstein UMass.

Similar presentations


Presentation on theme: "U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Achieving Goals in Decentralized POMDPs Christopher Amato Shlomo Zilberstein UMass."— Presentation transcript:

1 U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Achieving Goals in Decentralized POMDPs Christopher Amato Shlomo Zilberstein UMass Amherst May 14, 2009

2 U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 2 Overview The importance of goals DEC-POMDP model Previous work on goals Indefinite-horizon DEC-POMDPs Goal-directed DEC-POMDPs Results and future work

3 U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 3 Achieving goals in multiagent setting General setting Problem proceeds over a sequence of steps until a goal is achieved Multiagent setting Can terminate when any number of agents achieve local goals or when all agents achieve a global goal Many problems have this structure Meeting or catching a target Cooperatively completing a task How do we make use of this structure?

4 U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 4 DEC-POMDPs Decentralized partially observable Markov decision process (DEC-POMDP) Multiagent sequential decision making under uncertainty At each stage, each agent receives: A local observation rather than the actual state A joint immediate reward Environment a1a1 o1o1 a2a2 o2o2 r

5 U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 5 DEC-POMDP definition A two agent DEC-POMDP can be defined with the tuple: M =  S, A 1, A 2, P, R,  1,  2, O  S, a finite set of states with designated initial state distribution b 0 A 1 and A 2, each agent’s finite set of actions P, the state transition model: P(s’ | s, a 1, a 2 ) R, the reward model: R(s, a 1, a 2 )  1 and  2, each agent’s finite set of observations O, the observation model: O(o 1, o 2 | s', a 1, a 2 ) This model can be extended to any number of agents

6 U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 6 DEC-POMDP solutions A policy for each agent is a mapping from their observation sequences to actions,  *  A, allowing distributed execution Note that planning can be centralized but execution is distributed A joint policy is a policy for each agent Finite-horizon case: goal is to maximize expected reward over infinite steps Infinite-horizon case: discount the reward to keep sum finite using factor, 

7 U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 7 Achieving goals If problem terminates after goal is achieved, how do we model it? Unclear how many steps are needed until termination Want to avoid a discount factor: value is often arbitrary and can change the solution [ what else to say here? ]

8 U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 8 Previous work Some in POMDPs, but for DEC- POMDPs only Goldman and Zilberstein 04 Modeled problems with goals as finite horizon and studied the complexity Same complexity unless agents have independent transitions and observations and one goal is always better This assumes negative rewards for non- goal states and no-op available at goal

9 U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 9 Indefinite-horizon DEC-POMDPs Extend POMDP assumptions Patek 01 and Hansen 07 Our assumptions Each agents possesses a set of terminal actions Negative rewards for non-terminal actions Problem stops when a terminal action is taken by each agent simultaneously Can capture uncertainty about reaching goal Many problems can be modeled this way Example: Capturing a Target All (or a subset) of agents must simultaneously attack Or agents are targets are must meet at same location Agents are unsure when goal is reached, but must choose when to terminate problem

10 U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 10 Optimal solution Lemma 3.1. An optimal set of indefinite- horizon policy trees must have horizon less than where is the value of the best combination of terminal actions, the value of best combination of non-terminal actions and is the maximum value attained by choosing a set of terminal actions on the first step given the initial state distribution. Theorem 3.2. Our dynamic programming algorithm for indefinite-horizon POMDPs returns an optimal set of policy trees for the given initial state distribution.

11 U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 11 Goal-directed DEC-POMDPs Relax assumptions, but still have goal Problem terminates when: The set of agents reach a global goal state A single agent or set of agents reach local goal states Any chosen combination of actions and observations is taken or seen by the set of agents Can no longer guarantee termination, so becomes subclass of infinite-horizon More problems fall into this class (can terminate without agent knowledge) Example: Completing a set of experiments Robots must travel to different sites and perform different experiments at each Some require cooperation (simultaneous action) while some can be completed independently Problem ends when all necessary experiments are completed

12 U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 12 Sample-based approach Use sampling to generate agent trajectories From the known initial state until goal conditions are met Produces only action and observation sequences that lead to goal This reduces the number of policies to consider We prove a bound on the number of samples required to approach optimality (extended from Kearns, Mansour and Ng 99 ) Showed Probability that the value attained is at least  from optimal is at most  with  samples

13 U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 13 Getting more from fewer samples Optimize a finite-state controller Use trajectories to create a controller Ensures a valid DEC-POMDP policy Allows solution to be more compact Choose actions and adjust resulting transitions (permitting possibilities that were not sampled) Optimize in the context of the other agents Trajectories create an initial controller which is then optimized to produce a high-valued policy

14 U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 14 Generating controllers from trajectories Trajectories: a 1 -o 1 g a 1 -o 3 a 1 -o 1 g a 1 -o 3 a 1 -o 3 a 1 -o 1 g a 4 -o 4 a 1 -o 2 a 3 -o 1 g a 4 -o 3 a 1 -o 1 g a4a4 g g g g g a3a3 a1a1 a1a1 a1a1 a1a1 a1a1 o4o4 o1o1 o1o1 o1o1 o1o1 o1o1 o2o2 o3o3 o3o3 o3o3 a4a4 g g g g a3a3 a1a1 a1a1 a1a1 a1a1 o4o4 o1o1 o1o1 o1o1 o1o1 o2o2 o3o3 o3o3 o3o3 g g g a1a1 a1a1 a1a1 o1o1 o1o1 o1o1 o3o3 o3o3 a1a1 o1o1 o 2-4 g Initial controller: Optimized controllers:Reduced controller: a) b) 0 5 43 21 2 1 0 43 0 2 1

15 U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 15 Experiments Compared our goal-directed approach with leading approximate infinite-horizon algorithms BFS: Szer and Charpillet 05 DEC-BPI: Bernstein, Hansen and Zilberstein 05 NLP: Amato, Bernstein and Zilberstein 07 Each approach was run with larger controllers until resources were exhausted (2GB or 4 hours) BFS provides an optimal deterministic controller for a given size Other algs were run 10 times and mean times and values are reported

16 U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 16 Experimental results We built controllers from a small number of the highest valued trajectories Our sample-based approach outperforms other methods on these problems # samples=1000000, 10 # samples=5000000, 25 # samples=1000000, 10 #=500000, 5

17 U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 17 Conclusions Make use of goal structure, when present to improve efficiency and solution quality Indefinite-horizon approach Created model for DEC-POMDPs Developed algorithm and proved optimality Goal-directed problems Described more general goal model Developed sample-based algorithm and demonstrated high quality results Proved a bound on the number of samples needed to approach optimality Future: can extend this work to general finite and infinite-horizon problems

18 U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 18 Thank you


Download ppt "U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Achieving Goals in Decentralized POMDPs Christopher Amato Shlomo Zilberstein UMass."

Similar presentations


Ads by Google