Networked Distributed POMDPs: DCOP-Inspired Distributed POMDPs

Slides:

Advertisements

Similar presentations

Exact Inference. Inference Basic task for inference: – Compute a posterior distribution for some query variables given some observed evidence – Sum out.

Advertisements

Adopt Algorithm for Distributed Constraint Optimization

Markov Decision Process

1 University of Southern California Keep the Adversary Guessing: Agent Security by Policy Randomization Praveen Paruchuri University of Southern California.

Partially Observable Markov Decision Process (POMDP)

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.

SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.

CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)

Exact Inference in Bayes Nets

Junction Trees And Belief Propagation. Junction Trees: Motivation What if we want to compute all marginals, not just one? Doing variable elimination for.

Dynamic Bayesian Networks (DBNs)

MDP Presentation CS594 Automated Optimal Decision Making Sohail M Yousof Advanced Artificial Intelligence.

Pradeep Varakantham Singapore Management University Joint work with J.Y.Kwak, M.Taylor, J. Marecki, P. Scerri, M.Tambe.

Pearl’s Belief Propagation Algorithm Exact answers from tree-structured Bayesian networks Heavily based on slides by: Tomas Singliar,

Planning under Uncertainty

Junction Trees: Motivation Standard algorithms (e.g., variable elimination) are inefficient if the undirected graph underlying the Bayes Net contains cycles.

GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.

POMDPs: Partially Observable Markov Decision Processes Advanced AI

In practice, we run into three common issues faced by concurrent optimization algorithms. We alter our model-shaping to mitigate these by reasoning about.

Markov Decision Processes CSE 473 May 28, 2004 AI textbook : Sections Russel and Norvig Decision-Theoretic Planning: Structural Assumptions.

KI Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.

Decision Making Under Uncertainty Russell and Norvig: ch 16, 17 CMSC421 – Fall 2005.

1 University of Southern California Security in Multiagent Systems by Policy Randomization Praveen Paruchuri, Milind Tambe, Fernando Ordonez University.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.

Multiagent Planning with Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.

Decision Making Under Uncertainty Russell and Norvig: ch 16, 17 CMSC421 – Fall 2003 material from Jean-Claude Latombe, and Daphne Koller.

Learning and Planning for POMDPs Eyal Even-Dar, Tel-Aviv University Sham Kakade, University of Pennsylvania Yishay Mansour, Tel-Aviv University.

Distributed Constraint Optimization * some slides courtesy of P. Modi

Instructor: Vincent Conitzer

MAKING COMPLEX DEClSlONS

Distributed Constraint Optimization Michal Jakob Agent Technology Center, Dept. of Computer Science and Engineering, FEE, Czech Technical University A4M33MAS.

CSE-473 Artificial Intelligence Partially-Observable MDPS (POMDPs)

Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up Ekhlas Sonu, Prashant Doshi Dept. of Computer Science University.

Software Multiagent Systems: Lecture 13 Milind Tambe University of Southern California

CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.

Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher.

CS584 - Software Multiagent Systems Lecture 12 Distributed constraint optimization II: Incomplete algorithms and recent theoretical results.

Model-based Bayesian Reinforcement Learning in Partially Observable Domains by Pascal Poupart and Nikos Vlassis (2008 International Symposium on Artificial.

Utilities and MDP: A Lesson in Multiagent System Based on Jose Vidal’s book Fundamentals of Multiagent Systems Henry Hexmoor SIUC.

Conformant Probabilistic Planning via CSPs ICAPS-2003 Nathanael Hyafil & Fahiem Bacchus University of Toronto.

Deciding Under Probabilistic Uncertainty Russell and Norvig: Sect ,Chap. 17 CS121 – Winter 2003.

Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:

CPS 570: Artificial Intelligence Markov decision processes, POMDPs

1 Chapter 17 2 nd Part Making Complex Decisions --- Decision-theoretic Agent Design Xin Lu 11/04/2002.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Achieving Goals in Decentralized POMDPs Christopher Amato Shlomo Zilberstein UMass.

Solving problems by searching Chapter 3. Types of agents Reflex agent Consider how the world IS Choose action based on current percept Do not consider.

Keep the Adversary Guessing: Agent Security by Policy Randomization

Solving problems by searching

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3

PENGANTAR INTELIJENSIA BUATAN (64A614)

CS b659: Intelligent Robotics

Making complex decisions

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3

Markov Decision Processes

Multi-Agent Exploration

Markov Decision Processes

Propagating Uncertainty In POMDP Value Iteration with Gaussian Process

Constraint Propagation

Markov Decision Processes

MURI Kickoff Meeting Randolph L. Moses November, 2008

CS 188: Artificial Intelligence Fall 2007

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 11

Hidden Markov Models (cont.) Markov Decision Processes

Junction Trees 3 Undirected Graphical Models

CS 416 Artificial Intelligence

Reinforcement Learning Dealing with Partial Observability

Deciding Under Probabilistic Uncertainty

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3

Presentation transcript:

Networked Distributed POMDPs: DCOP-Inspired Distributed POMDPs Ranjit Nair, Honeywell Labs Pradeep Varakantham, USC Milind Tambe, USC Makoto Yokoo, Kyushu University

Background: DPOMDP Distributed Partially Observable Markov Decision Problems (DPOMDP): a decision theoretic approach Performance linked to optimality of decision making Explicitly reasons about (+/-ve) rewards and uncertainty. Current methods use centralized planning and distributed execution The complexity of finding optimal policy is NEXP-Complete In many domains, not all agents can interact or affect each other Most current DPOMDP algorithms do not exploit locality of interaction Distributed sensors Disaster Rescue simulations Battlefield simulations

Background: DCOP Cost = 0 Cost = 7 Distributed Constraint Optimization Problem (DCOP): Constraint Graph (V,E) Vertices are agent’s variables (x1, ..,, x4) each with a domain d1, …, d4 Edges represent rewards DCOP algorithms exploit locality of interaction DCOP algorithms do not reason about uncertainty di dj f(di,dj) 1 2 x1 x2 x3 x4 Cost = 0 Cost = 7

Key ideas and contributions Exploit locality of interaction to enable scale-up Hybrid DCOP –DPOMDP approach to collaboratively find joint policy Distributed offline planning and distributed execution Key contributions: ND-POMDP Distributed POMDP model that captures locality of interaction Locally Interacting Distributed Joint Equilibrium-based Search for Policies (LID-JESP) Hill climbing like Distributed Breakout Algorithm (DBA) Distributed Parallel Algorithm for Finding Locally Optimal Joint Policy Globally Optimal Algorithm (GOA) Variable Elimination

Outline Sensor net domain Networked Distributed POMDPs (ND-POMDPs) Locally interacting distributed joint equilibrium-based search for policies (LID-JESP) Globally optimal algorithm Experiments Conclusions and Future Work

Example Domain Two independent targets Each changes position based on its stochastic transition function Sensing agents cannot affect each other or target’s position False positives and false negatives in observing targets possible Reward obtained if two agents track a target correctly together Cost for leaving sensor on E N W S Ag1 Ag3 Ag2 target1 target2 Sec1 Sec3 Sec2 Sec4 Sec5 Ag5 Ag4

Networked Distributed POMDP ND-POMDP for set of n agents Ag: <S, A, P, O, Ω, R, b> World state s ∈ S where S = S1× …× Sn× Su Each agent i ∈ Ag has local state si ∈ Si E.g. Is sensor on or off? Su is the part of the state that no agent can affect E.g. Location of the two targets b is the initial belief state, a probability distribution over S b = b1 … bn. bu A = A1× …× An , where Ai is set of actions for agent i E.g. “Scan East”, “Scan West”, “Turn Off” No communication during execution Agents communicate during planning

ND-POMDP Transition independence: Agent i’s local state cannot be affected by other agents Pi : Si × Su × Ai × Si → [0,1] Pu : Su × Su → [0,1] Ω = Ω1× …× Ωn , where Ωi is set of observations for agent i E.g. Target present in sector Observation independence: Agent i’s observations not dependent on others Oi: Si × Su × Ai × Ωi → [0,1] Reward function R is decomposable R(s,a) = ∑l Rl (sl1, … slk, su, al1, … alk) l  Ag, and k = |l| Goal: To find a joint policy π = < π1, …, πn> where πi is the local policy of agent i such that π maximizes the expected joint reward over finite horizon T

ND-POMDP as a DCOP Inter-agent interactions captured by an interaction hypergraph (Ag, E) Each agent is a node Set of hyperedges E = {l| l  Ag and Rl is a component of R} Neighborhood of agent i: Set of i’s neighbors Ni = {j ∈ Ag| j ≠ i, l ∈ E, i ∈ l and j ∈ l} Agents are solving a DCOP where: Constraint graph is the interaction hypergraph Variable at each node is the local policy of that agent Optimize expected joint reward Ag1 Ag2 Ag3 Ag5 Ag4 R12 R1 R1: Ag1’s cost for scanning R12: Reward for Ag1 and Ag2 tracking target

ND-POMDP theorems Theorem 1: For an ND-POMDP, expected reward for a policy  is the sum of expected rewards for each of the links for policy  Global value function is decomposable into value functions for each link Local Neighborhood Utility: V[Ni]: Expected reward obtained from all links involving agent i for executing policy  Theorem 2: Locality of interaction: For policies  and ’, if i = ’i and Ni = ’Ni then V[Ni] = V’[Ni] Given its neighbor’s policies, local neighborhood utility of agent i does not depend on any non-neighbor’s policy

LID-JESP LID-JESP Algorithm (based on Distributed Breakout Algorithm): Choose local policy randomly Communicate local policy to neighbors Compute local neighborhood utility of current policy wrt to neighbors’ policies Compute local neighborhood utility of best response policy wrt neighbors (GetValue) Communicate the gain (4 - 3) to neighbors If gain is greater than gain of neighbors Change local policy to best response policy Communicate changed policy to neighbors Else If not reached termination go to step 3 Theorem 3: Global Utility is strictly increasing with each iteration until local optimum is reached

Termination Detection Each agent maintains a termination counter Reset to zero is gain > 0 else increment by 1 Exchange counter with neighbors Set counter to min of own counter and neighbors’ counters Termination detected if counter = d (diameter of graph) Theorem 4: LID-JESP will terminate within d cycles of reaching local optimum Theorem 5: If LID-JESP terminates, agents are in a local optimum From Theorems 3-5, LID-JESP will terminate in a local optimum within d cyles

Computing best response policy Given neighbors’ fixed policies, each agent is faced with solving a single agent POMDP State is Note: state is not fully observable Transition function: Observation function: Reward function: Best response computed using Bellman backup approach

Global Optimal Algorithm (GOA) Similar to variable elimination Relies on a tree structured interaction graph Cycle cutset algorithm to eliminate cycles Assumes only binary interactions Phase 1: Values are propagated upwards from leaves to root For each policy, sum up values of its children’s optimal responses Compute value of optimal response to each of the parent’s policies Communicate these values to parent Phase 2: Policies are propagated downwards from root to leaves. Agent chooses policy corresponding to optimal response to parent’s policy Communicates its policy to child

Experiments Compared to: LID-JESP-no-n/w: ignores interaction graph JESP: Centralized solver (Nair2003) 3 agent chain LID-JESP exponentially faster than GOA 4 agent chain LID-JESP is faster than JESP and LID-JESP-no-nw

Experiments 5 agent chain LID-JESP is much faster than JESP and LID-JESP-no-nw Values: LID-JESP values are comparable to GOA Random restarts can be used to find global optimal

Experiments Reasons for speedup: C: No. of cycles G: No. of GetValue calls W: No. of agents that change their policies in a cycle LID-JESP converges in fewer cycles (column C) LID-JESP allows multiple agents to change their policies in a single cycle (column W) JESP has fewer GetValue calls than LID-JESP But each such call was slower

Complexity Complexity of best response: JESP: O(|S|2. |Ai|. ∏j|Ωj|T) depends on entire world state depends on observation histories of all agents LID-JESP: O(|Su×Si×SNi|2. |Ai|. ∏jNi|Ωj|T) depends on observation histories of only neighbors depends only on Su, Si and SNi Increasing number of agents does not affect complexity Fixed number of neighbors Complexity of GOA: Brute force global optimal: O(∏j|πj|.|S|2.∏j|Ωj|T) GOA: O(n.|πj|.|Su×Si×Sj|2. |Ai|.|Ωi|T.|Ωj|T) Increasing number of agents will cause linear increase run time

Conclusions DCOP algorithms are applied to finding solution to Distributed POMDP Exploiting “locality of interaction” reduces run time LID-JESP based on DBA Agents converge to locally optimal joint policy GOA based on variable elimination First distributed parallel algorithms for Distributed POMDPs Complexity increases linearly with increased number of agents Fixed number of neighbors

Future Work How can communication be incorporated? Will introducing communication cause agents to lose locality of interaction Remove assumption of transition independence May cause all agents to be dependent on each other Other globally optimal algorithms Increased parallelism

Backup slides

Global Optimal Consider only binary constraints. Can be applied to n-ary constraints Run distributed cycle cutset algorithm in case graph is not a tree Algorithm: Convert graph into trees and a cycle cutset C For each possible joint policy πC of agents in C Val[πC] = 0 For each tree of agents Val[πC] = + DP-Global (tree, πC) Choose joint policy with highest value

Global Optimal Algorithm (GOA) Similar to variable elimination Relies on a tree structured interaction graph Cycle cutset algorithm to eliminate cycles Assumes only binary interactions Phase 1: Values are propagated upwards from leaves to root From the deepest nodes in the tree to the root, do 1. For each of agent i’s policies, πi do eval(πi) ← ∑ci valueπi ci where valueπi ci is received from child ci. 2. for each parent's policy πj do valueπji ← 0 for each of agent i’s policy πi do set current-eval ← expected-reward(πj , πi) + eval(πi) if valueπji < current-eval then valueπji ← current-eval send valueπji to parent j; Phase 2: Policies are propagated downwards from root to leaves.