Networked Distributed POMDPs: DCOP-Inspired Distributed POMDPs

Networked Distributed POMDPs: DCOP-Inspired Distributed POMDPs
Ranjit Nair, Honeywell Labs Pradeep Varakantham, USC Milind Tambe, USC Makoto Yokoo, Kyushu University

Background: DPOMDP Distributed Partially Observable Markov Decision Problems (DPOMDP): a decision theoretic approach Performance linked to optimality of decision making Explicitly reasons about (+/-ve) rewards and uncertainty. Current methods use centralized planning and distributed execution The complexity of finding optimal policy is NEXP-Complete In many domains, not all agents can interact or affect each other Most current DPOMDP algorithms do not exploit locality of interaction Distributed sensors Disaster Rescue simulations Battlefield simulations

Background: DCOP Cost = 0 Cost = 7
Distributed Constraint Optimization Problem (DCOP): Constraint Graph (V,E) Vertices are agent’s variables (x1, ..,, x4) each with a domain d1, …, d4 Edges represent rewards DCOP algorithms exploit locality of interaction DCOP algorithms do not reason about uncertainty di dj f(di,dj) 1 2 x1 x2 x3 x4 Cost = 0 Cost = 7

Key ideas and contributions
Exploit locality of interaction to enable scale-up Hybrid DCOP –DPOMDP approach to collaboratively find joint policy Distributed offline planning and distributed execution Key contributions: ND-POMDP Distributed POMDP model that captures locality of interaction Locally Interacting Distributed Joint Equilibrium-based Search for Policies (LID-JESP) Hill climbing like Distributed Breakout Algorithm (DBA) Distributed Parallel Algorithm for Finding Locally Optimal Joint Policy Globally Optimal Algorithm (GOA) Variable Elimination

Outline Sensor net domain Networked Distributed POMDPs (ND-POMDPs)
Locally interacting distributed joint equilibrium-based search for policies (LID-JESP) Globally optimal algorithm Experiments Conclusions and Future Work

Example Domain Two independent targets
Each changes position based on its stochastic transition function Sensing agents cannot affect each other or target’s position False positives and false negatives in observing targets possible Reward obtained if two agents track a target correctly together Cost for leaving sensor on E N W S Ag1 Ag3 Ag2 target1 target2 Sec1 Sec3 Sec2 Sec4 Sec5 Ag5 Ag4

Networked Distributed POMDP
ND-POMDP for set of n agents Ag: <S, A, P, O, Ω, R, b> World state s ∈ S where S = S1× …× Sn× Su Each agent i ∈ Ag has local state si ∈ Si E.g. Is sensor on or off? Su is the part of the state that no agent can affect E.g. Location of the two targets b is the initial belief state, a probability distribution over S b = b1 … bn. bu A = A1× …× An , where Ai is set of actions for agent i E.g. “Scan East”, “Scan West”, “Turn Off” No communication during execution Agents communicate during planning

ND-POMDP Transition independence: Agent i’s local state cannot be affected by other agents Pi : Si × Su × Ai × Si → [0,1] Pu : Su × Su → [0,1] Ω = Ω1× …× Ωn , where Ωi is set of observations for agent i E.g. Target present in sector Observation independence: Agent i’s observations not dependent on others Oi: Si × Su × Ai × Ωi → [0,1] Reward function R is decomposable R(s,a) = ∑l Rl (sl1, … slk, su, al1, … alk) l  Ag, and k = |l| Goal: To find a joint policy π = < π1, …, πn> where πi is the local policy of agent i such that π maximizes the expected joint reward over finite horizon T

ND-POMDP as a DCOP Inter-agent interactions captured by an interaction hypergraph (Ag, E) Each agent is a node Set of hyperedges E = {l| l  Ag and Rl is a component of R} Neighborhood of agent i: Set of i’s neighbors Ni = {j ∈ Ag| j ≠ i, l ∈ E, i ∈ l and j ∈ l} Agents are solving a DCOP where: Constraint graph is the interaction hypergraph Variable at each node is the local policy of that agent Optimize expected joint reward Ag1 Ag2 Ag3 Ag5 Ag4 R12 R1 R1: Ag1’s cost for scanning R12: Reward for Ag1 and Ag2 tracking target

ND-POMDP theorems Theorem 1: For an ND-POMDP, expected reward for a policy  is the sum of expected rewards for each of the links for policy  Global value function is decomposable into value functions for each link Local Neighborhood Utility: V[Ni]: Expected reward obtained from all links involving agent i for executing policy  Theorem 2: Locality of interaction: For policies  and ’, if i = ’i and Ni = ’Ni then V[Ni] = V’[Ni] Given its neighbor’s policies, local neighborhood utility of agent i does not depend on any non-neighbor’s policy

LID-JESP LID-JESP Algorithm (based on Distributed Breakout Algorithm):
Choose local policy randomly Communicate local policy to neighbors Compute local neighborhood utility of current policy wrt to neighbors’ policies Compute local neighborhood utility of best response policy wrt neighbors (GetValue) Communicate the gain (4 - 3) to neighbors If gain is greater than gain of neighbors Change local policy to best response policy Communicate changed policy to neighbors Else If not reached termination go to step 3 Theorem 3: Global Utility is strictly increasing with each iteration until local optimum is reached

Termination Detection
Each agent maintains a termination counter Reset to zero is gain > 0 else increment by 1 Exchange counter with neighbors Set counter to min of own counter and neighbors’ counters Termination detected if counter = d (diameter of graph) Theorem 4: LID-JESP will terminate within d cycles of reaching local optimum Theorem 5: If LID-JESP terminates, agents are in a local optimum From Theorems 3-5, LID-JESP will terminate in a local optimum within d cyles

Computing best response policy
Given neighbors’ fixed policies, each agent is faced with solving a single agent POMDP State is Note: state is not fully observable Transition function: Observation function: Reward function: Best response computed using Bellman backup approach

Global Optimal Algorithm (GOA)
Similar to variable elimination Relies on a tree structured interaction graph Cycle cutset algorithm to eliminate cycles Assumes only binary interactions Phase 1: Values are propagated upwards from leaves to root For each policy, sum up values of its children’s optimal responses Compute value of optimal response to each of the parent’s policies Communicate these values to parent Phase 2: Policies are propagated downwards from root to leaves. Agent chooses policy corresponding to optimal response to parent’s policy Communicates its policy to child

Experiments Compared to: LID-JESP-no-n/w: ignores interaction graph
JESP: Centralized solver (Nair2003) 3 agent chain LID-JESP exponentially faster than GOA 4 agent chain LID-JESP is faster than JESP and LID-JESP-no-nw

Experiments 5 agent chain
LID-JESP is much faster than JESP and LID-JESP-no-nw Values: LID-JESP values are comparable to GOA Random restarts can be used to find global optimal

Experiments Reasons for speedup: C: No. of cycles
G: No. of GetValue calls W: No. of agents that change their policies in a cycle LID-JESP converges in fewer cycles (column C) LID-JESP allows multiple agents to change their policies in a single cycle (column W) JESP has fewer GetValue calls than LID-JESP But each such call was slower

Complexity Complexity of best response: JESP: O(|S|2. |Ai|. ∏j|Ωj|T)
depends on entire world state depends on observation histories of all agents LID-JESP: O(|Su×Si×SNi|2. |Ai|. ∏jNi|Ωj|T) depends on observation histories of only neighbors depends only on Su, Si and SNi Increasing number of agents does not affect complexity Fixed number of neighbors Complexity of GOA: Brute force global optimal: O(∏j|πj|.|S|2.∏j|Ωj|T) GOA: O(n.|πj|.|Su×Si×Sj|2. |Ai|.|Ωi|T.|Ωj|T) Increasing number of agents will cause linear increase run time

Conclusions DCOP algorithms are applied to finding solution to Distributed POMDP Exploiting “locality of interaction” reduces run time LID-JESP based on DBA Agents converge to locally optimal joint policy GOA based on variable elimination First distributed parallel algorithms for Distributed POMDPs Complexity increases linearly with increased number of agents Fixed number of neighbors

Future Work How can communication be incorporated?
Will introducing communication cause agents to lose locality of interaction Remove assumption of transition independence May cause all agents to be dependent on each other Other globally optimal algorithms Increased parallelism

Backup slides

Global Optimal Consider only binary constraints. Can be applied to n-ary constraints Run distributed cycle cutset algorithm in case graph is not a tree Algorithm: Convert graph into trees and a cycle cutset C For each possible joint policy πC of agents in C Val[πC] = 0 For each tree of agents Val[πC] = + DP-Global (tree, πC) Choose joint policy with highest value

Global Optimal Algorithm (GOA)
Similar to variable elimination Relies on a tree structured interaction graph Cycle cutset algorithm to eliminate cycles Assumes only binary interactions Phase 1: Values are propagated upwards from leaves to root From the deepest nodes in the tree to the root, do 1. For each of agent i’s policies, πi do eval(πi) ← ∑ci valueπi ci where valueπi ci is received from child ci. 2. for each parent's policy πj do valueπji ← 0 for each of agent i’s policy πi do set current-eval ← expected-reward(πj , πi) + eval(πi) if valueπji < current-eval then valueπji ← current-eval send valueπji to parent j; Phase 2: Policies are propagated downwards from root to leaves.

Networked Distributed POMDPs: DCOP-Inspired Distributed POMDPs

Similar presentations

Presentation on theme: "Networked Distributed POMDPs: DCOP-Inspired Distributed POMDPs"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Networked Distributed POMDPs: DCOP-Inspired Distributed POMDPs

Similar presentations

Presentation on theme: "Networked Distributed POMDPs: DCOP-Inspired Distributed POMDPs"— Presentation transcript:

Similar presentations

About project

Feedback