Shlomo Zilberstein Alan Carlin Bounded Rationality in Multiagent Systems using Decentralized Metareasoning TexPoint fonts used in EMF. Read the TexPoint.

Shlomo Zilberstein Alan Carlin Bounded Rationality in Multiagent Systems using Decentralized Metareasoning TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A AAAAA A

2 Satisficing  Satisficing, a Scottish word which means satisfying, was proposed by Herbert Simon in 1957 to denote decision making that searches until an alternative is found that meets the agent’s aspiration level criterion. Also referred to as bounded rationality.  Approaches Satisficing as approximate reasoning Satisficing as approximate modeling Satisficing as optimal meta-reasoning Satisficing as bounded optimality Satisficing as a combination of the above

3 Resource-bounded reasoning An approach to satisficing that involves:  Algorithms that allow small quantities of resources, such as time, memory, or information, to be traded for gains in the value of computed results.  Compact representation of resource/quality tradeoffs.  Techniques to compose larger systems from resource- bounded components.  Meta-level control strategies that optimize the value of computation by exploiting these tradeoffs. Focus of this talk

4 Anytime algorithms Decision Quality Time Ideal Traditional Time cost Anytime Value  Ideal - maximal quality in no time  Traditional - quality maximizing  Anytime - value maximizing

5 Models of time/quality tradeoff  Anytime algorithms/flexible computation [Dean & Boddy ‘88] [Horvitz ‘87][Zilberstein & Russell ‘91]  Progressive processing [Mouaddib & Charpillet ‘93][Zilberstein & Mouaddib ‘95]  Design-to-time [Garvey & Lesser ‘91]  Imprecise computation [Liu et al. ‘91]

6 Algorithmic Approaches to Tradeoff  Single Agent Performance Profiles with Monitoring [Hansen & Zilberstein ‘01]  Multi-Agent Meta-Reasoning with Negotiation [Raja & Lesser 07] Qual ity T=1…5678 13M01M 0 2 1 3 3M1

7 Performance Profile

8 Distributed Monitoring Problem

9 Distributed Monitoring Problem (DMP)  The Distributed Monitoring Problem (DMP) is defined by a tuple such that Ag is a set of agents Q 1, Q 2,... Q n is a set of possible quality levels for agents 1..n. Joint qualities are represented by q. A is a set of options, “continue”, “stop”, “monitor local”, “monitor global” available to each agent. P i is the transition model for agent i. P i (q i t+1 | q i ) 2 [0,1] U(q,t) is a utility function that assigns utility to a quality vector q at time t. C L, C G are costs assigned to local and global monitoring. T is a time horizon, the number of steps in the problem.

10 Local Monitoring Compute Evaluate Solution Compute Evaluate Solution Terminate Compute Evaluate Solution Compute Evaluate Solution Terminate

11 NP-hardness of DMP  Lemma 1: The problem of finding an optimal solution for a DMP with a fixed number of agents |Ag|, C L =0 and C G = 1 is NP-hard.  Proof: Decentralized Detection (NP-complete)-- Given finite sets Y 1,Y 2, probability mass function p:Y 1 £ Y 2 ! Q, a partition {A 0, A 1 } of Y 1 £ Y 2 Optimize J( ° 1, ° 2 ) over the selection of ° i : Y i ! {0,1}, i=1,2, where : J( ° 1, ° 2 ) is given by  (y1,y2) 2 A 0 p(y 1,y 2 ) ° 1 (y 1 ) ° 2 (y 2 ) +  (y1,y2) 2 A 1 p(y 1,y 2 ) (1- ° 1 (y 1 ) ° 2 (y 2 ))

12 Lemma 1: DD->DMP Form 3-step DMP. Known quality at first step, DMP transition model P to second step defined by y 1 and y 2 “continue” at second step iff ° 1, ° 2 = 0, effects are deterministic. Arbitrarily set U(q i 2,q j 2 ), for example to zero. Define third step utility so that: U(q i 3,q j 3 ) - U(q i 2,q j 2 ) = J ( ° 1, ° 2 )

13 DMP -> Dec-MDP  Lemma 2: DMP is in NP  DMP -> Dec-MDP  We reduce DMP to NP-complete Dec-MDP, a model that contains states, actions, and rewards Diagram modified from [Amato, Bernstein, & Zilberstein ‘06] Environment a1a1 State change a2a2 r

14 Dec-POMDP/Dec-MDP definition  A two agent DEC-POMDP can be defined with the tuple: M =  S, A 1, A 2, P, R,  1,  2, O  S, a finite set of states with designated initial state distribution b 0 A 1 and A 2, each agent’s finite set of actions P, the state transition model: P(s’ | s, a 1, a 2 ) R, the reward model: R(s, a 1, a 2 )  1 and  2, each agent’s finite set of observations O, the observation model: O(o 1, o 2 | s', a 1, a 2 )  A Dec-MDP is jointly fully observable A Dec-MDP with transition and observation independence can be thought of as fully observable locally

15 Lemma 2: DMP->Dec-MDP  States: All triples. Also terminal state for each agent, denoted s i f  Actions: “Continue”, “Stop”, “Monitor Local”  Transitions: P MDP ( |,monitor) = 0 if t'  t or t'  t' 0. P MDP ( |, monitor) = P'(q i t ' | q i t ) if t' = t' 0 = t  Reward: -kC if k agents choose to monitor U(q 1,q 2 ) if one of the agents choose to terminate. Utility is zero if one of the agents is in the final state.  Note: DMP model can be extended by defining U(s i f,q 2 ) as non-zero to account for different stop times

16 Local Monitoring Complexity  Proof of Lemma 2 continued continue monitor Time tTime t+1Time t+2 P (1.0-P) Theorem 1: DMP is NP-complete Proof: Follows from Lemma 1 and Lemma 2

17 Greedy Solution  Dynamic local performance profile Pr i (q' i | q i, ¢ t) denotes the probability of agent i obtaining a solution of quality q' by continuing ¢ t steps.  Definition: A greedy estimate of expected value of computation (MEVC) for agent i at time t is: MEVC(q i t, t, t + ¢ t) =  q t  q t+ ¢ t Pr(q t |q i t,t) Pr(q t+ ¢ t |q t, ¢ t)(U(q t+ ¢ t,t+ ¢ t) - U(q t,t))  Definition: A monitoring policy  (q i, t) for agent i is a mapping from time step t and local quality level q i to a decision whether to continue the algorithm and act on the currently available solution.

18 Greedy Stopping Rule  Step 1: Create a local utility function U i U i (q i,t) =  {q -i t }Pr(q -i t )U(,t)  Step 2: Create a value function using dynamic programming

19 Greedy Method  Advantages: Quick, dynamic program  Disadvantages: May introduce error

20 Optimal Solution method  Represent state-action probabilities as vectors x and y. Each component is the probability of a tuple  Form bilinear program maximize x,y r 1 T x+ x T Ry + r 2 y subject to A 1 x = ® 1 A 2 x = ® 2  r 1 and r 2 = 0 [no local reward]  Each entry of R = U(,t), with q 1,q 2, and t from the tuple above If t is not the same for both agents, reward is zero  ® 1  and ® 2  and A 1 and A 2 reflect constraints on policies State-actions must “add-up” correctly so that chance of entering each state is the same as chance of leaving it.

21 Global Monitoring Compute Communicate Compute Communicate Terminate Compute Communicate Quality Compute Communicate Quality Terminate

22 Global Monitoring  Case where C L = infinity C G = 0  Greedy Solution: Use Value of Information Approach Introduce V*, expected utility after monitoring V, expected utility without monitoring  VoI = V*(q i, t) - V (q i, t) – C G  V * (q i,t) =  q -i Pr(q -i,t)V * (q i,q -i,t)

23 Global Monitoring Complexity  Theorem 2: The DMP problem where C L = 0 and C G is a constant, is NP-complete  Proof: Solve DMP by forming and solving Dec-MDP- Comm-Sync Dec-MDP-Comm-Sync is a Dec-MDP where agents may communicate after each step Dec-MDP-Comm-Sync is NP-complete [Goldman & Zilberstein ‘04],[Becker, Carlin, Lesser & Zilberstein ‘09]

24 Experiments  Rock Sampling domain Multiple rovers sampling rocks. Must plan paths, and execute simultaneously. Profile HSVI POMDP solver  Max Flow network Multiple providers solving individual Max Flow problems Profile progress of Max Flow solvers

25 Rock Sample with Local Monitoring  C L =.5,4,7, and 10 (top to bottom  If Cost of Time is low, agents always continue (values converge)  If Cost of Time is high, agents always stop immediately (values converge)  Therefore C L alters the slope of each plotted line.

26 Timing Results ProblemCompile Time (s)Solve Time (s) Max Flow Local3.511.4 Rock Sample Local.132.8 Max Flow Global.04370 Rock Sample Global.01129

27 Rock Sample: Greedy Local Monitoring Policy QualityT=123456 1433210 2433210 3032210 4021110 5010000 QualityT=123456 1433210 2433210 3032210 4021110 5010000

28 Max Flow: Local Monitoring Policy QualityT=123456 13M1M20 2 20 3 20 4 10 5000 QualityT=123456 11M 2M10 21M2M10 3 10 41M010 5 000

29 Local versus Global Monitoring Rock SamplingMax Flow Solid line represents global monitoring, dotted line optimal and greedy local monitoring

35 Conclusions  Often modern algorithms have anytime properties  Previous research has shown that quality/time tradeoff can be optimized using decision theory  Extending this approach to multi-agent systems may introduce error (greedy approach)  We present a framework for analyzing and optimizing joint anytime performance  Simple extensions can relax the joint stopping assumption presented here Stopping accumulates local quality only Joint stop with time delay  Future work Partial Observability Exploiting structure in the dependencies (dependency graph)

Shlomo Zilberstein Alan Carlin Bounded Rationality in Multiagent Systems using Decentralized Metareasoning TexPoint fonts used in EMF. Read the TexPoint.

Similar presentations

Presentation on theme: "Shlomo Zilberstein Alan Carlin Bounded Rationality in Multiagent Systems using Decentralized Metareasoning TexPoint fonts used in EMF. Read the TexPoint."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Shlomo Zilberstein Alan Carlin Bounded Rationality in Multiagent Systems using Decentralized Metareasoning TexPoint fonts used in EMF. Read the TexPoint.

Similar presentations

Presentation on theme: "Shlomo Zilberstein Alan Carlin Bounded Rationality in Multiagent Systems using Decentralized Metareasoning TexPoint fonts used in EMF. Read the TexPoint."— Presentation transcript:

Similar presentations

About project

Feedback