Shlomo Zilberstein Alan Carlin Bounded Rationality in Multiagent Systems using Decentralized Metareasoning TexPoint fonts used in EMF. Read the TexPoint.

Slides:



Advertisements
Similar presentations
Markov Decision Process
Advertisements

Probabilistic Planning (goal-oriented) Action Probabilistic Outcome Time 1 Time 2 Goal State 1 Action State Maximize Goal Achievement Dead End A1A2 I A1.
1 University of Southern California Keep the Adversary Guessing: Agent Security by Policy Randomization Praveen Paruchuri University of Southern California.
Randomized Sensing in Adversarial Environments Andreas Krause Joint work with Daniel Golovin and Alex Roper International Joint Conference on Artificial.
1 Dynamic Programming Week #4. 2 Introduction Dynamic Programming (DP) –refers to a collection of algorithms –has a high computational complexity –assumes.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
Meta-Level Control in Multi-Agent Systems Anita Raja and Victor Lesser Department of Computer Science University of Massachusetts Amherst, MA
Decision Theoretic Planning
Markov Models for Multi-Agent Coordination Maayan Roth Multi-Robot Reading Group April 13, 2005.
MDP Presentation CS594 Automated Optimal Decision Making Sohail M Yousof Advanced Artificial Intelligence.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
Pradeep Varakantham Singapore Management University Joint work with J.Y.Kwak, M.Taylor, J. Marecki, P. Scerri, M.Tambe.
A Hybridized Planner for Stochastic Domains Mausam and Daniel S. Weld University of Washington, Seattle Piergiorgio Bertoli ITC-IRST, Trento.
An Introduction to Markov Decision Processes Sarah Hickmott
Markov Decision Processes
Infinite Horizon Problems
Planning under Uncertainty
POMDPs: Partially Observable Markov Decision Processes Advanced AI
In practice, we run into three common issues faced by concurrent optimization algorithms. We alter our model-shaping to mitigate these by reasoning about.
Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University.
Markov Decision Processes
Reinforcement Learning
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
1 University of Southern California Security in Multiagent Systems by Policy Randomization Praveen Paruchuri, Milind Tambe, Fernando Ordonez University.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.
Markov Decision Processes
Multiagent Planning with Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Reinforcement Learning (1)
A1A1 A4A4 A2A2 A3A3 Context-Specific Multiagent Coordination and Planning with Factored MDPs Carlos Guestrin Shobha Venkataraman Daphne Koller Stanford.
MAKING COMPLEX DEClSlONS
Conference Paper by: Bikramjit Banerjee University of Southern Mississippi From the Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence.
Reinforcement Learning
General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.
Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up Ekhlas Sonu, Prashant Doshi Dept. of Computer Science University.
1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 6: Optimality Criterion in MDPs Dr. Itamar Arel College of Engineering Department.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Computer Science CPSC 502 Lecture 14 Markov Decision Processes (Ch. 9, up to 9.5.3)
Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher.
Utilities and MDP: A Lesson in Multiagent System Based on Jose Vidal’s book Fundamentals of Multiagent Systems Henry Hexmoor SIUC.
1 On Completing Latin Squares Iman Hajirasouliha Joint work with Hossein Jowhari, Ravi Kumar, and Ravi Sundaram.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Solving POMDPs through Macro Decomposition
Reinforcement Learning Yishay Mansour Tel-Aviv University.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
A Tutorial on the Partially Observable Markov Decision Process and Its Applications Lawrence Carin June 7,2006.
Information Theory for Mobile Ad-Hoc Networks (ITMANET): The FLoWS Project Competitive Scheduling in Wireless Networks with Correlated Channel State Ozan.
MDPs (cont) & Reinforcement Learning
© 2009 Ilya O. Ryzhov 1 © 2008 Warren B. Powell 1. Optimal Learning On A Graph INFORMS Annual Meeting October 11, 2009 Ilya O. Ryzhov Warren Powell Princeton.
CPS 570: Artificial Intelligence Markov decision processes, POMDPs
1 Chapter 17 2 nd Part Making Complex Decisions --- Decision-theoretic Agent Design Xin Lu 11/04/2002.
Distributed Optimization Yen-Ling Kuo Der-Yeuan Yu May 27, 2010.
Markov Decision Process (MDP)
Generalized Point Based Value Iteration for Interactive POMDPs Prashant Doshi Dept. of Computer Science and AI Institute University of Georgia
On the Difficulty of Achieving Equilibrium in Interactive POMDPs Prashant Doshi Dept. of Computer Science University of Georgia Athens, GA Twenty.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Achieving Goals in Decentralized POMDPs Christopher Amato Shlomo Zilberstein UMass.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Keep the Adversary Guessing: Agent Security by Policy Randomization
Making complex decisions
Markov Decision Processes
Markov Decision Processes
Markov Decision Processes
CS 188: Artificial Intelligence Fall 2007
13. Acting under Uncertainty Wolfram Burgard and Bernhard Nebel
CS 416 Artificial Intelligence
Reinforcement Learning Dealing with Partial Observability
Presentation transcript:

Shlomo Zilberstein Alan Carlin Bounded Rationality in Multiagent Systems using Decentralized Metareasoning TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A AAAAA A

2 Satisficing  Satisficing, a Scottish word which means satisfying, was proposed by Herbert Simon in 1957 to denote decision making that searches until an alternative is found that meets the agent’s aspiration level criterion. Also referred to as bounded rationality.  Approaches Satisficing as approximate reasoning Satisficing as approximate modeling Satisficing as optimal meta-reasoning Satisficing as bounded optimality Satisficing as a combination of the above

3 Resource-bounded reasoning An approach to satisficing that involves:  Algorithms that allow small quantities of resources, such as time, memory, or information, to be traded for gains in the value of computed results.  Compact representation of resource/quality tradeoffs.  Techniques to compose larger systems from resource- bounded components.  Meta-level control strategies that optimize the value of computation by exploiting these tradeoffs. Focus of this talk

4 Anytime algorithms Decision Quality Time Ideal Traditional Time cost Anytime Value  Ideal - maximal quality in no time  Traditional - quality maximizing  Anytime - value maximizing

5 Models of time/quality tradeoff  Anytime algorithms/flexible computation [Dean & Boddy ‘88] [Horvitz ‘87][Zilberstein & Russell ‘91]  Progressive processing [Mouaddib & Charpillet ‘93][Zilberstein & Mouaddib ‘95]  Design-to-time [Garvey & Lesser ‘91]  Imprecise computation [Liu et al. ‘91]

6 Algorithmic Approaches to Tradeoff  Single Agent Performance Profiles with Monitoring [Hansen & Zilberstein ‘01]  Multi-Agent Meta-Reasoning with Negotiation [Raja & Lesser 07] Qual ity T=1… M01M M1

7 Performance Profile

8 Distributed Monitoring Problem

9 Distributed Monitoring Problem (DMP)  The Distributed Monitoring Problem (DMP) is defined by a tuple such that Ag is a set of agents Q 1, Q 2,... Q n is a set of possible quality levels for agents 1..n. Joint qualities are represented by q. A is a set of options, “continue”, “stop”, “monitor local”, “monitor global” available to each agent. P i is the transition model for agent i. P i (q i t+1 | q i ) 2 [0,1] U(q,t) is a utility function that assigns utility to a quality vector q at time t. C L, C G are costs assigned to local and global monitoring. T is a time horizon, the number of steps in the problem.

10 Local Monitoring Compute Evaluate Solution Compute Evaluate Solution Terminate Compute Evaluate Solution Compute Evaluate Solution Terminate

11 NP-hardness of DMP  Lemma 1: The problem of finding an optimal solution for a DMP with a fixed number of agents |Ag|, C L =0 and C G = 1 is NP-hard.  Proof: Decentralized Detection (NP-complete)-- Given finite sets Y 1,Y 2, probability mass function p:Y 1 £ Y 2 ! Q, a partition {A 0, A 1 } of Y 1 £ Y 2 Optimize J( ° 1, ° 2 ) over the selection of ° i : Y i ! {0,1}, i=1,2, where : J( ° 1, ° 2 ) is given by  (y1,y2) 2 A 0 p(y 1,y 2 ) ° 1 (y 1 ) ° 2 (y 2 ) +  (y1,y2) 2 A 1 p(y 1,y 2 ) (1- ° 1 (y 1 ) ° 2 (y 2 ))

12 Lemma 1: DD->DMP Form 3-step DMP. Known quality at first step, DMP transition model P to second step defined by y 1 and y 2 “continue” at second step iff ° 1, ° 2 = 0, effects are deterministic. Arbitrarily set U(q i 2,q j 2 ), for example to zero. Define third step utility so that: U(q i 3,q j 3 ) - U(q i 2,q j 2 ) = J ( ° 1, ° 2 )

13 DMP -> Dec-MDP  Lemma 2: DMP is in NP  DMP -> Dec-MDP  We reduce DMP to NP-complete Dec-MDP, a model that contains states, actions, and rewards Diagram modified from [Amato, Bernstein, & Zilberstein ‘06] Environment a1a1 State change a2a2 r

14 Dec-POMDP/Dec-MDP definition  A two agent DEC-POMDP can be defined with the tuple: M =  S, A 1, A 2, P, R,  1,  2, O  S, a finite set of states with designated initial state distribution b 0 A 1 and A 2, each agent’s finite set of actions P, the state transition model: P(s’ | s, a 1, a 2 ) R, the reward model: R(s, a 1, a 2 )  1 and  2, each agent’s finite set of observations O, the observation model: O(o 1, o 2 | s', a 1, a 2 )  A Dec-MDP is jointly fully observable A Dec-MDP with transition and observation independence can be thought of as fully observable locally

15 Lemma 2: DMP->Dec-MDP  States: All triples. Also terminal state for each agent, denoted s i f  Actions: “Continue”, “Stop”, “Monitor Local”  Transitions: P MDP ( |,monitor) = 0 if t'  t or t'  t' 0. P MDP ( |, monitor) = P'(q i t ' | q i t ) if t' = t' 0 = t  Reward: -kC if k agents choose to monitor U(q 1,q 2 ) if one of the agents choose to terminate. Utility is zero if one of the agents is in the final state.  Note: DMP model can be extended by defining U(s i f,q 2 ) as non-zero to account for different stop times

16 Local Monitoring Complexity  Proof of Lemma 2 continued continue monitor Time tTime t+1Time t+2 P (1.0-P) Theorem 1: DMP is NP-complete Proof: Follows from Lemma 1 and Lemma 2

17 Greedy Solution  Dynamic local performance profile Pr i (q' i | q i, ¢ t) denotes the probability of agent i obtaining a solution of quality q' by continuing ¢ t steps.  Definition: A greedy estimate of expected value of computation (MEVC) for agent i at time t is: MEVC(q i t, t, t + ¢ t) =  q t  q t+ ¢ t Pr(q t |q i t,t) Pr(q t+ ¢ t |q t, ¢ t)(U(q t+ ¢ t,t+ ¢ t) - U(q t,t))  Definition: A monitoring policy  (q i, t) for agent i is a mapping from time step t and local quality level q i to a decision whether to continue the algorithm and act on the currently available solution.

18 Greedy Stopping Rule  Step 1: Create a local utility function U i U i (q i,t) =  {q -i t }Pr(q -i t )U(,t)  Step 2: Create a value function using dynamic programming

19 Greedy Method  Advantages: Quick, dynamic program  Disadvantages: May introduce error

20 Optimal Solution method  Represent state-action probabilities as vectors x and y. Each component is the probability of a tuple  Form bilinear program maximize x,y r 1 T x+ x T Ry + r 2 y subject to A 1 x = ® 1 A 2 x = ® 2  r 1 and r 2 = 0 [no local reward]  Each entry of R = U(,t), with q 1,q 2, and t from the tuple above If t is not the same for both agents, reward is zero  ® 1  and ® 2  and A 1 and A 2 reflect constraints on policies State-actions must “add-up” correctly so that chance of entering each state is the same as chance of leaving it.

21 Global Monitoring Compute Communicate Compute Communicate Terminate Compute Communicate Quality Compute Communicate Quality Terminate

22 Global Monitoring  Case where C L = infinity C G = 0  Greedy Solution: Use Value of Information Approach Introduce V*, expected utility after monitoring V, expected utility without monitoring  VoI = V*(q i, t) - V (q i, t) – C G  V * (q i,t) =  q -i Pr(q -i,t)V * (q i,q -i,t)

23 Global Monitoring Complexity  Theorem 2: The DMP problem where C L = 0 and C G is a constant, is NP-complete  Proof: Solve DMP by forming and solving Dec-MDP- Comm-Sync Dec-MDP-Comm-Sync is a Dec-MDP where agents may communicate after each step Dec-MDP-Comm-Sync is NP-complete [Goldman & Zilberstein ‘04],[Becker, Carlin, Lesser & Zilberstein ‘09]

24 Experiments  Rock Sampling domain Multiple rovers sampling rocks. Must plan paths, and execute simultaneously. Profile HSVI POMDP solver  Max Flow network Multiple providers solving individual Max Flow problems Profile progress of Max Flow solvers

25 Rock Sample with Local Monitoring  C L =.5,4,7, and 10 (top to bottom  If Cost of Time is low, agents always continue (values converge)  If Cost of Time is high, agents always stop immediately (values converge)  Therefore C L alters the slope of each plotted line.

26 Timing Results ProblemCompile Time (s)Solve Time (s) Max Flow Local Rock Sample Local Max Flow Global Rock Sample Global.01129

27 Rock Sample: Greedy Local Monitoring Policy QualityT= QualityT=

28 Max Flow: Local Monitoring Policy QualityT= M1M QualityT= M 2M10 21M2M M

29 Local versus Global Monitoring Rock SamplingMax Flow Solid line represents global monitoring, dotted line optimal and greedy local monitoring

30

31

32

33

34

35 Conclusions  Often modern algorithms have anytime properties  Previous research has shown that quality/time tradeoff can be optimized using decision theory  Extending this approach to multi-agent systems may introduce error (greedy approach)  We present a framework for analyzing and optimizing joint anytime performance  Simple extensions can relax the joint stopping assumption presented here Stopping accumulates local quality only Joint stop with time delay  Future work Partial Observability Exploiting structure in the dependencies (dependency graph)