Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multiagent Planning with Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.

Similar presentations


Presentation on theme: "Multiagent Planning with Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University."— Presentation transcript:

1 Multiagent Planning with Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University

2 Multiagent Coordination Examples Search and rescue Factory management Supply chain Firefighting Network routing Air traffic control Multiple, simultaneous decisions Limited observability Limited communication

3 Network Management Problem Administrators must coordinate to maximize global reward M4M4 M1M1 M3M3 M2M2 Si’Si’ Neighboring machines: LiLi Li’Li’ Load: SiSi Status: AiAi Action: RiRi When process terminates successfully t t+1

4 Joint Decision Space Represent as MDP: Action space: joint action a= {a 1,…, a n } for all agents State space: joint state x of entire system Reward function: total reward r Action space is exponential in # agents State space is exponential in # variables Global decision requires complete observation

5 Long-term Utilities One step utility: SysAdmin A i receives reward ($) if process completes Total utility: sum of rewards Optimal action requires long-term planning Long-term utility Q(x,a): Expected reward, given current state x and action a Optimal action at state x is:

6 Q(A 1,…,A 4, X 1,…,X 4 ) ¼ Q 1 (A 1, A 4, X 1,X 4 ) + Q 2 (A 1, A 2, X 1,X 2 ) + Q 3 (A 2, A 3, X 2,X 3 ) + Q 4 (A 3, A 4, X 3,X 4 ) Local Q function Approximation M4M4 M1M1 M3M3 M2M2 Q3Q3 Q(A 1,…,A 4, X 1,…,X 4 ) Associated with Agent 3 Limited observability: agent i only observes variables in Q i Observe only X 2 and X 3 Must choose action to maximize  i Q i

7 Use variable elimination for maximization: [Bertele & Brioschi ‘72] Maximizing  i Q i : Coordination Graph Limited communication for optimal action choice Comm. bandwidth = induced width of coord. graph Here we need only 23, instead of 63 sum operations. A1A1 A4A4 A2A2 A3A3 ),(),(),(max 321312211,, 321 AAgAAQAAQ AAA   ),(),( ),(),( 424433312211,, 4321 AAQAAQAAQAAQ AAAA  ),(),(),(),( 424433312211,,, 4321 AAQAAQAAQAAQ AAAA 

8 Where do the Q i come from? Use function approximation to find Q i : Q(X 1, …, X 4, A 1, …, A 4 ) ¼ Q 1 (A 1, A 4, X 1,X 4 ) + Q 2 (A 1, A 2, X 1,X 2 ) + Q 3 (A 2, A 3, X 2,X 3 ) + Q 4 (A 3, A 4, X 3,X 4 ) Long-term planning requires Markov Decision Process # states exponential # actions exponential Efficient approximation by exploiting structure!

9 Dynamic Decision Diagram M4M4 M1M1 M3M3 M2M2 A3A3 A4A4 A2A2 A1A1 X1X1 X3X3 X4X4 X2X2 R1R1 R2R2 R3R3 R4R4 X3’X3’ X4’X4’ X2’X2’ X1’X1’ State Dynamics Decisions Rewards P(X 1 ’|X 1, X 4, A 1 )

10 Long-term Utility = Value of MDP Value computed by linear programming: One variable V (x) for each state One constraint for each state x and action a Number of states and actions exponential!

11 Linear combination of restricted domain functions [Bellman et al. ‘63] [Tsitsiklis & Van Roy ’96] [Koller & Parr ’99,’00] [Guestrin et al. ’01] Decomposable Value Functions Each h i is status of small part(s) of a complex system: Status of a machine and neighbors Load on machine Must find w giving good approximate value function

12 Single LP Solution for Factored MDPs One variable w i for each basis function Polynomially many LP variables One constraint for every state and action  Exponentially many LP constraints h i, Q i depend on small sets of variables/actions [Schweitzer and Seidmann ‘85]

13 Representing Exponentially Many Constraints [Guestrin et al ’01] Exponentially many linear = one nonlinear constraint

14 Representing the Constraints Functions are factored, use Variable Elimination to represent constraints: Number of constraints exponentially smaller

15 Summary of Algorithm 1.Pick local basis functions h i 2.Single LP to compute local Q i ’s in factored MDP 3.Coordination graph computes maximizing action

16 Network Management Problem Unidirectional Ring Server Star Ring of Rings

17 Single Agent Policy Quality 50 100 150 200 250 300 350 0102030 number of machines Discounted reward PI = Approximate Policy Iteration with Max-norm Projection [Guestrin et al.’01] Single LP versus Approximate Policy Iteration LP single basis PI single basis LP pair basis LP triple basis

18 Single Agent Running Time LP single basis PI single basis LP pair basis LP triple basis PI = Approximate Policy Iteration with Max-norm Projection [Guestrin et al.’01]

19 Multiagent Policy Quality Comparing to Distributed Reward and Distributed Value Function algorithms [Schneider et al. ‘99]

20 Multiagent Policy Quality Comparing to Distributed Reward and Distributed Value Function algorithms [Schneider et al. ‘99] Distributed reward Distributed value

21 Multiagent Policy Quality Comparing to Distributed Reward and Distributed Value Function algorithms [Schneider et al. ‘99] LP single basis LP pair basis Distributed reward Distributed value

22 Multiagent Running Time Star single basis Star pair basis Ring of rings

23 Conclusions Multiagent planning algorithm: Limited Communication Limited Observability Unified view of function approximation and multiagent communication Single LP solution is simple and very efficient Exploit structure to reduce computation costs! Solve very large MDPs efficiently

24 144365965422032752148167664920368 2268285973467048995407783138506080619639097776968725823559509545 8210061891186534272525795367402762022519832080387801477422896484 1274390400117588618041128947815623094438061566173054086674490506 1781254803444055470543970388958174653682549161362208302685637785 8229022846398307887896918556404084898937609373242171846359938695 5167650189405881090604260896714388641028143503856487471658320106 14366132173102768902855220001 Solve Very Large MDPs Solved MDPs with : states 1322070819480806636890455259752 over 10 150 actions and 500 agents;

25 Conclusions Multiagent planning algorithm: Limited Communication Limited Observability Unified view of function approximation and multiagent communication Single LP solution is simple and very efficient Exploit structure to reduce computation costs! Solve very large MDPs efficiently


Download ppt "Multiagent Planning with Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University."

Similar presentations


Ads by Google