Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.

Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University

Collaborative Multiagent Planning Search and rescue Factory management Supply chain Firefighting Network routing Air traffic control Long-term goals Multiple agents Coordinated decisions Collaborative Multiagent Planning

Exploiting Structure Real-world problems have: Hundreds of objects Googles of states Real-world problems have structure! Approach: Exploit structured representation to obtain efficient approximate solution

peasant footman building Real-time Strategy Game Peasants collect resources and build Footmen attack enemies Buildings train peasants and footmen

Joint Decision Space State space: Joint state x of entire system Action space: Joint action a= {a 1,…, a n } for all agents Reward function: Total reward R(x,a) Transition model: Dynamics of the entire system P(x’|x,a) Markov Decision Process (MDP) Representation:

Policy Policy:  (x) = a At state x, action a for all agents  (x 0 ) = both peasants get wood x0x0  (x 1 ) = one peasant gets gold, other builds barrack x1x1  (x 2 ) = Peasants get gold, footmen attack x2x2

Value of Policy Value: V  (x) Expected long- term reward starting from x Start from x 0 x0x0 R(x 0 )  (x 0 ) V  (x 0 ) = E[R(x 0 ) +  R(x 1 ) +  2 R(x 2 ) +  3 R(x 3 ) +  4 R(x 4 ) +  ] Future rewards discounted by  2 [0,1) x1x1 R(x 1 ) x 1 ’’ x 1 ’ R(x1’)R(x1’) R(x 1 ’’)  (x 1 ) x2x2 R(x 2 )  (x 2 ) x3x3 R(x 3 )  (x 3 ) x4x4 R(x 4 ) (x1’)(x1’)  (x 1 ’’)

Optimal Long-term Plan Optimal Policy:  * (x) Optimal value function V * (x) Optimal policy: Bellman Equations:

Solving an MDP Policy iteration [Howard ’60, Bellman ‘57] Value iteration [Bellman ‘57] Linear programming [Manne ’60] … Solve Bellman equation Optimal value V * (x) Optimal policy  *(x) Many algorithms solve the Bellman equations:

LP Solution to MDP Value computed by linear programming: One variable V (x) for each state One constraint for each state x and action a Polynomial time solution [Manne ’60] ),( :subject to :minimize     ,  ax xa x Q)(xV )(xV )(xV,  ax )(xV

Planning under Bellman’s “Curse” Planning is Polynomial in #states and #actions #states exponential in number of variables #actions exponential in number of agents Efficient approximation by exploiting structure!

F’ E’ G’ P’ Structure in Representation: Factored MDP State Dynamics Decisions Rewards Peasant Footman Enemy Gold R Complexity of representation: Exponential in #parents (worst case) [Boutilier et al. ’95] tt+1 Time A Peasant A Build A Footman P(F’|F,G,A B,A F )

Structured Value function ? Factored MDP  Structure in V * Y’’ Z’’ X’’ R Y’’’ Z’’’ X’’’ Time tt+1 R Y’ Z’ X’ t+2t+3 R Z Y X R  Factored MDP Structure in V * Almost! Structured V yields good approximate value function ?

Linear combination of restricted domain functions [Bellman et al. ‘63] [Tsitsiklis & Van Roy ’96] [Koller & Parr ’99,’00] [Guestrin et al. ’01] Structured Value Functions Each h i is status of small part(s) of a complex system: State of footman and enemy Status of barracks Status of barracks and state of footman Structured V  Structured Q Must find w giving good approximate value function   i i h i w V)()( ~ xx   i i Q Q ~ small # of A i ’s, X j ’s

Approximate LP Solution :subject to    ,  ax :minimize  x ),(xaQ )(xV )(xV ),(  xa i i Q)(  x i ii hw)(  x i ii hw One variable w i for each basis function Polynomial number of LP variables One constraint for every state and action  Exponentially many LP constraints )(  x i i i h w )(  x i i i h w,  ax [Schweitzer and Seidmann ‘85]

Representing Exponentially Many Constraints Exponentially many linear = one nonlinear constraint [Guestrin, Koller, Parr ’01] Maximization over exponential space 

Variable Elimination A D BC Here we need only 23, instead of 63 sum operations ),(),(),(max 121,, CBgCAfBAf CBA   ),(),( ),(),( 4321,, DBfDCfCAfBAf DCBA  ),(),(),(),( 4321,,, DBfDCfCAfBAf DCBA  Variable elimination to maximize over state space [Bertele & Brioschi ‘72] Maximization only exponential in largest factor Tree-width characterizes complexity Graph-theoretic measure of “connectedness” Arises in many settings: integer prog., Bayes nets, comput. geometry, … Structured Value Function

Variable Elimination A D BC Here we need only 23, instead of 63 sum operations ),(),(),(max 121,, CBgCAfBAf CBA   ),(),( ),(),( 4321,, DBfDCfCAfBAf DCBA  ),(),(),(),( 4321,,, DBfDCfCAfBAf DCBA  Variable elimination to maximize over state space [Bertele & Brioschi ‘72] Maximization only exponential in largest factor Tree-width characterizes complexity Graph-theoretic measure of “connectedness” Arises in many settings: integer prog., Bayes nets, comput. geometry, … Structured Value Function small # of A i ’s, X j ’s small # of X j ’s

Representing the Constraints Use Variable Elimination to represent constraints: Number of constraints exponentially smaller!

Understanding Scaling Properties Explicit LPFactored LP k = tree-width 2n2n (n+1-k)2 k Explicit LP 0 10000 20000 30000 40000 246810121416 number of variables number of constraints Factored LP k = 3 k = 5 k = 8 k = 10 k = 12

Network Management Problem Ring Star Ring of Rings k-grid Computer status = {good, dead, faulty} Dead neighbors increase dying probability Computer runs processes Reward for successful processes Each SysAdmin takes local action = {reboot, not reboot } Problem with n machines  9 n states, 2 n actions

Running Time Ring Exact solution Ring Single basis k=4 Star Single basis k=4 3-grid Single basis k=5 Star Pair basis k=4 Ring Pair basis k=8 k – tree-width

Summary of Algorithm 1.Pick local basis functions h i 2.Factored LP computes value function 3.Policy is argmax a of Q

Large-scale Multiagent Coordination Efficient algorithm computes V Action at state x is: #actions is exponential  Complete observability  Full communication 

Distributed Q Function Q(A 1,…,A 4, X 1,…,X 4 ) [Guestrin, Koller, Parr ’02] 2 3 4 1 Q4Q4 ≈ Q 2 (A 1, A 2, X 1,X 2 ) Q 4 (A 3, A 4, X 3,X 4 ) Q 1 (A 1, A 4, X 1,X 4 ) Q 3 (A 2, A 3, X 2,X 3 ) + ++ Each agent maintains a part of the Q function Distributed Q function

Multiagent Action Selection 2 3 4 1 Q 2 (A 1, A 2, X 1,X 2 ) Q 4 (A 3, A 4, X 3,X 4 ) Q 1 (A 1, A 4, X 1,X 4 ) Q 3 (A 2, A 3, X 2,X 3 ) Distributed Q function Instantiate current state x Maximal action argmax a

Instantiate Current State x 2 3 4 1 Q 2 (A 1, A 2, X 1,X 2 ) Q 4 (A 3, A 4, X 3,X 4 ) Q 1 (A 1, A 4, X 1,X 4 ) Q 3 (A 2, A 3, X 2,X 3 ) Q 2 (A 1, A 2 ) Q 3 (A 2, A 3 ) Q 4 (A 3, A 4 ) Q 1 (A 1, A 4 ) Observe only X 1 and X 2 Instantiate current state x Limited observability: agent i only observes variables in Q i

Multiagent Action Selection 2 3 4 1 Distributed Q function Instantiate current state x Maximal action argmax a Q 2 (A 1, A 2 ) Q 3 (A 2, A 3 ) Q 4 (A 3, A 4 ) Q 1 (A 1, A 4 )

Coordination Graph Q 2 (A 1, A 2 ) Q 3 (A 2, A 3 ) Q 4 (A 3, A 4 ) Q 1 (A 1, A 4 ) max a ++ + Use variable elimination for maximization: Limited communication for optimal action choice Comm. bandwidth = tree-width of coord. graph A1A1 A3A3 A2A2 A4A4 ),(),(),(max 421212411,, 321 AAgAAQAAQ AAA   ),(),( ),(),( 434323212411,, 3321 AAQAAQAAQAAQ AAAA  ),(),(),(),( 434323212411,,, 4321 AAQAAQAAQAAQ AAAA  A2A2 A4A4 Value of optimal A 3 action Attack 5 Defend6 Attack8 Defend 12

Coordination Graph Example A4A4 A1A1 A3A3 A2A2 A7A7 A5A5 A6A6 A 11 A9A9 A8A8 A 10 Trees don’t increase communication requirements Cycles require graph triangulation

Unified View: Function Approximation  Multiagent Coordination Q 1 (A 1, A 4, X 1,X 4 ) + Q 2 (A 1, A 2, X 1,X 2 ) + Q 3 (A 2, A 3, X 2,X 3 ) + Q 4 (A 3, A 4, X 3,X 4 ) A1A1 A3A3 A2A2 A4A4 Q 1 (A 1, X 1 ) + Q 2 (A 2, X 2 ) + Q 3 (A 3, X 3 ) + Q 4 (A 4, X 4 ) A1A1 A3A3 A2A2 A4A4 Factored MDP and value function representations induce communication, coordination Tradeoff Communication / Accuracy

How good are the policies? SysAdmin problem Power grid problem [Schneider et al. ‘99]

SysAdmin Ring - Quality of Policies Utopic maximum value Exact solution Constraint sampling Single basis Constraint sampling Pair basis Factored LP Single basis

Power Grid – Factored Multiagent Lower is better! [Guestrin, Lagoudakis, Parr ‘02]

Summary of Algorithm 1.Pick local basis functions h i 2.Factored LP computes value function 3.Coordination graph computes argmax a of Q

Planning Complex Environments When faced with a complex problem, exploit structure:  For planning  For action selection Given new problem Replan from scratch:   Different MDP  New planning problem  Huge problems intractable, even with factored LP

Generalizing to New Problems Solve Problem 1 Solve Problem n Good solution to Problem n+1 Solve Problem 2 MDPs are different!  Different sets of states, action, reward, transition, … Many problems are “similar”

Generalization with Relational MDPs Avoid need to replan Tackle larger problems [Guestrin, Koller, Gearhart, Kanodia ’03] “Similar” domains have similar “types” of objects Exploit similarities by computing generalizable value functions Relational MDP Generalization

Relational Models and MDPs Classes: Peasant, Gold, Wood, Barracks, Footman, Enemy… Relations Collects, Builds, Trains, Attacks… Instances Peasant1, Peasant2, Footman1, Enemy1…

Relational MDPs Class-level transition probabilities depends on: Attributes; Actions; Attributes of related objects Class-level reward function P P’ APAP G Gold G’ Collects Very compact representation! Does not depend on # of objects Peasant

Tactical Freecraft: Relational Schema Enemy H’ Health R Count Footman H’ Health A Footman my_enemy Enemy’s health depends on #footmen attacking Footman’s health depends on Enemy’s health

World is a Large Factored MDP Instantiation (world): # instances of each class Links between instances Well-defined factored MDP Relational MDP Links between objects Factored MDP # of objects

World with 2 Footmen and 2 Enemies F 1.Health F 1.A F 1.H’ E 1.Health E 1.H’ F 2.Health F 2.A F 2.H’ E 2.Health E 2.H’ R1R1 R2R2 Footman1 Enemy1 Enemy2 Footman2

World is a Large Factored MDP Instantiate world Well-defined factored MDP Use factored LP for planning We have gained nothing!  Relational MDP Links between objects Factored MDP # of objects

Class-level Value Functions F 1.Health E 1.Health F 2.Health E 2.Health Footman1 Enemy1 Enemy2 Footman2 V F1 ( F 1.H, E 1.H ) V E1 ( E 1.H ) V F2 ( F 2.H, E 2.H ) V E2 ( E 2.H ) V  ( F 1.H, E 1.H, F 2.H, E 2.H ) = +++ Units are Interchangeable! V F1  V F2  V F V E1  V E2  V E At state x, each footman has different contribution to V Given V C — can instantiate value function for any world Footman1 Enemy1 Enemy2 Footman2 VFVF VFVF VEVE VEVE

Computing Class-level V C :minimize :subject to    ,  ax ),(xaQ )(xV  x )(xV   CCo C V)( ][ x    CCo C Q),( ][ ax    CCo C V)( ][ x  Constraints for each world represented by factored LP Number of worlds exponential or infinite 

Sampling Worlds Many worlds are similar Sample set I of worlds  , x, a    I,  x, a Sampling

Theorem Exponentially (infinitely) many worlds ! need exponentially many samples? NO! samples Value function within  of class-level solution optimized for all worlds, with prob. at least 1-  R max is the maximum class reward Proof method related to [de Farias, Van Roy ‘ 02]

Learning Classes of Objects 1 2 3423 3 4 5 1 V1V1 V2V2 V1V1 V2V2 Plan for sampled worlds separately Objects with similar values belong to same class Find regularities between worlds Used decision tree regression in experiments

Summary of Algorithm 1.Model domain as Relational MDPs 2.Sample set of worlds 3.Factored LP computes class-level value function for sampled worlds 4.Reuse class-level value function in new world 5.Coordination graph computes argmax a of Q

Experimental Results SysAdmin problem

Generalizing to New Problems

Learning Classes of Objects

Classes of Objects Discovered Learned 3 classes Server Intermediate Leaf

Strategic World: 2 Peasants, 2 Footmen, 1 Enemy, Gold, Wood, Barracks Reward for dead enemy About 1 million state/action pairs Algorithm: Solve with Factored LP Coordination graph for action selection

Strategic World: 9 Peasants, 3 Footmen, 1 Enemy, Gold, Wood, Barracks Reward for dead enemy About 3 trillion state/action pairs Algorithm: Solve with factored LP Coordination graph for action selection grows exponentially in # agents 

Strategic World: 9 Peasants, 3 Footmen, 1 Enemy, Gold, Wood, Barracks Reward for dead enemy About 3 trillion state/action pairs Algorithm: Use generalized class-based value function Coordination graph for action selection instantiated Q-functions grow polynomially in # agents

Tactical Planned in 3 Footmen versus 3 Enemies Generalized to 4 Footmen versus 4 Enemies 3 vs. 34 vs. 4 Generalize

Contributions Efficient planning with LP decomposition [Guestrin, Koller, Parr ’01] Multiagent action selection [Guestrin, Koller, Parr ’02] Generalization to new environments [Guestrin, Koller, Gearhart, Kanodia ’03] Variable coordination structure [Guestrin, Venkataraman, Koller ’02] Multiagent reinforcement learning [Guestrin, Lagoudakis, Parr ’02] [Guestrin, Patrascu, Schuurmans ’02] Hierarchical decomposition [Guestrin, Gordon ’02]

Open Issues High tree-width problems Basis function selection Variable relational structure Partial observability

Daphne Koller Committee Leslie Kaelbling, Yoav Shoham, Claire Tomlin, Ben Van Roy Co-authors DAGS members Kristina and Friends My Family M.S. Apaydin, D. Brutlag, F. Cozman, C. Gearhart, G. Gordon, D. Hsu, N. Kanodia, D. Koller, E. Krotkov, M. Lagoudakis, J.C. Latombe, D. Ormoneit, R. Parr, R. Patrascu, D. Schuurmans, C. Varma, S. Venkataraman.

In planning problem – Factored LP Exploit Structure In action selection – Coord. graph Between problems – Generalization Complex multiagent planning task Conclusions Formal framework for multiagent planning that scales to very large problemsvery large 144365965422032752148167664920368 2268285973467048995407783138506080619639097776968725823559509545 8210061891186534272525795367402762022519832080387801477422896484 1274390400117588618041128947815623094438061566173054086674490506 1781254803444055470543970388958174653682549161362208302685637785 8229022846398307887896918556404084898937609373242171846359938695 5167650189405881090604260896714388641028143503856487471658320106 14366132173102768902855220001 states 1322070819480806636890455259752

Network Management Problem Ring Star Ring of Rings k-grid Computer runs processes Computer status = {good, dead, faulty} Dead neighbors increase dying probability Reward for successful processes Each SysAdmin takes local action = {reboot, not reboot }

Multiagent Policy Quality Comparing to Distributed Reward and Distributed Value Function algorithms [Schneider et al. ‘99]

Multiagent Policy Quality Comparing to Distributed Reward and Distributed Value Function algorithms [Schneider et al. ‘99] Distributed reward Distributed value

Multiagent Policy Quality Comparing to Distributed Reward and Distributed Value Function algorithms [Schneider et al. ‘99] LP single basis LP pair basis Distributed reward Distributed value

Comparing to Apricodd [Boutilier et al.] Apricodd: Exploits context-specific independence (CSI) Factored LP: Exploits CSI and linear independence

Appricodd Ring Star

Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.

Similar presentations

Presentation on theme: "Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.

Similar presentations

Presentation on theme: "Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University."— Presentation transcript:

Similar presentations

About project

Feedback