Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multi-Agent Planning in Complex Uncertain Environments Daphne Koller Stanford University Joint work with: Carlos Guestrin (CMU) Ronald Parr (Duke)

Similar presentations


Presentation on theme: "Multi-Agent Planning in Complex Uncertain Environments Daphne Koller Stanford University Joint work with: Carlos Guestrin (CMU) Ronald Parr (Duke)"— Presentation transcript:

1 Multi-Agent Planning in Complex Uncertain Environments Daphne Koller Stanford University Joint work with: Carlos Guestrin (CMU) Ronald Parr (Duke)

2 ©2004 – Carlos Guestrin, Daphne Koller Collaborative Multiagent Planning Search and rescue, firefighting Factory management Multi-robot tasks (Robosoccer) Network routing Air traffic control Computer game playing Long-term goals Multiple agents Coordinated decisions Collaborative Multiagent Planning

3 ©2004 – Carlos Guestrin, Daphne Koller Joint Planning Space Joint action space: Each agent i takes action a i at each step Joint action a= {a 1,…, a n } for all agents Joint state space: Assignment x 1,…,x n to some set of variables X 1,…,X n Joint state x= {x 1,…, x n } of entire system Joint system: Payoffs and state dynamics depend on joint state and joint action Cooperative agents: Want to maximize total payoff

4 ©2004 – Carlos Guestrin, Daphne Koller Exploiting Structure Real-world problems have: Hundreds of objects Googles of states Real-world problems have structure! Approach: Exploit structured representation to obtain efficient approximate solution

5 ©2004 – Carlos Guestrin, Daphne Koller Outline Action Coordination Factored Value Functions Coordination Graphs Context-Specific Coordination Joint Planning Multi-Agent Markov Decision Processes Efficient Linear Programming Solution Decentralized Market-Based Solution Generalizing to New Environments Relational MDPs Generalizing Value Functions

6 ©2004 – Carlos Guestrin, Daphne Koller One-Shot Optimization Task Q-function Q(x,a) encodes agents’ payoff for joint action a in joint state x Agents’ task: To compute #actions is exponential  Complete state observability  Full agent communication 

7 ©2004 – Carlos Guestrin, Daphne Koller Factored Payoff Function Approximate Q function as sum of Q sub-functions Each sub-function depends on local part of system Two interacting agents Agent and important resource Two inter-dependent pieces of machinery [K. & Parr ’99,’00] [Guestrin, K., Parr ’01] Q(A 1,…,A 4, X 1,…,X 4 ) ¼ Q 2 (A 1, A 2, X 1,X 2 )Q 1 (A 1, A 4, X 1,X 4 ) Q 3 (A 2, A 3, X 2,X 3 ) + ++ Q 4 (A 3, A 4, X 3,X 4 )

8 ©2004 – Carlos Guestrin, Daphne Koller Distributed Q Function Q(A 1,…,A 4, X 1,…,X 4 ) 2 3 4 1 Q4Q4 ¼ Q 2 (A 1, A 2, X 1,X 2 ) Q 4 (A 3, A 4, X 3,X 4 ) Q 1 (A 1, A 4, X 1,X 4 ) Q 3 (A 2, A 3, X 2,X 3 ) + ++ [Guestrin, K., Parr ’01] Q sub-functions assigned to relevant agents

9 ©2004 – Carlos Guestrin, Daphne Koller Multiagent Action Selection 2 3 4 1 Q 2 (A 1, A 2, X 1,X 2 ) Q 4 (A 3, A 4, X 3,X 4 ) Q 1 (A 1, A 4, X 1,X 4 ) Q 3 (A 2, A 3, X 2,X 3 ) Distributed Q function Instantiate current state x Maximal action argmax a

10 ©2004 – Carlos Guestrin, Daphne Koller Instantiating State x 2 3 4 1 Q 2 (A 1, A 2, X 1,X 2 ) Q 4 (A 3, A 4, X 3,X 4 ) Q 1 (A 1, A 4, X 1,X 4 ) Q 3 (A 2, A 3, X 2,X 3 ) Observe only X 1 and X 2 Limited observability: agent i only observes variables in Q i

11 ©2004 – Carlos Guestrin, Daphne Koller Choosing Action at State x 2 3 4 1 Q 2 (A 1, A 2, X 1,X 2 ) Q 4 (A 3, A 4, X 3,X 4 ) Q 1 (A 1, A 4, X 1,X 4 ) Q 3 (A 2, A 3, X 2,X 3 ) Q 2 (A 1, A 2 ) Q 3 (A 2, A 3 ) Q 4 (A 3, A 4 ) Q 1 (A 1, A 4 ) Instantiate current state x Maximal action max a

12 ©2004 – Carlos Guestrin, Daphne Koller Variable Elimination Q 2 (A 1, A 2 ) Q 3 (A 2, A 3 ) Q 4 (A 3, A 4 ) Q 1 (A 1, A 4 ) max a ++ + Use variable elimination for maximization: Limited communication for optimal action choice Comm. bandwidth = tree-width of coord. graph ),(),(),(max 421212411,, 421 AAgAAQAAQ AAA   ),(),( ),(),( 434323212411,, 3421 AAQAAQAAQAAQ AAAA  ),(),(),(),( 434323212411,,, 4321 AAQAAQAAQAAQ AAAA  A2A2 A4A4 Value of optimal A 3 action Attack 5 Defend6 Attack8 Defend 12

13 ©2004 – Carlos Guestrin, Daphne Koller Choosing Action at State x ),(),(),(max 421212411,, 421 AAgAAQAAQ AAA   ),(),( ),(),( 434323212411,, 3421 AAQAAQAAQAAQ AAAA  ),(),(),(),( 434323212411,,, 4321 AAQAAQAAQAAQ AAAA 

14 ©2004 – Carlos Guestrin, Daphne Koller Choosing Action at State x 2 3 4 1 Q 2 (A 1, A 2 ) Q 3 (A 2, A 3 ) Q 4 (A 3, A 4 ) Q 1 (A 1, A 4 ) max A3A3 [ + ] g 1 (A 2, A 4 ) ),(),(),(max 421212411,, 421 AAgAAQAAQ AAA   ),(),( ),(),( 434323212411,, 3421 AAQAAQAAQAAQ AAAA  ),(),(),(),( 434323212411,,, 4321 AAQAAQAAQAAQ AAAA 

15 ©2004 – Carlos Guestrin, Daphne Koller Coordination Graphs Communication follows triangulated graph Computation grows exponentially in tree width Graph-theoretic measure of “connectedness” Arises in BNs, CSPs, … Cost exponential in worst case, fairly low for many real graphs A4A4 A1A1 A3A3 A2A2 A7A7 A5A5 A6A6 A9A9 A8A8 A 10 A 11

16 ©2004 – Carlos Guestrin, Daphne Koller Context-Specific Interactions Payoff structure can vary by context Agents A1, A2 both trying to pass through same narrow corridor Can use context-specific “value rules” <At(X,A1), At(X,A2), A1 = fwd  A2 = fwd : -100> Hope: Context-specific payoffs will induce context-specific coordination A1 A2 X

17 ©2004 – Carlos Guestrin, Daphne Koller Context-Specific Coordination Instantiate current state: x = true A1A1 A4A4 A2A2 A3A3 A5A5 A6A6

18 ©2004 – Carlos Guestrin, Daphne Koller Context-Specific Coordination A1A1 A4A4 A2A2 A3A3 A5A5 A6A6 Coordination structure varies based on context

19 ©2004 – Carlos Guestrin, Daphne Koller Context-Specific Coordination A1A1 A4A4 A2A2 A3A3 A5A5 A6A6 Maximizing out A 1 Rule-based variable elimination [Zhang & Poole ’99] Coordination structure varies based on communication

20 ©2004 – Carlos Guestrin, Daphne Koller Context-Specific Coordination A1A1 A4A4 A2A2 A3A3 A5A5 A6A6 Eliminate A 1 from the graph Rule-based variable elimination [Zhang & Poole ’99] Coordination structure varies based on agent decisions

21 ©2004 – Carlos Guestrin, Daphne Koller Robot Soccer UvA Trilearn 2002 won German Open 2002, but placed fourth in Robocup-2002. “ … the improvements introduced in UvA Trilearn 2003 … include an extension of the intercept skill, improved passing behavior and especially the usage of coordination graphs to specify the coordination requirements between the different agents.” Kok, Vlassis & Groen University of Amsterdam

22 ©2004 – Carlos Guestrin, Daphne Koller RoboSoccer Value Rules Coordination graph rules include conditions on player role and aspects of global system state Example rules for player i, in role of passer: Depends on distance of j to goal after move

23 ©2004 – Carlos Guestrin, Daphne Koller UvA Trilearn 2003 Results Round 1OpponentScore Round 1Mainz Rolling Brains Mainz Rolling Brains (Germany) 4-0 IraniansIranians (Iran)31-0 SahandSahand (Iran)39-0 a4tya4ty (Latvia)25-0 Round 2HeliosHelios (Iran)2-1 AT-HumboldtAT-Humboldt (Germany)5-0 ZJUBaseZJUBase (China)6-0 AriaAria (Iran)6-0 Hana Hana (Japan)26-0 Round 3Zenit-NewERAZenit-NewERA (Russia)4-0 RoboSinaRoboSina (Iran)6-0 Wright EagleWright Eagle (China)3-1 EverestEverest (China)7-1 AriaAria (Iran)5-0 Semi-finalBrainstormersBrainstormers (Germany)4-1 FinalTsinghuAeolusTsinghuAeolus (China)4-3 177-7 UvA Trilearn won German Open 2003 US Open 2003 RoboCup 2003 German Open 2004

24 ©2004 – Carlos Guestrin, Daphne Koller Outline Action Coordination Factored Value Functions Coordination Graphs Context-Specific Coordination Joint Planning Multi-Agent Markov Decision Processes Efficient Linear Programming Solution Decentralized Market-Based Solution Generalizing to New Environments Relational MDPs Generalizing Value Functions

25 ©2004 – Carlos Guestrin, Daphne Koller peasant footman building Real-time Strategy Game Peasants collect resources and build Footmen attack enemies Buildings train peasants and footmen

26 ©2004 – Carlos Guestrin, Daphne Koller Planning Over Time Action space: Joint agent actions a= {a 1,…, a n } State space: Joint state descriptions x= {x 1,…, x n } Momentary reward function R(x,a) Probabilistic system dynamics P(x’|x,a) Markov Decision Process (MDP) representation:

27 ©2004 – Carlos Guestrin, Daphne Koller Policy Policy:  (x) = a At state x, action a for all agents  (x 0 ) = both peasants get wood x0x0  (x 1 ) = one peasant gets gold, other builds barrack x1x1  (x 2 ) = Peasants get gold, footmen attack x2x2

28 ©2004 – Carlos Guestrin, Daphne Koller Value of Policy Value: V  (x) Expected long- term reward starting from x Start from x 0 x0x0 R(x 0 )  (x 0 ) V  (x 0 ) = E[R(x 0 ) +  R(x 1 ) +  2 R(x 2 ) +  3 R(x 3 ) +  4 R(x 4 ) +  ] Future rewards discounted by   [0,1) x1x1 R(x 1 ) x 1 ’’ x 1 ’ R(x1’)R(x1’) R(x 1 ’’)  (x 1 ) x2x2 R(x 2 )  (x 2 ) x3x3 R(x 3 )  (x 3 ) x4x4 R(x 4 ) (x1’)(x1’)  (x 1 ’’)

29 ©2004 – Carlos Guestrin, Daphne Koller Optimal Long-term Plan Optimal policy  * (x) Optimal Q-function Q * (x,a) Optimal policy: Bellman Equations:

30 ©2004 – Carlos Guestrin, Daphne Koller Solving an MDP Policy iteration [Howard ’60, Bellman ‘57] Value iteration [Bellman ‘57] Linear programming [Manne ’60] … Solve Bellman equation Optimal value V * (x) Optimal policy  *(x) Many algorithms solve the Bellman equations:

31 ©2004 – Carlos Guestrin, Daphne Koller LP Solution to MDP One variable V (x) for each state One constraint for each state x and action a Polynomial time solution

32 ©2004 – Carlos Guestrin, Daphne Koller Are We Done? Planning is polynomial in #states and #actions #states exponential in number of variables #actions exponential in number of agents Efficient approximation by exploiting structure!

33 ©2004 – Carlos Guestrin, Daphne Koller F’ E’ G’ P’ Structured Representation State Dynamics Decisions Rewards Peasant Footman Enemy Gold R Complexity of representation: Exponential in #parents (worst case) [Boutilier et al. ’95] tt+1 Time A Peasant A Build A Footman P(F’|F,G,A B,A F ) Factored MDP

34 ©2004 – Carlos Guestrin, Daphne Koller Structured Value function ? Factored MDP  Structure in V * Y’’ Z’’ X’’ R Y’’’ Z’’’ X’’’ Time tt+1 R Y’ Z’ X’ t+2t+3 R Z Y X R  Factored MDP Structure in V * Almost! Factored V often provides good approximate value function

35 ©2004 – Carlos Guestrin, Daphne Koller [Bellman et al. ‘63], [Tsitsiklis & Van Roy ‘96] [K. & Parr ’99,’00] Structured Value Functions Approximate V* as a factored value function In rule-based case: h i is a rule concerning small part of the system w i is the value associated with the rule Goal: find w giving good approximation V to V* Factored value function V =  w i h i Factored Q function Q =  Q i Can use coordination graph

36 ©2004 – Carlos Guestrin, Daphne Koller Approximate LP Solution :subject to    ,  ax :minimize  x ),(xaQ )(xV )(xV ),(  xa i i Q)(  x i ii hw)(  x i ii hw One variable w i for each basis function Polynomial number of LP variables One constraint for every state and action  Exponentially many LP constraints,  ax )(  x i i h w i )(  x i h w i i

37 ©2004 – Carlos Guestrin, Daphne Koller So What Now? Exponentially many linear = one nonlinear constraint [Guestrin, K., Parr ’01]

38 ©2004 – Carlos Guestrin, Daphne Koller Variable Elimination Revisited Use Variable Elimination to represent constraints: Exponentially fewer constraints [Guestrin, K., Parr ’01] Polynomial LP for finding good factored approximation to V*

39 ©2004 – Carlos Guestrin, Daphne Koller Network Management Problem Ring Star Ring of Rings k-grid Computer runs processes Computer status = {good, dead, faulty} Dead neighbors increase dying probability Reward for successful processes Each SysAdmin takes local action = {reboot, not reboot }

40 ©2004 – Carlos Guestrin, Daphne Koller Scaling of Factored LP Explicit LPFactored LP k = tree-width 2n2n (n+1-k)2 k Explicit LP 0 10000 20000 30000 40000 246810121416 number of variables number of constraints Factored LP k = 3 k = 5 k = 8 k = 10 k = 12

41 ©2004 – Carlos Guestrin, Daphne Koller Multiagent Running Time Star single basis Star pair basis Ring of rings

42 ©2004 – Carlos Guestrin, Daphne Koller Strategic 2x2 Factored MDP model 2 Peasants, 2 Footmen, Enemy, Gold, Wood, Barracks ~1 million state/action pairs Factored LP computes value function Q offline online World x a Coordination graph computes argmax a Q(x,a)

43 ©2004 – Carlos Guestrin, Daphne Koller Demo: Strategic 2x2 Guestrin, Koller, Gearhart & Kanodia

44 ©2004 – Carlos Guestrin, Daphne Koller Limited Interaction MDPs Some MDPs have additional structure: Agents are largely autonomous Interact in limited ways e.g., competing for resources Can decompose MDP as set of agent- based MDPs, with limited interface A2A2 A1A1 X1X1 R1R1 X3X3 X2X2 X’ 3 X’ 2 X’ 1 h2h2 h1h1 R2R2 R3R3 A2A2 A1A1 X3X3 X2X2 X’ 3 X’ 2 R2R2 R3R3 A1A1 X1X1 R1R1 X2X2 X’ 1 X1X1 X2X2 X1X1 A1A1 M2M2 M1M1 [Guestrin & Gordon, ’02]

45 ©2004 – Carlos Guestrin, Daphne Koller Limited Interaction MDPs In such MDPs, our LP matrix is highly structured Can use Dantzig-Wolfe LP decomposition to solve LP optimally, in a decentralized way Gives rise to a market-like algorithm with multiple agents and a centralized “auctioneer” [Guestrin & Gordon, ’02]

46 ©2004 – Carlos Guestrin, Daphne Koller Auction-style planning Each agent solves local (stand-alone) MDP Agents send `constraint messages’ to auctioneer: Must agree on “policy” for shared variables Auctioneer sends `pricing messages’ to agents Pricing reflects penalties for constraint violations Influences agents’ rewards in their MDP Auctioneer $ $ $ Set pricing based on conflicts Plan, plan, plan [Guestrin & Gordon, ’02]

47 ©2004 – Carlos Guestrin, Daphne Koller UAV start Target Fuel Allocation Problem UAVs share a pot of fuel Targets have varying priority Ignore target interference Bererton, Gordon, Thrun & Khosla

48 ©2004 – Carlos Guestrin, Daphne Koller [Bererton, Gordon, Thrun, & Khosla, ’03] Fuel Allocation Problem

49 ©2004 – Carlos Guestrin, Daphne Koller High-Speed Robot Paintball Bererton, Gordon & Thrun

50 ©2004 – Carlos Guestrin, Daphne Koller High-Speed Robot Paintball Game variant 1 Game variant 2 Coordination point Sensor Placement x = start location + = goal location

51 ©2004 – Carlos Guestrin, Daphne Koller High-Speed Robot Paintball Bererton, Gordon & Thrun

52 ©2004 – Carlos Guestrin, Daphne Koller Outline Action Coordination Factored Value Functions Coordination Graphs Context-Specific Coordination Joint Planning Multi-Agent Markov Decision Processes Efficient Linear Programming Solution Decentralized Market-Based Solution Generalizing to New Environments Relational MDPs Generalizing Value Functions

53 ©2004 – Carlos Guestrin, Daphne Koller Generalizing to New Problems Solve Problem 1 Solve Problem n Good solution to Problem n+1 Solve Problem 2 MDPs are different!  Different sets of states, action, reward, transition, … Many problems are “similar”

54 ©2004 – Carlos Guestrin, Daphne Koller Generalizing with Relational MDPs Avoid need to replan Tackle larger problems “Similar” domains have similar “types” of objects Exploit similarities by computing generalizable value functions Relational MDP Generalization

55 ©2004 – Carlos Guestrin, Daphne Koller Relational Models and MDPs Classes: Peasant, Footman, Gold, Barracks, Enemy… Relations Collects, Builds, Trains, Attacks… Instances Peasant1, Peasant2, Footman1, Enemy1… Builds on Probabilistic Relational Models [K. & Pfeffer ‘98] [Guestrin, K., Gearhart & Kanodia ‘03]

56 ©2004 – Carlos Guestrin, Daphne Koller Relational MDPs Very compact representation! Does not depend on # of objects Enemy H’ Health R Count Footman H’ Health A Footman my_enemy Class-level transition probabilities depends on: Attributes; Actions; Attributes of related objects Class-level reward function [Guestrin, K., Gearhart & Kanodia ‘03]

57 ©2004 – Carlos Guestrin, Daphne Koller World is a Large Factored MDP Instantiation (world): # instances of each class Links between instances Well-defined factored MDP Relational MDP Links between objects Factored MDP # of objects

58 ©2004 – Carlos Guestrin, Daphne Koller MDP with 2 Footmen and 2 Enemies F 1.Health F 1.A F 1.H’ E 1.Health E 1.H’ F 2.Health F 2.A F 2.H’ E 2.Health E 2.H’ R1R1 R2R2 Footman1 Enemy1 Enemy2 Footman2

59 ©2004 – Carlos Guestrin, Daphne Koller World is a Large Factored MDP Instantiate world Well-defined factored MDP Use factored LP for planning We have gained nothing!  Relational MDP Links between objects Factored MDP # of objects

60 ©2004 – Carlos Guestrin, Daphne Koller Class-level Value Functions F 1.Health E 1.Health F 2.Health E 2.Health Footman1 Enemy1 Enemy2 Footman2 V F1 ( F 1.H, E 1.H ) V E1 ( E 1.H ) V F2 ( F 2.H, E 2.H ) V E2 ( E 2.H ) V  ( F 1.H, E 1.H, F 2.H, E 2.H ) = +++ Units are Interchangeable! V F1  V F2  V F V E1  V E2  V E At state x, each footman has different contribution to V Given w C — can instantiate value function for any world Footman1 Enemy1 Enemy2 Footman2 VFVF VFVF VEVE VEVE

61 ©2004 – Carlos Guestrin, Daphne Koller Factored LP-based Generalization E1E1 F1F1 E2E2 F2F2 Sample Set I VFVF VEVE Generalize E1E1 F1F1 E2E2 F2F2 E3E3 F3F3 Class- level factored LP How many samples?

62 ©2004 – Carlos Guestrin, Daphne Koller Sampling Complexity NO! Exponentially many worlds  need exponentially many samples? # objects in a world is unbounded  must sample very large worlds?

63 ©2004 – Carlos Guestrin, Daphne Koller Theorem samples Value function within O(  ) of class-level value function optimized for all worlds, with prob. at least 1-  R c max is the maximum class reward Sample m small worlds of up to O( ln 1/  ) objects m =

64 ©2004 – Carlos Guestrin, Daphne Koller Strategic 2x2 Relational MDP model 2 Peasants, 2 Footmen, Enemy, Gold, Wood, Barracks ~1 million state/action pairs Factored LP computes value function Q offline online World x a Coordination graph computes argmax a Q(x,a)

65 ©2004 – Carlos Guestrin, Daphne Koller Relational MDP model 9 Peasants, 3 Footmen, Enemy, Gold, Wood, Barracks Factored LP computes value function QoQo offline online World x a Coordination graph computes argmax a Q(x,a) ~3 trillion state/action pairs  grows exponentially in # agents Strategic 9x3

66 ©2004 – Carlos Guestrin, Daphne Koller Strategic Generalization Relational MDP model 2 Peasants, 2 Footmen, Enemy, Gold, Wood, Barracks ~1 million state/action pairs Factored LP computes class-level value function wCwC offline online World x a Coordination graph computes argmax a Q  (x,a) 9 Peasants, 3 Footmen, Enemy, Gold, Wood, Barracks ~3 trillion state/action pairs instantiated Q-functions grow polynomially in # agents

67 ©2004 – Carlos Guestrin, Daphne Koller Demo: Generalized 9x3 Guestrin, Koller, Gearhart & Kanodia

68 ©2004 – Carlos Guestrin, Daphne Koller Tactical Generalization Planned in 3 Footmen versus 3 Enemies Generalized to 4 Footmen versus 4 Enemies 3 v. 34 v. 4 Generalize

69 ©2004 – Carlos Guestrin, Daphne Koller Demo: Planned Tactical 3x3 Guestrin, Koller, Gearhart & Kanodia

70 ©2004 – Carlos Guestrin, Daphne Koller Demo: Generalized Tactical 4x4 [Guestrin, K., Gearhart & Kanodia ‘03] Guestrin, Koller, Gearhart & Kanodia

71 ©2004 – Carlos Guestrin, Daphne Koller Summary Structured Multi-Agent MDPs Effective planning under uncertainty Distributed coordinated action selection Generalization to new problems

72 ©2004 – Carlos Guestrin, Daphne Koller Important Questions Continuous spaces Partial observability Complex actions Learning to act How far can we go??

73 http://robotics.stanford.edu/~koller Carlos Guestrin Ronald Parr Chris Gearhart Neal Kanodia Shobha Venkataraman Curt Bererton Geoff Gordon Sebastian Thrun Jelle Kok Matthijs Spaan Nikos Vlassis


Download ppt "Multi-Agent Planning in Complex Uncertain Environments Daphne Koller Stanford University Joint work with: Carlos Guestrin (CMU) Ronald Parr (Duke)"

Similar presentations


Ads by Google