Presentation is loading. Please wait.

Presentation is loading. Please wait.

Generalizing Plans to New Environments in Relational MDPs

Similar presentations


Presentation on theme: "Generalizing Plans to New Environments in Relational MDPs"— Presentation transcript:

1 Generalizing Plans to New Environments in Relational MDPs
Carlos Guestrin Daphne Koller Chris Gearhart Neal Kanodia Stanford University

2 Collaborative Multiagent Planning
Long-term goals Multiple agents Coordinated decisions Search and rescue Factory management Supply chain Firefighting Network routing Air traffic control

3 Real-time Strategy Game
peasant footman Real-time Strategy Game Peasants collect resources and build Footmen attack enemies Buildings train peasants and footmen building

4 Structure in Representation: Factored MDP
[Boutilier et al. ‘95] t t+1 Time APeasant AFootman Peasant Footman Enemy Gold F’ E’ G’ P’ # states exponential # actions exponential exact solution is intractable P(F’|F,G,AF) State Dynamics Decisions Rewards Complexity of representation: Exponential in #parents (worst case) R

5 Structured Value Functions
Linear combination of restricted domain functions [Bellman et al. ’63, Tsitsiklis & Van Roy ’96, Koller & Parr ‘99,’00, Guestrin et al. ‘01] w o w o å = o V ) ( ~ x å = i h w V ) ( x o o Each hi is status of small part(s) of a complex system: State of footman and enemy Status of barracks Status of barracks and state of footman Structured V  Structured Q Must find w giving good approximate value function Vi(Footman)  Qi(Footman, Gold, AFootman) å = o Q ~ small # of Ai’s, Xj’s w o

6 Approximate LP Solution
[Schweitzer and Seidmann ‘85] ) ( å x o V w o : minimize å x ) ( å x o V w o ) ( x V ) ( å x o V w o ) , ( å x a o Q w o ) , ( å x a o Q w o ) ( å x o V w o : subject to î í ì ) ( x V ) , ( x a Q , " a x , " a x o One variable wi for each object basis function  Polynomial number of LP variables One constraint for every state and action  Exponentially many LP constraints Efficient LP decomposition [Guestrin+al `01]  Functions depend on small sets of variables Polynomial time solution

7 Summary of Multiagent Algorithm
[Guestrin et al.,`01,`02] Summary of Multiagent Algorithm offline Model world as factored MDP Basis functions selection o hi Factored LP computes value function o w , Qo online Real world x a Coordination graph computes argmaxa Q(x,a)

8 Planning Complex Environments
When faced with a complex problem, exploit structure: For planning For action selection Given new problem Replan from scratch:  Different MDP  New planning problem Huge problems intractable, even with factored LP

9 Generalizing to New Problems
Many problems are “similar” Good solution to Problem n+1 Solve Problem 1 Solve Problem 2 Solve Problem n MDPs are different!  Different sets of states, action, reward, transition, …

10 Generalization with Relational MDPs
“Similar” domains have similar “types” of objects  Relational MDP Exploit similarities by computing generalizable value functions Generalization Avoid need to replan Tackle larger problems

11 Relational Models and MDPs
Classes: Peasant, Gold, Wood, Barracks, Footman, Enemy… Relations Collects, Builds, Trains, Attacks… Instances Peasant1, Peasant2, Footman1, Enemy1… Builds on Probabilistic Relational Models [Koller, Pfeffer ‘98]

12 Very compact representation! Does not depend on # of objects
Relational MDPs Enemy Footman Health H’ Health H’ my_enemy AFootman R Count Class-level transition probabilities depends on: Attributes; Actions; Attributes of related objects Class-level reward function Very compact representation! Does not depend on # of objects

13 World is a Large Factored MDP
Links between objects Relational MDP # of objects Factored MDP Instantiation (world): # instances of each class Links between instances Well-defined factored MDP

14 World with 2 Footmen and 2 Enemies
Footman1 F1.Health F1.A F1.H’ Enemy1 E1.Health E1.H’ R1 Footman2 F2.Health F2.A F2.H’ Enemy2 E2.Health E2.H’ R2

15 World is a Large Factored MDP
Links between objects Relational MDP # of objects Factored MDP Instantiate world Well-defined factored MDP Use factored LP for planning We have gained nothing! 

16 Class-level Value Functions
Footman1 Enemy1 Enemy2 Footman2 Footman1 Enemy1 Enemy2 Footman2 F1.Health E1.Health F2.Health E2.Health VF1(F1.H, E1.H) VE1(E1.H) VF2(F2.H, E2.H) VE2(E2.H) VF VE VF VE V(F1.H, E1.H, F2.H, E2.H) = Units are Interchangeable! VF1  VF2  VF + VE1  VE2  VE + + At state x, each footman has different contribution to V Given wC — can instantiate value function for any world 

17 Computing Class-level VC
å Î C o V ) ( ] [ x w å x ) ( V w C : minimize å Î C o V ) ( ] [ x w w C å Î C o Q ) , ( ] [ a x w w C : subject to î í ì ) ( x V ) , ( x a Q , " a x Constraints for each world represented efficient by factored LP  Number of worlds exponential or infinite 

18 Sampling Worlds Many worlds are similar Sample set I of worlds
 , x, a    I ,  x, a Many worlds are similar Sample set I of worlds

19 Factored LP-based Generalization
How many samples? E1 F1 E2 F2 E3 F3 Class- level factored LP Gen. E1 F1 E2 F2 VF VE Sample Set I

20 Complexity of Sampling Worlds
Exponentially many worlds ! need exponentially many samples? # objects in a world is unbounded ! must apply LP decomposition to very large worlds? NO!

21 (Improved) Theorem Sample m small worlds of up to O( ln 1/ ) objects
samples Value function within O() of class-level solution optimized for all worlds, with prob. at least 1- Rcmax is the maximum class reward

22 Learning Subclasses of Objects
V1 V1 1 2 3 4 2 3 4 5 1 V2 V2 Find regularities between worlds Objects with similar values belong to same class Plan for sampled worlds separately Used decision tree regression in experiments

23 Summary of Generalization Algorithm
offline Relational MDP model Sampled worlds Class definitions I C Factored LP computes class-level value function new world  wC online Coordination graph computes argmaxa Q(x,a) x a Real world

24 Experimental Results SysAdmin problem

25 Generalizing to New Problems

26 Classes of Objects Discovered
Learned 3 classes Leaf Intermediate Server

27 Learning Classes of Objects

28 Strategic Tactical

29 Strategic 2x2 a x World Relational MDP model
offline Relational MDP model 2 Peasants, 2 Footmen, Enemy, Gold, Wood, Barracks ~1 million state/action pairs Factored LP computes value function Qo online Coordination graph computes argmaxa Q(x,a) x a World

30 grows exponentially in # agents
Strategic 9x3 offline Relational MDP model 9 Peasants, 3 Footmen, Enemy, Gold, Wood, Barracks ~3 trillion state/action pairs grows exponentially in # agents Factored LP computes value function Qo online Coordination graph computes argmaxa Q(x,a) x a World

31 Strategic - Generalization
offline Relational MDP model 2 Peasants, 2 Footmen, Enemy, Gold, Wood, Barracks ~1 million state/action pairs 9 Peasants, 3 Footmen, Enemy, Gold, Wood, Barracks ~3 trillion state/action pairs Factored LP computes class-level value function instantiated Q-functions grow polynomially in # agents wC online Coordination graph computes argmaxa Q(x,a) x a World

32 Tactical Planned in 3 Footmen versus 3 Enemies
3 vs. 3 4 vs. 4 Generalize Planned in 3 Footmen versus 3 Enemies Generalized to 4 Footmen versus 4 Enemies

33 Conclusions Relational MDP representation Class-level value function
Efficient linear program optimizes over sampled environments: Theorem: polynomial sample complexity generalizes from small to large problems Learning subclass definitions Generalization of value functions to new worlds: Avoid replanning Tackle larger worlds


Download ppt "Generalizing Plans to New Environments in Relational MDPs"

Similar presentations


Ads by Google