Multiagent Planning with Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.

Slides:



Advertisements
Similar presentations
Markov Decision Processes (MDPs) read Ch utility-based agents –goals encoded in utility function U(s), or U:S  effects of actions encoded in.
Advertisements

Markov Decision Process
Probabilistic Planning (goal-oriented) Action Probabilistic Outcome Time 1 Time 2 Goal State 1 Action State Maximize Goal Achievement Dead End A1A2 I A1.
1 University of Southern California Keep the Adversary Guessing: Agent Security by Policy Randomization Praveen Paruchuri University of Southern California.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
Decision Theoretic Planning
Pradeep Varakantham Singapore Management University Joint work with J.Y.Kwak, M.Taylor, J. Marecki, P. Scerri, M.Tambe.
An Accelerated Gradient Method for Multi-Agent Planning in Factored MDPs Sue Ann HongGeoff Gordon CarnegieMellonUniversity.
COSC 878 Seminar on Large Scale Statistical Machine Learning 1.
Markov Decision Processes
Infinite Horizon Problems
Advanced MDP Topics Ron Parr Duke University. Value Function Approximation Why? –Duality between value functions and policies –Softens the problems –State.
Generalizing Plans to New Environments in Relational MDPs
Concurrent Markov Decision Processes Mausam, Daniel S. Weld University of Washington Seattle.
Max-norm Projections for Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.
Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University.
KI Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.
Solving Factored POMDPs with Linear Value Functions Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.
Distributed Planning in Hierarchical Factored MDPs Carlos Guestrin Stanford University Geoffrey Gordon Carnegie Mellon University.
Markov Decision Processes
Source-Destination Routing Optimal Strategies Eric Chi EE228a, Fall 2002 Dept. of EECS, U.C. Berkeley.
Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.
4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)
Recovering Articulated Object Models from 3D Range Data Dragomir Anguelov Daphne Koller Hoi-Cheung Pang Praveen Srinivasan Sebastian Thrun Computer Science.
Discretization Pieter Abbeel UC Berkeley EECS
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.
Multi-Agent Planning in Complex Uncertain Environments Daphne Koller Stanford University Joint work with: Carlos Guestrin (CMU) Ronald Parr (Duke)
Markov Decision Processes Value Iteration Pieter Abbeel UC Berkeley EECS TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:
Department of Computer Science Undergraduate Events More
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Decision Making Under Uncertainty Russell and Norvig: ch 16, 17 CMSC421 – Fall 2003 material from Jean-Claude Latombe, and Daphne Koller.
Making Decisions CSE 592 Winter 2003 Henry Kautz.
A1A1 A4A4 A2A2 A3A3 Context-Specific Multiagent Coordination and Planning with Factored MDPs Carlos Guestrin Shobha Venkataraman Daphne Koller Stanford.
MAKING COMPLEX DEClSlONS
Reinforcement Learning on Markov Games Nilanjan Dasgupta Department of Electrical and Computer Engineering Duke University Durham, NC Machine Learning.
Planning and Verification for Stochastic Processes with Asynchronous Events Håkan L. S. Younes Carnegie Mellon University.
Planning and Execution with Phase Transitions Håkan L. S. Younes Carnegie Mellon University Follow-up paper to Younes & Simmons’ “Solving Generalized Semi-Markov.
1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 6: Optimality Criterion in MDPs Dr. Itamar Arel College of Engineering Department.
June 21, 2007 Minimum Interference Channel Assignment in Multi-Radio Wireless Mesh Networks Anand Prabhu Subramanian, Himanshu Gupta.
Reinforcement Learning Presentation Markov Games as a Framework for Multi-agent Reinforcement Learning Mike L. Littman Jinzhong Niu March 30, 2004.
Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher.
Efficient Solution Algorithms for Factored MDPs by Carlos Guestrin, Daphne Koller, Ronald Parr, Shobha Venkataraman Presented by Arkady Epshteyn.
Model-based Bayesian Reinforcement Learning in Partially Observable Domains by Pascal Poupart and Nikos Vlassis (2008 International Symposium on Artificial.
Practical Dynamic Programming in Ljungqvist – Sargent (2004) Presented by Edson Silveira Sobrinho for Dynamic Macro class University of Houston Economics.
Reinforcement Learning 主講人:虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Department of Computer Science Undergraduate Events More
Solving POMDPs through Macro Decomposition
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Carnegie Mellon University
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Projection Methods (Symbolic tools we have used to do…) Ron Parr Duke University Joint work with: Carlos Guestrin (Stanford) Daphne Koller (Stanford)
Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.
Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012.
1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 21: Dynamic Multi-Criteria RL problems Dr. Itamar Arel College of Engineering Department.
Department of Computer Science Undergraduate Events More
1 Structure Learning (The Good), The Bad, The Ugly Inference Graphical Models – Carlos Guestrin Carnegie Mellon University October 13 th, 2008 Readings:
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Keep the Adversary Guessing: Agent Security by Policy Randomization
Making complex decisions
István Szita & András Lőrincz
Networked Distributed POMDPs: DCOP-Inspired Distributed POMDPs
Markov Decision Processes
Structured Models for Multi-Agent Interactions
Markov Decision Processes
Exploiting Graphical Structure in Decision-Making
CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29
Normal Form (Matrix) Games
Presentation transcript:

Multiagent Planning with Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University

Multiagent Coordination Examples Search and rescue Factory management Supply chain Firefighting Network routing Air traffic control Multiple, simultaneous decisions Limited observability Limited communication

Network Management Problem Administrators must coordinate to maximize global reward M4M4 M1M1 M3M3 M2M2 Si’Si’ Neighboring machines: LiLi Li’Li’ Load: SiSi Status: AiAi Action: RiRi When process terminates successfully t t+1

Joint Decision Space Represent as MDP: Action space: joint action a= {a 1,…, a n } for all agents State space: joint state x of entire system Reward function: total reward r Action space is exponential in # agents State space is exponential in # variables Global decision requires complete observation

Long-term Utilities One step utility: SysAdmin A i receives reward ($) if process completes Total utility: sum of rewards Optimal action requires long-term planning Long-term utility Q(x,a): Expected reward, given current state x and action a Optimal action at state x is:

Q(A 1,…,A 4, X 1,…,X 4 ) ¼ Q 1 (A 1, A 4, X 1,X 4 ) + Q 2 (A 1, A 2, X 1,X 2 ) + Q 3 (A 2, A 3, X 2,X 3 ) + Q 4 (A 3, A 4, X 3,X 4 ) Local Q function Approximation M4M4 M1M1 M3M3 M2M2 Q3Q3 Q(A 1,…,A 4, X 1,…,X 4 ) Associated with Agent 3 Limited observability: agent i only observes variables in Q i Observe only X 2 and X 3 Must choose action to maximize  i Q i

Use variable elimination for maximization: [Bertele & Brioschi ‘72] Maximizing  i Q i : Coordination Graph Limited communication for optimal action choice Comm. bandwidth = induced width of coord. graph Here we need only 23, instead of 63 sum operations. A1A1 A4A4 A2A2 A3A3 ),(),(),(max ,, 321 AAgAAQAAQ AAA   ),(),( ),(),( ,, 4321 AAQAAQAAQAAQ AAAA  ),(),(),(),( ,,, 4321 AAQAAQAAQAAQ AAAA 

Where do the Q i come from? Use function approximation to find Q i : Q(X 1, …, X 4, A 1, …, A 4 ) ¼ Q 1 (A 1, A 4, X 1,X 4 ) + Q 2 (A 1, A 2, X 1,X 2 ) + Q 3 (A 2, A 3, X 2,X 3 ) + Q 4 (A 3, A 4, X 3,X 4 ) Long-term planning requires Markov Decision Process # states exponential # actions exponential Efficient approximation by exploiting structure!

Dynamic Decision Diagram M4M4 M1M1 M3M3 M2M2 A3A3 A4A4 A2A2 A1A1 X1X1 X3X3 X4X4 X2X2 R1R1 R2R2 R3R3 R4R4 X3’X3’ X4’X4’ X2’X2’ X1’X1’ State Dynamics Decisions Rewards P(X 1 ’|X 1, X 4, A 1 )

Long-term Utility = Value of MDP Value computed by linear programming: One variable V (x) for each state One constraint for each state x and action a Number of states and actions exponential!

Linear combination of restricted domain functions [Bellman et al. ‘63] [Tsitsiklis & Van Roy ’96] [Koller & Parr ’99,’00] [Guestrin et al. ’01] Decomposable Value Functions Each h i is status of small part(s) of a complex system: Status of a machine and neighbors Load on machine Must find w giving good approximate value function

Single LP Solution for Factored MDPs One variable w i for each basis function Polynomially many LP variables One constraint for every state and action  Exponentially many LP constraints h i, Q i depend on small sets of variables/actions [Schweitzer and Seidmann ‘85]

Representing Exponentially Many Constraints [Guestrin et al ’01] Exponentially many linear = one nonlinear constraint

Representing the Constraints Functions are factored, use Variable Elimination to represent constraints: Number of constraints exponentially smaller

Summary of Algorithm 1.Pick local basis functions h i 2.Single LP to compute local Q i ’s in factored MDP 3.Coordination graph computes maximizing action

Network Management Problem Unidirectional Ring Server Star Ring of Rings

Single Agent Policy Quality number of machines Discounted reward PI = Approximate Policy Iteration with Max-norm Projection [Guestrin et al.’01] Single LP versus Approximate Policy Iteration LP single basis PI single basis LP pair basis LP triple basis

Single Agent Running Time LP single basis PI single basis LP pair basis LP triple basis PI = Approximate Policy Iteration with Max-norm Projection [Guestrin et al.’01]

Multiagent Policy Quality Comparing to Distributed Reward and Distributed Value Function algorithms [Schneider et al. ‘99]

Multiagent Policy Quality Comparing to Distributed Reward and Distributed Value Function algorithms [Schneider et al. ‘99] Distributed reward Distributed value

Multiagent Policy Quality Comparing to Distributed Reward and Distributed Value Function algorithms [Schneider et al. ‘99] LP single basis LP pair basis Distributed reward Distributed value

Multiagent Running Time Star single basis Star pair basis Ring of rings

Conclusions Multiagent planning algorithm: Limited Communication Limited Observability Unified view of function approximation and multiagent communication Single LP solution is simple and very efficient Exploit structure to reduce computation costs! Solve very large MDPs efficiently

Solve Very Large MDPs Solved MDPs with : states over actions and 500 agents;

Conclusions Multiagent planning algorithm: Limited Communication Limited Observability Unified view of function approximation and multiagent communication Single LP solution is simple and very efficient Exploit structure to reduce computation costs! Solve very large MDPs efficiently