Pradeep Varakantham Singapore Management University Joint work with J.Y.Kwak, M.Taylor, J. Marecki, P. Scerri, M.Tambe.

Slides:

Advertisements

Similar presentations

Markov Decision Process

Advertisements

1 University of Southern California Keep the Adversary Guessing: Agent Security by Policy Randomization Praveen Paruchuri University of Southern California.

Partially Observable Markov Decision Process (POMDP)

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.

SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.

SARSOP Successive Approximations of the Reachable Space under Optimal Policies Devin Grady 4 April 2013.

CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)

Decision Theoretic Planning

Markov Models for Multi-Agent Coordination Maayan Roth Multi-Robot Reading Group April 13, 2005.

A Hybridized Planner for Stochastic Domains Mausam and Daniel S. Weld University of Washington, Seattle Piergiorgio Bertoli ITC-IRST, Trento.

What Are Partially Observable Markov Decision Processes and Why Might You Care? Bob Wall CS 536.

An Introduction to Markov Decision Processes Sarah Hickmott

Planning under Uncertainty

Distributed Model Shaping for Scaling to Decentralized POMDPs with hundreds of agents Prasanna Velagapudi Pradeep Varakantham Paul Scerri Katia Sycara.

In practice, we run into three common issues faced by concurrent optimization algorithms. We alter our model-shaping to mitigate these by reasoning about.

Markov Decision Processes CSE 473 May 28, 2004 AI textbook : Sections Russel and Norvig Decision-Theoretic Planning: Structural Assumptions.

KI Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.

Approximate Solutions for Partially Observable Stochastic Games with Common Payoffs Rosemary Emery-Montemerlo joint work with Geoff Gordon, Jeff Schneider.

1 University of Southern California Security in Multiagent Systems by Policy Randomization Praveen Paruchuri, Milind Tambe, Fernando Ordonez University.

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

CMPUT 551 Analyzing abstraction and approximation within MDP/POMDP environment Magdalena Jankowska (M.Sc. - Algorithms) Ilya Levner (M.Sc - AI/ML)

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.

Markov Decision Processes

Multiagent Planning with Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

A Decentralised Coordination Algorithm for Mobile Sensors School of Electronics and Computer Science University of Southampton {rs06r2, fmdf08r, acr,

Decentralised Coordination of Mobile Sensors School of Electronics and Computer Science University of Southampton Ruben Stranders,

Instructor: Vincent Conitzer

A1A1 A4A4 A2A2 A3A3 Context-Specific Multiagent Coordination and Planning with Factored MDPs Carlos Guestrin Shobha Venkataraman Daphne Koller Stanford.

MAKING COMPLEX DEClSlONS

Conference Paper by: Bikramjit Banerjee University of Southern Mississippi From the Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence.

Department of Computer Science Christopher Amato Carnegie Mellon University Feb 5 th, 2010 Increasing Scalability in Algorithms for Centralized and Decentralized.

General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.

REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.

CSE-473 Artificial Intelligence Partially-Observable MDPS (POMDPs)

Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up Ekhlas Sonu, Prashant Doshi Dept. of Computer Science University.

Software Multiagent Systems: Lecture 13 Milind Tambe University of Southern California

CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.

Model-based Bayesian Reinforcement Learning in Partially Observable Domains by Pascal Poupart and Nikos Vlassis (2008 International Symposium on Artificial.

Shlomo Zilberstein Alan Carlin Bounded Rationality in Multiagent Systems using Decentralized Metareasoning TexPoint fonts used in EMF. Read the TexPoint.

CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty Jiang Bian, Fall 2012 University of Arkansas at Little Rock.

Solving POMDPs through Macro Decomposition

Reinforcement Learning Yishay Mansour Tel-Aviv University.

A Tutorial on the Partially Observable Markov Decision Process and Its Applications Lawrence Carin June 7,2006.

Conformant Probabilistic Planning via CSPs ICAPS-2003 Nathanael Hyafil & Fahiem Bacchus University of Toronto.

1 Multiagent Teamwork: Analyzing the Optimality and Complexity of Key Theories and Models David V. Pynadath and Milind Tambe Information Sciences Institute.

Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.

CPS 570: Artificial Intelligence Markov decision processes, POMDPs

Distributed Planning for Large Teams Prasanna Velagapudi Thesis Committee: Katia Sycara (co-chair) Paul Scerri (co-chair) J. Andrew Bagnell Edmund H. Durfee.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Achieving Goals in Decentralized POMDPs Christopher Amato Shlomo Zilberstein UMass.

Keep the Adversary Guessing: Agent Security by Policy Randomization

Making complex decisions

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Networked Distributed POMDPs: DCOP-Inspired Distributed POMDPs

Markov Decision Processes

CPS 570: Artificial Intelligence Markov decision processes, POMDPs

Thrust IC: Action Selection in Joint-Human-Robot Teams

Timothy Boger and Mike Korostelev

Markov Decision Processes

Markov Decision Processes

Hierarchical POMDP Solutions

Instructor: Vincent Conitzer

Chapter 17 – Making Complex Decisions

Heuristic Search Value Iteration

CS 416 Artificial Intelligence

Reinforcement Learning Dealing with Partial Observability

CS 416 Artificial Intelligence

Reinforcement Nisheeth 18th January 2019.

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Presentation transcript:

Pradeep Varakantham Singapore Management University Joint work with J.Y.Kwak, M.Taylor, J. Marecki, P. Scerri, M.Tambe

Motivating Domains Disaster Rescue Sensor Networks  Characteristics of Domains: Uncertainty Coordinating multiple agents Sequential decision making

Meeting the challenges Problem: Multiple agents coordinating to perform multiple tasks in presence of uncertainty Sol: Represent as Distributed POMDPs and solve NEXP Complete for optimal solution Approximate algorithm to dynamically exploit structure in interactions Result: Vast improvement in performance over existing algorithms

Outline Illustrative Domain Model Approach: Exploit dynamic structure in interactions Results

Illustrative Domain  Multiple types of robots  Uncertainty in movements  Reward Saving victims Collisions Clearing debris  Maximize expected joint reward

Model DisPOMDPs with Coordination Locales, DPCL Joint model: Global state represents completion of tasks Agents independent except in coordination locales, CLs Two types of CLs: Same time CL (Ex: Agents colliding with each other) Future time CL (Ex: Cleaner robot cleaning the debris assists rescue robot in reaching the goal) Individual observability

Solving DPCLs with TREMOR Teams REshaping of MOdels for Rapid execution Two steps: 1. Branch and Bound search MDP based heuristics 2. Task Assignment evaluation By computing policies for every agent Perform only joint policy computation at CLs

1. Branch and Bound search

2. Task Assignment Evaluation  Until convergence of policies or maximum iterations: 1) Solve individual POMDPs 2) Identify potential coordination locales 3) Based on type and value of coordination :  Shape P and R of relevant individual agents  Capture interactions  Encourage/Discourage interactions 4) Go to step 1

Identifying potential CLs CL = Probability of CL occurring at a time step, T Given starting belief Standard belief update given policy Policy over belief states Probability of observing w, in belief state “b” Updating “b”

Type of CL STCL, if there exists “s” and “a” for which Transition/Reward function not decomposable, P(s,a,s’) ≠ Π 1≤i≤N P((s g,s i ),a i,(s g ’,s i ’)) OR R(s,a,s’) ≠ Σ 1≤i≤N R((s g,s i ),a i,(s g ’,s i ’)) FTCL, Completion of task (global state) by an agent at t’ affects transitions/rewards of other agents at t

Shaping Model (STCL) Shaping transition function Shaping reward function Joint transition probability when CL occurs New transition probability for agent “i”

Results Benchmark Algorithms Independent POMDPs Memory Bounded Dynamic Programming (MBDP) Criterion Decision quality Run-time Parameters: (i) agents; (ii) CLs; (iii) states; (iv) horizon

State space

Agents

Coordination Locales

Time Horizon

Related work Existing Research DEC-MDPs Assuming individual or collective full observability Task allocation and dependencies as input DEC-POMDPs JESP MBDP Exploiting independence in transition/reward/observation. Model Shaping Guestrin and Gordon, 2002

Conclusion DPCL, a specialization of Distributed POMDPs TREMOR exploits presence of few CLs in domains TREMOR depends on single agent POMDP solvers Results: TREMOR outperformed DisPOMDP algorithms, except in tightly coupled small problems

Questions?

Same Time CL (STCL) There is an STCL, if Transition function not decomposable, OR P(s,a,s’) ≠ Π 1≤i≤N P((s g,s i ),a i,(s g ’,s i ’)) Observation function not decomposable, OR O(s’,a,o) ≠ Π 1≤i≤N O(o i,a i,(s g ’,s i ’)) Reward function not decomposable R(s,a,s’) ≠ Σ 1≤i≤N R((s g,s i ),a i,(s g ’,s i ’)) Ex: Two robots colliding in a narrow corridor

Future Time CL Actions of one agent at “ t’ ” can affect transitions OR observations OR rewards of other agents at “ t ” P((s t g,s t i ),a t i,(s t g ’,s t i ’)|a j t’ ) ≠ P((s t g,s t i ),a t i,(s t g ’,s t i ’)), ¥ t’ < t R((s t g,s t i ),a t i,(s t g ’,s t i ’)|a j t’ ) ≠ R((s t g,s t i ),a t i,(s t g ’,s t i ’)), ¥ t’ < t O(w t i,a t i,(s t g ’,s t i ’)|a j t’ ) ≠ O(w t i,a t i,(s t g ’,s t i ’)), ¥ t’ < t Ex: Clearing of debris assists rescue robots in getting to victims faster