Conference Paper by: Bikramjit Banerjee University of Southern Mississippi From the Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence.

Conference Paper by: Bikramjit Banerjee University of Southern Mississippi From the Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence (2013) Presentation by: John Mills Florida Institute of Technology Pruning for Monte Carlo Distributed Reinforcement Learning in Dec-POMDPs

Agenda Review of POMDPs Decentralized POMDPs Applications of Dec-POMDPs Overview of Reinforcement Learning Purpose of this Work Reinforcement Learning for Dec-POMDPs Dec-POMPD Algorithms Experimental Results Review Conclusions

Review of POMDPs Partially Observable Markov Decision Processes Agents do not have complete knowledge of their state POMDPs are essentially MDPs with sensor models Transition Model: P(s’|s,a) Actions: A(s) Reward Function: R(s) Sensor Model: P(e|s) Agents estimate their state by computing a belief state -- a conditional probability distribution over the actual states given its history of observations and actions The optimal action only depends on the current belief state

Decentralized POMPDs

Applications of Dec-POMDPs Formation flight of UAVs Cooperative robotics Swarm robotics Load Balancing Among Queues Communication Networks Sensor Networks

Overview of Reinforcement Learning

Purpose of this Work Exact solutions to finite horizon Dec-POMDPs require significant time and memory Many Dec-POMDP solvers are centralized and assume prior knowledge of the model Research by Banerjee et al. aims to use decentralized planning and reinforcement learning to solve Dec-POMDPs with less sample complexity and minimal error Additionally pruning is used to remove parts of the experience tree Their methods are evaluated by solving four benchmark Dec-POMDPs

Reinforcement Learning for Dec-POMDPs The authors used a Monte Carlo approach to solve Dec-POMDP problems Agents take turns to learn the best response to each others’ policies Agents do not know the models P, R, and O Assumptions: Agents know the size of the problem Agents know the overall maximum reward Agents have partial communication during learning phase This approach is semi -model based because it estimates intermediate reward and history transition functions

MCQ-ALT Algorithm Algorithm Description: The first action is performed and rewards and observations are received The experience tree is explored as the reward and history transitions are estimated After a policy has been created it is evaluated and the Q-value is updated Subroutine Descriptions: Actions are selected based on the history (SELECTACTION) Reward and history transition functions are estimated (STEP) Number of history-action pairs, N, are tracked and when one occurs frequently enough the history is called “Known” (ENDEPISODE) The Q-value is estimated (QUPDATE)

MCQ-ALT Algorithm Explore experience tree Estimate R and H functions Greedy selection Least frequent Update Q-value

Modification to MCQ-ALT The MCQ-ALT invests N samples into every leaf node in an experience tree Rare histories become a significant liability and contribute little to the value function The value function may not need to be so accurate 1) Policies usually converge before value functions 2) Most of the experience tree does not appear in the optimal policy Confidence is preserved by removing actions that do not meet a derived criterion

IMCQ-ALT Algorithm Remove actions from history at proper level based on confidence preservation criterion Perform several passes through experience tree

Experimental Results The MCQ-ALT and IMCQ-ALT algorithms were tested with four benchmark POMDPs DEC-TIGER RECYCLING ROBOTS BOX-PUSHING MARS-ROVERS Varied Parameters Maximum Frequency of Action-History Pair: N = 10, 20, 50, 100, 200, 500 Maximum Number of Action-Observation Steps (Horizon): T = 3, 4, 5 and T = 2, 3, 4

DEC-TIGER

BOX-PUSHING

Review Advantages Runtime and memory usage are improved No model is needed Computational burden is effectively distributed Algorithm has a well-defined stopping criterion Disadvantages Agents can only learn from the previous agent Robustness and adaptability is not considered

Conclusions Decentralized POMDPs look like a promising method for modelling multi-agent problems The authors proposed a method of solving Dec-POMDPs that promises to find (near) optimal solutions in less time and with less memory than other methods The algorithm has been shown to perform well for benchmark problems Larger horizon values should be investigated Future testing should apply the algorithm to other problems with well defined requirements and success criteria

QUESTIONS?

REFERENCES Confused Robot Dec-POMDPs by Frans A. Oliehoek Formation Flight of UAVs Swarm Robotics Dec-Tiger Picture Box-Pushing Pic Banerjee, B. 2013. Pruning for Monte Carlo Distributed Reinforcement Learning in Decentralized POMDPs. In Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence (AAAI-13), 88-94. Banerjee, B.; Lyle, J.; Kraemer, L.; Yellamraju, R. 2012. Solving Finite Horizon Decentralized POMDPs by Distributed Reinforcement Learning. In The Seventh Annual Workshop on Multiagent Sequential Decision-Making Under Uncertainty (MSDM-2012).

Conference Paper by: Bikramjit Banerjee University of Southern Mississippi From the Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence.

Similar presentations

Presentation on theme: "Conference Paper by: Bikramjit Banerjee University of Southern Mississippi From the Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Conference Paper by: Bikramjit Banerjee University of Southern Mississippi From the Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence.

Similar presentations

Presentation on theme: "Conference Paper by: Bikramjit Banerjee University of Southern Mississippi From the Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence."— Presentation transcript:

Similar presentations

About project

Feedback