Presentation on theme: "Modified MDPs for Concurrent Execution AnYuan Guo Victor Lesser University of Massachusetts."— Presentation transcript:
Modified MDPs for Concurrent Execution AnYuan Guo Victor Lesser University of Massachusetts
Concurrent Execution A set of tasks where each task is relatively easy to solve on its own, but when executed concurrently, new interactions arise that complicate the execution of the composite task. Single agent executing multiple tasks in parallel (example: office robot) Multiple agents act in parallel (team)
Cross Product MDP The problem of concurrent execution can be solved optimally by solving the cross product MDP formed by the separate processes. Problem: exponential blow up
Related Work Deterministic Planning - Situation calculus [Reiter96] - Extending STRIPS [Boutilier97, Knoblock94] Termination schemes for temporally extended actions [Rohanimanesh03] Planning in cross-product MDP [Singh98] Learning ( W-learning [Humphrys96], MAXQ [Dietterich00])
The Goal Somehow break apart the interactions, encapsulate them within each agent, so they can again be solved independently.
Algorithm Summary Define the types of events and interactions of interest Summarize the other agent’s effect on self in terms of statistical information of how often the constraining event occurs Change my model to reflect this statistic
Events in MDP State based events (agent enters s5) Action based events (agent moves north 1 step) State-action based events (agent moves north 1 step from s4) Events in MDP 1 affect events in MDP 2, a total of 9 types of interactions
Assumptions The list of possible interactions between the MDPs are given The constraints are one-way only. The effects do not propagate back to the originator of the constraint.
Directed Acyclic Constraints Constraints between a set of events that forms a directed acyclic graph.
Event Frequency & MDP modification 1)Calculate frequency 2) Modify MDP
Calculating State Visitation Frequency Given a policy, solve the system of simultaneous linear equations: Under the constraint that:
Calculating Action Frequencies Given a policy, the action frequency F(a) is the sum of the visitation frequencies of all the states in which action a is executed. where
Calculating State-Action Frequencies otherwise if Now both the action and the state at which it is executed matters: Also generalizes to a set of states and actions.
Account for the Effects of Constraints Modify the model Modify the transition probability table Intuition: other agents can change the dynamics of my environment Example: A1A2
Account for State Based Events A constraint from another task can affect the current task’s ability to enter certain states: P(s1,a1,s1)P(s1, a1, s2)P(s1, a1, s3) P(s2,a1,s1)P(s2, a1, s2)P(s2, a1, s3) P(s3,a1,s1)P(s3, a1, s2)P(s3, a1, s3) s1 s2 s3 s2s1 A slice of the TPT: under action a1. from: to:
Account for Action Based Events A constraint from another task can affect the current task’s ability to carry out certain actions: P(s1,a1,s1)P(s1, a1, s2)P(s1, a1, s3) P(s2, a1, s1)P(s2,a1,s2)P(s2, a1, s3) P(s3, a1, s1)P(s3, a1, s2)P(s3,a1,s3) s1 s2 s3 s1 s2 s3 TPT for affected action a1
Account for State-Action Based Events A constraint from another task can affect the current task’s ability to carry out certain actions at certain states: s1 s2 s3 P(s1, a1, s1)P(s1, a1, s2)P(s1, a1, s3) P(s2, a1, s1)P(s2, a1, s2)P(s2, a1, s3) P(s3, a1, s1)P(s3, a1, s2)P(s3,a1,s3) s1 s2 s3 TPT for affected action a1
Experiments States (location of the agent) Actions (move up, down, left, right or any of the 4 diagonal steps, 8 total) Transitions (0.05 of slipping to an adjacent state rather than intended) Rewards (-1, -3 for diagonal, 100 for goal) Constraint: agent 1 taking the “up” action prevents agent 2 from doing so The mountain climbing scenario:
Results: Policies Policies when executing independently Policies when executed concurrently, after we apply the algorithm
Results Size of State Space Average Value of Policy
Improvements Explore different ways to modify the MDP (e.g. shrink action set) Relax the directed-acyclic constraint restriction (take an iterative approach) Show that it is optimal for summaries that consist of a single random variable
New Directions Different types of summaries - steady state behavior (current work) - multi-state summaries - summaries with temporal information Dynamic task arrival/departure: - given some model of arrival - without model – learning Positive interactions (e.g. enable)