Concurrent Markov Decision Processes Mausam, Daniel S. Weld University of Washington Seattle.

Concurrent Markov Decision Processes Mausam, Daniel S. Weld University of Washington Seattle

Planning Environment Percepts Actions What action next?

Motivation Two features of real world planning domains : Concurrency (widely studied in the Classical Planning literature) Some instruments may warm up Others may perform their tasks Others may shutdown to save power. Uncertainty (widely studied in the MDP literature) All actions (pick up the rock, send data etc.) have a probability of failure. Need both!

Probabilistic Planning Probabilistic Planning typically modeled as Markov Decision Processes. Traditional MDPs assume a “single action per decision epoch”. Solving Concurrent MDPs in the naïve way incurs exponential blowups in running times.

Outline of the talk MDPs Concurrent MDPs Present sound pruning rules to reduce the blowup. Present sampling techniques to obtain orders of magnitude speedups. Experiments Conclusions and Future Work

Markov Decision Process S : a set of states, factored into Boolean variables. A : a set of actions Pr (S £ A £ S ! [0,1]): the transition model C (A ! R ) : the cost model  discount factor (  2  ) s 0 : the start state G : a set of absorbing goals

GOAL of an MDP Find a policy (S ! A) which: minimises expected discounted cost of reaching a goal for an infinite horizon for a fully observable Markov decision process.

Bellman Backup Define J*(s) {optimal cost} as the minimum expected cost to reach a goal from this state. Given an estimate of J* function (say J n ) Backup J n function at state s to calculate a new estimate (J n+1 ) as follows Value Iteration Perform Bellman updates at all states in each iteration. Stop when costs have converged at all states.

Min Bellman Backup a1a1 a2a2 a3a3 s JnJn JnJn JnJn JnJn JnJn JnJn JnJn Q n+1 (s,a) J n+1 (s) Ap(s)

Min RTDP Trial a1a1 a2a2 a3a3 s JnJn JnJn JnJn JnJn JnJn JnJn JnJn Q n+1 (s,a) J n+1 (s) Ap(s) a min = a 2 Goal

Real Time Dynamic Programming (Barto, Bradtke and Singh’95) Trial : Simulate greedy policy; Perform Bellman backup on visited states Repeat RTDP Trials until cost function converges Anytime behaviour Only expands reachable state space Complete convergence is slow Labeled RTDP (Bonet & Geffner’03) Admissible, if started with admissible cost function. Monotonic; converges quickly

Concurrent MDPs Redefining the Applicability function Ap : S !P ( P (A)) Inheriting mutex definitions from Classical planning: Conflicting preconditions Conflicting effects Interfering preconditions and effects a 1 : if p 1 set x 1 a 2 : if : p 1 set x 1 a 1 : set x 1 (pr=0.5) a 2 : toggle x 1 (pr=0.5) a 1 : if p 1 set x 1 a 2 : toggle p 1 (pr=0.5)

Concurrent MDPs (contd) Ap(s) = {A c µ A | All actions in A c are individually applicable in s. No two actions in A c are mutex. } ) The actions in A c don’t interact with each other. Hence,

Concurrent MDPs (contd) Cost Model C : P (A) ! R Typically, C(A c ) <  a 2 A c C({a}) Time component Resource component (if C(A c ) =  … then optimal sequential policy is optimal for concurrent MDP)

JnJn JnJn JnJn JnJn JnJn Bellman Backup (Concurrent MDP) a2a2 a 1,a 2 a3a3 s J n+1 (s) Ap(s) a1a1 a 1,a 3 a 2,a 3 a 1,a 2,a 3 JnJn JnJn JnJn JnJn JnJn JnJn JnJn JnJn JnJn JnJn JnJn JnJn JnJn Min Exponential blowup to calculate a Bellman Backup!

Outline of the talk MDPs Concurrent MDPs Present sound pruning rules to reduce the blowup. Present sampling techniques to obtain orders of magnitude speedups. Experiments Conclusions and Future Work

Combo skipping (proven sound pruning rule) If d J n (s) e <  1-k Q n (s,{a 1 }) + func(A c,  ) Then prune A c for state s in this backup. Use Q n (s,A prev ) as an upper bound of J n (s). Choose a 1 as the action with maximum Q n (s,{a 1 }) to obtain maximum pruning. Skips a combination only for current iteration 

Combo elimination (proven sound pruning rule) If b Q*(s,A c ) c > d J*(s) e then eliminate A c from applicability set of state s. Eliminates the combination A c from applicable list of s for all subsequent iterations. Use Q n (s,A c ) as a lower bound of Q*(s,A c ). Use J* sing (s) (the optimal cost for single-action MDP as an upper bound of J*(s).

Pruned RTDP RTDP with modified Bellman Backups. Combo-skipping Combo-elimination Guarantees: Convergence Optimality

Experiments Domains NASA Rover Domain Factory Domain Switchboard domain Cost function Time Component 0.2 Resource Component 0.8 State variables : 20-30 Avg(Ap(s)) : 170 - 12287

Speedups in Rover domain

Stochastic Bellman Backups Sample a subset of combinations for a Bellman Backup. Intuition : Actions with low Q-values have high likelihood to be in the optimal combination. Sampling Distribution : (i) Calculate all single action Q-values. (ii) Bias towards choosing combinations containing actions with low Q-values. Best combinations for this state in the previous iteration (memoization).

Sampled RTDP Non-monotonic Inadmissible ) Convergence, Optimality not proven. Heuristics Complete backup phase (labeling). Run Pruned RTDP with value function from Sampled RTDP (after scaling).

Speedup in the Rover domain

Close to optimal solutions ProblemJ*(s 0 ) (S-RTDP)J*(s 0 ) (Optimal)Error Rover110.753810.7535<0.01% Rover210.7535 0 Rover311.0016 0 Rover412.749012.74610.02% Rover57.3163 0 Rover610.5063 0 Rover712.934312.92460.08% Art14.5137 0 Art26.3847 0 Art36.5583 0 Fact115.085915.03380.35% Fact214.141414.03290.77% Fact316.377116.34120.22% Fact415.8588 0 Fact59.03148.98440.56%

Speedup vs. Concurrency

Varying the num_samples Optimality Efficiency

Contributions Modeled Concurrent MDPs Sound, optimal pruning methods Combo-skipping Combo-elimination Fast sampling approaches Close to optimal solution Heuristics to improve optimality Our techniques are general and can be applied to any algorithm – VI, LAO*, etc.

Related Work Factorial MDPs (Mealeau etal’98, Singh & Cohn’98) Multiagent planning (Guestrin, Koller, Parr’01) Concurrent Markov Options (Rohanimanesh & Mahadevan’01) Generate, test and debug paradigm (Younes & Simmons’04) Parallelization of sequential plans (Edelkamp’03, Nigenda & Kambhampati’03)

Future Work Find error bounds, prove convergence for Sampled RTDP Concurrent Reinforcement Learning Modeling durative actions (Concurrent Probabilistic Temporal Planning) Initial Results – Mausam & Weld’04, (AAAI Workshop on MDPs)

Concurrent Probabilistic Temporal Planning (CPTP) Concurrent MDP CPTP Our solution (AAAI Workshop on MDPs) Model CPTP as a Concurrent MDP in an augmented state space. Present admissible heuristics to speed up the search and manage the state space blowup.

Concurrent Markov Decision Processes Mausam, Daniel S. Weld University of Washington Seattle.

Similar presentations

Presentation on theme: "Concurrent Markov Decision Processes Mausam, Daniel S. Weld University of Washington Seattle."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Concurrent Markov Decision Processes Mausam, Daniel S. Weld University of Washington Seattle.

Similar presentations

Presentation on theme: "Concurrent Markov Decision Processes Mausam, Daniel S. Weld University of Washington Seattle."— Presentation transcript:

Similar presentations

About project

Feedback