Presentation is loading. Please wait.

Presentation is loading. Please wait.

Model-based Bayesian Reinforcement Learning in Partially Observable Domains by Pascal Poupart and Nikos Vlassis (2008 International Symposium on Artificial.

Similar presentations


Presentation on theme: "Model-based Bayesian Reinforcement Learning in Partially Observable Domains by Pascal Poupart and Nikos Vlassis (2008 International Symposium on Artificial."— Presentation transcript:

1 Model-based Bayesian Reinforcement Learning in Partially Observable Domains by Pascal Poupart and Nikos Vlassis (2008 International Symposium on Artificial Intelligence and Math) Presented by Lihan He ECE, Duke University Oct 3, 2008

2 Introduction POMDP represented as dynamic decision network (DDN) Partially observable reinforcement learning Belief update Value function and optimal action Partially observable BEETLE Offline policy optimization Online policy execution Conclusion Outline 1/14

3 Introduction Final objective: learn optimal actions (policy) to achieve best reward POMDP: partially observable Markov decision process  represented by  sequential decision-making problem Reinforcement learning for POMDP: solve the decision-making problem given feedback from environment, when the dynamics of the environment (T and O) are unknown.  given action-observation sequence as history  model-based: explicitly model the environment  model-free: avoid to explicitly model the environment  online learning: policy learning and execution at the same time  offline learning: learn policy first given training data, and then execute policy without modifying the policy 2/14

4 Introduction This paper:  Bayesian model-based approach  Set the prior for belief as mixture of products of Dirichlets  The posterior belief is a mixture of products of Dirichlets  The value function is also a mixture of products of Dirichlets  The number of the mixture components increases exponentially with respect to the time step  PO-BEETLE algorithm 3/14

5 POMDP and DDN Redefine POMDP as dynamic decision network (DDN) X, X’ : two consecutive time steps  Observation and reward are subsets of state variable  The conditional probability distributions of state Pr(s’|pa s’ ) jointly encode the transition, observation and reward models T, O and R 4/14

6 POMDP and DDN The optimal value function satisfies Bellman’s equation Given X, S, R, O, A, edge E and the dynamics Pr(s’|pa s’ ): Belief update: Objective: finding a policy that maximizes the expected total reward Value iteration algorithms optimize the value function by iteratively computing the right hand side of the Bellman’s equation. 5/14

7 POMDP and DDN For reinforcement learning, assume X, S, R, O, A are known, and edges E are known, but the dynamics Pr(s’|pa s’ ) are unknown. We augment graph: Dynamics are included in the graph, denoted by parameter Θ. If the unknown model is static, Belief over sjoint belief over s and θ 6/14

8 PORL: belief update Problem: number of mixture components increases by a factor of |S| (exponential growth with time) Prior setting for belief: a mixture of products of Dirichlets Posterior belief (after taking action a and receiving observation o’) is again a mixture of products of Dirichlets 7/14

9 PORL: value function and optimal action The augmented POMDP is hybrid, with discrete state variables S and continuous model variables Θ Discrete state POMDP: with Continuous state POMDP [1]: [1] Porta, J. M.; Vlassis, N. A.; Spaaan, M. T. J.; and Poupart, P. 2006. Point-based value iteration for continuous POMDPs. Journal of Machine Learning Research 7:2329–2367. The α-function α(s,θ) can also be represented as a mixture of products of Dirichlets Hybrid state POMDP: 8/14

10 PORL: value function and optimal action Assume for k step-to-go is then for k+1 step-to-go is decomposed in 3 steps find optimal action for belief b find the corresponding α-function with Problem: number of mixture components increases by a factor of |S| (exponential growth with time) 9/14 1) 2) 3)

11 PO-BEETLE: offline policy optimization Policy learning is performed offline, given sufficient training data (action-observation sequence) 10/14

12 PO-BEETLE: offline policy optimization Keep the number of mixture components for α-functions bounded: Approach 1: approximation using basis functions Approach 2: approximation by important components 11/14

13 PO-BEETLE: online policy execution Given policy, the agent executes the policy and updates belief online. Keep the number of mixture components for belief b bounded: Approach 1: approximation using importance sampling 12/14

14 PO-BEETLE: online policy execution Approach 2: particle filtering: simultaneously update belief and reduce the number of mixture components Sample one updated component (after taking a and receiving o’) The updated belief is represented by k particles 13/14

15 Conclusion  Bayesian model-based reinforcement learning;  Prior belief is a mixture of products of Dirichlets;  Posterior belief is also a mixture of products of Dirichlets, with the number of mixture components growing exponentially with time;  α-functions (associated with value functions) are also represented as mixtures of products of Dirichlets that grow exponentially with time;  Partially observable BEETLE algorithm. 14/14


Download ppt "Model-based Bayesian Reinforcement Learning in Partially Observable Domains by Pascal Poupart and Nikos Vlassis (2008 International Symposium on Artificial."

Similar presentations


Ads by Google