Presentation is loading. Please wait.

Presentation is loading. Please wait.

Partially Observable Markov Decision Process By Nezih Ergin Özkucur.

Similar presentations


Presentation on theme: "Partially Observable Markov Decision Process By Nezih Ergin Özkucur."— Presentation transcript:

1 Partially Observable Markov Decision Process By Nezih Ergin Özkucur

2 2 Contents Markov Decision Process (MDP)  Value Iteration Algorithm  Reinforcement Learning Partially Observable Markov Decision Process (POMDP)  POMDP vs MDP  Value Function Representation  Exact algorithms Kalman Filtering ART2A Network ARKAQ Learning Algorithm

3 3 Markov Decision Process Consider an agent which must act rationally in an environment. At each discrete time step, agent must choose one of the given decisions. In the long term, agent tries to get good results. MDP is a way to model this kind of problems. By modeling the problem, we can run some automated algorithms to solve it.

4 4 MDP Components MDP can be defined by (S,A,T,R) where  S is a finite set of states which describes the situation of the environment.  A is a finite set of actions, which agent must chose from in each time step.  T (State transition function) is a mapping from SxA to probability distrubutions over S. T(s,a,s`) is the probability of being state s` when agent was in state s and have chosen action a.  R (Reward Function) is a mapping from SxA to real numbers.

5 5 Value Iteration Algorithm A policy ( ) is a mapping from S to A which gives action to select in each state. Value of a state is expected long term return starting from that state. The Algorithm’s update rule:

6 6 Q-Learning Algorithm Action Values: Update rule:

7 7 Partially Observable Markov Decision Process Consider a MDP in which agent cannot observe a state completely. We can model this problem with POMDP POMPD has 2 more components. O is the finite observation set. O(s,a,o) is the probability of making observation o from state s after having taken action a.

8 8 Agent’s Internal State Agent can represent the situation of environment with belief states. A belief state (b) is a probability distrubition over S. b(s) is probability of being state s when belief state is b. Next b can be calculated from previous b.

9 9 MDP vs POMDP

10 10 Belief State Example Observations: [ goal non-goal ] Step 1 b = [ 0.330.3300.33] Step 2 b = [ 00.500.5 ] Step 3 b = [ 0001]

11 11 Value Iteration Algorithm We can rewrite transition probabilities and reward functions. And try to apply value iteration algorithm The problem here is how can we represent value function, and how can we iterate over infinite belief space.

12 12 Value Function Representation Value function can be approximated with vectors which has the Piecewise Linear and Convex (PWLC) property.

13 13 Witness Algorithm Start with a set of b, which are at the corners of the belief space. At each iteration find a witness point which satisfies where: Calculate new vector and add to vector set. Stop when there is no witness

14 14 Incremental Pruning Algorithm

15 15 Heuristic Search Value Iteration Algorithm (HSVI)

16 16 ARKAQ Learning

17 17 Result of ARKAQ Learning Algorithm 4x4 Grid Problem

18 18 References Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra. Planning and acting in partially observable stochastic domains. Artificial Intelligence, Volume 101, pp. 99-134, 1998 Anthony R. Cassandra, Michael L. Littman and Nevin L. Zhang. Incremental pruning: A simple, fast, exact method for partially observable Markov decision processes. Uncertainty in Artificial Intelligence (UAI), 1997 Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press Heuristic Search Value Iteration for POMDPs. T. Smith and R. Simmons. In Proc. of UAI, 2004 Alp SARDAG Autonomous Strategy Planning Under. Uncertainty PhD Thesis Boğaziçi University 2006


Download ppt "Partially Observable Markov Decision Process By Nezih Ergin Özkucur."

Similar presentations


Ads by Google