Presentation is loading. Please wait.

Presentation is loading. Please wait.

Reinforcement Learning with Partially Known World Dynamics

Similar presentations


Presentation on theme: "Reinforcement Learning with Partially Known World Dynamics"— Presentation transcript:

1 Reinforcement Learning with Partially Known World Dynamics
Christian R. Shelton Stanford University

2 Reinforcement Learning
Environment State Dynamics Agent Actions Rewards Goal

3 Motivation Reinforcement learning promises great things
Automatic task optimization Without any prior information about the world Reinforcement learning is hard Optimization goal could be arbitrary Every new situation might be different Modify the problem slightly Keep the basic general flexible framework Allow the specification of domain knowledge Don’t require full specification of the problem (planning)

4 Our Approach Partial World Modeling Example:
Keep the partial observability Allow conditional dynamics Example: Known dynamics: Sensor models, motion models, etc. Unknown dynamics: Enemy movements, maps, etc. Flexible barrier between the two

5 Partially Known Markov Decision Process (PKMDP)
Unknown Dynamics: s0 s1 s2 Known Dynamics: x0 x1 x2

6 Partially Known Markov Decision Process (PKMDP)
z0 y1 z1 y2 z2 Interface: x0 x1 x2

7 Partially Known Markov Decision Process (PKMDP)
z0 y1 z1 y2 z2 x0 x1 x2 Observation: o0 o1 o2

8 Partially Known Markov Decision Process (PKMDP)
z0 y1 z1 y2 z2 x0 x1 x2 o0 o1 o2 Action: a0 a1 a2

9 Partially Known Markov Decision Process (PKMDP)
unknown y0 z0 y1 z1 y2 z2 known, unobserved x0 x1 x2 o0 o1 o2 observed a0 a1 a2

10 Algorithm Outline Input Output Method Set of Trajectories (o, a, y, z)
Set of Policies Output Policy that maximizes expected return Method Construct non-parametric model of return Maximize with respect to policy

11 Algorithm Details Unknown Dynamics: Known Dynamics:
Use experience Importance sampling Known Dynamics: Use model DBN inference Exact calculation: lower variance Maximize using conjugate gradient Policy search method, but… Not policy gradient

12 Total Estimate For each sample, K and V involve reasoning in the DBN:
z0 x0 y0 o1 a1 z1 x1 y1 o2 a2 z2 x2 y2

13 Load-Unload Example 26 states, 14 observations, 4 actions
Three versions: 1. No world knowledge 2. Memory dynamics known 3. End-point & memory dynamics known

14 Clogged Pipe Example 144 states, 12 observations, 8 actions
Three versions: 1. Memory only 2. Known cart control 3. Incoming flow unknown

15 Conclusion Advantages Current Work
Uses samples for estimation of unknown dynamics Uses exact dynamics when known Allows natural specification of domain knowledge Current Work Improving gradient ascent planner Using structure within known dynamics Removing requirement observability of interface


Download ppt "Reinforcement Learning with Partially Known World Dynamics"

Similar presentations


Ads by Google