Presentation is loading. Please wait.

Presentation is loading. Please wait.

Reinforcement Learning Dealing with Complexity and Safety in RL Subramanian Ramamoorthy School of Informatics 27 March, 2012.

Similar presentations

Presentation on theme: "Reinforcement Learning Dealing with Complexity and Safety in RL Subramanian Ramamoorthy School of Informatics 27 March, 2012."— Presentation transcript:

1 Reinforcement Learning Dealing with Complexity and Safety in RL Subramanian Ramamoorthy School of Informatics 27 March, 2012

2 (Why) Isn’t RL Deployed More Widely? Very interesting discussion at: 20Reinforcement%20Learning, maintained by Satinder Singh 20Reinforcement%20Learning Negative views/myths: RL is hard due to dimensionality, partial observability, function approximation, etc. etc. Positive view: There is no getting away from the fact that RL is the proper statement of the “agent’s problem”. So, the question is really one of how to solve it! 27/03/2012Reinforcement Learning2

3 A Provocative Claim “The (PO)MDP frameworks are fundamentally broken, not because they are insufficiently powerful representations, but because they are too powerful. We submit that, rather than generalizing these models, we should be specializing them if we want to make progress on solving real problems in the real world.” T. Lane, W.D. Smart, Why (PO)MDPs Lose for Spatial Tasks and What to Do About It, ICML Workshop on Rich Representations for RL, 2005. 27/03/2012Reinforcement Learning3

4 What is the Issue? (Lane et al.) In our efforts to formalize the notion of “learning control”, we have striven to construct ever more general and, putatively, powerful models. By the mid-1990s we had (with a little bit of blatant “borrowing” from the Operations Research community) arrived at the (PO)MDP formalism (Puterman, 1994) and grounded our RL methods in it (Sutton & Barto, 1998; Kaelbling et al., 1996; Kaelbling et al., 1998). These models are mathematically elegant, have enabled precise descriptions and analysis of a wide array of RL algorithms, and are incredibly general. We argue, however, that their very generality is a hindrance in many practical cases. In their generality, these models have discarded the very qualities — metric, topology, scale, etc. — that have proven to be so valuable for many, many science and engineering disciplines. 27/03/2012Reinforcement Learning4

5 What is Missing in POMDPs? POMDPs do not describe natural metrics in environment – When driving, we know both global and local distances POMDPs do not natively recognize differences between scales – Uncertainty in control is entirely different from uncertainty in routing POMDPs conflate properties of the environment with properties of the agent – Roads and buildings behave differently from cars and pedestrians: we need to generalize over them differently POMDPs are defined in a global coordinate frame, often discrete! – We may need many different representations in real problems 27/03/20125Reinforcement Learning

6 Specific Insight #1 Metric of a space imposes a “speed limit” on the agent — the agent cannot transition to arbitrary points in the environment in a single step. Consequences: Agent can neglect large parts of the state space when planning. More importantly, however, this result implies that control experience can be generalized across regions of the state space. – If the agent learns a good policy for one bounded region of the state space, and it can find a second bounded region that is homeomorphic to the first. 27/03/2012Reinforcement Learning6 Metric envelope bound for point-to-point navigation in an open-space gridworld environment. The outer region is the elliptical envelope that contains 90% of the trajectory probability mass. The inner, darker region is the set of states occupied by an agent in a total of 10,000 steps of experience (319 trajectories from bottom to top).

7 Insight #2: Manifold Representations Informally, a manifold representation models the domain of the value function using a set of overlapping local regions, called charts. Each chart has a local coordinate frame, is a (topological) disk, and has a (local) Euclidean distance metric. The collection of charts and their overlap regions is called a manifold. We can embed partial value functions (and other models) on these charts, and combine them, using the theory of manifolds, to provide a global value function (or model). 27/03/2012Reinforcement Learning7 13 eq. classes. If you consider Rotational symmetry, Only 4 classes.

8 What Makes Some POMDP Problems Easy to Approximate? David Hsu, Wee Sun Lee, Nan Rong, NIPS 2007 27/03/2012Reinforcement Learning8

9 Understanding Why PBVI Works Point-based algorithms have been surprisingly successful in computing approximately optimal solutions for POMDPs. What are the belief-space properties that allow some POMDP problems to be approximated efficiently, explaining the point- based algorithms’ success? 27/03/2012Reinforcement Learning9

10 Hardness of POMDPs Intractability due to curse of dimensionality Size of belief space grows exponentially with state space, |S| But, in recent years, good progress has been made in sampling the belief space and approximating solutions Hsu et al. refer to solutions to a POMDP with hundreds of states in seconds – Tag problem: robot needs to search for a moving tag (whose position is unobserved except when robot bumps into it), ~870-dim space – Solved using PBVI methods in <1 minute 27/03/2012Reinforcement Learning10

11 Initial Observation Many point-based algorithms only explore a subset of the belief space,, the reachable space The reachable space contains all points reachable from a given initial belief point b 0 under arbitrary sequences of actions and observations – Is the reason for PBVI’s success that reachable space is small? – Not always: Tag has approx. 860-dim reachable space. 27/03/2012Reinforcement Learning11

12 Covering Number Covering number of a space is the minimum number of given size balls that needed to cover the space fully Hsu et al. show that an approximately optimal POMDP solution can be computed in time polynomial in the covering number of R (b 0 ) Covering number also reveals that the belief space for Tag behaves more like the union of some 29-dimensional spaces rather than an 870-dimensional space, as the robot’s position is fully observed. 27/03/2012Reinforcement Learning12

13 Further Questions Is it possible to compute an approximate solution efficiently under the weaker condition of having a small covering number for an optimal reachable R* (b 0 ), which contains only points in B reachable from b 0 under an optimal policy? Unfortunately, this problem is NP-hard. The problem remains NP-hard, even if the optimal policies have a compact piecewise-linear representation using  -vectors. However, given a suitable set of points that “cover” R* (b 0 ) well, a good approximate solution can be computed in polynomial time. Using sampling to approximate an optimal reachable space, and not just the reachable space, may be a promising approach in practice. 27/03/2012Reinforcement Learning13

14 Lyapunov Design for Safe Reinforcement Learning Theodore J. Perkins and Andrew G. Barto, JMLR 2002 27/03/201214Reinforcement Learning

15 Dynamical Systems Dynamical systems can be described by states and evolution of states over time The evolution of states is constrained by dynamics of the system In other words, dynamical systems are mappings from current state to next state If the mapping is a contraction, the state will eventually converge to a fixed point 27/03/201215Reinforcement Learning

16 Reinforcement Learning – Traditional Methods The target or goal state may not be a natural attractor Hypothesis: Learning is easier if target is a fixed point, e.g., TD-Gammon People have tried to embed domain knowledge in various ways: – Known good actions are specified – Sub-goals are explicitly specified 27/03/201216Reinforcement Learning

17 Key Idea Use Lyapunov functions to constrain action selection This forces the RL agent to move towards the goal e.g., consider grid world, finite steps if Lyapunov constrained: 27/03/201217Reinforcement Learning

18 Problem Setup Deterministic dynamical system Evolution according to MDP, 27/03/201218Reinforcement Learning

19 Lyapunov Functions Generalized energy functions 27/03/201219Reinforcement Learning

20 Pendulum Problem 27/03/201220Reinforcement Learning

21 Results 1 A EA,A All had shorter trials than A const A EA outperformed A All, especially at fine resolutions of discretization A EA trial times seemed independent of binning A Const alone never worked Note: Theorem guarantees that A EA monotonically increases energy. 27/03/201221Reinforcement Learning

22 Results 2 1: A EA, G 2 2: A All, G 2 3: A Const, G 2 4: A All + sat LQR, G 1 27/03/201222Reinforcement Learning

23 Stochastic Case 27/03/201223Reinforcement Learning

24 Results – Stochastic Case 27/03/201224Reinforcement Learning

25 Some Open Questions How can you improve performance using less sophisticated ‘primitive’ actions? Perkins and Barto use deep intuition to design local laws, e.g., to avoid undesired gravity-control equilibria. How do we deal with this when the dynamics is less understood? Stochastic cases have rather weak guarantees. How can they be improved? 27/03/201225Reinforcement Learning

Download ppt "Reinforcement Learning Dealing with Complexity and Safety in RL Subramanian Ramamoorthy School of Informatics 27 March, 2012."

Similar presentations

Ads by Google