Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK

Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK jlw@cs.bham.ac.uk www.cs.bham.ac.uk/research/robotics www.cs.bham.ac.uk/~jlw

2 The talk in one slide Optimal learning problems: how to act while learning how to act We’re going to look at this while learning from rewards Old heuristic: be optimistic in the face of uncertainty Our method: apply principle of optimism directly to model of how the world works Bayesian

3 Plan Reinforcement Learning How to act while learning from rewards An approximate algorithm Results Learning with structure

4 Reinforcement Learning (RL) Learning from punishments and rewards Agent moves through world, observing states and rewards Adapts its behaviour to maximise some function of reward s9s9 s5s5 s4s4 s2s2 …… … s3s3 +50 +3 r9r9 r5r5 r4r4 r1r1 s1s1 a9a9 a5a5 a4a4 a2a2 … a3a3 a1a1

5 Reinforcement Learning (RL) Let’s assume our agent acts according to some rules, called a policy,  The return R t is a measure of long term reward collected after time t +50 +3 r9r9 r5r5 r4r4 r1r1

6 Reinforcement Learning (RL) R t is a random variable So it has an expected value in a state under a given policy RL problem is to find optimal policy    that maximises the expected value in every state

7 Markov Decision Processes (MDPs) The transitions between states are uncertain The probabilities depend only on the current state Transition matrix P, and reward function R r = 2 2 1 1 r = 0 a1a1 a2a2

8 Bellman equations and bootstrapping Conditional independence allows us to define the expected return V * for the optimal policy    in terms of a recurrence relation: where We can use the recurrence relation to bootstrap our estimate of V   in two ways i 4 3 5 a

9 Two types of bootstrapping We can bootstrap using explicit knowledge of P and R (Dynamic Programming) Or we can bootstrap using samples from P and R (temporal difference learning) i 4 3 5 a s t+1 atat stst r t+1

10 Multi-agent RL: Learning to play football Learning to play in a team Too time consuming to do on real robots There is a well established simulator league We can learn effectively from reinforcement

11 Learning to play backgammon TD( ) learning and a Backprop net with one hidden layer 1,500,000 training games (self play) Equivalent in skill to the top dozen human players Backgammon has ~10 20 states, so can’t be solved using DP

12 The Exploration problem: intuition We are learning to maximise performance But how should we act while learning? Trade-off:exploit what we know or explore to gain new information? Optimal learning: maximise performance while learning given your imperfect knowledge

13 The optimal learning problem If we knew P it would be easy However … –We estimate P from observations –P is a random variable –There is a density f(P) over the space of possible MDPs –What is it? How does it help us solve our problem?

14 A density over MDPs Suppose we’ve wandered around for a while We have a matrix M containing the transition counts The density over possible P depends on M, f(P|M) and is a product of Dirichlet densities r = 2 2 1 1 r = 0 a1a1 a2a2 0 0 1 1

15 A density over multinomials 2 1 1 a1a1 0 0 1 1

16 Optimal learning formally stated Given f(P|M), find  that maximises 0 0 1 1 r = 2 2 1 1 r = 0 a1a1 a2a2

17 Transforming the problem When we evaluate the integral we get another MDP! This one is defined over the space of information states This space grows exponentially in the depth of the look ahead R2R2 a1a1 a2a2 R3R3

18 A heuristic: Optimistic Model Selection Solving the information state space MDP is intractable An old heuristic is be optimistic in the face of uncertainty So here we pick an optimistic P Find V * for that P only How do we pick P optimistically?

19 Optimistic Model Selection h Do some DP style bootstrapping to improve estimated V

20 Experimental results

21 Bayesian view: performance while learning

22 Bayesian view: policy quality

23 Do we really care? Why solve for MDPs? While challenging they are too simple to be useful Structured representations are more powerful

24 Model-based RL: structured models Transition model P is represented compactly using a Dynamic Bayes Net (or factored MDP) V is represented as a tree Backups look like goal regression operators Converging with the AI planning community

25 Structured Exploration: results

26 Challenge: Learning with Hidden State Learning in a POMDP, or k-Markov environment Planning in POMDPs is intractable Factored POMDPs look promising POMDPs are the basis of the state of the art in mobile robotics atat a t+1 a t+2 r t+1 r t+2 stst s t+1 s t+2 otot o t+1 o t+2

27 Wrap up RL is a class of problems We can pose some optimal learning problems elegantly in this framework Can’t be perfect, but we can do alright BUT: probabilistic representations while very useful in many fields are a source of frequent intractability General probabilistic representations are best avoided How?

28 Cumulative Discounted Return

31 Policy Quality

32 Policy Quality

33 Policy Quality

Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK

Similar presentations

Presentation on theme: "Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK

Similar presentations

Presentation on theme: "Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK"— Presentation transcript:

Similar presentations

About project

Feedback