Presentation is loading. Please wait.

Presentation is loading. Please wait.

Reinforcement Learning

Similar presentations

Presentation on theme: "Reinforcement Learning"— Presentation transcript:

1 Reinforcement Learning
CSLT ML Summer Seminar (12) Reinforcement Learning Dong Wang Most slides from David Silver’s UCL Course on RL Some slides from John Schulman’s ‘Deep Reinforcement Learning’, Lecture 1

2 Content What is reinforcement learning? Shallow discrete learning
Deep Q learning

3 What is reinforcement learning?
Reinforcement learning is the problem faced by an agent that learns behavior through tiral-and- error interactions with a dynamic environment. Given a state by the environment, the agent learns how to take an action, which will be given back some (random) rewards and the system moves to another (random) state. The probabilities of the reward and the next state are stationray.

4 An example


6 Main component of RL

7 It is different from other tasks
Unlike supervised learning, it has no ‘label’, e.g., which action should take. Feedback is often delayed, e.g., in game playing Time is important. It is sequential decision process. Decision impacts the environment It has some ‘supervision’ (the reward) when compared to unsupervised learning.


9 Applications Fly stunt manoeuvres in a helicopter
Defeat the world champion at Backgammon Manage an investment portfolio Control a power station Make a humanoid robot walk Play many dierent Atari games better than humans

10 Robot

11 Robot

12 Business

13 Finance

14 Media

15 Medicine

16 Sequence prediction

17 Game playing

18 Some important things to mention
If the environment is known (e.g., the transition and reward probability) If we can observe the state (hidden or explicit) Do we need to model the environment If we learn from episode or online If we want to use approximation or explicit table

19 Content What is reinforcement learning? Shallow discrete learning
Deep Q learning

20 Markov decision process
Markov decision processes formally describe an environment for reinforcement learning Where the environment is fully observable, i.e. the current state completely characterises the process Almost all RL problems can be formalised as MDPs, e.g. Optimal control primarily deals with continuous MDPs Partially observable problems can be converted into MDPs Bandits are MDPs with one state

21 Markov process

22 Markov reward process

23 An example of Markov reward process

24 Return in Markov reward process

25 Value function

26 Bellman Equation

27 Markov decision process (MDP)


29 Policy

30 Value function in MDP

31 Bellman Expectation Equation in MDP

32 Relation between two valuation functions

33 Relation between two valuation functions

34 Optimal value function

35 Optimal policy

36 Find optimal policy

37 Bellman optimization They are non-linear (because of the max()), and no closed form solution Iterative procedures can do that

38 Policy evaluation Given a policy, look at the valuation function at each state.

39 Improve policy

40 General policy process

41 Value iteration Not involve policy update, however the optimal policy has been learned by the ‘max’ operation.


43 Can be performed asynchronously
Three simple ideas for asynchronous dynamic programming: In-place dynamic programming Prioritised sweeping Real-time dynamic programming

44 Full-width approach and sampling approach
The above approach uses all possible offsprings in the update, but it can also be using the ‘exact experience’. We have to know the dynamic properties of the system It can be also design a model-free approach based on sampling. It interacts with the environment and learn from the expeirence. Sampling approach is easier to implement and more efficient

45 Monte-Carlo Reinforcement Learning
MC methods learn directly from episodes of experience MC is model-free: no knowledge of MDP transitions / rewards MC learns from complete episodes: no bootstrapping MC uses the simplest possible idea: value = mean return Caveat: can only apply MC to episodic MDPs All episodes must terminate

46 MC evaluation Do not touch the policy, just the value function

47 First-visit MC

48 Every-visit MC

49 Incremental MC

50 Temporal-Dierence Learning
TD methods learn directly from episodes of experience TD is model-free: no knowledge of MDP transitions / rewards TD learns from incomplete episodes, by bootstrapping TD updates a guess towards a guess Can work without knowing the output. It is good for online!


52 Some comparison


54 Now we update the policy
On-policy learning Learn on the job Learn about policy from experience sampled from O-policy learning Look over someone's shoulder

55 MC policy learning

56 Sarsa (TD) policy learning

57 Off-policy learning Update the policy and Q valuate together

58 But all the above mostly useless
How do we know the status? How do we keep the value function if the status is large? How if the status is continuous? How about if we meet some status that similar but different from those in the training data? How if we just observe a small number of training examples? Use parametric function to approximate it!

59 Value function approximation

60 We consider dierentiable function approximators, e.g.
Linear combinations of features Neural network Decision tree Nearest neighbour Fourier / wavelet bases ... Furthermore, we require a training method that is suitable for non- stationary, non-iid data

61 Content What is reinforcement learning? Shallow discrete learning
Deep Q learning

62 Deep Q network Using DNN to approximate the value function
Using MC or TD to generate samples, using the error signals from the training samples to train the DNN


64 Incremental update for Q function

65 DQN for game learning Human-level control through deep reinforcement

66 Two mechanisms

67 DQN for AlphaGo Mastering the game of Go with deep neural networks and tree search

68 Wrap up Reinforcement learning learns policy.
It is basically formulated as a Markov decision process learning. It can be learned in a ‘batch way’ or sample way, and can be in an episode or incremental fashion. Learning value function is highly important. Deep learning provides a brilliant solution. It opens the door for fantastic machine intelligence.

Download ppt "Reinforcement Learning"

Similar presentations

Ads by Google