Reinforcement Learning

Reinforcement Learning
CSLT ML Summer Seminar (12) Reinforcement Learning Dong Wang Most slides from David Silver’s UCL Course on RL Some slides from John Schulman’s ‘Deep Reinforcement Learning’, Lecture 1

Content What is reinforcement learning? Shallow discrete learning
Deep Q learning

What is reinforcement learning?
Reinforcement learning is the problem faced by an agent that learns behavior through tiral-and- error interactions with a dynamic environment. Given a state by the environment, the agent learns how to take an action, which will be given back some (random) rewards and the system moves to another (random) state. The probabilities of the reward and the next state are stationray.

An example

Main component of RL

It is different from other tasks
Unlike supervised learning, it has no ‘label’, e.g., which action should take. Feedback is often delayed, e.g., in game playing Time is important. It is sequential decision process. Decision impacts the environment It has some ‘supervision’ (the reward) when compared to unsupervised learning.

Applications Fly stunt manoeuvres in a helicopter
Defeat the world champion at Backgammon Manage an investment portfolio Control a power station Make a humanoid robot walk Play many dierent Atari games better than humans

Business

Finance

Medicine

Sequence prediction

Game playing

Some important things to mention
If the environment is known (e.g., the transition and reward probability) If we can observe the state (hidden or explicit) Do we need to model the environment If we learn from episode or online If we want to use approximation or explicit table

Deep Q learning

Markov decision process
Markov decision processes formally describe an environment for reinforcement learning Where the environment is fully observable, i.e. the current state completely characterises the process Almost all RL problems can be formalised as MDPs, e.g. Optimal control primarily deals with continuous MDPs Partially observable problems can be converted into MDPs Bandits are MDPs with one state

Markov process

Markov reward process

An example of Markov reward process

Return in Markov reward process

Value function

Bellman Equation

Markov decision process (MDP)

Policy

Value function in MDP

Bellman Expectation Equation in MDP

Relation between two valuation functions

Optimal value function

Optimal policy

Find optimal policy

Bellman optimization They are non-linear (because of the max()), and no closed form solution Iterative procedures can do that

Policy evaluation Given a policy, look at the valuation function at each state.

Improve policy

General policy process

Value iteration Not involve policy update, however the optimal policy has been learned by the ‘max’ operation.

Can be performed asynchronously
Three simple ideas for asynchronous dynamic programming: In-place dynamic programming Prioritised sweeping Real-time dynamic programming

Full-width approach and sampling approach
The above approach uses all possible offsprings in the update, but it can also be using the ‘exact experience’. We have to know the dynamic properties of the system It can be also design a model-free approach based on sampling. It interacts with the environment and learn from the expeirence. Sampling approach is easier to implement and more efficient

Monte-Carlo Reinforcement Learning
MC methods learn directly from episodes of experience MC is model-free: no knowledge of MDP transitions / rewards MC learns from complete episodes: no bootstrapping MC uses the simplest possible idea: value = mean return Caveat: can only apply MC to episodic MDPs All episodes must terminate

MC evaluation Do not touch the policy, just the value function

First-visit MC

Every-visit MC

Incremental MC

Temporal-Dierence Learning
TD methods learn directly from episodes of experience TD is model-free: no knowledge of MDP transitions / rewards TD learns from incomplete episodes, by bootstrapping TD updates a guess towards a guess Can work without knowing the output. It is good for online!

Some comparison

Now we update the policy
On-policy learning Learn on the job Learn about policy from experience sampled from O-policy learning Look over someone's shoulder

MC policy learning

Sarsa (TD) policy learning

Off-policy learning Update the policy and Q valuate together

But all the above mostly useless
How do we know the status? How do we keep the value function if the status is large? How if the status is continuous? How about if we meet some status that similar but different from those in the training data? How if we just observe a small number of training examples? Use parametric function to approximate it!

Value function approximation

We consider dierentiable function approximators, e.g.
Linear combinations of features Neural network Decision tree Nearest neighbour Fourier / wavelet bases ... Furthermore, we require a training method that is suitable for non- stationary, non-iid data

Deep Q learning

Deep Q network Using DNN to approximate the value function
Using MC or TD to generate samples, using the error signals from the training samples to train the DNN

Incremental update for Q function

DQN for game learning Human-level control through deep reinforcement

Two mechanisms

DQN for AlphaGo Mastering the game of Go with deep neural networks and tree search

Wrap up Reinforcement learning learns policy.
It is basically formulated as a Markov decision process learning. It can be learned in a ‘batch way’ or sample way, and can be in an episode or incremental fashion. Learning value function is highly important. Deep learning provides a brilliant solution. It opens the door for fantastic machine intelligence.

Reinforcement Learning

Similar presentations

Presentation on theme: "Reinforcement Learning"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Reinforcement Learning

Similar presentations

Presentation on theme: "Reinforcement Learning"— Presentation transcript:

Similar presentations

About project

Feedback