Download presentation

Presentation is loading. Please wait.

1
**Reinforcement learning**

Thomas Trappenberg

2
**Three kinds of learning: Supervised learning**

2. Unsupervised Learning 3. Reinforcement learning Detailed teacher that provides desired output y for a given input x: training set {x,y} find appropriate mapping function y=h(x;w) [= W j(x) ] Unlabeled samples are provided from which the system has to figure out good representations: training set {x} find sparse basis functions bi so that x=Si ci bi Delayed feedback from the environment in form of reward/ punishment when reaching state s with action a: reward r(s,a) find optimal policy a=p*(s) Most general learning circumstances

3
**Maximize expected Utility**

4
**2. Reinforcement learning**

-0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 From Russel and Norvik

5
**Markov Decision Process (MDP)**

6
**Two important quantities**

policy: value function: Goal: maximize total expected payoff Optimal Control

7
**Calculate value function (dynamic programming)**

Deterministic policies to simplify notation Bellman Equation for policy p Solution: Analytic or Incremental Richard Bellman

8
**Remark on different formulations:**

Some (like Sutton, Alpaydin, but not Russel & Norvik) define the value as the reward at the next state plus all the following reward: instead of

9
Policy Iteration

10
Value Iteration Bellman Equation for optimal policy

11
Solution: But: Environment not known a priori Observability of states Curse of Dimensionality Online (TD) POMDP Model-based RL

12
POMDP: Partially observable MDPs can be reduced to MDPs by considering believe states b:

13
**What if the environment is not completely known ? **

Online value function estimation (TD learning) If the environment is not known, use Monte Carlo method with bootstrapping Expected payoff before taking step Expected reward after taking step = actual reward plus discounted expected payoff of next step Temporal Difference

14
**Online optimal control: Exploitation versus Exploration**

On-policy TD learning: Sarsa Off-policy TD learning: Q-learning

15
Model-based RL: TD(1) Instead of tabular methods as mainly discussed before, use function approximator with parameters q and gradient descent step (Satton 1988): For example by using a neural network with weights q and corresponding delta learning rule when updating the weights after an episode of m steps. The only problem is that we receive the feedback r only after the t-th step. So we need to keep a memory (trace) of the sequence.

16
**Model-based RL: TD(1) … alternative formulation**

We can write An putting this into the formula and rearranging the sum gives We still need to keep the cumulative sum of the derivative terms, but otherwise it looks already closer to bootstrapping.

17
Model-based RL: TD(l) We now introduce a new algorithm by weighting recent gradients more than ones in the distance This is called the TD(l) rule. For l=1 we recover the TD(1) rule. Interesting is also the the other extreme of TD(0) Which uses the prediction of V(t+1) as supervision signal for step t. Otherwise this is equivalent to supervised learning and can easily be generalized to hidden layer networks.

18
**Free-Energy-Based RL:**

This can be generalized to Boltzmann machines (Sallans & Hinton 2004) Paul Hollensen: Sparse, topographic RBM successfully learns to drive the e-puck and avoid obstacles, given training data (proximity sensors, motor speeds)

20
**Classical Conditioning**

Ivan Pavlov Nobel Prize 1904 Classical Conditioning Rescorla-Wagner Model (1972)

21
**Reward Signals in the Brain**

Wolfram Schultz Stimulus A No reward Stimulus B Stimulus A Reward

23
**Disorders with effects On dopamine system: Parkinson’s disease **

Tourett’s syndrome ADHD Drug addiction Schizophrenia Maia & Frank 2011

24
**Conclusion and Outlook**

Three basic categories of learning: Supervised: Lots of progress through statistical learning theory Kernel machines, graphical models, etc Unsupervised: Hot research area with some progress, deep temporal learning Reinforcement: Important topic in animal behavior, model-based RL

Similar presentations

OK

Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility is defined by the reward function Must learn to act.

Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility is defined by the reward function Must learn to act.

© 2018 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt on beer lambert law units Ppt on ram and rom material Ppt on marketing management on jeans Ppt on power plant in india Ppt on polynomials and coordinate geometry pdf Ppt on introduction to bluetooth technology Ppt on crop production management Ppt on features of ms word Ppt on cdma and bluetooth technology Ppt on barack obama leadership profile