Download presentation

Presentation is loading. Please wait.

Published byAntonio Elliott Modified over 4 years ago

1
**Reinforcement Learning: Learning from Interaction**

Winter School on Machine Learning and Vision, 2010 B. Ravindran Many slides adapted from Sutton and Barto

2
**Learning to Control So far looked at two models of learning**

Supervised: Classification, Regression, etc. Unsupervised: Clustering, etc. How did you learn to cycle? Neither of the above Trial and error! Falling down hurts! Intro to RL

3
Can You hear me now? Can You hear me now? Can You hear me now?

4
**Reinforcement Learning**

A trial-and-error learning paradigm Rewards and Punishments Not just an algorithm but a new paradigm in itself Learn about a system – behaviour control from minimal feed back Inspired by behavioural psychology Intro to RL

5
**RL Framework Learn from close interaction Stochastic environment**

State Action evaluation Agent Benefits of RL.Applications. Future rewards. Situate in literature.. AI. Stochastic systems. Connections to OR. Learn from close interaction Stochastic environment Noisy delayed scalar evaluation Maximize a measure of long term performance

6
**Not Supervised Learning!**

Input Output Agent Target Error Very sparse “supervision” No target output provided No error gradient information available Action chooses next state Explore to estimate gradient – Trail and error learning Intro to RL

7
**Not Unsupervised Learning**

Input Activation Agent Evaluation Sparse “supervision” available Pattern detection not primary goal Intro to RL

8
TD Gammon Tesauro 1992, 1994, 1995, ... White has just rolled a 5 and a 2 so can move one of his pieces 5 and one (possibly the same) 2 steps Objective is to advance all pieces to points 19-24 Hitting 30 pieces, 24 locations implies enormous number of configurations Effective branching factor of 400 Intro to RL

9
**The Agent-Environment Interface**

. . . s a r t +1 t +2 t +3

10
**The Agent Learns a Policy**

Reinforcement learning methods specify how the agent changes its policy as a result of experience. Roughly, the agent’s goal is to get as much reward as it can over the long run. Intro to RL

11
Goals and Rewards Is a scalar reward signal an adequate notion of a goal?—maybe not, but it is surprisingly flexible. A goal should specify what we want to achieve, not how we want to achieve it. A goal must be outside the agent’s direct control—thus outside the agent. The agent must be able to measure success: explicitly; frequently during its lifespan. Intro to RL

12
Returns Episodic tasks: interaction breaks naturally into episodes, e.g., plays of a game, trips through a maze. where T is a final time step at which a terminal state is reached, ending an episode. Intro to RL

13
The Markov Property “the state” at step t, means whatever information is available to the agent at step t about its environment. The state can include immediate “sensations”, highly processed sensations, and structures built up over time from sequences of sensations. Ideally, a state should summarize past sensations so as to retain all “essential” information, i.e., it should have the Markov Property: Intro to RL

14
**Markov Decision Processes**

If a reinforcement learning task has the Markov Property, it is basically a Markov Decision Process (MDP). If state and action sets are finite, it is a finite MDP. To define a finite MDP, you need to give: state and action sets one-step “dynamics” defined by transition probabilities: reward expectations: Intro to RL

15
**An Example Finite MDP Recycling Robot**

At each step, robot has to decide whether it should (1) actively search for a can, (2) wait for someone to bring it a can, or (3) go to home base and recharge. Searching is better but runs down the battery; if runs out of power while searching, has to be rescued (which is bad). Decisions made on basis of current energy level: high, low. Reward = number of cans collected Intro to RL

16
Recycling Robot MDP Intro to RL

17
Value Functions The value of a state is the expected return starting from that state; depends on the agent’s policy: The value of a state-action pair is the expected return starting from that state, taking that action, and thereafter following p : Intro to RL

18
**Bellman Equation for a Policy p**

The basic idea: So: Or, without the expectation operator: Intro to RL

19
**Optimal Value Functions**

For finite MDPs, policies can be partially ordered: There is always at least one (and possibly many) policies that is better than or equal to all the others. This is an optimal policy. We denote them all p *. Optimal policies share the same optimal state-value function: Intro to RL

20
**Bellman Optimality Equation**

The value of a state under an optimal policy must equal the expected return for the best action from that state: is the unique solution of this system of nonlinear equations. Intro to RL

21
**Bellman Optimality Equation**

Similarly, the optimal value of a state-action pair is the expected return for taking that action and thereafter following the optimal policy is the unique solution of this system of nonlinear equations. Intro to RL

22
**Dynamic Programming DP is the solution method of choice for MDPs**

Require complete knowledge of system dynamics (P and R) Expensive and often not practical Curse of dimensionality Guaranteed to converge! RL methods: online approximate dynamic programming No knowledge of P and R Sample trajectories through state space Some theoretical convergence analysis available Intro to RL

23
**Policy Evaluation Policy Evaluation: for a given policy p, compute the**

state-value function Recall: Intro to RL

24
Policy Improvement Suppose we have computed for a deterministic policy p. For a given state s, would it be better to do an action ? Intro to RL

25
**Policy Improvement Cont.**

Intro to RL

26
**Policy Improvement Cont.**

Intro to RL

27
**Policy Iteration policy evaluation policy improvement “greedification”**

Intro to RL

28
**Value Iteration Recall the Bellman optimality equation:**

We can convert it to an full value iteration backup: Iterate until “convergence” Intro to RL

29
**Generalized Policy Iteration**

Generalized Policy Iteration (GPI): any interaction of policy evaluation and policy improvement, independent of their granularity. A geometric metaphor for convergence of GPI: Intro to RL

30
Dynamic Programming T T T T T T T T T T T T T Intro to RL

31
Simplest TD Method T T T T T T T T T T T Intro to RL

32
**RL Algorithms – Prediction**

Policy Evaluation (the prediction problem): for a given policy, compute the state-value function. No knowledge of P and R, but access to the real system, or a “sample” model assumed. Uses “bootstrapping” and sampling Intro to RL

33
Advantages of TD TD methods do not require a model of the environment, only experience TD methods can be fully incremental You can learn before knowing the final outcome Less memory Less peak computation You can learn without the final outcome From incomplete sequences Intro to RL

34
**RL Algorithms – Control**

SARSA Q-learning Intro to RL

35
Cliffwalking e-greedy, e = 0.1 Intro to RL

36
Actor-Critic Methods Explicit representation of policy as well as value function Minimal computation to select actions Can learn an explicit stochastic policy Can put constraints on policies Appealing as psychological and neural models Intro to RL

37
Actor-Critic Details Intro to RL

38
**Applications of RL Robot navigation Adaptive control**

Helicopter pilot! Combinatorial optimization VLSI placement and routing , elevator dispatching Game playing Backgammon – world’s best player! Computational Neuroscience Modeling of reward processes Intro to RL

39
**Other Topics Different measures of return Policy gradient approaches**

Average rewards, Discounted Returns Policy gradient approaches Directly perturb policies Generalization [Saturday] Function approximation Temporal abstraction Least square methods Better use of data More suited for “off-line” RL Intro to RL

40
**References Mitchell, T. Machine Learning. McGraw Hill. 1992**

Russell, S. J. and Norvig, P. Artificial Intelligence – A modern approach. Pearson Educational Sutton, R.S. and Barto, A. G. Reinforcement Learning: An Introduction. MIT Press Dayan, P. and Abbott, L. F. Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems. MIT Press Bertsikas, D. P. and Tsitsiklis, J. N. Neuro-Dynamic Programming. Athena Scientific Intro to RL

Similar presentations

Presentation is loading. Please wait....

OK

PSSA Preparation.

PSSA Preparation.

© 2018 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google