Download presentation
Presentation is loading. Please wait.
Published byAntonio Elliott Modified over 10 years ago
1
Reinforcement Learning: Learning from Interaction
Winter School on Machine Learning and Vision, 2010 B. Ravindran Many slides adapted from Sutton and Barto
2
Learning to Control So far looked at two models of learning
Supervised: Classification, Regression, etc. Unsupervised: Clustering, etc. How did you learn to cycle? Neither of the above Trial and error! Falling down hurts! Intro to RL
3
Can You hear me now? Can You hear me now? Can You hear me now?
4
Reinforcement Learning
A trial-and-error learning paradigm Rewards and Punishments Not just an algorithm but a new paradigm in itself Learn about a system – behaviour control from minimal feed back Inspired by behavioural psychology Intro to RL
5
RL Framework Learn from close interaction Stochastic environment
State Action evaluation Agent Benefits of RL.Applications. Future rewards. Situate in literature.. AI. Stochastic systems. Connections to OR. Learn from close interaction Stochastic environment Noisy delayed scalar evaluation Maximize a measure of long term performance
6
Not Supervised Learning!
Input Output Agent Target Error Very sparse “supervision” No target output provided No error gradient information available Action chooses next state Explore to estimate gradient – Trail and error learning Intro to RL
7
Not Unsupervised Learning
Input Activation Agent Evaluation Sparse “supervision” available Pattern detection not primary goal Intro to RL
8
TD Gammon Tesauro 1992, 1994, 1995, ... White has just rolled a 5 and a 2 so can move one of his pieces 5 and one (possibly the same) 2 steps Objective is to advance all pieces to points 19-24 Hitting 30 pieces, 24 locations implies enormous number of configurations Effective branching factor of 400 Intro to RL
9
The Agent-Environment Interface
. . . s a r t +1 t +2 t +3
10
The Agent Learns a Policy
Reinforcement learning methods specify how the agent changes its policy as a result of experience. Roughly, the agent’s goal is to get as much reward as it can over the long run. Intro to RL
11
Goals and Rewards Is a scalar reward signal an adequate notion of a goal?—maybe not, but it is surprisingly flexible. A goal should specify what we want to achieve, not how we want to achieve it. A goal must be outside the agent’s direct control—thus outside the agent. The agent must be able to measure success: explicitly; frequently during its lifespan. Intro to RL
12
Returns Episodic tasks: interaction breaks naturally into episodes, e.g., plays of a game, trips through a maze. where T is a final time step at which a terminal state is reached, ending an episode. Intro to RL
13
The Markov Property “the state” at step t, means whatever information is available to the agent at step t about its environment. The state can include immediate “sensations”, highly processed sensations, and structures built up over time from sequences of sensations. Ideally, a state should summarize past sensations so as to retain all “essential” information, i.e., it should have the Markov Property: Intro to RL
14
Markov Decision Processes
If a reinforcement learning task has the Markov Property, it is basically a Markov Decision Process (MDP). If state and action sets are finite, it is a finite MDP. To define a finite MDP, you need to give: state and action sets one-step “dynamics” defined by transition probabilities: reward expectations: Intro to RL
15
An Example Finite MDP Recycling Robot
At each step, robot has to decide whether it should (1) actively search for a can, (2) wait for someone to bring it a can, or (3) go to home base and recharge. Searching is better but runs down the battery; if runs out of power while searching, has to be rescued (which is bad). Decisions made on basis of current energy level: high, low. Reward = number of cans collected Intro to RL
16
Recycling Robot MDP Intro to RL
17
Value Functions The value of a state is the expected return starting from that state; depends on the agent’s policy: The value of a state-action pair is the expected return starting from that state, taking that action, and thereafter following p : Intro to RL
18
Bellman Equation for a Policy p
The basic idea: So: Or, without the expectation operator: Intro to RL
19
Optimal Value Functions
For finite MDPs, policies can be partially ordered: There is always at least one (and possibly many) policies that is better than or equal to all the others. This is an optimal policy. We denote them all p *. Optimal policies share the same optimal state-value function: Intro to RL
20
Bellman Optimality Equation
The value of a state under an optimal policy must equal the expected return for the best action from that state: is the unique solution of this system of nonlinear equations. Intro to RL
21
Bellman Optimality Equation
Similarly, the optimal value of a state-action pair is the expected return for taking that action and thereafter following the optimal policy is the unique solution of this system of nonlinear equations. Intro to RL
22
Dynamic Programming DP is the solution method of choice for MDPs
Require complete knowledge of system dynamics (P and R) Expensive and often not practical Curse of dimensionality Guaranteed to converge! RL methods: online approximate dynamic programming No knowledge of P and R Sample trajectories through state space Some theoretical convergence analysis available Intro to RL
23
Policy Evaluation Policy Evaluation: for a given policy p, compute the
state-value function Recall: Intro to RL
24
Policy Improvement Suppose we have computed for a deterministic policy p. For a given state s, would it be better to do an action ? Intro to RL
25
Policy Improvement Cont.
Intro to RL
26
Policy Improvement Cont.
Intro to RL
27
Policy Iteration policy evaluation policy improvement “greedification”
Intro to RL
28
Value Iteration Recall the Bellman optimality equation:
We can convert it to an full value iteration backup: Iterate until “convergence” Intro to RL
29
Generalized Policy Iteration
Generalized Policy Iteration (GPI): any interaction of policy evaluation and policy improvement, independent of their granularity. A geometric metaphor for convergence of GPI: Intro to RL
30
Dynamic Programming T T T T T T T T T T T T T Intro to RL
31
Simplest TD Method T T T T T T T T T T T Intro to RL
32
RL Algorithms – Prediction
Policy Evaluation (the prediction problem): for a given policy, compute the state-value function. No knowledge of P and R, but access to the real system, or a “sample” model assumed. Uses “bootstrapping” and sampling Intro to RL
33
Advantages of TD TD methods do not require a model of the environment, only experience TD methods can be fully incremental You can learn before knowing the final outcome Less memory Less peak computation You can learn without the final outcome From incomplete sequences Intro to RL
34
RL Algorithms – Control
SARSA Q-learning Intro to RL
35
Cliffwalking e-greedy, e = 0.1 Intro to RL
36
Actor-Critic Methods Explicit representation of policy as well as value function Minimal computation to select actions Can learn an explicit stochastic policy Can put constraints on policies Appealing as psychological and neural models Intro to RL
37
Actor-Critic Details Intro to RL
38
Applications of RL Robot navigation Adaptive control
Helicopter pilot! Combinatorial optimization VLSI placement and routing , elevator dispatching Game playing Backgammon – world’s best player! Computational Neuroscience Modeling of reward processes Intro to RL
39
Other Topics Different measures of return Policy gradient approaches
Average rewards, Discounted Returns Policy gradient approaches Directly perturb policies Generalization [Saturday] Function approximation Temporal abstraction Least square methods Better use of data More suited for “off-line” RL Intro to RL
40
References Mitchell, T. Machine Learning. McGraw Hill. 1992
Russell, S. J. and Norvig, P. Artificial Intelligence – A modern approach. Pearson Educational Sutton, R.S. and Barto, A. G. Reinforcement Learning: An Introduction. MIT Press Dayan, P. and Abbott, L. F. Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems. MIT Press Bertsikas, D. P. and Tsitsiklis, J. N. Neuro-Dynamic Programming. Athena Scientific Intro to RL
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.