Presentation on theme: "Reinforcement Learning: Learning from Interaction Winter School on Machine Learning and Vision, 2010 B. Ravindran Many slides adapted from Sutton and Barto."— Presentation transcript:
Reinforcement Learning: Learning from Interaction Winter School on Machine Learning and Vision, 2010 B. Ravindran Many slides adapted from Sutton and Barto
Intro to RL 2 Learning to Control So far looked at two models of learning –Supervised: Classification, Regression, etc. –Unsupervised: Clustering, etc. How did you learn to cycle? –Neither of the above –Trial and error! –Falling down hurts!
Intro to RL 4 Reinforcement Learning A trial-and-error learning paradigm –Rewards and Punishments Not just an algorithm but a new paradigm in itself Learn about a system – –behaviour –control from minimal feed back Inspired by behavioural psychology
5 RL Framework Learn from close interaction Stochastic environment Noisy delayed scalar evaluation Maximize a measure of long term performance Environment Agent ActionState evaluation
Intro to RL 6 Not Supervised Learning! Very sparse supervision No target output provided No error gradient information available Action chooses next state Explore to estimate gradient – Trail and error learning Agent Output Input Target Error
Intro to RL 7 Not Unsupervised Learning Sparse supervision available Pattern detection not primary goal Agent Activation Input Evaluation
Intro to RL 8 TD Gammon Tesauro 1992, 1994, 1995,... White has just rolled a 5 and a 2 so can move one of his pieces 5 and one (possibly the same) 2 steps Objective is to advance all pieces to points 19-24 Hitting 30 pieces, 24 locations implies enormous number of configurations Effective branching factor of 400
9 The Agent-Environment Interface t... s t a r t +1 s a r t +2 s a r t +3 s... t +3 a
Intro to RL 10 The Agent Learns a Policy Reinforcement learning methods specify how the agent changes its policy as a result of experience. Roughly, the agents goal is to get as much reward as it can over the long run.
Intro to RL 11 Goals and Rewards Is a scalar reward signal an adequate notion of a goal?maybe not, but it is surprisingly flexible. A goal should specify what we want to achieve, not how we want to achieve it. A goal must be outside the agents direct controlthus outside the agent. The agent must be able to measure success: –explicitly; –frequently during its lifespan.
Intro to RL 12 Returns Episodic tasks: interaction breaks naturally into episodes, e.g., plays of a game, trips through a maze. where T is a final time step at which a terminal state is reached, ending an episode.
Intro to RL 13 The Markov Property the state at step t, means whatever information is available to the agent at step t about its environment. The state can include immediate sensations, highly processed sensations, and structures built up over time from sequences of sensations. Ideally, a state should summarize past sensations so as to retain all essential information, i.e., it should have the Markov Property:
Intro to RL 14 Markov Decision Processes If a reinforcement learning task has the Markov Property, it is basically a Markov Decision Process (MDP). If state and action sets are finite, it is a finite MDP. To define a finite MDP, you need to give: –state and action sets –one-step dynamics defined by transition probabilities: –reward expectations:
Intro to RL 15 Recycling Robot An Example Finite MDP At each step, robot has to decide whether it should (1) actively search for a can, (2) wait for someone to bring it a can, or (3) go to home base and recharge. Searching is better but runs down the battery; if runs out of power while searching, has to be rescued (which is bad). Decisions made on basis of current energy level: high, low. Reward = number of cans collected
Intro to RL 17 The value of a state is the expected return starting from that state; depends on the agents policy: The value of a state-action pair is the expected return starting from that state, taking that action, and thereafter following : Value Functions
Intro to RL 18 Bellman Equation for a Policy The basic idea: So: Or, without the expectation operator:
Intro to RL 19 For finite MDPs, policies can be partially ordered: There is always at least one (and possibly many) policies that is better than or equal to all the others. This is an optimal policy. We denote them all *. Optimal policies share the same optimal state- value function: Optimal Value Functions
Intro to RL 20 Bellman Optimality Equation The value of a state under an optimal policy must equal the expected return for the best action from that state: is the unique solution of this system of nonlinear equations.
Intro to RL 21 Bellman Optimality Equation is the unique solution of this system of nonlinear equations. Similarly, the optimal value of a state-action pair is the expected return for taking that action and thereafter following the optimal policy
Intro to RL 22 Dynamic Programming DP is the solution method of choice for MDPs –Require complete knowledge of system dynamics (P and R) –Expensive and often not practical –Curse of dimensionality –Guaranteed to converge! RL methods: online approximate dynamic programming –No knowledge of P and R –Sample trajectories through state space –Some theoretical convergence analysis available
Intro to RL 23 Policy Evaluation Policy Evaluation: for a given policy, compute the state-value function Recall:
Intro to RL 24 Policy Improvement Suppose we have computed for a deterministic policy. For a given state s, would it be better to do an action ?
Intro to RL 28 Value Iteration Recall the Bellman optimality equation: We can convert it to an full value iteration backup: Iterate until convergence
Intro to RL 29 Generalized Policy Iteration Generalized Policy Iteration (GPI): any interaction of policy evaluation and policy improvement, independent of their granularity. A geometric metaphor for convergence of GPI:
Intro to RL 30 Dynamic Programming T T T TTTTTTTTTT
Intro to RL 32 RL Algorithms – Prediction Policy Evaluation (the prediction problem): for a given policy, compute the state-value function. No knowledge of P and R, but access to the real system, or a sample model assumed. Uses bootstrapping and sampling
Intro to RL 33 Advantages of TD TD methods do not require a model of the environment, only experience TD methods can be fully incremental –You can learn before knowing the final outcome Less memory Less peak computation –You can learn without the final outcome From incomplete sequences
Intro to RL 34 RL Algorithms – Control SARSA Q-learning
Intro to RL 36 Actor-Critic Methods Explicit representation of policy as well as value function Minimal computation to select actions Can learn an explicit stochastic policy Can put constraints on policies Appealing as psychological and neural models
Intro to RL 38 Applications of RL Robot navigation Adaptive control –Helicopter pilot! Combinatorial optimization –VLSI placement and routing, elevator dispatching Game playing –Backgammon – worlds best player! Computational Neuroscience –Modeling of reward processes
Intro to RL 39 Other Topics Different measures of return –Average rewards, Discounted Returns Policy gradient approaches –Directly perturb policies Generalization [Saturday] –Function approximation –Temporal abstraction Least square methods –Better use of data –More suited for off-line RL
Intro to RL 40 References Mitchell, T. Machine Learning. McGraw Hill. 1992 Russell, S. J. and Norvig, P. Artificial Intelligence – A modern approach. Pearson Educational. 2000. Sutton, R.S. and Barto, A. G. Reinforcement Learning: An Introduction. MIT Press. 1998. Dayan, P. and Abbott, L. F. Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems. MIT Press. 2001. Bertsikas, D. P. and Tsitsiklis, J. N. Neuro-Dynamic Programming. Athena Scientific. 1997.