Attributions These slides were originally developed by R.S. Sutton and A.G. Barto, Reinforcement Learning: An Introduction. (They have been reformatted.

Slides:

Advertisements

Similar presentations

Reinforcement Learning: Learning from Interaction

Advertisements

Markov Decision Process

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.

Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.

Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.

Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 3 Ann Nowé By Sutton.

1 Reinforcement Learning Problem Week #3. Figure reproduced from the figure on page 52 in reference [1] 2 Reinforcement Learning Loop state Agent Environment.

Reinforcement Learning

MDP Presentation CS594 Automated Optimal Decision Making Sohail M Yousof Advanced Artificial Intelligence.

1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.

COSC 878 Seminar on Large Scale Statistical Machine Learning 1.

Markov Decision Processes

Planning under Uncertainty

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.

91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

Reinforcement Learning (2) Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)

Department of Computer Science Undergraduate Events More

Reinforcement Learning Yishay Mansour Tel-Aviv University.

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 3: The Reinforcement Learning Problem pdescribe the RL problem we will.

Reinforcement Learning

Reinforcement Learning (1)

Making Decisions CSE 592 Winter 2003 Henry Kautz.

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.

MAKING COMPLEX DEClSlONS

Reinforcement Learning

General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.

REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.

1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 6: Optimality Criterion in MDPs Dr. Itamar Arel College of Engineering Department.

Decision Making in Robots and Autonomous Agents Decision Making in Robots and Autonomous Agents The Markov Decision Process (MDP) model Subramanian Ramamoorthy.

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction Ann Nowé By Sutton and.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Cross-Validation ML and Bayesian Model Comparison.

Reinforcement Learning 主講人：虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.

Reinforcement Learning Ata Kaban School of Computer Science University of Birmingham.

Department of Computer Science Undergraduate Events More

© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

Reinforcement Learning 主講人：虞台文大同大學資工所智慧型多媒體研究室.

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.

MDPs (cont) & Reinforcement Learning

Reinforcement Learning with Laser Cats! Marshall Wang Maria Jahja DTR Group Meeting October 5, 2015.

Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.

CMSC 471 Fall 2009 MDPs and the RL Problem Prof. Marie desJardins Class #23 – Tuesday, 11/17 Thanks to Rich Sutton and Andy Barto for the use of their.

Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012.

Department of Computer Science Undergraduate Events More

Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.

Abstract LSPI (Least-Squares Policy Iteration) works well in value function approximation Gaussian kernel is a popular choice as a basis function but can.

REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,

Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO AUTOMATICO Lezione 12 - Reinforcement Learning Prof. Giancarlo Mauri.

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 3: The Reinforcement Learning Problem pdescribe the RL problem we will.

Making complex decisions

Markov Decision Processes

Markov Decision Processes

Markov Decision Processes

CMSC 671 – Fall 2010 Class #22 – Wednesday 11/17

Chapter 3: The Reinforcement Learning Problem

CS 188: Artificial Intelligence Fall 2007

Chapter 3: The Reinforcement Learning Problem

Chapter 3: The Reinforcement Learning Problem

CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29

CS 416 Artificial Intelligence

Reinforcement Learning Dealing with Partial Observability

Markov Decision Processes

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Markov Decision Processes

Presentation transcript:

LECTURE 21: REINFORCEMENT LEARNING Objectives: Definitions and Terminology Agent-Environment Interaction Markov Decision Processes Value Functions Bellman Equations Resources: RSAB/TO: The RL Problem AM: RL Tutorial RKVM: RL SIM TM: Intro to RL GT: Active Speech MS Equation 3.0 was used with settings of: 18, 12, 8, 18, 12. • URL: .../publications/courses/ece_8443/lectures/current/lecture_21.ppt

Attributions These slides were originally developed by R.S. Sutton and A.G. Barto, Reinforcement Learning: An Introduction. (They have been reformatted and slightly annotated to better integrate them into this course.) The original slides have been incorporated into many machine learning courses, including Tim Oates’ Introduction of Machine Learning, which contains links to several good lectures on various topics in machine learning (and is where I first found these slides). A slightly more advanced version of the same material is available as part of Andrew Moore’s excellent set of statistical data mining tutorials. The objectives of this lecture are: describe the RL problem; present idealized form of the RL problem for which we have precise theoretical results; introduce key components of the mathematics: value functions and Bellman equations; describe trade-offs between applicability and mathematical tractability.

The Agent-Environment Interface . . . s a r t +1 t +2 t +3

The Agent Learns A Policy Definition of a policy: Policy at state t, t , is a mapping from states to action probabilities. t (s,a) = probability that at = a when st = s. Reinforcement learning methods specify how the agent changes its policy as a result of experience. Roughly, the agent’s goal is to get as much reward as it can over the long run. Learning can occur in several ways: Adaptation of classifier parameters based on prior and current data (e.g., many help systems now ask you “was this answer helpful to you”). Selection of the most appropriate next training pattern during classifier training (e.g., active learning). Common algorithm design issues include rate of convergence, bias vs. variance, adaptation speed, and batch vs. incremental adaptation.

Getting the Degree of Abstraction Right Time steps need not refer to fixed intervals of real time (e.g., each new training pattern can be considered a time step). Actions can be low level (e.g., voltages to motors), or high level (e.g., accept a job offer), “mental” (e.g., shift in focus of attention), etc. Actions can be rule- based (e.g., user expresses a preference) or mathematics-based (e.g., assignment of a class or update of a probability). States can be low-level “sensations”, or they can be abstract, symbolic, based on memory, or subjective (e.g., the state of being “surprised” or “lost”). States can be hidden or observable. A reinforcement learning (RL) agent is not like a whole animal or robot, which consist of many RL agents as well as other components. The environment is not necessarily unknown to the agent, only incompletely controllable. Reward computation is in the agent’s environment because the agent cannot change it arbitrarily.

Goals Is a scalar reward signal an adequate notion of a goal? Perhaps not, but it is surprisingly flexible. A goal should specify what we want to achieve, not how we want to achieve it. A goal must be outside the agent’s direct control — thus outside the agent. The agent must be able to measure success: explicitly frequently during its lifespan.

Returns Suppose the sequence of rewards after step t is rt+1, rt+2,… What do we want to maximize? In general, we want to maximize the expected return, E[Rt], for each step t, where Rt = rt+1 + rt+2 + … + rT, where T is a final time step at which a terminal state is reached, ending an episode. (You can view this as a variant of the forward backward calculation in HMMs.) Here episodic tasks denote a complete transaction (e.g., a play of a game, a trip through a maze, a phone call to a support line). Some tasks do not have a natural episode and can be considered continuing tasks. For these tasks, we can define the return as: where  is the discounting rate and 0    1.  close to zero favors short-term returns (shortsighted) while  close to 1 favors long-term returns.  can also be thought of as a “forgetting factor” in that, since it is less than one, it weights near-term future actions more heavily than longer-term future actions.

Example Goal: get to the top of the hill as quickly as possible. Return: Reward = -1 for each step taken when you are not at the top of the hill. Return = -(number of steps) Return is maximized by minimizing the number of step to reach the top of the hill. Other distinctions include deterministic versus dynamic: the context for a task can change as a function of time (e.g., an airline reservation system). In such cases time to solution might also be important (minimizing the number of steps as well as the overall return).

A Unified Notation In episodic tasks, we number the time steps of each episode starting from zero. We usually do not have to distinguish between episodes, so we write st,j instead of st for the state s at step t of episode j. Think of each episode as ending in an absorbing state that always produces reward of zero: We can cover all cases by writing: where  = 1 only if a zero reward absorbing state is always reached. s0 s1 s2 r1 = +1 r2 = +1 r3 = +1 r4 = 0 r5 =0 …

Markov Decision Processes “The state” at step t refers to whatever information is available to the agent at step t about its environment. The state can include immediate “sensations,” highly processed sensations, and structures built up over time from sequences of sensations. Ideally, a state should summarize past sensations so as to retain all “essential” information, i.e., it should have the Markov Property: for all s’, r, and histories st, at, rt, st-1, at-1, …, r1, s0, a0. If a reinforcement learning task has the Markovian property, it is basically a Markov Decision Process (MDP). If state and action sets are finite, it is a finite MDP which consists of the usual parameters for a Markov model: The transition probability is given by: and a slightly modified expression for the output probability at a state, called the reward probability in this context:

Example: Recycling Robot At each step, robot has to decide whether it should: actively search for a can, wait for someone to bring it a can, or go to home base and recharge. Searching is better but runs down the battery; if runs out of power while searching, it has to be rescued (which is bad). Decisions made on basis of current energy level: high, low. Reward = number of cans collected.

A Recycling Robot MDP

Value Functions The value of a state is the expected return starting from that state; depends on the agent’s policy: The value of taking an action in a state under policy p is the expected return starting from that state, taking that action, and thereafter following p :

Bellman Equation for a Policy The basic idea: So: Or, without the expectation operator: This is a set of linear equations, one for each state. This reduces the problem of finding the optimal state sequence and action to a graph search:

Optimal Value Functions For finite MDPs, policies can be partially ordered: There is always at least one (and possibly many) policy that is better than or equal to all the others. This is an optimal policy. We denote them all p *. Optimal policies share the same optimal state-value function: Optimal policies also share the same optimal action-value function: This is the expected return for taking action a in state s and thereafter following an optimal policy.

Bellman Optimality Equation for V* The value of a state under an optimal policy must equal the expected return for the best action from that state: V* is the unique solution of this system of nonlinear equations. The optimal action is again found through the maximization process: Q* is the unique solution of this system of nonlinear equations.

Solving the Bellman Optimality Equation Finding an optimal policy by solving the Bellman Optimality Equation requires the following: accurate knowledge of environment dynamics; we have enough space an time to do the computation; the Markov Property. How much space and time do we need? polynomial in number of states (via dynamic programming methods); BUT, number of states is often huge (e.g., backgammon has about 10**20 states). We usually have to settle for approximations. Many RL methods can be understood as approximately solving the Bellman Optimality Equation.

Q-Learning* Q-learning is a reinforcement learning technique that works by learning an action-value function that gives the expected utility of taking a given action in a given state and following a fixed policy thereafter. A strength with Q-learning is that it is able to compare the expected utility of the available actions without requiring a model of the environment. The value Q(s,a) is defined to be the expected discounted sum of future payoffs obtained by taking action a from state s and following an optimal policy thereafter. Once these values have been learned, the optimal action from any state is the one with the highest Q-value. The core of the algorithm is a simple value iteration update. For each state, s, from the state set S, and for each action, a, from the action set A, we can calculate an update to its expected discounted reward with the following expression: where rt is an observed real reward at time t, αt(s,a) are the learning rates such that 0 ≤ αt(s,a) ≤ 1, and γ is the discount factor such that 0 ≤ γ < 1. This can be thought of as incrementally maximizing the next step (one look ahead). May not produce the globally optimal solution. * From Wikipedia (http://en.wikipedia.org/wiki/Q-learning)

Summary Agent-environment interaction (states, actions, rewards) Policy: stochastic rule for selecting actions Return: the function of future rewards agent tries to maximize Episodic and continuing tasks Markov Property Markov Decision Process Transition probabilities Value functions State-value function for a policy Action-value function for a policy Optimal state-value function Optimal action-value function Optimal value functions Optimal policies Bellman Equations The need for approximation Other forms of learning such as Q-learning.