1 Reinforcement Learning Problem Week #3. Figure reproduced from the figure on page 52 in reference [1] 2 Reinforcement Learning Loop state Agent Environment.

Slides:

Advertisements

Similar presentations

1 Reinforcement Learning (RL). 2 Introduction The concept of reinforcement learning incorporates an agent that solves the problem in hand by interacting.

Advertisements

Reinforcement Learning

Markov Decision Process

brings-uas-sensor-technology-to- smartphones/ brings-uas-sensor-technology-to-

1 Dynamic Programming Week #4. 2 Introduction Dynamic Programming (DP) –refers to a collection of algorithms –has a high computational complexity –assumes.

11 Planning and Learning Week #9. 22 Introduction... 1 Two types of methods in RL ◦Planning methods: Those that require an environment model  Dynamic.

1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.

An Introduction to Markov Decision Processes Sarah Hickmott

Partially Observable Markov Decision Process By Nezih Ergin Özkucur.

COSC 878 Seminar on Large Scale Statistical Machine Learning 1.

Planning under Uncertainty

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.

Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.

Reinforcement Learning, Cont’d Useful refs: Sutton & Barto, Reinforcement Learning: An Introduction, MIT Press 1998.

Planning in MDPs S&B: Sec 3.6; Ch. 4. Administrivia Reminder: Final project proposal due this Friday If you haven’t talked to me yet, you still have the.

Cooperative Q-Learning Lars Blackmore and Steve Block Expertness Based Cooperative Q-learning Ahmadabadi, M.N.; Asadpour, M IEEE Transactions on Systems,

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

Discretization Pieter Abbeel UC Berkeley EECS

Reinforcement Learning: Learning to get what you want... Sutton & Barto, Reinforcement Learning: An Introduction, MIT Press 1998.

Planning to learn. Progress report Last time: Transition functions & stochastic outcomes Markov chains MDPs defined Today: Exercise completed Value functions.

1 Machine Learning: Symbol-based 9d 9.0Introduction 9.1A Framework for Symbol-based Learning 9.2Version Space Search 9.3The ID3 Decision Tree Induction.

Algorithms For Inverse Reinforcement Learning Presented by Alp Sardağ.

More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 3: The Reinforcement Learning Problem pdescribe the RL problem we will.

Reinforcement Learning

Reinforcement Learning (1)

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

MAKING COMPLEX DEClSlONS

Search and Planning for Inference and Learning in Computer Vision

Reinforcement Learning

General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.

OBJECT FOCUSED Q-LEARNING FOR AUTONOMOUS AGENTS M. ONUR CANCI.

1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 6: Optimality Criterion in MDPs Dr. Itamar Arel College of Engineering Department.

Markov Decision Processes1 Definitions; Stationary policies; Value improvement algorithm, Policy improvement algorithm, and linear programming for discounted.

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.

Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction Ann Nowé By Sutton and.

Reinforcement Learning 主講人：虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.

1 S ystems Analysis Laboratory Helsinki University of Technology Flight Time Allocation Using Reinforcement Learning Ville Mattila and Kai Virtanen Systems.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

A Tutorial on the Partially Observable Markov Decision Process and Its Applications Lawrence Carin June 7,2006.

Attributions These slides were originally developed by R.S. Sutton and A.G. Barto, Reinforcement Learning: An Introduction. (They have been reformatted.

Cooperative Q-Learning Lars Blackmore and Steve Block Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents Tan, M Proceedings of the.

Reinforcement Learning 主講人：虞台文大同大學資工所智慧型多媒體研究室.

1 Introduction to Reinforcement Learning Freek Stulp.

Distributed Q Learning Lars Blackmore and Steve Block.

Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.

Reinforcement Learning with Laser Cats! Marshall Wang Maria Jahja DTR Group Meeting October 5, 2015.

Reinforcement Learning

Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.

CMSC 471 Fall 2009 MDPs and the RL Problem Prof. Marie desJardins Class #23 – Tuesday, 11/17 Thanks to Rich Sutton and Andy Barto for the use of their.

Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012.

1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 21: Dynamic Multi-Criteria RL problems Dr. Itamar Arel College of Engineering Department.

Reinforcement Learning AI – Week 22 Sub-symbolic AI Two: An Introduction to Reinforcement Learning Lee McCluskey, room 3/10

1 (Chapter 3 of) Planning and Control in Stochastic Domains with Imperfect Information by Milos Hauskrecht CS594 Automated Decision Making Course Presentation.

Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.

Markov Decision Processes AIMA: 17.1, 17.2 (excluding ), 17.3.

1 Markov Decision Processes Finite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 3: The Reinforcement Learning Problem pdescribe the RL problem we will.

Biomedical Data & Markov Decision Process

Chapter 3: The Reinforcement Learning Problem

Dr. Unnikrishnan P.C. Professor, EEE

Chapter 3: The Reinforcement Learning Problem

Chapter 3: The Reinforcement Learning Problem

CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29

Reinforcement Nisheeth 18th January 2019.

Presentation transcript:

1 Reinforcement Learning Problem Week #3

Figure reproduced from the figure on page 52 in reference [1] 2 Reinforcement Learning Loop state Agent Environment response action

3 Agent – Environment Interaction The figure on slide 2 represents the interaction between agent and environment. Agent takes an action a t at discrete time t. Perceiving the action, environment communicates two signals with agent at time t+1: –reward, r t+1 ∈ R –some representation of environment’s state, s t+1 ;

4 Time Domain Time is discrete, may be extended to continuous time [2]. Time steps over which reinforcement learning occurs may not need be equally spaced. Time steps may actually be event-based. That is, each time step is a successive stage of some decision-making process.

5 Reward: Response of Environment the response of environment to the last action taken by agent. a numerical signal r t, represents the extent to which environment found agent’s selection (i.e, current action) favorably or unfavorably. may represent positive or negative response either as a crisp (binary) measure {0,1} or as a finite set of values within a certain interval {a,...,b} or as a continuous function (infinitely many values) with a certain range [a,b].

6 Assignment of Reward Reward is used in RL problems to make agent move directly toward its goal. This is one of the features that distinguishes RL from other learning schemes. For agent to learn the goal-directed behavior, the rewards should be provided only to actions that really do lead to the goal. There may be subgoals that might help agent reach its goal, but, does not regularly result in primary goal. Assigning rewards to these subgoals, that do not necessarily end up in the main goal, may cause agent to learn to maximize its long-run reward reaching these subgoals and not its main goal.

7 Representation of Environment’s State A state s t ∈ S is the situation/position the environment is at, at time t. Depending on the RL task, a state manifests the location of environment within the domain. Examples may range from –low-level readings such as the “angular position of the rotor in an electric motor” to –high-level modes of environment such as some “tendency of a society to feel pessimistic, optimistic, lost or happy...”

8 Representation of Environment’s State Given the previous examples of state representation, a wide spectrum of methods exist to represent the states of environment: numerically in a single or multiple dimensional space, using strings, sets of tuples, etc. In general, any information that contributes to agent’s decision-making can be considered to be a state. Most of the time the state information is not clearly perceived by the agent. The state information is noisy or imperfect. Hence, instead of a clear state information, agent perceives some representation of state or an observation implying some state.

9 Action: Decision of Agent Action is an attempt that agent selects to better understand/control environment. By selecting an action, agent attempts to anticipate/predict the behavior of environment and fine-tune its decisions to maximize a cumulative criterion. “An action is any decision that agent wants to learn how to make” (pp. 53 [1]).

10 Separation of Environment from Agent The rule for distinguishing environment from agent: Anything under absolute control of agent is considered as part of agent. Anything not under the absolute control of agent (i.e., cannot be arbitrarily changed by the agent) is considered to be outside of it and hence part of the environment.

11 RL Framework A considerable abstraction of the problem of goal-directed learning from interaction. It asserts that –Any problem of goal-directed behavior can be reduced to three signals communicated between an agent and its environment: the representation of choices made by the agent (actions) of the basis on which the choices are made (states) that define the agent’s goal (rewards)

12 Goals & Rewards Goals are what we want the agent to achieve. Rewards are used to define the goal or to direct the agent to achieve its goal in such a way that the goal of the agent becomes to maximize its reward in the long run.

13 Goals & Rewards...(2) Examples –while having a mouse agent learn to find the cheese through a maze, it may be given a reward +1 when it finds the cheese and 0 until then. –to make it learn to find the cheese asap, for each unit of time it may be negatively rewarded (with -1 for instance). –Tic-tac-toe playing agent may be rewarded only when it wins. –Discussion: how to reward a chess playing agent?

14 Returns Return is the long-term (accumulated) reward. Agent seeks to maximize the expected return. The return may be simply defined as the sum of individual returns. An alternative definition of return where the most recent reward has the least effect on the expected return is the discounted return:

15 Markov Property State is any information of environment available to agent. State provides current and past information on environment. State does not and should not be expected to provide every detail of knowledge on environment.

16 Markov Property... (2) This implies that, due to the nature of the task, most of the time, some information may be unavailable (i.e., hidden) to agent, which may or may not be valuable in solving the learning problem. Therefore, each environment state s t may be represented at agent’s side by whatever observation o t is perceived by the agent.

17 Markov Property... (3) Hence, responses to agent should be provided rather on the basis of whether or not agent has forgotten some valuable information that it once knew than whether or not agent knows it.

18 Markov Property The state information should ideally summarize past inputs while preserving relevant information. A state with that property is said to be Markov or to have Markov property

19 Markov state A general environment whose policy is based upon all past state information has a state probability distribution as in the following: For a first degree Markovian environment, this conditional state probability distribution depends only upon the current state and action:

20 Markov Decision Processes (MDPs) An RL task that satisfies the Markovian property is defined as a Markov decision process (MDP). An MDP with finite state and action sets is called a finite MDP. Transition probability: Expected next reward: These two quantities completely specify the dynamics of a finite MDP.

21 Value Functions Value functions are functions of states (V π (s)) or of state-action pairs (Q π (s,a)) that define a basis for agent’s decision about “how good” it is to be at a state s or to select an action a at a specific state s, respectively, for a specific policy π. State-value function for policy π: Action-value function for policy π :

22 Recursive Relationships of Value Functions Value functions fulfill for a specific state the following recursive property:

23 Better and Optimal Value Functions A policy π i is defined to be better than or equal to a policy π j (π i ≥ π j ) if The policy π * is the optimal policy if or

24 References [1] Sutton, R. S. and Barto A. G., “Reinforcement Learning: An introduction,” MIT Press, 1998 [2] Bertsekas, D. P. and Tsitsiklis, J. N., “Neuro-Dynamic Programming,” Athena Scientific, 1996