Reinforcement Learning Introduction Presented by Alp Sardağ.

Slides:



Advertisements
Similar presentations
Reinforcement Learning
Advertisements

Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas Sandholm Carnegie Mellon University Computer Science.
Markov Decision Process
Genetic Algorithms (Evolutionary Computing) Genetic Algorithms are used to try to “evolve” the solution to a problem Generate prototype solutions called.
1 Dynamic Programming Week #4. 2 Introduction Dynamic Programming (DP) –refers to a collection of algorithms –has a high computational complexity –assumes.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
Reinforcement Learning
1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
Markov Decision Processes
Planning under Uncertainty
Reinforcement Learning
Reinforcement Learning
4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Reinforcement Learning. Overview  Introduction  Q-learning  Exploration vs. Exploitation  Evaluating RL algorithms  On-Policy Learning: SARSA.
1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.
Ai in game programming it university of copenhagen Reinforcement Learning [Intro] Marco Loog.
Reinforcement Learning Game playing: So far, we have told the agent the value of a given board position. How can agent learn which positions are important?
9/23. Announcements Homework 1 returned today (Avg 27.8; highest 37) –Homework 2 due Thursday Homework 3 socket to open today Project 1 due Tuesday –A.
Making Decisions CSE 592 Winter 2003 Henry Kautz.
Evolutionary Reinforcement Learning Systems Presented by Alp Sardağ.
Reinforcement Learning Russell and Norvig: Chapter 21 CMSC 421 – Fall 2006.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
Reinforcement Learning
Reinforcement Learning
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
CS344 : Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 28- PAC and Reinforcement Learning.
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.
Reinforcement Learning 主講人:虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.
CS 621 Reinforcement Learning Group 8 Neeraj Bisht Ranjeet Vimal Nishant Suren Naineet C. Patel Jimmie Tete.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
MDPs (cont) & Reinforcement Learning
Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.
Reinforcement learning (Chapter 21)
Markov Decision Process (MDP)
COMP 2208 Dr. Long Tran-Thanh University of Southampton Reinforcement Learning.
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Markov Decision Process (MDP)
Making complex decisions
Markov Decision Processes
Reinforcement Learning
Reinforcement Learning
Markov Decision Processes
Markov Decision Processes
Dr. Unnikrishnan P.C. Professor, EEE
13. Acting under Uncertainty Wolfram Burgard and Bernhard Nebel
CS 188: Artificial Intelligence Fall 2007
Chapter 17 – Making Complex Decisions
CS 188: Artificial Intelligence Spring 2006
Introduction to Reinforcement Learning and Q-Learning
CS 416 Artificial Intelligence
CS 416 Artificial Intelligence
Markov Decision Processes
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Markov Decision Processes
Reinforcement Learning
Presentation transcript:

Reinforcement Learning Introduction Presented by Alp Sardağ

Supervised vs Unsupervised Learning Any Situation in which both the inputs and outputs of a component can be perceived is called Supervised Learning. Learning when there is no hint at all about correct outputs is called Unsupervised Learning. The agent receives some evaluation of its action but is not told the correct action.

Sequential Decision Problems  In single decision problems, the utility of each actions outcome is well known. SiSi S j1 S j2 S jn U j1 U j2 U j3 A j1 A j2 A jn Choose next action with Max(U)

Sequential Decision Problems  Sequential decision problems, the agent’s utility depends on a sequence of actions.  The difference is what is returned is not a single action but rather a policy- arrived at by calculating the utilities for each state.

Example  Terminal States : The environment terminates when the agent reaches one of the states marked +1 or –1.  Model : Set of probabilites associated with the possible transitions between states after any given action. The notation M a ij means the probability of reaching State j if action A is done in State i. (Accessible environment MDP: next state depends current state and action.) The available actions (A) : North, South, East and West P(IE | A) = 0.8 ; P(^IE | A) = 0.2 * IE  Intended Action

Model Obtained by simulation

Example  There is no utility for the states other than the terminal states (T).  We have to base the utility function on a sequence of states rather than on a single state. E.g. U ex (s 1,...,s n ) = -1/25 n + U(T)  To select the next action : Consider sequences as one long action and apply the basic maximum expected utility principle to sequences. – Max(EU(A|I)) = Max(  M a ij * U j ) – Result : The first action of the optimal sequence.

Drawback  Consider the action sequence starting from state (3,2) ; [North,East].  Than it will be better to calculate utilitiy function for each state.

VALUE ITERATION  The basic idea is to calculate the utility of each state, U(state), and then use the state utilities to select an optimal action in each state.  Policy : A complete mapping from states to actions.  H(state,policy) : History tree starting from the state and taking action according to policy.  U(i)  EU(H(i,policy * )|M)   P(H(i,policy * )|M)U h (H(i,policy * )))

The Property of Utility Function  For a utility function on states (U) to make sense, we require that the utility function on histories (U h ) have the property of seperability. U h ([s 0,s 1,...,s n ]) = f(s 0,U h ([s 1,...,s n ])  The siplest form of seperable utility funciton is additive. U h ([s 0,s 1,...,s n ]) = R(s 0 ) + U h ([s 1,...,s n ]) where R is called the Reward function. *Notice : Additivity was implicit in our use of path cost functions in heuristic search algorithms. The sum of the utilities from that state until the terminal state is reached.

Refreshing  We have to base the utility function on a sequence of states rather than on a single state. E.g. U ex (s 1,...,s n ) = -1/25 n + U(T)  In that case R(s i ) = -1/25 for non terminal states, +1 for state (4,3) and –1 for state (4,2).

Utility of States  Given a separable utility function U h, the utility of a state can be expressed in terms of the utility of its succesors. U(i) = R(i) + max a  j M a ij U(j)  The above equation is the basis for dynamic programming.

Dynamic Programming  There are two approaches. – The first approach starts by calculating utilities of all states at step n-1 in terms of utilites of the terminal states; than at step n-2, so on... – The second approach approximates the utilities of states to any degree of accuracv using an iterative procedure. This is used because in most decision problem the environment histories are potentially of unbounded length.

Algorithm Function DP (M, R ) Returns Utility Function Begin // Initialization U = R ; U’ = R; Repeat U U’ For Each State i do U’[i] R[i] + max a  j M a ij U(j) end Until U’-U <  End

Policy  Policy Function : policy * (i) = max a  j M a ij U(j)

Reinforcement Learning  The task is to use rewards and punishments to learn a succesfull agent function (policy) – Diffucult, the agent never told what the right actions, nor which reward for which action. The agent starts with no model and no utility function. – In many complex domain, RL is the only feasible way to train a program to perform at high levels.

Example: An agent learning to play chess  Supervised learning: very hard for the teacher from large number of positions to choose accurate ones to train directly from examples.  In RL the program told when it has won or lost, and can use this information to learn an evaluation function.

Two Basic Designs  The agent learns a utility function on states (or histories) and uses it to select actions that maximizes the expected utility of their outcomes.  The agent learns an action-value function giving the expected utility of taking a given action in a given state. This is called Q-learning. The agent not interested with the outcome of its action.

Active & Passive Learner  A passive learner simply watches the world going by, and tries to learn utility of being in various states.  An active learner must also act using learned information and use its problem generator to suggest explorations of unknown portions of the environment.

Comparison of Basic Designs  The policy for an agent that learns a utility function on states is: policy * (i) = max a  j M a ij U(j)  Te policy for an agent that learns an action- value function is: policy * (i) = max a Q(a,i)

Passive Learning (a)Simple Stocastic Environment (b)M ij is provided in PL, M a ij is provided in AL (c)The exact utility values

Calculation of Utility on States for PL  Dynamic Programming (ADP): U(i)  R(i) +  j M ij U(j) Because the agent is passive, no maximization over action.  Temporal Difference Learning: U(i)  U(i)+  (R(i)+U(j)-U(i)) where  is the learning rate. This suggest U(i) agree with its successor.

Comparison of ADP & TD  ADP will converge faster than TD, ADP knows current environment model.  ADP use the full model, TD uses no model, just information about connectedness of states, from the current training sequence.  TD adjusts a state to agree with its observed successor whereas ADP adjusts the state to agree with all successor. But this difference will disappear when the effects of TD adjustments are averaged over a large number of transitions.  Full ADP may be intractable when the number of states is large. Prioritized-sweeping heuristic prefers to make adjustement to states whose likely successor have just undergone a large adjustment in their own utility.

Calculation of Utility on States for AL  Dynamic Programming (ADP): U(i)  R(i) + max a  j M a ij U(j)  Temporal Difference Learning: U(i)  U(i)+  (R(i)+U(j)-U(i))

Problem of Exploration in AL  An active learner act using the learned information, and can use its problem generator to suggest explorations of unknown portions of the environment.  Trade-off between immediate good and long-term well- being. – One idea: To change the constraint equation so that it assigns a higher utility estimate to relatively unexplored action-state pairs. U(i)  R(i) + max a F(  j M a ij U(j),N(a,i)) where F(u,n) = { R + if n<N e uotherwise

Learning an Action-Value Function  The function assigns an expected utility to taking a given action in a given state. Q(a,i) : expected utility to taking action a in state i. – Like condition-action rules, they suffice for decision making. – Unlike the condition-action rules, they can be learned directly from reward feedback.

Calculation of Action-Value Function  Dynamic Programming: Q(a,i)  R(i) +  j M a ij max a’ Q(a’,j)  Temporal Difference Learning: Q(a,i)  Q(a,i) +  (R(i)+ max a’ Q(a’,j) - Q(a,i)) where  is the learning rate.

Question & The Answer that Refused to be Found  Is it better to learn a utility function or to learn an action-value function?