Introduction to Reinforcement Learning and Q-Learning

Slides:



Advertisements
Similar presentations
Reinforcement Learning
Advertisements

Markov Decision Process
Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Reinforcement learning (Chapter 21)
Markov Decision Processes
Planning under Uncertainty
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Reinforcement learning
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 6 - Reinforcement Learning Prof. Giancarlo.
Reinforcement Learning
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
Reinforcement Learning Introduction Presented by Alp Sardağ.
Hierarchical Reinforcement Learning Ersin Basaran 19/03/2005.
1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.
Markov Decision Processes
Learning: Reinforcement Learning Russell and Norvig: ch 21 CMSC421 – Fall 2005.
Reinforcement Learning (1)
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Kunstmatige Intelligentie / RuG KI Reinforcement Learning Sander van Dijk.
Reinforcement Learning Russell and Norvig: Chapter 21 CMSC 421 – Fall 2006.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Search and Planning for Inference and Learning in Computer Vision
Reinforcement Learning
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Introduction to Reinforcement Learning Dr Kathryn Merrick 2008 Spring School on Optimisation, Learning and Complexity Friday 7 th.
CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty Jiang Bian, Fall 2012 University of Arkansas at Little Rock.
Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.
Reinforcement Learning Ata Kaban School of Computer Science University of Birmingham.
Solving POMDPs through Macro Decomposition
INTRODUCTION TO Machine Learning
CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:
1 Introduction to Reinforcement Learning Freek Stulp.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Reinforcement learning (Chapter 21)
CSE 473Markov Decision Processes Dan Weld Many slides from Chris Bishop, Mausam, Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer.
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Reinforcement Learning
Markov Decision Process (MDP)
Reinforcement learning (Chapter 21)
Reinforcement learning (Chapter 21)
Teaching Style COSC 6368 Teaching Style COSC 6368
An Overview of Reinforcement Learning
Markov Decision Processes
Reinforcement Learning
Reinforcement Learning
PD-World Pickup: Cells: (1,1), (4,1),(3,3),(5,5)
Markov Decision Processes
Planning to Maximize Reward: Markov Decision Processes
Markov Decision Processes
Reinforcement Learning
Instructors: Fei Fang (This Lecture) and Dave Touretzky
CS 188: Artificial Intelligence Fall 2007
CS 188: Artificial Intelligence Spring 2006
Designing Neural Network Architectures Using Reinforcement Learning
COSC 4368 Group Project Spring 2019 Learning Paths from Feedback Using Reinforcement Learning for a Transportation World P D P D D P.
CS 416 Artificial Intelligence
Reinforcement Learning (2)
Reinforcement Learning
Reinforcement Learning (2)
CS 440/ECE448 Lecture 22: Reinforcement Learning
Presentation transcript:

Introduction to Reinforcement Learning and Q-Learning Andrew L. Nelson Visiting Research Faculty University of South Florida 2/28/2019 Q-Learning

Overview Outline to the left in green Current topic in yellow References Introduction Agent and environment Nomenclature Cell World Policy Learning in known space Example Reinforcement Policy Learning Q-Function Q-Algorithm Generalization Summary Outline to the left in green Current topic in yellow References Introduction Learning an optimal policy in a known environment Learning an approximate optimal policy in an unknown environment Example Generalization and representation Knowledge based vs general function approximation methods 2/28/2019 Q-Learning

References Overview References Introduction Agent and environment Nomenclature Cell World Policy Learning in known space Example Reinforcement Policy Learning Q-Function Q-Algorithm Generalization Summary C. Watkins, P. Dayan, “Q-Learning,” Machine Learning, vol. 8, pp. 279-292, 1989. T.M. Mitchell, Machine Learning, WCB/McGraw-Hill, 1997. 2/28/2019 Q-Learning

Introduction Situated Learning Agents The Goal of a leaning agent is to learn to choose actions (a) so that the net reward over a sequence of actions is maximized Supervised learning methods make use of knowledge of the world and of known reward functions Reinforcement learning methods use rewards to learn an optimal policy in a given (unknown) environment Overview References Introduction Agent and environment Nomenclature Cell World Policy Learning in known space Example Reinforcement Policy Learning Q-Function Q-Algorithm Generalization Summary 2/28/2019 Q-Learning

Agent and Environment An agent produces an action (a), and receives a reward (and changes the state, s) from a given environment Overview References Introduction Agent and environment Nomenclature Cell World Policy Learning in known space Example Reinforcement Policy Learning Q-Function Q-Algorithm Generalization Summary 2/28/2019 Q-Learning

Nomenclature Action: a  A. State: s  S. Reward: r = R(s) Overview References Introduction Agent and environment Nomenclature Cell World Policy Learning in known space Example Reinforcement Policy Learning Q-Function Q-Algorithm Generalization Summary Action: a  A. State: s  S. Reward: r = R(s) Policy: π: A → S Optimal Policy: π * World Model: s' = T(s, a) Utility: U(s) Value: Q(a, s) 2/28/2019 Q-Learning

Cell World Agent States Transitions Reward Introduction Overview References Introduction Agent and environment Nomenclature Cell World Policy Learning in known space Example Reinforcement Policy Learning Q-Function Q-Algorithm Generalization Summary Agent States Transitions Reward 2/28/2019 Q-Learning

Learning π* in Known Environments The supervised method: Find the maximum possible utility for each state (Iterative search) learn the optimal policy π*: A → S by learning the action associated with each state s that leads to the next state s' with maximum possible utility, U* Requirements: Known world model, T(s, a) Known reward function, R(s) Overview References Introduction Agent and environment Nomenclature Cell World Policy Learning in known space Example Reinforcement Policy Learning Q-Function Q-Algorithm Generalization Summary 2/28/2019 Q-Learning

Known Rewards and Transitions R(s) and s' = T(s, a) known for all s  S and a  A References Overview Introduction Agent and environment Nomenclature Cell World Policy Learning in known space Example Reinforcement Policy Learning Q-Function Q-Algorithm Generalization Summary 2/28/2019 Q-Learning

Calculate U* for Each State (Using an iterative search algorithm, for example) References Overview Introduction Agent and environment Nomenclature Cell World Policy Learning in known space Example Reinforcement Policy Learning Q-Function Q-Algorithm Generalization Summary 2/28/2019 Q-Learning

Calculate π* using the known U* values References Overview Introduction Agent and environment Nomenclature Cell World Policy Learning in known space Example Reinforcement Policy Learning Q-Function Q-Algorithm Generalization Summary π*: U*(s), for all s 2/28/2019 Q-Learning

Notes Overview References Introduction Agent and environment Nomenclature Cell World Policy Learning in known space Example Reinforcement Policy Learning Q-Function Q-Algorithm Generalization Summary Supervised learning methods work well when a complete model of the environment and the reward function are known Since R(s) and T(s, a) are known, we can reduce learning to a standard iterative learning process. 2/28/2019 Q-Learning

Unknown Environments What if the environment is unknown? Overview References Introduction Agent and environment Nomenclature Cell World Policy Learning in known space Example Reinforcement Policy Learning Q-Function Q-Algorithm Generalization Summary What if the environment is unknown? 2/28/2019 Q-Learning

Policy Learning in known space Overview References Introduction Agent and environment Nomenclature Cell World Policy Learning in known space Example Reinforcement Policy Learning Q-Function Q-Algorithm Generalization Summary 2/28/2019 Q-Learning

The Q-Function Overview References Introduction Agent and environment Nomenclature Cell World Policy Learning in known space Example Reinforcement Policy Learning Q-Function Q-Algorithm Generalization Summary Instead of learning utilities, action-state values (Q) will be learned U(s) = maxaQ(s, a) Local action and exploration can be used to discover and learn Q(s, a) values in an unknown environment We will use the following equation: Q(s, a) ← r + maxa' Q(s', a') 2/28/2019 Q-Learning

The Q-Learning Algorithm Build up a table of Q(s, a) values as follows: Do forever: From the current state s Set each un-initialized state-action Q(s, a) value to 0 and add it to table of Q values With probability p, Select action a with maximum Q value (otherwise select a at random) Execute a and receive immediate reward r. Update the table entry for Q(s, a) as Q(s, a) ← r + maxa' Q(s', a') s ← s' Overview References Introduction Agent and environment Nomenclature Cell World Policy Learning in known space Example Reinforcement Policy Learning Q-Function Q-Algorithm Generalization Summary 2/28/2019 Q-Learning

Q-Learning Example Initialize table and first position Overview References Introduction Agent and environment Nomenclature Cell World Policy Learning in known space Example Reinforcement Policy Learning Q-Function Q-Algorithm Generalization Summary Initialize table and first position 2/28/2019 Q-Learning

Q-Learning Example Move to s'... iterate Overview References Introduction Agent and environment Nomenclature Cell World Policy Learning in known space Example Reinforcement Policy Learning Q-Function Q-Algorithm Generalization Summary Move to s'... iterate 2/28/2019 Q-Learning

Q-Learning Example Continue Overview References Introduction Agent and environment Nomenclature Cell World Policy Learning in known space Example Reinforcement Policy Learning Q-Function Q-Algorithm Generalization Summary Continue 2/28/2019 Q-Learning

Q-Learning Example Terminal state, start over Overview References Introduction Agent and environment Nomenclature Cell World Policy Learning in known space Example Reinforcement Policy Learning Q-Function Q-Algorithm Generalization Summary Terminal state, start over 2/28/2019 Q-Learning

Q-Learning Example Starting new iteration Overview References Introduction Agent and environment Nomenclature Cell World Policy Learning in known space Example Reinforcement Policy Learning Q-Function Q-Algorithm Generalization Summary Starting new iteration 2/28/2019 Q-Learning

Q-Learning Example After a few more iterations... Overview References Introduction Agent and environment Nomenclature Cell World Policy Learning in known space Example Reinforcement Policy Learning Q-Function Q-Algorithm Generalization Summary After a few more iterations... 2/28/2019 Q-Learning

Representation and Generalization Overview References Introduction Agent and environment Nomenclature Cell World Policy Learning in known space Example Reinforcement Policy Learning Q-Function Q-Algorithm Generalization Summary Policies learned using state transition representations do not generalize to un-visited stated. Functional representations allow for generalization to states not explored f(s) = p1a + p2a2 + p3a3 ... Functional representations might cover search spaces that do not contain the target policy. 2/28/2019 Q-Learning

Summary Reinforcement learning (RL) is useful for learning policies in un-characterized environments RL uses reward from actions taken during exploration RL is useful on small state transition spaces Functional representations increase the power of RL both in terms of generalization and representation Overview References Introduction Agent and environment Nomenclature Cell World Policy Learning in known space Example Reinforcement Policy Learning Q-Function Q-Algorithm Generalization Summary 2/28/2019 Q-Learning