ONLINE Q-LEARNER USING MOVING PROTOTYPES by Miguel Ángel Soto Santibáñez.

Slides:



Advertisements
Similar presentations
Reinforcement learning
Advertisements

Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.
Markov Decision Process
Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.
TEMPORAL DIFFERENCE LEARNING Mark Romero – 11/03/2011.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Reinforcement Learning
Markov Decision Processes
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 6 - Reinforcement Learning Prof. Giancarlo.
Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.
Reinforcement Learning
Machine LearningRL1 Reinforcement Learning in Partially Observable Environments Michael L. Littman.
Nov 14 th  Homework 4 due  Project 4 due 11/26.
Reinforcement Learning
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
Miguel Angel Soto Santibanez. Overview Why they are important Previous Work Advantages and Shortcomings.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Reinforcement Learning (1)
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Reinforcement Learning in the Presence of Hidden States Andrew Howard Andrew Arnold {ah679
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
MDP Reinforcement Learning. Markov Decision Process “Should you give money to charity?” “Would you contribute?” “Should you give money to charity?” $
Reinforcement Learning
OBJECT FOCUSED Q-LEARNING FOR AUTONOMOUS AGENTS M. ONUR CANCI.
Introduction to Reinforcement Learning Dr Kathryn Merrick 2008 Spring School on Optimisation, Learning and Complexity Friday 7 th.
Utilities and MDP: A Lesson in Multiagent System Based on Jose Vidal’s book Fundamentals of Multiagent Systems Henry Hexmoor SIUC.
Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.
Solving POMDPs through Macro Decomposition
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Reinforcement Learning
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Thursday 29 October 2002 William.
INTRODUCTION TO Machine Learning
MDPs (cont) & Reinforcement Learning
Reinforcement Learning with Laser Cats! Marshall Wang Maria Jahja DTR Group Meeting October 5, 2015.
CSE 473Markov Decision Processes Dan Weld Many slides from Chris Bishop, Mausam, Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer.
Reinforcement Learning Based on slides by Avi Pfeffer and David Parkes.
Reinforcement Learning Elementary Solution Methods
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Making a Learning Agent Flexible and Fast Convergent, a closer look to the problem and previous work Miguel A. Soto Santibanez Michael M. Marefat Department.
Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department of Electrical and Computer Engineering University of.
Reinforcement Learning
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,
CS 541: Artificial Intelligence Lecture XI: Reinforcement Learning Slides Credit: Peter Norvig and Sebastian Thrun.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO AUTOMATICO Lezione 12 - Reinforcement Learning Prof. Giancarlo Mauri.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
Reinforcement Learning (1)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Markov Decision Processes
Reinforcement learning
Chapter 4: Dynamic Programming
Chapter 4: Dynamic Programming
CS 188: Artificial Intelligence Fall 2007
Reinforcement Learning
Reinforcement Learning in MDPs by Lease-Square Policy Iteration
Reinforcement Learning
Reinforcement Learning
Chapter 8: Generalization and Function Approximation
Introduction to Reinforcement Learning and Q-Learning
Deep Reinforcement Learning
Chapter 4: Dynamic Programming
Reinforcement Learning (2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Reinforcement Learning (2)
Presentation transcript:

ONLINE Q-LEARNER USING MOVING PROTOTYPES by Miguel Ángel Soto Santibáñez

Reinforcement Learning What does it do? Tackles the problem of learning control strategies for autonomous agents. What is the goal? The goal of the agent is to learn an action policy that maximizes the total reward it will receive from any starting state.

Reinforcement Learning What does it need? This method assumes that training information is available in the form of a real-valued reward signal given for each state- action transition. This method assumes that training information is available in the form of a real-valued reward signal given for each state- action transition. i.e. (s, a, r) i.e. (s, a, r) What problems? Very often, reinforcement learning fits a problem setting known as a Markov decision process (MDP).

Reinforcement Learning vs. Dynamic programming reward function r(s, a)  r r(s, a)  r state transition function state transition function δ(s, a)  s’ δ(s, a)  s’

Q-learning An off-policy control algorithm. Advantage: Converges to an optimal policy in both deterministic and nondeterministic MDPs. Disadvantage: Only practical on a small number of problems.

Q-learning Algorithm Initialize Q(s, a) arbitrarily Repeat (for each episode) Repeat (for each episode) Initialize s Initialize s Repeat (for each step of the episode) Repeat (for each step of the episode) Choose a from s using an exploratory policy Choose a from s using an exploratory policy Take action a, observe r, s’ Take action a, observe r, s’ Q(s, a)  Q(s, a) + α[r + γ max Q(s’, a’) – Q(s, a)] Q(s, a)  Q(s, a) + α[r + γ max Q(s’, a’) – Q(s, a)] a’ a’ s  s’ s  s’

Introduction to Q-learning Algorithm An episode: { (s 1, a 1, r 1 ), (s 2, a 2, r 2 ), … (s n, a n, r n ), } An episode: { (s 1, a 1, r 1 ), (s 2, a 2, r 2 ), … (s n, a n, r n ), } s’: δ(s, a)  s’ s’: δ(s, a)  s’ Q(s, a): Q(s, a): γ, α : γ, α :

A Sample Problem A B r = 8 r = 0 r = - 8

States and actions states:actions: NSEWNSEW

The Q(s, a) function N S W E states actionsactions

Q-learning Algorithm Initialize Q(s, a) arbitrarily Repeat (for each episode) Repeat (for each episode) Initialize s Initialize s Repeat (for each step of the episode) Repeat (for each step of the episode) Choose a from s using an exploratory policy Choose a from s using an exploratory policy Take action a, observe r, s’ Take action a, observe r, s’ Q(s, a)  Q(s, a) + α[r + γ max Q(s’, a’) – Q(s, a)] Q(s, a)  Q(s, a) + α[r + γ max Q(s’, a’) – Q(s, a)] a’ a’ s  s’ s  s’

Initializing the Q(s, a) function N S W E states actionsactions

Q-learning Algorithm Initialize Q(s, a) arbitrarily Repeat (for each episode) Repeat (for each episode) Initialize s Initialize s Repeat (for each step of the episode) Repeat (for each step of the episode) Choose a from s using an exploratory policy Choose a from s using an exploratory policy Take action a, observe r, s’ Take action a, observe r, s’ Q(s, a)  Q(s, a) + α[r + γ max Q(s’, a’) – Q(s, a)] Q(s, a)  Q(s, a) + α[r + γ max Q(s’, a’) – Q(s, a)] a’ a’ s  s’ s  s’

An episode

Q-learning Algorithm Initialize Q(s, a) arbitrarily Repeat (for each episode) Repeat (for each episode) Initialize s Initialize s Repeat (for each step of the episode) Repeat (for each step of the episode) Choose a from s using an exploratory policy Choose a from s using an exploratory policy Take action a, observe r, s’ Take action a, observe r, s’ Q(s, a)  Q(s, a) + α[r + γ max Q(s’, a’) – Q(s, a)] Q(s, a)  Q(s, a) + α[r + γ max Q(s’, a’) – Q(s, a)] a’ a’ s  s’ s  s’

Calculating new Q(s, a) values 1 st step:2 nd step: 3 rd step: 4 th step:

The Q(s, a) function after the first episode N S W E states actionsactions

A second episode

Calculating new Q(s, a) values 1 st step:2 nd step: 3 rd step: 4 th step:

The Q(s, a) function after the second episode N S W E states actionsactions

The Q(s, a) function after a few episodes N S W E states actionsactions

One of the optimal policies N S W E states actionsactions

An optimal policy graphically

Another of the optimal policies N S W E states actionsactions

Another optimal policy graphically

The problem with tabular Q-learning What is the problem? Only practical in a small number of problems because: a)Q-learning can require many thousands of training iterations to converge in even modest-sized problems. b)Very often, the memory resources required by this method become too large.

Solution What can we do about it? Use generalization. Use generalization. What are some examples? Tile coding, Radial Basis Functions, Fuzzy function approximation, Hashing, Artificial Neural Networks, LSPI, Regression Trees, Kanerva coding, etc.

Shortcomings Tile coding: Curse of Dimensionality. Tile coding: Curse of Dimensionality. Kanerva coding: Static prototypes. Kanerva coding: Static prototypes. LSPI: Require a priori knowledge of the Q-function. LSPI: Require a priori knowledge of the Q-function. ANN: Require a large number of learning experiences. ANN: Require a large number of learning experiences. Batch + Regression trees: Slow and requires lots of memory. Batch + Regression trees: Slow and requires lots of memory.

Needed properties 1)Memory requirements should not explode exponentially with the dimensionality of the problem. 2)It should tackle the pitfalls caused by the usage of “static prototypes”. 3)It should try to reduce the number of learning experiences required to generate an acceptable policy. NOTE: All this without requiring a priori knowledge of the Q-function. NOTE: All this without requiring a priori knowledge of the Q-function.

Overview of the proposed method 1) The proposed method limits the number of prototypes available to describe the Q-function (as Kanerva coding). 2) The Q-function is modeled using a regression tree (as the batch method proposed by Sridharan and Tesauro). 3) But prototypes are not static, as in Kanerva coding, but dynamic. 4) The proposed method has the capacity to update the Q- function once for every available learning experience (it can be an online learner).

Changes on the normal regression tree

Basic operations in the regression tree Rupture Rupture Merging Merging

Impossible Merging

Rules for a sound tree parent children parent

Impossible Merging

Sample Merging The “smallest predecessor”

Sample Merging List 1

Sample Merging The node to be inserted

Sample Merging List 1 List 1.1 List 1.2

Sample Merging

The agent Detectors’ Signals Actuators’ Signals Agent Reward

Applications BOOK STORE

Results first application TabularQ-learningMovingPrototypesBatchMethod PolicyQualityBestBestWorst ComputationalComplexity O(n) O(n log(n)) O(n2) O(n3) MemoryUsageBadBestWorst

Results first application (details) TabularQ-learningMovingPrototypesBatchMethod PolicyQuality $2,423,355$2,423,355$2,297,100 MemoryUsage10,202prototypes413prototypes11,975prototypes

Results second application MovingPrototypesLSPI(least-squares policy iteration) PolicyQualityBestWorst ComputationalComplexity O(n log(n)) O(n2) O(n) MemoryUsageWorstBest

Results second application (details) MovingPrototypesLSPI(least-squares policy iteration) PolicyQuality forever (succeeded) 26 time steps (failed) 170 time steps (failed) forever (succeeded) Required Learning Experiences ,902,621183, MemoryUsage about 170 prototypes 2 weight parameters

Results third application Reason for this experiment: Evaluate the performance of the proposed method in a scenario that we consider ideal for this method, namely one, for which there is no application specific knowledge available. What took to learn a good policy: Less than 2 minutes of CPU time. Less than 2 minutes of CPU time. Less that 25,000 learning experiences. Less that 25,000 learning experiences. Less than 900 state-action-value tuples. Less than 900 state-action-value tuples.

Swimmer first movie

Swimmer second movie

Swimmer third movie

Future Work Different types of splits. Different types of splits. Continue characterization of the method Moving Prototypes. Continue characterization of the method Moving Prototypes. Moving prototypes + LSPI. Moving prototypes + LSPI. Moving prototypes + Eligibility traces. Moving prototypes + Eligibility traces.