Distributed Q Learning Lars Blackmore and Steve Block.

Slides:



Advertisements
Similar presentations
Markov Decision Process
Advertisements

Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Reinforcement learning (Chapter 21)
Effective Reinforcement Learning for Mobile Robots Smart, D.L and Kaelbing, L.P.
1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.
Reinforcement learning
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 6 - Reinforcement Learning Prof. Giancarlo.
SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.
RL at Last! Q- learning and buddies. Administrivia R3 due today Class discussion Project proposals back (mostly) Only if you gave me paper; e-copies yet.
CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley.
Latent Learning in Agents iCML 03 Robotics/Vision Workshop Rati Sharma.
Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6 th, 2006 CS286r Presented by Ilan Lobel.
Optimal Tuning of Continual Online Exploration in Reinforcement Learning Youssef Achbany, Francois Fouss, Luh Yen, Alain Pirotte & Marco Saerens Information.
Cooperative Q-Learning Lars Blackmore and Steve Block Expertness Based Cooperative Q-learning Ahmadabadi, M.N.; Asadpour, M IEEE Transactions on Systems,
1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.
Reinforcement Learning (2) Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Reinforcement Learning (1)
1 Quality of Experience Control Strategies for Scalable Video Processing Wim Verhaegh, Clemens Wüst, Reinder J. Bril, Christian Hentschel, Liesbeth Steffens.
CPSC 7373: Artificial Intelligence Lecture 11: Reinforcement Learning Jiang Bian, Fall 2012 University of Arkansas at Little Rock.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Conference Paper by: Bikramjit Banerjee University of Southern Mississippi From the Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence.
Reinforcement Learning
REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.
Introduction Many decision making problems in real life
OBJECT FOCUSED Q-LEARNING FOR AUTONOMOUS AGENTS M. ONUR CANCI.
Reinforcement Learning Presentation Markov Games as a Framework for Multi-agent Reinforcement Learning Mike L. Littman Jinzhong Niu March 30, 2004.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Introduction to Reinforcement Learning Dr Kathryn Merrick 2008 Spring School on Optimisation, Learning and Complexity Friday 7 th.
Cooperative Q-Learning Lars Blackmore and Steve Block Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents Tan, M Proceedings of the.
INTRODUCTION TO Machine Learning
Learning to Navigate Through Crowded Environments Peter Henry 1, Christian Vollmer 2, Brian Ferris 1, Dieter Fox 1 Tuesday, May 4, University of.
CUHK Learning-Based Power Management for Multi-Core Processors YE Rong Nov 15, 2011.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
gaflier-uas-battles-feral-hogs/ gaflier-uas-battles-feral-hogs/
Distributed Q Learning Lars Blackmore and Steve Block.
Reinforcement learning (Chapter 21)
Reinforcement Learning
CS 484 – Artificial Intelligence1 Announcements Homework 5 due Tuesday, October 30 Book Review due Tuesday, October 30 Lab 3 due Thursday, November 1.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
Reinforcement Learning
Abstract LSPI (Least-Squares Policy Iteration) works well in value function approximation Gaussian kernel is a popular choice as a basis function but can.
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
Def gradientDescent(x, y, theta, alpha, m, numIterations): xTrans = x.transpose() replaceMe =.0001 for i in range(0, numIterations): hypothesis = np.dot(x,
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO AUTOMATICO Lezione 12 - Reinforcement Learning Prof. Giancarlo Mauri.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
CS 182 Reinforcement Learning. An example RL domain Solitaire –What is the state space? –What are the actions? –What is the transition function? Is it.
Reinforcement Learning (1)
"Playing Atari with deep reinforcement learning."
Markov Decision Processes
UAV Route Planning in Delay Tolerant Networks
Markov Decision Processes
Announcements Homework 3 due today (grace period through Friday)
Instructors: Fei Fang (This Lecture) and Dave Touretzky
CS 188: Artificial Intelligence Fall 2007
Reinforcement Nisheeth 18th January 2019.
Reinforcement Learning (2)
Markov Decision Processes
Markov Decision Processes
Reinforcement Learning (2)
Presentation transcript:

Distributed Q Learning Lars Blackmore and Steve Block

Contents – to be removed What is Q-learning? –MDP framework –Q-learning per se Distributed Q-learning  discuss how Sharing Q-values  why interesting? Simple averaging (no good) Expertness based distributed Q-learning Expertness w/ specialised agents (optional)

Markov Decision Processes Framework: MDP –States S –Actions A –Rewards R(s,a) –Transition Function T(s,a,s’) Goal: find optimal policy  *(s) G G

Reinforcement Learning Want to find  * through experience –Reinforcement Learning –Intuitively similar to human/animal learning –Use some policy  for motion –Converge to the optimal policy  * An algorithm for reinforcement learning…

Q-Learning Define Q*(s,a): –“Total reward if agent is in state s, takes action a, then acts optimally forever” Optimal policy:  *(s)=argmax a Q*(s,a) Q(s,a) is an estimate of Q*(s,a) Q-learning motion policy:  (s)=argmax a Q(s,a) Update Q recursively: Optimality theorem: –“if each (s,a) pair is updated an infinite number of times, Q converges to Q* with probability 1”

Distributed Q-Learning Problem formulation Different approaches –Expanding state (share sensor information) –Sharing experiences –Share Q-values Experimental results? –If yes, have to explain setup… Sharing Q-values –Explain why most interesting

Sharing Q-values First approach: Simple Averaging Learning framework –Individual learning for t i trials –Each trial starts from a random state and ends when robot reaches goal –Next, all robots switch to cooperative learning Result: Simple Averaging is worse in general!

Why is Simple Averaging Worse? Slower learning rate: –Example: First robot to find the goal (at time t) Insensitive to environment changes: –First robot to find the change RobotQ(s,a) at tQ(s,a) at t+1Q*(s,a)

Expertness Idea: pay more attention to agents who are ‘experts’ –Expertness based cooperative Q-learning New Q-sharing equation: Agent i weights agent j’s Q value based on their relative expertness e i and e j

Expertness Measures Need to define expertness of agent j –Based on the reinforcement agent j has encountered Alternative definitions: –Simple Sum –Abs –Positive –Negative Different interpretations

Weighting Strategies How do we come up with weights based on the expertnesses? Alternative strategies: –‘Learn from all’: –‘Learn from experts’:

Experimental Setup Hunter-prey scenario Individual trial phase as before –Different number of trials for each agent Then cooperative phase

Results Cooperative vs. individual Different strategies Interpretation Conclusion – Expertness based methods are good if expertness significantly different.

Specialised Agents Agent i may have explored area A a lot but area B very little –What is agent i’s expertness? –Agent i is an expert in area A but not in area B Idea: –Agents can be specialised,i.e. experts in certain areas of the world –Pay more attention to Q-values from agents which are experts in that area

Specialised Agents Continued