Effective Reinforcement Learning for Mobile Robots Smart, D.L and Kaelbing, L.P.

Slides:



Advertisements
Similar presentations
Viktor Zhumatiya, Faustino Gomeza,
Advertisements

Programming exercises: Angel – lms.wsu.edu – Submit via zip or tar – Write-up, Results, Code Doodle: class presentations Student Responses First visit.
Evaluating Classifiers
Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
Reinforcement Learning
DARPA Mobile Autonomous Robot SoftwareMay Adaptive Intelligent Mobile Robotics William D. Smart, Presenter Leslie Pack Kaelbling, PI Artificial.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.
Reinforcement Learning
Practical Reinforcement Learning in Continuous Space William D. Smart Brown University Leslie Pack Kaelbling MIT Presented by: David LeRoux.
Latent Learning in Agents iCML 03 Robotics/Vision Workshop Rati Sharma.
Effective Reinforcement Learning for Mobile Robots William D. Smart & Leslie Pack Kaelbling* Mark J. Buller (mbuller) 14 March 2007 *Proceedings of IEEE.
Robot Learning Jeremy Wyatt School of Computer Science University of Birmingham.
Distributed Reinforcement Learning for a Traffic Engineering Application Mark D. Pendrith DaimlerChrysler Research & Technology Center Presented by: Christina.
Distributed Q Learning Lars Blackmore and Steve Block.
Cooperative Q-Learning Lars Blackmore and Steve Block Expertness Based Cooperative Q-learning Ahmadabadi, M.N.; Asadpour, M IEEE Transactions on Systems,
Integrating POMDP and RL for a Two Layer Simulated Robot Architecture Presented by Alp Sardağ.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Ai in game programming it university of copenhagen Reinforcement Learning [Intro] Marco Loog.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Soar-RL: Reinforcement Learning and Soar Shelley Nason.
Reinforcement Learning (1)
1 Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning Greg Grudic University of Colorado.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Sérgio Ronaldo Barros dos Santos (ITA-Brazil) Sidney Nascimento Givigi Júnior (RMC-Canada) Cairo Lúcio Nascimento Júnior (ITA-Brazil) Autonomous Construction.
 1  Outline  stages and topics in simulation  generation of random variates.
DARPA Mobile Autonomous Robot SoftwareLeslie Pack Kaelbling; March Adaptive Intelligent Mobile Robotics Leslie Pack Kaelbling Artificial Intelligence.
Reinforcement Learning
Encoding Robotic Sensor States for Q-Learning using the Self-Organizing Map Gabriel J. Ferrer Department of Computer Science Hendrix College.
Study on Genetic Network Programming (GNP) with Learning and Evolution Hirasawa laboratory, Artificial Intelligence section Information architecture field.
Natural Actor-Critic Authors: Jan Peters and Stefan Schaal Neurocomputing, 2008 Cognitive robotics 2008/2009 Wouter Klijn.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Reinforcement Learning Presentation Markov Games as a Framework for Multi-agent Reinforcement Learning Mike L. Littman Jinzhong Niu March 30, 2004.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Reinforcement Learning
Materials Process Design and Control Laboratory MULTISCALE MODELING OF ALLOY SOLIDIFICATION LIJIAN TAN NICHOLAS ZABARAS Date: 24 July 2007 Sibley School.
1 S ystems Analysis Laboratory Helsinki University of Technology Flight Time Allocation Using Reinforcement Learning Ville Mattila and Kai Virtanen Systems.
Solving POMDPs through Macro Decomposition
Value Function Approximation on Non-linear Manifolds for Robot Motor Control Masashi Sugiyama1)2) Hirotaka Hachiya1)2) Christopher Towell2) Sethu.
A Passive Approach to Sensor Network Localization Rahul Biswas and Sebastian Thrun International Conference on Intelligent Robots and Systems 2004 Presented.
PHE YEONG KIANG A Introduction For this course LMCK1531 KEPIMPINAN & KREATIVITI, I will talk about what I've try to do in the Robot Soccer's Club.
Cooperative Q-Learning Lars Blackmore and Steve Block Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents Tan, M Proceedings of the.
CUHK Learning-Based Power Management for Multi-Core Processors YE Rong Nov 15, 2011.
Schedule for presentations. 6.1: Chris? – The agent is driving home from work from a new work location, but enters the freeway from the same point. Thus,
Distributed Q Learning Lars Blackmore and Steve Block.
1 Motion Fuzzy Controller Structure(1/7) In this part, we start design the fuzzy logic controller aimed at producing the velocities of the robot right.
Learning Momentum: Integration and Experimentation Brian Lee and Ronald C. Arkin Mobile Robot Laboratory Georgia Tech Atlanta, GA.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Learning for Physically Diverse Robot Teams Robot Teams - Chapter 7 CS8803 Autonomous Multi-Robot Systems 10/3/02.
Goal Finding Robot using Fuzzy Logic and Approximate Q-Learning
MIT Artificial Intelligence Laboratory — Research Directions Intelligent Agents that Learn Leslie Pack Kaelbling.
Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department of Electrical and Computer Engineering University of.
Abstract LSPI (Least-Squares Policy Iteration) works well in value function approximation Gaussian kernel is a popular choice as a basis function but can.
R. Brafman and M. Tennenholtz Presented by Daniel Rasmussen.
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,
Rigid Needles, Steerable Needles, and Optimal Beam Algorithms Ovidiu Daescu Bio-Medical Computing Laboratory Department of Computer Science University.
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO AUTOMATICO Lezione 12 - Reinforcement Learning Prof. Giancarlo Mauri.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Reinforcement Learning (1)
Announcements Homework 3 due today (grace period through Friday)
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Reinforcement Learning
CS 188: Artificial Intelligence Fall 2008
Chapter 7: Eligibility Traces
Reinforcement Learning (2)
Reinforcement Learning (2)
Presentation transcript:

Effective Reinforcement Learning for Mobile Robots Smart, D.L and Kaelbing, L.P.

Content Background Review Q-learning Reinforcement learning on mobile robots Learning framework Experimental results Conclusion Discussion

Background Hard to code behaviour efficiently and correctly Reinforcement learning: tell the robot what to do, not how to do it How well suited is reinforcement learning for mobile robots?

Review Q-learning Discrete states s and actions a Learn value function by observing rewards – Actual function Q*(s,a) = E[R(s,a) +  max Q*(s’,a’)] – Learn by Q(s t,a t ) = (1-  ) Q(s t,a t ) +  (r t+1 +  max Q(s t+1,a’)) Sample distribution has no effect on learned policy  *(s) = argmax Q*(s,a)

Reinforcement learning on mobile robots Sparse reward function – Almost always zero reward R(s,a) – Non-zero reward only when on success or failure Continuous environment – HEDGER is used as a function approximator – Function approximation can be used when it never extrapolates from the data

Reinforcement learning on mobile robots Q-learning can only be successful when a state with positive reward can be found Sparse reward function and continuous environment cause reward states to be hard to find by trial and error Solution: show robot how to find the reward states

Learning framework Split learning into two phases: – Phase one: actions are controlled by exterior force, learning algorithm only passively observes – Phase two: learning algorithm learns optimal policy By ‘showing’ the robot where the interesting states are, learning should be quicker

Experimental setup Two experiments on B21r mobile robot – Movement speed is fixed by outside force – Rotation speed has to be learned – Settings  = 0.2,  = 0.99 or 0.90 Performance is measured after every 5 runs – Robot does not learn from these test – Starting position and orientation similar, not identical

Experimental Results: Corridor Following Task State space: – distance to end of corridor – distance to left wall as fraction of corridor width – angle  to target point

Experimental Results: Corridor Following Task Computer controlled teacher – Rotation speed is a fraction  of the angle 

Experimental Results: Corridor Following Task Human controlled teacher – Different corridor than computer controlled teacher

Experimental Results: Corridor Following Task Results Decrease in performance after training – Phase 2 supplies more novel experiences Sloppy human controller causes faster convergence than rigid computer controller – Fewer phase 1 and phase 2 runs – Human controller supplies more varied data

Experimental Results: Corridor Following Task Results Simulated performance without advantage of teacher examples

Experimental Results: Obstacle Avoidance Task State space: – direction and distance to obstacles – direction and distance to target

Experimental Results: Obstacle Avoidance Task Results Human controlled teacher – Robot starts 3m from target, random orientation

Experimental Results: Obstacle Avoidance Task Results Simulation without teacher examples – No obstacles present; robot only must reach goal – Simulated robot starts in the right orientation – 3 meters from target: 18.7% reached target in one week of simulated time, taking 6.54 hours on average

Conclusion Passive observation of appropriate state-action behaviour can speed up Q-learning Knowledge about the robot or the learning algorithm is not necessary Any solution will work, providing a good solution is not necessary

Discussion