Reinforcement Learning with Partially Known World Dynamics

Slides:



Advertisements
Similar presentations
Pat Langley Computational Learning Laboratory Center for the Study of Language and Information Stanford University, Stanford, California
Advertisements

Markov Decision Process
1 University of Southern California Keep the Adversary Guessing: Agent Security by Policy Randomization Praveen Paruchuri University of Southern California.
Adam Coates, Pieter Abbeel, and Andrew Y. Ng Stanford University ICML 2008 Learning for Control from Multiple Demonstrations TexPoint fonts used in EMF.
Patch to the Future: Unsupervised Visual Prediction
 By Ashwinkumar Ganesan CMSC 601.  Reinforcement Learning  Problem Statement  Proposed Method  Conclusions.
What Are Partially Observable Markov Decision Processes and Why Might You Care? Bob Wall CS 536.
KI Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.
Apprenticeship Learning by Inverse Reinforcement Learning Pieter Abbeel Andrew Y. Ng Stanford University.
Apprenticeship Learning by Inverse Reinforcement Learning Pieter Abbeel Andrew Y. Ng Stanford University.
Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Reinforcement Learning Introduction Presented by Alp Sardağ.
Markov Decision Processes
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Dynamic Bayesian Networks CSE 473. © Daniel S. Weld Topics Agency Problem Spaces Search Knowledge Representation Reinforcement Learning InferencePlanningLearning.
8/9/20151 DARPA-MARS Kickoff Adaptive Intelligent Mobile Robots Leslie Pack Kaelbling Artificial Intelligence Laboratory MIT.
Slide 1 Tutorial: Optimal Learning in the Laboratory Sciences Working with nonlinear belief models December 10, 2014 Warren B. Powell Kris Reyes Si Chen.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
MDP Reinforcement Learning. Markov Decision Process “Should you give money to charity?” “Would you contribute?” “Should you give money to charity?” $
Conference Paper by: Bikramjit Banerjee University of Southern Mississippi From the Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence.
Vinay Papudesi and Manfred Huber.  Staged skill learning involves:  To Begin:  “Skills” are innate reflexes and raw representation of the world. 
Reinforcement Learning
1 Robot Environment Interaction Environment perception provides information about the environment’s state, and it tends to increase the robot’s knowledge.
Model-based Bayesian Reinforcement Learning in Partially Observable Domains by Pascal Poupart and Nikos Vlassis (2008 International Symposium on Artificial.
Augmenting Physical State Prediction Through Structured Activity Inference Nam Vo & Aaron Bobick ICRA 2015.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Learning to Navigate Through Crowded Environments Peter Henry 1, Christian Vollmer 2, Brian Ferris 1, Dieter Fox 1 Tuesday, May 4, University of.
CUHK Learning-Based Power Management for Multi-Core Processors YE Rong Nov 15, 2011.
Machine Learning 5. Parametric Methods.
Transfer Learning in Sequential Decision Problems: A Hierarchical Bayesian Approach Aaron Wilson, Alan Fern, Prasad Tadepalli School of EECS Oregon State.
1 Chapter 17 2 nd Part Making Complex Decisions --- Decision-theoretic Agent Design Xin Lu 11/04/2002.
Unclassified//For Official Use Only 1 RAPID: Representation and Analysis of Probabilistic Intelligence Data Carnegie Mellon University PI : Prof. Jaime.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
RADHA-KRISHNA BALLA 19 FEBRUARY, 2009 UCT for Tactical Assault Battles in Real-Time Strategy Games.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Keep the Adversary Guessing: Agent Security by Policy Randomization
Figure 5: Change in Blackjack Posterior Distributions over Time.
Markov Decision Process (MDP)
M. Lopes (ISR) Francisco Melo (INESC-ID) L. Montesano (ISR)
Generative Adversarial Imitation Learning
TÆMS-based Execution Architectures
CS b659: Intelligent Robotics
Petar Kormushev, Sylvain Calinon and Darwin G. Caldwell
István Szita & András Lőrincz
Timothy Boger and Mike Korostelev
Reinforcement Learning
Markov ó Kalman Filter Localization
Dynamical Statistical Shape Priors for Level Set Based Tracking
Policy Gradient in Continuous Time
Harm van Seijen Bram Bakker Leon Kester TNO / UvA UvA
Introduction to particle filter
Joelle Pineau: General info
UAV Route Planning in Delay Tolerant Networks
Hierarchical POMDP Solutions
CS498-EA Reasoning in AI Lecture #20
Introduction to particle filter
Apprenticeship Learning via Inverse Reinforcement Learning
Dr. Unnikrishnan P.C. Professor, EEE
Reinforcement Learning
یادگیری تقویتی Reinforcement Learning
Texture Image Extrapolation for Compression
Chapter 10: Dimensions of Reinforcement Learning
Introduction to Reinforcement Learning and Q-Learning
Machine Learning: Lecture 6
Machine Learning: UNIT-3 CHAPTER-1
CS 416 Artificial Intelligence
Department of Computer Science Ben-Gurion University
Presentation transcript:

Reinforcement Learning with Partially Known World Dynamics Christian R. Shelton Stanford University

Reinforcement Learning Environment State Dynamics Agent Actions Rewards Goal

Motivation Reinforcement learning promises great things Automatic task optimization Without any prior information about the world Reinforcement learning is hard Optimization goal could be arbitrary Every new situation might be different Modify the problem slightly Keep the basic general flexible framework Allow the specification of domain knowledge Don’t require full specification of the problem (planning)

Our Approach Partial World Modeling Example: Keep the partial observability Allow conditional dynamics Example: Known dynamics: Sensor models, motion models, etc. Unknown dynamics: Enemy movements, maps, etc. Flexible barrier between the two

Partially Known Markov Decision Process (PKMDP) Unknown Dynamics: s0 s1 s2 Known Dynamics: x0 x1 x2

Partially Known Markov Decision Process (PKMDP) z0 y1 z1 y2 z2 Interface: x0 x1 x2

Partially Known Markov Decision Process (PKMDP) z0 y1 z1 y2 z2 x0 x1 x2 Observation: o0 o1 o2

Partially Known Markov Decision Process (PKMDP) z0 y1 z1 y2 z2 x0 x1 x2 o0 o1 o2 Action: a0 a1 a2

Partially Known Markov Decision Process (PKMDP) unknown y0 z0 y1 z1 y2 z2 known, unobserved x0 x1 x2 o0 o1 o2 observed a0 a1 a2

Algorithm Outline Input Output Method Set of Trajectories (o, a, y, z) Set of Policies Output Policy that maximizes expected return Method Construct non-parametric model of return Maximize with respect to policy

Algorithm Details Unknown Dynamics: Known Dynamics: Use experience Importance sampling Known Dynamics: Use model DBN inference Exact calculation: lower variance Maximize using conjugate gradient Policy search method, but… Not policy gradient

Total Estimate For each sample, K and V involve reasoning in the DBN: z0 x0 y0 o1 a1 z1 x1 y1 o2 a2 z2 x2 y2

Load-Unload Example 26 states, 14 observations, 4 actions Three versions: 1. No world knowledge 2. Memory dynamics known 3. End-point & memory dynamics known

Clogged Pipe Example 144 states, 12 observations, 8 actions Three versions: 1. Memory only 2. Known cart control 3. Incoming flow unknown

Conclusion Advantages Current Work Uses samples for estimation of unknown dynamics Uses exact dynamics when known Allows natural specification of domain knowledge Current Work Improving gradient ascent planner Using structure within known dynamics Removing requirement observability of interface