Automatic Discovery of Subgoals in Reinforcement Learning using Diverse Density Amy McGovern Andrew Barto.

Slides:



Advertisements
Similar presentations
1 Reinforcement Learning (RL). 2 Introduction The concept of reinforcement learning incorporates an agent that solves the problem in hand by interacting.
Advertisements

Reinforcement Learning
NIPS 2007 Workshop Welcome! Hierarchical organization of behavior Thank you for coming Apologies to the skiers… Why we will be strict about timing Why.
1 Reinforcement Learning Problem Week #3. Figure reproduced from the figure on page 52 in reference [1] 2 Reinforcement Learning Loop state Agent Environment.
1. Algorithms for Inverse Reinforcement Learning 2
Patch to the Future: Unsupervised Visual Prediction
Eick: Q-Learning for the PD-World COSC 6342 Project 1 Spring 2014 Q-Learning for a Pickup Dropoff World P P PD D D.
1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
1 Abstract This paper presents a novel modification to the classical Competitive Learning (CL) by adding a dynamic branching mechanism to neural networks.
Reinforcement Learning Rafy Michaeli Assaf Naor Supervisor: Yaakov Engel Visit project’s home page at: FOR.
Optimizing General Compiler Optimization M. Haneda, P.M.W. Knijnenburg, and H.A.G. Wijshoff.
Multiple-Instance Learning Paper 1: A Framework for Multiple-Instance Learning [Maron and Lozano-Perez, 1998] Paper 2: EM-DD: An Improved Multiple-Instance.
Nov 14 th  Homework 4 due  Project 4 due 11/26.
Incorporating Advice into Agents that Learn from Reinforcement Presented by Alp Sardağ.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Hierarchical Reinforcement Learning Ersin Basaran 19/03/2005.
1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.
Octopus Arm Project Final Presentation Dmitry Volkinshtein & Peter Szabo Supervised by: Yaki Engel.
Algorithms For Inverse Reinforcement Learning Presented by Alp Sardağ.
Soar-RL: Reinforcement Learning and Soar Shelley Nason.
Homework #3 Due Thursday, April 17 Problems: –Chapter 11: 11.6, –Chapter 12: 12.1, 12.2, 12.3, 12.4, 12.5, 12.7.
Radial Basis Function Networks
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
MDP Reinforcement Learning. Markov Decision Process “Should you give money to charity?” “Would you contribute?” “Should you give money to charity?” $
Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer.
CHAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling
Developing Analytical Framework to Measure Robustness of Peer-to-Peer Networks Niloy Ganguly.
General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.
OBJECT FOCUSED Q-LEARNING FOR AUTONOMOUS AGENTS M. ONUR CANCI.
Trajectory Pattern Mining
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Hierarchical Reinforcement Learning Using Graphical Models Victoria Manfredi and.
Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.
Solving POMDPs through Macro Decomposition
Curiosity-Driven Exploration with Planning Trajectories Tyler Streeter PhD Student, Human Computer Interaction Iowa State University
Top level learning Pass selection using TPOT-RL. DT receiver choice function DT is trained off-line in artificial situation DT used in a heuristic, hand-coded.
Expectation-Maximization (EM) Case Studies
Cooperative Q-Learning Lars Blackmore and Steve Block Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents Tan, M Proceedings of the.
Off-Policy Temporal-Difference Learning with Function Approximation Doina Precup McGill University Rich Sutton Sanjoy Dasgupta AT&T Labs.
Class Builder Tutorial Presented By- Amit Singh & Sylendra Prasad.
Accurate Robot Positioning using Corrective Learning Ram Subramanian ECE 539 Course Project Fall 2003.
An Agent Epidemic Model Toward a general model. Objectives n An epidemic is any attribute that is passed from one person to others in society è disease,
Self-Organized Web Usage Regularities. Problems of foraging information on WWW Slow accession Difficulty in finding useful information is related to balkanization.
Model Minimization in Hierarchical Reinforcement Learning Balaraman Ravindran Andrew G. Barto Autonomous Learning Laboratory.
Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.
Reinforcement Learning AI – Week 22 Sub-symbolic AI Two: An Introduction to Reinforcement Learning Lee McCluskey, room 3/10
Learning for Physically Diverse Robot Teams Robot Teams - Chapter 7 CS8803 Autonomous Multi-Robot Systems 10/3/02.
Generalized Point Based Value Iteration for Interactive POMDPs Prashant Doshi Dept. of Computer Science and AI Institute University of Georgia
MIT Artificial Intelligence Laboratory — Research Directions Intelligent Agents that Learn Leslie Pack Kaelbling.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Dynamics of Reward Bias Effects in Perceptual Decision Making Jay McClelland & Juan Gao Building on: Newsome and Rorie Holmes and Feng Usher and McClelland.
Reinforcement Learning RS Sutton and AG Barto Summarized by Joon Shik Kim (Thu) Computational Models of Intelligence.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Machine Learning Supervised Learning Classification and Regression
Lecture 11 Graph Algorithms
Learning to Generate Networks
CSC317 Graph algorithms Why bother?
Accurate Robot Positioning using Corrective Learning
PD-World Pickup: Cells: (1,1), (4,1),(3,3),(5,5)
Predictive Performance
SNU BioIntelligence Lab.
Chapter 2: Evaluative Feedback
CSE572, CBS572: Data Mining by H. Liu
Chapter 9: Planning and Learning
Introduction to Reinforcement Learning and Q-Learning
COSC 4368 Group Project Spring 2019 Learning Paths from Feedback Using Reinforcement Learning for a Transportation World P D P D D P.
A Neural Signature of Hierarchical Reinforcement Learning
CSE572: Data Mining by H. Liu
Lecture 10 Graph Algorithms
Chapter 2: Evaluative Feedback
Presentation transcript:

Automatic Discovery of Subgoals in Reinforcement Learning using Diverse Density Amy McGovern Andrew Barto

Abstract The paper presents a method to automatically discover subgoals Based on the idea of mining a set of behavioral trajectories to look for commonalities Commonalities are assumed to be subgoals/bottlenecks

Motivating Application Agent should recognize doorway as a bottleneck Doorway links two strongly connected regions By adding an option to reach a doorway the rooms become more closely connected S G D A two-room gridworld environment

Multiple Instance Learning A bottleneck is a region in agent’s observation space that is visited on all successful paths but not on unsuccessful paths Problem of finding bottleneck regions is treated as a multiple instance learning problem In this problem, a target concept is identified on the basis of bags of instances: positive bag corresponds to a successful trajectory, negative bag corresponds to an unsuccessful trajectory

Diverse Density The most diversely dense region is the region with instances from most positive bags and least negative bags For a target concept diverse density is defined as: which yields: where is defined as a Gaussian based on the distance from the particular instance to the target concept Find the concept with the highest DD value

Options Framework Option is a macro-action which, when chosen, executes until a termination condition is satisfied Defined as a triple where I is the option’s input set of states, is option’s policy and is the termination condition An option bases its policy on its own internal value function An option is a way of reaching a subgoal

Online subgoal discovery At the end of each run, the agent creates a new bag and searches for diverse density peaks (concepts with high DD) Bottlenecks (diverse density peaks) appear in the initial stages of learning and persist throughout learning Use a running average of how often a concept c appears as a peak. At the end of each trajectory the running average is updated by:

Forming new options An option is created for a subgoal found at time step t in the trajectory Option’s input I set can be initialized by adding the set of states visited by the agent from time (t – n) to t where n is a parameter is set to 1 when the subgoal is reached or when the agent is no longer in I, and is set to 0 otherwise The reward function for the policy is to give a reward of -1 on each step and 0 when the option terminates. The agent is rewarded negatively for leaving the input set. The policy uses the same state space as the overall problem The option’s value function is learned using experience replay with the saved trajectories

Pseudocode for subgoal discovery Init full trajectory database to For each trial Interact with environment/Learn using RL Add observed full trajectory to database Create positive or negative bag from state trajectory Search for diverse density peaks For each peak concept c found Update the running average by If c is above threshold and passes the static filter Create a new option o = of reaching concept c Init I by examining trajectory database Set,init policy using experience replay

Macro Q- learning Q-learning Macro Q-learning

Experimental Results Two-room gridworld, four-room gridworld No negative bags were created The agent was limited to creating only one option per run Comparison of options vs. an appropriate multi-step policy