Download presentation
Presentation is loading. Please wait.
Published byΠαλλάς Γερμανού Modified over 5 years ago
1
Reinforcement Learning in a Multi-Robot Domain
Author: Maja J Mataric Presenter: Drew Bagnell Some of the material I will discuss comes from Mataric’s paper in Adaption and Learning in Multi-Agent systems…. “Learning in Multi-robot Systems” 5/14/2019 Drew Bagnell, Robotics Institute
2
Drew Bagnell, Robotics Institute
Motivation Reinforcement learning potentially provides a method to construct sophisticated behaviors in single and multiple robots with little programming in the classical sense Declarative task specification– what not how Good results in many domains (see RL Survey, Kaebling, Littman and Moore) Why do we care? 5/14/2019 Drew Bagnell, Robotics Institute
3
Drew Bagnell, Robotics Institute
Problems in RL? Standard RL algorithms and analyses are inapplicable in the multi-agent environments she is interested in. MDP is an inaccurate model in situated robotics. The traditional algorithms are not taking advantage of domain knowledge to enable/accelerate learning Two main classes of problems: Managing state space complexity Structuring and assigning reinforcement Mataric takes a very broad view of the Reinforcement Learning Problem… it’s refreshing in that it’s not as theoretically biased. Specifically, the MDP fails for a # of reasons: 5/14/2019 Drew Bagnell, Robotics Institute
4
Drew Bagnell, Robotics Institute
State Representation Combinatorial explosion– state space is exponential. Continuous state– high level filters used in simulation may be unrealistic. Suggests in here case using behaviors pre-conditions as a “state” space As an early consideration, it is important to note that Mataric’s definition of state is unclear– I take it to mean any observation space the learning algorithm can work It is important to note that at this point Mataric is essentially throwing up her hands on finding optimal or near optimal policies– she has given up any notion of state, and in doing we can conclude that finding even the best memoryless policy is NP-hard. 5/14/2019 Drew Bagnell, Robotics Institute
5
Drew Bagnell, Robotics Institute
Transitions/Events World/agent are asynchronous World is largely uncontrolled Noise and uncertainty have “specific usually complex properties that cannot be modeled” Building a predictive model can be very slow– model free instead The statement about noise re-iterates her thoughts on state. 5/14/2019 Drew Bagnell, Robotics Institute
6
Reinforcement vs. “Shaping”
Monolithic reinforcement functions Multiple Goals-- ordering The more immediate the reinforcement the better Domain knowledge can give us progress estimates that are informative “RL methods hide (domain knowledge) in the reinforcement function, which often employs some ad hoc embedding of the domain semantics.” “Shaping” is a term Mataric borrows from the psychology literature, which is a shortening of “shaping by successive approximations”. A robot/animal is shaped when it is conditioned in steps toward an ultimate goal. Traditional RL techniques without domain knowledge would have to accidentally reach the goal– the further away that is, the more difficult it is to associate good actions with that. 5/14/2019 Drew Bagnell, Robotics Institute
7
Drew Bagnell, Robotics Institute
Progress Estimators Partial internal critics Functions that provide positive or negative reinforcement with respect to the current goal (behavior post-conditions) Encourage exploration in the sense that as long as the behavior makes progress we do not switch behaviors in discrete time Allow us to catch thrashing in a single behavior 5/14/2019 Drew Bagnell, Robotics Institute
8
Drew Bagnell, Robotics Institute
The Learning Task Mataric assumes behaviors are known, and the task is to learn a switching function between behaviors The observation space is shifted to be defined by the operating conditions of each behavioral module Behaviors are run until the next event or progress estimators declare a failure 5/14/2019 Drew Bagnell, Robotics Institute
9
The Learning Algorithm
Find a total ordering on behaviors A(c,b) Where A(c,b) is a weighted sum of immediate (“heterogeneous”) reinforcement for subgoals achieved and Progress estimators Learning is continuous No bootstraping occurs between (c,b) pairs– there is no flow of information as DP algorithms use. If you have several points, steps, or key ideas use multiple slides. Determine if your audience is to understand a new idea, learn a process, or receive greater depth to a familiar concept. Back up each point with adequate explanation. As appropriate, supplement your presentation with technical support data in hard copy or on disc, , or the Internet. Develop each point adequately to communicate with your audience. 5/14/2019 Drew Bagnell, Robotics Institute
10
Drew Bagnell, Robotics Institute
Experimental Setup Foraging– “complex & biological inspired” Known behaviors, also utility behaviors Safe-wandering Dispersion Resting Homing Space of pre-conditions as observation space If you have several points, steps, or key ideas use multiple slides. Determine if your audience is to understand a new idea, learn a process, or receive greater depth to a familiar concept. Back up each point with adequate explanation. As appropriate, supplement your presentation with technical support data in hard copy or on disc, , or the Internet. Develop each point adequately to communicate with your audience. 5/14/2019 Drew Bagnell, Robotics Institute
11
Drew Bagnell, Robotics Institute
Experimental Results Monolithic Q-learning does poorly– in particular it has problems with other robots and only does tasks that gain it immediate reward Heterogeneous rewards do better The “Shaping” reward structure does the best of all. If you have several points, steps, or key ideas use multiple slides. Determine if your audience is to understand a new idea, learn a process, or receive greater depth to a familiar concept. Back up each point with adequate explanation. As appropriate, supplement your presentation with technical support data in hard copy or on disc, , or the Internet. Develop each point adequately to communicate with your audience. 5/14/2019 Drew Bagnell, Robotics Institute
12
Uncovered Interesting Issues
Interaction and credit assignment— the task is essentially a single robot task Hidden environmental state– the problems induced here are hardly discussed Hidden state introduced by interaction If you have several points, steps, or key ideas use multiple slides. Determine if your audience is to understand a new idea, learn a process, or receive greater depth to a familiar concept. Back up each point with adequate explanation. As appropriate, supplement your presentation with technical support data in hard copy or on disc, , or the Internet. Develop each point adequately to communicate with your audience. 5/14/2019 Drew Bagnell, Robotics Institute
13
Drew Bagnell, Robotics Institute
Objections Use of the word “state” implies sufficient statistic Features need not expand the observation space exponentially if we know a priori that they are only relevant given certain other features– We can build progress estimators into an initial value function using the above observation– plus the estimators tune themselves Semi-Markov Processes deal with discrete event systems instead of discrete time It almost certainly took longer to craft the reward system and the progress estimators than to write the “empirically derived” optimal switcher. If you have several points, steps, or key ideas use multiple slides. Determine if your audience is to understand a new idea, learn a process, or receive greater depth to a familiar concept. Back up each point with adequate explanation. As appropriate, supplement your presentation with technical support data in hard copy or on disc, , or the Internet. Develop each point adequately to communicate with your audience. 5/14/2019 Drew Bagnell, Robotics Institute
14
Drew Bagnell, Robotics Institute
Conclusions Mataric makes important and valid points regarding the difficulty of learning on real robots She shows her methods can improve performance using domain knowledge The methods described are ad-hoc and very little could be said about even convergence to any policy It is not clear that any work is saved in this approach– nor that what we have when done learning is a good or optimal policy. Determine the best close for your audience and your presentation. Close with a summary; offer options; recommend a strategy; suggest a plan; set a goal. Keep your focus throughout your presentation, and you will more likely achieve your purpose. 5/14/2019 Drew Bagnell, Robotics Institute
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.