Presentation is loading. Please wait.

Presentation is loading. Please wait.

Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department of Electrical and Computer Engineering University of.

Similar presentations


Presentation on theme: "Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department of Electrical and Computer Engineering University of."— Presentation transcript:

1 Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department of Electrical and Computer Engineering University of Arizona, Tucson, AZ santiban@ece.arizona.edu marefat@ece.arizona.edu

2 Background and Motivation “A computer program is said to LEARN from experience E with respect to some class of task T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E” A robot driving learning problem: Task T: driving on public four-lane highways using vision sensors Performance measure P: average distance traveled before an error (as judged by human overseer) Training experiences E: a sequence of images and steering commands recorded while observing a human driver

3 1) Artificial Neural Networks Robust to errors in the training data Dependency on the availability of good and extensive training examples 2) Instance-Based Learning Able to model complex policies by making use of less complex local approximations Dependency on the availability of good and extensive training examples 3) Reinforcement Learning Independent of the availability of good and extensive training examples Convergence to the optimal policy can be extremely slow Background and Motivation II

4 Background and Motivation III Motivation: Is it possible to get the best of both worlds? Is it possible for a Learning Agent to be flexible and fast convergent at the same time?

5 The Problem Formalization: Given: a) a set of actions A = {a1, a2, a3,...}, b) a set of situations S = {s1, s2, s3,...}, c) and a function TR(a, s)  tr, where tr is the total reward associated with applying action a while at state s, The LA needs to construct a set of rules P = {rule(s1, a1), rule(s2, a2),...} such that  rule(s, a)  P, a = amax where TR(amax, s)=max(TR(a1,s), TR(a2,s),...) Also: 1) Increase flexibility 2) Increase speed of convergence

6 The Solution The Q learning Algorithm: 1:  rule(s, a)  P, TR(a, s)  0 2:find out what is the current situation si 3:do forever: 4:select an action ai  A and execute it 5:find out what is the immediate reward r 6:find out what is the current situation si’ 7:TR(ai, si)  r + aFactor  max(TR(a, si’))  a 8:si  si’

7 The Solution II Advantages: 1) The LA does not depend on the availability of good and extensive training examples Reason: a) This method learns from experimentation instead of given training examples Shortcomings: 1) Convergence to the optimal policy can be very slow Reasons: a) The Q learning Algorithm propagates “good findings” very slowly. b) Speed of convergence tied to number of situations that need to be handled. 2) May not be able to use this method on high dimensionality problems Reason: a) The memory requirements grow exponentially as we add more dimensions to the problem.

8 The Solution III Speed of convergence tied to the number of situations:  situations ==>  P rules that need to be found  P rules that need to be found ==>  experiments are needed  experiments are needed ==>  convergence speed

9 The Solution IV Slow propagation of “good findings”:

10 The Solution V First Sub-problem: Slow propagation of “good findings” Solution: Develop a method that propagates “good findings” beyond the previous state

11 The Solution VI Solution to First Sub-problem: a) Use a buffer, which we call “short term memory”, to keep track of the last n situations b) After each learning experience apply the following algorithm:

12 The Solution VII The Second and Third Sub-problems: a) Memory requirements grow exponentially as we add more dimensions to the problem b) Speed of convergence tied to number of situations that need to be handled. Solution: 1) We just keep a few examples of the policy (also called prototypes) 2) We generate the policy on situation not described explicitly by these prototypes by “generalizing” from “nearby” prototypes

13 The Solution VIII Kanerva Coding And Tile Coding Moving Prototypes

14 The Solution IX

15 The Solution X

16 The Solution XI

17 The Solution XII A sound tree: a) all the “areas” are mutually exclusive b) their merging is exhaustive c) the merging of any two sibling “areas” is equal to their parent’s “area”. children parent children parent

18 The Solution XIII  Impossible Merge

19 The Solution XIV “Smallest predecessor”

20 The Solution XV

21 The Solution XVI Possible ways of breaking the existing nodes: Node being inserted

22 The Solution XVII List 1 List 1.1 List 1.2 and

23 The Solution XVIII

24 The Solution XIX

25 The Solution XX

26 Results The performance of the algorithm “Propagation of Good Findings” is especially good when the world is large: The algorithm “Propagation of Good Findings” is more efficient when the size of its “Short Term Memory” is large:

27 Results II The algorithm “Propagation of Good Findings” is more efficient when the value of the parameter “discount factor” is large: Results do not depend on sequence of random numbers:

28 Conclusions Q Learning Algorithm  LA becomes more flexible Propagating concept  Convergence is accelerated Moving Prototypes concept  LA becomes more flexible Moving Prototypes concept  Convergence is accelerated

29 Conclusions II What is left to do: Obtain results on the advantages of using regression trees and linear approximation over other similar methods (just as we have already done with the method “Propagation of Good Findings”). Apply the proposed model to solving example applications such as a self-optimizing middle-men between a high level planner and the actuators in a robot. Develop more precisely the limits on the use of this model.


Download ppt "Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department of Electrical and Computer Engineering University of."

Similar presentations


Ads by Google