Download presentation

Presentation is loading. Please wait.

Published byHerbert Hannibal Modified over 4 years ago

1
Python Programming, 2/e1

2
Most useful Presentation of algorithms Requiring us to read & respond (2) Discussion (2) or in-class exercises Examples (x2)

3
Least Useful The book Book is hard to read Book is redundant Book goes too quickly Reading Responses No deadline for exercises Long uninterrupted lectures Easy to get confused during discussions

4
What could students do? Talk more / less (x4) Work towards mastery Summarize chapters and discuss with others Rephrase problems / algorithms in own words Read the chapter more than once

5
What could Matt do? Provide more examples / code (x3) Show some real applications Present challenge / problem before method Go through textbook slower

6
δ is or normal TD error e is vector of eligibility traces θ is a weight vector

7
Linear Methods Why are these a particularly important type of function approximation? Parameter vector θ t Column vector of features φ s for every state (same number of components)

8
Tile coding

9
Mountain-Car Task

10
3D Mountain Car X: position and acceleration Y: position and acceleration http://eecs.wsu.edu/~taylorm/traj.gif

11
Control with FA Bootstrapping

12
Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesvari, Cs., Wiewiora, E. Fast gradient-descent methods for temporal-difference learning with linear function approximation. ICML-09. Sutton, Szepesvari and Maei (2009) recently introduced the first temporal- difference learning algorithm compatible with both linear function approximation and off-policy training, and whose complexity scales only linearly in the size of the function approximator. We introduce two new related algorithms with better convergence rates. The first algorithm, GTD2, is derived and proved convergent just as GTD was, but uses a different objective function and converges significantly faster (but still not as fast as conventional TD). The second new algorithm, linear TD with gradient correction, or TDC, uses the same update rule as conventional TD except for an additional term which is initially zero. In our experiments on small test problems and in a Computer Go application with a million features, the learning rate of this algorithm was comparable to that of conventional TD.

13
van Seijen, H., Sutton, R. S. True online TD(λ) ICML-14 TD(λ) based on equivalence to a clear and conceptually simple forward view, and the fact that it can be implemented online in an inexpensive manner. Equivalence between TD(λ) and the forward view is exact only for the off- line version of the algorithm (in which updates are made only at the end of each episode). In the online version of TD(λ) (in which updates are made at each step, which generally performs better and is always used in applications) the match to the forward view is only approximate. In this paper we introduce a new forward view that takes into account the possibility of changing estimates and a new variant of TD(λ) that exactly achieves it. In our empirical comparisons, our algorithm outperformed TD(λ) in all of its variations. It seems, by adhering more truly to the original goal of TD(λ)matching an intuitively clear forward view even in the online casethat we have found a new algorithm that simply improves on classical TD(λ).

14
Efficiency in ML / AI 1.Data efficiency (rate of learning) 2.Computational efficiency (memory, computation, communication) 3.Researcher efficiency (autonomy, ease of setup, parameter tuning, priors, labels, expertise)

15
Would it have been fine to skip the review [of RL] or is it bad form? Also it says V*(s) is the optimal policy but shouldn't that be the optimal state-value function, and pi* is the optimal policy? Why are action-value functions are preferred to value functions Im still confused with the concepts of expert or domain knowledge. the knowledge the domain expert must supply to IFSA is less detailed. What does this mean and what is the benefit? (like need less information about features?) performance of IFSA-rev: dropped off very quickly then jumped around for a while to eventually find a better solution than Sarsa, and is able to find the optimal way before either IFSA or Sarsa, but why does it freak out at the beginning? Do we know? Different initial intuition may result in totally different ordering of the features, so I believe that it will influence the performance of IFSA and IFSA-rev since its easy for us to understand that picking more relevant features to learn more important concepts at first is usually better to speed up learning. But Im not sure whether they will generally converge to the same level finally. The paper states that the agent adds a feature to the feature set when the algorithm is reasonably converged, in a case of a continuous state, how does the algorithm address the values of the new state? I can imagine a greedy heuristic approach where the algorithm runs each feature set as the first feature set, and after x runs, whichever feature is doing best is selected as the first feature set. Then fork that run and run each other feature set from there, again selecting the one which performs best, etc. This gives a quadratic run time based on the number of feature sets, rather than the exponential time it takes to find the optimal one. This could be used when having an expert select the order is not feasible. If in Chess the positions of the pawns didnt matter that much you could train the RL with just the more important pieces. thereby shrinking your potential state space down tremendously. Then when you later readd the pawns back in you now know something about every state you encounter for a particular set of the main 8 pieces independent of how the pawns are situated. That would allow you to ignore millions of states where the pawns were shifted just slightly but you already knew the primary 8 pieces had a low value.

16
Background of 1 st author Self-citations Understandability – utilize Paper length Reasonably Converged Related work Error bars / stat sig How to pick test domains XOR Keepaway

Similar presentations

OK

Reinforcement Learning Eligibility Traces 主講人：虞台文 大同大學資工所 智慧型多媒體研究室.

Reinforcement Learning Eligibility Traces 主講人：虞台文 大同大學資工所 智慧型多媒體研究室.

© 2018 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google