Presentation on theme: "Python Programming, 2/e1. Most useful Presentation of algorithms Requiring us to read & respond (2) Discussion (2) or in-class exercises Examples (x2)"— Presentation transcript:
Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesvari, Cs., Wiewiora, E. Fast gradient-descent methods for temporal-difference learning with linear function approximation. ICML-09. Sutton, Szepesvari and Maei (2009) recently introduced the first temporal- difference learning algorithm compatible with both linear function approximation and off-policy training, and whose complexity scales only linearly in the size of the function approximator. We introduce two new related algorithms with better convergence rates. The first algorithm, GTD2, is derived and proved convergent just as GTD was, but uses a different objective function and converges significantly faster (but still not as fast as conventional TD). The second new algorithm, linear TD with gradient correction, or TDC, uses the same update rule as conventional TD except for an additional term which is initially zero. In our experiments on small test problems and in a Computer Go application with a million features, the learning rate of this algorithm was comparable to that of conventional TD.
van Seijen, H., Sutton, R. S. True online TD(λ) ICML-14 TD(λ) based on equivalence to a clear and conceptually simple forward view, and the fact that it can be implemented online in an inexpensive manner. Equivalence between TD(λ) and the forward view is exact only for the off- line version of the algorithm (in which updates are made only at the end of each episode). In the online version of TD(λ) (in which updates are made at each step, which generally performs better and is always used in applications) the match to the forward view is only approximate. In this paper we introduce a new forward view that takes into account the possibility of changing estimates and a new variant of TD(λ) that exactly achieves it. In our empirical comparisons, our algorithm outperformed TD(λ) in all of its variations. It seems, by adhering more truly to the original goal of TD(λ)matching an intuitively clear forward view even in the online casethat we have found a new algorithm that simply improves on classical TD(λ).
Efficiency in ML / AI 1.Data efficiency (rate of learning) 2.Computational efficiency (memory, computation, communication) 3.Researcher efficiency (autonomy, ease of setup, parameter tuning, priors, labels, expertise)
Would it have been fine to skip the review [of RL] or is it bad form? Also it says V*(s) is the optimal policy but shouldn't that be the optimal state-value function, and pi* is the optimal policy? Why are action-value functions are preferred to value functions Im still confused with the concepts of expert or domain knowledge. the knowledge the domain expert must supply to IFSA is less detailed. What does this mean and what is the benefit? (like need less information about features?) performance of IFSA-rev: dropped off very quickly then jumped around for a while to eventually find a better solution than Sarsa, and is able to find the optimal way before either IFSA or Sarsa, but why does it freak out at the beginning? Do we know? Different initial intuition may result in totally different ordering of the features, so I believe that it will influence the performance of IFSA and IFSA-rev since its easy for us to understand that picking more relevant features to learn more important concepts at first is usually better to speed up learning. But Im not sure whether they will generally converge to the same level finally. The paper states that the agent adds a feature to the feature set when the algorithm is reasonably converged, in a case of a continuous state, how does the algorithm address the values of the new state? I can imagine a greedy heuristic approach where the algorithm runs each feature set as the first feature set, and after x runs, whichever feature is doing best is selected as the first feature set. Then fork that run and run each other feature set from there, again selecting the one which performs best, etc. This gives a quadratic run time based on the number of feature sets, rather than the exponential time it takes to find the optimal one. This could be used when having an expert select the order is not feasible. If in Chess the positions of the pawns didnt matter that much you could train the RL with just the more important pieces. thereby shrinking your potential state space down tremendously. Then when you later readd the pawns back in you now know something about every state you encounter for a particular set of the main 8 pieces independent of how the pawns are situated. That would allow you to ignore millions of states where the pawns were shifted just slightly but you already knew the primary 8 pieces had a low value.
Background of 1 st author Self-citations Understandability – utilize Paper length Reasonably Converged Related work Error bars / stat sig How to pick test domains XOR Keepaway