Python Programming, 2/e1. Most useful Presentation of algorithms Requiring us to read & respond (2) Discussion (2) or in-class exercises Examples (x2)

Slides:



Advertisements
Similar presentations
Short reading for Thursday Job talk at 1:30pm in ETRL 101 Kuka robotics –
Advertisements

Dougal Sutherland, 9/25/13.
Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.
Programming exercises: Angel – lms.wsu.edu – Submit via zip or tar – Write-up, Results, Code Doodle: class presentations Student Responses First visit.
Linear Regression.
Practicum 2: - Asymptotics - List and Tree Structures Fundamental Data Structures and Algorithms Klaus Sutner Feb. 5, 2004.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Newton’s Method Application to LMS Recursive Least Squares Exponentially-Weighted.
CS Perceptrons1. 2 Basic Neuron CS Perceptrons3 Expanded Neuron.
CSC321: Neural Networks Lecture 3: Perceptrons
Infinite Horizon Problems
Adapted from R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction From Sutton & Barto Reinforcement Learning An Introduction.
Machine Learning Week 2 Lecture 2.
Simple Neural Nets For Pattern Classification
Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.
LEARNING FROM OBSERVATIONS Yılmaz KILIÇASLAN. Definition Learning takes place as the agent observes its interactions with the world and its own decision-making.
Reinforcement Learning
Reinforcement Learning Rafy Michaeli Assaf Naor Supervisor: Yaakov Engel Visit project’s home page at: FOR.
לביצוע מיידי ! להתחלק לקבוצות –2 או 3 בקבוצה להעביר את הקבוצות – היום בסוף השיעור ! ספר Reinforcement Learning – הספר קיים online ( גישה מהאתר של הסדנה.
MAE 552 – Heuristic Optimization Lecture 28 April 5, 2002 Topic:Chess Programs Utilizing Tree Searches.
LEARNING FROM OBSERVATIONS Yılmaz KILIÇASLAN. Definition Learning takes place as the agent observes its interactions with the world and its own decision-making.
Reinforcement Learning: Generalization and Function Brendan and Yifang Feb 10, 2015.
CS 4700: Foundations of Artificial Intelligence
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.
1 Reinforcement Learning: Learning algorithms Function Approximation Yishay Mansour Tel-Aviv University.
Neural Networks Lecture 8: Two simple learning algorithms
Neural Networks How do we get around only linear solutions?
Algorithm Taxonomy Thus far we have focused on:
Drones Collecting Cell Phone Data in LA AdNear had already been using methods.
How to do backpropagation in a brain
Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.
Efficiency in ML / AI 1.Data efficiency (rate of learning) 2.Computational efficiency (memory, computation, communication) 3.Researcher efficiency (autonomy,
Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.
Introduction Algorithms and Conventions The design and analysis of algorithms is the core subject matter of Computer Science. Given a problem, we want.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
LARGE MARGIN CLASSIFIERS David Kauchak CS 451 – Fall 2013.
CSC 2535 Lecture 8 Products of Experts Geoffrey Hinton.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
MultiModality Registration Using Hilbert-Schmidt Estimators By: Srinivas Peddi Computer Integrated Surgery II April 27 th, 2001 Final Presentation.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Derivation Computational Simplifications Stability Lattice Structures.
Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford.
Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
Schedule for presentations. 6.1: Chris? – The agent is driving home from work from a new work location, but enters the freeway from the same point. Thus,
Off-Policy Temporal-Difference Learning with Function Approximation Doina Precup McGill University Rich Sutton Sanjoy Dasgupta AT&T Labs.
Insight: Steal from Existing Supervised Learning Methods! Training = {X,Y} Error = target output – actual output.
1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 12: Generalization and Function Approximation Dr. Itamar Arel College of Engineering.
Chapter VI What should I know about the sizes and speeds of computers?
University of Colorado Boulder ASEN 5070 Statistical Orbit determination I Fall 2012 Professor George H. Born Professor Jeffrey S. Parker Lecture 9: Least.
Artificial Intelligence in Game Design Lecture 20: Hill Climbing and N-Grams.
1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 12: Generalization and Function Approximation Dr. Itamar Arel College of Engineering.
TD(0) prediction Sarsa, On-policy learning Q-Learning, Off-policy learning.
Large Margin classifiers
A Comparison of Learning Algorithms on the ALE
Boosting and Additive Trees (2)
Matt Gormley Lecture 16 October 24, 2016
Objective of This Course
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Chapter 8: Generalization and Function Approximation
October 6, 2011 Dr. Itamar Arel College of Engineering
CS 188: Artificial Intelligence Fall 2008
Chapter 7: Eligibility Traces
Reinforcement Learning (2)
Reinforcement Learning (2)
Presentation transcript:

Python Programming, 2/e1

Most useful Presentation of algorithms Requiring us to read & respond (2) Discussion (2) or in-class exercises Examples (x2)

Least Useful The book Book is hard to read Book is redundant Book goes too quickly Reading Responses No deadline for exercises Long uninterrupted lectures Easy to get confused during discussions

What could students do? Talk more / less (x4) Work towards mastery Summarize chapters and discuss with others Rephrase problems / algorithms in own words Read the chapter more than once

What could Matt do? Provide more examples / code (x3) Show some real applications Present challenge / problem before method Go through textbook slower

δ is or normal TD error e is vector of eligibility traces θ is a weight vector

Linear Methods Why are these a particularly important type of function approximation? Parameter vector θ t Column vector of features φ s for every state (same number of components)

Tile coding

Mountain-Car Task

3D Mountain Car X: position and acceleration Y: position and acceleration

Control with FA Bootstrapping

Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesvari, Cs., Wiewiora, E. Fast gradient-descent methods for temporal-difference learning with linear function approximation. ICML-09. Sutton, Szepesvari and Maei (2009) recently introduced the first temporal- difference learning algorithm compatible with both linear function approximation and off-policy training, and whose complexity scales only linearly in the size of the function approximator. We introduce two new related algorithms with better convergence rates. The first algorithm, GTD2, is derived and proved convergent just as GTD was, but uses a different objective function and converges significantly faster (but still not as fast as conventional TD). The second new algorithm, linear TD with gradient correction, or TDC, uses the same update rule as conventional TD except for an additional term which is initially zero. In our experiments on small test problems and in a Computer Go application with a million features, the learning rate of this algorithm was comparable to that of conventional TD.

van Seijen, H., Sutton, R. S. True online TD(λ) ICML-14 TD(λ) based on equivalence to a clear and conceptually simple forward view, and the fact that it can be implemented online in an inexpensive manner. Equivalence between TD(λ) and the forward view is exact only for the off- line version of the algorithm (in which updates are made only at the end of each episode). In the online version of TD(λ) (in which updates are made at each step, which generally performs better and is always used in applications) the match to the forward view is only approximate. In this paper we introduce a new forward view that takes into account the possibility of changing estimates and a new variant of TD(λ) that exactly achieves it. In our empirical comparisons, our algorithm outperformed TD(λ) in all of its variations. It seems, by adhering more truly to the original goal of TD(λ)matching an intuitively clear forward view even in the online casethat we have found a new algorithm that simply improves on classical TD(λ).

Efficiency in ML / AI 1.Data efficiency (rate of learning) 2.Computational efficiency (memory, computation, communication) 3.Researcher efficiency (autonomy, ease of setup, parameter tuning, priors, labels, expertise)

Would it have been fine to skip the review [of RL] or is it bad form? Also it says V*(s) is the optimal policy but shouldn't that be the optimal state-value function, and pi* is the optimal policy? Why are action-value functions are preferred to value functions Im still confused with the concepts of expert or domain knowledge. the knowledge the domain expert must supply to IFSA is less detailed. What does this mean and what is the benefit? (like need less information about features?) performance of IFSA-rev: dropped off very quickly then jumped around for a while to eventually find a better solution than Sarsa, and is able to find the optimal way before either IFSA or Sarsa, but why does it freak out at the beginning? Do we know? Different initial intuition may result in totally different ordering of the features, so I believe that it will influence the performance of IFSA and IFSA-rev since its easy for us to understand that picking more relevant features to learn more important concepts at first is usually better to speed up learning. But Im not sure whether they will generally converge to the same level finally. The paper states that the agent adds a feature to the feature set when the algorithm is reasonably converged, in a case of a continuous state, how does the algorithm address the values of the new state? I can imagine a greedy heuristic approach where the algorithm runs each feature set as the first feature set, and after x runs, whichever feature is doing best is selected as the first feature set. Then fork that run and run each other feature set from there, again selecting the one which performs best, etc. This gives a quadratic run time based on the number of feature sets, rather than the exponential time it takes to find the optimal one. This could be used when having an expert select the order is not feasible. If in Chess the positions of the pawns didnt matter that much you could train the RL with just the more important pieces. thereby shrinking your potential state space down tremendously. Then when you later readd the pawns back in you now know something about every state you encounter for a particular set of the main 8 pieces independent of how the pawns are situated. That would allow you to ignore millions of states where the pawns were shifted just slightly but you already knew the primary 8 pieces had a low value.

Background of 1 st author Self-citations Understandability – utilize Paper length Reasonably Converged Related work Error bars / stat sig How to pick test domains XOR Keepaway