1 Fitted/batch/model-based RL: A (sketchy, biased) overview(?) Csaba Szepesvári University of Alberta TexPoint fonts used in EMF. Read the TexPoint manual.

Slides:



Advertisements
Similar presentations
1 Reinforcement Learning: Approximate Planning Csaba Szepesvári University of Alberta Kioloa, MLSS’08 Slides:
Advertisements

Batch RL Via Least Squares Policy Iteration
Partially Observable Markov Decision Process (POMDP)
Kshitij Judah, Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: School of EECS, Oregon State.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.
Tuning bandit algorithms in stochastic environments The 18th International Conference on Algorithmic Learning Theory October 3, 2007, Sendai International.
ANDREW MAO, STACY WONG Regrets and Kidneys. Intro to Online Stochastic Optimization Data revealed over time Distribution of future events is known Under.
Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 3 Ann Nowé By Sutton.
Decision Theoretic Planning
David Wingate Reinforcement Learning for Complex System Management.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
COSC 878 Seminar on Large Scale Statistical Machine Learning 1.
Sample-based Planning for Continuous Action Markov Decision Processes Ari Weinstein.
Infinite Horizon Problems
Planning under Uncertainty
Using Inaccurate Models in Reinforcement Learning Pieter Abbeel, Morgan Quigley and Andrew Y. Ng Stanford University.
Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University.
Reinforcement Learning
Markov Decision Processes CSE 473 May 28, 2004 AI textbook : Sections Russel and Norvig Decision-Theoretic Planning: Structural Assumptions.
Reinforcement Learning
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6 th, 2006 CS286r Presented by Ilan Lobel.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
7. Experiments 6. Theoretical Guarantees Let the local policy improvement algorithm be policy gradient. Notes: These assumptions are insufficient to give.
Discretization Pieter Abbeel UC Berkeley EECS
Exploration and Apprenticeship Learning in Reinforcement Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
Markov Decision Processes Value Iteration Pieter Abbeel UC Berkeley EECS TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:
Reinforcement Learning Yishay Mansour Tel-Aviv University.
9/23. Announcements Homework 1 returned today (Avg 27.8; highest 37) –Homework 2 due Thursday Homework 3 socket to open today Project 1 due Tuesday –A.
Making Decisions CSE 592 Winter 2003 Henry Kautz.
Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK
Learning and Planning for POMDPs Eyal Even-Dar, Tel-Aviv University Sham Kakade, University of Pennsylvania Yishay Mansour, Tel-Aviv University.
Online Learning Algorithms
1 Monte-Carlo Planning: Policy Improvement Alan Fern.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Computational Stochastic Optimization: Bridging communities October 25, 2012 Warren Powell CASTLE Laboratory Princeton University
Reinforcement Learning
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
Reinforcement Learning
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford.
MDPs (cont) & Reinforcement Learning
1 Monte-Carlo Planning: Policy Improvement Alan Fern.
© Daniel S. Weld 1 Logistics No Reading First Tournament Wed Details TBA.
Reinforcement learning (Chapter 21)
1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 21: Dynamic Multi-Criteria RL problems Dr. Itamar Arel College of Engineering Department.
Reinforcement Learning Based on slides by Avi Pfeffer and David Parkes.
Application of Dynamic Programming to Optimal Learning Problems Peter Frazier Warren Powell Savas Dayanik Department of Operations Research and Financial.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Reinforcement Learning
Reinforcement Learning (1)
Reinforcement learning (Chapter 21)
Reinforcement Learning
Biomedical Data & Markov Decision Process
Markov Decision Processes
UAV Route Planning in Delay Tolerant Networks
Markov Decision Processes
Tuning bandit algorithms in stochastic environments
Reinforcement Learning in MDPs by Lease-Square Policy Iteration
Markov Decision Problems
Deep Reinforcement Learning
Reinforcement Learning (2)
Reinforcement Learning (2)
Presentation transcript:

1 Fitted/batch/model-based RL: A (sketchy, biased) overview(?) Csaba Szepesvári University of Alberta TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A A A AA

2 Contents  What, why?  Constraints  How?  Model-based learning Model learning Planning  Model-free learning Averagers Fitted RL

3 Motto “Nothing is more practical than a good theory” [Lewin] “He who loves practice without theory is like the sailor who boards ship without a rudder and compass and never knows where he may cast.” [Leonardo da Vinci]

4 What? Why?  What is batch RL? Input: Samples (algorithm cannot influence samples) Output: A good policy  Why? Common problem Sample efficiency -- data is expensive Building block  Why not? Too much work (for nothing?) –  “Don’t worry, be lazy!” Old samples are irrelevant Missed opportunities (evaluate a policy!?)

5 Constraints  Large (infinite) state/action space  Limits on Computation Memory use

6 How?  Model learning + planning  Model free Policy search DP  Policy iteration  Value iteration

7 Model-based learning

8 Model learning

9 Model-based methods  Model-learning: How? Model: What happens if..? Features vs. observations vs. states System identification?  Satinder! Carlos! Eric! …  Planning: How? Sample + learning! (batch RL?..but you can influence the samples) What else? (Discretize? Nay..)  Pro: Model is good for multiple things  Contra: Problem is doubled: need of high fidelity models, good planning Problem 1: Should planning take into account the uncertainties in the model? (“robustification”) Problem 2: How to learn relevant, compact models? For example: How to reject irrelevant features and keep the relevant ones? Need: Tight integration of planning and learning!

10 Planning

11 Bad news..  Theorem (Chow, Tsitsiklis ’89) Markovian Decision Problems d dimensional state space Bounded transition probabilities, rewards Lipschitz-continuous transition probabilities and rewards  Any algorithm computing an ² - approximation of the optimal value function needs ( ² - d ) values of p and r.  What’s next then??  Open: Policy approximation?

12 The joy of laziness  Don’t worry, be lazy: “If something is too hard to do, then it's not worth doing”  Luckiness factor: “If you really want something in this life, you have to work for it - Now quiet, they're about to announce the lottery numbers!”

13 Sparse lookahead trees [Kearns et al., ’02]  Idea: Computing a good action ´ planning  build a lookahead tree  Size of the tree: S = c |A| H ( ² ) (unavoidable), where H( ² ) = K r /( ² (1- ° ))  Good news: S is independent of d!  Bad news: S is exponential in H( ² )  Still attractive: Generic, easy to implement  Problem: Not really practical

14 Idea..  Be more lazy  Need to propagate values from good leaves as early as possible  Why sample suboptimal actions at all?  Breadth-first  Depth-first!  Bandit algorithms  Upper Confidence Bounds  UCT [KoSze ’06]  Remi Similar ideas: [Peret and Garcia, ’04] [Chang et al., ’05]

15 Results: Sailing  ‘Sailing’: Stochastic shortest path  State-space size = 24*problem-size  Extension to two-player, full information games  Good results in go! (  Remi, David!) Open: Why (when) does UCT work so well? Conjecture: When being (very) optimistic does not abuse search How to improve UCT?

16 Random Discretization Method [Rust’97]  Method: Random base points Value function computed at these points (weighted importance sampling) Compute values at other points at run-time (“half-lazy method”)  Why Monte-Carlo? Avoid grids!  Result: State space: [0,1] d Action space: finite p(y|x,a), r(x,a) Lipschitz continuous, bounded Theorem [Rust ’97]: Theorem [Sze’01]: Poly samples are enough to come up with ² -optimal actions (poly dependence on H). Smoothness of the value function is not required Open: Can we improve the result by changing the distribution of samples? Idea: Presample + Follow the obtained policy Open: Can we get poly dependence on both d and H without representing a value function? (e.g. lookahead trees)

17 Pegasus [Ng & Jordan ’00]  Idea: Policy search + method of common random numbers (“scenarios”)  Results: Condition: Deterministic simulative model Thm: Finite action space, finite complexity policy class  polynomial sample complexity Thm: Infinite action spaces, Lipschitz continuity of trans.probs + rewards  polynomial sample complexity Thm: Finitely computable models + policies  polynomial sample complexity  Pro: Nice results  Contra: Global search? What policy space? Problem 1: How to avoid global search? Problem 2: When can we find a good policy efficiently? How? Problem 3: How to choose the policy class?

18 Other planning methods  Your favorite RL method! +Planning is easier than learning: You can reset the state! Dyna-style planning with prioritized sweeping  Rich Conservative policy iteration  Problem: Policy search, guaranteed improvement in every iteration  [K&L’00]: Bound for finite MDPs, policy class ´ all policies  [K’03]: Arbitrary policies, reduction-style result Policy search by DP [Bagnell, Kakade, Ng & Schneider ’03]  Similar to [K’03], finite horizon problems Fitted value iteration..

19 Model-free: Policy Search  ???? Open: How to do it?? (I am serious) Open: How to evaluate a policy/policy gradient given some samples? (partial result: In the limit, under some conditions, policies can be evaluated [AnSzeMu’08])

20 Model-free: Dynamic Programming  Policy Iteration How to evaluate policies? Do good value functions give rise to good policies?  Value Iteration Use action-value functions How to represent value functions? How to do the updates?

21 Value-function based methods  Questions: What representation to use? How are errors propagated?  Averagers [Gordon ’95] ~ kernel methods V t+1 = ¦ F T V t L 1 theory  Can we have an L 2 (L p ) theory?  Counterexamples [Boyan&Moore ’95, Baird’95, BeTsi’96]  L 2 error propagation [Munos ’03 ’05]

22 Fitted methods  Idea: Use regression/classification with value/policy iteration  Notable examples: Fitted Q-iteration  Use trees (  averagers; Damien!)  Use neural nets (  L 2, Martin!) Policy iteration  LSTD [Bradtke&Barto ’96, Boyan ‘99] BRM [AnSzeMu’06,’08]  LSPI: Use action-value functions + iterate [Lagoudakis & Parr ’01, ’03] RL as classification [La & Pa ’03]

23 Results for fitted algorithms  Results for LSPI/BRM-PI, FQI: Finite action-, continuous state-space Smoothness conditions on MDP Representative training set Function class (F) large (Bellman error of F is small), but controlled complexity  Polynomial rates (similar to supervised learning)  FQI, continuous action-spaces Similar conditions + restricted policy class  Polynomial rates, but bad scaling with the dimension of the action space [AnSzeMu ’06-’08] Open: How to choose the function space in an adaptive way? ~ model selection in supervised learning Supervised learning does not work without model selection? Why would RL work?  NO, IT DOES NOT. Idea: Regularize!  Problem: How to evaluate policies?

24 Regularization

25 Final thoughts  Batch RL: Flourishing area  Many open questions  More should! come soon!  Some good results in practice  Take computation cost seriously?  Connect to on-line RL?

26 Batch RL Let’s switch to that policy – after all the paper says that learning converges at an optimal rate!