Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Fitted/batch/model-based RL: A (sketchy, biased) overview(?) Csaba Szepesvári University of Alberta TexPoint fonts used in EMF. Read the TexPoint manual.

Similar presentations


Presentation on theme: "1 Fitted/batch/model-based RL: A (sketchy, biased) overview(?) Csaba Szepesvári University of Alberta TexPoint fonts used in EMF. Read the TexPoint manual."— Presentation transcript:

1 1 Fitted/batch/model-based RL: A (sketchy, biased) overview(?) Csaba Szepesvári University of Alberta TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A A A AA

2 2 Contents  What, why?  Constraints  How?  Model-based learning Model learning Planning  Model-free learning Averagers Fitted RL

3 3 Motto “Nothing is more practical than a good theory” [Lewin] “He who loves practice without theory is like the sailor who boards ship without a rudder and compass and never knows where he may cast.” [Leonardo da Vinci]

4 4 What? Why?  What is batch RL? Input: Samples (algorithm cannot influence samples) Output: A good policy  Why? Common problem Sample efficiency -- data is expensive Building block  Why not? Too much work (for nothing?) –  “Don’t worry, be lazy!” Old samples are irrelevant Missed opportunities (evaluate a policy!?)

5 5 Constraints  Large (infinite) state/action space  Limits on Computation Memory use

6 6 How?  Model learning + planning  Model free Policy search DP  Policy iteration  Value iteration

7 7 Model-based learning

8 8 Model learning

9 9 Model-based methods  Model-learning: How? Model: What happens if..? Features vs. observations vs. states System identification?  Satinder! Carlos! Eric! …  Planning: How? Sample + learning! (batch RL?..but you can influence the samples) What else? (Discretize? Nay..)  Pro: Model is good for multiple things  Contra: Problem is doubled: need of high fidelity models, good planning Problem 1: Should planning take into account the uncertainties in the model? (“robustification”) Problem 2: How to learn relevant, compact models? For example: How to reject irrelevant features and keep the relevant ones? Need: Tight integration of planning and learning!

10 10 Planning

11 11 Bad news..  Theorem (Chow, Tsitsiklis ’89) Markovian Decision Problems d dimensional state space Bounded transition probabilities, rewards Lipschitz-continuous transition probabilities and rewards  Any algorithm computing an ² - approximation of the optimal value function needs ( ² - d ) values of p and r.  What’s next then??  Open: Policy approximation?

12 12 The joy of laziness  Don’t worry, be lazy: “If something is too hard to do, then it's not worth doing”  Luckiness factor: “If you really want something in this life, you have to work for it - Now quiet, they're about to announce the lottery numbers!”

13 13 Sparse lookahead trees [Kearns et al., ’02]  Idea: Computing a good action ´ planning  build a lookahead tree  Size of the tree: S = c |A| H ( ² ) (unavoidable), where H( ² ) = K r /( ² (1- ° ))  Good news: S is independent of d!  Bad news: S is exponential in H( ² )  Still attractive: Generic, easy to implement  Problem: Not really practical

14 14 Idea..  Be more lazy  Need to propagate values from good leaves as early as possible  Why sample suboptimal actions at all?  Breadth-first  Depth-first!  Bandit algorithms  Upper Confidence Bounds  UCT [KoSze ’06]  Remi Similar ideas: [Peret and Garcia, ’04] [Chang et al., ’05]

15 15 Results: Sailing  ‘Sailing’: Stochastic shortest path  State-space size = 24*problem-size  Extension to two-player, full information games  Good results in go! (  Remi, David!) Open: Why (when) does UCT work so well? Conjecture: When being (very) optimistic does not abuse search How to improve UCT?

16 16 Random Discretization Method [Rust’97]  Method: Random base points Value function computed at these points (weighted importance sampling) Compute values at other points at run-time (“half-lazy method”)  Why Monte-Carlo? Avoid grids!  Result: State space: [0,1] d Action space: finite p(y|x,a), r(x,a) Lipschitz continuous, bounded Theorem [Rust ’97]: Theorem [Sze’01]: Poly samples are enough to come up with ² -optimal actions (poly dependence on H). Smoothness of the value function is not required Open: Can we improve the result by changing the distribution of samples? Idea: Presample + Follow the obtained policy Open: Can we get poly dependence on both d and H without representing a value function? (e.g. lookahead trees)

17 17 Pegasus [Ng & Jordan ’00]  Idea: Policy search + method of common random numbers (“scenarios”)  Results: Condition: Deterministic simulative model Thm: Finite action space, finite complexity policy class  polynomial sample complexity Thm: Infinite action spaces, Lipschitz continuity of trans.probs + rewards  polynomial sample complexity Thm: Finitely computable models + policies  polynomial sample complexity  Pro: Nice results  Contra: Global search? What policy space? Problem 1: How to avoid global search? Problem 2: When can we find a good policy efficiently? How? Problem 3: How to choose the policy class?

18 18 Other planning methods  Your favorite RL method! +Planning is easier than learning: You can reset the state! Dyna-style planning with prioritized sweeping  Rich Conservative policy iteration  Problem: Policy search, guaranteed improvement in every iteration  [K&L’00]: Bound for finite MDPs, policy class ´ all policies  [K’03]: Arbitrary policies, reduction-style result Policy search by DP [Bagnell, Kakade, Ng & Schneider ’03]  Similar to [K’03], finite horizon problems Fitted value iteration..

19 19 Model-free: Policy Search  ???? Open: How to do it?? (I am serious) Open: How to evaluate a policy/policy gradient given some samples? (partial result: In the limit, under some conditions, policies can be evaluated [AnSzeMu’08])

20 20 Model-free: Dynamic Programming  Policy Iteration How to evaluate policies? Do good value functions give rise to good policies?  Value Iteration Use action-value functions How to represent value functions? How to do the updates?

21 21 Value-function based methods  Questions: What representation to use? How are errors propagated?  Averagers [Gordon ’95] ~ kernel methods V t+1 = ¦ F T V t L 1 theory  Can we have an L 2 (L p ) theory?  Counterexamples [Boyan&Moore ’95, Baird’95, BeTsi’96]  L 2 error propagation [Munos ’03 ’05]

22 22 Fitted methods  Idea: Use regression/classification with value/policy iteration  Notable examples: Fitted Q-iteration  Use trees (  averagers; Damien!)  Use neural nets (  L 2, Martin!) Policy iteration  LSTD [Bradtke&Barto ’96, Boyan ‘99] BRM [AnSzeMu’06,’08]  LSPI: Use action-value functions + iterate [Lagoudakis & Parr ’01, ’03] RL as classification [La & Pa ’03]

23 23 Results for fitted algorithms  Results for LSPI/BRM-PI, FQI: Finite action-, continuous state-space Smoothness conditions on MDP Representative training set Function class (F) large (Bellman error of F is small), but controlled complexity  Polynomial rates (similar to supervised learning)  FQI, continuous action-spaces Similar conditions + restricted policy class  Polynomial rates, but bad scaling with the dimension of the action space [AnSzeMu ’06-’08] Open: How to choose the function space in an adaptive way? ~ model selection in supervised learning Supervised learning does not work without model selection? Why would RL work?  NO, IT DOES NOT. Idea: Regularize!  Problem: How to evaluate policies?

24 24 Regularization

25 25 Final thoughts  Batch RL: Flourishing area  Many open questions  More should! come soon!  Some good results in practice  Take computation cost seriously?  Connect to on-line RL?

26 26 Batch RL Let’s switch to that policy – after all the paper says that learning converges at an optimal rate!


Download ppt "1 Fitted/batch/model-based RL: A (sketchy, biased) overview(?) Csaba Szepesvári University of Alberta TexPoint fonts used in EMF. Read the TexPoint manual."

Similar presentations


Ads by Google