A Unifying Framework for Computational Reinforcement Learning Theory Lihong Li Rutgers Laboratory for Real-Life Reinforcement Learning (RL 3 ) Department.

A Unifying Framework for Computational Reinforcement Learning Theory Lihong Li Rutgers Laboratory for Real-Life Reinforcement Learning (RL 3 ) Department of Computer Science, Rutgers University PhD Defense Committee Michael Littman, Michael Pazzani, Robert Schapire, Mario Szegedy Joint work with Michael Littman, Alex Strehl, Tom Walsh, …

04/17/2009Lihong Li2 $ponsored $earch Are these better alternatives?  Need to EXPLORE!

04/17/2009Lihong Li3 Thesis The KWIK (Knows What It Knows) learning model provides a flexible, modularized, and unifying way for creating and analyzing RL algorithms with provably efficient exploration.

04/17/2009Lihong Li4 Outline Reinforcement Learning (RL) The KWIK Framework Provably Efficient RL –Model-based Approaches –Model-free Approaches Conclusions

04/17/2009Lihong Li5 100K dimensional dialog state, features, etc. Reinforcement Learning Example Speech Recognition, NLP, Belief Tracking, etc. Language Generation, Text-to-speech, etc. responses to user “May I speak to John Smith?” Confirm(“John Smith”) “So you want to call John Smith, is that right?” -1per response +20if succeeds -20if fails reward Optimized by RL Dialog Design Objective succeed in conversation with fewest responses AT&T Dialer [Li & Williams & Balakrishnan 09] states actions Want to call someone at AT&T

04/17/2009Lihong Li6 RL Summary European Workshop on Reinforcement Learning 2008 Define reward and let the agent chase it!

04/17/2009Lihong Li7 Markov Decision Process Environment is often modeled as an MDP s1s1 s2s2 stst time s t+1 Set of states Set of actions Transition probabilities Reward function Discount factor in (0,1) Regularity Assumption:

04/17/2009Lihong Li8 Policies and Value Functions Policy: Value function: Optimal value function: Optimal policy: Solving an MDP:

04/17/2009Lihong Li9 Solving an MDP Planning (when and are known) –Dynamic programming, linear programming, … –Relatively easy to analyze Learning (when or are unknown) –Q-learning [Watkins 89], … –Fundamentally harder –Exploration/exploitation dilemma

04/17/2009Lihong Li10 Exploration/Exploitation Dilemma Similar to –active learning (like selective sampling) –bandit problems (Ad ranking) But different/harder –Many heuristics may fail Take optimal actions “exploitation”: reward maximization Need estimate and “exploration”: knowledge acquisition Try suboptimal actions “dual control”

04/17/2009Lihong Li11 Combination Lock time total rewards poor (insufficient) exploration active (efficient) exploration optimal policy 1239899100 00 0000 1000 0.001 0

04/17/2009Lihong Li12 PAC-MDP RL RL algorithm viewed as a non-stationary policy: Sample complexity [Kakade 03] (given ): A is PAC-MDP (Probably Approximate Correct in MDP) [Strehl, Li, Wiewiora, Langford & Littman 06] if: –With prob. at least –The sample complexity is In words… We want the algorithm to act near optimally except in a small number of steps

04/17/2009Lihong Li13 Why PAC-MDP? Sample complexity –number of steps where learning/exploration happens –related to “learning speed” or “exploration efficiency” Roles of parameters –  : allow small sub-optimality –  : allow failure due to unlucky data –|M|: measures problem complexity –1/(1-  ): larger  makes problem harder Generality –No assumption on ergodicity –No assumption on mixing –No need for reset or generative model

04/17/2009Lihong Li14 Rmax [Brafman & Tenenholtz 02] Rmax is for finite-state, finite-action MDPs Learns T and R by counting/averaging In s t, takes optimal action in Known state-actions Unknown state-actions “Optimism in the face of uncertainty”:  Either: explore “unknown” region  Or: exploit “known” region Thm : Rmax is PAC-MDP [Kakade 03] SxASxA

04/17/2009Lihong Li16 KWIK: Knows What It Knows [Li & Littman & Walsh 08] A self-aware, supervised-learning model –Input set: X –Output set: Y –Observation set: Z –Hypothesis class: H µ (X  Y) –Target function: h* 2 H “Realizable assumption” –Special symbol: ? (“I don’t know”) KWIK Notation

04/17/2009Lihong Li17 KWIK Definition Given: , , H Env: Pick h* 2 H secretly & adversarially Env: Pick x adversarially Learner “ŷ”“ŷ” “?”“?” Observe y=h*(x) [deterministic] or measurement z [stochastic where E[z]=h*(x)] “I know” “I don’t know”  W/prob. 1- , all predictions are correct  |ŷ - h*(x)| ≤   Total # ? is small  at most poly(1/ ,1/ ,dim(H)) Learning succeeds if

04/17/2009Lihong Li18 Related Frameworks PAC: Probably Approximately Correct [Valiant 84] MB: Mistake Bound [Littlestone 87] KWIK: Knows What It Knows [Li & Littman & Walsh 08] (if one-way functions exist) [Blum 94] (may be exponentially harder) [Li & Littman & Walsh 08]

04/17/2009Lihong Li19 Deterministic / Finite Case (X or H is finite) Thought Experiment: You own a bar frequented by n patrons… –One is an instigator. When he shows up, there is a fight, unless –Another patron, the peacemaker, is also there. –We want to predict, for a subset of patrons, {fight or no-fight} 19 Alg. 1: Memorization Memorize outcome for each subgroup of patrons Predict ? if unseen before # ? ≤ |X| Bar-fight: # ? · 2 n Alg. 2: Enumeration Enumerate all consistent (instigator, peacemaker) pairs Say ? when they disagree # ? ≤ |H| -1 Bar-fight: # ? · n(n-1) Can make accurate predictions before complete identification of h*

04/17/2009Lihong Li20 Problem –Learn a multinomial distribution over N outcomes Same input at all times –Observe outcomes, not actual probabilities Algorithm –Predict ? for the first times –Use empirical estimate afterwards –Correctness follows from Chernoff’s bound Building block for many other stochastic cases Stochastic / Finite Case: Dice-Learning

04/17/2009Lihong Li21 More Examples Distance to an unknown point in < n [Li & Littman & Walsh 08] Linear functions with white noise [Strehl & Littman 08] [Walsh & Szita & Diuk & Littman 09] Gaussian distributions [Brunskill & Leffler & Li & Littman & Roy 08]

04/17/2009Lihong Li23 Model-based RL Model-based RL (in ) –First learn T and R –Then uses to compute Simulation lemma [Kearns & Singh 02] Building a model often makes more efficient use of training data in practice

04/17/2009Lihong Li24 KWIK-Rmax [Li et al. 09] Generalizes Rmax to general MDPs KWIK-learns T and R simultaneously In s t, takes optimal action in “Optimism in the face of uncertainty”:  Either: explore “unknown” region  Or: exploit “known” region Rmax [Brafman & Tenenholtz 02] Rmax is for finite-state, finite-action MDPs Learns T and R by counting/averaging Known state-actions Unknown state-actions SxASxA

04/17/2009Lihong Li25 KWIK-Rmax Analysis Explore-or-Exploit Lemma [Li et al. 09] –KWIK-Rmax either follows  -optimal policy, or –explores an unknown state allowing KWIK-learners to learn T and R! Theorem [Li et al. 09]: KWIK-Rmax is PAC- MDP w/ sample complexity

04/17/2009Lihong Li26 KWIK-Learning Finite MDPs by Input-Partition T(.|s,a) is multinomial distribution –There are |S||A| many of them –Each indexed by (s,a) Input-Partition T(.|s 1,a 1 )T(.|s 1,a 2 )T(.|s n,a m ) …… [Brafman & Tenenholtz 02] [Kakade 03] [Strehl & Li & Littman 06] Environment x=(s 1,a 2 ) dice-learning

04/17/2009Lihong Li27 DBN representation [Dean & Kanazawa 89] Network topologies from [Guestrin & Koller & Parr & Venkataraman 03] Factored-State MDPs Bidirectional Ring Star Ring and Star 3 Legs Ring of Rings

04/17/2009Lihong Li28 Factored-State MDPs DBN representation [Dean & Kanazawa 89] –Assuming #parents is bounded by a constant D Challenges:  How to estimate T i (s i ’ | parents(s i ’),a)?  How to discover parents of each s i ’?  How to combine learners L(s i ’) and L(s j ’)?

04/17/2009Lihong Li29 KWIK-Learning DBNs with Unknown Structure Noisy-Union Input-Partition Dice-Learning Entries in CPT Learning a DBN Discovery of parents of s i ’ Cross-Product CPTs for T(s i ’ | parent(s i ’), a) From [Kearns & Koller 99]: “ This paper leaves many interesting problems unaddressed. Of these, the most intriguing one is to allow the algorithm to learn the model structure as well as the parameters. The recent body of work on learning Bayesian networks from data [Heckerman, 1995] lays much of the foundation, but the integration of these ideas with the problems of exploration/exploitation is far from trivial. ” [Li & Littman & Walsh 08] [Diuk & Li & Leffler 09] First solved by [Strehl & Diuk & Littman 07]

04/17/2009Lihong Li30 Experiment: “System Administrator” Met-Rmax [Diuk & Li & Leffler 09] SLF-Rmax [Strehl & Diuk & Littman 07] Factored Rmax [Guestrin & Patrascu & Schuurmans 02] Ring network 8 machines 9 actions

04/17/2009Lihong Li31 MDPs with Gaussian Dynamics Examples: robot navigation, transportation planning State offset is multi-variate normal distribution CORL [Brunskill & Leffler & Li & Littman & Roy 08] RAM-Rmax [Leffler & Littman & Edmunds 07] (video by Leffler)

04/17/2009Lihong Li33 Model-free RL Estimate directly –Implying –No need to estimate T or R Benefits –Tractable computation complexity –Tractable space complexity Drawbacks –Seems to makes inefficient use of data –Are there PAC-MDP model-free algorithms?

04/17/2009Lihong Li34 PAC-MDP Model-free RL Can be KWIK-learned optimistic Q-functions small E(s,a)  near-optimal (exploit) explore

04/17/2009Lihong Li35 Delayed Q-learning Delayed Q-learning (for finite MDPs) first known PAC-MDP model-free algorithm [Strehl-Li-Wiewiora-Langford-Littman 06] Similar to Q-learning [Watkins 89] –Minimal computation complexity –Minimal space complexity

04/17/2009Lihong Li36 Comparison

04/17/2009Lihong Li37 Improved Lower Bound for Finite MDPs Lower bound for N=1 [Mannor & Tsitsiklis 04]: Theorem : a new lower bound Delayed Q-learning’s upper bound:

04/17/2009Lihong Li38 KWIK with Linear Function Approximation Linear FA: LSPI-Rmax [Li & Littman & Mansley 09] –LSPI [Lagoudakis & Parr 03] with online exploration –(s,a) is unknown if under-represented in training set –Includes Rmax as a special case REKWIRE [Li & Littman 08] –For finite-horizon MDPs –Learns Q in a bottom-up manner

04/17/2009Lihong Li40 Open Problems Agnostic learning [Kearns & Schapire & Sellie 94] in KWIK –Hypothesis class H may not include h* –“Unrealizable” KWIK [Li & Littman 08] Prior information in RL –Bayesian prior [Asmuth & Li & Littman & Nouri & Wingate 09] –Heuristic/shaping [Asmuth & Littman & Zinkov 08] [Strehl & Li & Littman 09] Approximate RL with KWIK –Least-squares policy iteration [Li & Littman & Mansley 09] –Fitted value iteration [Brunskill & Leffler & Li & Littman & Roy 08] –Linear function approximation [Li & Littman 08]

04/17/2009Lihong Li41 Conclusions: A Unification KWIK [Li & Littman & Walsh 08] Finite MDP [Kearns & Singh 02] [Brafman & Tenenholtz 02] [Kakade 03] [Strehl & Li & Littman 06] Linear MDP [Strehl & Littman 08] RAM-MDP [Leffler & Littman & Edmunds 07] Gaussian-Offset MDP [Brunskill & Leffer & Li & Littman & Roy 08] Factored MDP [Kearns & Koller 99] [Strehl & Diuk & Littman 07] [Li & Littman & Walsh 08] [Diuk & Li & Leffler 09] Delayed-Observation MDP [Walsh & Nouri & Li & Littman 07] Finite MDP [Strehl & Li & Wiewiora & Langford & Littman 06] KWIK-based VFA [Li & Littman 08] [Li & Mansley & Littman 09] Matching Lower Bound model-based model-free The KWIK (Knows What It Knows) learning model provides a flexible, modularized, and unifying way for creating and analyzing RL algorithms with provably efficient exploration.

04/17/2009Lihong Li42 1.Li, Littman, & Walsh: “Knows what it knows: A framework for self-aware learning”. In ICML 2008. 2.Diuk, Li, & Leffler: “The adaptive k-meteorologist problem and its applications to structure discovery and feature selection in reinforcement learning”. In ICML 2009. 3.Brunskill, Leffler, Li, Littman, & Roy: “CORL: A continuous-state offset- dynamics reinforcement learner”. In UAI 2008. 4.Walsh, Nouri, Li, & Littman: “Planning and learning in environments with delayed feedback”. In ECML 2007. 5.Strehl, Li, & Littman: “Incremental model-based learners with formal learning-time guarantees”. In UAI 2006. 6.Li, Littman, & Mansley: “Online exploration in least-squares policy iteration”. In AAMAS 2009. 7.Li & Littman: “Efficient value-function approximation via online linear regression”. In AI&Math 2008. 8.Strehl, Li, Wiewiora, Langford, & Littman: “PAC model-free reinforcement learning”. In ICML 2006. References KWIK MBRL MFRL

A Unifying Framework for Computational Reinforcement Learning Theory Lihong Li Rutgers Laboratory for Real-Life Reinforcement Learning (RL 3 ) Department.

Similar presentations

Presentation on theme: "A Unifying Framework for Computational Reinforcement Learning Theory Lihong Li Rutgers Laboratory for Real-Life Reinforcement Learning (RL 3 ) Department."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Unifying Framework for Computational Reinforcement Learning Theory Lihong Li Rutgers Laboratory for Real-Life Reinforcement Learning (RL 3 ) Department.

Similar presentations

Presentation on theme: "A Unifying Framework for Computational Reinforcement Learning Theory Lihong Li Rutgers Laboratory for Real-Life Reinforcement Learning (RL 3 ) Department."— Presentation transcript:

Similar presentations

About project

Feedback