A Unifying Framework for Computational Reinforcement Learning Theory Lihong Li Rutgers Laboratory for Real-Life Reinforcement Learning (RL 3 ) Department.

Slides:



Advertisements
Similar presentations
Dialogue Policy Optimisation
Advertisements

SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
Extraction and Transfer of Knowledge in Reinforcement Learning A.LAZARIC Inria “30 minutes de Science” Seminars SequeL Inria Lille – Nord Europe December.
David Wingate Reinforcement Learning for Complex System Management.
Reinforcement Learning
Reinforcement learning (Chapter 21)
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
Sample-based Planning for Continuous Action Markov Decision Processes Ari Weinstein.
Advanced MDP Topics Ron Parr Duke University. Value Function Approximation Why? –Duality between value functions and policies –Softens the problems –State.
Generalizing Plans to New Environments in Relational MDPs
Reinforcement learning
Max-norm Projections for Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.
Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University.
Reinforcement Learning
Distributed Planning in Hierarchical Factored MDPs Carlos Guestrin Stanford University Geoffrey Gordon Carnegie Mellon University.
Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng.]
Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng.]
Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.
Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
An Optimal Learning Approach to Finding an Outbreak of a Disease Warren Scott Warren Powell
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Nonmyopic Active Learning of Gaussian Processes An Exploration – Exploitation Approach Andreas Krause, Carlos Guestrin Carnegie Mellon University TexPoint.
Exploration and Apprenticeship Learning in Reinforcement Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
Multiagent Planning with Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.
To Model or not To Model; that is the question.. Administriva Presentations starting Thurs Ritthaler Scully Gupta Wildani Ammons ICES surveys today.
Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Maria-Florina Balcan A Theoretical Model for Learning from Labeled and Unlabeled Data Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer.
Reinforcement Learning (1)
Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK
Learning and Planning for POMDPs Eyal Even-Dar, Tel-Aviv University Sham Kakade, University of Pennsylvania Yishay Mansour, Tel-Aviv University.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Reinforcement Learning on Markov Games Nilanjan Dasgupta Department of Electrical and Computer Engineering Duke University Durham, NC Machine Learning.
Reinforcement Learning
General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.
REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
Benk Erika Kelemen Zsolt
Solving Large Markov Decision Processes Yilan Gu Dept. of Computer Science University of Toronto April 12, 2004.
Efficient Solution Algorithms for Factored MDPs by Carlos Guestrin, Daphne Koller, Ronald Parr, Shobha Venkataraman Presented by Arkady Epshteyn.
Model-based Bayesian Reinforcement Learning in Partially Observable Domains by Pascal Poupart and Nikos Vlassis (2008 International Symposium on Artificial.
Reinforcement Learning 主講人:虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.
CHAPTER 10 Reinforcement Learning Utility Theory.
Reinforcement Learning Ata Kaban School of Computer Science University of Birmingham.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Decision Making Under Uncertainty Lec #8: Reinforcement Learning UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Most slides by Jeremy.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford.
Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
An Object-oriented Representation for Efficient Reinforcement Learning Carlos Diuk, Andre Cohen and Michael L. Littman Rutgers Laboratory for Real-Life.
Tractable Inference for Complex Stochastic Processes X. Boyen & D. Koller Presented by Shiau Hong Lim Partially based on slides by Boyen & Koller at UAI.
1 Monte-Carlo Planning: Policy Improvement Alan Fern.
Model Minimization in Hierarchical Reinforcement Learning Balaraman Ravindran Andrew G. Barto Autonomous Learning Laboratory.
Reinforcement learning (Chapter 21)
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
Reinforcement Learning
R. Brafman and M. Tennenholtz Presented by Daniel Rasmussen.
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Achieving Goals in Decentralized POMDPs Christopher Amato Shlomo Zilberstein UMass.
István Szita & András Lőrincz
 Real-Time Scheduling via Reinforcement Learning
Knows What It Knows: A Framework for Self-Aware Learning
Chapter 2: Evaluative Feedback
 Real-Time Scheduling via Reinforcement Learning
Chapter 2: Evaluative Feedback
Presentation transcript:

A Unifying Framework for Computational Reinforcement Learning Theory Lihong Li Rutgers Laboratory for Real-Life Reinforcement Learning (RL 3 ) Department of Computer Science, Rutgers University PhD Defense Committee Michael Littman, Michael Pazzani, Robert Schapire, Mario Szegedy Joint work with Michael Littman, Alex Strehl, Tom Walsh, …

04/17/2009Lihong Li2 $ponsored $earch Are these better alternatives?  Need to EXPLORE!

04/17/2009Lihong Li3 Thesis The KWIK (Knows What It Knows) learning model provides a flexible, modularized, and unifying way for creating and analyzing RL algorithms with provably efficient exploration.

04/17/2009Lihong Li4 Outline Reinforcement Learning (RL) The KWIK Framework Provably Efficient RL –Model-based Approaches –Model-free Approaches Conclusions

04/17/2009Lihong Li5 100K dimensional dialog state, features, etc. Reinforcement Learning Example Speech Recognition, NLP, Belief Tracking, etc. Language Generation, Text-to-speech, etc. responses to user “May I speak to John Smith?” Confirm(“John Smith”) “So you want to call John Smith, is that right?” -1per response +20if succeeds -20if fails reward Optimized by RL Dialog Design Objective succeed in conversation with fewest responses AT&T Dialer [Li & Williams & Balakrishnan 09] states actions Want to call someone at AT&T

04/17/2009Lihong Li6 RL Summary European Workshop on Reinforcement Learning 2008 Define reward and let the agent chase it!

04/17/2009Lihong Li7 Markov Decision Process Environment is often modeled as an MDP s1s1 s2s2 stst time s t+1 Set of states Set of actions Transition probabilities Reward function Discount factor in (0,1) Regularity Assumption:

04/17/2009Lihong Li8 Policies and Value Functions Policy: Value function: Optimal value function: Optimal policy: Solving an MDP:

04/17/2009Lihong Li9 Solving an MDP Planning (when and are known) –Dynamic programming, linear programming, … –Relatively easy to analyze Learning (when or are unknown) –Q-learning [Watkins 89], … –Fundamentally harder –Exploration/exploitation dilemma

04/17/2009Lihong Li10 Exploration/Exploitation Dilemma Similar to –active learning (like selective sampling) –bandit problems (Ad ranking) But different/harder –Many heuristics may fail Take optimal actions “exploitation”: reward maximization Need estimate and “exploration”: knowledge acquisition Try suboptimal actions “dual control”

04/17/2009Lihong Li11 Combination Lock time total rewards poor (insufficient) exploration active (efficient) exploration optimal policy

04/17/2009Lihong Li12 PAC-MDP RL RL algorithm viewed as a non-stationary policy: Sample complexity [Kakade 03] (given ): A is PAC-MDP (Probably Approximate Correct in MDP) [Strehl, Li, Wiewiora, Langford & Littman 06] if: –With prob. at least –The sample complexity is In words… We want the algorithm to act near optimally except in a small number of steps

04/17/2009Lihong Li13 Why PAC-MDP? Sample complexity –number of steps where learning/exploration happens –related to “learning speed” or “exploration efficiency” Roles of parameters –  : allow small sub-optimality –  : allow failure due to unlucky data –|M|: measures problem complexity –1/(1-  ): larger  makes problem harder Generality –No assumption on ergodicity –No assumption on mixing –No need for reset or generative model

04/17/2009Lihong Li14 Rmax [Brafman & Tenenholtz 02] Rmax is for finite-state, finite-action MDPs Learns T and R by counting/averaging In s t, takes optimal action in Known state-actions Unknown state-actions “Optimism in the face of uncertainty”:  Either: explore “unknown” region  Or: exploit “known” region Thm : Rmax is PAC-MDP [Kakade 03] SxASxA

04/17/2009Lihong Li15 Outline Reinforcement Learning (RL) The KWIK Framework Provably Efficient RL –Model-based Approaches –Model-free Approaches Conclusions

04/17/2009Lihong Li16 KWIK: Knows What It Knows [Li & Littman & Walsh 08] A self-aware, supervised-learning model –Input set: X –Output set: Y –Observation set: Z –Hypothesis class: H µ (X  Y) –Target function: h* 2 H “Realizable assumption” –Special symbol: ? (“I don’t know”) KWIK Notation

04/17/2009Lihong Li17 KWIK Definition Given: , , H Env: Pick h* 2 H secretly & adversarially Env: Pick x adversarially Learner “ŷ”“ŷ” “?”“?” Observe y=h*(x) [deterministic] or measurement z [stochastic where E[z]=h*(x)] “I know” “I don’t know”  W/prob. 1- , all predictions are correct  |ŷ - h*(x)| ≤   Total # ? is small  at most poly(1/ ,1/ ,dim(H)) Learning succeeds if

04/17/2009Lihong Li18 Related Frameworks PAC: Probably Approximately Correct [Valiant 84] MB: Mistake Bound [Littlestone 87] KWIK: Knows What It Knows [Li & Littman & Walsh 08] (if one-way functions exist) [Blum 94] (may be exponentially harder) [Li & Littman & Walsh 08]

04/17/2009Lihong Li19 Deterministic / Finite Case (X or H is finite) Thought Experiment: You own a bar frequented by n patrons… –One is an instigator. When he shows up, there is a fight, unless –Another patron, the peacemaker, is also there. –We want to predict, for a subset of patrons, {fight or no-fight} 19 Alg. 1: Memorization Memorize outcome for each subgroup of patrons Predict ? if unseen before # ? ≤ |X| Bar-fight: # ? · 2 n Alg. 2: Enumeration Enumerate all consistent (instigator, peacemaker) pairs Say ? when they disagree # ? ≤ |H| -1 Bar-fight: # ? · n(n-1) Can make accurate predictions before complete identification of h*

04/17/2009Lihong Li20 Problem –Learn a multinomial distribution over N outcomes Same input at all times –Observe outcomes, not actual probabilities Algorithm –Predict ? for the first times –Use empirical estimate afterwards –Correctness follows from Chernoff’s bound Building block for many other stochastic cases Stochastic / Finite Case: Dice-Learning

04/17/2009Lihong Li21 More Examples Distance to an unknown point in < n [Li & Littman & Walsh 08] Linear functions with white noise [Strehl & Littman 08] [Walsh & Szita & Diuk & Littman 09] Gaussian distributions [Brunskill & Leffler & Li & Littman & Roy 08]

04/17/2009Lihong Li22 Outline Reinforcement Learning (RL) The KWIK Framework Provably Efficient RL –Model-based Approaches –Model-free Approaches Conclusions

04/17/2009Lihong Li23 Model-based RL Model-based RL (in ) –First learn T and R –Then uses to compute Simulation lemma [Kearns & Singh 02] Building a model often makes more efficient use of training data in practice

04/17/2009Lihong Li24 KWIK-Rmax [Li et al. 09] Generalizes Rmax to general MDPs KWIK-learns T and R simultaneously In s t, takes optimal action in “Optimism in the face of uncertainty”:  Either: explore “unknown” region  Or: exploit “known” region Rmax [Brafman & Tenenholtz 02] Rmax is for finite-state, finite-action MDPs Learns T and R by counting/averaging Known state-actions Unknown state-actions SxASxA

04/17/2009Lihong Li25 KWIK-Rmax Analysis Explore-or-Exploit Lemma [Li et al. 09] –KWIK-Rmax either follows  -optimal policy, or –explores an unknown state allowing KWIK-learners to learn T and R! Theorem [Li et al. 09]: KWIK-Rmax is PAC- MDP w/ sample complexity

04/17/2009Lihong Li26 KWIK-Learning Finite MDPs by Input-Partition T(.|s,a) is multinomial distribution –There are |S||A| many of them –Each indexed by (s,a) Input-Partition T(.|s 1,a 1 )T(.|s 1,a 2 )T(.|s n,a m ) …… [Brafman & Tenenholtz 02] [Kakade 03] [Strehl & Li & Littman 06] Environment x=(s 1,a 2 ) dice-learning

04/17/2009Lihong Li27 DBN representation [Dean & Kanazawa 89] Network topologies from [Guestrin & Koller & Parr & Venkataraman 03] Factored-State MDPs Bidirectional Ring Star Ring and Star 3 Legs Ring of Rings

04/17/2009Lihong Li28 Factored-State MDPs DBN representation [Dean & Kanazawa 89] –Assuming #parents is bounded by a constant D Challenges:  How to estimate T i (s i ’ | parents(s i ’),a)?  How to discover parents of each s i ’?  How to combine learners L(s i ’) and L(s j ’)?

04/17/2009Lihong Li29 KWIK-Learning DBNs with Unknown Structure Noisy-Union Input-Partition Dice-Learning Entries in CPT Learning a DBN Discovery of parents of s i ’ Cross-Product CPTs for T(s i ’ | parent(s i ’), a) From [Kearns & Koller 99]: “ This paper leaves many interesting problems unaddressed. Of these, the most intriguing one is to allow the algorithm to learn the model structure as well as the parameters. The recent body of work on learning Bayesian networks from data [Heckerman, 1995] lays much of the foundation, but the integration of these ideas with the problems of exploration/exploitation is far from trivial. ” [Li & Littman & Walsh 08] [Diuk & Li & Leffler 09] First solved by [Strehl & Diuk & Littman 07]

04/17/2009Lihong Li30 Experiment: “System Administrator” Met-Rmax [Diuk & Li & Leffler 09] SLF-Rmax [Strehl & Diuk & Littman 07] Factored Rmax [Guestrin & Patrascu & Schuurmans 02] Ring network 8 machines 9 actions

04/17/2009Lihong Li31 MDPs with Gaussian Dynamics Examples: robot navigation, transportation planning State offset is multi-variate normal distribution CORL [Brunskill & Leffler & Li & Littman & Roy 08] RAM-Rmax [Leffler & Littman & Edmunds 07] (video by Leffler)

04/17/2009Lihong Li32 Outline Reinforcement Learning (RL) The KWIK Framework Provably Efficient RL –Model-based Approaches –Model-free Approaches Conclusions

04/17/2009Lihong Li33 Model-free RL Estimate directly –Implying –No need to estimate T or R Benefits –Tractable computation complexity –Tractable space complexity Drawbacks –Seems to makes inefficient use of data –Are there PAC-MDP model-free algorithms?

04/17/2009Lihong Li34 PAC-MDP Model-free RL Can be KWIK-learned optimistic Q-functions small E(s,a)  near-optimal (exploit) explore

04/17/2009Lihong Li35 Delayed Q-learning Delayed Q-learning (for finite MDPs) first known PAC-MDP model-free algorithm [Strehl-Li-Wiewiora-Langford-Littman 06] Similar to Q-learning [Watkins 89] –Minimal computation complexity –Minimal space complexity

04/17/2009Lihong Li36 Comparison

04/17/2009Lihong Li37 Improved Lower Bound for Finite MDPs Lower bound for N=1 [Mannor & Tsitsiklis 04]: Theorem : a new lower bound Delayed Q-learning’s upper bound:

04/17/2009Lihong Li38 KWIK with Linear Function Approximation Linear FA: LSPI-Rmax [Li & Littman & Mansley 09] –LSPI [Lagoudakis & Parr 03] with online exploration –(s,a) is unknown if under-represented in training set –Includes Rmax as a special case REKWIRE [Li & Littman 08] –For finite-horizon MDPs –Learns Q in a bottom-up manner

04/17/2009Lihong Li39 Outline Reinforcement Learning (RL) The KWIK Framework Provably Efficient RL –Model-based Approaches –Model-free Approaches Conclusions

04/17/2009Lihong Li40 Open Problems Agnostic learning [Kearns & Schapire & Sellie 94] in KWIK –Hypothesis class H may not include h* –“Unrealizable” KWIK [Li & Littman 08] Prior information in RL –Bayesian prior [Asmuth & Li & Littman & Nouri & Wingate 09] –Heuristic/shaping [Asmuth & Littman & Zinkov 08] [Strehl & Li & Littman 09] Approximate RL with KWIK –Least-squares policy iteration [Li & Littman & Mansley 09] –Fitted value iteration [Brunskill & Leffler & Li & Littman & Roy 08] –Linear function approximation [Li & Littman 08]

04/17/2009Lihong Li41 Conclusions: A Unification KWIK [Li & Littman & Walsh 08] Finite MDP [Kearns & Singh 02] [Brafman & Tenenholtz 02] [Kakade 03] [Strehl & Li & Littman 06] Linear MDP [Strehl & Littman 08] RAM-MDP [Leffler & Littman & Edmunds 07] Gaussian-Offset MDP [Brunskill & Leffer & Li & Littman & Roy 08] Factored MDP [Kearns & Koller 99] [Strehl & Diuk & Littman 07] [Li & Littman & Walsh 08] [Diuk & Li & Leffler 09] Delayed-Observation MDP [Walsh & Nouri & Li & Littman 07] Finite MDP [Strehl & Li & Wiewiora & Langford & Littman 06] KWIK-based VFA [Li & Littman 08] [Li & Mansley & Littman 09] Matching Lower Bound model-based model-free The KWIK (Knows What It Knows) learning model provides a flexible, modularized, and unifying way for creating and analyzing RL algorithms with provably efficient exploration.

04/17/2009Lihong Li42 1.Li, Littman, & Walsh: “Knows what it knows: A framework for self-aware learning”. In ICML Diuk, Li, & Leffler: “The adaptive k-meteorologist problem and its applications to structure discovery and feature selection in reinforcement learning”. In ICML Brunskill, Leffler, Li, Littman, & Roy: “CORL: A continuous-state offset- dynamics reinforcement learner”. In UAI Walsh, Nouri, Li, & Littman: “Planning and learning in environments with delayed feedback”. In ECML Strehl, Li, & Littman: “Incremental model-based learners with formal learning-time guarantees”. In UAI Li, Littman, & Mansley: “Online exploration in least-squares policy iteration”. In AAMAS Li & Littman: “Efficient value-function approximation via online linear regression”. In AI&Math Strehl, Li, Wiewiora, Langford, & Littman: “PAC model-free reinforcement learning”. In ICML References KWIK MBRL MFRL