Extraction and Transfer of Knowledge in Reinforcement Learning A.LAZARIC Inria “30 minutes de Science” Seminars SequeL Inria Lille – Nord Europe December.

Slides:

Advertisements

Similar presentations

Markov Decision Process

Advertisements

1 University of Southern California Keep the Adversary Guessing: Agent Security by Policy Randomization Praveen Paruchuri University of Southern California.

Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.

Questions?. Setting a reward function, with and without subgoals Difference between agent and environment AI for games, Roomba Markov Property – Broken.

Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.

Reinforcement Learning

1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.

Markov Decision Processes

Reinforcement learning

Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

A Finite Sample Upper Bound on the Generalization Error for Q-Learning S.A. Murphy Univ. of Michigan CALD: February, 2005.

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

Using Value of Information to Learn and Classify under Hard Budgets Russell Greiner, Daniel Lizotte, Aloak Kapoor, Omid Madani Dept of Computing Science,

Reinforcement Learning Yishay Mansour Tel-Aviv University.

Reinforcement Learning (1)

Making Decisions CSE 592 Winter 2003 Henry Kautz.

Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK

Commitment without Regrets: Online Learning in Stackelberg Security Games Nika Haghtalab Carnegie Mellon University Joint work with Maria-Florina Balcan,

Hierarchical Exploration for Accelerating Contextual Bandits Yisong Yue Carnegie Mellon University Joint work with Sue Ann Hong (CMU) & Carlos Guestrin.

CS Machine Learning. What is Machine Learning? Adapt to / learn from data  To optimize a performance function Can be used to:  Extract knowledge.

1 Monte-Carlo Planning: Policy Improvement Alan Fern.

Reinforcement Learning

General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.

Introduction Many decision making problems in real life

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

Reinforcement Learning Ata Kaban School of Computer Science University of Birmingham.

© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.

Decision Making Under Uncertainty Lec #8: Reinforcement Learning UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Most slides by Jeremy.

Curiosity-Driven Exploration with Planning Trajectories Tyler Streeter PhD Student, Human Computer Interaction Iowa State University

Advice Taking and Transfer Learning: Naturally-Inspired Extensions to Reinforcement Learning Lisa Torrey, Trevor Walker, Richard Maclin*, Jude Shavlik.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

Class 2 Please read chapter 2 for Tuesday’s class (Response due by 3pm on Monday) How was Piazza? Any Questions?

Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford.

MDPs (cont) & Reinforcement Learning

Reinforcement Learning with Laser Cats! Marshall Wang Maria Jahja DTR Group Meeting October 5, 2015.

1 Monte-Carlo Planning: Policy Improvement Alan Fern.

Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012.

Reinforcement Learning AI – Week 22 Sub-symbolic AI Two: An Introduction to Reinforcement Learning Lee McCluskey, room 3/10

Transfer Learning in Sequential Decision Problems: A Hierarchical Bayesian Approach Aaron Wilson, Alan Fern, Prasad Tadepalli School of EECS Oregon State.

1 Chapter 17 2 nd Part Making Complex Decisions --- Decision-theoretic Agent Design Xin Lu 11/04/2002.

COMP 2208 Dr. Long Tran-Thanh University of Southampton Reinforcement Learning.

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.

R. Brafman and M. Tennenholtz Presented by Daniel Rasmussen.

Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.

Transfer and Multi-Task Learning in Reinforcement Learning Alessandro LAZARIC “Machine Learning with Interdependent and Non-identically Distributed Data”

Figure 5: Change in Blackjack Posterior Distributions over Time.

Online Multiscale Dynamic Topic Models

Chapter 6: Temporal Difference Learning

István Szita & András Lőrincz

Reinforcement Learning (1)

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Reinforcement learning (Chapter 21)

An Overview of Reinforcement Learning

Harm van Seijen Bram Bakker Leon Kester TNO / UvA UvA

Instructors: Fei Fang (This Lecture) and Dave Touretzky

Dr. Unnikrishnan P.C. Professor, EEE

Chapter 2: Evaluative Feedback

October 6, 2011 Dr. Itamar Arel College of Engineering

Chapter 6: Temporal Difference Learning

Deep Reinforcement Learning

Designing Neural Network Architectures Using Reinforcement Learning

CS 416 Artificial Intelligence

Chapter 2: Evaluative Feedback

Reinforcement Learning (2)

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Reinforcement Learning (2)

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Presentation transcript:

Extraction and Transfer of Knowledge in Reinforcement Learning A.LAZARIC Inria “30 minutes de Science” Seminars SequeL Inria Lille – Nord Europe December 10th, 2014

SequeL Sequential Learning SequeL Sequential Learning December 10th, 2014 A. LAZARIC – Transfer in RL- 2 (2005) (2008) (2010) since Dec Statistics Multi-arm bandit Multi-arm bandit Stochastic approximation Online optimization Dynamic programming Optimal control theory Reinforcement Learning Reinforcement Learning Sequence Prediction Online Learning Online Learning Theory (learnability, sample complexity, regret) Theory (learnability, sample complexity, regret) Algorithms (online/batch RL, bandit with structure) Algorithms (online/batch RL, bandit with structure) Applications (finance, recommendation systems, computer games) Applications (finance, recommendation systems, computer games) Tools Problems Results

December 10th, 2014 A. LAZARIC – Transfer in RL- 3 Extraction and Transfer of Knowledge in Reinforcement Learning

December 10th, 2014 A. LAZARIC – Transfer in RL- 4 Good transfer Positive transfer No transfer Negative transfer

December 10th, 2014 A. LAZARIC – Transfer in RL- 5 Can we design algorithms able to learn from experience and transfer knowledge across different problems to improve their learning performance ?

Outline December 10th, 2014 A. LAZARIC – Transfer in RL- 6  Transfer in Reinforcement Learning  Improving the Exploration Strategy  Improving the Accuracy of Approximation  Conclusions

Outline December 10th, 2014 A. LAZARIC – Transfer in RL- 7  Transfer in Reinforcement Learning  Improving the Exploration Strategy  Improving the Accuracy of Approximation  Conclusions

Reinforcement Learning December 10th, 2014 A. LAZARIC – Transfer in RL- 8 agent environment critic delay <position, speed><handlebar, pedals><new position, new speed>, advancement Value Function Control Policy

Markov Decision Process (MDP) December 10th, 2014 A. LAZARIC – Transfer in RL- 9 A Markov Decision Process is Set of states Set of actions Dynamics (probability of transition) Reward Policy Objective: maximize the value function

Reinforcement Learning Algorithms December 10th, 2014 A. LAZARIC – Transfer in RL- 10 Over time Observe state Take an action Observe next state and reward Update policy and value function Exploration/exploitation dilemma Approximation RL algorithms often require many samples and careful design and hand-tuning

Reinforcement Learning December 10th, 2014 A. LAZARIC – Transfer in RL- 11 agent environment critic delay, advancement Very inefficient!

Transfer in Reinforcement Learning December 10th, 2014 A. LAZARIC – Transfer in RL- 12 agent environment critic delay transfer of knowledge

Transfer in Reinforcement Learning December 10th, 2014 A. LAZARIC – Transfer in RL- 13 agent environment critic delay transfer of knowledge

Transfer in Reinforcement Learning December 10th, 2014 A. LAZARIC – Transfer in RL- 14 agent environment critic delay transfer of knowledge

Transfer in Reinforcement Learning December 10th, 2014 A. LAZARIC – Transfer in RL- 15 agent environment critic delay transfer of knowledge

Transfer in Reinforcement Learning December 10th, 2014 A. LAZARIC – Transfer in RL- 16 agent environment critic delay transfer of knowledge

Outline December 10th, 2014 A. LAZARIC – Transfer in RL- 17  Transfer in Reinforcement Learning  Improving the Exploration Strategy  Improving the Accuracy of Approximation  Conclusions

Multi-arm Bandit: a “Simple” RL Problem December 10th, 2014 A. LAZARIC – Transfer in RL- 18 The Multi-armed bandit problem Set of states: no state Set of actions (eg, movies, lessons) Dynamics: no dynamics Reward (eg, rating, grade) Policy Objective: maximize the reward over time Online optimization of an unknown stochastic function under computational constraints…

Sequential Transfer in Bandit December 10th, 2014 A. LAZARIC – Transfer in RL- 19 explore and exploit

Sequential Transfer in Bandit December 10th, 2014 A. LAZARIC – Transfer in RL- 20 Past usersFuture users Current user

Sequential Transfer in Bandit December 10th, 2014 A. LAZARIC – Transfer in RL- 21 Past usersFuture users Current user Idea: although the type of the user is unknown, we may collect knowledge about users and exploit their similarity to identify the type and speed-up the learning process

Sequential Transfer in Bandit December 10th, 2014 A. LAZARIC – Transfer in RL- 22 Past usersFuture users Current user Sanity check: develop an algorithm that given the information about possible users as prior knowledge can outperform a non-transfer approach

The model-Upper Confidence Bound Algorithm December 10th, 2014 A. LAZARIC – Transfer in RL- 23 Over time Select action Exploitation the higher the (estimated) reward the higher the chance to select the action Exploitation the higher the (estimated) reward the higher the chance to select the action Exploration the higher the (theoretical) uncertainty the higher the chance to select the action Exploration the higher the (theoretical) uncertainty the higher the chance to select the action

The model-Upper Confidence Bound Algorithm December 10th, 2014 A. LAZARIC – Transfer in RL- 24 Over time Select action “Transfer” combine current estimates with prior knowledge about the users in Θ “Transfer” combine current estimates with prior knowledge about the users in Θ

Sequential Transfer in Bandit December 10th, 2014 A. LAZARIC – Transfer in RL- 25 Past usersFuture users Current user Collect knowledge

Sequential Transfer in Bandit December 10th, 2014 A. LAZARIC – Transfer in RL- 26 Past usersFuture users Current user Transfer knowledge

Sequential Transfer in Bandit December 10th, 2014 A. LAZARIC – Transfer in RL- 27 Past usersFuture users Current user Collect & Transfer knowledge

Sequential Transfer in Bandit December 10th, 2014 A. LAZARIC – Transfer in RL- 28 Past usersFuture users Current user Collect & Transfer knowledge

Sequential Transfer in Bandit December 10th, 2014 A. LAZARIC – Transfer in RL- 29 Past usersFuture users Current user Collect & Transfer knowledge

The transfer-Upper Confidence Bound Algorithm December 10th, 2014 A. LAZARIC – Transfer in RL- 30 Over time Select action “Collect and transfer” using a method of moment approach to solve a latent variable model problem “Collect and transfer” using a method of moment approach to solve a latent variable model problem

Results December 10th, 2014 A. LAZARIC – Transfer in RL- 31 Theoretical guarantees Improvement wrt no-transfer solution Reduction in the regret Dependency on number of “users” and their difference

Empirical Results December 10th, 2014 A. LAZARIC – Transfer in RL- 32 Synthetic data BAD GOOD NIPS 2013, with E. Brunskill (CMU), M. Azar (Northwestern Univ) Currently testing on a “movie recommendation” dataset

Outline December 10th, 2014 A. LAZARIC – Transfer in RL- 33  Transfer in Reinforcement Learning  Improving the Exploration Strategy  Improving the Accuracy of Approximation  Conclusions

Sparse Multi-task Reinforcement Learning December 10th, 2014 A. LAZARIC – Transfer in RL- 34 Learning to play poker States: cards, chips, … Action: stay, call, fold Dynamics: deck, opponent Reward: money  Use RL to solve it!

Sparse Multi-task Reinforcement Learning December 10th, 2014 A. LAZARIC – Transfer in RL- 35 This is a Multi-Task RL problem!

Sparse Multi-task Reinforcement Learning December 10th, 2014 A. LAZARIC – Transfer in RL- 36 Let’s use as much information as possible to solve the problem! Not all the “features” are equally useful!

The linear Fitted Q-Iteration Algorithm December 10th, 2014 A. LAZARIC – Transfer in RL- 37 Collect samples from the environment Create a regression dataset Solve a linear regression problem Return the greedy policy features

Sparse Linear Fitted Q-Iteration December 10th, 2014 A. LAZARIC – Transfer in RL- 38 Collect samples from the environment Create a regression dataset Solve a sparse linear regression problem Return the greedy policy The LASSO L1-regularized least-squares

The Multi-task Joint Sparsity Assumption December 10th, 2014 A. LAZARIC – Transfer in RL- 39 features tasks

Multi-task Sparse Linear Fitted Q-Iteration December 10th, 2014 A. LAZARIC – Transfer in RL- 40 Collect samples from each task Create T regression datasets Solve a MT sparse linear regression problem Return the greedy policies The Group LASSO L-(1,2)-regularized least-squares

The Multi-task Joint Sparsity Assumption December 10th, 2014 A. LAZARIC – Transfer in RL- 41 features tasks

Learning a sparse representation December 10th, 2014 A. LAZARIC – Transfer in RL- 42 transformation of the features (aka dictionary learning)

Multi-task Feature Learning Linear Fitted Q-Iteration December 10th, 2014 A. LAZARIC – Transfer in RL- 43 Collect samples from each task Create T regression datasets Return the greedy policies The MT- Feature Learning Learn a sparse representation Solve a MT sparse linear regression problem

Theoretical Results December 10th, 2014 A. LAZARIC – Transfer in RL- 44 Number of samples (per task) needed to have an accurate approximation using d features Std approach: Linearly proportional… too many samples! Lasso: Only log(d)! But no advantage from multiple tasks… G-Lasso: Decreasing in T! But joint sparsity may be poor… Rep. learn.: Smallest number of important features! But learning the representation may be expensive…

Empirical Results: the BlackJack December 10th, 2014 A. LAZARIC – Transfer in RL- 45 NIPS 2014, with D. Calandriello and M. Restelli (PoliMi) Under study: application to other computer games

Outline December 10th, 2014 A. LAZARIC – Transfer in RL- 46  Transfer in Reinforcement Learning  Improving the Exploration Strategy  Improving the Accuracy of Approximation  Conclusions

Conclusions December 10th, 2014 A. LAZARIC – Transfer in RL- 47 With Transfer Without Transfer

Thanks!! Inria Lille – Nord Europe