Extraction and Transfer of Knowledge in Reinforcement Learning A.LAZARIC Inria “30 minutes de Science” Seminars SequeL Inria Lille – Nord Europe December.

Extraction and Transfer of Knowledge in Reinforcement Learning A.LAZARIC Inria “30 minutes de Science” Seminars SequeL Inria Lille – Nord Europe December 10th, 2014

SequeL Sequential Learning SequeL Sequential Learning December 10th, 2014 A. LAZARIC – Transfer in RL- 2 Master @PoliMi+UIC (2005) PhD @PoliMi (2008) Post-doc @SequeL (2010) CR @SequeL since Dec. 2010 Statistics Multi-arm bandit Multi-arm bandit Stochastic approximation Online optimization Dynamic programming Optimal control theory Reinforcement Learning Reinforcement Learning Sequence Prediction Online Learning Online Learning Theory (learnability, sample complexity, regret) Theory (learnability, sample complexity, regret) Algorithms (online/batch RL, bandit with structure) Algorithms (online/batch RL, bandit with structure) Applications (finance, recommendation systems, computer games) Applications (finance, recommendation systems, computer games) Tools Problems Results

December 10th, 2014 A. LAZARIC – Transfer in RL- 3 Extraction and Transfer of Knowledge in Reinforcement Learning

December 10th, 2014 A. LAZARIC – Transfer in RL- 4 Good transfer Positive transfer No transfer Negative transfer

December 10th, 2014 A. LAZARIC – Transfer in RL- 5 Can we design algorithms able to learn from experience and transfer knowledge across different problems to improve their learning performance ?

Outline December 10th, 2014 A. LAZARIC – Transfer in RL- 6  Transfer in Reinforcement Learning  Improving the Exploration Strategy  Improving the Accuracy of Approximation  Conclusions

Reinforcement Learning December 10th, 2014 A. LAZARIC – Transfer in RL- 8 agent environment critic delay <position, speed><handlebar, pedals><new position, new speed>, advancement Value Function Control Policy

Markov Decision Process (MDP) December 10th, 2014 A. LAZARIC – Transfer in RL- 9 A Markov Decision Process is Set of states Set of actions Dynamics (probability of transition) Reward Policy Objective: maximize the value function

Reinforcement Learning Algorithms December 10th, 2014 A. LAZARIC – Transfer in RL- 10 Over time Observe state Take an action Observe next state and reward Update policy and value function Exploration/exploitation dilemma Approximation RL algorithms often require many samples and careful design and hand-tuning

Reinforcement Learning December 10th, 2014 A. LAZARIC – Transfer in RL- 11 agent environment critic delay, advancement Very inefficient!

Transfer in Reinforcement Learning December 10th, 2014 A. LAZARIC – Transfer in RL- 12 agent environment critic delay transfer of knowledge

Multi-arm Bandit: a “Simple” RL Problem December 10th, 2014 A. LAZARIC – Transfer in RL- 18 The Multi-armed bandit problem Set of states: no state Set of actions (eg, movies, lessons) Dynamics: no dynamics Reward (eg, rating, grade) Policy Objective: maximize the reward over time Online optimization of an unknown stochastic function under computational constraints…

Sequential Transfer in Bandit December 10th, 2014 A. LAZARIC – Transfer in RL- 19 explore and exploit

Sequential Transfer in Bandit December 10th, 2014 A. LAZARIC – Transfer in RL- 20 Past usersFuture users Current user

Sequential Transfer in Bandit December 10th, 2014 A. LAZARIC – Transfer in RL- 21 Past usersFuture users Current user Idea: although the type of the user is unknown, we may collect knowledge about users and exploit their similarity to identify the type and speed-up the learning process

Sequential Transfer in Bandit December 10th, 2014 A. LAZARIC – Transfer in RL- 22 Past usersFuture users Current user Sanity check: develop an algorithm that given the information about possible users as prior knowledge can outperform a non-transfer approach

The model-Upper Confidence Bound Algorithm December 10th, 2014 A. LAZARIC – Transfer in RL- 23 Over time Select action Exploitation the higher the (estimated) reward the higher the chance to select the action Exploitation the higher the (estimated) reward the higher the chance to select the action Exploration the higher the (theoretical) uncertainty the higher the chance to select the action Exploration the higher the (theoretical) uncertainty the higher the chance to select the action

The model-Upper Confidence Bound Algorithm December 10th, 2014 A. LAZARIC – Transfer in RL- 24 Over time Select action “Transfer” combine current estimates with prior knowledge about the users in Θ “Transfer” combine current estimates with prior knowledge about the users in Θ

Sequential Transfer in Bandit December 10th, 2014 A. LAZARIC – Transfer in RL- 25 Past usersFuture users Current user Collect knowledge

Sequential Transfer in Bandit December 10th, 2014 A. LAZARIC – Transfer in RL- 26 Past usersFuture users Current user Transfer knowledge

Sequential Transfer in Bandit December 10th, 2014 A. LAZARIC – Transfer in RL- 27 Past usersFuture users Current user Collect & Transfer knowledge

The transfer-Upper Confidence Bound Algorithm December 10th, 2014 A. LAZARIC – Transfer in RL- 30 Over time Select action “Collect and transfer” using a method of moment approach to solve a latent variable model problem “Collect and transfer” using a method of moment approach to solve a latent variable model problem

Results December 10th, 2014 A. LAZARIC – Transfer in RL- 31 Theoretical guarantees Improvement wrt no-transfer solution Reduction in the regret Dependency on number of “users” and their difference

Empirical Results December 10th, 2014 A. LAZARIC – Transfer in RL- 32 Synthetic data BAD GOOD NIPS 2013, with E. Brunskill (CMU), M. Azar (Northwestern Univ) Currently testing on a “movie recommendation” dataset

Sparse Multi-task Reinforcement Learning December 10th, 2014 A. LAZARIC – Transfer in RL- 34 Learning to play poker States: cards, chips, … Action: stay, call, fold Dynamics: deck, opponent Reward: money  Use RL to solve it!

Sparse Multi-task Reinforcement Learning December 10th, 2014 A. LAZARIC – Transfer in RL- 35 This is a Multi-Task RL problem!

Sparse Multi-task Reinforcement Learning December 10th, 2014 A. LAZARIC – Transfer in RL- 36 Let’s use as much information as possible to solve the problem! Not all the “features” are equally useful!

The linear Fitted Q-Iteration Algorithm December 10th, 2014 A. LAZARIC – Transfer in RL- 37 Collect samples from the environment Create a regression dataset Solve a linear regression problem Return the greedy policy features

Sparse Linear Fitted Q-Iteration December 10th, 2014 A. LAZARIC – Transfer in RL- 38 Collect samples from the environment Create a regression dataset Solve a sparse linear regression problem Return the greedy policy The LASSO L1-regularized least-squares

The Multi-task Joint Sparsity Assumption December 10th, 2014 A. LAZARIC – Transfer in RL- 39 features tasks

Multi-task Sparse Linear Fitted Q-Iteration December 10th, 2014 A. LAZARIC – Transfer in RL- 40 Collect samples from each task Create T regression datasets Solve a MT sparse linear regression problem Return the greedy policies The Group LASSO L-(1,2)-regularized least-squares

The Multi-task Joint Sparsity Assumption December 10th, 2014 A. LAZARIC – Transfer in RL- 41 features tasks

Learning a sparse representation December 10th, 2014 A. LAZARIC – Transfer in RL- 42 transformation of the features (aka dictionary learning)

Multi-task Feature Learning Linear Fitted Q-Iteration December 10th, 2014 A. LAZARIC – Transfer in RL- 43 Collect samples from each task Create T regression datasets Return the greedy policies The MT- Feature Learning Learn a sparse representation Solve a MT sparse linear regression problem

Theoretical Results December 10th, 2014 A. LAZARIC – Transfer in RL- 44 Number of samples (per task) needed to have an accurate approximation using d features Std approach: Linearly proportional… too many samples! Lasso: Only log(d)! But no advantage from multiple tasks… G-Lasso: Decreasing in T! But joint sparsity may be poor… Rep. learn.: Smallest number of important features! But learning the representation may be expensive…

Empirical Results: the BlackJack December 10th, 2014 A. LAZARIC – Transfer in RL- 45 NIPS 2014, with D. Calandriello and M. Restelli (PoliMi) Under study: application to other computer games

Conclusions December 10th, 2014 A. LAZARIC – Transfer in RL- 47 With Transfer Without Transfer

Thanks!! Inria Lille – Nord Europe www.inria.fr

Extraction and Transfer of Knowledge in Reinforcement Learning A.LAZARIC Inria “30 minutes de Science” Seminars SequeL Inria Lille – Nord Europe December.

Similar presentations

Presentation on theme: "Extraction and Transfer of Knowledge in Reinforcement Learning A.LAZARIC Inria “30 minutes de Science” Seminars SequeL Inria Lille – Nord Europe December."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Extraction and Transfer of Knowledge in Reinforcement Learning A.LAZARIC Inria “30 minutes de Science” Seminars SequeL Inria Lille – Nord Europe December.

Similar presentations

Presentation on theme: "Extraction and Transfer of Knowledge in Reinforcement Learning A.LAZARIC Inria “30 minutes de Science” Seminars SequeL Inria Lille – Nord Europe December."— Presentation transcript:

Similar presentations

About project

Feedback