Presentation is loading. Please wait.

Presentation is loading. Please wait.

Extraction and Transfer of Knowledge in Reinforcement Learning A.LAZARIC Inria “30 minutes de Science” Seminars SequeL Inria Lille – Nord Europe December.

Similar presentations


Presentation on theme: "Extraction and Transfer of Knowledge in Reinforcement Learning A.LAZARIC Inria “30 minutes de Science” Seminars SequeL Inria Lille – Nord Europe December."— Presentation transcript:

1 Extraction and Transfer of Knowledge in Reinforcement Learning A.LAZARIC Inria “30 minutes de Science” Seminars SequeL Inria Lille – Nord Europe December 10th, 2014

2 SequeL Sequential Learning SequeL Sequential Learning December 10th, 2014 A. LAZARIC – Transfer in RL- 2 (2005) (2008) (2010) since Dec Statistics Multi-arm bandit Multi-arm bandit Stochastic approximation Online optimization Dynamic programming Optimal control theory Reinforcement Learning Reinforcement Learning Sequence Prediction Online Learning Online Learning Theory (learnability, sample complexity, regret) Theory (learnability, sample complexity, regret) Algorithms (online/batch RL, bandit with structure) Algorithms (online/batch RL, bandit with structure) Applications (finance, recommendation systems, computer games) Applications (finance, recommendation systems, computer games) Tools Problems Results

3 December 10th, 2014 A. LAZARIC – Transfer in RL- 3 Extraction and Transfer of Knowledge in Reinforcement Learning

4 December 10th, 2014 A. LAZARIC – Transfer in RL- 4 Good transfer Positive transfer No transfer Negative transfer

5 December 10th, 2014 A. LAZARIC – Transfer in RL- 5 Can we design algorithms able to learn from experience and transfer knowledge across different problems to improve their learning performance ?

6 Outline December 10th, 2014 A. LAZARIC – Transfer in RL- 6  Transfer in Reinforcement Learning  Improving the Exploration Strategy  Improving the Accuracy of Approximation  Conclusions

7 Outline December 10th, 2014 A. LAZARIC – Transfer in RL- 7  Transfer in Reinforcement Learning  Improving the Exploration Strategy  Improving the Accuracy of Approximation  Conclusions

8 Reinforcement Learning December 10th, 2014 A. LAZARIC – Transfer in RL- 8 agent environment critic delay , advancement Value Function Control Policy

9 Markov Decision Process (MDP) December 10th, 2014 A. LAZARIC – Transfer in RL- 9 A Markov Decision Process is Set of states Set of actions Dynamics (probability of transition) Reward Policy Objective: maximize the value function

10 Reinforcement Learning Algorithms December 10th, 2014 A. LAZARIC – Transfer in RL- 10 Over time Observe state Take an action Observe next state and reward Update policy and value function Exploration/exploitation dilemma Approximation RL algorithms often require many samples and careful design and hand-tuning

11 Reinforcement Learning December 10th, 2014 A. LAZARIC – Transfer in RL- 11 agent environment critic delay, advancement Very inefficient!

12 Transfer in Reinforcement Learning December 10th, 2014 A. LAZARIC – Transfer in RL- 12 agent environment critic delay transfer of knowledge

13 Transfer in Reinforcement Learning December 10th, 2014 A. LAZARIC – Transfer in RL- 13 agent environment critic delay transfer of knowledge

14 Transfer in Reinforcement Learning December 10th, 2014 A. LAZARIC – Transfer in RL- 14 agent environment critic delay transfer of knowledge

15 Transfer in Reinforcement Learning December 10th, 2014 A. LAZARIC – Transfer in RL- 15 agent environment critic delay transfer of knowledge

16 Transfer in Reinforcement Learning December 10th, 2014 A. LAZARIC – Transfer in RL- 16 agent environment critic delay transfer of knowledge

17 Outline December 10th, 2014 A. LAZARIC – Transfer in RL- 17  Transfer in Reinforcement Learning  Improving the Exploration Strategy  Improving the Accuracy of Approximation  Conclusions

18 Multi-arm Bandit: a “Simple” RL Problem December 10th, 2014 A. LAZARIC – Transfer in RL- 18 The Multi-armed bandit problem Set of states: no state Set of actions (eg, movies, lessons) Dynamics: no dynamics Reward (eg, rating, grade) Policy Objective: maximize the reward over time Online optimization of an unknown stochastic function under computational constraints…

19 Sequential Transfer in Bandit December 10th, 2014 A. LAZARIC – Transfer in RL- 19 explore and exploit

20 Sequential Transfer in Bandit December 10th, 2014 A. LAZARIC – Transfer in RL- 20 Past usersFuture users Current user

21 Sequential Transfer in Bandit December 10th, 2014 A. LAZARIC – Transfer in RL- 21 Past usersFuture users Current user Idea: although the type of the user is unknown, we may collect knowledge about users and exploit their similarity to identify the type and speed-up the learning process

22 Sequential Transfer in Bandit December 10th, 2014 A. LAZARIC – Transfer in RL- 22 Past usersFuture users Current user Sanity check: develop an algorithm that given the information about possible users as prior knowledge can outperform a non-transfer approach

23 The model-Upper Confidence Bound Algorithm December 10th, 2014 A. LAZARIC – Transfer in RL- 23 Over time Select action Exploitation the higher the (estimated) reward the higher the chance to select the action Exploitation the higher the (estimated) reward the higher the chance to select the action Exploration the higher the (theoretical) uncertainty the higher the chance to select the action Exploration the higher the (theoretical) uncertainty the higher the chance to select the action

24 The model-Upper Confidence Bound Algorithm December 10th, 2014 A. LAZARIC – Transfer in RL- 24 Over time Select action “Transfer” combine current estimates with prior knowledge about the users in Θ “Transfer” combine current estimates with prior knowledge about the users in Θ

25 Sequential Transfer in Bandit December 10th, 2014 A. LAZARIC – Transfer in RL- 25 Past usersFuture users Current user Collect knowledge

26 Sequential Transfer in Bandit December 10th, 2014 A. LAZARIC – Transfer in RL- 26 Past usersFuture users Current user Transfer knowledge

27 Sequential Transfer in Bandit December 10th, 2014 A. LAZARIC – Transfer in RL- 27 Past usersFuture users Current user Collect & Transfer knowledge

28 Sequential Transfer in Bandit December 10th, 2014 A. LAZARIC – Transfer in RL- 28 Past usersFuture users Current user Collect & Transfer knowledge

29 Sequential Transfer in Bandit December 10th, 2014 A. LAZARIC – Transfer in RL- 29 Past usersFuture users Current user Collect & Transfer knowledge

30 The transfer-Upper Confidence Bound Algorithm December 10th, 2014 A. LAZARIC – Transfer in RL- 30 Over time Select action “Collect and transfer” using a method of moment approach to solve a latent variable model problem “Collect and transfer” using a method of moment approach to solve a latent variable model problem

31 Results December 10th, 2014 A. LAZARIC – Transfer in RL- 31 Theoretical guarantees Improvement wrt no-transfer solution Reduction in the regret Dependency on number of “users” and their difference

32 Empirical Results December 10th, 2014 A. LAZARIC – Transfer in RL- 32 Synthetic data BAD GOOD NIPS 2013, with E. Brunskill (CMU), M. Azar (Northwestern Univ) Currently testing on a “movie recommendation” dataset

33 Outline December 10th, 2014 A. LAZARIC – Transfer in RL- 33  Transfer in Reinforcement Learning  Improving the Exploration Strategy  Improving the Accuracy of Approximation  Conclusions

34 Sparse Multi-task Reinforcement Learning December 10th, 2014 A. LAZARIC – Transfer in RL- 34 Learning to play poker States: cards, chips, … Action: stay, call, fold Dynamics: deck, opponent Reward: money  Use RL to solve it!

35 Sparse Multi-task Reinforcement Learning December 10th, 2014 A. LAZARIC – Transfer in RL- 35 This is a Multi-Task RL problem!

36 Sparse Multi-task Reinforcement Learning December 10th, 2014 A. LAZARIC – Transfer in RL- 36 Let’s use as much information as possible to solve the problem! Not all the “features” are equally useful!

37 The linear Fitted Q-Iteration Algorithm December 10th, 2014 A. LAZARIC – Transfer in RL- 37 Collect samples from the environment Create a regression dataset Solve a linear regression problem Return the greedy policy features

38 Sparse Linear Fitted Q-Iteration December 10th, 2014 A. LAZARIC – Transfer in RL- 38 Collect samples from the environment Create a regression dataset Solve a sparse linear regression problem Return the greedy policy The LASSO L1-regularized least-squares

39 The Multi-task Joint Sparsity Assumption December 10th, 2014 A. LAZARIC – Transfer in RL- 39 features tasks

40 Multi-task Sparse Linear Fitted Q-Iteration December 10th, 2014 A. LAZARIC – Transfer in RL- 40 Collect samples from each task Create T regression datasets Solve a MT sparse linear regression problem Return the greedy policies The Group LASSO L-(1,2)-regularized least-squares

41 The Multi-task Joint Sparsity Assumption December 10th, 2014 A. LAZARIC – Transfer in RL- 41 features tasks

42 Learning a sparse representation December 10th, 2014 A. LAZARIC – Transfer in RL- 42 transformation of the features (aka dictionary learning)

43 Multi-task Feature Learning Linear Fitted Q-Iteration December 10th, 2014 A. LAZARIC – Transfer in RL- 43 Collect samples from each task Create T regression datasets Return the greedy policies The MT- Feature Learning Learn a sparse representation Solve a MT sparse linear regression problem

44 Theoretical Results December 10th, 2014 A. LAZARIC – Transfer in RL- 44 Number of samples (per task) needed to have an accurate approximation using d features Std approach: Linearly proportional… too many samples! Lasso: Only log(d)! But no advantage from multiple tasks… G-Lasso: Decreasing in T! But joint sparsity may be poor… Rep. learn.: Smallest number of important features! But learning the representation may be expensive…

45 Empirical Results: the BlackJack December 10th, 2014 A. LAZARIC – Transfer in RL- 45 NIPS 2014, with D. Calandriello and M. Restelli (PoliMi) Under study: application to other computer games

46 Outline December 10th, 2014 A. LAZARIC – Transfer in RL- 46  Transfer in Reinforcement Learning  Improving the Exploration Strategy  Improving the Accuracy of Approximation  Conclusions

47 Conclusions December 10th, 2014 A. LAZARIC – Transfer in RL- 47 With Transfer Without Transfer

48 Thanks!! Inria Lille – Nord Europe


Download ppt "Extraction and Transfer of Knowledge in Reinforcement Learning A.LAZARIC Inria “30 minutes de Science” Seminars SequeL Inria Lille – Nord Europe December."

Similar presentations


Ads by Google