Download presentation

Presentation is loading. Please wait.

Published byJulio Elletson Modified about 1 year ago

1
Extraction and Transfer of Knowledge in Reinforcement Learning A.LAZARIC Inria “30 minutes de Science” Seminars SequeL Inria Lille – Nord Europe December 10th, 2014

2
SequeL Sequential Learning SequeL Sequential Learning December 10th, 2014 A. LAZARIC – Transfer in RL- 2 (2005) (2008) (2010) since Dec Statistics Multi-arm bandit Multi-arm bandit Stochastic approximation Online optimization Dynamic programming Optimal control theory Reinforcement Learning Reinforcement Learning Sequence Prediction Online Learning Online Learning Theory (learnability, sample complexity, regret) Theory (learnability, sample complexity, regret) Algorithms (online/batch RL, bandit with structure) Algorithms (online/batch RL, bandit with structure) Applications (finance, recommendation systems, computer games) Applications (finance, recommendation systems, computer games) Tools Problems Results

3
December 10th, 2014 A. LAZARIC – Transfer in RL- 3 Extraction and Transfer of Knowledge in Reinforcement Learning

4
December 10th, 2014 A. LAZARIC – Transfer in RL- 4 Good transfer Positive transfer No transfer Negative transfer

5
December 10th, 2014 A. LAZARIC – Transfer in RL- 5 Can we design algorithms able to learn from experience and transfer knowledge across different problems to improve their learning performance ?

6
Outline December 10th, 2014 A. LAZARIC – Transfer in RL- 6 Transfer in Reinforcement Learning Improving the Exploration Strategy Improving the Accuracy of Approximation Conclusions

7
Outline December 10th, 2014 A. LAZARIC – Transfer in RL- 7 Transfer in Reinforcement Learning Improving the Exploration Strategy Improving the Accuracy of Approximation Conclusions

8
Reinforcement Learning December 10th, 2014 A. LAZARIC – Transfer in RL- 8 agent environment critic delay

9
Markov Decision Process (MDP) December 10th, 2014 A. LAZARIC – Transfer in RL- 9 A Markov Decision Process is Set of states Set of actions Dynamics (probability of transition) Reward Policy Objective: maximize the value function

10
Reinforcement Learning Algorithms December 10th, 2014 A. LAZARIC – Transfer in RL- 10 Over time Observe state Take an action Observe next state and reward Update policy and value function Exploration/exploitation dilemma Approximation RL algorithms often require many samples and careful design and hand-tuning

11
Reinforcement Learning December 10th, 2014 A. LAZARIC – Transfer in RL- 11 agent environment critic delay, advancement Very inefficient!

12
Transfer in Reinforcement Learning December 10th, 2014 A. LAZARIC – Transfer in RL- 12 agent environment critic delay transfer of knowledge

13
Transfer in Reinforcement Learning December 10th, 2014 A. LAZARIC – Transfer in RL- 13 agent environment critic delay transfer of knowledge

14
Transfer in Reinforcement Learning December 10th, 2014 A. LAZARIC – Transfer in RL- 14 agent environment critic delay transfer of knowledge

15
Transfer in Reinforcement Learning December 10th, 2014 A. LAZARIC – Transfer in RL- 15 agent environment critic delay transfer of knowledge

16
Transfer in Reinforcement Learning December 10th, 2014 A. LAZARIC – Transfer in RL- 16 agent environment critic delay transfer of knowledge

17
Outline December 10th, 2014 A. LAZARIC – Transfer in RL- 17 Transfer in Reinforcement Learning Improving the Exploration Strategy Improving the Accuracy of Approximation Conclusions

18
Multi-arm Bandit: a “Simple” RL Problem December 10th, 2014 A. LAZARIC – Transfer in RL- 18 The Multi-armed bandit problem Set of states: no state Set of actions (eg, movies, lessons) Dynamics: no dynamics Reward (eg, rating, grade) Policy Objective: maximize the reward over time Online optimization of an unknown stochastic function under computational constraints…

19
Sequential Transfer in Bandit December 10th, 2014 A. LAZARIC – Transfer in RL- 19 explore and exploit

20
Sequential Transfer in Bandit December 10th, 2014 A. LAZARIC – Transfer in RL- 20 Past usersFuture users Current user

21
Sequential Transfer in Bandit December 10th, 2014 A. LAZARIC – Transfer in RL- 21 Past usersFuture users Current user Idea: although the type of the user is unknown, we may collect knowledge about users and exploit their similarity to identify the type and speed-up the learning process

22
Sequential Transfer in Bandit December 10th, 2014 A. LAZARIC – Transfer in RL- 22 Past usersFuture users Current user Sanity check: develop an algorithm that given the information about possible users as prior knowledge can outperform a non-transfer approach

23
The model-Upper Confidence Bound Algorithm December 10th, 2014 A. LAZARIC – Transfer in RL- 23 Over time Select action Exploitation the higher the (estimated) reward the higher the chance to select the action Exploitation the higher the (estimated) reward the higher the chance to select the action Exploration the higher the (theoretical) uncertainty the higher the chance to select the action Exploration the higher the (theoretical) uncertainty the higher the chance to select the action

24
The model-Upper Confidence Bound Algorithm December 10th, 2014 A. LAZARIC – Transfer in RL- 24 Over time Select action “Transfer” combine current estimates with prior knowledge about the users in Θ “Transfer” combine current estimates with prior knowledge about the users in Θ

25
Sequential Transfer in Bandit December 10th, 2014 A. LAZARIC – Transfer in RL- 25 Past usersFuture users Current user Collect knowledge

26
Sequential Transfer in Bandit December 10th, 2014 A. LAZARIC – Transfer in RL- 26 Past usersFuture users Current user Transfer knowledge

27
Sequential Transfer in Bandit December 10th, 2014 A. LAZARIC – Transfer in RL- 27 Past usersFuture users Current user Collect & Transfer knowledge

28
Sequential Transfer in Bandit December 10th, 2014 A. LAZARIC – Transfer in RL- 28 Past usersFuture users Current user Collect & Transfer knowledge

29
Sequential Transfer in Bandit December 10th, 2014 A. LAZARIC – Transfer in RL- 29 Past usersFuture users Current user Collect & Transfer knowledge

30
The transfer-Upper Confidence Bound Algorithm December 10th, 2014 A. LAZARIC – Transfer in RL- 30 Over time Select action “Collect and transfer” using a method of moment approach to solve a latent variable model problem “Collect and transfer” using a method of moment approach to solve a latent variable model problem

31
Results December 10th, 2014 A. LAZARIC – Transfer in RL- 31 Theoretical guarantees Improvement wrt no-transfer solution Reduction in the regret Dependency on number of “users” and their difference

32
Empirical Results December 10th, 2014 A. LAZARIC – Transfer in RL- 32 Synthetic data BAD GOOD NIPS 2013, with E. Brunskill (CMU), M. Azar (Northwestern Univ) Currently testing on a “movie recommendation” dataset

33
Outline December 10th, 2014 A. LAZARIC – Transfer in RL- 33 Transfer in Reinforcement Learning Improving the Exploration Strategy Improving the Accuracy of Approximation Conclusions

34
Sparse Multi-task Reinforcement Learning December 10th, 2014 A. LAZARIC – Transfer in RL- 34 Learning to play poker States: cards, chips, … Action: stay, call, fold Dynamics: deck, opponent Reward: money Use RL to solve it!

35
Sparse Multi-task Reinforcement Learning December 10th, 2014 A. LAZARIC – Transfer in RL- 35 This is a Multi-Task RL problem!

36
Sparse Multi-task Reinforcement Learning December 10th, 2014 A. LAZARIC – Transfer in RL- 36 Let’s use as much information as possible to solve the problem! Not all the “features” are equally useful!

37
The linear Fitted Q-Iteration Algorithm December 10th, 2014 A. LAZARIC – Transfer in RL- 37 Collect samples from the environment Create a regression dataset Solve a linear regression problem Return the greedy policy features

38
Sparse Linear Fitted Q-Iteration December 10th, 2014 A. LAZARIC – Transfer in RL- 38 Collect samples from the environment Create a regression dataset Solve a sparse linear regression problem Return the greedy policy The LASSO L1-regularized least-squares

39
The Multi-task Joint Sparsity Assumption December 10th, 2014 A. LAZARIC – Transfer in RL- 39 features tasks

40
Multi-task Sparse Linear Fitted Q-Iteration December 10th, 2014 A. LAZARIC – Transfer in RL- 40 Collect samples from each task Create T regression datasets Solve a MT sparse linear regression problem Return the greedy policies The Group LASSO L-(1,2)-regularized least-squares

41
The Multi-task Joint Sparsity Assumption December 10th, 2014 A. LAZARIC – Transfer in RL- 41 features tasks

42
Learning a sparse representation December 10th, 2014 A. LAZARIC – Transfer in RL- 42 transformation of the features (aka dictionary learning)

43
Multi-task Feature Learning Linear Fitted Q-Iteration December 10th, 2014 A. LAZARIC – Transfer in RL- 43 Collect samples from each task Create T regression datasets Return the greedy policies The MT- Feature Learning Learn a sparse representation Solve a MT sparse linear regression problem

44
Theoretical Results December 10th, 2014 A. LAZARIC – Transfer in RL- 44 Number of samples (per task) needed to have an accurate approximation using d features Std approach: Linearly proportional… too many samples! Lasso: Only log(d)! But no advantage from multiple tasks… G-Lasso: Decreasing in T! But joint sparsity may be poor… Rep. learn.: Smallest number of important features! But learning the representation may be expensive…

45
Empirical Results: the BlackJack December 10th, 2014 A. LAZARIC – Transfer in RL- 45 NIPS 2014, with D. Calandriello and M. Restelli (PoliMi) Under study: application to other computer games

46
Outline December 10th, 2014 A. LAZARIC – Transfer in RL- 46 Transfer in Reinforcement Learning Improving the Exploration Strategy Improving the Accuracy of Approximation Conclusions

47
Conclusions December 10th, 2014 A. LAZARIC – Transfer in RL- 47 With Transfer Without Transfer

48
Thanks!! Inria Lille – Nord Europe

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google