The ideals reality of science The pursuit of verifiable answers highly cited papers for your c.v. The validation of our results by reproduction convincing.

The ideals reality of science The pursuit of verifiable answers highly cited papers for your c.v. The validation of our results by reproduction convincing referees who did not see your code or data An altruistic, collective enterprise A race to outrun your colleagues in front of the giant bear of grant funding Credit: Fernando Pérez @ Pycon 2014

Transfer Learning in RL Matthew E. Taylor and Peter Stone. Transfer Learning for Reinforcement Learning Domains: A Survey. Journal of Machine Learning Research, 10(1):1633–1685, 2009.

Inter-Task Transfer Learning tabula rasa can be unnecessarily slow Humans can use past information – Soccer with different numbers of players Agents leverage learned knowledge in novel tasks – Bias learning : speedup method

Primary Questions Source S SOURCE, A SOURCE Target S TARGET, A T ARGET Is it possible to transfer learned knowledge? Possible to transfer without a providing a task mapping? Only consider reinforcement learning tasks Lots of work in supervised settings

Value Function Transfer ρ( Q S (S S, A S ) ) = Q T (S T, A T ) Action-Value function transferred ρ is task-dependant: relies on inter-task mappings ρ Q not defined on S T and A T Source Task Target Task Environment Agent Action T State T Reward T Environment Agent Action S State S Reward S Q S : S S ×A S → ℜ Q T : S T ×A T → ℜ

Enabling Transfer Source Task Target Task Environment Agent Action T State T Reward T Environment Agent Action S State S Reward S Q S : S S ×A S → ℜ Q T : S T ×A T → ℜ

Inter-Task Mappings SourceTarget Source

Inter-Task Mappings χ x: s target→ s source – Given state variable in target task (some x from s=x 1, x 2, … x n) – Return corresponding state variable in source task χ A: a target→ a source – Similar, but for actions Intuitive mappings exist in some domains (Oracle) Used to construct transfer functional Target ⟨ x 1 …x n ⟩ {a 1 …a m } S T ARGET A T ARGET Source S SOURCE A SOURCE ⟨ x 1 …x k ⟩ {a 1 …a j } χxχx χAχA

Q-value Reuse Could be considered a type of reward shaping Directly use expected rewards from source task to bias learner in target task Not function-approximator specific No initialization step needed between learning the two tasks Drawbacks include an increased lookup time and larger memory requirements

Example Cancer Castle attack

Lazaric Transfer can – Reduce need for instances in target task – Reduce need for domain expert knowledge What changes – State space, state features – A, T, R, goal state – learning method

Threshold: 8.5 Performance Target: no Transfer Target: with Transfer Target + Source: with Transfer Target: no transfer Target: with Transfer Two distinct scenarios: 1. Target Time Metric: Successful if target task learning time reduced Transfer Evaluation Metrics Set a threshold performance Majority of agents can achieve with learning 2. Total Time Metric : Successful if total (source + target) time reduced “Sunk Cost” is ignored Source task(s) independently useful AI Goal Effectively utilize past knowledge Only care about Target Source Task(s) not useful Engineering Goal Minimize total training

Previous was “learning speed improvement” Can also have – Asymptotic improvement – Jumpstart improvement

Value Function Transfer: Time to threshold in 4 vs. 3 Total Time No Transfer Target Task Time }

Results: Scaling up to 5 vs. 4 47% of no transfer All statistically significant when compared to no transfer

Problem Statement Humans can selecting a training sequence Results in faster training / better performance Meta-planning problem for agent learning – Known vs. Unknown final task? MDP ?

TL vs. multi-task vs. lifelong learning vs. generalization vs. concept drift Learning from Easy Missions Changing length of cart pole

K2K2 K3K3 T2T2 T1T1 K1K1 Both takers move towards player with ball Goal: Maintain possession of ball 5 agents 3 (stochastic) actions 13 (noisy & continuous) state variables Keeper with ball may hold ball or pass to either teammate Keepaway [Stone, Sutton, and Kuhlmann 2005] 4 vs. 3: 7 agents 4 actions 19 state variables

Learning Keepaway Sarsa update – CMAC, RBF, and neural network approximation successful Q π (s,a): Predicted number of steps episode will last – Reward = +1 for every timestep

 ’s Effect on CMACs 4 vs. 33 vs. 2  For each weight in 4 vs. 3 function approximator: o Use inter-task mapping to find corresponding 3 vs. 2 weight

Keepaway Hand-coded χ A Hold 4v3  Hold 3v2 Pass1 4v3  Pass1 3v2 Pass2 4v3  Pass2 3v2 Pass3 4v3  Pass2 3v2 Actions in 4 vs. 3 have “similar” actions in 3 vs. 2

Value Function Transfer: Time to threshold in 4 vs. 3 Total Time No Transfer Target Task Time }

Example Transfer Domains Series of mazes with different goals [Fernandez and Veloso, 2006] Mazes with different structures [Konidaris and Barto, 2007]

Example Transfer Domains Series of mazes with different goals [Fernandez and Veloso, 2006] Mazes with different structures [Konidaris and Barto, 2007] Keepaway with different numbers of players [Taylor and Stone, 2005] Keepaway to Breakaway [Torrey et al, 2005]

All tasks are drawn from the same domain o Task: An MDP o Domain: Setting for semantically similar tasks o What about Cross-Domain Transfer? o Source task could be much simpler o Show that source and target can be less similar Example Transfer Domains Series of mazes with different goals [Fernandez and Veloso, 2006] Mazes with different structures [Konidaris and Barto, 2007] Keepaway with different numbers of players [Taylor and Stone, 2005] Keepaway to Breakaway [Torrey et al, 2005]

Source Task: Ringworld Ringworld Goal: avoid being tagged 2 agents 3 actions 7 state variables Fully Observable Discrete State Space (Q-table with ~8,100 s,a pairs) Stochastic Actions Opponent moves directly towards player Player may stay or run towards a pre- defined location K2K2 K3K3 T2T2 T1T1 K1K1 3 vs. 2 Keepaway Goal: Maintain possession of ball 5 agents 3 actions 13 state variables Partially Observable Continuous State Space Stochastic Actions

Source Task: Knight’s Joust Knight’s Joust Goal: Travel from start to goal line 2 agents 3 actions 3 state variables Fully Observable Discrete State Space (Q-table with ~600 s,a pairs) Deterministic Actions K2K2 K3K3 T2T2 T1T1 K1K1 Opponent moves directly towards player Player may move North, or take a knight jump to either side 3 vs. 2 Keepaway Goal: Maintain possession of ball 5 agents 3 actions 13 state variables Partially Observable Continuous State Space Stochastic Actions

Rule Transfer Overview 1.Learn a policy (π : S → A) in the source task – TD, Policy Search, Model-Based, etc. 2.Learn a decision list, D source, summarizing π 3.Translate (D source ) → D target (applies to target task) – State variables and actions can differ in two tasks 4.Use D target to learn a policy in target task Allows for different learning methods and function approximators in source and target tasks Learn π Learn Dsource Translate (Dsource) → Dtarget Use Dtarget

Rule Transfer Details Learn π Learn Dsource Translate (Dsource) → Dtarget Use Dtarget Environment Agent Action StateReward Source Task In this work we use Sarsa o Q : S × A → Return Other learning methods possible

Rule Transfer Details Learn π Learn Dsource Translate (Dsource) → Dtarget Use Dtarget Use learned policy to record S, A pairs Use JRip (RIPPER in Weka) to learn a decision list Environment Agent Action StateReward Action State Action …… IF s 1 5 → a 1 ELSEIF s 1 < 3 → a 2 ELSEIF s 3 > 7 → a 1 …

Rule Transfer Details Learn π Learn Dsource Translate (Dsource) → Dtarget Use Dtarget Inter-task Mappings χ x: s target →s source – Given state variable in target task (some x from s = x 1, x 2, … x n ) – Return corresponding state variable in source task χ A: a target →a source – Similar, but for actions rule rule’ translate χ A χ x

Rule Transfer Details Learn π Learn Dsource Translate (Dsource) → Dtarget Use Dtarget K2K2 K3K3 T2T2 T1T1 K1K1 StayHold Ball Run Near Pass to K 2 Run Far Pass to K 3 dist(Player, Opponent)dist(K 1,T 1 )… χxχx χAχA IF dist(Player, Opponent) > 4 → Stay IF dist(K 1,T 1 ) > 4 → Hold Ball

Rule Transfer Details Learn π Learn Dsource Translate (Dsource) → Dtarget Use Dtarget Many possible ways to use D target o Value Bonus o Extra Action o Extra Variable Assuming TD learner in target task o Should generalize to other learning methods Evaluate agent’s 3 actions in state s = s 1, s 2 Q(s 1, s 2, a 1 ) = 5 Q(s 1, s 2, a 2 ) = 3 Q(s 1, s 2, a 3 ) = 4 D target (s) = a 2 + 8 Evaluate agent’s 3 actions in state s = s 1, s 2 Q(s 1, s 2, a 1 ) = 5 Q(s 1, s 2, a 2 ) = 3 Q(s 1, s 2, a 3 ) = 4 Q(s 1, s 2, a 4 ) = 7 (take action a 2 ) Evaluate agent’s 3 actions in state s = s 1, s 2 Q(s 1, s 2, s 3, a 1 ) = 5 Q(s 1, s 2, s 3, a 2 ) = 3 Q(s 1, s 2, s 3, a 3 ) = 4 Evaluate agent’s 3 actions in state s = s 1, s 2 Q(s 1, s 2, a 2, a 1 ) = 5 Q(s 1, s 2, a 2, a 2 ) = 9 Q(s 1, s 2, a 2, a 3 ) = 4 (shaping) (initially force agent to select)

Comparison of Rule Transfer Methods Without Transfer Only Follow Rules Rules from 5 hours of training Extra Variable Extra Action Value Bonus

Results: Extra Action Without Transfer Ringworld: 20,000 episodes Knight’s Joust: 50,000 episodes

The ideals reality of science The pursuit of verifiable answers highly cited papers for your c.v. The validation of our results by reproduction convincing.

Similar presentations

Presentation on theme: "The ideals reality of science The pursuit of verifiable answers highly cited papers for your c.v. The validation of our results by reproduction convincing."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The ideals reality of science The pursuit of verifiable answers highly cited papers for your c.v. The validation of our results by reproduction convincing.

Similar presentations

Presentation on theme: "The ideals reality of science The pursuit of verifiable answers highly cited papers for your c.v. The validation of our results by reproduction convincing."— Presentation transcript:

Similar presentations

About project

Feedback