Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lisa Torrey University of Wisconsin – Madison Doctoral Defense May 2009.

Similar presentations


Presentation on theme: "Lisa Torrey University of Wisconsin – Madison Doctoral Defense May 2009."— Presentation transcript:

1 Lisa Torrey University of Wisconsin – Madison Doctoral Defense May 2009

2 GivenLearn Task T Task S

3 Environment s1s1 Agent Q(s 1, a) = 0 policy π(s 1 ) = a 1 a1a1 s2s2 r2r2 δ(s 1, a 1 ) = s 2 r(s 1, a 1 ) = r 2 Q(s 1, a 1 )  Q(s 1, a 1 ) + Δ π(s 2 ) = a 2 a2a2 δ(s 2, a 2 ) = s 3 r(s 2, a 2 ) = r 3 s3s3 r3r3 ExplorationExploitation Maximize reward Reference: Sutton and Barto, Reinforcement Learning: An Introduction, MIT Press 1998

4 performance training higher start higher slope higher asymptote

5 3-on-2 BreakAway 3-on-2 KeepAway 3-on-2 MoveDownfield 2-on-1 BreakAway Q a (s) = w 1 f 1 + w 2 f 2 + w 3 f 3 + … Hand-coded defenders Single learning agent

6 Starting-point methods Imitation methods Hierarchical methods Alteration methods New RL algorithms

7 pass(t 1 ) pass(t 2 ) pass(Teammate) Opponent 1 Opponent 2 IF feature(Opponent) THEN

8 Advice transfer Advice taking Inductive logic programming Skill-transfer algorithm ECML 2006 (ECML 2005) Macro transfer Macro-operators Demonstration Macro-transfer algorithm ILP 2007 Markov Logic Network transfer Markov Logic Networks MLNs in macros MLN Q-function transfer algorithm AAAI workshop 2008 MLN policy-transfer algorithm ILP 2009

9 Advice transfer Advice taking Inductive logic programming Skill-transfer algorithm Macro transfer Macro-operators Demonstration Macro-transfer algorithm Markov Logic Network transfer Markov Logic Networks MLNs in macros MLN Q-function transfer algorithm MLN policy-transfer algorithm

10 IF these conditions hold THEN pass is the best action

11 Try what worked in a previous task!

12 Batch Reinforcement Learning via Support Vector Regression (RL-SVR) Environment Agent Batch 1 Environment Agent Batch 2 … Compute Q-functions Find Q-functions that minimize:ModelSize + C × DataMisfit (one per action)

13 Find Q-functions that minimize:ModelSize + C × DataMisfit Batch Reinforcement Learning with Advice (KBKR) Environment Agent Batch 1 Compute Q-functions Environment Agent Batch 2 … Advice + µ × AdviceMisfit Robust to negative transfer!

14 IF [ ] THEN pass(Teammate) IF distance(Teammate) ≤ 5 angle(Teammate, Opponent) ≥ 15 THEN pass(Teammate) IF distance(Teammate) ≤ 5 angle(Teammate, Opponent) ≥ 30 THEN pass(Teammate) IF distance(Teammate) ≤ 5 THEN pass(Teammate) IF distance(Teammate) ≤ 10 THEN pass(Teammate) … F(β) = (1+ β 2 ) × Precision × Recall (β 2 × Precision) + Recall Reference: De Raedt, Logical and Relational Learning, Springer 2008

15 Source Target IF distance(Teammate) ≤ 5 angle(Teammate, Opponent) ≥ 30 THEN pass(Teammate) ILP Advice Taking

16 Skill transfer from 3-on-2 MoveDownfield to 4-on-3 MoveDownfield IFdistance(me, Teammate) ≥ 15 distance(me, Teammate) ≤ 27 distance(Teammate, rightEdge) ≤ 10 angle(Teammate, me, Opponent) ≥ 24 distance(me, Opponent) ≥ 4 THEN pass(Teammate)

17 Skill transfer from several tasks to 3-on-2 BreakAway Torrey et al. ECML 2006

18 Advice transfer Advice taking Inductive logic programming Skill-transfer algorithm Macro transfer Macro-operators Demonstration Macro-transfer algorithm Markov Logic Network transfer Markov Logic Networks MLNs in macros MLN Q-function transfer algorithm MLN policy-transfer algorithm

19 pass(Teammate) move(Direction) shoot(goalRight) shoot(goalLeft) IF [... ] THEN pass(Teammate) IF [... ] THEN move(ahead) IF [... ] THEN shoot(goalRight) IF [... ] THEN shoot(goalLeft) IF [... ] THEN pass(Teammate) IF [... ] THEN move(left) IF [... ] THEN shoot(goalRight) IF [... ] THEN shoot(goalRight)

20 source target target-task training policy used No more protection against negative transfer! But… best-case scenario could be very good.

21 Source Target ILP Demonstration

22 Learning structures Positive: BreakAway games that score Negative: BreakAway games that didn’t score ILP IF actionTaken(Game, StateA, pass(Teammate), StateB) actionTaken(Game, StateB, move(Direction), StateC) actionTaken(Game, StateC, shoot(goalRight), StateD) actionTaken(Game, StateD, shoot(goalLeft), StateE) THEN isaGoodGame(Game)

23 Learning rules for arcs Positive: states in good games that took the arc Negative: states in good games that could have taken the arc but didn’t ILP shoot(goalRight) IF [ … ] THEN enter(State) IF [ … ] THEN loop(State, Teammate)) pass(Teammate)

24 Selecting and scoring rules Rule 1Precision=1.0 Rule 2Precision=0.99 Rule3Precision=0.96… Does rule increase F(10) of ruleset? yes Add to ruleset Rule score = # games that follow the rule that are good # games that follow the rule

25 Macro transfer from 2-on-1 BreakAway to 3-on-2 BreakAway pass(Teammate) move(ahead) pass(Teammate) move(right) shoot(goalLeft) move(right) move(left) shoot(goalLeft) shoot(goalRight) move(left) shoot(goalLeft) shoot(goalRight) move(ahead) move(right) shoot(goalLeft) shoot(goalRight) move(away) shoot(goalLeft)shoot(goalRight) move(right) shoot(goalLeft) shoot(goalRight) shoot(goalLeft) shoot(goalRight) shoot(GoalPart)

26 Macro transfer from 2-on-1 BreakAway to 3-on-2 BreakAway Torrey et al. ILP 2007

27 Macro self-transfer in 2-on-1 BreakAway Probability of goal Training games Asymptote 56% Initial 1% Single macro 32% Multiple macro 43%

28 Advice transfer Advice taking Inductive logic programming Skill-transfer algorithm Macro transfer Macro-operators Demonstration Macro-transfer algorithm Markov Logic Network transfer Markov Logic Networks MLNs in macros MLN Q-function transfer algorithm MLN policy-transfer algorithm

29 Formulas (F) evidence 1 (X) AND query(X) evidence 2 (X) AND query(X) Weights (W) w 0 = 1.1 w 1 = 0.9 n i (world) = # true groundings of i th formula in world query(x 1 ) e1 e2 … query(x 2 ) e1 e2 … Reference: Richardson and Domingos, Markov Logic Networks, Machine Learning 2006

30 IF [... ] THEN … Alchemy weight learning w 0 = 1.1 From ILP: MLN: Reference: http://alchemy.cs.washington.edu

31 IF distance(Teammate, goal) < 12 THEN pass(Teammate) IF angle(Teammate, defender) > 30 THEN pass(Teammate) Matches t 1, score=0.92 Matches t 2, score=0.88 pass(Teammate) MLN P(t1) = 0.35 P(t2) = 0.65

32 pass(Teammate) AND angle(Teammate, defender) > 30 pass(Teammate) AND distance(Teammate, goal) < 12 pass(t 1 ) pass(t 2 ) distance(t 1, goal) < 12 distance(t 2, goal) < 12 angle(t 2, defender) > 30 angle(t 1, defender) > 30

33 Macro transfer from 2-on-1 BreakAway to 3-on-2 BreakAway

34 Macro self-transfer in 2-on-1 BreakAway Probability of goal Training games Asymptote 56% Initial 1% Regular macro 32% Macro with MLN 43%

35 Source Target ILP, Alchemy Demonstration MLN for action 1 StateQ-value MLN Q-functions MLN for action 2 StateQ-value …

36 0 ≤ Q a < 0.20.2 ≤ Q a < 0.40.4 ≤ Q a < 0.6 ……… … Bin Number Probability Bin Number Probability Bin Number Probability

37 MLN Q-function transfer from 2-on-1 BreakAway to 3-on-2 BreakAway IFdistance(me, GoalPart) ≥ 42 distance(me, Teammate) ≥ 39 THEN pass(Teammate) falls into [0, 0.11] IFangle(topRight, goalCenter, me) ≤ 42 angle(topRight, goalCenter, me) ≥ 55 angle(goalLeft, me, goalie) ≥ 20 angle(goalCenter, me, goalie) ≤ 30 THEN pass(Teammate) falls into [0.11, 0.27] IFdistance(Teammate, goalCenter) ≤ 9 angle(topRight, goalCenter, me) ≤ 85 THEN pass(Teammate) falls into [0.27, 0.43]

38 MLN Q-function transfer from 2-on-1 BreakAway to 3-on-2 BreakAway Torrey et al. AAAI workshop 2008

39 Source Target ILP, Alchemy Demonstration MLN (F,W) State Action Probability MLN Policy

40 move(ahead)pass(Teammate)shoot(goalLeft) ……… … Policy = highest-probability action

41 MLN policy transfer from 2-on-1 BreakAway to 3-on-2 BreakAway IFangle(topRight, goalCenter, me) ≤ 70 timeLeft ≥ 98 distance(me, Teammate) ≥ 3 THEN pass(Teammate) IFdistance(me, GoalPart) ≥ 36 distance(me, Teammate) ≥ 12 timeLeft ≥ 91 angle(topRight, goalCenter, me) ≤ 80 THEN pass(Teammate) IFdistance(me, GoalPart) ≥ 27 angle(topRight, goalCenter, me) ≤ 75 distance(me, Teammate) ≥ 9 angle(Teammate, me, goalie) ≥ 25 THEN pass(Teammate)

42 MLN policy transfer from 2-on-1 BreakAway to 3-on-2 BreakAway Torrey et al. ILP 2009

43 MLN self-transfer in 2-on-1 BreakAway Probability of goal Training games Asymptote 56% Initial 1% MLN Q-function 59% MLN Policy 65%

44 Advice transfer Advice taking Inductive logic programming Skill-transfer algorithm ECML 2006 (ECML 2005) Macro transfer Macro-operators Demonstration Macro-transfer algorithm ILP 2007 Markov Logic Network transfer Markov Logic Networks MLNs in macros MLN Q-function transfer algorithm AAAI workshop 2008 MLN policy-transfer algorithm ILP 2009

45 Starting-point Taylor et al. 2005: Value-function transfer Imitation Fernandez and Veloso 2006: Policy reuse Hierarchical Mehta et al. 2008: MaxQ transfer Alteration Walsh et al. 2006: Aggregate states New Algorithms Sharma et al. 2007: Case-based RL

46 Transfer can improve reinforcement learning Initial performance Learning speed Advice transfer Low initial performance Steep learning curves Robust to negative transfer Macro transfer and MLN transfer High initial performance Shallow learning curves Vulnerable to negative transfer

47 Close-transfer scenarios Distant-transfer scenarios Multiple Macro MLN Policy Single Macro MLN Q-Function Skill Transfer == ≥ ≥ Multiple Macro MLN Policy Single Macro MLN Q-Function Skill Transfer == ≥ ≥

48 Task T Multiple source tasks Task S1

49 Theoretical results How high can the initial performance be? How quickly can the target-task learner improve? How many episodes are “saved” through transfer? SourceTarget Relationship?

50 Joint learning and inference in macros Single search Combined rule/weight learning pass(Teammate) move(Direction)

51 Refinement of transferred knowledge Macros Revising rule scores Relearning rules Relearning structure MLNs Revising weights Relearning rules Too-specific clause Better clause Too-general clause Better clause (Mihalkova et. al 2007)

52 Relational reinforcement learning Q-learning with MLN Q-function Policy search with MLN policies or macro Bin Number Probability MLN Q-functions lose too much information:

53 Diverse tasks Complex testbeds Automated mapping Protection against negative transfer General challenges in RL transfer

54 Advisor: Jude Shavlik Collaborators: Trevor Walker and Richard Maclin Committee David Page Mark Craven Jerry Zhu Michael Coen UW Machine Learning Group Grants DARPA HR0011-04-1-0007 NRL N00173-06-1-G002 DARPA FA8650-06-C-7606

55

56 0000 0000 0000 target-task training 2548 9172 5914 Initial Q-table transfer no transfer Source task Starting-point methods

57 Imitation methods training source target policy used

58 Hierarchical methods RunKick Pass Shoot Soccer

59 Alteration methods Task S Original states Original actions Original rewards New states New actions New rewards

60 Source Target IF Q(pass(Teammate)) > Q(other) THEN pass(Teammate) Advice Taking

61 action = pass(X) ? outcome = caught(X) ? pass(X) good? pass(X) clearly best? some action good? pass(X) clearly bad? Positive example for pass(X) Negative example for pass(X) yes no yes Reject example no

62 Exact Inference x 1 = world where pass(t 1 ) is truex 0 = world where pass(t 1 ) is false Note: when pass(t 1 ) is false no formulas are true pass(t 1 ) AND angle(t 1, defender) > 30 pass(t 1 ) AND distance(t 1, goal) < 12

63 Exact Inference


Download ppt "Lisa Torrey University of Wisconsin – Madison Doctoral Defense May 2009."

Similar presentations


Ads by Google