Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.

Similar presentations


Presentation on theme: "1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012."— Presentation transcript:

1 1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012

2 2 Objetivo desta Aula n Aprendizado por Reforço: –Planning and Learning. –Relational Reinforcement Learning –Uso de Heurísticas para acelerar o AR. –Dimensions of RL - Conclusões. n Aula de hoje: –Capítulos 9 e 10 do Sutton & Barto. –Artigo Relational RL, MLJ, de 2001 –Tese de Doutorado Bianchi. Esta é uma aula mais informativa

3 3 Planning and Learning Capítulo 9 do Sutton e Barto

4 4 Objetivos n Use of environment models. n Integration of planning and learning methods.

5 5 Models n Model: anything the agent can use to predict how the environment will respond to its actions –Distribution model: description of all possibilities and their probabilities e.g., –Sample model: produces sample experiences e.g., a simulation model n Both types of models can be used to produce simulated experience n Often sample models are much easier to come by

6 6 Planning n Planning: any computational process that uses a model to create or improve a policy. n Planning in AI: –state-space planning –plan-space planning (e.g., partial-order planner)

7 7 Planning in RL n We take the following (unusual) view: –all state-space planning methods involve computing value functions, either explicitly or implicitly. –they all apply backups to simulated experience.

8 8 Planning Cont. n Classical DP methods are state-space planning methods. n Heuristic search methods are state- space planning methods.

9 9 Q-Learning Planning Random-Sample One-Step Tabular Q-Planning n A planning method based on Q- learning:

10 10 Learning, Planning, and Acting n Two uses of real experience: –model learning: to improve the model –direct RL: to directly improve the value function and policy n Improving value function and/or policy via a model is sometimes called indirect RL or model- based RL. Here, we call it planning.

11 11 Direct vs. Indirect RL n Indirect methods: –make fuller use of experience: get better policy with fewer environment interactions n Direct methods –simpler –not affected by bad models But they are very closely related and can be usefully combined: - planning, acting, model learning, and direct RL can occur simultaneously and in parallel

12 12 The Dyna Architecture (Sutton 1990)

13 13 The Dyna-Q Algorithm direct RL model learning planning

14 14 Dyna-Q on a Simple Maze rewards = 0 until goal, when =1

15 15 Dyna-Q Snapshots: Midway in 2nd Episode

16 16 Prioritized Sweeping n Which states or state-action pairs should be generated during planning? n Work backwards from states whose values have just changed: –Maintain a queue of state-action pairs whose values would change a lot if backed up, prioritized by the size of the change –When a new backup occurs, insert predecessors according to their priorities –Always perform backups from first in queue n Moore and Atkeson 1993; Peng and Williams, 1993

17 17 Prioritized Sweeping

18 18 Prioritized Sweeping vs. Dyna-Q Both use N=5 backups per environmental interaction

19 19 Summary n Emphasized close relationship between planning and learning n Important distinction between distribution models and sample models n Looked at some ways to integrate planning and learning –synergy among planning, acting, model learning

20 20 Summary n Distribution of backups: focus of the computation –trajectory sampling: backup along trajectories –prioritized sweeping –heuristic search n Size of backups: full vs. sample; deep vs. shallow

21 Relational Reinforcement Learning Baseado em aula de Tayfun Gürel 21

22 Relational Representations n In most applications the state space is too large. n Generalization over states is essential n Many states are similar for some aspects n Representations for RL have to be enriched for generalization n RRL was proposed (Dzeroski, DeRaedt, Blockeel 1998)

23 Blocks World An action: move(a,b) precondition: clear(a)  S clear(b)  S s1={clear(b), clear(a), on(b,c), on(c,floor), on(a,floor) }

24 Relational Representations n With Relational Representations for states: –Abstraction from details q_value(0.72) = goal_unstack, numberofblocks(A), action_move(B,C), height(D,E), E=2, on(C,D),!. n Flexible to goal changes –Retraining from the beginning is not necessary n Transfer of experience to more complex domains

25 Relational Reinforcement Learning n How does it work? n An integration of RF with ILP: n Do forever: –Use Q-learning to generate sample Q values for sample states action pairs –Generalize them using ILP (in this case TILDE)

26 TILDE (Top Down Induction of Logical Decision Trees ) n A generalization of Q values is represented by a logical decision tree n Logical Decision Trees: –Nodes are First Order Logic atoms (Prolog Queries as Tests ) e.g. on (A, c) : is there any block on c –Training Data is a relational database or a Prolog knowledge base

27 Logical Decision Tree vs. Decision Tree Decision Tree and Logical Decision Tree deciding whether blocks are stacked

28 TILDE Algorithm Declarative bias: e.g. on(+,-) Background knowledge: A prolog program: An example part of the program can be:

29 Examples generated by Q-RR learning Q-RRL Algorithm

30 Logical regression tree generated by TILDE-RT Equivalent prolog program

31 EXPERIMENTS n Tested for three different goals: –1. one-stack –2. on(a,b) –3. unstack n Tested for the following as well: –Fixed number of blocks –Number of blocks changed after learning –Number of blocks changed while learning n P-RRL vs. Q-RRL

32 Results: Fixed number of blocks Accuracy of random policies Accuracy: percentage of correctly classified (s,a) pairs (optimal-non-optimal)

33 Results: Fixed number of blocks

34 Results: Evaluating learned policies on varying number of blocks

35 Conclusion n RRL has satisfying initial results, needs more research n RRL is more successful when number of blocks is increased (generalizes better to more complex domains) n Theoretical research proving why it works is still missing

36 Usando heuristicas para Aceleração do AR - pdf 36


Download ppt "1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012."

Similar presentations


Ads by Google