Download presentation

Presentation is loading. Please wait.

Published byJoanna Hotchkiss Modified over 2 years ago

1
1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2007

2
2 Objetivo desta Aula n Aprendizado por Reforço: –Métodos de Monte Carlo. –Aprendizado por Diferenças Temporais. –Traços de Elegibilidade. n Aula de hoje: capítulos 5, 6 e 7 do Sutton & Barto.

3
3 Relembrando a aula passada.

4
4 O que é o Aprendizado por Reforço? n Aprendizado por interação. n Aprendizado orientado a objetivos. n Aprendizado sobre, do e enquanto interagindo com um ambiente externo. n Aprender o que fazer: –Como mapear situações em ações. –Maximizando um sinal de recompensa numérico.

5
5 Agente no AR n Situado no tempo. n Aprendizado e planejamento continuo. n Objetivo é modificar o ambiente. Ambiente Ação Estado Recompensa Agente

6
6 Policy Reward Value Model of environment Elementos do AR n Política (Policy): o que fazer. n Recompensa (Reward): o que é bom. n Valor (Value): o que é bom porque prevê uma recompensa. n Modelo (Model): o que causa o que.

7
7 The Agent-Environment Interface t... s t a r t +1 s a r t +2 s a r t +3 s... t +3 a

8
8 The Agent Learns a Policy n Reinforcement learning methods specify how the agent changes its policy as a result of experience. n Roughly, the agent’s goal is to get as much reward as it can over the long run.

9
9 Goals and Rewards n Is a scalar reward signal an adequate notion of a goal?—maybe not, but it is surprisingly flexible. n A goal should specify what we want to achieve, not how we want to achieve it. n A goal must be outside the agent’s direct control—thus outside the agent. n The agent must be able to measure success: –explicitly; –frequently during its lifespan.

10
10 Returns Episodic tasks: interaction breaks naturally into episodes, e.g., plays of a game, trips through a maze. where T is a final time step at which a terminal state is reached, ending an episode.

11
11 Importante!!! n São muito diferentes: –Reward ( r t ): O que ganha quando faz uma ação. –Return ( R t ): É o retorno esperado. n A relação entre um e outro pode ser: Expected Return ( E{R t } ): –É o que se deseja maximizar.

12
12 Returns for Continuing Tasks Continuing tasks: interaction does not have natural episodes. Discounted return:

13
13 The Markov Property n Ideally, a state should summarize past sensations so as to retain all “essential” information, i.e., it should have the Markov Property:

14
14 Defining a Markov Decision Processes n To define a finite MDP, you need to give: –state S and action sets A(s). –one-step “dynamics” defined by transition probabilities: –reward expectations:

15
15 n The value of a state is the expected return starting from that state; depends on the agent’s policy. State-value function for policy : Value Functions

16
16 The value of taking an action in a state under policy is the expected return starting from that state, taking that action, and thereafter following . Action-value function for policy : Value Functions

17
17 Bellman Equation for a Policy The basic idea: So: Or, without the expectation operator:

18
18 Policy Iteration policy evaluationpolicy improvement “greedification”

19
19 Policy Iteration

20
20 Value Iteration Recall the full policy evaluation backup: Here is the full value iteration backup:

21
21 Value Iteration Cont.

22
22 Fim da Revisão n Importante: –Conceitos básicos bem entendidos. n Problema: –DP necessita do modelo de transição de estados P. n Como resolver este problema, se o modelo não é conhecido?

23
23 Métodos de Monte Carlo Capítulo 5 do Sutton e Barto.

24
24 Monte Carlo Methods n Métodos de Monte Carlo permitem aprender a partir de exemplos de retornos completos (complete sample returns) –Definido para tarefas episódicas. n Métodos de Monte Carlo possibilitam o aprendizado baseado diretamente em experiências: –On-line: Não necessita de um modelo para atingir a solução ótima. –Simulated: Não necessita de um modelo completo.

25
25 Wikipedia: Monte Carlo Definition n Monte Carlo methods are a widely used class of computational algorithms for simulating the behavior of various physical and mathematical systems.computationalalgorithmssimulatingphysical mathematical n They are distinguished from other simulation methods (such as molecular dynamics) by being stochastic, usually by using random numbers - as opposed to deterministic algorithms.molecular dynamicsstochasticrandom numbersdeterministic algorithms n Because of the repetition of algorithms and the large number of calculations involved, Monte Carlo is needs large computer power.

26
26 Wikipedia: Monte Carlo Definition n A Monte Carlo algorithm is a numerical Monte Carlo method used to find solutions to mathematical problems (which may have many variables) that cannot easily be solved, for example, by integral calculus, or other numerical methods.integral calculus n For many types of problems, its efficiency relative to other numerical methods increases as the dimension of the problem increases.dimension

27
27 Monte Carlo principle n Consider the game of solitaire: what’s the chance of winning with a properly shuffled deck? n Hard to compute analytically because winning or losing depends on a complex procedure of reorganizing cards n Insight: why not just play a few hands, and see empirically how many do in fact win? n More generally, can approximate a probability density function using only samples from that density ? Lose WinLose Chance of winning is 1 in 4! http://nlp.stanford.edu/local/talks/mcmc_2004_07_01.ppt

28
28 Monte Carlo principle n Given a very large set X and a distribution p(x) over it n We draw a set of N samples n We can then approximate the distribution using these samples X p(x)

29
29 Monte Carlo principle n We can also use these samples to compute expectations n And even use them to find a maximum

30
30 Monte Carlo Example: Approximation of (the number)... n If a circle of radius r = 1 is inscribed inside a square whit side length L = 2, then we obtain: http://twt.mpei.ac.ru/MAS/Worksheets/approxpi.mcd

31
31 MC Example: Approximation of (the number)... n Inside the square, we can put N points at random with uniform distribution with (x,y) coordinates. n Now, we can to count how many points have fallen in the circle. http://twt.mpei.ac.ru/MAS/Worksheets/approxpi.mcd

32
32 MC Example: Approximation of (the number)... n If N is large enough, we can think that the ratio: http://twt.mpei.ac.ru/MAS/Worksheets/approxpi.mcd

33
33 MC Example: Approximation of (the number)... n For N = 1000: –N Circle = 768 –Pi = 3.072 –Error = 0.07

34
34 MC Example: Approximation of (the number)... n For N = 10000: –N Circle = 7802 –Pi = 3.1208 –Error = 0.021

35
35 MC Example: Approximation of (the number)... n For N = 100000: –N Circle = 78559 –Pi = 3.1426 –Error = 0.008

36
36 Monte Carlo Policy Evaluation Goal: learn V (s) Given: some number of episodes under which contain s n Idea: Average returns observed after visits to s 12345

37
37 Monte Carlo Policy Evaluation n Every-Visit MC: average returns for every time s is visited in an episode n First-visit MC: average returns only for first time s is visited in an episode n Both converge asymptotically.

38
38 First-visit Monte Carlo policy evaluation

39
39 Blackjack example n Object: Have your card sum be greater than the dealers without exceeding 21. n States (200 of them): –current sum (12-21) –dealer’s showing card (ace-10) –do I have a useable ace? n Reward: +1 for winning, 0 for a draw, -1 for losing n Actions: stick (stop receiving cards), hit (receive another card) n Policy: Stick if my sum is 20 or 21, else hit

40
40 Blackjack value functions

41
41 Backup diagram for Monte Carlo n Entire episode included n Only one choice at each state (unlike DP) n MC does not bootstrap n Time required to estimate one state does not depend on the total number of states

42
42 Monte Carlo Estimation of Action Values (Q) n Monte Carlo is most useful when a model is not available –We want to learn Q * Q (s,a) - average return starting from state s and action a following n Also converges asymptotically if every state- action pair is visited n Exploring starts: Every state-action pair has a non-zero probability of being the starting pair

43
43 Monte Carlo Control n MC policy iteration: Policy evaluation using MC methods followed by policy improvement n Policy improvement step: greedify with respect to value (or action-value) function

44
44 Convergence of MC Control n Policy improvement theorem tells us: This assumes exploring starts and infinite number of episodes for MC policy evaluation To solve the latter: update only to a given level of performance alternate between evaluation and improvement per episode

45
45 Monte Carlo Exploring Starts Fixed point is optimal policy * Proof is open question

46
46 Blackjack example continued n Exploring starts n Initial policy as described before

47
47 On-policy Monte Carlo Control greedy non-max n On-policy: learn about policy currently executing. n How do we get rid of exploring starts? –Need soft policies: (s,a) > 0 for all s and a –e.g. -soft policy: Similar to GPI: move policy towards greedy policy (i.e. - soft) Converges to best -soft policy

48
48 On-policy Monte Carlo Control

49
49

50
50 On-policy MC Control

51
51 Learning while following ’

52
52 Learning while following ’

53
53 Off-policy Monte Carlo control n Recall that the distinguishing feature of on- policy methods is that they estimate the value of a policy while using it for control. n In off-policy methods these two functions are separated: –Behavior policy generates behavior in environment. –Estimation policy is policy being learned about. n Average returns from behavior policy by probability their probabilities in the estimation policy.

54
54 Off-policy MC control

55
55 Incremental Implementation n MC can be implemented incrementally –saves memory n Compute the weighted average of each return incremental equivalentnon-incremental

56
56 Monte Carlo Summary n MC has several advantages over DP: –Can learn directly from interaction with environment –No need for full models –No need to learn about ALL states –Less harm by Markovian violations (later in book) n MC methods provide an alternate policy evaluation process n One issue to watch for: maintaining sufficient exploration –exploring starts, soft policies n No bootstrapping (as opposed to DP)

57
57 Métodos das Diferenças Temporais Após o intervalo

58
58 Intervalo

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google