Presentation is loading. Please wait.

Presentation is loading. Please wait.

Temporal-Difference Learning and Monte Carlo Methods

Similar presentations


Presentation on theme: "Temporal-Difference Learning and Monte Carlo Methods"— Presentation transcript:

1 Temporal-Difference Learning and Monte Carlo Methods

2 Introduction TD-Learning: Combination of ideas from Monte Carlo and Dynamic Programming(DP) Like Monte-Carlo TD methods can do without a model of the environment's dynamics, can learn directly from raw experience Like DP TD methods bootstrap: Update estimates based in part on other learned estimates, not waiting for the final outcome

3 TD-Prediction Recall that policy evaluation or prediction problem is
estimating the value function Vπ for a given policy π Both TD and MC update their estimate V of Vπ after some experience of following a policy π If non-terminal state st is visited at time t, both methods update their estimate of V(st) The difference between two methods arise from how long each method waits before updating the estimate

4 How MC Methods do updates?
Monte Carlo methods wait until the return following the visit is known, then use that return as a target for V(st) where Rt is the actual return following time t and 𝛼 is constant step-size parameter Above equation can be understood as, Thus, MC methods must wait until the end of the episode to determine the increment, because only then Rt is known Target of MC update is Rt Error in the estimate

5 TD Methods updates Temporal Difference methods wait only until the next step to make update the estimate V(st) at time t where rt+1 is the observed reward. Above equation is known as TD(0) Target of TD update is Bootstrapping method like DP TD method bases its update in part on an existing estimate, V(st+1) TD method combine sampling of Monte Carlo with bootstrapping on DP

6 TD as Combination of MC and DP
MC methods use estimate of this as target; Estimate as expected value is not known; sample return is used in place of the real expected return DP methods use estimate of this as target; Estimate as Vπ(st+1) is not known and current estimate Vt(st+1) is used Add the driving home example, just talk about it, TODO TD target is estimate because of both reasons as MC and DP methods: samples the expected values as DP uses the current estimate Vt instead of true Vπ like MC

7 TD(0) Estimate for the state node is updated on the basis of the one sample transition from it to the immediately following state Sample backups: looking ahead to a sample successor state (or state-action pair), using the value of the successor and the reward along the way to compute a backed- up value, changing the value of the original state (or state-action pair) based on a single sample successor rather than on a complete distribution of all possible successors

8 Advantages of TD Prediction Methods
TD methods learn estimates in part on the basis of other estimates, learning a guess from a guess! Is it good? Unlike DP, TD methods do not require a model of the environment and of its reward and next-state probability distributions Unlike MC methods TD methods are naturally implemented in an on-line, fully incremental fashion Only need to wait for a single time step not till the end of full episode for learning Applies to learning tasks without episodes

9 Advantages of TD Prediction Methods
Are TD methods correct, guarantee convergence to the correct answer? Learning a guess from another guess to get to the right answer seems counterintuitive. TD(0) proven to converge to Vπ , for any fixed policy π, in the mean for constant step-size parameter if it is sufficiently small and with probability 1 if the step-size parameter decreases according to the usual stochastic approximation conditions, i.e. Which one, TD or MC, is faster since both methods converge asymptotically to the correct predictions? Open question No mathematical proof as to which method converges faster Not clear how to formulate this question in most appropriate formal way. In practice, TD methods have been found to converge faster than constant-𝛼 MC methods on stochastic tasks first condition is required to guarantee that the steps are large enough to eventually overcome any initial conditions or random fluctuations second condition guarantees that eventually the steps become small enough to assure convergence.

10 Optimality of TD(0) When the amount of experience is finite, incremental learning methods present the experience repeatedly until the method converges upon an answer Batch updating: Updates to value function are made only after processing each complete batch of training data Given an approximate value function V, increments are computed for every time step t while visiting non-terminal states, but the value function is changed only once by the sum of all the increments Available experience is processed again with the new value function to produce a new overall increment, and so on, until the value function converges Under batch updating: TD(0) converges deterministically to a single answer independent of the step-size parameter 𝛼, 𝛼 being sufficiently small. The constant-𝛼 MC method also converges deterministically under the same conditions, but to a different answer Under normal updating, the methods do not move all the way to their respective batch answers, but in some sense, take steps in these directions

11 Optimality of TD(0) Example 6.2, 6.3: Random Walk Under batch training, constant-𝛼 MC converges to values, V(s), that are sample averages of the actual returns experienced after visiting each state s Optimal because they minimize the mean-squared error from the actual returns in the training set, but optimal is limited way as compared to TD, which still does better. Why?

12 Optimality of TD(0) Given this batch of data, what are best value estimates of V(A) and V(B)? For itsV(B) = 3/4, six out of the eight times in state B the process terminated immediately with a return of 1 For V(A), there are two possible answers 100% of the times the process was in state A it traversed immediately to (with a reward of 0); and since B has value 3/4, therefore A must have value 3/4 as well Answer of Markov process Answer batch TD(0) Observe that A is seen only once and the return that followed is 0, therefore the estimate V(A) is 0 Answer of batch Monte Carlo method Gives minimum squared error on the training data

13 Optimality of TD(0) So what do we learn?
Batch Monte Carlo methods always find the estimates that minimize mean-squared error on the training set Batch TD(0) always finds the estimates that would be exactly correct for the maximum-likelihood model of the Markov process Maximum-likelihood estimate of a parameter is the parameter value whose probability of generating the data is greatest Here the maximum-likelihood estimate is the model of the Markov process formed by the estimated transition probability from i to j being the fraction of observed transitions from i to j the associated expected reward being the average of the rewards This model gives the certainty-equivalence estimate We can compute the estimate of the value function that would be exactly correct if the model was exactly correct TD(0) converges to the certainty-equivalence estimate, and that is why TD(0) is faster than MC methods Never feasible to compute certainty-equivalence estimate directly Requires N2 memory and N3 computational step for model with N states TD methods possibly the only feasible way of approximating the certainty-equivalence solution In this case, the maximum-likelihood estimate is the model of the Markov process formed in the obvious way from the observed episodes: the estimated transition probability from i to j is the fraction of observed transitions from i that went to j, and the associated expected reward is the average of the rewards observed on those transitions. TODO CHECK….Why is this?

14 Sarsa: On-Policy TD Control
The control problem: finding an optimal policy Aim is to learn an action-value function rather than a state-value function Estimate Qπ(s,a) for current behavior policy π for all states s and actions a Same method as for learning Vπ Consider transitions from state-action pair to state-action pair, and learn the value of state-action pairs Update is done after every transition from a nonterminal state st If st+1 is terminal, then Q(st+1, at+1) is 0 Rule uses every element of the quintuple of events , hence the name Sarsa Generalized policy iteration (GPI) refers to the general idea of letting policy evaluation and policy improvement processes interact, independent of the granularity and other details of the two processes.

15 Sarsa Algorithm Continually estimate Qπ for the behavior policy π, and at the same time change π toward greediness with respect to Qπ Convergence properties of the Sarsa algorithm depend on the nature of the policy's dependence on Q Converges with probability 1 to an optimal policy and action-value function as long as all state- action pairs are visited an infinite number of times and the policy converges in the limit to the greedy policy

16 Q-Learning: Off-Policy TD Control
Simplest form is, one-step Q-learning Learned action-value function, Q, directly approximates Q*, the optimal action-value function, independent of the policy being followed Policy still determines which state-action pairs are visited and updated For convergence, all pairs should continue to be updated Minimal requirement as any method guaranteed to find optimal behavior in the general case must satisfy this condition Qt has been shown to converge with probability 1 to Q* under usual stochastic approximation conditions on the sequence of step-size parameters

17 Q-learning Algorithm A state-action pair is updated, so the top node, the root of the backup, must be a small, filled action node Backup is also from action nodes, maximizing over all those actions possible in the next state.

18 The cliff-walking task
Sarsa v/s Q-Learning Q-learning learns values for the optimal policy, that which travels right along the edge of the cliff. results in its occasionally falling off the cliff because of the ε-greedy action selection Sarsa, on the other hand, takes the action selection into account and learns the longer but safer path through the upper part of the grid. On-line performance of Q-learning is worse than Sarsa If ε is gradually reduced, both methods will asymptotically converge to the optimal policy On-policy and off-policy learning difference? - Tradeoff in exploration? An off-policy learner learns the value of the optimal policy independently of the agent's actions. Q-learning is an off-policy learner. An on-policy learner learns the value of the policy being carried out by the agent, including the exploration steps. The cliff-walking task

19 The actor-critic architecture
Actor-Critic Methods TD methods Have separate memory structure to explicitly represent the policy independent of the value function Policy structure is the actor, used to select the actions Estimated value function is the critic, criticizes the action made by the actor Learning is always on-policy critic must learn about and critique whatever policy is currently being followed by the actor critique takes the form of a TD error sole output of the critic and drives all learning in both actor and critic The actor-critic architecture

20 Actor-Critic Methods Typically, critic is a state-value function
After each action selection, the critic evaluates the new state to determine whether things have gone better or worse than expected by calculating the TD error: where V is the current value function implemented by the critic TD error can be used to evaluate the action just selected if TD error is positive, meaning that action just selected is good and therefore, tendency to select it should be strengthened for future if TD error is negative, tendency of its selection should be weakened Advantages of Actor-Critic Methods require minimal computation in order to select actions can learn an explicitly stochastic policy; that is, they can learn the optimal probabilities of selecting various actions be useful in competitive and non-Markov cases Advantages: first point: case where there are an infinite number of possible actions--for example, a continuous-valued action. Any method learning just action values must search through this infinite set in order to pick an action. If the policy is explicitly stored, then this extensive computation may not be needed for each action selection.

21 R-Learning for Undiscounted Continuing Tasks
Off-policy control method for the advanced version of the reinforcement learning problem No discounting of finite rewards Experience isn’t divided into distinct episodes Instead of maximizing the total expected reward, the goal changes to maximizing reward per time step Value functions for a policy, π , are defined relative to the average expected reward per time step under the policy, ρπ assuming the process is ergodic (nonzero probability of reaching any state from any other under any policy) and thus that ρπ does not depend on the starting state Average reward in long run is same from any state, but there is transient some states provide better-than-average rewards others provide worst-than-average rewards

22 R-Learning for Undiscounted Continuing Tasks
This transient defines the value of the state: Value of a state-action pair is similarly the transient difference in reward when starting in that state and taking that action Values are relative to the average reward under the current policy For practical purposes, policies can be ordered according to their average reward per time step Maximal value of ρπ can be considered to be optimal

23 R-Learning for Undiscounted Continuing Tasks
R-Learning method maintains two policies, a behavior policy and an estimation policy, plus an action-value function and an estimated average reward Behavior policy is used to generate experience Estimation policy is involved in GPI, typically greedy policy with respect to the action-value function If π is the estimation policy, then the action-value function, Q, is an approximation of and the average reward, ρ, is an approximation of ρπ.

24 Games, Afterstates, and Other Special Cases
Exceptional tasks are better treated in a specialized way Conventional state-value function evaluates states in which the agent has the option of selecting an action, but the state-value function used in tic-tac-toe evaluates board positions after the agent has made its move More efficient method, queuing tasks there are actions such as assigning customers to servers, rejecting customers, or discarding information. In such cases the actions are in fact defined in terms of their immediate effects, which are completely known. Afterstates: environment states that are evaluated after the agent has made the action and the value functions corresponding to these states are called afterstate value functions Useful when we know about initial part of the environment's dynamics but not necessarily of the full dynamics e.g. We know for each possible chess move what the resulting position will be, but not how our opponent will reply In such cases, afterstate value functions produce a more efficient learning method Performs better as many state-action pairs can produce the same resulting state

25 Questions 1) Problem can be non-de but model is set
Monte Carlo and TD can solve stochastic tasks, while DP can only deal with deterministic tasks. I am wondering whether a real meaningful deterministic task exists? [Yifang] If the task is deterministic, do we need learning in first place? Ideas? Why do actor-critic TD methods learn an explicitly stochastic policy? Is that relatively unique? [Brendan] Because of the way actions are chosen are based on probabilities and the method just increases/decreases the tendency to select a future action, the action being good or bad. For TD(lambda) learning, what is the difference when choosing different lambda rather than setting lambda=0? [Jiyun] Lambda parameter refers to the trace decay parameter, with 0 ≤ 𝜆 ≤1. As 𝜆 increases towards 1, longer lasting traces, i.e., larger proportion of credit from a reward can be given to more distant states and actions. How does Q-Learning control exploration and exploitation? [Jiyun] Q-learning is exploration insensitive: that is, that the Q values will converge to the optimal values, independent of how the agent behaves while the data is being collected TD-learning is a combination of Monte Carlo and DP. Can you illustrate this explicitly? [Prof. Grace] TD methods combine the sampling of Monte Carlo with the bootstrapping of DP. TD resembles a Monte Carlo method because it learns by sampling the environment according to some policy. TD is relates to DP techniques as it approximates its current estimate based on previously learned estimates. The driving home example given in the book [Example 6.1] 1) Problem can be non-de but model is set

26 Monte Carlo Method and Temporal-Difference Learning

27 Last time: Dynamic Programming
assume environment is finite MDP 𝓢 and A(s) are finite dynamics given by transition probabilities and expected immediate rewards key idea for DP use of value functions to organize and structure the search for good policies -- how to search

28 Solving RL Problems Dynamic Programming Monte Carlo Method
mathematically well developed require complete and accurate model of environment Monte Carlo Method don’t require model not suitable for step-by-step incremental computation Temporal Difference Learning fully incremental more complex to analyze

29 Monte Carlo Method do not assume complete knowledge of the environment
where in DP, we assume complete and accurate model of environment require only experience from on-line or simulated interaction with an environment in many cases, it is easy to generate experience sampled according to the desired probability distributions, but infeasible to obtain the distributions in explicit form any examples? advantages: attractive when one requires the value of only a subset of the states ability to learn from actual experience and from simulated experience

30 Monte Carlo Method idea: solving RL problem based on averaging sample returns define Monte Carlo methods only for episodic tasks to ensure well-defined returns are available assume experience is divided into episodes, and all episodes eventually terminate no matter what actions are selected only upon the completion of an episode that value estimates and policies are changed Monte Carlo methods are incremental in an episode-by-episode sense, not in a step-by-step sense

31 Monte Carlo Method Monte Carlo and DP share the most important ideas:
they both compute the same value functions they interact to attain optimality in essentially the same way Generalized Policy Iteration

32 What is “Monte Carlo”? Monte Carlo methods (or Monte Carlo experiments) are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results; typically one runs simulations many times over in order to obtain the distribution of an unknown probabilistic entity. They are often used in physical and mathematical problems and are most useful when it is difficult or impossible to obtain a closed-form expression, or infeasible to apply a deterministic algorithm. Monte Carlo methods are mainly used in three distinct problem classes: optimization, numerical integration and generation of draws from a probability distribution. — wikipedia

33 What is “Monte Carlo”? Monte Carlo methods vary, but tend to follow a particular pattern: Define a domain of possible inputs. Generate inputs randomly from a probability distribution over the domain. Perform a deterministic computation on the inputs. Aggregate the results. Consider a circle inscribed in a unit square. Given that the circle and the square have a ratio of areas that is π/4, the value of π can be approximated using a Monte Carlo method: Draw a square on the ground, then inscribe a circle within it. Uniformly scatter some objects of uniform size (grains of rice or sand) over the square. Count the number of objects inside the circle and the total number of objects. The ratio of the two counts is an estimate of the ratio of the two areas, which is π/4. Multiply the result by 4 to estimate π.

34 What is “Monte Carlo”? Before the Monte Carlo method was developed, simulations tested a previously understood deterministic problem and statistical sampling was used to estimate uncertainties in the simulations. Monte Carlo simulations invert this approach, solving deterministic problems using a probabilistic analog. An early variant of the Monte Carlo method can be seen in the Buffon's needle experiment, in which π can be estimated by dropping needles on a floor made of parallel and equidistant strips. In the 1930s, Enrico Fermi first experimented with the Monte Carlo method while studying neutron diffusion, but did not publish anything on it.

35 What is “Monte Carlo”? In 1946, physicists at Los Alamos Scientific Laboratory were investigating radiation shielding and the distance that neutrons would likely travel through various materials. Despite having most of the necessary data, such as the average distance a neutron would travel in a substance before it collided with an atomic nucleus, and how much energy the neutron was likely to give off following a collision, the Los Alamos physicists were unable to solve the problem using conventional, deterministic mathematical methods. Stanislaw Ulam had the idea of using random experiments. He recounts his inspiration as follows: The first thoughts and attempts I made to practice [the Monte Carlo Method] were suggested by a question which occurred to me in 1946 as I was convalescing from an illness and playing solitaires. The question was what are the chances that a Canfield solitaire laid out with 52 cards will come out successfully? After spending a lot of time trying to estimate them by pure combinatorial calculations, I wondered whether a more practical method than "abstract thinking" might not be to lay it out say one hundred times and simply observe and count the number of successful plays. This was already possible to envisage with the beginning of the new era of fast computers, and I immediately thought of problems of neutron diffusion and other questions of mathematical physics, and more generally how to change processes described by certain differential equations into an equivalent form interpretable as a succession of random operations. Later [in 1946], I described the idea to John von Neumann, and we began to plan actual calculations. –Stanislaw Ulam[4]

36 What is “Monte Carlo”? Being secret, the work of von Neumann and Ulam required a code name. A colleague of von Neumann and Ulam, Nicholas Metropolis, suggested using the name Monte Carlo, which refers to the Monte Carlo Casino in Monaco where Ulam's uncle would borrow money from relatives to gamble. Using lists of "truly random" random numbers was extremely slow, but von Neumann developed a way to calculate pseudorandom numbers, using the middle-square method. Though this method has been criticized as crude, von Neumann was aware of this: he justified it as being faster than any other method at his disposal, and also noted that when it went awry it did so obviously, unlike methods that could be subtly incorrect. Monte Carlo methods were central to the simulations required for the Manhattan Project, though severely limited by the computational tools at the time. In the 1950s they were used at Los Alamos for early work relating to the development of the hydrogen bomb, and became popularized in the fields of physics, physical chemistry, and operations research. The Rand Corporation and the U.S. Air Force were two of the major organizations responsible for funding and disseminating information on Monte Carlo methods during this time, and they began to find a wide application in many different fields. Uses of Monte Carlo methods require large amounts of random numbers, and it was their use that spurred the development of pseudorandom number generators, which were far quicker to use than the tables of random numbers that had been previously used for statistical sampling.

37 Monte Carlo Policy Evaluation
Goal: learn the state-value function for a given policy value of a state is the expected return — expected cumulative future discounted reward — starting from that state average the returns observed after visits to that state suppose we wish to estimate Vπ(s) given a set of episodes obtained by following π and passing through s each occurrence of state s in an episode is called a visit to s every-visit MC: estimates Vπ(s) as the average of the returns following all the visits to s in a set of episodes first-visit MC: estimates Vπ(s) as the average of just returns following first visits to s estimates converge to Vπ(s) as the number of visits (or first visits) to s goes to infinity each return is an independent, identically distributed estimate of Vπ(s) every-visit MC in chapter 7

38 Monte Carlo Policy Evaluation
estimating the value of a single state is independent of the number of states immune to curse of dimensionality (number of states often grows exponentially with number of state variables) particularly attractive when one requires the value of only a subset of the states important fact: estimates for each state are independent The estimate for one state does not build upon the estimate of any other state

39 Monte Carlo Estimation of Action Values
If a model is not available, then it is particularly useful to estimate action values rather than state values. with a model, state values alone are sufficient to determine a policy without a model, state values alone are not sufficient we must explicitly estimate the value of each action The policy evaluation problem for action values is to estimate Qπ(s,a),the expected return when starting in state s, taking action a, and thereafter following policy π.

40 Monte Carlo Estimation of Action Values
Problem: many relevant state-action pares may never be visited. If 𝜋 is a deterministic policy, then in following 𝜋 one will observe returns only for one of the actions from each state. assumption of exploring starts: the first step of each episode starts at a state-action pair, and that every such pair has a nonzero probability of being selected as the start guarantees that all state-action pairs will be visited an infinite number of times in the limit of an infinite number of episodes

41 Monte Carlo Control Monte Carlo control algorithm assuming exploring starts

42 Monte Carlo Control an ace that could count as 11 without going bust is called usable

43 On-Policy Monte Carlo Control
How to avoid the assumption of exploring starts? the only general way to ensure that all actions are selected infinitely often is for the agent to continue to select them on-policy and off-policy on-policy method: attempt to evaluate or improve the policy that is used to make decisions In on-policy methods the policy is generally soft, meaning that 𝜋(s,a)>0 for all s and a 𝟄-greedy policies: most of the time they choose an action that has maximal estimated action value, but with probability 𝟄 they instead select an action at random

44 On-Policy Monte Carlo Control

45 Evaluating One Policy While Following Another
we have considered methods for estimating the value functions for a policy given an infinite supply of episodes generated using that policy suppose all we have are episodes generated from a different policy we wish to estimate V𝜋 or Q𝜋, but all we have are episodes following 𝜋’, where 𝜋’ ≠ 𝜋. in order to use episodes from 𝜋’ to estimate values for 𝜋, we require that every action taken under 𝜋 is also taken, at least occasionally, under 𝜋’ require that 𝜋(s,a) > 0 implies 𝜋’(s,a) > 0

46 Off-Policy Monte Carlo Control
In off-policy methods: the policy used to generate behavior, called the behavior policy, may in fact be unrelated to the policy that is evaluated and improved, called the estimation policy advantage: the estimation policy may be deterministic (e.g., greedy), while the behavior policy can continue to sample all possible actions off-policy Monte Carlo control methods use the technique in estimating the value function for one policy while following another follow the behavior policy while learning about and improving the estimation policy requires that the behavior policy have a nonzero probability of selecting all actions that might be selected by the estimation policy require that the behavior policy be soft in order to explore all possibilities

47 Off-Policy Monte Carlo Control
following pi’, evaluating and improving pi

48 Questions What is the key difference between on-policy and off-policy learning? Use examples and compare algorithms from both chapters.(Grace) On-policy: following and evaluating the same policy

49 Questions Re: off-policy MC control: "If nongreedy actions are frequent, then learning will be slow, particularly for states appearing in the early portions of long episodes." How would one ensure that the choices are only optimally non-greedy? What's the trade-off? (Brad) non-greedy actions cannot be guaranteed to be optimal, that’s why we need to keep exploring. Trade-off would be slow learning process. So in Chapter 4, we use a randomized algorithm to evaluate the state-value function, with a terminating condition. In Chapter 5.1, we use Monte Carlo to evaluate the state-value function. Which one is better? (Yifang) They have different assumptions regarding the environment model, so it is inappropriate to compare them directly

50 Questions Is first-visit MC method to estimate state-value function an application of Gibbs Sampling? It looks like that if we aim to get distribution of return(s) rather than expectation(s). (Yifang) In statistics and in statistical physics, Gibbs sampling or a Gibbs sampler is a Markov chain Monte Carlo (MCMC) algorithm for obtaining a sequence of observations which are approximated from a specified multivariate probability distribution (i.e. from the joint probability distribution of two or more random variables), when direct sampling is difficult. This sequence can be used to approximate the joint distribution (e.g., to generate a histogram of the distribution); to approximate the marginal distribution of one of the variables, or some subset of the variables (for example, the unknown parameters or latent variables); or to compute an integral (such as the expected value of one of the variables). Typically, some of the variables correspond to observations whose values are known, and hence do not need to be sampled. Gibbs sampling is commonly used as a means of statistical inference, especially Bayesian inference. It is a randomized algorithm (i.e. an algorithm that makes use of random numbers, and hence may produce different results each time it is run), and is an alternative to deterministic algorithms for statistical inference such as variational Bayes or the expectation-maximization algorithm (EM). As with other MCMC algorithms, Gibbs sampling generates a Markov chain of samples, each of which is correlated with nearby samples. As a result, care must be taken if independent samples are desired (typically by thinning the resulting chain of samples by only taking every nth value, e.g. every 100th value). In addition (again, as in other MCMC algorithms), samples from the beginning of the chain (the burn-in period) may not accurately represent the desired distribution.

51 Questions "Monte Carlo" seems like a strange name for this type of machine learning; is it just referring to the expected environment randomness from episode to episode? (Brendan) “Monte Carlo” here refers to using repeated random sampling to estimate the expected value/return. The name is from physics. What are the advantages of "focus"-ing Monte Carlo rather than explicitly limiting the search space? (Brendan) Limiting search space would reduce the amount/possibility of exploration, which may take long time to converge or find optimal policy. not equally exploring will need to know the exact search space

52 Questions Section 5.4 of the textbook introduced on-policy Monte Carlo control.  My question is, can the 𝟄-soft policies guarantee convergence? What if we have dynamic rather than constant 𝟄 value? (Sicong) For any 𝟄-soft policy, 𝜋, any 𝟄-greedy policy with respect to Q𝜋 is guaranteed to be better than or equal to 𝜋. So it will converge. If 𝟄 is dynamic and remains small, it just affects how much the agent explore. It does not affect the results. In every-visit MC, since in an episode, the visits of state s are dependent with each other, so they are not independent estimations of V(s) given policy pi, then why use every-visit MC? What is the advantage comparing to first-visit MC? (Jiyun) Every-visit MC is introduced in Chapter 7 gives a more “closer” estimate of V(s). V(s) is the expected return when the agent enters state s. It does not specify whether it is the first time the agent visits state s. So to estimate V(s), one has to consider the returns of non-first visiting to state s.

53 Questions I notice the idea of Monte Carlo method here seems have some similarity with Simulated Annealing. The value function is what the "robot" wants to learn, and at the same time, it is also directed by the current value function. In addition, during each iteration, both algorithms have some chance to choose a random action. Is it? (Sicong) Simulated annealing (SA) is a generic probabilistic metaheuristic for the global optimization problem of locating a good approximation to the global optimum of a given function in a large search space. It is often used when the search space is discrete (e.g., all tours that visit a given set of cities). For certain problems, simulated annealing may be more efficient than exhaustive enumeration — provided that the goal is merely to find an acceptably good solution in a fixed amount of time, rather than the best possible solution. we do not have a clear given function to optimize


Download ppt "Temporal-Difference Learning and Monte Carlo Methods"

Similar presentations


Ads by Google