Neuro-Dynamic Programming

Neuro-Dynamic Programming
José A. Ramírez，Yan Liao Advanced Decision Processes ECECS 841, Spring 2003 University of Cincinnati

Outline 1. Neuro-Dynamic Programming (NDP): motivation.
2. Introduction to Infinite Horizon Problems Minimization of Total Cost, Discounted Problems, Finite-State Systems, Value Iteration and Error Bounds, Policy Iteration, The Role of Contraction Mappings. 3. Stochastic Control Overview: State Equation (system model), Value Function, Stationary policies and value function, Introductory example: Tetris (game). Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

Outline 4. Control of Complex Systems:
Motivation about use of NDP in complex systems, Examples of complex systems where NDP could be applied. 5. Value Function Approximation: Linear parameterization: parameter vector and basis functions, Continuation Tetris example. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

Outline 6. Temporal-Difference Learning (TD()):
Introduction: Autonomous systems, general TD() algorithm, Controlled Systems, TD() for more general systems: Approximate policy iteration, Controlled TD, Q-functions, and approximating the Q-function (Q-learning), Comments about relationship with Approximate Value Iteration. 7. Actors and Critics: Averaged Rewards, Independent actors, Using critic Feedback. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

1. Neuro-Dynamic Programming (NDP): motivation
“rational” and “irrational” behavior -How decisions are made (psychologists, economists). Study of Decision-Making -How decisions ought to be made: “rational decision-making” (engineers and management scientists). clear objectives, strategic behavior. Rational decision problems: -Development of mathematical theory: understanding of dynamics models, uncertainties, objectives, and characterization of optimal decision strategies. -If optimal strategies do exist, then computational methods are used as complement (e.g., Implementation). Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

-In contrast to rational decision-making, there is no clear-cut mathematical theory about decisions made by participants of natural systems (speculative theories, refining ideas by experimentation). -One approach: hypothesis that behavior is in some sense rational, then ideas from study of rational decision-making are used to characterize such behavior, e.g., utility and equilibrium theory in financial economics. -Also, study of animal behavior is subject of interest: evolutionary theory and its popular precept “survival of the fittest” –support the possibility that behavior to some extent concurs with that rational agent. -Contributions from study of natural systems to science of rational decision-making: -Computational complexity of decision problems and lack of systematic approaches for dealing with it. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

-For example: practical problems addressed by the theory of dynamic programming (DP) can rarely solved using DP algorithms because the computational time required for the generation of optimal strategies typically grows exponentially in the number of variables involved  Curse of dimensionality. -This call for an understanding of suboptimal solutions /decision-making under computational constraints. Problem  no satisfactory theory has been developed to this end. -Interesting: the fact that biological mechanisms facilitate the efficient synthesis of adequate strategies motivates the possibility that understanding such mechanisms can inspire new and computationally feasible methodologies for strategic decision-making. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

-Reinforcement Learning (RL): over two decades, RL algorithms –originally conceived as descriptive models for phenomena observed in animal behavior- have grown out in the field of artificial intelligence and been applied to solving complex sequential decision problems. -Success of RL in solving large-scale problems has generated special interest among operations researchers and control theorists  research devoted to understand those methods and their potential. -Developments from the operations research and control theorists: focused in normative view, acknowledge of relative disconnect from descriptive models of animal behavior,  some operations researchers and control theorists have come to refer this area of research as Neuro-Dynamic Programming (NDP) instead of RL. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

-During these lectures we will present a sample of the recent developments and open issues of research in NDP. -Specifically, we will be focused in two algorithmic ideas of greatest use in NDP, and for which there has been significant theoretical progress in recent years: -Temporal-Difference learning -Actor-Critic Methods. -First, we begin providing some background and perspective on the methodology and problems may address. Comments about references Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

2. Introduction to Infinite Horizon Problems
Material taken from “Dynamic Programming and Optimal Control”, vol. I, II; and “Neuro-Dynamic Programming” by Dimitri P. Berstsektas and John Tsitsiklis. The Dynamic Programming Problems with infinite horizon are characterized by the following aspects: a) The number of stages is infinite*. b) The system is stationary  the system equation, the cost per stage, and the random disturbance statistics do not change from one stage to the next. Why Infinite Horizon Problems?: -They are interesting because their analysis is elegant and insightful. -Implementation of optimal policies is often simple. Optimal policies are typically stationary, e.g., optimal rule used to choose controls does not change from stage to stage. -NDP!  complex systems. *This assumption is never satisfied in practice, but is a reasonable approximation for problems with a finite but very large number of stages. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

-They require more sophisticated analysis than the finite horizon problems.  It is needed to analyze limiting behavior as the horizon tends to infinity. -We consider four principal classes of infinite horizon problems. The first two classes try to minimize J (x0), the total cost over an infinite number of stages: Stochastic shortest path problems: in this case  = 1 and assume that there is an additional state 0, which is a cost-free termination state; once the system reach the termination state it remains there at not additional cost. objective: reach the termination state with the minimal cost. Discounted problems with bounded cost per stage: here  < 1, and the absolute one-stage cost |g(x,u,w)| is bounded from above by some constant M. Thus, J(x0) is well defined because it is the infinite sum of a sequence of numbers that are bounded in absolute value by the decreasing geometric progression Mk. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

iii) Discounted problems with unbounded cost per stage: here the discount factor  may or may not be less than 1, and the cost per stage is unbounded. this problem is difficult to analyze because of the possibility of infinite cost for some policies (more details in chap.3, Dynamic Programming,vol. II, by Bertsekas). iv) Average cost problems: in some problems we have J(x0) =, for every policy  and initial state i, then in many problems the average cost per stage, given by where JN(i) is the N-stage cost-to-go of policy  starting at state x0, is well defined as a limit and is finite. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

A Preview of Infinite Horizon Results: Let J* the optimal cost-to-go function of the infinite horizon problem, and consider the case  = 1, with JN(x) as the optimal cost of the problem involving N stages, initial state x, cost per stage g(x,u,w), and zero terminal cost. Thus, the N-stage cost can be computed after N iterations of the DP algorithm*: Thus, we can speculate the following: The optimal infinite horizon cost is the limit of the corresponding N-stage optimal costs as N ∞ : *Note that the time indexing has been reversed from the original DP algorithm, thus the optimal finite horizon cost functions can be computed with a single DP recursion (more details in chap.1, “Dynamic Programming”, vol. II, by D.P. Bertsekas. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

ii) The following limiting form of the DP algorithm should hold for all states: this is not an algorithm, but a system of equations (one equation per state), which has as solution the costs-to-go for all states. It is also viewed as a functional equation for the cost-to-go function J* , and it is called Bellman’s equation. iii) If (x) attains the minimum in the right-hand side of the Bellman’s equation for each x, then the policy ={,  , …} should be optimal. This is true for most infinite horizon problems of interest. stationary policies. Most of the analysis of infinite horizon problems are focused around the above three issues and efficient methods to compute J* and optimal policies. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

Stationary Policy: A stationary policy is an admissible policy ={, , …} with a corresponding cost function J (x).  is optimal if J(x)=J*(x) for all states x. Some Shorthand Notation: The use of single recursions in the DP algorithm to compute optimal costs over a finite horizon, motivates the introduction of two mappings that play an important theoretical role and give us a convenient shorthand notation for expressions that are Complicated to write. For any function J:S, where S is the states space, we consider the function obtained by applying the DP mapping J as follows: T can be viewed as a mapping that transforms J on S into the function TJ on S. TJ represent the optimal cost function for the one-stage problem that has stage cost g and terminal cost J. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

Similarly, for any control function  : S  C, where C is the space of controls, we have: Also, we denote the composition Tk of the mapping T with itself k times Then, for k=0 we have Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

Some Basic Properties Monotonicity Lemma: For any functions J:S and J’:S, such that and for any function :SC with (x)  U(x), for all x  S, we have Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

The Role of Contraction Mappings (Dynamic Programming, vol. II, Bertsekas) Definition 1.4.1: A mapping H: B(S)  B(S) is said to be a contraction mapping if there exists a scalar  <1 such that Where || ∙ || is the norm It is said to be an m-stage contraction mapping if there exists a positive integer m and some  < 1 such that where Hm denotes the composition of H with itself m times. Note: B(S) is the set of all bounded real-valued functions on S. Every function J:S . Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

The Role of Contraction Mappings (Dynamic Programming, vol. II, Bertsekas) Proposition 1.4.1: (Contraction Mapping Fixed-Point Theorem) If H: B(S)  B(S) is a contraction mapping or an m-stage contraction mapping, then there exists a unique fixed point of H; i.e., there exists a unique function J*  B(S) such that Furthermore, if J is any function in B(S) and Hk is the composition of H with itself k times, then Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

3. Stochastic Control Overview
State Equation: Let’s consider a discrete-time dynamic system, that at each time t, takes on a state xt and evolves according to: where wt is a disturbance (iid) and at is a control decision. We restrict attention to finite state, disturbances, and control spaces, denoted by , W, and , respectively. Value Function: Let r :  x   associates a reward r( xt , at ) with a decision at, made at state xt. Let  a stationary policy with  :   . For each policy  we define a value function v( ∙ ,  ) :   : Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

Optimal Value Function: we define the optimal value function V as follows: From dynamic programming, we have that any stationary policy * given by is optimal in the sense that Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

Example 1: Tetris The video arcade game of Tetris can be viewed as an instance of stochastic control. In particular, we can view the state xt as an encoding of the current “wall of bricks” and the shape of the current “falling piece.” The decision at identifies an orientation and horizontal position for placement of the falling piece onto the wall. Though the arcade game employs a more complicated scoring system, consider for simplicity a reward r(xt, at) equal to the number of rows eliminated by placing the piece in the position described by at. Then, a stationary policy that maximizes the value essentially optimizes a combination of present and future row elimination, with decreasing emphasis placed on rows to be eliminated at times farther into the future. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

Example 1: Tetris, cont. Tetris was first programmed by Alexey Pajitnov, Dmitry Pavlovsky, and Vadim Gerasimov, computer engineers at the Computer Center of the Russian Academy of Sciences in h=20 w=10 Standard shapes Number of states Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

Dynamic programming algorithms compute the optimal value function V. The result is stored in a “look-up” table with one entry V(x) per state x  X. When is required, the value function is used to generate optimal decisions. For example, given a current state xt  X , a decision at is selected according to Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

4. Control of Complex Systems
The main objective is in the development of a methodology for the control of “complex systems”. Two common characteristics of these type of systems are: i-An intractable state space : intractable state spaces preclude the use of classical DP which compute and store one numerical value per state. ii- Severe nonlinearities: methods of traditional linear control, which are applicable even in large state spaces, are ruled out by severe nonlinearities. Let’s review some examples of complex systems, where NDP could be and has been applied. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

a) Call Admission and Routing With rising demand in telecommunication network resources, effective management is as important as ever. Admission (deciding which calls to accept/reject) and routing (allocating links in the network to particular calls) are examples of decisions that must be made at any point in time. The objective is to make the “best” use of limited network resources. In principle, such sequential decision problems can be addressed by dynamic programming. Unfortunately, the enormous state spaces involved render dynamic programming algorithms inapplicable, and heuristic control strategies are used in lieu. b) Strategic Asset Allocation Strategic asset allocation is the problem of distributing an investor’s wealth among assets in the market in order to take on a combination of risk and expected return that best suits the investor’s preferences. In general, the optimal strategy involves dynamic rebalancing of wealth among assets over time. If each asset offers a fixed rate of risk and return, and some additional simplifying assumptions are made, the only state variable is wealth, and the problem can be solved efficiently by dynamic programming algorithms. There are even closed form solutions in cases involving certain types of investor preferences. However, in the more realistic setting involving risks and returns that fluctuate with economic conditions, economic indicators must be taken into account as state variables, and this quickly leads to an intractable state space. The design of effective strategies in such situations constitutes an important challenge in the growing field of financial engineering. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

c) Supply–Chain Management With today’s tight vertical integration, increased production complexity, and diversification, the inventory flow within and among corporations can be viewed as a complex network – called a supply chain – consisting of storage, production, and distribution sites. In a supply chain, raw materials and parts from external vendors are processed through several stages to produce finished goods. Finished goods are then transported to distributors, then to wholesalers, and finally retailers, before reaching customers. The goal in supply–chain management is to achieve a particular level of product availability while minimizing costs. The solution is a policy that decides how much to order or produce at various sites given the present state of the company and the operating environment. d) Emissions Reductions The threat of global warming that may result from accumulation of carbon dioxide and other “greenhouse gasses” poses a serious dilemma. In particular, cuts in emission levels bear a detrimental short–term impact on economic growth. At the same time, a depleting environment can severely hurt the economy – especially the agricultural sector – in the longer term. To complicate the matter further, scientific evidence on the relationship between emission levels and global warming is inconclusive, leading to uncertainty about the benefits of various cuts. One systematic approach to considering these conflicting goals involves the formulation of a dynamic system model that describes our understanding of economic growth and environmental science. Given such a model, the design of environmental policy amounts to dynamic programming. Unfortunately, classical algorithms are inapplicable due to the size of the state space. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

e) Semiconductor Wafer Fabrication The manufacturing floor at a semiconductor wafer fabrication facility is organized into service stations, each equipped with specialized machinery. There is a single stream of jobs arriving on a production floor. Each job follows a deterministic route that revisits the same station multiple times. This leads to a scheduling problem where, at any time, each station must select a job to service such that (long term) production capacity is maximized. Such a system can be viewed as a special class of queueing networks, which are models suitable for a variety of applications in manufacturing, telecommunications, and computer systems. Optimal control of queueing networks is notoriously difficult, and this reputation is strengthened by formal characterizations of computational complexity. Other systems: parking lots, football, games strategy, combinatorial optimization – maintenance and repair, dynamic channel allocation, backgammon. Some papers in applications: -Tsitsiklis, J. and Van Roy, B. “Neuro-Dynamic Programming Overview and a Case Study in Optimal Stopping.” IEEE Proceedings of the 36th Conference on Decision & Control, San Diego, California, pp , December, 1997. -Van Roy, B., Bertesekas, D.P., Lee, Y., and Tsitsiklis, J. “A Neuro-Dynamic Programming Approach to Retailer Inventory Management” IEEE Proceedings of the 36th Conference on Decision & Control, San Diego, California, pp , December, Marbach, P. and Tsitsiklis, J. “A Neuro-Dynamic Programming Approach to Admission Control in ATM Networks: The Single Link Case” Technical Report LIDS-P-2402, Laboratory for Information and Decision Systems, M.I.T., November 1997. - Marbach, P, Mihatsch, O, and Tsitsiklis, J. “Call Admission Control and Routing in Integrated Services Networks Using Reinforcement Learning” IEEE Proceedings of the 37th Conference on Decision & Control, Tampa, Florida, pp , December, Bertsekas, D.P., Homer, M.L., “Missile Defense and Interceptor Allocation by Neuro-Dynamic Programming” IEEE Transactions on Systems Man and Cybernetics, vol. 30, pp , 2000. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

-For the examples presented, state spaces are intractable as consequence of the “curse of dimensionality”, that is, state spaces grow exponentially in the number of state variables.  difficult (if not impossible) to compute (store) one value per state as is required by classical DP. -Additional shortcoming with classical DP: computations require use of transition probabilities.  For many complex systems, such probabilities are not readily accessible. On the other hand, is often easier develop simulation models for the system and generate sample trajectories. -Objective of NDP: overcoming curse of dimensionality through use of parameterized (value) function approximators and through use of output generated by simulators, rather than explicit transition probabilities. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

5. Value Function Approximation
-Intractability of state spaces  value function approximation. -Two important pre-conditions for the development of effective approximation: i-Choose a parameterization*: that yields a good approximation ii-Algorithms for computing appropriate parameter values. *Note: the choice of a suitable parameterization requires some practical experience or theoretical analysis that provides rough information about the shape of the function to be approximated. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

Linear parameterization* -General classes of parameterizations have been found used in NDP, to keep the exposition simple, let us focus on linear parameterizations, which take the form where 1, …, K are “basis functions” mapping X to  and u = (u(1), …, u(K))’ is a vector of scalar weights. Similar to statistical regression, the basis functions 1, …, K are selected by a human based on intuition or analysis to the problem at hand. Hint: one interpretation that is useful for the construction of basis functions involves viewing each function k as a “feature” – that is, a numerical value capturing a salient characteristic of the state that may be pertinent to effective decision making. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

Example 2: Tetris, continuation. In our stochastic control formulation of Tetris, the state is an encoding of the current wall configuration and the current falling piece. There are clearly too many states for exact dynamic programming algorithms to be applicable. However, we may believe that most information relevant to game–playing decisions can be captured by a few intuitive features. In particular, one feature, say 1, may map states to the height of the wall. Another, say 2, could map states to a measure of “jaggedness” of the wall. A third might provide a scalar encoding of the type of the current falling piece (there are seven different shapes in the arcade game). Given a collection of such features, the next task is to select weights u(1), , u(K) such that for all states x. This approximation could then be used to generate a game–playing strategy. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

Example 2: Tetris, continuation. Similar approach is presented in the book “Neuro-Dynamic Programming” (chapter 8, cases of study) by D.P. Bertesekas and J. Tsitsiklis, with the following parameterization, after some experimentation: Let the height hk of the kth column of the wall. There are w such features, where w is the wall’s width. The absolute difference hk - hk+1 between the heights of the kth and the (k+1)st column, k=1,…, w-1. The maximum wall height maxk hk. The number of holes L in the wall, that is, the number of empty positions of the wall that are surrounded by full positions. h=20 w=10 Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

Example 2: Tetris, continuation. Thus, there are 2w+1 features, which together with a constant offset, require 2w+2 weights in a linear architecture of the form Using this parameterization, with w=10 (22 features), an strategy is generated by NDP that eliminates an average of 3554 rows per game, reflecting performance comparable of an expert player. offset Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

6. Temporal-Difference Learning
Introduction to Temporal-Difference Learning Material from: Richard Sutton, “Learning to Predict by Methods of Temporal Differences”, Machine Learning, 3: 9-44,  this paper provide the first formal results in the theory of temporal- difference (TD) methods. - “Learning to predict”: -Use of past experience (historical information) with a incompletely know system to predict its future behavior. -”Learning to predict“ is one of the most basic and prevalent kinds of learning. -In prediction learning training examples can be taken directly from the temporal -sequence of ordinary sensory input; no special supervisor or teacher is required. -Conventional prediction-learning methods (Widrow-Hoff, LMS, Delta Rule, Backpropagation): -Driven by error between predicted and actual outcomes. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

-TD methods: Driven by error or difference between temporally successive predictions.  learning occurs whenever there is a change in prediction over time. -Advantages of TD methods over conventional methods: They are more incremental, and therefore more easier to compute. They tend to make more efficient use of their experience: they converge faster and produce better predictions. -TD Approach: Predictions are based on numerical features combined using adjustable parameters or “weights”. similar to connectionist models (Neural Networks) Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

-TD and supervised-learning approaches to prediction: Historically, the most important learning paradigm has been supervised learning: learner is asked to associate pairs of items (input,output). Supervised learning has been used in patter classification, concept acquisition, learning from examples, system identification, and associative memory. Input A Real Output + Error Learning Algorithm - Input Estimated Output Adjust estimator parameters Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

-Single-step and multi-step prediction problems: Single-step: all information about the correctness of each prediction is revealed at once. supervised learning methods. Multi-step: correctness is not revealed until more than one step after the prediction is made, but partial information relevant to its correctness is revealed at each step.  TD learning methods. -Computational issues: Sutton introduce a particular TD procedure by formally relating it to a classical supervised-learning procedure, the Widrow-Hoff rule (also known as “delta rule’, the ADALINE –Adaptive Linear Element, and the Least Mean Squares –LMS- filter). We consider multi-step prediction problems in which experience comes in observation-outcome sequences of the form x1, x2, x3, …, xm, z, where each xt is a vector of observations available at time t in the sequence, and z is the outcome of the sequence. Also, xt  n and z  . Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

-Computational issues (cont.): For each observation-outcome, the learner produces a corresponding sequence of predictions P1, P2, P3, …, Pm, each of which is an estimate of z. Predictions Pt are based on a vector of modifiable parameters w.  Pt(xt ,w) All learning procedures are expressed as rules for updating w. For each observation, an increment to w, denoted wt , is determined. After a complete sequence has been processed, w is changed by (the sum of) all the sequences increments: The supervised-learning approach treats each sequence of observations and its outcome as a sequence of observation-outcome pairs: (x1 , z) , (x2 , z), …, (xm , z). In this case increments due to time t depends on the error between Pt and z, and on how to change w will affect Pt . Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

-Computational issues (cont.): Then, a prototypical supervised-learning update procedure is where  is a positive parameter affecting the rate of learning, and the gradient wPt , is the vector of partial derivatives of Pt with respect to each component of w. Special case: consider Pt a linear function of xt and w, that is Pt = wTxt = i w(i) xt(i), where w(i) and xt(i) are ith component of w and xt. wPt = xt. Thus, which correspond to the Widrow-Hoff rule. This equation depend critically on z, and thus cannot be determined until the end of the sequence when z becomes known. All observations and predictions made during a sequence must be remembered until its end: wt cannot be computed incrementally. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

-TD Procedure: There is a Temporal-Difference procedure that produces the same (exactly) result, and can be Computed incrementally. The key is to represent the error z-Pt as a sum of the changes in predictions as follows Using this equation and the prototypical supervised-learning equation, we have This equation can be computed incrementally, because it depends only on a pair of successive predictions and on the sum of all past values of the gradient. We refer to this procedure as TD(1). Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

The TD() family of learning procedures: The “hallmark” of temporal-difference methods is their sensitivity to changes in successive predictions rather than overall error between predictions and the final outcome. In response to an increase (decrease) in prediction from Pt to Pt+1 , an increment wt is determined that increases (decreases) the predictions for some or all of the preceding observations vectors x1, …,xt. TD(1) is a special case where all the predictions are altered to an equal extent. Now, consider the case where greater alterations are made to more recent predictions. We consider an exponential weighting with recency, in which alterations to the predictions of observation vectors occurring k steps in the past are weighted according to k for 0    1: t-k  =1, TD(1) 1  =0 TD(0) 0 <  < 1 k increases t-k Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

The TD() family of learning procedures: For  =0 we have the TD(0) procedure: For  =1 we have the TD(1) procedure, that is equivalent to the Widrow-Hoff rule, except that TD(1) is incremental: Alterations of past predictions can be weighted in ways other than the exponential form given previously, let Also referred in literature as eligibility vectors. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

(Material taken from Neuro-Dynamic Programming, Chapter 5) Monte Carlo Simulation: brief overview Suppose v is a random variable with unknown mean m that we want to estimate. Using Monte Carlo simulation to estimate m: generate a number of samples v1, v2, …, vN, and then estimate m by forming the sample mean Also, we can compute the sample mean recursively with M1 = v1 . Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

Monte Carlo simulation: case of iid samples Suppose N samples v1, v2, …, vN , independent and identically distributed, with mean m, and variance 2. Then we have where MN is said to be an unbiased estimator of m. Also, its variance is given by As N  the variance of MN converge to zero  MN converges to m. Also, the strong law of large numbers provides an additional property: the sequence MN converges to m with probability one.  The estimator is consistent. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

Policy Evaluation by Monte Carlo simulation -Consider the stochastic shortest path problem, with state space {0, 1, 2, …, n} with 0 as an absorbing state and cost-free. Let the cost-to-go from i to j g( i , j ) (given the control action μ(i) , pij(μ(i))). Suppose that we have a fixed stationary policy μ (proper) and we want to calculate, using simulation, the corresponding cost-to-go vector J μ’ = ( J μ (1) J μ (2) J μ (n) ). Approach: generate, starting from each i, many samples states trajectories and average the corresponding costs to obtain an approximation to J μ(i). Instead of do this for each state i, lets use each trajectory to obtain cost samples for all states visited by the trajectory, and consider the cost of the trajectory portion that starts at each intermediate state. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

Policy Evaluation by Monte Carlo simulation Suppose that a number of simulation runs are performed, each ending at the termination state 0. Consider the m-th time a given state i0 is encountered, and let (i0 , i1 ,…, iN) be the remainder of the corresponding trajectory, where iN = 0. Then, let c( i0 , m ) the cumulative cost up to reaching state 0, then Some assumptions: different simulated trajectories are statistically independent, and each trajectory is generated according to the Markov process determined by the policy μ. Then we have, Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

Policy Evaluation by Monte Carlo simulation The estimation of Jμ(i) is obtained by forming the sample mean subsequent to the Kth encounter with state i. The sample mean can be expressed in iterative form: where starting with J(i)=0. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

Policy Evaluation by Monte Carlo simulation Consider the trajectory (i0, i1, …, iN), and let k an integer, such as 1  k  N. The trajectory contains the subtrajectory (ik, ik+1,…, iN). sample trajectory with initial state ik that can be used to update J(ik) using the iterative equation previously presented. Algorithm*: Run a simulation and generate the state trajectory (i0, i1, …,iN), update the estimates J(ik) for each k=0, 1, …, N-1, the formula The step size γ(ik) can change from one iteration to iteration. *Additional details  Neuro-Dynamic Programming by Bertsekas and Tsitsiklis, chapter 5. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

Monte Carlo simulation using Temporal Differences Here we consider the implementation of the Monte Carlo policy evaluation algorithm that incrementally updates the cost-to-go estimates J(i). First, assume that for any trajectory i0, i1, …, iN , with iN = 0, and ik =0 for k > N, g( ik, ik+1) =0 for k  N, and J(0)=0. Also, the policy under consideration is proper. Lets rewrite the previous formula in the following form: Note that we use the property J(iN)=0. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

Monte Carlo simulation using Temporal Differences Equivalently, we can rewrite the previous equation as follows: where are called temporal differences (TD). The TD represents the difference between the estimate of the cost-to-go based on the simulated outcome in the current stage, and the current estimate J(ik). Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

Monte Carlo simulation using Temporal Differences Advantage: The estimations can be computed incrementally, e.g., for the l-th temporal difference dl (once that it becomes available) we have as soon as dl is available. The temporal difference dl appears in the update formula for J(ik) for every k  l, then, as soon as transition il+1 has been simulated. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

Monte Carlo simulation using Temporal Differences: TD() Here we introduce the TD() algorithm as a stochastic approximation method for solving a suitably reformulated Bellman equation. The Monte Carlo evaluation algorithm can be viewed as a Robbins-Monro stochastic approximation (more details chapter 4. Neuro-Dynamic Programming) method for solving the equations for unknowns J(ik), as ik ranges over the states in the state space. Other algorithms can be generated in similar form, e.g., starting from other systems equation involving J and then replacing expectations by single estimates. For example, from Bellman’s equation Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

Monte Carlo simulation using Temporal Differences: TD() the stochastic approximation method takes the form which is updated each time that state ik is visited. Lets take now a fixed value l, nonnegative and integer, and taking into consideration the cost for the first l+1 transitions, then the stochastic algorithm could be based on the (l+1)-step Bellman equation Without any special knowledge to select one value of l over another, we consider forming a weighted average of all possible multistep Bellman equations. Specifically, we fix some  < 1, and multiply by (1- ) l and sum over all nonnegative l. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

Monte Carlo simulation using Temporal Differences: TD() Then, we have Interchanging the order of the two summations, and using the fact that we have Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

Monte Carlo simulation using Temporal Differences: TD() The previous equation can be expressed in terms of the temporal differences as follows where are the temporal differences, and E{dm}=0 for all m (Bellman’s equation). The Robbins-Monro stochastic approximation method, equivalent to the previous equation, is where γ is a stepsize parameter (can change from iteration to iteration). The above equation provide us with a family of algorithms, one for each choice of , and it is known as TD(). Note that for =1 we have the Monte Carlo policy evaluation method, also called TD(1). Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

Monte Carlo simulation using Temporal Differences: TD() Also, for =0, we have another limiting case, and using the convention 00=1, then the TD(0) method is presented as follows This equation coincides with the one-step Bellman’s equation previously presented. Off-line and On-line variants When all of the updates are carried out simultaneously, after the entire trajectory has been simulated, then we have the off-line version of TD(). In alternative form, when the updates are evaluated one term at a time, we have the on-line version of TD(). Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

Temporal-Difference Learning (TD()):
Discounted Problem the (l+1)-step Bellman equation Specifically, we fix some  < 1, and multiply by (1- ) l and sum over all nonnegative l Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

Interchanging the order of the two summations, and using the fact that we have Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

In terms of the temporal differences defined by we have Again we have E{dm}=0 for all m. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

From here on, the development is entirely similar to the development for the undiscounted case. The only difference is that enters in the definition of the temporal difference and that is replaced by In particular, we have Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

Approximation (linear) To tune basis function weights Value function Autonomous systems Controlled systems Approximated policy iteration Controlled TD Q-function Relationship with Approximate Value Iteration Historical View Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

Value function Autonomous systems
Problem formulation Autonomous process: Value function: where is a scalar reward is a discount factor Linear approximation where is a collection of basis function Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

Suppose that we observe a sequence of states ; at time t the weight vector has been set to some value Temporal difference corresponding to the transition from to : an “improved prediction ” that incorporates knowledge of the reward and the next stage a prediction of given our current approximation to the value function Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

Given an arbitrary initial weight vector , finally we find the “correct weight vector” So the updating law of the weight vector is where is a scalar step size, is called eligibility vector Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

Eligibility vector is defined as where providing a direction for the adjustment of such that moves towards the improved prediction Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

Note that the eligibility vectors can be recursively updated according to Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

Consequently, when the updating law of the weight vector can be rewritten as: That means, only the last state has effect on the update In the more general case of , step size “trigger” direction Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

Convergence – Linear Approximators
Under appropriate technical conditions: i) For any  [0,1], there exists a vector u() such that the sequence ut generated by the algorithm converges (with probability one) to u() . ii) The limit of convergence u() satisfies [59] J. N. Tsitsiklis and B. Van Roy, An Analysis of Temporal–Deference Learning with Function Approximation. IEEE Transactions on Automatic Control, 42(5):674–690, 1997. [10] Bertsekas and Tsitsiklis, chapter 6. Neuro-Dynamic Programming. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

Value function Controlled systems
Unlike an autonomous system, a controlled system cannot be passively simulated and observed. Control decisions are required and influence the system’s dynamics. The objective here is to approximate the optimal value function of a controlled system. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

Approximate Policy Iteration A classical dynamic programming algorithm - policy iteration Given a value function corresponding to a stationary policy , an improved policy can be defined by In particular, for all Furthermore, a sequence of policies initialized with some arbitrary and updated according to converges to an optimal policy . Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

Approximate Policy Iteration For each value function , let generating a sequence of weight vectors Select such that With an arbitrary initial stationary policy Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

Approximate Policy Iteration Two loops: External: find converged weight vector  update the present stationary policy Internal: applying temporal-difference learning to generate each iterate weight vector  value function approximation Initialization: Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

Approximate Policy Iteration A result from section 6.2 [10] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming. Athena Scientific, Bellmont, MA, 1996. if there exists some such that for all m then The external sequence does not always converge. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

Controlled TD arbitrarily initialize and ; generate a decision according to where Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

Big problem: Convergence A modification that has been found to be useful in practical applications involves adding “exploration noise” to the controls. One approach to this end involves randomizing decisions by choosing at each time t a decision ,for , with probability where is a small scalar. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

Note: 1) Probability >0 2) , the probability of ,such that simple proof: Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

Q-Function Given V, Define a Q-function: then
Q–learning is a variant of temporal–difference learning that approximates Q functions rather than value functions. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

Q-Function Q-learning arbitrarily initialize and ;
generate a decision according to where Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

Q-Function Like in the case of controlled TD, it is often desirable to incorporate exploration. Randomize decisions by choosing at each time t a decision ,for , with probability where is a small scalar. Note: 1) Probability >0 2) , the probability of ,such that The analysis of Q–learning bears many similarities with that of controlled TD, and results that apply to one can often be generalized in a straightforward way to accommodate the other. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

Relationship with Approximate Value Iteration
The classical value iteration algorithm can be described compactly in terms of the “dynamic programming operator” T, Approximate Value iteration: Disadvantage: the approximate value iteration need not possess fixed points, and therefore should not be expected to converge. In fact, even in cases where a fixed point exists, and even when the system is autonomous, the algorithm can generate a diverging sequence of weight vectors. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

min Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

Controlled TD can be thought of as a stochastic approximation algorithm designed to converge on fixed points of approximate value iteration. Advantage: Controlled TD uses simulation to effectively bypass the need to explicitly compute projections required for approximate value iteration. Autonomous systems Controlled systems: the possible introduction of exploration Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

Historical View A long history and big names
Sutton: Temporal–difference based on earlier work by Barto and Sutton on models for classical conditioning phenomena observed in animal behavior and by Barto, Sutton, and Anderson on “actor–critic methods” Witten: look–up table algorithm bears similarities with one proposed a decade earlier. Watkins: Q–learning was propose in his thesis; and the study of temporal–dierence learning was integrated with classical ideas from dynamic programming and stochastic approximation theory. The work of Werbos and Barto, Bradtke, and Singh also contributed to the above integration. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

Historical View Application
Tesauro: a world–class Backgammon playing program. The practical potential was first demonstrated since then channel allocation in cellular communication networks elevator dispatching inventory management job–shop scheduling Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

Actors and Critics: Averaged Rewards Independent actors
An actor is a parameterized class of policies. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

Actors and Critics: Independent actors (cont.)
one stochastic gradient method, which was proposed by Marbach and Tsitsiklis where Using critic Feedback an estimate of the gradient given parameters Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J
Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

Bibliography [1] D. P. Bertsekas, Dynamic Programming and Optimal Control. Athena Scientific, Bellmont, MA, 1995. [2] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming. Athena Scientific, Bellmont, MA, 1996. [3] R. S. Sutton, Temporal Credic Assignment in Reinforcement Learning. PhD thesis, University of Massachusetts, Amherst, Amherst, MA, 1984. [4] R. S. Sutton, Learning to Predict by the Methods of Temporal Differences. Machine Learning, 3:9–44, 1988. [5] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 1998. [6] J. N. Tsitsiklis and B. Van Roy, An Analysis of Temporal–Dierence Learning with Function Approximation. IEEE Transactions on Automatic Control, 42(5):674–690, 1997. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

Neuro-Dynamic Programming

Similar presentations

Presentation on theme: "Neuro-Dynamic Programming"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Neuro-Dynamic Programming

Similar presentations

Presentation on theme: "Neuro-Dynamic Programming"— Presentation transcript:

Similar presentations

About project

Feedback