1 Reinforcement Learning: Dynamic Programming Csaba Szepesvári University of Alberta Kioloa, MLSS’08 Slides:

1 Reinforcement Learning: Dynamic Programming Csaba Szepesvári University of Alberta Kioloa, MLSS’08 Slides: http://www.cs.ualberta.ca/~szepesva/MLSS08/http://www.cs.ualberta.ca/~szepesva/MLSS08/ TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A A A A A

2 Reinforcement Learning RL = “Sampling based methods to solve optimal control problems”  Contents Defining AI Markovian Decision Problems Dynamic Programming Approximate Dynamic Programming Generalizations (Rich Sutton)

3 Literature  Books Richard S. Sutton, Andrew G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 Dimitri P. Bertsekas, John Tsitsiklis: Neuro-Dynamic Programming, Athena Scientific, 1996  Journals JMLR, MLJ, JAIR, AI  Conferences NIPS, ICML, UAI, AAAI, COLT, ECML, IJCAI

4 Some More Books  Martin L. Puterman. Markov Decision Processes. Wiley, 1994.  Dimitri P. Bertsekas: Dynamic Programming and Optimal Control. Athena Scientific. Vol. I (2005), Vol. II (2007).  James S. Spall: Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control, Wiley, 2003.

5 Resources  RL-Glue http://rlai.cs.ualberta.ca/RLBB/top.html  RL-Library http://rlai.cs.ualberta.ca/RLR/index.html  The RL Toolbox 2.0 http://www.igi.tugraz.at/ril- toolbox/general/overview.html http://www.igi.tugraz.at/ril- toolbox/general/overview.html  OpenDP http://opendp.sourceforge.net  RL-Competition (2008)! http://rl-competition.org/ June 1st, 2008: Test runs begin!  Related fields: Operations research (MOR, OR) Control theory (IEEE TAC, Automatica, IEEE CDC, ECC) Simulation optimization (Winter Simulation Conference)

6 Abstract Control Model Environment actions Sensations (and reward) Controller = agent “Perception-action loop”

7 external sensations memory state reward actions internal sensations agent Zooming in..

8 A Mathematical Model  Plant (controlled object): x t+1 = f(x t,a t,v t ) x t : state, v t : noise z t = g(x t,w t ) z t : sens/obs, w t : noise  State: Sufficient statistics for the future Independently of what we measure..or.. Relative to measurements  Controller a t = F(z 1,z 2,…,z t )a t : action/control => PERCEPTION-ACTION LOOP “CLOSED-LOOP CONTROL”  Design problem: F = ?  Goal:    r(z t,a t ) ! max “Objective State”“Subjective State”

9 A Classification of Controllers  Feedforward: a 1,a 2,… is designed ahead in time ???  Feedback: Purely reactive systems: a t = F(z t ) Why is this bad? Feedback with memory: m t = M(m t-1,z t,a t-1 ) ~interpreting sensations a t = F(m t ) decision making: deliberative vs. reactive

10 Feedback controllers  Plant: x t+1 = f(x t,a t,v t ) z t+1 = g(x t,w t )  Controller: m t = M(m t-1,z t,a t-1 ) a t = F(m t )  m t ¼ x t : state estimation, “filtering” difficulties: noise,unmodelled parts  How do we compute a t ? With a model (f’): model-based control ..assumes (some kind of) state estimation Without a model: model-free control

11 Markovian Decision Problems

12 Markovian Decision Problems  (X,A,p,r)  X – set of states  A – set of actions (controls)  p – transition probabilities p(y|x,a)  r – rewards r(x,a,y), or r(x,a), or r(x)  ° – discount factor 0 · ° < 1

13 The Process View  (X t,A t,R t )  X t – state at time t  A t – action at time t  R t – reward at time t  Laws: X t+1 ~p(.|X t,A t ) A t ~ ¼ (.|H t ) ¼ : policy H t = (X t,A t-1,R t-1,.., A 1,R 1,X 0 ) – history R t = r(X t,A t,X t+1 )

14 The Control Problem  Value functions:  Optimal value function:  Optimal policy:

15 Applications of MDPs  Operations research  Econometrics  Optimal investments  Replacement problems  Option pricing  Logistics, inventory management  Active vision  Production scheduling  Dialogue control  Control, statistics  Games, AI  Bioreactor control  Robotics (Robocup Soccer)  Driving  Real-time load balancing  Design of experiments (Medical tests)

16 Variants of MDPs  Discounted  Undiscounted: Stochastic Shortest Path  Average reward  Multiple criteria  Minimax  Games

17 MDP Problems  Planning The MDP (X,A,P,r, ° ) is known. Find an optimal policy ¼ * !  Learning The MDP is unknown. You are allowed to interact with it. Find an optimal policy ¼ * !  Optimal learning While interacting with the MDP, minimize the loss due to not using an optimal policy from the beginning

18 Solving MDPs – Dimensions  Which problem? (Planning, learning, optimal learning)  Exact or approximate?  Uses samples?  Incremental?  Uses value functions? Yes: Value-function based methods  Planning: DP, Random Discretization Method, FVI, …  Learning: Q-learning, Actor-critic, … No: Policy search methods  Planning: Monte-Carlo tree search, Likelihood ratio methods (policy gradient), Sample-path optimization (Pegasus),  Representation Structured state:  Factored states, logical representation, … Structured policy space:  Hierarchical methods

19 Dynamic Programming

20 Richard Bellman (1920-1984)  Control theory  Systems Analysis  Dynamic Programming: RAND Corporation, 1949-1955  Bellman equation  Bellman-Ford algorithm  Hamilton-Jacobi-Bellman equation  “Curse of dimensionality”  invariant imbeddings  Grönwall-Bellman inequality

21 Bellman Operators  Let ¼ :X ! A be a stationary policy  B(X) = { V | V:X ! R, ||V|| 1 < 1 }  T ¼ :B(X) ! B(X)  (T ¼ V)(x) =  y p(y|x, ¼ (x)) [r(x, ¼ (x),y)+ ° V(y)]  Theorem: T ¼ V ¼ = V ¼  Note: This is a linear system of equations: r ¼ + ° P ¼ V ¼ = V ¼  V ¼ = (I- ° P ¼ ) -1 r ¼

22 Proof of T ¼ V ¼ = V ¼  What you need to know: Linearity of expectation: E[A+B] = E[A]+E[B] Law of total expectation: E[ Z ] =  x P(X=x) E[ Z | X=x ], and E[ Z | U=u ] =  x P(X=x|U=u) E[Z|U=u,X=x]. Markov property: E[ f(X 1,X 2,..) | X 1 =y,X 0 =x] = E[ f(X 1,X 2,..) | X 1 =y]  V ¼ (x) = E ¼ [ t 1 ° t R t |X 0 = x] =  y P(X 1 =y|X 0 =x) E ¼ [ t 1 ° t R t |X 0 = x,X 1 =y] (by the law of total expectation) =  y p(y|x, ¼ (x)) E ¼ [ t 1 ° t R t |X 0 = x,X 1 =y] (since X 1 ~p(.|X 0, ¼ (X 0 ))) =  y p(y|x, ¼ (x)) {E ¼ [ R 0 |X 0 =x,X 1 =y]+ ° E ¼ [  t 1 ° t R t+1 |X 0 =x,X 1 =y]} (by the linearity of expectation) =  y p(y|x, ¼ (x)) {r(x, ¼ (x),y) + ° V ¼ (y)} (using the definition of r, V ¼ ) = (T ¼ V ¼ )(x).(using the definition of T ¼ )

23 The Banach Fixed-Point Theorem  B = (B,||.||) Banach space  T: B 1 ! B 2 is L-Lipschitz (L>0) if for any U,V, || T U – T V || · L ||U-V||.  T is contraction if B 1 =B 2, L<1; L is a contraction coefficient of T  Theorem [Banach]: Let T:B ! B be a ° -contraction. Then T has a unique fixed point V and 8 V 0 2 B, V k+1 =T V k, V k ! V and ||V k -V||=O( ° k )

24 An Algebra for Contractions  Prop: If T 1 :B 1 ! B 2 is L 1 -Lipschitz, T 2 : B 2 ! B 3 is L 2 -Lipschitz then T 2 T 1 is L 1 L 2 Lipschitz.  Def: If T is 1-Lipschitz, T is called a non-expansion  Prop: M: B(X £ A) ! B(X), M(Q)(x) = max a Q(x,a) is a non-expansion  Prop: Mul c : B ! B, Mul c V = c V is |c|-Lipschitz  Prop: Add r : B ! B, Add V = r + V is a non-expansion.  Prop: K: B(X) ! B(X), (K V)(x)= y K(x,y) V(y) is a non-expansion if K(x,y) ¸ 0,  y K(x,y) =1.

25 Policy Evaluations are Contractions  Def: ||V|| 1 = max x |V(x)|, supremum norm; here ||.||  Theorem: Let T ¼ the policy evaluation operator of some policy ¼. Then T ¼ is a ° -contraction.  Corollary: V ¼ is the unique fixed point of T ¼. V k+1 = T ¼ V k ! V ¼, 8 V 0 2 B(X) and ||V k -V ¼ || = O( ° k ).

26 The Bellman Optimality Operator  Let T:B(X) ! B(X) be defined by (TV)(x) = max a  y p(y|x,a) { r(x,a,y) + ° V(y) }  Def: ¼ is greedy w.r.t. V if T ¼ V =T V.  Prop: T is a ° -contraction.  Theorem (BOE): T V * = V *.  Proof: Let V be the fixed point of T. T ¼ · T  V * · V. Let ¼ be greedy w.r.t. V. Then T ¼ V = T V. Hence V ¼ = V  V · V *  V = V *.

27 Value Iteration  Theorem: For any V 0 2 B(X), V k+1 = T V k, V k ! V * and in particular ||V k – V * ||=O( ° k ).  What happens when we stop “early”?  Theorem: Let ¼ be greedy w.r.t. V. Then ||V ¼ – V * || · 2||TV-V||/(1- ° ).  Proof: ||V ¼ -V * || · ||V ¼ -V||+||V-V * || …  Corollary: In a finite MDP, the number of policies is finite. We can stop when ||V k -TV k || · ¢ (1- ° )/2, where ¢ = min{ ||V * -V ¼ || : V ¼  V * }  Pseudo-polynomial complexity

28 Policy Improvement [Howard ’60]  Def: U,V 2 B(X), V ¸ U if V(x) ¸ U(x) holds for all x 2 X.  Def: U,V 2 B(X), V > U if V ¸ U and 9 x 2 X s.t. V(x)>U(x).  Theorem (Policy Improvement): Let ¼ ’ be greedy w.r.t. V ¼. Then V ¼ ’ ¸ V ¼. If T V ¼ >V ¼ then V ¼ ’ >V ¼.

29 Policy Iteration  Policy Iteration( ¼ )  V  V ¼  Do {improvement} V’  V Let ¼ : T ¼ V = T V V  V ¼  While (V>V’)  Return ¼

30 Policy Iteration Theorem  Theorem: In a finite, discounted MDP policy iteration stops after a finite number of steps and returns an optimal policy.  Proof: Follows from the Policy Improvement Theorem.

31 Linear Programming  V ¸ T V  V ¸ V * = T V *.  Hence, V * is the “largest” V that satisfies V ¸ T V. V ¸ T V, (*) V(x) ¸  y p(y|x,a){r(x,a,y)+ ° V(y)}, 8 x,a  LinProg(V):  x V(x) ! min s.t. V satisfies (*).  Theorem: LinProg(V) returns the optimal value function, V *.  Corollary: Pseudo-polynomial complexity

32 Variations of a Theme

33 Approximate Value Iteration  AVI: V k+1 = T V k + ² k  AVI Theorem: Let ² = max k || ² k ||. Then limsup k !1 ||V k -V * || · 2 ° ² / (1- ° ).  Proof: Let a k = ||V k –V * ||. Then a k+1 = ||V k+1 – V * || = ||T V k – T V * + ² k || · ° ||V k -V * || + ² = ° a k + ². Hence, a k is bounded. Take “limsup” of both sides: a · ° a + ² ; reorder.// (e.g., [BT96])

34 Fitted Value Iteration – Non-expansion Operators  FVI: Let A be a non-expansion, V k+1 = A T V k. Where does this converge to?  Theorem: Let U,V be such that A T U = U and T V = V. Then ||V-U|| · ||AV –V||/(1- ° ).  Proof: Let U’ be the fixed point of TA. Then ||U’-V|| · ° ||AV-V||/(1- ° ). Since A U’ = A T (AU’), U=AU’. Hence, ||U-V|| =||AU’-V|| · ||AU’-AV||+||AV-V|| … [Gordon ’95]

35 Application to Aggregation  Let ¦ be a partition of X, S(x) be the unique cell that x belongs to.  Let A: B(X) ! B(X) be (A V)(x) =  z ¹ (z;S(x)) V(z), where ¹ is a distribution over S(x).  p’(C|B,a) =  x 2 B ¹ (x;B)  y 2 C p(y|x,a), r’(B,a,C) =  x 2 B ¹ (x;B)  y 2 C p(y|x,a) r(x,a,y).  Theorem: Take ( ¦,A,p’,r’), let V’ be its optimal value function, V’ E (x) = V’(S(x)). Then ||V’ E – V * || · ||AV * -V * ||/(1- ° ).

36 Action-Value Functions  L: B(X) ! B(X £ A), (L V)(x,a) =  y p(y|x,a) {r(x,a,y) + ° V(y)}. “One-step lookahead”.  Note: ¼ is greedy w.r.t. V if (LV)(x, ¼ (x)) = max a (LV)(x,a).  Def: Q * = L V *.  Def: Let Max: B(X £ A) ! B(X), (Max Q)(x) = max a Q(x,a).  Note: Max L = T.  Corollary: Q * = L Max Q *. Proof: Q * = L V * = L T V * = L Max L V * = L Max Q *.  T = L Max is a ° -contraction  Value iteration, policy iteration, …

37 Changing Granularity  Asynchronous Value Iteration: Every time-step update only a few states  AsyncVI Theorem: If all states are updated infinitely often, the algorithm converges to V *.  How to use? Prioritized Sweeping  IPS [MacMahan & Gordon ’05]: Instead of an update, put state on the priority queue When picking a state from the queue, update it Put predecessors on the queue  Theorem: Equivalent to Dijkstra on shortest path problems, provided that rewards are non-positive  LRTA* [Korf ’90] ~ RTDP [Barto, Bradtke, Singh ’95] Focussing on parts of the state that matter Constraints:  Same problem solved from several initial positions  Decisions have to be fast Idea: Update values along the paths

38 Changing Granularity  Generalized Policy Iteration: Partial evaluation and partial improvement of policies Multi-step lookahead improvement  AsyncPI Theorem: If both evaluation and improvement happens at every state infinitely often then the process converges to an optimal policy. [Williams & Baird ’93]

39 Variations of a theme [SzeLi99]  Game against nature [Heger ’94]: inf w  t ° t R t (w) with X 0 = x  Risk-sensitive criterion: log ( E[ exp( t ° t R t ) | X_0 = x ] )  Stochastic Shortest Path  Average Reward  Markov games Simultaneous action choices (Rock- paper-scissor) Sequential action choices Zero-sum (or not)

40 References  [Howard ’60] R.A. Howard: Dynamic Programming and Markov Processes, The MIT Press, Cambridge, MA, 1960.  [Gordon ’95] G.J. Gordon: Stable function approximation in dynamic programming. ICML, pp. 261—268, 1995.  [Watkins ’90] C.J.C.H. Watkins: Learning from Delayed Rewards, PhD Thesis, 1990.  [McMahan, Gordon ’05] H. B. McMahan and Geoffrey J. Gordon: Fast Exact Planning in Markov Decision Processes. ICAPS.Fast Exact Planning in Markov Decision Processes  [Korf ’90] R. Korf: Real-Time Heuristic Search. Artificial Intelligence 42, 189–211, 1990.  [Barto, Bradtke & Singh, ’95] A.G. Barto, S.J. Bradtke & S. Singh: Learning to act using real-time dynamic programming, Artificial Intelligence 72, 81—138, 1995.  [Williams & Baird, ’93] R.J. Williams & L.C. Baird: Tight Performance Bounds on Greedy Policies Based on Imperfect Value Functions. Northeastern University Technical Report NU-CCS-93- 14, November, 1993.  [SzeLi99] Cs. Szepesvári and M.L. Littman: A Unified Analysis of Value-Function-Based Reinforcement-Learning Algorithms, Neural Computation, 11, 2017—2059, 1999.  [Heger ’94] M. Heger: Consideration of risk in reinforcement learning, ICML, 105—111, 1994.

1 Reinforcement Learning: Dynamic Programming Csaba Szepesvári University of Alberta Kioloa, MLSS’08 Slides:

Similar presentations

Presentation on theme: "1 Reinforcement Learning: Dynamic Programming Csaba Szepesvári University of Alberta Kioloa, MLSS’08 Slides:"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Reinforcement Learning: Dynamic Programming Csaba Szepesvári University of Alberta Kioloa, MLSS’08 Slides:

Similar presentations

Presentation on theme: "1 Reinforcement Learning: Dynamic Programming Csaba Szepesvári University of Alberta Kioloa, MLSS’08 Slides:"— Presentation transcript:

Similar presentations

About project

Feedback