Download presentation

Presentation is loading. Please wait.

Published bySophia Barns Modified about 1 year ago

1
1 Reinforcement Learning: Dynamic Programming Csaba Szepesvári University of Alberta Kioloa, MLSS’08 Slides: TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A A A A A

2
2 Reinforcement Learning RL = “Sampling based methods to solve optimal control problems” Contents Defining AI Markovian Decision Problems Dynamic Programming Approximate Dynamic Programming Generalizations (Rich Sutton)

3
3 Literature Books Richard S. Sutton, Andrew G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 Dimitri P. Bertsekas, John Tsitsiklis: Neuro-Dynamic Programming, Athena Scientific, 1996 Journals JMLR, MLJ, JAIR, AI Conferences NIPS, ICML, UAI, AAAI, COLT, ECML, IJCAI

4
4 Some More Books Martin L. Puterman. Markov Decision Processes. Wiley, Dimitri P. Bertsekas: Dynamic Programming and Optimal Control. Athena Scientific. Vol. I (2005), Vol. II (2007). James S. Spall: Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control, Wiley, 2003.

5
5 Resources RL-Glue RL-Library The RL Toolbox 2.0 toolbox/general/overview.html toolbox/general/overview.html OpenDP RL-Competition (2008)! June 1st, 2008: Test runs begin! Related fields: Operations research (MOR, OR) Control theory (IEEE TAC, Automatica, IEEE CDC, ECC) Simulation optimization (Winter Simulation Conference)

6
6 Abstract Control Model Environment actions Sensations (and reward) Controller = agent “Perception-action loop”

7
7 external sensations memory state reward actions internal sensations agent Zooming in..

8
8 A Mathematical Model Plant (controlled object): x t+1 = f(x t,a t,v t ) x t : state, v t : noise z t = g(x t,w t ) z t : sens/obs, w t : noise State: Sufficient statistics for the future Independently of what we measure..or.. Relative to measurements Controller a t = F(z 1,z 2,…,z t )a t : action/control => PERCEPTION-ACTION LOOP “CLOSED-LOOP CONTROL” Design problem: F = ? Goal: r(z t,a t ) ! max “Objective State”“Subjective State”

9
9 A Classification of Controllers Feedforward: a 1,a 2,… is designed ahead in time ??? Feedback: Purely reactive systems: a t = F(z t ) Why is this bad? Feedback with memory: m t = M(m t-1,z t,a t-1 ) ~interpreting sensations a t = F(m t ) decision making: deliberative vs. reactive

10
10 Feedback controllers Plant: x t+1 = f(x t,a t,v t ) z t+1 = g(x t,w t ) Controller: m t = M(m t-1,z t,a t-1 ) a t = F(m t ) m t ¼ x t : state estimation, “filtering” difficulties: noise,unmodelled parts How do we compute a t ? With a model (f’): model-based control ..assumes (some kind of) state estimation Without a model: model-free control

11
11 Markovian Decision Problems

12
12 Markovian Decision Problems (X,A,p,r) X – set of states A – set of actions (controls) p – transition probabilities p(y|x,a) r – rewards r(x,a,y), or r(x,a), or r(x) ° – discount factor 0 · ° < 1

13
13 The Process View (X t,A t,R t ) X t – state at time t A t – action at time t R t – reward at time t Laws: X t+1 ~p(.|X t,A t ) A t ~ ¼ (.|H t ) ¼ : policy H t = (X t,A t-1,R t-1,.., A 1,R 1,X 0 ) – history R t = r(X t,A t,X t+1 )

14
14 The Control Problem Value functions: Optimal value function: Optimal policy:

15
15 Applications of MDPs Operations research Econometrics Optimal investments Replacement problems Option pricing Logistics, inventory management Active vision Production scheduling Dialogue control Control, statistics Games, AI Bioreactor control Robotics (Robocup Soccer) Driving Real-time load balancing Design of experiments (Medical tests)

16
16 Variants of MDPs Discounted Undiscounted: Stochastic Shortest Path Average reward Multiple criteria Minimax Games

17
17 MDP Problems Planning The MDP (X,A,P,r, ° ) is known. Find an optimal policy ¼ * ! Learning The MDP is unknown. You are allowed to interact with it. Find an optimal policy ¼ * ! Optimal learning While interacting with the MDP, minimize the loss due to not using an optimal policy from the beginning

18
18 Solving MDPs – Dimensions Which problem? (Planning, learning, optimal learning) Exact or approximate? Uses samples? Incremental? Uses value functions? Yes: Value-function based methods Planning: DP, Random Discretization Method, FVI, … Learning: Q-learning, Actor-critic, … No: Policy search methods Planning: Monte-Carlo tree search, Likelihood ratio methods (policy gradient), Sample-path optimization (Pegasus), Representation Structured state: Factored states, logical representation, … Structured policy space: Hierarchical methods

19
19 Dynamic Programming

20
20 Richard Bellman ( ) Control theory Systems Analysis Dynamic Programming: RAND Corporation, Bellman equation Bellman-Ford algorithm Hamilton-Jacobi-Bellman equation “Curse of dimensionality” invariant imbeddings Grönwall-Bellman inequality

21
21 Bellman Operators Let ¼ :X ! A be a stationary policy B(X) = { V | V:X ! R, ||V|| 1 < 1 } T ¼ :B(X) ! B(X) (T ¼ V)(x) = y p(y|x, ¼ (x)) [r(x, ¼ (x),y)+ ° V(y)] Theorem: T ¼ V ¼ = V ¼ Note: This is a linear system of equations: r ¼ + ° P ¼ V ¼ = V ¼ V ¼ = (I- ° P ¼ ) -1 r ¼

22
22 Proof of T ¼ V ¼ = V ¼ What you need to know: Linearity of expectation: E[A+B] = E[A]+E[B] Law of total expectation: E[ Z ] = x P(X=x) E[ Z | X=x ], and E[ Z | U=u ] = x P(X=x|U=u) E[Z|U=u,X=x]. Markov property: E[ f(X 1,X 2,..) | X 1 =y,X 0 =x] = E[ f(X 1,X 2,..) | X 1 =y] V ¼ (x) = E ¼ [ t 1 ° t R t |X 0 = x] = y P(X 1 =y|X 0 =x) E ¼ [ t 1 ° t R t |X 0 = x,X 1 =y] (by the law of total expectation) = y p(y|x, ¼ (x)) E ¼ [ t 1 ° t R t |X 0 = x,X 1 =y] (since X 1 ~p(.|X 0, ¼ (X 0 ))) = y p(y|x, ¼ (x)) {E ¼ [ R 0 |X 0 =x,X 1 =y]+ ° E ¼ [ t 1 ° t R t+1 |X 0 =x,X 1 =y]} (by the linearity of expectation) = y p(y|x, ¼ (x)) {r(x, ¼ (x),y) + ° V ¼ (y)} (using the definition of r, V ¼ ) = (T ¼ V ¼ )(x).(using the definition of T ¼ )

23
23 The Banach Fixed-Point Theorem B = (B,||.||) Banach space T: B 1 ! B 2 is L-Lipschitz (L>0) if for any U,V, || T U – T V || · L ||U-V||. T is contraction if B 1 =B 2, L<1; L is a contraction coefficient of T Theorem [Banach]: Let T:B ! B be a ° -contraction. Then T has a unique fixed point V and 8 V 0 2 B, V k+1 =T V k, V k ! V and ||V k -V||=O( ° k )

24
24 An Algebra for Contractions Prop: If T 1 :B 1 ! B 2 is L 1 -Lipschitz, T 2 : B 2 ! B 3 is L 2 -Lipschitz then T 2 T 1 is L 1 L 2 Lipschitz. Def: If T is 1-Lipschitz, T is called a non-expansion Prop: M: B(X £ A) ! B(X), M(Q)(x) = max a Q(x,a) is a non-expansion Prop: Mul c : B ! B, Mul c V = c V is |c|-Lipschitz Prop: Add r : B ! B, Add V = r + V is a non-expansion. Prop: K: B(X) ! B(X), (K V)(x)= y K(x,y) V(y) is a non-expansion if K(x,y) ¸ 0, y K(x,y) =1.

25
25 Policy Evaluations are Contractions Def: ||V|| 1 = max x |V(x)|, supremum norm; here ||.|| Theorem: Let T ¼ the policy evaluation operator of some policy ¼. Then T ¼ is a ° -contraction. Corollary: V ¼ is the unique fixed point of T ¼. V k+1 = T ¼ V k ! V ¼, 8 V 0 2 B(X) and ||V k -V ¼ || = O( ° k ).

26
26 The Bellman Optimality Operator Let T:B(X) ! B(X) be defined by (TV)(x) = max a y p(y|x,a) { r(x,a,y) + ° V(y) } Def: ¼ is greedy w.r.t. V if T ¼ V =T V. Prop: T is a ° -contraction. Theorem (BOE): T V * = V *. Proof: Let V be the fixed point of T. T ¼ · T V * · V. Let ¼ be greedy w.r.t. V. Then T ¼ V = T V. Hence V ¼ = V V · V * V = V *.

27
27 Value Iteration Theorem: For any V 0 2 B(X), V k+1 = T V k, V k ! V * and in particular ||V k – V * ||=O( ° k ). What happens when we stop “early”? Theorem: Let ¼ be greedy w.r.t. V. Then ||V ¼ – V * || · 2||TV-V||/(1- ° ). Proof: ||V ¼ -V * || · ||V ¼ -V||+||V-V * || … Corollary: In a finite MDP, the number of policies is finite. We can stop when ||V k -TV k || · ¢ (1- ° )/2, where ¢ = min{ ||V * -V ¼ || : V ¼ V * } Pseudo-polynomial complexity

28
28 Policy Improvement [Howard ’60] Def: U,V 2 B(X), V ¸ U if V(x) ¸ U(x) holds for all x 2 X. Def: U,V 2 B(X), V > U if V ¸ U and 9 x 2 X s.t. V(x)>U(x). Theorem (Policy Improvement): Let ¼ ’ be greedy w.r.t. V ¼. Then V ¼ ’ ¸ V ¼. If T V ¼ >V ¼ then V ¼ ’ >V ¼.

29
29 Policy Iteration Policy Iteration( ¼ ) V V ¼ Do {improvement} V’ V Let ¼ : T ¼ V = T V V V ¼ While (V>V’) Return ¼

30
30 Policy Iteration Theorem Theorem: In a finite, discounted MDP policy iteration stops after a finite number of steps and returns an optimal policy. Proof: Follows from the Policy Improvement Theorem.

31
31 Linear Programming V ¸ T V V ¸ V * = T V *. Hence, V * is the “largest” V that satisfies V ¸ T V. V ¸ T V, (*) V(x) ¸ y p(y|x,a){r(x,a,y)+ ° V(y)}, 8 x,a LinProg(V): x V(x) ! min s.t. V satisfies (*). Theorem: LinProg(V) returns the optimal value function, V *. Corollary: Pseudo-polynomial complexity

32
32 Variations of a Theme

33
33 Approximate Value Iteration AVI: V k+1 = T V k + ² k AVI Theorem: Let ² = max k || ² k ||. Then limsup k !1 ||V k -V * || · 2 ° ² / (1- ° ). Proof: Let a k = ||V k –V * ||. Then a k+1 = ||V k+1 – V * || = ||T V k – T V * + ² k || · ° ||V k -V * || + ² = ° a k + ². Hence, a k is bounded. Take “limsup” of both sides: a · ° a + ² ; reorder.// (e.g., [BT96])

34
34 Fitted Value Iteration – Non-expansion Operators FVI: Let A be a non-expansion, V k+1 = A T V k. Where does this converge to? Theorem: Let U,V be such that A T U = U and T V = V. Then ||V-U|| · ||AV –V||/(1- ° ). Proof: Let U’ be the fixed point of TA. Then ||U’-V|| · ° ||AV-V||/(1- ° ). Since A U’ = A T (AU’), U=AU’. Hence, ||U-V|| =||AU’-V|| · ||AU’-AV||+||AV-V|| … [Gordon ’95]

35
35 Application to Aggregation Let ¦ be a partition of X, S(x) be the unique cell that x belongs to. Let A: B(X) ! B(X) be (A V)(x) = z ¹ (z;S(x)) V(z), where ¹ is a distribution over S(x). p’(C|B,a) = x 2 B ¹ (x;B) y 2 C p(y|x,a), r’(B,a,C) = x 2 B ¹ (x;B) y 2 C p(y|x,a) r(x,a,y). Theorem: Take ( ¦,A,p’,r’), let V’ be its optimal value function, V’ E (x) = V’(S(x)). Then ||V’ E – V * || · ||AV * -V * ||/(1- ° ).

36
36 Action-Value Functions L: B(X) ! B(X £ A), (L V)(x,a) = y p(y|x,a) {r(x,a,y) + ° V(y)}. “One-step lookahead”. Note: ¼ is greedy w.r.t. V if (LV)(x, ¼ (x)) = max a (LV)(x,a). Def: Q * = L V *. Def: Let Max: B(X £ A) ! B(X), (Max Q)(x) = max a Q(x,a). Note: Max L = T. Corollary: Q * = L Max Q *. Proof: Q * = L V * = L T V * = L Max L V * = L Max Q *. T = L Max is a ° -contraction Value iteration, policy iteration, …

37
37 Changing Granularity Asynchronous Value Iteration: Every time-step update only a few states AsyncVI Theorem: If all states are updated infinitely often, the algorithm converges to V *. How to use? Prioritized Sweeping IPS [MacMahan & Gordon ’05]: Instead of an update, put state on the priority queue When picking a state from the queue, update it Put predecessors on the queue Theorem: Equivalent to Dijkstra on shortest path problems, provided that rewards are non-positive LRTA* [Korf ’90] ~ RTDP [Barto, Bradtke, Singh ’95] Focussing on parts of the state that matter Constraints: Same problem solved from several initial positions Decisions have to be fast Idea: Update values along the paths

38
38 Changing Granularity Generalized Policy Iteration: Partial evaluation and partial improvement of policies Multi-step lookahead improvement AsyncPI Theorem: If both evaluation and improvement happens at every state infinitely often then the process converges to an optimal policy. [Williams & Baird ’93]

39
39 Variations of a theme [SzeLi99] Game against nature [Heger ’94]: inf w t ° t R t (w) with X 0 = x Risk-sensitive criterion: log ( E[ exp( t ° t R t ) | X_0 = x ] ) Stochastic Shortest Path Average Reward Markov games Simultaneous action choices (Rock- paper-scissor) Sequential action choices Zero-sum (or not)

40
40 References [Howard ’60] R.A. Howard: Dynamic Programming and Markov Processes, The MIT Press, Cambridge, MA, [Gordon ’95] G.J. Gordon: Stable function approximation in dynamic programming. ICML, pp. 261—268, [Watkins ’90] C.J.C.H. Watkins: Learning from Delayed Rewards, PhD Thesis, [McMahan, Gordon ’05] H. B. McMahan and Geoffrey J. Gordon: Fast Exact Planning in Markov Decision Processes. ICAPS.Fast Exact Planning in Markov Decision Processes [Korf ’90] R. Korf: Real-Time Heuristic Search. Artificial Intelligence 42, 189–211, [Barto, Bradtke & Singh, ’95] A.G. Barto, S.J. Bradtke & S. Singh: Learning to act using real-time dynamic programming, Artificial Intelligence 72, 81—138, [Williams & Baird, ’93] R.J. Williams & L.C. Baird: Tight Performance Bounds on Greedy Policies Based on Imperfect Value Functions. Northeastern University Technical Report NU-CCS , November, [SzeLi99] Cs. Szepesvári and M.L. Littman: A Unified Analysis of Value-Function-Based Reinforcement-Learning Algorithms, Neural Computation, 11, 2017—2059, [Heger ’94] M. Heger: Consideration of risk in reinforcement learning, ICML, 105—111, 1994.

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google