Presentation is loading. Please wait.

Presentation is loading. Please wait.

From Markov Decision Processes to Artificial Intelligence Rich Sutton Andy Barto Satinder SinghDoina Precup with thanks to:

Similar presentations


Presentation on theme: "From Markov Decision Processes to Artificial Intelligence Rich Sutton Andy Barto Satinder SinghDoina Precup with thanks to:"— Presentation transcript:

1 From Markov Decision Processes to Artificial Intelligence Rich Sutton Andy Barto Satinder SinghDoina Precup with thanks to:

2 The steady march of computing science is changing artificial intelligence More computation-based approximate methods –Machine learning, neural networks, genetic algorithms Machines are taking on more of the work –More data, more computation –Less handcrafted solutions, human understandability More search –Exponential methods are still exponential… but compute-intensive methods increasingly winning More general problems –stochastic, non-linear, optimal –real-time, large

3 The problem is to predict and control a doubly branching interaction unfolding over time, with a long-term goal AgentWorld state action

4 Sequential, state-action-reward problems are ubiquitous Walking Flying a helicopter Playing tennis Logistics Inventory control Intruder detection Repair or replace? Visual search for objects Playing chess, Go, Poker Medical tests, treatment Conversation User interfaces Marketing Queue/server control Portfolio management Industrial process control Pipeline failure prediction Real-time load balancing

5 Markov Decision Processes (MDPs) Discrete time States Actions Policy Transition probabilities Rewards state action

6 MDPs Part II: The Objective “Maximize cumulative reward” Define the value of being in a state under a policy as where delayed rewards are discounted by   [0,1] Defines a partial ordering over policies, with at least one optimal policy: There are other possibilities... Needs proving

7 Markov Decision Processes Extensively studied since 1950s In Optimal Control –Specializes to Ricatti equations for linear systems –And to HJB equations for continuous time systems –Only general, nonlinear, optimal-control framework In Operations Research –Planning, scheduling, logistics –Sequential design of experiments –Finance, marketing, inventory control, queuing, telecomm In Artificial Intelligence (last 15 years) –Reinforcement learning, probabilistic planning Dynamic Programming is the dominant solution method

8 Outline Markov decision processes (MDPs) Dynamic Programming (DP) –The curse of dimensionality Reinforcement Learning (RL) –TD( ) algorithm –TD-Gammon example –Acrobot example RL significantly extends DP methods for solving MDPs –RoboCup example Conclusion, from the AI point of view –Spy plane example

9 The Principle of Optimality Dynamic Programming (DP) requires a decomposition into subproblems In MDPs this comes from the Independence of Path assumption Values can be written in terms of successor values, e.g., “Bellman Equations”

10 Initialize: Do forever: Pick any of the maximizing actions to get  * Dynamic Programming: Sweeping through the states, updating an approximation to the optimal value function For example, Value Iteration:

11 DP is repeated backups, shallow lookahead searches s V a s’  VV(s’)V(s’’) s’’

12 Dynamic Programming is the dominant solution method for MDPs Routinely applied to problems with millions of states Worst case scales polynomially in | S | and | A | Linear Programming has better worst-case bounds but in practice scales 100s of times worse On large stochastic problems, only DP is feasible

13 Perennial Difficulties for DP 1. Large state spaces “The curse of dimensionality” 2. Difficulty calculating the dynamics, e.g., from a simulation 3. Unknown dynamics

14 The Curse of Dimensionality The number of states grows exponentially with dimensionality -- the number of state variables Thus, on large problems, –Can’t complete even one sweep of DP Can’t enumerate states, need sampling! –Can’t store separate values for each state Can’t store values in tables, need function approximation! Bellman, 1961

15 Let be an observed sequence with actions selected by  For every time step, t, “Bellman Equation” which suggests the DP-like update: We don’t know this expected value, but we know the actual, an unbiased sample of it. In RL, we take a step toward this sample, e.g., half way “Tabular TD(0)” Reinforcement Learning: Using experience in place of dynamics

16 Temporal-Difference Learning (Sutton, 1988) Updating a prediction based on its change (temporal difference) from one moment to the next. Tabular TD(0): Or, V is, e.g., a neural network with parameter  Then use gradient-descent TD(0): TD( ), >0, uses differences from later predictions as well first prediction better, later prediction Temporal difference

17 TD-Gammon... TD Error Action selection by 2-3 ply search Tesauro, Start with a random Network Play millions of games against itself Learn a value function from this simulated experience This produces arguably the best player in the world 162 ≈ probability of winning

18 TD-Gammon vs an Expert-Trained Net (Tesauro, 1992) fraction of games won against Gammontool number of hidden units TD-Gammon EP (BP net trained from expert moves) * EP+features "Neurogammon" TD-Gammon +features

19 Elevator Control Crites & Barto –(Probably) world's best down-peak elevator controller Walking Robot Benbrahim & Franklin –Learned critical parameters for bipedal walking Robocup Soccer Teams e.g., Stone & Veloso, Reidmiller et al. –RL is used in many of the top teams Inventory Management Van Roy, Bertsekas, Lee & Tsitsiklis –10-15% improvement over industry standard methods Dynamic Channel Assignment Singh & Bertsekas, Nie & Haykin –World's best assigner of radio channels to mobile telephone calls KnightCap and TDleaf Baxter, Tridgell & Weaver –Improved chess play from intermediate to master in 300 games Examples of Reinforcement Learning

20 Does function approximation beat the curse of dimensionality? Yes… probably FA makes dimensionality per se largely irrelevant With FA, computation seems to scale with the complexity of the solution (crinkliness of the value function) and how hard it is to find it If you can get FA to work!

21 FA in DP and RL (1st bit) Conventional DP works poorly with FA –Empirically [Boyan and Moore, 1995] –Diverges with linear FA [Baird, 1995] –Even for prediction (evaluating a fixed policy) [Baird, 1995] RL works much better –Empirically [many applications and Sutton, 1996] –TD( ) prediction converges with linear FA [Tsitsiklis & Van Roy, 1997] –TD( ) control converges with linear FA [Perkins & Precup, 2002] Why? Following actual trajectories in RL ensures that every state is updated at least as often as it is the basis for updating

22 DP+FA failsRL+FA works More transitions can go in to a state than go out Real trajectories always leave a state after entering it

23 Outline Markov decision processes (MDPs) Dynamic Programming (DP) –The curse of dimensionality Reinforcement Learning (RL) –TD( ) algorithm –TD-Gammon example –Acrobot example RL significantly extends DP methods for solving MDPs –RoboCup example Conclusion, from the AI point of view –Spy plane example

24 The Mountain Car Problem Minimum-Time-to-Goal Problem Moore, 1990 Goal Gravity wins SITUATIONS: car's position and velocity ACTIONS: three thrusts: forward, reverse, none REWARDS: always –1 until car reaches the goal No Discounting

25 Value Functions Learned while solving the Mountain Car problem Minimize Time-to-Goal Value = estimated time to goal Goal region Lower is better Sutton, 1996

26 Sparse, Coarse, Tile-Coding (CMAC) Car position Car velocity Albus, 1980

27 Tile Coding (CMAC) Albus, 1980 fixed expansive Re-representation Linear last layer Example of Sparse Coarse-Coded Networks Coarse: Large receptive fields Sparse: Few features present at one time features

28 The Acrobot Problem e.g., Dejong & Spong, 1994 Sutton, 1996 Minimum–Time–to–Goal: 4 state variables: 2 joint angles 2 angular velocities Tile coding with 48 tilings         Goal: Raise tip above line Torque applied here tip Reward = -1 per time step fixed base

29 The RoboCup Soccer Competition

30 13 Continuous State Variables (for 3 vs 2) 11 distances among the players, ball, and the center of the field 2 angles to takers along passing lanes Stone & Sutton, 2001

31 RoboCup Feature Vectors.. Sparse, coarse, tile coding Linear map  Full soccer state action values Huge binary feature vector (about 400 1’s and 40,000 0’s) 13 continuous state variables    s Stone & Sutton, 2001

32 Learning Keepaway Results 3v2 handcrafted takers Multiple, independent runs of TD( ) Stone & Sutton, 2001

33 Hajime Kimura’s RL Robots (dynamics knowledge) BeforeAfter BackwardNew Robot, Same algorithm

34 Assessment re: DP RL has added some new capabilities to DP methods –Much larger MDPs can be addressed (approximately) –Simulations can be used without explicit probabilities –Dynamics need not be known or modeled Many new applications are now possible –Process control, logistics, manufacturing, telecomm, finance, scheduling, medicine, marketing… Theoretical and practical questions remain open

35 Outline Markov decision processes (MDPs) Dynamic Programming (DP) –The curse of dimensionality Reinforcement Learning (RL) –TD( ) algorithm –TD-Gammon example –Acrobot example RL significantly extends DP methods for solving MDPs –RoboCup example Conclusion, from the AI point of view –Spy plane example

36 A lesson for AI: The Power of a “Visible” Goal In MDPs, the goal (reward) is part of the data, part of the agent’s normal operation The agent can tell for itself how well it is doing This is very powerful… we should do more of it in AI Can we give AI tasks visible goals? –Visual object recognition? Better would be active vision –Story understanding? Better would be dialog, eg call routing –User interfaces, personal assistants –Robotics… say mapping and navigation, or search The usual trick is to make them into long-term prediction problems Must be a way. If you can’t feel it, why care about it?

37 Assessment re: AI DP and RL are potentially powerful probabilistic planning methods –But typically don’t use logic or structured representations How is they as an overall model of thought? –Good mix of deliberation and immediate judgments (values) –Good for causality, prediction, not for logic, language The link to data is appealing…but incomplete –MDP-style knowledge may be learnable, tuneable, verifiable –But only if the “level” of the data is right Sometimes seems too low-level, too flat

38 Ongoing and Future Directions Temporal abstraction [Sutton, Precup, Singh, Parr, others] –Generalize transitions to include macros, “options” –Multiple overlying MDP-like models at different levels States representation [Littman, Sutton, Singh, Jaeger...] –Eliminate the nasty assumption of observable state –Get really real with data –Work up to higher-level, yet grounded, representations Neuroscience of reward systems [Dayan, Schultz, Doya] –Dopamine reward system behaves remarkably like TD Theory and practice of value function approximation [everybody]

39 Spy Plane Example (Reconnaissance Mission Planning) Mission: Fly over (observe) most valuable sites and return to base Stochastic weather affects observability (cloudy or clear) of sites Limited fuel Intractable with classical optimal control methods Temporal scales: –Actions: which direction to fly now –Options: which site to head for Options compress space and time –Reduce steps from ~600 to ~6 –Reduce states from ~10 11 to ~10 6 any state (10 6 ) sites only (6) Sutton & Ravindran, 2001

40 Spy Plane Results SMDP planner: –Assumes options followed to completion –Plans optimal SMDP solution SMDP planner with re-evaluation –Plans as if options must be followed to completion –But actually takes them for only one step –Re-picks a new option on every step Static planner: –Assumes weather will not change –Plans optimal tour among clear sites –Re-plans whenever weather changes Low Fuel High Fuel Expected Reward/Mission SMDP Planner Static Re-planner SMDP planner with re-evaluation of options on each step Temporal abstraction finds better approximation than static planner, with little more computation than SMDP planner Sutton & Ravindran, 2001

41 Didn’t have time for Action Selection –Exploration/Exploitation –Action values vs. search –How learning values leads to policy improvements Different returns, e.g., the undiscounted case Exactly how FA works, backprop Exactly how options work –How planning at a high level can affect primitive actions How states can be abstracted to affordances –And how this directly builds on the option work


Download ppt "From Markov Decision Processes to Artificial Intelligence Rich Sutton Andy Barto Satinder SinghDoina Precup with thanks to:"

Similar presentations


Ads by Google