Presentation is loading. Please wait.

Presentation is loading. Please wait.

An Overview of Reinforcement Learning

Similar presentations


Presentation on theme: "An Overview of Reinforcement Learning"— Presentation transcript:

1 An Overview of Reinforcement Learning
Angela Yu Cogs 118A February 26, 2009

2 Outline A formal framework for learning from reinforcement
– Markov decision problems – Interactions between an agent and its environment Dynamic programming as a formal solution – Policy iteration – Value iteration Temporal difference methods as a practical solution – Actor-critic learning – Q-learning Extensions – Exploration vs. exploitation – Representation and neural networks -- summarize actor-critic learning & Q-learning, perhaps use format from other RL paper, introducing methods first, then two principle ideas -- exploration vs. exploitation analogous to plasticity vs. stability problem

3 RL as a Markov Decision Process
Markov blanket for rt and xt+1 action state reward

4 RL as a Markov Decision Process
Goal: find optimal policy : x  a by maximizing return: action state reward

5 RL as a Markov Decision Process
Simple case: assume transition and reward probabilities are known action state reward

6 Dynamic Programming I: Policy Iteration
Policy Evaluation (system of linear equations) Policy Improvement Based on the values of these state-action pairs, incrementally improve policy: Guaranteed to converge on (one set of) optimal values:

7 Dynamic Programming II: Value Iteration
Q-value Update Guaranteed to converge on (one set of) optimal values: Policy

8 Temporal Difference Learning
Difficult (realistic) case: transition and reward probabilities are unknown action state reward Solution: temporal difference (TD) learning

9 Actor-Critic Learning (related to policy iteration)
Critic improves value estimation incrementally: stochastic gradient ascent MC samples for < > Boot-strapping: V(xt) Mutual dependence Convergence? MC samples Learning rate Temporal difference t Actor improves policy execution incrementally Stochastic policy Delta-rule Monte Carlo samples Learning rate

10 Actor-Critic Learning
Exploration vs. Exploitation Best annealing schedule?

11 (related to value iteration)
Q-Learning (related to value iteration) State-action value estimation MC samples for < > Boot-strapping: Q(xt, at) Proven convergence No explicit parameter to control explore/exploit Policy

12 Pro’s and Con’s of TD Learning
TD learning practically appealing – no representation of sequences of states & actions – relatively simple computations – TD in the brain: dopamine signals temporal difference t TD suffers from several disadvantages – local optima – can be (exponentially) slow to converge – actor-critic not guaranteed to converge – no principled way to trade off exploration and exploitation – cannot easily deal with non-stationary environment

13 TD in the Brain

14 TD in the Brain

15 Extensions to basic TD Learning
A continuum of improvements possible – more complete partial models of the effects of actions – estimate expected reward <r(xt)> – representing & processing longer sequences of actions & states – faster learning & more efficient use of agent’s experiences – parameterize value function (versus look-up table) Timing and partial observability in reward prediction – state not (always) directly observable – delayed payoffs – reward-prediction only (no instrumental contingencies)

16

17 References Sutton, RS & Barto, AG (1998). Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press. Bellman, RE (1957). Dynamic Programming. Princeton, NJ: Princeton University Press. Daw, ND, Courville, AC, & Touretsky, DS (2003). Timing and partial observability in the dopamine system. In Neural Information Processing Systems 15. Cambridge, MA: MIT Press. Dayan, P & Watkins, CJCH (2001). Reinforcement learning. Encyclopedia of Cognitive Science. London, England: MacMillan Press. Dayan, P & Abbott, LF (2001). Theoretical Neuroscience. Cambridge, MA: MIT Press. Gittins, JC (1979). Bandit processes and dynamic allocation indices. Journal of Royal Statistical Society B, 41: Schultz, W, Dayan, P, & Montague, PR (1997). A neural substrate of prediction and reward. Science, 275,


Download ppt "An Overview of Reinforcement Learning"

Similar presentations


Ads by Google