Reinforcement Learning Chapter 21 Vassilis Athitsos
Reinforcement Learning In previous chapters: Learning from examples. Reinforcement learning: Learning what to do. Learning to fly (a helicopter). Learning to play a game. Learning to walk. Learning based on rewards.
Relation to MDPs Feedback can be provided at the end of the sequence of actions, or more frequently. Compare chess and ping-pong. No complete model of environment. Transitions may be unknown. Reward function unknown.
Agents Utility-based agent: Q-learning agent: Reflex agent: Learns utility function on states. Q-learning agent: Learns utility function on (action, state) pairs. Reflex agent: Learns function mapping states to actions.
Passive Reinforcement Learning Assume fully observable environment. Passive learning: Policy is fixed (behavior does not change). The agent learns how good each state is. Similar to policy evaluation, but: Transition function and reward function or unknown. Why is it useful?
Passive Reinforcement Learning Assume fully observable environment. Passive learning: Policy is fixed (behavior does not change). The agent learns how good each state is. Similar to policy evaluation, but: Transition function and reward function or unknown. Why is it useful? For future policy revisions.
Direct Utility Estimation For each state the agent ever visits: For each time the agent visits the state: Keep track of the accumulated rewards from the visit onwards. Similar to inductive learning: Learning a function on states using samples. Weaknesses: Ignores correlations between utilities of neighboring states. Converges very slowly.
Adaptive Dynamic Programming Learns transitions and state utilities. Plugs values into Bellman equations. Solves equations with linear algebra, or policy iteration. Problem:
Adaptive Dynamic Programming Learns transitions and state utilities. Plugs values into Bellman equations. Solves equations with linear algebra, or policy iteration. Problem: Intractable for large number of states. Example: backgammon. 1050 equations, with 1050 unknowns.
Temporal Difference Every time we make a transition from state s to state s’: Update utility of s’: U[s’] = current observed reward. Update utility of s: U[s] = (1-a)U[s] + a (r + g U[s’] ). a: learning rate r: previous reward g: discount factor
Properties of Temporal Difference What happens when an unlikely transition occurs?
Properties of Temporal Difference What happens when an unlikely transition occurs? U[s] becomes a bad approximation of true utility. However, U[s] is rarely a bad approximation. Average value of U[s] converges to correct value. If a decreases over time, U[s] converges to correct value.
Hybrid Methods ADP: TD: An intermediate approach: More accurate, slower, intractable for large numbers of states. TD: Less accurate, faster, tractable. An intermediate approach:
Hybrid Methods ADP: TD: An intermediate approach: Pseudo-experiences: More accurate, slower, intractable for large numbers of states. TD: Less accurate, faster, tractable. An intermediate approach: Pseudo-experiences: Imagine transitions that have not happened. Update utilities according to those transitions.
Hybrid Methods Making ADP more efficient: Do a limited number of adjustments after each transition. Use estimated transition probabilities to identify the most useful adjustments.
Active Reinforcement Learning Using passive reinforcement learning, utilities of states and transition probabilities are learned. Those utilities and transitions can be plugged into Bellman equations. Problem?
Active Reinforcement Learning Using passive reinforcement learning, utilities of states and transition probabilities are learned. Those utilities and transitions can be plugged into Bellman equations. Problem? Bellman equations give optimal solutions given correct utility and transition functions. Passive reinforcement learning produces approximate estimates of those functions. Solutions?
Exploration/Exploitation The goal is to maximize utility. However, utility function is only approximately known. Dilemma: should the agent Maximize utility based on current knowledge, or Try to improve current knowledge.
Exploration/Exploitation The goal is to maximize utility. However, utility function is only approximately known. Dilemma: should the agent Maximize utility based on current knowledge, or Try to improve current knowledge. Answer: A little of both.
Exploration Function U[s] = R[s] + g max {f(Q(a,s), N(a, s))}. R[s]: current reward. g: discount factor. Q(a,s): estimated utility of performing action a in state s. N(a, s): number of times action a has been performed in state s. f(u, n): preference according to utility and degree of exploration so far for (a, s). Initialization: U[s] = optimistically large value.
Q-learning Learning utility of state-action pairs. U[s] = max Q(a, s). Learning can be done using TD: Q(a,s) = (1-b) Q(a,s) + b(R(s) + g max(Q(a’, s’)) b: learning factor g: discount factor s’: next state a’: possible action at next state
Generalization in Reinforcement Learning How do we apply reinforcement learning problems with huge numbers of states (chess, backgammon, …)?
Generalization in Reinforcement Learning How do we apply reinforcement learning problems with huge numbers of states (chess, backgammon, …)? Solution similar to estimating probabilities of a huge number of events:
Generalization in Reinforcement Learning How do we apply reinforcement learning problems with huge numbers of states (chess, backgammon, …)? Solution similar to estimating probabilities of a huge number of events: Learn parametric functions, where parameters are features of each state. Example: chess. 20 features adequate for describing the current board.
Learning Parametric Utility Functions For Backgammon First approach: Design weighted linear functions of 16 terms. Collect training set of board states. Ask human experts to evaluate training states. Result: Program not competitive with human experts. Collecting training data was very tedious.
Learning Parametric Utility Functions For Backgammon Second approach: Design weighted linear functions of 16 terms. Let the system play against itself. Reward provided at the end of each game. Result (after 300,000 games, a few weeks): Program competitive with best players in the world.