University Paderborn 07 January 2009 RG Knowledge Based Systems Prof. Dr. Hans Kleine Büning Reinforcement Learning

Outline Motivation Applications Markov Decision Processes Q-learning Examples

Reinforcement Learning Prof. Dr. Hans Kleine Büning 3 University Paderborn

Reinforcement Learning: The Idea A way of programming agents by reward and punishment without specifying how the task is to be achieved

Learning to Ride a Bicycle Environment stat e action

Learning to Ride a Bicycle States: –Angle of handle bars –Angular velocity of handle bars –Angle of bicycle to vertical –Angular velocity of bicycle to vertical –Acceleration of angle of bicycle to vertical

Learning to Ride a Bicycle Environment stat e action

Learning to Ride a Bicycle Actions: –Torque to be applied to the handle bars –Displacement of the center of mass from the bicycles plan (in cm)

Learning to Ride a Bicycle Environment stat e action

Angle of bicycle to vertical is greater than 12° Reward = 0 Reward = -1 no yes

Learning To Ride a Bicycle Reinforcement Learning

Reinforcement Learning: Applications Board Games –TD-Gammon program, based on reinforcement learning, has become a world-class backgammon player Mobile Robot Controlling –Learning to ride a Bicycle –Navigation –Pole-balancing –Acrobot Sequential Process Controlling –Elevator dispatching

History of Reinforcement Learning Trial and error learning in psychology of animal learning Optimal control and dynamic programming Temporal-difference methods

Key Features of Reinforcement Learning Learner is not told which actions to take Trial and error search Possibility of delayed reward: –Sacrifice of short-term gains for greater long-term gains Explore/Exploit trade-off Considers the whole problem of a goal-directed agent interaction with an uncertain environment

The Agent-Environment Interaction Agent and environment interact at discrete time steps: t = 0,1, 2, … –Agent observes state at step t : s t 2 S –produces action at step t: a t 2 A –gets resulting reward : r t +1 2 –and resulting next state: s t +1 2 S

The Agents Goal: Coarsely, the agents goal is to get as much reward as it can over the long run Policy is a mapping from states to action s) = a Reinforcement learning methods specify how the agent changes its policy as a result of experience

Deterministic Markov Decision Process

Example

Example: Corresponding MDP

Example: Corresponding MDP

Example: Corresponding MDP

Example: Policy

Value of Policy and Rewards

Value of Policy and Agents Task

Nondeterministic Markov Decision Process P = 0.8 P = 0.1

Nondeterministic Markov Decision Process

Nondeterministic Markov Decision Process

Example with South-Easten Wind

Example with South-Easten Wind

Methods Dynamic Programming Value Function Approximation + Dynamic Programming Reinforcement Learning (Q-learning, Monte Carlo Methods) Value Function Approximation + Reinforcement Learning continuous states discrete states continuous states Model (reward function and transition probabilities) is known Model (reward function or transition probabilities) is unknown

Q-learning Algorithm

Q-learning Algorithm

Example

Example: Q-table Initialization

Example: Episode 1

Example: Episode 1

Example: Episode 1

Example: Episode 1

Example: Episode 1

Example: Q-table

Example: Episode 1

Episode 1

Example: Q-table

Example: Episode 2

Example: Episode 2

Example: Episode 2

Example: Q-table after Convergence

Example: Value Function after Convergence

Example: Optimal Policy

Example: Optimal Policy

Q-learning

Convergence of Q-learning

Blackjack Standard rules of blackjack hold State space: –element[0] - current value of player's hand (4-21) –element[1] - value of dealer's face-up card (2-11) –element[2] - player does not have usable ace (0/1) Starting states: –player has any 2 cards (uniformly distributed), dealer has any 1 card (uniformly distributed) Actions: –HIT –STICK Rewards: –1 for a loss –0 for a draw –1 for a win

Blackjack: Optimal Policy

Reinforcement Learning: Example States –Grids Actions –Left –Up –Right –Down Rewards –Bonus 20 –Food 1 –Predator -10 –Empty grid -0.1 Transition probabilities –0.80 – agent goes where he intends to go –0.20 – to any other adjacent grid or remains where it was (in case he is on the board of the grid world he goes to the other side)

Reinforcement Learning: Example

Reinforcement Learning: Example

Reinforcement Learning: Example

Reinforcement Learning: Example

Reinforcement Learning: Example

Reinforcement Learning: Example

Reinforcement Learning: Example

Reinforcement Learning: Example

Reinforcement Learning: Example

Reinforcement Learning: Example

Reinforcement Learning: Example

Reinforcement Learning: Example

Reinforcement Learning: Example

Reinforcement Learning: Example

Reinforcement Learning: Example

Reinforcement Learning: Example

Reinforcement Learning: Example

Reinforcement Learning: Example

Reinforcement Learning: Example

Reinforcement Learning: Example

Reinforcement Learning: Example

Reinforcement Learning: Example

Reinforcement Learning: Example

Reinforcement Learning: Example

Reinforcement Learning: Example

