Presentation on theme: "Reinforcement Learning"— Presentation transcript:
1Reinforcement Learning 16 January 2009RG Knowledge Based SystemsHans Kleine BüningMy name is Natalia Akchurina.I am making my PhD at University of Paderborn in Germany.And I would like to present to your attention my report on the theme Multi-Agent Reinforcement Learning Algorithm with Variable Optimistic-Pessimistic Criterion.
2Outline Motivation Applications Markov Decision Processes Q-learning ExamplesHere is the outline of my presentation.First I would like to introduce reinforcement learning concept to you.Second I would like to present to your attention learning by observation concept.Developed methods, Benchmarks I test the methods on and the results of the experiments.Then I would like to dwell on virtual markets and present developed sellers based on Q-learning approach.The models of virtual markets I tested the developed sellers on and the results of the experiments.Then I would like to dwell on the topic how learning by observation concept can be applied to virtual markets.Then conclusions my publications during PhD studies and future plans and how I see the structure of my PhD work.
3How to program a robot to ride a bicycle? We can’t program robot in a traditional waySometimes we don’t have precise model: how to keep bicycle in equilibrium (we have some abstract model but we don’t know the precise parameters), the model is of ultimate complexity or we want our agents to adapt to changing environment (the model of this change is too complicated or we don’t have a slightest idea)Randlov J. and Alstrom P. Learning to drive a bicycle using reinforcement learning and shaping, 1998.
4Reinforcement Learning: The Idea A way of programming agents by reward and punishment without specifying how the task is to be achievedThe idea of reinforcement learning…Reinforcement learning provides a way…Agent chooses the action and the environment provides it with the reward (feedback) that reflects how well the agent is functioning in the environment and changes its state (in general non-deterministically)The task of the agent is to learn a policy that maps states of the environment to the actions the agent ought totake to maximize its cumulative reward by trial-and-error from indirect, delayed reward.
5Learning to Ride a Bicycle Environmentstateaction€€€For this task states…
6Learning to Ride a Bicycle States:Angle of handle barsAngular velocity of handle barsAngle of bicycle to verticalAngular velocity of bicycle to verticalAcceleration of angle of bicycle to verticalStates are characterized by the following parameters…
7Learning to Ride a Bicycle Environmentstateaction€€€Actions in this task…
8Learning to Ride a Bicycle Actions:Torque to be applied to the handle barsDisplacement of the center of mass from the bicycle’s plan (in cm)The actions consist of :The torque …Displacement…
9Learning to Ride a Bicycle Environmentstateaction€€€And the rewards…
10Angle of bicycle to vertical is greater than 12° noyesRewards…The bicycle must be held upright within +-12° measured from vertical position. If the angle from the bicycle to the vertical falls outside this interval the bicycle falls and the agent receives a punishment -1. Otherwise the reward is 0.Reward = 0Reward = -1
11Learning To Ride a Bicycle ReinforcementLearningSo provided only with the above information about its state, its performance, able to apply torque to handle bars and displace its center of mass, the robot itself with the use of reinforcement learning has learnt to drive a bicycle
12Reinforcement Learning: Applications Board GamesTD-Gammon program, based on reinforcement learning, has become a world-class backgammon playerMobile Robot ControllingLearning to Drive a BicycleNavigationPole-balancingAcrobotSequential Process ControllingElevator DispatchingIn spite of the simplicity of the idea, reinforcement learning has some very successful applications.TD-Gammon:As Deep Blue became a world-class chess player due to its computational power and not to sophisticated AI algorithms, TD-Gammon became a world-class backgammon player due to high efficiency of reinforcement learning.The program was based on reinforcement learning approach and used value function approximation based on neural network. It was trained against itself.Learning to drive a bicycle:To balance on the bicycle and to drive it are the tasks that can’t be solved by traditional methods. With the use of reinforcement learning they have been solved: the robot after a lot of falling down managed to balance on the bicycle and to drive it.Elevator Dispatching:Waiting for an elevator is a situation with which we are all familiar. We press a button and then wait for an elevator to arrive traveling in the right direction. We may have to wait a long time if there are too many passengers or not enough elevators. Just how long we wait depends on the dispatching strategy the elevators use to decide where to go. For example, if passengers on several floors have requested pickups, which should be served first? If there are no pickup requests, how should the elevators distribute themselves to await the next request? Elevator dispatching is a good example of a stochastic optimal control problem of economic importance that is too large to solve by classical techniques such as dynamic programming.Crites and Barto studied the application of reinforcement learning techniques to the four-elevator, ten-floor system. Along the right-hand side are pickup requests and an indication of how long each has been waiting. Each elevator has a position, direction, and speed, plus a set of buttons to indicate where passengers want to get off. Roughly quantizing the continuous variables, Crites and Barto estimated that the system has over states. This large state set rules out classical dynamic programming methods such as value iteration. Even if one state could be backed up every microsecond it would still require over 1000 years to complete just one sweep through the state space.In practice, modern elevator dispatchers are designed heuristically and evaluated on simulated buildings. The simulators are quite sophisticated and detailed. The physics of each elevator car is modeled in continuous time with continuous state variables. Passenger arrivals are modeled as discrete, stochastic events, with arrival rates varying frequently over the course of a simulated day. Not surprisingly, the times of greatest traffic and greatest challenge to the dispatching algorithm are the morning and evening rush hours. Dispatchers are generally designed primarily for these difficult periods.The performance of elevator dispatchers is measured in several different ways, all with respect to an average passenger entering the system. The average waiting time is how long the passenger waits before getting on an elevator, and the average system time is how long the passenger waits before being dropped off at the destination floor. Another frequently encountered statistic is the percentage of passengers whose waiting time exceeds 60 seconds. The objective that Crites and Barto focused on is the average squared waiting time. This objective is commonly used because it tends to keep the waiting times low while also encouraging fairness in serving all the passengers.Crites and Barto applied a version of one-step Q-learning augmented in several ways to take advantage of special features of the problem. The most important of these concerned the formulation of the actions. First, each elevator made its own decisions independently of the others. Second, a number of constraints were placed on the decisions. An elevator carrying passengers could not pass by a floor if any of its passengers wanted to get off there, nor could it reverse direction until all of its passengers wanting to go in its current direction had reached their floors. In addition, a car was not allowed to stop at a floor unless someone wanted to get on or off there, and it could not stop to pick up passengers at a floor if another elevator was already stopped there. Finally, given a choice between moving up or down, the elevator was constrained always to move up (otherwise evening rush hour traffic would tend to push all the elevators down to the lobby). These last three constraints were explicitly included to provide some prior knowledge and make the problem easier. The net result of all these constraints was that each elevator had to make few and simple decisions. The only decision that had to be made was whether or not to stop at a floor that was being approached and that had passengers waiting to be picked up. At all other times, no choices needed to be made.That each elevator made choices only infrequently permitted a second simplification of the problem. As far as the learning agent was concerned, the system made discrete jumps from one time at which it had to make a decision to the next. When a continuous-time decision problem is treated as a discrete-time system in this way it is known as a semi-Markov decision process. To a large extent, such processes can be treated just like any other Markov decision process by taking the reward on each discrete transition as the integral of the reward over the corresponding continuous-time interval. The notion of return generalizes naturally from a discounted sum of future rewards to a discounted integral of future rewardsOne complication is that the reward as defined--the negative sum of the squared waiting times--is not something that would normally be known while an actual elevator was running. This is because in a real elevator system one does not know how many people are waiting at a floor, only how long it has been since the button requesting a pickup on that floor was pressed. Of course this information is known in a simulator, and Crites and Barto used it to obtain their best results. They also experimented with another technique that used only information that would be known in an on-line learning situation with a real set of elevators. In this case one can use how long since each button has been pushed together with an estimate of the arrival rate to compute an expected summed squared waiting time for each floor. Using this in the reward measure proved nearly as effective as using the actual summed squared waiting time.For function approximation, a nonlinear neural network trained by backpropagation was used to represent the action-value function. Crites and Barto experimented with a wide variety of ways of representing states to the network. After much exploration, their best results were obtained using networks with 47 input units, 20 hidden units, and two output units, one for each action. The way the state was encoded by the input units was found to be critical to the effectiveness of the learning. The 47 input units were as follows:18 units: Two units encoded information about each of the nine hall buttons for down pickup requests. A real-valued unit encoded the elapsed time if the button had been pushed, and a binary unit was on if the button had not been pushed.16 units: A unit for each possible location and direction for the car whose decision was required. Exactly one of these units was on at any given time.10 units: The location of the other elevators superimposed over the 10 floors. Each elevator had a "footprint'' that depended on its direction and speed. For example, a stopped elevator caused activation only on the unit corresponding to its current floor, but a moving elevator caused activation on several units corresponding to the floors it was approaching, with the highest activations on the closest floors. No information was provided about which one of the other cars was at a particular location.1 unit: This unit was on if the elevator whose decision was required was at the highest floor with a passenger waiting.1 unit: This unit was on if the elevator whose decision was required was at the floor with the passenger who had been waiting for the longest amount of time.1 unit: Bias unit was always on.Two architectures were used. In RL1, each elevator was given its own action-value function and its own neural network. In RL2, there was only one network and one action-value function, with the experiences of all four elevators contributing to learning in the one network. In both cases, each elevator made its decisions independently of the other elevators, but shared a single reward signal with them. This introduced additional stochasticity as far as each elevator was concerned because its reward depended in part on the actions of the other elevators, which it could not control. In the architecture in which each elevator had its own action-value function, it was possible for different elevators to learn different specialized strategies (although in fact they tended to learn the same strategy). On the other hand, the architecture with a common action-value function could learn faster because it learned simultaneously from the experiences of all elevators. Training time was an issue here, even though the system was trained in simulation. The reinforcement learning methods were trained for about four days of computer time on a 100 mips processor (corresponding to about 60,000 hours of simulated time). While this is a considerable amount of computation, it is negligible compared with what would be required by any conventional dynamic programming algorithm.By all of the performance measures, the reinforcement learning dispatchers compare favorably with the others. Although the optimal policy for this problem is unknown, and the state of the art is difficult to pin down because details of commercial dispatching strategies are proprietary, these learned dispatchers appeared to perform very well.
13Key Features of Reinforcement Learning Learner is not told which actions to takeTrial and error searchPossibility of delayed reward:Sacrifice of short-term gains for greater long-term gainsExplore/Exploit trade-offConsiders the whole problem of a goal-directed agent interacting with an uncertain environment
14The Agent-Environment Interaction Agent and environment interact at discrete time steps: t = 0,1, 2, …Agent observes state at step t : st 2 Sproduces action at step t: at 2 Agets resulting reward : rt +1 2 ℜand resulting next state: st +1 2 S
15The Agent’s Goal:Coarsely, the agent’s goal is to get as much reward as itcan over the long runPolicy isa mapping from states to action (s) = aReinforcement learning methods specify how the agent changes its policy as a result of experience experience
16Deterministic Markov Decision Process Formally it can be defined in the following way:Here \gamma \in [0,1) is the discount factor that determines the relative value of delayed versus immediate rewards.Reward function and transition function must have Markovian property
17Example Let’s consider a simple grid world. In states s1-s8 the agents can take actions (LEFT, UP, RIGHT, DOWN) as it is shown on the picture.This is deterministic MDP and the agents get where they intend to get.s9 is absorbing state and the agent can’t take any actions to leave the state.s9 is the terminal state of episodes.s1 is the starting state.Above the arrows --- rewards upon taking the corresponding actions in states.Let’s find the optimal policy with the use of Q-learning (The students will have the algorithm as hand-outs).Parameters of Q-learningAlpha = 0.1Gamma = 0.9Epsilon = 0.1
23Value of Policy and Agent’s Task V^pi(s_t) - cumulative reward achieved by following a policy pi from an initial state s_tPi* - optimal policy that maps states of the world to the actions the agent ought totake to maximize its cumulative reward
24Nondeterministic Markov Decision Process In Nondeterministic Markov Decision Process the state transitions to next state upon taking action in current state are probabilistic.The robot decides to move to the right and with probability 0.8 he gets where he intends to, and with probability 0.2 he gets to the other two adjacent grids.
27Example with South-Easten Wind Let’s consider a simple grid world.In states s1-s8 the agents can take actions (LEFT, UP, RIGHT, DOWN) as it is shown on the picture.This is deterministic MDP and the agents get where they intend to get.s9 is absorbing state and the agent can’t take any actions to leave the state.s9 is the terminal state of episodes.s1 is the starting state.Above the arrows --- rewards upon taking the corresponding actions in states.Let’s find the optimal policy with the use of Q-learning (The students will have the algorithm as hand-outs).Parameters of Q-learningAlpha = 0.1Gamma = 0.9Epsilon = 0.1
29Methods Model (reward function and transition probabilities) is known Model (reward function or transitionprobabilities) is unknowndiscrete statescontinuousstatesdiscrete statescontinuousstatesDynamicProgrammingValueFunctionApproximation+DynamicProgrammingReinforcementLearning,Monte CarloMethodsValuationFunctionApproximation+ReinforcementLearningWe must use value function approximation when the number of states is very large or continuous.In this case we don’t have any longer theoretical results of convergence to optimal policy and developing methods of approximation is an actual problem of the research.The approaches to developing methods of Value Function Approximation are numerous: Decision Tree, Memory Based Methods, Linear Approximation, Neural Networks.What lacks is the theoretical results of convergence to the optimal policy and even worse for many widely used reinforcement learning algorithms (for example for Q-learning) examples are known when using methods of value function approximation leads to divergence.It is worth emphasizing that reinforcement learning is applied when the model (reward function or transition probabilities) is unknown.
31Q-learning AlgorithmOne of the most important breakthroughs in reinforcement learning was the development of Q-learning algorithm.Q-learning is also the most popular algorithm of reinforcement learning.It allows to calculate a tabular function Q(s,a) that returns the greatest value for the action a that should be taken in each particular state s so as to maximize expected cumulative reward.The students will have it as hand-outs with the exercise.
32Example Let’s consider a simple grid world. In states s1-s8 the agents can take actions (LEFT, UP, RIGHT, DOWN) as it is shown on the picture.This is deterministic MDP and the agents get where they intend to get.s9 is absorbing state and the agent can’t take any actions to leave the state.s9 is the terminal state of episodes.s1 is the starting state.Above the arrows --- rewards upon taking the corresponding actions in states.Let’s find the optimal policy with the use of Q-learning (The students will have the algorithm as hand-outs).Parameters of Q-learningAlpha = 0.1Gamma = 0.9Epsilon = 0.1
33Example: Q-table Initialization First we initialize Q-table with arbitrarily numbers (let it be zeros). Some actions are forbidden in some states. For example we can’t take action LEFT is state s1.
34Example: Episode 1The episode starts at s1 (just the conditions of the task)
35Example: Episode 1All values of Q table for state s1 are equal so we choose probabilistically action RIGHT.We take action RIGHT and observe the reward and new state.Then we update the Q value of s1 and the action we took according the formula.And move to s2
37Example: Episode 1We repeat the following sequence once more without much difference (all Q values won’t change and will stay zero)
38Example: Episode 1After some steps the agent finds itself in the state s8
39Example: Q-tableAfter all these steps the Q-table is still unchanged and zero (for all the actions we took on the path have zero rewards and all Q-values of next states were also zero)
40Example: Episode 1All values of Q table for state s1 are equal so we choose probabilistically action RIGHT.We take action RIGHT and observe the reward and new state.Then we update the Q value of s8 and the action we took according the formula. This time the reward is not zero and we update Q tableAnd move to s9
41Episode 1S9 is the terminal state of the episode.
43Example: Episode 2In the second episode we find ourselves in s5. After all these steps of episode 2 the Q-table is still unchanged and zero (for all the actions we took on the path have zero rewards and all Q-values of next states were also zero). In s5 we take action DOWN.
44Example: Episode 2All values of Q table for state s5 are equal so we choose probabilistically action DOWN.We take action DOWN and observe the reward and new state s8.Then we update the Q value of s5 and the action we took according the formula. This time though the reward is zero, but the new state has some positive Q values.And move to s8
45Example: Episode 2After these updates the Q-table will look like…
46Example: Q-table after Convergence Continuing in the same way, finally we will get the following Q-table.Above the arrows – Q-values of corresponding states and actions.
47Example: Value Function after Convergence In green is value function of the states.V*(s) = max_a Q(s,a)
48Example: Optimal Policy One of optimal policies – green arrowsWe must take an action with the maximal Q (Q we get as a result of Q-learning algorithm) in each state to get an optimal policy.Pi*(s) = argmax_a Q(s,a)
49Example: Optimal Policy This is one of the optimal policies
50Q-learningFor the agent learns by trial and error its performance is rather poor at the very beginning.General view of cumulative reward as the time elapses. (Here there are also state-action pairs with negative rewards)
51Convergence of Q-learning Q-learning was proved to converge to an optimal policy when the following conditions hold
52Blackjack Standard rules of blackjack hold State space: element - current value of player's hand (4-21)element - value of dealer's face-up card (2-11)element - player does not have usable ace (0/1)Starting states:player has any 2 cards (uniformly distributed), dealer has any 1 card (uniformly distributed)Actions:HITSTICKRewards:1 for a loss0 for a draw1 for a winAnother example is BlackJack...Blackjack is one of the most popular casino card games in the world.Much of blackjack's popularity is due to the mix of chance with elements of skill, and the publicity that surrounds card counting (keeping track of which cards have been played since the last shuffle).The rules of BlackJack:Each player is dealt two cards and is then offered the opportunity to take more. The hand with the highest total wins as long as it doesn't exceed 21; a hand with a higher total than 21 is said to bust or have too many. Cards 2 through 10 are worth their face value, and face cards (jack, queen, king) are also worth 10. An ace's value is 11 unless this would cause the player to bust, in which case it is worth 1. A hand in which an ace's value is counted as 11 is called a soft hand, because it cannot be busted if the player draws another card.The goal of each player is to beat the dealer by having the higher, unbusted hand. Note that if the player busts he loses, even if the dealer also busts (therefore Blackjack favors the dealer). If both the player and the dealer have the same point value, it is called a "push", and neither player nor dealer wins the hand. Each player has an independent game with the dealer, so it is possible for the dealer to lose to one player, but still beat the other players in the same round.
53Blackjack: Optimal Policy Found optimal policy and optimal value function
54Reinforcement Learning: Example StatesGridsActionsLeftUpRightDownRewardsBonus 20Food 1Predator -10Empty grid -0.1Transition probabilities0.80 – agent goes where he intends to go0.20 – to any other adjacent grid or remains where it was (in case he is on the board of the grid world he goes to the other side)Let’s consider a simple grid world.States…Actions the agent can take…Rewards when he finds himself in states…And transition defined the states environment returns as the answer to agent’s action choice…The aim is to save up the greatest cumulative reward
55Reinforcement Learning: Example The agent chooses action UP
56Reinforcement Learning: Example And gets also up…Chooses action UP
57Reinforcement Learning: Example And start collecting rewards…It is the nearest to him allocation of rewards where he can travel from one grid with reward to another (not from one reward to empty grid)