Presentation on theme: "Reinforcement Learning (II.) Exercise Solutions Ata Kaban School of Computer Science University of Birmingham 2003."— Presentation transcript:
Reinforcement Learning (II.) Exercise Solutions Ata Kaban A.Kaban@cs.bham.ac.uk School of Computer Science University of Birmingham 2003
Exercise 1 In the grid based environment below the state values have all been computed except for one. Possible actions are up, down, left and right. All other actions result in no reward except those that move the agent out of states A and B. Calculate the value of the blank state assuming a random policy (the action is selected randomly between those possible). Consider a discount reward = 0.9.
Solution 0.7 4 5 1 3.0 4.4 2 1.9 3 V (5) = 0.25( 0 + V (1)) + 0.25( 0 + V (2)) + 0.25( 0 + V (3)) + 0.25( 0 + V (4)) V (5) = 0.25(0.9)[4.4 + 1.9 + 0.7 + 3.0] = 2.25
Exercise 2 The diagram below depicts an MDP model of a fierce battle.
You can move between two locations, L1 and L2, one of them being closer to the adversary. If you attack from the closest state, –then you have more chances (90%) to succeed (while only 70% from the farther location), –however you could also be detected (with 80% chance) and killed (while the chances of being detected from the farther location is 50%). You can only be detected if you stay in the same location. You need to come up with an action plan for the situation.
The arrows represent the possible actions: – ‘move’ (M) is a deterministic action –‘attack’ (A) and ‘stay’ (S) are stochastic. For the stochastic actions, the probabilities of transitioning to the next state are indicated on the arrow. All rewards are 0, except in the terminal states, where your success is represented by a reward of +50 while your adversary’s success is a reward of -50 for you. Employing a discount factor of 0.9, compute an optimal policy (action plan).
Solution The computations of action-values for all states and actions are required. Denote by In value iteration, we start with initial estimates (for all other states) Then we update all action values according to the update rule: where
Hence, in the first iteration of the algorithm we get: The values for the ‘move’ action stay the same (at 0): After this iteration, the values of the two states are and they correspond to the action of ‘attacking’ in both states.
The next iteration gives the following: The new V-values are (by computing max): These correspond to the ‘attack’ action in both states.
This process can continue until the values do not change much between successive iterations. From what we can see at this point, the best action plan seems to be attacking all the time. Can we say more without a full computer simulation?
Continuing (optional)… It is clear that to ‘Stay’ is suboptimal in both states. In the Close state, it is also clear that the best thing to do is to ‘Attack’ continuously (given that we have no cost for that). Actually we can compute the values in the limit analytically (if you keep an eye at changes in update from iteration to iteration)
Now for the far state, the question is between ‘Attack’ or ‘Move’ to the closer orbit. Compute the values for both these actions (in the same way as before):
Hence it is worth moving closer to the orbit. The optimal policy for this problem setting (!) is to move closer and attack from there. Can you imagine a different policy making more sense for this problem? Can you imagine another setting (parameter design) which would lead to a different (more desirable) optimal policy? Designing the parameter setting for a situation according to the conditions is up to the human and not up to the machine… Well in this exercise all parameters were given but in your potential future real applications will be not.