Download presentation
Presentation is loading. Please wait.
1
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning
2
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Literatur Reinforcement Learning: An Introduction Richard S. Sutton and Andrew G. Barto http://www.cs.ualberta.ca/~sutton/book/the-book.html
3
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Context ➔ Introduction The Learning Task Q Learning Nondeterministic Rewards and Actions Summary
4
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Introduction Situation: A robot, or an agent has a set of sensors to observe the state of its environment and a set of actions it can perform to alter this state The agent knows only the current state and the possibilities (actions) from which it can choose Learning strategy: Reward func: Assigns a numerical value to each distinct action the agent may take from each distinct state. The task of the robot is to perform sequences of actions, observe their consequences, and learn a control policy by choosing from any initial state the action that maximise the reward accumulated over time) Problem areas: Learning to control a mobile robot, learning to optimise operations in a factory or learning to play board games Teaching a robot to dock its battery onto charger whenever the battery level goes low
5
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Context Introduction ➔ The Learning Task Classification of the Problem The Markov Decision Process (MDF) Goal of the Learning Example Q Learning Nondeterministic Rewards and Actions Summary
6
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Classification of the Problem Agent interacts with its environment. The agent exists in the environment described by some set of possible states S. It can perform any set of possible actions A. Each time it performs an action in state the agent receives a reward that indicates the immediate value of this state- action transaction. This produces a sequence of states actions and immediate rewards The agent's task is to learn a control policy that maximizes the expected sum of timmediate rewards and the future rewards exponentially discounted by their delay
7
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Classification of the Problem 2 Consider specific settings: The actions have deterministic or nondeterministic outcomes The agent has or does not have prior knowledge about the effects of its action on the enviroment (r reward) (s:state) Is the agent trained by an expert? Difference to other function approximation tasks Delayed rewards: The trainer provides only a sequence of immediate reward values as the agent executes its sequence of action => temporal credit assignment determins which of the actions in its sequence are to be credited by producing the eventual rewards Exploration: Which experimentation strategy produces most effective learning? Choosing the exploration of unknown states and actions or the exploration of states and actions that are already learned will yield high reward? Partially observed state : In practical situation sensors provide only partial information Life-long learning: Robot learns several task within the same enviroment using the same sensor => using previously obtained experience
8
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning The Markov Decision Process (MDF) The process is deterministic The agent can perceive a set distinct states (S) in it environment and has set of action (A) allowed to perform At each discrete time step t, the agent senses the current state, chooses a current action and performs it. The environment responds by giving the agent a reward and by producing the succeeding state. In MDP and depend only on the current state and action and the earlier ones and are not part of the environment so the agent does not know them. S and A are finite.
9
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Goal of the Learning Goal: Learn a policy for selecting its next action based on the current observed state : Approach: require the policy that produces the greatest possible cumulative reward: Precisely the agent's learning task is to learn a policy that maximizes for all state s. Call optimal policy, denoted is the following: Simplify the notation: maximum discounted cumulative reward that the agent can obtain starting from s
10
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Example r(s,a) values Values One optimal policy Q(s,a) values
11
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Context Introduction The Learning Task ➔ Q Learning The Q Function An Algorithm for Learning Q An Illustrative Example Convergence Experimental Strategies Updating Sequence Nondeterministic Rewards and Actions Summary
12
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning The Q Function Problem: There is no training data in the form instead the only training information available to the learner is the sequence of immediate rewards The agent can learn a numerical evaluation function like : The agent prefers state to state whenever How can the agent choose among actions Solution: The optimal action in a state s is the action a that maximizes the sum of immediate rewards r(s,a) and the value of the immediate successor state, discounted by
13
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning The Q Function Problem: A perfect knowledge of the immediate reward function r and the state transition function would be necessary Solution: Q is the sum of the reward received immediately upon executing an action a from state s, plus the value gained by following the optimal policy thereafter Advantage: of using Q instead, it will able to select optimal actions even when it has no knowledge of the function r and
14
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning An Algorithm for Learning Q How can Q be learned? Through transformation we get a recursive definition of Q (Watkins 1989) : refer to the learner's estimate, or hypothesis, of the actual Q function It is represented by a large table with separate entry for each state-action pair The table can be initially filled with random values The agent works as before + updates the table entry Q learning propagates the estimates backward from the new state to the oldone Episode: During each episode, the agent begins at some randomly chosen state and is allowed to execute actions until it reaches the absorbing goal state When it does, the episode ends and the agent is transported to a new, randomly chosen, initial state of the next episode.
15
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning An Algorithm for Learning Q (2) Algorithm: For each (s, a) pair initialise the table entry to zero Observe the current state s Do forever: Select an action a and execute it Receive immediate reward r Observe the new state s' Update the table entry for
16
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning An Algorithm for Learning Q (3) Two general properties of this Q learning algorithm that hold for any deterministic MDP in which the rewards are non-negative assuming the values are initialized with zero: values never decrease during training: Every will remain in the interval between zero and its true Q
17
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning An Illustrative Example The diagram on the left shows the initial state of the robot and several relevant values in its initial hypotheses. ( ) When the robot executes the action it receives immediate reward r=0 and transition state It then updates its estimate based on its estimates for this state Here
18
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Convergence Will the algorithm converge to Q ? YES but with some conditions: Deterministic MDP Intermediate reward values are bounded: Agent select actions in such a fashion that it visits every possible state-action pair infinitely often Theorem: Consider a Q learning agent in a deterministic MDP with bounded rewards The Q learning agent uses the training rule initializes its table to arbitrary finite values and uses a discount factor such that Let the agent's hypothesis following the nth update. If each state-action pair is visited infinitely often then converges to as
19
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Experimentation Strategies Question: How actions are choosen during the training? The agent selects in the state s the action a that maximizes Disadvantage: Agents will prefer actions that are found to have high values during the early training and will fail to explore other actions that might have even higher values. Using probabilistic approach to selecting actions The probability of selecting action given that the agent is in state s k>0 is a constant that determines how strongly the selection favours actions with high values
20
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Updating Sequence Possibilities to improve the convergence: If all values initialized 0 => after the first full episode only one entry in the agent's table will have changed (the entire corresponding the final transition into the goal state. (backward change) Training on this same state-action transition but in reverse chronological order for each episode: apply the same update rules for each transition, but perform this updates in reverse order. => convergence in few iterations but higher memory usage Second strategy stores past state-actions
21
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Context Introduction The Learning Task Q Learning ➔ Nondeterministic Rewards and Actions Summary
22
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Nondeterministic Rewards and Actions In such cases and r(s,a) can be viewed as first producing a probability distribution over outcomes based on s and a, and then drawing an outcome at random according to this distribution The changing in the Q algorithm, first to the expected value of the discounted cumulative reward Generalization of Q: Recursive Q function: Modify the training rule so that it takes a decaying weighted average of current values and the revised estimate Where is the total number of visits of this state-action pair up to and including the the nth iteration
23
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Context Introduction The Learning Task Q Learning Nondeterministic Rewards and Actions ➔ Summary
24
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Summary Reinforcement learning addresses the problem of learning control strategies for autonomous agents Training information is available in the form of a real-valued reward signal given for each state-action transition. The goal of the agent is to learn an action policy that maximizes the total reward received from any starting state. The Q-learning relates to a limited field of problems, named Markov decision processes. The Q-function is defined as the maximum of the expected, discounted, cumulative rewards may be achieved by the agent via applying action a in the state s. is represented by a lookup table with a distinct entry for each pair It can show the convergence in both deterministic and nondeterministic MDP's.
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.