Presentation is loading. Please wait.

Presentation is loading. Please wait.

An Introduction to Reinforcement Learning

Similar presentations


Presentation on theme: "An Introduction to Reinforcement Learning"— Presentation transcript:

1 An Introduction to Reinforcement Learning
Presenter: Verena Rieser, Course: Classification and Clustering, WS 2005 V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

2 Contents Part 1: The main ideas of RL Part 2:
The general framework of RL Part 3: Automatic Optimization of Dialogue Management (Application) V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

3 Reinforcement Learning
Artificial Intelligence Control Theory and Operations Research Psychology Reinforcement Learning (RL) algorithms and applications in control theory Theoretic model from psychology: how do animals and humans learn? Neuroscience Artificial Neural Networks V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

4 Part 1:The Idea of Reinforcement Learning
Learning from interaction with environment to achieve some goal Example 1 Baby playing: No teacher; sensorimotor connection to environment. Cause-effect/Action-consequences How to achieve some goal Example 2 Learning to hold a conversation, etc. We find out the effects of our actions later. -- Learning from interaction with the environment; other ML methods learn by examples. -- Learning to achieve some goal: only the final outcome is specified; “optimal” steps in between should be learnt; other ML methods: satisfy immediate purpose -- Idea of a learning system that wants something, that adapts its behaviour to maximise a special signal from the environment (“hedonistic” learning system); RL “Agents”: human metapher; -- First forms of human learning: infant plays, looks around, etc. has no teacher but gets information about cause and effects of its actions from the environment. -- RL is learning what to do: how to map situations on actions. The learner is not told which actions to take, but instead must discover which actions yield to the most reward by trying them. -- Second Example: Making an utterance in a conversation; maybe find find out later that this formulation caused effects we did not intend e.g. insulting s.b. V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

5 Supervised Learning Error = (target output – actual output)
Training Info = desired (target) outputs Supervised Learning System Inputs Outputs SL: learning from examples provided by a knowledgeable external supervisor; e.g. statistical pattern recognition, or neuronal networks; Clear definition what is wrong and what is right -> “Instructive feedback!” Teacher tells the child what to do. -- Not adequate for learning from interaction: In interactive problems it is often impractical to obtain examples of desired behaviour because they cannot be both, correct and representative for all situations in which the agent has to act - there will always be new situations which require new strategies to develop!!! DYNAMIC behaviour matching different needs in the environment Error = (target output – actual output) V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

6 Reinforcement Learning
Training Info = evaluations (“rewards” / “penalties”) RL System Inputs Outputs (“actions”) Inputs = feature representation of environment Outputs = selects actions from predefined action space Rewards = environment evaluates the result of an action; Environment is “responding” e.g. action have an impact -> Evaluative feedback!! Environment tells the child if the action was good (trail-and-error) Example: RL: Child touches the oven - gets burnt - learns not to touch the oven SL: Teacher tells the child that touching the oven is a “no good” Features = position of chess figures Outputs = next move Rewards = how this move contributes winning/ loosing the game -- RL is a type of US learning, i.e. there is no “right” answer; but in contrast to US we want to achieve a long-distance goal; US: make some decision *now* which satisfies the immediate constrains (e.g. clustering: clusters should be not smaller than n) RL: plan your decision to achieve some goal in the future; for example a *bad/costly* action right now might bring good results (e.g. chess) Objective: get as much reward as possible V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

7 R L - How does it work? Learning a mapping from situations to actions in order to maximize a scalar reward/reinforcement signal. How? Try out actions to learn which produces highest reward - trial and error search Actions affect immediate reward + all subsequent rewards - delayed effects, delayed rewards Two main features: 1. trial and error search 2. delayed rewards Why do we need it? The agent has to try out actions because the agent is not told what to do. As the agent will be confronted with an dynamic environment there is no “right” way to react. E.g. in most of the cases it’s good to help old people to cross the street, also you should not to it if the person is accompanied by a dangerous dog. If you are very stupid you need to get bitten by the dog! Delayed effects: all subsequent rewards are also important, maybe you got bitten, but one week later you got a prize for being the bravest citizen! V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

8 Exploration/Exploitation Trade-off
High rewards from trying previously-well-rewarded actions - EXPLOITATION (= greedy) BUT: Which actions are best? Must try ones not tried before - EXPLORATION (= e) Must do both! Exploitation/Exploration trade-off also depends on the life-time of an agent. Exploitation: the agent was confronted with a (similar) situation before and preformed an action which gave him good reward Exploration: can I do any better than that? Is there an action out there which is even better, i.e. will give me even higher rewards? More risky because the action can also be worse! Stochastic: success is not defined by two numbers, but probability of success for special types of tasks. That means that for stochastic tasks, task failure doe not necessarily means that the other solution would be right! Why life-time? Show next slide: --greedy better at the beginning, but then levels off because it gets stuck in perfroming sub-optimal actions -- non greedy continues to explore; more likely to find an “optimal” strategy V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

9 e-Greedy Methods on the 10-Armed Testbed [Sutton and Barto, 2002]
Greedy = simply supervised learning; do whatever you were told to do! Exploitation= loose some feathers at the beginning, than you will refine your strategy; Example: the risky chess player has to make some experience by trying new actions; later on he got more experience what good moves are and has a richer repertoire than the cautious player. When being confronted with new actions the risky player is “creative” as well as “more experienced”. First greedy action performs better, than exploration will find better ways to do it! V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

10 Part 2: Framework of RL Temporally situated
Continual learning and planning Object is to affect the environment Environment is stochastic and uncertain Environment Nice theory -- but how does it work? Picture: interaction between an active decision making agent and its environment; -- temporally situated: environment is dynamic/ changing over time -- continual learning: as said earlier- enriches its repertoire by exploring new actions -- planning: way of deciding a course of actions by considering possible future situations before they are experienced, “foresight” -- need to have a model of the environment to reason with! -- agent’s actions will affect the future state of the environment, i.e. options available to the agent at later times; -- environment is stochastic (not 0/1) and uncertain: there will always be new and unexpected events! Summary: -- agent = sensation (state + reward), action (affect the environment), goal -> this triple is all the designer specifies!!! --what’s wrong with this model? Environment is not only outside! The agent has to have a model of it to reason about future events! This model will be learnt from data (SL) --What’s most important? INTERACTION, Agents learns to CONTROL his environment: takes actions to make its environment response in a certain way!! -- remark: inter-actions are what makes RL different from SU: The main problem facing a supervised learning system is to construct a mapping from situations to actions that mimics the correct actions specified by the environment and that generalizes correctly to new situations. A supervised learning system cannot be said to learn to control its environment because it follows, rather than influences, the instructive information it receives. Instead of trying to make its environment behave in a certain way, it tries to make itself behave as instructed by its environment. Evaluation vs. Instruction!! -- next: how does it work? action state reward Agent V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

11 Elements of RL Policy: what to do Reward: what is good
Value Model of environment Policy: what to do Reward: what is good Value: what is good because it predicts reward Model: what follows what Circle: “is embeded in”/ “is influenced by”/ “is defined with respect to” -- policy: *defines agent’s way of behaving at a given time. *Roughly speaking, a policy is mapping from perceived states of the environment to actions to be taken when in those states. -- reward function: *defines the goal in a RL problem. *Roughly speaking, it maps each perceived state (or state-action pair) of the environment to a single number, a reward, indicating the intrinsic desirability of that state. -- value function: *whereas the reward function indicates what is good in an immediate sense, a values function specifies what is good in the long run. * Roughly speaking, the value of a state is the total number of reward an agent can expect to accumulate over the future, staring from that state; -- model of environment: *mimics behaviour of environment, *e.g. given a state and an action, the model predicts the resultant next state and next reward. V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

12 General RL Algorithm Initialise learner’s internal state
Do forever (!?): Observe current state s Choose action a using some evaluation function Execute action a Let r be immediate reward, s’ new state Update internal state based on s,a,r,s’ Do forever: * in real-world application there an “optimal” policy could only be learnt if we never stop learning! I.e. there is no optimal policy! * In practice we stop learning if the training will give us no more possibilities to learn (i.e. the fake environment model limits learning ) V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

13 To solve the problem mathematically:
Formulate it as Markov Decision Process (MDP) or Partially Observable Markov Decision Process (POMDP) Maximize the state-value and action-value functions using the Bellmann optimality equation Use approximations to solve the Bellmann equation such as dynamic programming, Monte Carlo Methods, Temporal-difference Learning. Markov Property = (see extra slide if nobody knows) This is sometimes also referred to as an "independence of path" property because all that matters is in the current state signal; its meaning is independent of the "path," or history, of signals that have led up to it. MPD -> all we need from the past are decision probabilities P and expected rewards R Intuitively, the Bellman optimality equation is an estimate “how good” it is for an agent to be in a certain state Explicitly solving the Bellman optimality equation provides one route to finding an optimal policy has at least three assumptions that are rarely true in practice: we accurately know the dynamics of the environment; (2) we have enough computational resources to complete the computation of the solution; and (3) the Markov property. -> approximations: just to list some names!!! For example, although the first and third assumptions present no problems for the game of backgammon, the second is a major impediment. Since the game has about states, it would take thousands of years on today's fastest computers to solve the Bellman equation for , and the same is true for finding . In reinforcement learning one typically has to settle for approximate solutions. V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

14 The Bellmann Equation “What actions are available?”
Bellmann optimality equation estimates “how good” it is to be in a state s. Vπ(s)=∑aπ(s,a) ∑s’Pss’a[Rss’a+µVπ(s’)] V*(s) = max Qπ*(s,a) [figure (a)] Q*(s,a) = ∑Pass´ [Rass´+ max Q*(s´,a´)] [figure (b)] Following a policy : ∑aπ(s,a) averages over all possibilities ∑Pass´ weighting each by its probability of occurring [Rss’a+µVπ(s’)] for each of (s,a) the environemnt can respond with several next steps s’ together with reward R Q -- action value function; value in taking an action a in state s under policy π V -- sate value function; expected return when starting in s and following π “What actions are available?” “How good are those actions?” V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

15 Summary: Key Features of RL
Learner is not told which actions to take Trial-and-Error search Possibility of delayed reward (Sacrifice short-term gains for greater long-term gains) The need to explore and exploit Considers the whole problem of a goal-directed agent interacting with an uncertain environment -- trade-off between exploration and exploitation: to obtain a lot of reward, a RL agent must prefer action that he has tried in the past and has found to be effective in producing reward (exploit). But to discover such actions, it has to try actions that it has not selected before. The agent has to exploit what it already knows, but it also has to explore in order to make better action selections in the future. The exploration-expectation dilemma has been intensively studied by mathematicians for many decades. For now just note that the entire issue of balancing exploration and exploitation does not even arise in SL. -- another key feature: considers whole problem of a goal directed agent interacting with an uncertain environment. Many approaches just consider subproblems without addressing how they fit into a larger picture: SL: how such an ability would finally be useful? Planning theories: planning with general goals but without addressing the interplay between planning and real-time decision making, as well as the question of how an environmental model is acquired RL: can involve SU but does it for specific reasons; AI & Engineering: Extended ideas from optimal control theory and stochastic approximation V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

16 Interactive Exercise:
Help me to annotate the example “a dog catching a stick” with concepts from RL. Explain: How would an artificial dog learn to catch the stick using RL? V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

17 Part 3: Application for CoLi
Diane J. Litman and Michael S. Kearns and Satinder Singh and Marilyn A. Walker: Automatic Optimization of Dialogue Management. In: Proceedings of the 18th International Conference on Computational Linguistics (COLING-2000), Saarbrücken, 2000. V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

18 Dialogue Management Motivation: Agent wants to achieve some goal
Non-trivial choices based on the internal state Usability should be guaranteed by iterative prototyping DM is costly! Why not “simply” learn the optimal choices? Formulate dialogue as MDP Represent the environment (= states) Define a set of possible dialogue strategies (= actions) Evaluate actions (= reward) V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

19 The NJFun System Represent a dialogue strategy as mapping from state S to a set of dialogue acts Deploy an initial training system which generates exploratory training data w.r.t. S Construct an MPD model from the training data Using value iteration to learn the optimal strategy Evaluate the system w.r.t. a hand-coded strategy NJFun: telephone based tourist information system in New Yersey; access dB via three values: type, location, time of day V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

20 NJFun: Action Space Initiative
User: the system asks open questions with an unrestricted grammar for recognition System: the system uses directed prompts with restricted grammars Mixed: the system uses directed prompts with non-restricted grammars Confirmation Explicit: the system asks the user to verify an attribute No confirmation: the system does not generate a confirmation prompt V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

21 NJFun: State Space {Greet}: whether the system has greeted the user or not (0,1) {Attr}: which attr the system is trying to obtain or verify (1=activity, 2=location, 3=time, 4=done) {Conf}: ASR confidence after obtaining value for an attribute (0,1,2,3,4) {Val}: whether system has obtained a value for an attribute (0,1) {Times}: number of times the system has asked for the attribute {Gram}: type of grammar most recently used to obtain the attribute {Hist}: “trouble-in-past” V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

22 Example S1: Welcome to NJFun. How may I help you?
s[greet=1] - a[user initiative] U1: I’d like to find *um* wine tasting in Lambertville. s[conf=2, val=1] S2a: Did you say you are interested in wine tasting in Lambertville? s’[attr=(1,2), times=1] - a[explicit confirmation ] S2b: At what time? s’[attr=3] - a[no confirmation ] States = blue Actions: red=initiative, green = confirmation strategy How to learn state action mapping? 0.Observe state 1. Make decision/ planning future events: 1. 1 Have I seen a similar state? What was the value function for that state and what action to take? 2 Do I want to explore (e.g. this state already has an high reward) or exploit (no high reward for that state by now - find a better strategy!)? 3. Take an action 4. Observe impacts on the environment: log the reward V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

23 NJFun: Optimizing the strategy
NJFun’s initial strategy: “Exploratory for Initiative and Confirmation” (EIC); chooses randomly between possible actions in each state Data: 54 subjects for training, 21 for testing Binary reward function: 1 if system queries DB with all specified attr., 0 otherwise Results: large and significant improvement for expert user and non-significant degeneration for novice V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

24 Discussion How general are the features? What about dialogues on other domains (e.g. information seeking vs. Tutorial dialogue) What about the algorithm? Why can’t we use supervised learning? Do we really save costs? Stochastic user models for training “boot-strap” an initial system from training data Answers: *{Attribute} domain specific; rest OK *{Times}, {reward} might vary across domains: task completition time might be less important for tutorial dialogues; clarifications should not be punished, but are essential for a student to learn! 2. SL: generalizes known states correctly to new situations. construct a mapping from situations to actions that mimics the correct actions specified by the environment Instructive learning: there is one correct action -> no trial and error search! cannot be said to learn to control its environment because it follows, rather than influences, the instructive information it receives. Goal of dialogue modelling: wants its environment to behave in a certain way (max user satisfaction) -> no interaction! No future planning: no long-term reward; cannot sacrifice immediate acions for long-term goal no model of future events -> no delayed reward 3. -- should a user sit there and interact with a crappy Dialogue system un til it gets better? -> built a user model -- how to built an initial system? NJFun: rule-based system, only few interactions are learnt by interaction; “boot-strap” an initial policy: collect data (WOZ trails) annotate data with state-action sets -> we can learn the transition probability from state to actions (SL) -> use RL to optimize the policy in dynamic environment V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

25 Additional Slides V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

26 Simple Learning Taxonomy
Supervised Learning “Teacher” provides required response to inputs. Desired behaviour is know. Unsupervised Learning Learner looks for patterns in input. No “right” answer. Reinforcement Learning Learner not told which actions to take, but gets reward/punishment from environment and learns the action to pick the next time. SL: 2 main problem a SL system is facing: - to construct a mapping from situations to actions that mimics the correct actions specified by the environment - and to generalize correctly to new situations. US: - Make some decision *now* which satisfies the immediate constrains (e.g. clustering: clusters should be not smaller than n) - RL is a kind of US learning RL: Environment evaluates action in context V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

27 RL vs. SL The main problem facing a SL system is
to construct a mapping from situations to actions that mimics the correct actions specified by the environment and that generalizes correctly to new situations. A SL system cannot be said to learn to control its environment because it follows, rather than influences, the instructive information it receives. Instead of trying to make its environment behave in a certain way, it tries to make itself behave as instructed by its environment. V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

28 RL vs. US US: Make some decision *now* which satisfies the immediate constrains (e.g. clustering: clusters should be not smaller than n) RL: Plan your decision to achieve some goal in the future; delayed rewards V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

29 A More Formal Definition of the RL Framework...
POLICY p(s,a) =P{at= a|st=t} Given the situation at time t is s the policy which gives the probability that the agent’s action will be a. Reward function Defines goal, and immediate good or bad experience Value function Estimate of total future long-term reward. (We want actions that lead to states of high value, not necessarily high immediate reward!) Model of environment Maps states and actions onto states S´AxS. If in state s1 we take action a2 the model predicts s2 (and sometimes reward r2) V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

30 Markov Property A state signal that succeeds in retaining all relevant information is said to be Markov, or to have the Markov property. For example: the current position and velocity of a cannonball is all that matters for its future flight. It doesn't matter how that position and velocity came about. This is sometimes also referred to as an "independence of path" property because all that matters is in the current state signal; its meaning is independent of the "path," or history, of signals that have led up to it. V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

31 MPDs vs. POMPDs Major difference: how they represent uncertainty.
In MPDs the state space is in general represented as vectors describing information slots where each is associated with a discrete value. POMPDS explicitly model uncertainty by maintaining a belief state - a distribution over MPD states - in the absence of knowing its state exactly. V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.

32 Some Notable RL Applications
TD-Gammon: Tesauro world’s best backgammon program Elevator Control: Crites & Barto high performance down-peak elevator controller Dynamic Channel Assignment: Singh & Bertsekas, Nie & Haykin high performance assignment of radio channels to mobile telephone calls In general applicable for all (?) optimization tasks which are goal-oriented V. Rieser: An Introduction to Reinforcement Learning. C&C, 2005.


Download ppt "An Introduction to Reinforcement Learning"

Similar presentations


Ads by Google