3Dialogue as a partially observable Markov decision process (POMDP) atstst+1State is unobservableotot+1State depends on a noisy observationLets remind ourselves of the POMDP model for dialogueState depends on the previous state and the action that the system tookState is unobservable and depends on a noisy observationWe keep track of the probability distribution of all states at every time stepBased on this distribution we take actionrtAction selection (policy) is based on the distribution over all states at every time step t – belief state b(st)
4Dialogue policy optimisation stateactionDialogue is in a state – belief state bDialogue manager takes actions a, as defined by a policy πIt gets rewards rstatereward
5Optimal PolicyThe optimal policy is one which generates the highest expected reward over time
6Reinforcement learning – the idea Take actions randomlyCompute average rewardChange policy to take actions that generated high reward
7Challenges in dialogue policy optimisation How to define the reward?Belief state is large and continuousReinforcement learning takes many iterations
8Problem 1: The reward function Solution: Reward is a measure of how good the dialogue isIt should incorporate the measure of successwhether the system gave all the information that the user wantedIt should favour shorter dialoguespenalise the system for every dialogue turnIt can incorporate more elements
9Problem 2: Belief state is large and continuous Solution: Compress the belief state into a smaller scale summary spaceOriginalBelief SpaceActionsPolicySummaryFunctionMasterFunctionSummarySpaceSummaryActionsSummary Policy1 J. Williams and S. Young (2005). "Scaling up POMDPs forDialogue Management: The Summary POMDP Method."
10Summary spaceSummary space contains features of the belief space that are important for learningThis is hand-coded!It can contain probabilities of concepts, their values and so on!Continuous variables can be discretised into a grid
11Q-functionQ-function measures the expected discounted reward that can be obtained at a grid point when an action is takenStarting actionExpectation with respect to policy πStarting grid pointDiscountFactor in (0,1]RewardTakes into account the reward of the future actionsOptimising the Q-function is equivalent to optimising the policy
12Online learningReinforcement learning in direct interaction with the environmentActions are taken e-greedilyExploitation: choose action according to the best estimate of Q functionExploration: choose action randomly (with probability e)
13Monte Carlo control algorithm Initialise Q arbitraryRepeatRepeat for every turn in a dialogueUpdate belief state, map to summary spaceRecord grid point, record rewardUntil the end of dialogueFor each grid point sum up all rewards that followedUpdate Q function and policy
14How many iterations?Each grid point needs to be visited sufficiently enough to obtain good estimateIf the grid is large then the estimate is not precise enoughIf there are lots of grid points then the policy optimization is slowIn practice 10,000s dialogues are needed!
15Learning in interaction with a Simulated User DialogueManagerSpeechUnderstandingDialogueStateExpectedRewardSpeechGenerationDialoguePolicyOptimisePolicy
16Exhibit a variety of behaviour Simulated userVarious modelsExhibit a variety of behaviourImitate real users
17Agenda-based user simulator Consists of an agenda and a goalGoal:Concepts that describe the entity that the user wantsExample: restaurant, cheap, ChineseAgendaDialogue acts needed to elicit the user goalDynamically changed during the dialogueGenerated either deterministically or stochastically
18Learning with noisy input inform ( price = cheap, area = centre)POMDPs is to provide robustness to speech recognition errorsExpose the manager to noise during learningUser simulator output can be corrupted to produce an N-best list of scored noisy inputsinform ( price = cheap, area = south) 0.63inform ( price = expensice ) 0.22request ( area ) 0.15
19Evaluating a dialogue system Dialogue system consists of many components and joint evaluation is difficultWhat matters is the user experienceDialogue manager uses reward to optimise the dialogue policyThis can also be used for evaluation
21Problem 3: Policy optimisation requires a lot of dialogues Policy optimisations requires 10,000s dialoguesSolution: Take into account similarities between different belief statesEssential ingredientsGaussian processKernel functionOutcomeFast policy optimisationThe POMDP approach requires 10,000s dialogues to train the policy which is too much for learning in interaction with real users
22The Q-function as a Gaussian Process The Q-function in a POMDP is the expected long-term reward from taking action a in belief state b(s).It can be modelled as a stochastic process – a Gaussian process to take into account the uncertainty of approximation
23VoiceMail example The user asks the system to save or delete the message.System actions: save, delete, confirmThe user input is corrupted with noise, so the true dialogue state is unknown.
24Q-function as a Gaussian process belief state b(s)
25The role of kernel function in a Gaussian process The kernel function models correlation between different Q-function valuesActionBelief stateQ-function valueConfirmConfirm
26Exploration in online learning State space needs to be sufficiently explored to find the optimal pathHow to efficiently explore the space?
27Active learning in GP Reinforcement learning gives the uncertaintyGP model for Q-functionchoose action that the model in uncertain aboutExplorationchoose action with the highest expected rewardExploitation