Training in interaction with humans Problem 1: Optimisation requires too many dialogues Problem 2: Training makes random moves Problem 3: Humans give inconsistent ratings
Outline Background Dialogue model Dialogue optimisation Sample-efficient optimisation Models for learning Robust reward function Human experiments Conclusion
Model: Partially Observable Markov Decision Process atat stst s t+1 rtrt otot o t+1 State is Markov -- depends on the previous state and action: P(s t+1 |s t, a t ) – the transition probability State is unobservable and generates a noisy observation P(o t |s t ) -- the observation probability In every state action is taken and a reward is obtained Dialogue is a sequence of states Action selection (policy) is based on the distribution over all states at every time step t – belief state b(s t )
Dialogue state factorisation Decompose the state into conditionally independent elements: user goal user action stst gtgt utut dtdt dialogue history atat rtrt otot o t+1 g t+1 u t+1 d t+1
Further dialogue state factorisation gtgt utut dtdt atat rtrt otot o t+1 g t+1 u t+1 d t+1 g t food d t food u t food g t area d t area u t area g t+1 food d t+1 food u t+1 food g t+1 area d t+1 area u t+1 food
Policy optimisation in summary space Compress the belief state into a summary space 1 J. Williams and S. Young (2005). "Scaling up POMDPs for Dialogue Management: The Summary POMDP Method." Original Belief Space Actions Policy Summary Space Summary Actions Summary Function Master Function Summary Policy
Q-function Q-function measures the expected discounted reward that can be obtained at a summary point when an action is taken Takes into account the reward of the future actions Optimising the Q-function is equivalent to optimising the policy Discount factor in (0,1] Reward Starting summary point Starting action Expectation with respect to policy π
Online learning Reinforcement learning in direct interaction with the environment Actions are taken e-greedily Exploitation: choose action according to the best estimate of Q function Exploration: choose action randomly (with probability e) In practice 10,000s of dialogues are needed!
Problem 1: Standard models require too many dialogues
Solution: Take into account similarities between different belief states Essential ingredients Gaussian process Kernel function Outcome Sample-efficient policy optimisation
Gaussian Process Policy Optimisation The Q-function is the expected long-term reward It can be modelled as a Gaussian process Prior: Posterior, given visited summary states, actions and obtained rewards:
Voice mail example Voice mail example: The user asks the system to save or delete the message. The user input is corrupted with noise, so the true dialogue state is unknown. belief state b(s)
The role of kernel function in a Gaussian Process The kernel function models correlation between different Q-function values Confirm Q-function value Action Belief state Confirm
Problem 2: Standard models make random moves Exploitation? Exploration?
Solution: Define a stochastic policy Gaussian process defines Gaussian distributions for each action Sample from these distributions Automatically deal with exploration/exploitation Outcome: Less unexpected behaviour