# On-line dialogue policy optimisation Milica Gašić Dialogue Systems Group.

## Presentation on theme: "On-line dialogue policy optimisation Milica Gašić Dialogue Systems Group."— Presentation transcript:

On-line dialogue policy optimisation Milica Gašić Dialogue Systems Group

Spoken Dialogue System Optimisation Problem: What is the optimal behaviour Solution: Find it automatically through interaction

Reinforcement learning

Training in interaction with humans Problem 1: Optimisation requires too many dialogues Problem 2: Training makes random moves Problem 3: Humans give inconsistent ratings

Outline Background Dialogue model Dialogue optimisation Sample-efficient optimisation Models for learning Robust reward function Human experiments Conclusion

Model: Partially Observable Markov Decision Process atat stst s t+1 rtrt otot o t+1 State is Markov -- depends on the previous state and action: P(s t+1 |s t, a t ) – the transition probability State is unobservable and generates a noisy observation P(o t |s t ) -- the observation probability In every state action is taken and a reward is obtained Dialogue is a sequence of states Action selection (policy) is based on the distribution over all states at every time step t – belief state b(s t )

Dialogue state factorisation Decompose the state into conditionally independent elements: user goal user action stst gtgt utut dtdt dialogue history atat rtrt otot o t+1 g t+1 u t+1 d t+1

Further dialogue state factorisation gtgt utut dtdt atat rtrt otot o t+1 g t+1 u t+1 d t+1 g t food d t food u t food g t area d t area u t area g t+1 food d t+1 food u t+1 food g t+1 area d t+1 area u t+1 food

Policy optimisation in summary space Compress the belief state into a summary space 1 J. Williams and S. Young (2005). "Scaling up POMDPs for Dialogue Management: The Summary POMDP Method." Original Belief Space Actions Policy Summary Space Summary Actions Summary Function Master Function Summary Policy

Q-function Q-function measures the expected discounted reward that can be obtained at a summary point when an action is taken Takes into account the reward of the future actions Optimising the Q-function is equivalent to optimising the policy Discount factor in (0,1] Reward Starting summary point Starting action Expectation with respect to policy π

Online learning Reinforcement learning in direct interaction with the environment Actions are taken e-greedily Exploitation: choose action according to the best estimate of Q function Exploration: choose action randomly (with probability e) In practice 10,000s of dialogues are needed!

Problem 1: Standard models require too many dialogues

Solution: Take into account similarities between different belief states Essential ingredients Gaussian process Kernel function Outcome Sample-efficient policy optimisation

Gaussian Process Policy Optimisation The Q-function is the expected long-term reward It can be modelled as a Gaussian process Prior: Posterior, given visited summary states, actions and obtained rewards:

Voice mail example Voice mail example: The user asks the system to save or delete the message. The user input is corrupted with noise, so the true dialogue state is unknown. belief state b(s)

The role of kernel function in a Gaussian Process The kernel function models correlation between different Q-function values Confirm Q-function value Action Belief state Confirm

Problem 2: Standard models make random moves Exploitation? Exploration?

Solution: Define a stochastic policy Gaussian process defines Gaussian distributions for each action Sample from these distributions Automatically deal with exploration/exploitation Outcome: Less unexpected behaviour

Results during testing (with simulated user)

Results during training (with simulated user)

Problem 3: Humans give inconsistent ratings Reward is a measure of how good the dialogue is

On-line learning from user rating

User rating inconsistency Random policyOnline learned policy Simulator trained policy User rating (%)36.376.985.7 Objective score (%) 17.753.863.7 P(user rating=1|objective score=1) 0.800.94 P(user rating=1| objective score=0) 0.260.570.68

Solution: Incorporate both objective and subjective evaluation

Evaluation results Simulator trainedOn-line trained Evaluation dialogues 400410 Reward11.6 +/- 0.413.4 +/- 0.3 Success (%)93.5 +/- 1.296.8 +/- 0.9

Conclusions GP in policy optimisation Automate dialogue manager optimisation Enable sample efficient optimisation Outperforms simulator trained policies

Similar presentations