Presentation is loading. Please wait.

Presentation is loading. Please wait.

Emotional Grounding in Spoken Dialog Systems Jackson Liscombe Giuseppe Riccardi Dilek Hakkani-Tür

Similar presentations


Presentation on theme: "Emotional Grounding in Spoken Dialog Systems Jackson Liscombe Giuseppe Riccardi Dilek Hakkani-Tür"— Presentation transcript:

1 Emotional Grounding in Spoken Dialog Systems Jackson Liscombe jaxin@cs.columbia.edu Giuseppe Riccardi Dilek Hakkani-Tür dsp3@research.att.com dtur@research.att.com

2 10.14.04 Jackson Liscombe -- CU / AT&T 2 In Spoken Dialog Systems, users can … … start angry. … get angry. … end angry. The Problem: Emotion

3 10.14.04 Jackson Liscombe -- CU / AT&T 3 Outline Previous Work Corpus Description Feature Extraction Classification Experiments

4 10.14.04 Jackson Liscombe -- CU / AT&T 4 Outline Previous Work Corpus Description Feature Extraction Classification Experiments

5 10.14.04 Jackson Liscombe -- CU / AT&T 5 Past Work I.Isolated Speech II.Spoken Dialog Systems

6 10.14.04 Jackson Liscombe -- CU / AT&T 6 Past Work: Isolated Speech Acted Data Features:  F0/pitch  energy  speaking rate Researchers (late 1990s - present)  Aubergé, Campbell, Cowie, Douglas-Cowie, Hirscheberg, Liscombe, Mozziconacci, Oudeyer, Pereira, Roach, Scherer, Schröder, Tato, Yuan, Zetterholm, …

7 10.14.04 Jackson Liscombe -- CU / AT&T 7 Past Work: Spoken Dialog Systems (1) Batliner, Huber, Fischer, Spilker, Nöth (2003)  system = Verbmobil (Wizard of Oz scenarios)  binary classification  features: prosodic lexical (POS tags, swear words) dialog acts (repeat/repair/insult)  0.1% relative improvement using dialog acts

8 10.14.04 Jackson Liscombe -- CU / AT&T 8 Past Work: Spoken Dialog Systems (2) Ang, Dhillon, Krupski, Shriberg, Stolcke (2002)  system = DARPA Communicator  binary classification  features: prosodic lexical (language model) dialog acts (repeats/repairs)  4% relative improvement using dialog acts

9 10.14.04 Jackson Liscombe -- CU / AT&T 9 Past Work: Spoken Dialog Systems (3) Lee, Narayanan (2004) system = Speechworks call-center  binary classification  features: prosodic lexical (weighted mutual information) dialog acts (repeat/rejection)  3% improvement using dialog acts

10 10.14.04 Jackson Liscombe -- CU / AT&T 10 Past Work: Summary Past research has focused on acoustic data But, moving toward grounding emotion in context (dialogs acts) Summer work = extend contextual features for better emotion prediction

11 10.14.04 Jackson Liscombe -- CU / AT&T 11 Outline Previous Work Corpus Description Feature Extraction Classification Experiments

12 10.14.04 Jackson Liscombe -- CU / AT&T 12 Corpus Description AT&T’s “How May I Help You? SM ” corpus (0300 Benchmark) Labeled with “Voice Signature” information:  user state (emotion)  gender  age  accent type

13 10.14.04 Jackson Liscombe -- CU / AT&T 13 Corpus Description StatisticTrainingTesting number user turns15,0135,000 number of dialogs4,2591,431 number of turns per dialog3.5 number of words per turn9.09.9

14 10.14.04 Jackson Liscombe -- CU / AT&T 14 User Emotion Distribution

15 10.14.04 Jackson Liscombe -- CU / AT&T 15 Emotion Labels Original Set:  Positive/Neutral  Somewhat Frustrated  Very Frustrated  Somewhat Angry  Very Angry  Other Somewhat Negative  Very Negative Reduced Set:  Positive  Negative

16 10.14.04 Jackson Liscombe -- CU / AT&T 16 Corpus Description: Binary User States StatisticTrainingTesting % of turns that are positive88.1%73.1% % of dialogs with at least one negative turn 24.8%44.7% % of negative dialogs that start negative 43.5%59.9% % of negative dialogs that end negative 42.4%48.7%

17 10.14.04 Jackson Liscombe -- CU / AT&T 17 Outline Previous Work Corpus Description Feature Extraction Classification Experiments

18 10.14.04 Jackson Liscombe -- CU / AT&T 18 Feature Set Space Features Context ProsodicLexicalDiscourse turn i turn i-1 turn i turn i-2 turn i-1 …………

19 10.14.04 Jackson Liscombe -- CU / AT&T 19 Feature Set Space: Context Overview Features Context ProsodicLexicalDiscourse turn i Isolated turn i-1 turn i DifferentialsPrior Statistics turn i-2 turn i-1 …

20 10.14.04 Jackson Liscombe -- CU / AT&T 20 Lexical Features Language Model (ngrams) Examples of words significantly correlated with negative user state (p<0.001) :  1 st person pronouns: ‘I’, ‘me’  requests for a human operator: ‘person’, ‘talk’, ‘speak’, ‘human’, ‘machine’  billing-related words: ‘dollars’, ‘cents’  curse words: …

21 10.14.04 Jackson Liscombe -- CU / AT&T 21 Prosodic Features Praat  open source tool for speech analysis, synthesis, statistics, manipulation, …  Paul Boersma and David Weenink  University of Amsterdam  www.praat.org www.praat.org

22 10.14.04 Jackson Liscombe -- CU / AT&T 22 Prosodic Features Pitch (F0) 1. overall minimum 2. overall maximum 3. overall median 4. overall standard deviation 5. mean absolute slope 6. slope of final vowel 7. longest vowel mean Other 8. local jitter over longest vowel Energy 9. overall minimum 10. overall maximum 11. overall mean 12. overall standard deviation 13. longest vowel mean Speaking Rate 14. vowels per second 15. mean vowel length 16. ratio voiced frames to total frames 17. percent internal silence

23 Prosodic Features in Action

24 10.14.04 Jackson Liscombe -- CU / AT&T 24 “That amount is incorrect.” A View Of Pitch 20020508/0508095935fhf2910 Very Frustrated Somewhat Frustrated

25 10.14.04 Jackson Liscombe -- CU / AT&T 25 PersonTranscriptionEmotionPlay ……… Agent: I'm sorry, I didn't understand that. How may I help you? User: please an operator please [.hst ] may I speak- Positive/Neutral Agent: I'm sorry we're having trouble. Can we try one more time? Just briefly tell me how I may help you. User:may I speak to an assistant pleaseSomewhat Frustrated Agent:Are you calling from your home phone? User:yesSomewhat Angry 20020221/0221080552atf1536 Pitch/Energy/Speaking Rate : Change

26 10.14.04 Jackson Liscombe -- CU / AT&T 26 20020221/0221080552atf1536 Pitch/Energy/Speaking Rate : Change

27 10.14.04 Jackson Liscombe -- CU / AT&T 27 PersonTranscriptionEmotionPlay Agent: Hello, I am an AT&T automated assistant. You can speak naturally to me. How may I help you ? User: I need to find out about a number that I don't recognize Positive/Neutral Agent: Would you like to look up a number you don't recognize on you bill? User:yes I wouldPositive/Neutral Agent:Are you calling from your home phone? User:yes I amPositive/Neutral 20020317/0317220739atf9051 Pitch/Energy/Speaking Rate : No Change

28 10.14.04 Jackson Liscombe -- CU / AT&T 28 20020317/0317220739atf9051 Pitch/Energy/Speaking Rate : No Change

29 10.14.04 Jackson Liscombe -- CU / AT&T 29 Feature Set Space: Baseline Features Context ProsodicLexicalDiscourse turn i turn i-1 turn i turn i-2 turn i-1 …………

30 10.14.04 Jackson Liscombe -- CU / AT&T 30 Discourse Features Dialog Acts  greeting  re-prompt  confirmation  specification  acknowledgment  disambiguation  …

31 10.14.04 Jackson Liscombe -- CU / AT&T 31 Feature Set Space: State-of-the-Art Features Context ProsodicLexicalDiscourse turn i turn i-1 turn i turn i-2 turn i-1 …………

32 10.14.04 Jackson Liscombe -- CU / AT&T 32 Contextual Features Lexical (2)  edit distance with previous 2 turns Discourse (10)  turn number  call type repetition with previous 2 turns  dialog act repetition with previous 2 turns Prosodic (34)  1 st and 2 nd order differentials for each feature Other (2)  user state of previous 2 turns

33 10.14.04 Jackson Liscombe -- CU / AT&T 33 Feature Set Space: Contextual Features Context ProsodicLexicalDiscourse turn i turn i-1 turn i turn i-2 turn i-1 …………

34 10.14.04 Jackson Liscombe -- CU / AT&T 34 Outline Previous Work Corpus Description Feature Extraction Classification Experiments

35 10.14.04 Jackson Liscombe -- CU / AT&T 35 Experimental Design Training size = 15,013 turns Testing size = 5,000 turns Most frequent user state (positive) accounts for 73.1% of testing data Learning Algorithm Used:  BoosTexter (boosting w/ weak learners)  continuous and discrete valued features  2000 iterations

36 10.14.04 Jackson Liscombe -- CU / AT&T 36 Performance Accuracy Summary Feature SetAccuracy Rel. Improv. over Baseline Most Freq. State73.1%----- Baseline76.1%----- State-of-the-Art77.0%1.2% Contextual79.0%3.8%

37 10.14.04 Jackson Liscombe -- CU / AT&T 37 Conclusions Baseline (prosodic and lexical features)  leads to improved emotion prediction over chance State-of-the-Art (baseline plus dialog acts)  gives further improvement Innovative contextual features:  improves emotion prediction even further Towards a computation model of emotional grounding

38 Thank You


Download ppt "Emotional Grounding in Spoken Dialog Systems Jackson Liscombe Giuseppe Riccardi Dilek Hakkani-Tür"

Similar presentations


Ads by Google