Presentation is loading. Please wait.

Presentation is loading. Please wait.

CEICES: a “Vertical” Approach Towards Recognizing Emotion in Speech Anton Batliner Lehrstuhl für Mustererkennung (Informatik 5) (Chair for Pattern Recognition)

Similar presentations


Presentation on theme: "CEICES: a “Vertical” Approach Towards Recognizing Emotion in Speech Anton Batliner Lehrstuhl für Mustererkennung (Informatik 5) (Chair for Pattern Recognition)"— Presentation transcript:

1 CEICES: a “Vertical” Approach Towards Recognizing Emotion in Speech Anton Batliner Lehrstuhl für Mustererkennung (Informatik 5) (Chair for Pattern Recognition) Friedrich-Alexander-Universität Erlangen-Nürnberg HUMAINE Plenary, Paris, June 4th, 2007

2 Seite 2 A. Batliner Click to edit Master title style 16.08.2014Ergebnisse Mitarbeiterbefragung 2001 page 2 What is CEICES?

3 Seite 3 A. Batliner Click to edit Master title style 16.08.2014Ergebnisse Mitarbeiterbefragung 2001 page 3 CEICES Initiative Combining Efforts for Improving automatic Classification of Emotional user States - a "forced co-operation" initiative Partners active:  from outside HUMAINE: TUM (Technische Universität München), FBK-irst (inside/outside)  from within HUMAINE, WP4: FAU, UA, LIMSI, TAU/AFEKA People:  Anton Batliner, Stefan Steidl, Björn Schuller, Dino Seppi, Thurid Vogt, Johannes Wagner, Laurence Devillers, Laurence Vidrascu, Noam Amir, Loic Kessous, Vered Aharonson

4 Seite 4 A. Batliner Click to edit Master title style 16.08.2014Ergebnisse Mitarbeiterbefragung 2001 page 4 Idea Behind different research traditions at different sites somehow fossilized approaches at different sites co-operation pays off: pooling together competence and feature sets

5 Seite 5 A. Batliner Click to edit Master title style 16.08.2014Ergebnisse Mitarbeiterbefragung 2001 page 5 Which data do we use?

6 Seite 6 A. Batliner Click to edit Master title style 16.08.2014Ergebnisse Mitarbeiterbefragung 2001 page 6 Database German corpus with recordings of 51 ten to twelve year old children communicating with Sony's Aibo pet robot (9.2 hours of speech, 51.393 words) data ± reverberated, transliterated, annotated: 5 labellers, 11 word-based "emotion" labels originator site (FAU) provides speech files, phonetic lexicon, definition of train and test samples, etc. effort for manual “pre-processing” only: ~80 k € researcher, ~80 k € students (conservative estimation)

7 Seite 7 A. Batliner Click to edit Master title style 16.08.2014Ergebnisse Mitarbeiterbefragung 2001 page 7 A “Vertical” Approach segmentation, transliterationemotion labellingannotation of interactionmanual word segmentationmanual correction of F0syntactic annotationrule-based chunking system.............

8 Seite 8 A. Batliner Click to edit Master title style 16.08.2014Ergebnisse Mitarbeiterbefragung 2001 page 8 Basics of Chosen Approach children: not exotic but normal annotation: with context (it’s speech, not sounds) majority voting (≤ 3 out of 5 agree) unit of annotation: the word, because  link to ASR  link to higher processing (syntax, dialogue, semantics)  smallest possible emotional unit  can be combined onto higher units of different size mapping onto 4 cover classes, due to sparse data:  Motherese (positive valence)  default class Neutral  "pre-stage" to negative: Emphatic  negative (Angry) "dimension" (smearing fine-grained differences between: touchy, reprimanding, angry)  AMEN sub-sample

9 Seite 9 A. Batliner Click to edit Master title style 16.08.2014Ergebnisse Mitarbeiterbefragung 2001 page 9 The AMEN Sub-sample syntactically/semantically meaningful chunks with at least one AMEN word syntactic-prosodic chunking rules: IF (synt. bound. = sentence/free phrase/between vocatives) OR (pause  500 ms at any other synt. bound.) frequencies:  Motherese: 586  Neutral: 1998  Emphatic: 1045  Angry: 914 experiments so far with 2- or 3-fold speaker-independent cross-validation, upsampling for training

10 Seite 10 A. Batliner Click to edit Master title style 16.08.2014Ergebnisse Mitarbeiterbefragung 2001 page 10 Type of Database Scenario  - acted  - prompted  - real-life  + elicited/induced  + volunteering  + application-oriented  - emotion-oriented Outcome  + spontaneous  + natural  + realistic  - selected acted - induced - natural

11 Seite 11 A. Batliner Click to edit Master title style 16.08.2014Ergebnisse Mitarbeiterbefragung 2001 page 11 fully exploiting the state of the art: relevance of features

12 Seite 12 A. Batliner Click to edit Master title style 16.08.2014Ergebnisse Mitarbeiterbefragung 2001 page 12 SEM S8I02102M1D4R5111L002000A00.00.00.00.00.00.00.00.00.00C0000010000F00.41.00N00X0000000000T0000000000PPOV_positive_valence BOW S5I01413M1D4R5101L200000A00.00.00.00.00.00.00.00.00.00C0000010000F00.92.00N05X0000000000T0000000000PTUM_logTF_ILS_und S S6I03001M1D4R5112L000000A00.00.00.10.00.00.00.00.00.00C0000010000F00.10.00N00X0000000000T0000000000Pspectral_cog_mean E S1I00004M1D4R5112L000000A00.10.00.00.00.00.00.00.00.00C0000001000F00.10.00N00X0000000000T__________PeneMean___ BOW S5I01270M1D4R5101L200000A00.00.00.00.00.00.00.00.00.00C0000010000F00.92.00N05X0000000000T0000000000PTUM_logTF_ILS_kom BOW S5I01344M1D4R5101L200000A00.00.00.00.00.00.00.00.00.00C0000010000F00.92.00N05X0000000000T0000000000PTUM_logTF_ILS_rum E S8I01036M1D4R5112L000000A00.10.00.00.00.00.00.00.00.00C0000010000F00.02.01N00X0000000000T0000000000PEnMax0_0_f36_min S S5I04027M1D4R5102L000000A00.00.00.00.13.00.00.00.00.00C0000010000F10.30.00N00X0000000000T0000000000PTUM_0_fa1_band_stdd SEM S8I02108M1D4R5111L002000A00.00.00.00.00.00.00.00.00.00C0000010000F00.41.91N04X0000000000T0000000000PPOV_positive_valence_norm BOW S5I01440M1D4R5101L200000A00.00.00.00.00.00.00.00.00.00C0000010000F00.92.00N05X0000000000T0000000000PTUM_logTF_ILS_wieder E S8I01229M1D4R5112L000000A00.10.00.00.00.00.00.00.00.00C0000010000F00.00.10N00X0000000000T0000000000PEnEneAbs0_0_f29_mean POS S5I00042M1D4R5101L020000A00.00.00.00.00.00.00.00.00.00C0000010000F10.35.00N00X0000000000T0000000000PTUM_sum_APN POS S5I00045M1D4R5101L050000A00.00.00.00.00.00.00.00.00.00C0000010000F10.35.00N00X0000000000T0000000000PTUM_sum_PAJ SEM S8I02106M1D4R5111L009000A00.00.00.00.00.00.00.00.00.00C0000010000F00.41.00N00X0000000000T0000000000PRES_rest D S8I01056M1D4R5111L000000A10.00.00.00.00.00.00.00.00.00C0000010000F00.99.01N00X0000000000T0000000000PDurAbsSyl0_0_f56_min P S8I01263M1D4R5111L000000A00.00.10.00.00.00.00.00.00.00C0000010000F20.61.10N00X0000000000T0000000000PF0RegCoeff0_0_f63_mean S S4I00080M1D4R5111L000000A00.00.00.10.00.00.04.00.00.00C0000010000F00.00.00N00X1000000000T0000000000Pvnhr P S4I01055M1D4R5111L000000A00.00.10.00.00.00.00.00.00.00C0000010000F00.22.00N00X1000000000T0000000000Pprctilep4A E S4I01001M1D4R5111L000000A00.10.10.00.00.00.00.00.00.00C0000010000F30.02.00N00X1000000000T0000000000Ploud_maxval V S4I00075M1D4R5111L000000A00.00.00.00.00.00.02.00.00.00C0000010000F00.00.00N00X1000000000T0000000000Pvshimapq3 E S8I01029M1D4R5112L000000A00.10.00.00.00.00.00.00.00.00C0000010000F00.00.01N00X0000000000T0000000000PEnEneAbs0_0_f29_min V S4I00074M1D4R5111L000000A00.00.00.00.00.00.02.00.00.00C0000010000F00.00.00N00X1000000000T0000000000Pvshimloc BOW S5I01217M1D4R5101L200000A00.00.00.00.00.00.00.00.00.00C0000010000F00.92.00N05X0000000000T0000000000PTUM_logTF_ILS_halt BOW S5I01382M1D4R5101L200000A00.00.00.00.00.00.00.00.00.00C0000010000F00.92.00N05X0000000000T0000000000PTUM_logTF_ILS_sollst C S5I03116M1D4R5102L000000A00.00.00.00.00.10.00.00.00.00C0000010000F10.10.00N00X0000000000T0000000000PTUM_MFCC10Average C S5I05195M1D4R5102L000000A00.00.00.00.00.12.00.00.00.00C0000010000F11.18.00N00X0000000000T0000000000PTUM_0_mfcc_c12_d_cnt E S6I01002M1D4R5112L000000A00.10.00.00.00.00.00.00.00.00C0000010000F00.02.00N00X0000000000T0000000000Penergy_max C S6I04113M1D4R5112L000000A00.00.00.00.00.04.00.00.00.00C0000010000F00.31.00N00X0000000000T0000000000Pmfcc4_var E S1I00006M1D4R5112L000000A00.10.00.00.00.00.00.00.00.00C0000101010F00.49.00N00X0000000000T__________PeneTau____ S S6I03006M1D4R5112L000000A00.00.00.10.00.00.00.00.00.00C0000010000F00.21.00N00X0000000000T0000000000Pspectral_cog_median SEM S8I02103M1D4R5111L003000A00.00.00.00.00.00.00.00.00.00C0000010000F00.41.00N00X0000000000T0000000000PNEV_negative_valence E S6I01114M1D4R5112L000000A00.10.00.00.00.00.00.00.00.00C0000010000F02.21.00N00X0000000000T0000000000Penergy_deltadelta_median Feature Encoding Scheme (WS at FAU 12/06)

13 Seite 13 A. Batliner Click to edit Master title style 16.08.2014Ergebnisse Mitarbeiterbefragung 2001 page 13 SEM S8I02102M1D4R5111L002000A00.00.00.00.00.00.00.00.00.00.. PPOV_positive_valence BOW S5I01413M1D4R5101L200000A00.00.00.00.00.00.00.00.00.00.. PTUM_logTF_ILS_und S S6I03001M1D4R5112L000000A00.00.00.10.00.00.00.00.00.00.. Pspectral_cog_mean E S1I00004M1D4R5112L000000A00.10.00.00.00.00.00.00.00.00.. PeneMean___ BOW S5I01270M1D4R5101L200000A00.00.00.00.00.00.00.00.00.00.. PTUM_logTF_ILS_kom BOW S5I01344M1D4R5101L200000A00.00.00.00.00.00.00.00.00.00.. PTUM_logTF_ILS_rum E S8I01036M1D4R5112L000000A00.10.00.00.00.00.00.00.00.00.. PEnMax0_0_f36_min S S5I04027M1D4R5102L000000A00.00.00.00.13.00.00.00.00.00.. PTUM_0_fa1_band_stdd Zoom on Feature Encoding Scheme linguistic encoding acoustic encoding

14 Seite 14 A. Batliner Click to edit Master title style 16.08.2014Ergebnisse Mitarbeiterbefragung 2001 page 14 Impact of Feature Types (SVM), Separate and Combined Classification of Chunks, F Values: Acoustic and Linguistic Features feature types # all red. (150)SFFS sep SFFS comb energy 26558.560.056.956.3 duration 39155.160.654.949.6 F0 33356.155.1 46.746.8 spectral/formant 65654.456.049.946.2 cepstral 169952.757.150.446.4 voice quality 15451.551.641.538.7 wavelets 21656.056.344.935.3 bag of words 47662.662.353.237.4 part-of-speech 3154.7 -54.948.1 higher semantics 1257.6 -57.956.0 non-verbal 824.2 - - - disfluencies 426.8 - - -

15 Seite 15 A. Batliner Click to edit Master title style 16.08.2014Ergebnisse Mitarbeiterbefragung 2001 page 15 Impact of Feature Types (SVM), Separate and Combined Classification of Chunks, F Values: Acoustic and Linguistic Features feature types # all red. (150)SFFS sep SFFS comb energy 26558.560.056.956.3 duration 39155.160.654.949.6 F0 33356.155.1 46.746.8 spectral/formant 65654.456.049.946.2 cepstral 169952.757.150.446.4 voice quality 15451.551.641.538.7 wavelets 21656.056.344.935.3 bag of words 47662.662.353.237.4 part-of-speech 3154.7 -54.948.1 higher semantics 1257.6 -57.956.0 non-verbal 824.2 - - - disfluencies 426.8 - - -

16 Seite 16 A. Batliner Click to edit Master title style 16.08.2014Ergebnisse Mitarbeiterbefragung 2001 page 16 Impact of Feature Types (SVM), Separate and Combined Classification of Chunks, F Values: Acoustic and Linguistic Features feature types # all red. (150)SFFS sep SFFS comb energy 26558.560.056.956.3 duration 39155.160.654.949.6 F0 33356.155.1 46.746.8 spectral/formant 65654.456.049.946.2 cepstral 169952.757.150.446.4 voice quality 15451.551.641.538.7 wavelets 21656.056.344.935.3 bag of words 47662.662.353.237.4 part-of-speech 3154.7 -54.948.1 higher semantics 1257.6 -57.956.0 non-verbal 824.2 - - - disfluencies 426.8 - - -

17 Seite 17 A. Batliner Click to edit Master title style 16.08.2014Ergebnisse Mitarbeiterbefragung 2001 page 17 Impact of Feature Types (SVM), Separate and Combined Classification of Chunks, F Values: Acoustic and Linguistic Features feature types # all red. (150)SFFS sep SFFS comb energy 26558.560.056.956.3 duration 39155.160.654.949.6 F0 33356.155.1 46.746.8 spectral/formant 65654.456.049.946.2 cepstral 169952.757.150.446.4 voice quality 15451.551.641.538.7 wavelets 21656.056.344.935.3 bag of words 47662.662.353.237.4 part-of-speech 3154.7 -54.948.1 higher semantics 1257.6 -57.956.0 non-verbal 824.2 - - - disfluencies 426.8 - - -

18 Seite 18 A. Batliner Click to edit Master title style 16.08.2014Ergebnisse Mitarbeiterbefragung 2001 page 18 Impact of Feature Types (SVM), Separate and Combined Classification of Chunks, F Values: Acoustic and Linguistic Features feature types # all red. (150)SFFS sep SFFS comb energy 26558.560.056.956.3 duration 39155.160.654.949.6 F0 33356.155.1 46.746.8 spectral/formant 65654.456.049.946.2 cepstral 169952.757.150.446.4 voice quality 15451.551.641.538.7 wavelets 21656.056.344.935.3 bag of words 47662.662.353.237.4 part-of-speech 3154.7 -54.948.1 higher semantics 1257.6 -57.956.0 non-verbal 824.2 - - - disfluencies 426.8 - - -

19 Seite 19 A. Batliner Click to edit Master title style 16.08.2014Ergebnisse Mitarbeiterbefragung 2001 page 19 Impact of Feature Types (SVM), Separate and Combined Classification of Chunks, F Values: Acoustic and Linguistic Features feature types # all red. (150)SFFS sep SFFS comb energy 26558.560.056.956.3 duration 39155.160.654.949.6 F0 33356.155.1 46.746.8 spectral/formant 65654.456.049.946.2 cepstral 169952.757.150.446.4 voice quality 15451.551.641.538.7 wavelets 21656.056.344.935.3 bag of words 47662.662.353.237.4 part-of-speech 3154.7 -54.948.1 higher semantics 1257.6 -57.956.0 non-verbal 824.2 - - - disfluencies 426.8 - - -

20 Seite 20 A. Batliner Click to edit Master title style 16.08.2014Ergebnisse Mitarbeiterbefragung 2001 page 20 Types of Approaches, SFFS, F Values knowledge-based and sequential (FAU, FBK: 118)*:58.8 knowledge-based (TAU, LIMSI: 312):53.3 brute-force (TUM, UA: 3304):54.9 all acoustic features(3714)63.4 all linguistic features (531)62.6 all together(4245)65.5 * word-based features, using manually corrected word boundaries, combined into chunk-based features

21 Seite 21 A. Batliner Click to edit Master title style 16.08.2014Ergebnisse Mitarbeiterbefragung 2001 page 21 beyond the state of the art: units of analysis

22 Seite 22 A. Batliner Click to edit Master title style 16.08.2014Ergebnisse Mitarbeiterbefragung 2001 page 22 Performance for Different Chunks, Preliminary Experiments at FBK, Small Feature Set, F Values # F optimal, i.e. adjacent identical labels 700864.0 (e.g.: NNN EE AAAA NNNNN M N MMM) turns (pause > 1.5 sec.) 399050.0 syntactic-prosodic rule system 915255.2 words1761155.0 syntactic rule system (clauses/phrases/ …) 910253.9 prosodic rule system (pause > 0.5 sec.) 512953.0 LM2 (bi-gram language model) 548052.8 POS-LM (part-of-speech language model) 463756.0

23 Seite 23 A. Batliner Click to edit Master title style 16.08.2014Ergebnisse Mitarbeiterbefragung 2001 page 23 Summing up our Results impact of acoustic feature types: energy most important, voice quality less important, other types in between (note domain-dependency!) impact of linguistic feature types: very high - to be checked with real Automatic Speech Recognition (ASR) output sequential approach promising chunking is the right way to do emotion recognition seems to be less prone to noise than comparable speech processing tasks (ICASSP 2007) PDA (Pitch Detection Algorithm) extraction errors deteriorate performance consistently but not detrimentally (ICPhS 2007)

24 Seite 24 A. Batliner Click to edit Master title style 16.08.2014Ergebnisse Mitarbeiterbefragung 2001 page 24 In a Nutshell full exploitation of state-of-the-art approaches  > 4 k features  knowledge-based vs. brute-force  selection and classification and beyond state-of-the-art  towards new dimensions (UMUAI 2007)  meaningful units of analysis (chunking)  interaction/dialogue modelling  prototyping  personalization ……

25 Seite 25 A. Batliner Click to edit Master title style 16.08.2014Ergebnisse Mitarbeiterbefragung 2001 page 25 and the Message of the Day people stare at classification performance which is tuned explicitely by highly sophisticated classifiers and implicitely by settings not obvious to the 'normal' reader such as  manual emotion chunking  using only prototypes  using acted data  and other devices

26 Seite 26 A. Batliner Click to edit Master title style 16.08.2014Ergebnisse Mitarbeiterbefragung 2001 page 26 Thank you for your attention


Download ppt "CEICES: a “Vertical” Approach Towards Recognizing Emotion in Speech Anton Batliner Lehrstuhl für Mustererkennung (Informatik 5) (Chair for Pattern Recognition)"

Similar presentations


Ads by Google