Presentation on theme: "Mixed Feelings About Using Phoneme-Level Models in Emotion Recognition Hannes Pirker Austrian Research Institute for Artificial Intelligence (OFAI) Vienna."— Presentation transcript:
Mixed Feelings About Using Phoneme-Level Models in Emotion Recognition Hannes Pirker Austrian Research Institute for Artificial Intelligence (OFAI) Vienna Poster presented at ACII-2007, September 16-19,Lisbon, Portugal
This study deals with the application of MFCC based models for both the recognition of emotional speech and the recognition of emotions in speech. More specifically it investigates the performance of phone-level models. First, results from performing forced alignment for the phonetic segmentation on GEMEP, a novel multimodal corpus of acted emotional utterances are presented, then the newly acquired segmentations are used for experiments with emotion recognition comparing phone-level models with sentence-level models.. 1 Motivation Abstract The Geneva Multimodal Emotion Portrayals corpus (GEMEP)  is a novel corpus of highly controlled and uniform content, containing 2 pseudo-linguistic utterances produced with 18 different emotions and varying levels of intensity. This uniform content provides a promising basis for investigating both the acoustic correlates of emotions as well as timing-issues in the synchronization between speech, facial expressions and body movements. In order to provide a sound basis for the investigation of fine-grained temporal issues, phonetic segmentation at the level of individual speech sounds was to be performed. Apart from this practical goal it also was to be investigated on how well standard segmentation techniques, i.e. Hidden Markov Models (HMM) with Mel Frequency Cepstral Coefficients (MFCC)  would cope with the amount of variability of manner and voice quality typically found in emotional speech. The third goal was to shed some light on the conflicting requirements on MFCCs. They are used in speech recognition for discriminating speech sounds and for their robustness against variation in intonation and voice quality. But they are also became popular in emotion recognition , though then they typically should be indifferent to the underlying speech sounds and sensitive to voice quality. As the identity of the speech sounds is a major influencing factor on MFCCs, we are comparing emotion classifiers that rely on phoneme-level models with sentence-level models.
2 Database Description The Geneva Multimodal Emotion Portrayals corpus (GEMEP) consists of audio and video recordings of 10 speakers (professional French speaking actors), 5 of which are female. Verbal content is restricted to only 2 different pseudo-linguistic sentences: (type1): Ne kal ibam soud mol'en! (type2): Koun s'e mina lod belam? These were uttered with 18 different emotions which were chosen for an extensive coverage of both activation (high-low) and evaluation dimensions (positive-negative) Utterances were also produced in less intense, more intense and 'masked' manner (i.e. unsuccessfully ‘hide’ the actual emotion from the audience). In average utterances were repeated 8 times per condition. This results in 3815 sentences: 2739 of type1 and 1076 of type2. Emotion categories in the GEMEP corpus. Subset of “selected 6” set in bold. 3. Phonetic Segmentation 3.2 Workflow Starting with 50 manually segmented sentences for training an initial set of HMM, several cycles of training-alignment- manual correction and re-training with the increased set were performed. By now 1313 manually validated sentences are available which are currently split into 892 training- and 421 test-samples. To make use of the constrained content of the corpus HMM where trained for each individual sound, i.e. 4 different models for /a/ where trained. Apart from unconstrained training (Global-models) specific models for male and female speakers (per Gender), for each actor (per Speaker) and each emotion (per Emotion) where constructed and evaluated.
3.1. Technical Procedure The original audio data was downsampled to kHz and high-pass filtered at 55 Hz. MFCC where calculated with a frame size of 30 ms and a window shift of 2.5 ms (i.e. resulting in 400 frames per second). 12 MFCC, energy, delta and acceleration were included resulting in a 39 dimensional feature vector. For the phoneme-level models Hidden Markov Models with 3 states and 5 Gaussian mixtures in each state were employed. Baum-Welch algorithm was used for training the HMMs and Viterbi decoding was performed in order to retrieve the segment boundaries. Extraction of MFCC features, training and application of HMMs was performed with the respective tools provided by HTK  Evaluation The absolute position error at the initial phone boundaries was used for evaluation. Quantiles and error thresholds provide a meaningful measure: approx. 85% of segments are located with less an error less than 20ms and should not require manual correction. Below the performance of two differently sized training sets (N=503 vs. N=892) and when using gender-, speaker- and emotion-specific phoneme-models is illustrated. The results indicate that the performance of the aligner seems to already level out, i.e. for type1 the increase in the size of the training data from 395 to 685 does not show much of an effect anymore while the less frequent type 2 still benefits from the increase in training-size. Also further fractionating the training set by using speaker-specific models etc. does not decrease the performance dramatically anymore.
This is in line with the subjective impression that ‘normal’ cases were remarkably well handled by models trained on small sets but persistently problematic classes exist, e.g. flustered speech with its weakly articulated formant structures. Soft speech, even of very low intensity, as well as loud or even shouted speech are less critical. The most problematic cases are intermingled laughter, hesitations etc. which pose severe problems to any procedure based on forced alignment. 4 Emotion Classification 4.1 Technical Procedure In order to test the capabilities of phone-level-MFCC modeling for emotion classification virtually the same methods as used for phonetic segmentation were re-applied. For phone-level modelling emotion-specific 3 state left-to-right HMMs with 5 Gaussian Mixtures were trained. Alternatively two sentence-level HMM typologies were tested: an 'elongated' version of the phone-level model with 22 states/5 mixtures and several 1-state HMMs, which are equivalent to a Gaussian Mixture Model. The “recognition grammar” of the aligner was adapted as shown in Fig.2. A majority vote was performed on the outcome of the Viterbi decoding. For the experiments with emotion recognition a test-set of 875 samples was reserved, leaving a maximum of 2940 samples for training (i.e. 77%/23%). Currently only sentence-level models could make use of the whole training-set, as they do not require any pre-segmented data. Phone-based models where for now trained on a smaller set of 892 though, as it was necessary to use manually segmented data. Fig.1: 85% Error quantile [ms] per segment for the two sentences in the corpus (N train=892)
Fig.2 Recognition grammar for phone-level and sentence-level modelling. 18 categories is an untypical high numbee, for better comparability all experiments where also performed with a sub- set of 6 emotions (anger, joy, fear, sadness, pleasure, interest) For sentence-level HMMs different typologies where compared. For the full set of 2940 a single state model with an exceedingly high number of Gaussian Mixtures (512) performed best. This model was compared with phone-level modelling. 4.2 Evaluation The table below provides a comparison between the phone-based model and a sentence-based model on the same train- and test-set as it was also used in the evaluation of the phonetic-alignment task above. Again the smaller number of training-samples for type2 results in a significant drop in performance.
The recognition quality is likely to still benefit from increased training size as the results from the experiments with sentence-level models indicate: 5. Conclusion We presented a study on phone-based MFCC-models for emotion recognition which originated in the ‘practical task’ of phonetic segmentation of the GEMEP corpus. The results for the automatic segmentation probably have reached a certain ceiling, but provide a valid basis for manual correction of further samples. Results for emotion classification are still in flux, i.e., results with differently sized training-sets still show significant volatility. Ultimate conclusions on the relationship of phone-based vs. global models are difficult to obtain. Because the segmental content in GEMEP is so restricted, phoneme-based models lose their expected implicit advantages in this context. On the other hand errors in the automatic alignment have a strong influence on the classification, e.g. surprisingly the sound /s/ provided the best classification results of all phonemes, which probably is due to the fact that fricatives are the most suitable sounds for the aligner. The study on emotion recognition was not at all aimed at coming up with ‘impressive’ recognition rates, which could be easily boosted by using e.g. 10fold-cross-validation, a-priori probabilities, more equally balanced training sets etc., but to test the relative performance of different MFCC-based models.
Acknowledgements I am very indebted to Klaus Scherer and his group in Geneva for designing, creating and sharing the GEMEP corpus and especially to Tanja Baenziger for long standing and ongoing interaction and support. This work has been funded by the EU Network of Excellence HUMAINE (IST ) and by the Austrian Funds for Research and Technology Promotion for Industry (FFF /2970 KA/SA). Financial support for OFAI is provided by the Austrian Federal Ministry of Science and Research and by the Federal Ministry of Transport, Innovation and Technology. References 1.Baenziger T., Pirker H., Scherer K.: GEMEP - GEneva Multimodal Emotion Portrayals: A corpus for the study of multimodal emotional expressions, in Devillers L. et al. (eds.), Proceedings of LREC'06 Workshop on Corpora for Research on Emotion and Affect, May 23, Genoa, Italy, pp.15-19, Lee C.M., Yildirim S., Bulut M., Kazemzadeh A., Busso C., Deng Z., Lee S., Narayanan S.: Emotion Recognition based on Phonem Classes, Proceedings of ICSLP 04, Jeju, Korea, Schuller B., Rigoll G.: Timing Levels in Segment-Based Speech Emotion Recognition, in Proceedings of INTERSPEECH ICSLP, September, Pittsburgh, PA, USA, pp , Young S., Evermann G., Kershaw D., Moore G., Odell J., Ollason D., Povey D., Valtchev V., Woodland P.: The HTK Book (version 3.4), Cambridge University Engineering Department, Cambridge UK, 2006.